Hierarchical Graph Matching Networks for Deep Graph Similarity Learning

07/08/2020 ∙ by Xiang Ling, et al. ∙ 77

While the celebrated graph neural networks yield effective representations for individual nodes of a graph, there has been relatively less success in extending to deep graph similarity learning. Recent work has considered either global-level graph-graph interactions or low-level node-node interactions, ignoring the rich cross-level interactions (e.g., between nodes and a whole graph). In this paper, we propose a Hierarchical Graph Matching Network (HGMN) for computing the graph similarity between any pair of graph-structured objects. Our model jointly learns graph representations and a graph matching metric function for computing graph similarities in an end-to-end fashion. The proposed HGMN model consists of a node-graph matching network for effectively learning cross-level interactions between nodes of a graph and a whole graph, and a siamese graph neural network for learning global-level interactions between two graphs. Our comprehensive experiments demonstrate that HGMN consistently outperforms state-of-the-art graph matching network baselines for both classification and regression tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning a general similarity metric between arbitrary pairs of graph-structured objects is one of the key challenges in machine learning. Such learning problems arise in a variety of applications, ranging from graph searching in graph-based databases 

Yan and Han (2002), to fewshot 3D action recognition Guo et al. (2018) and malware detection Wang et al. (2019). Conceptually, classical exact or inexact graph matching techniques Bunke and Allermann (1983); Caetano et al. (2009); Riesen et al. (2010) provide a strong tool for learning graph similarity. However, these methods usually either require input graphs with similar sizes or consider mainly the graph structures for finding a correspondence between the nodes of different graphs without taking into account the node representations or features. In contrast, in this paper, we consider the graph matching problem of learning a mapping between a pair of graph inputs and the similarity score , based on a set of training triplet of graph input pairs and scalar output score

drawn from some fixed but unknown probability distribution in real applications.

Although graph neural networks (GNNs) have recently demonstrated to be a powerful class of neural networks for learning node embeddings of graphs on tasks ranging from node classifications, graph classifications to graph generations Bronstein et al. (2017); Li et al. (2016); Kipf and Welling (2017); Hamilton et al. (2017); Velickovic et al. (2018)

, there is relatively less study on learning graph similarity using GNNs. A simple yet straightforward way is to use GNN to encode each graph as a vector and combine the vectors of two graphs to make a decision. This simple approach can be effective as the graph-level vector contains important information of a pair of graphs, but one obvious limitation is that this approach ignores finer-grained interactions among different-level embeddings of two graphs. Very recently, a few attempts have been made to take into account low-level interactions either by considering the histogram information or spatial patterns (with CNNs) of the node-wise similarity matrix of node embeddings 

Bai et al. (2019, 2020), or by improving the node embeddings of one graph by incorporating the implicit attentive neighbors of another graph Li et al. (2019). However, there are two significant challenges making these graph matching networks potentially ineffective: i) how to learn different-level granularity (global level and local level) of interactions between a pair of graphs; ii) how to effectively learn richer cross-level interactions between nodes of a graph and a whole graph.

Inspired by these observations, in this paper, we propose a hierarchical111The terminology “hierarchical" here means different-level granularity of interactions between a pair of graphs, which is different from the meaning of “hierarchical pooling" operations in Ying et al. (2018). graph matching network (HGMN) for computing the graph similarity between any pair of graph-structured objects. HGMN jointly learns graph representations and a graph matching metric function for computing graph similarity in an end-to-end fashion. It consists of a novel node-graph matching network for effectively learning cross-level interaction features between nodes of a graph and a whole graph, and a siamese graph neural network for learning global-level interaction features between two graphs. Our final small prediction networks consume these feature vectors from both cross-level and global-level interactions to perform either graph-graph classification or graph-graph regression tasks, respectively.

Recently proposed work only computes graph similarity by considering either graph-graph classification tasks (with labels Li et al. (2019), or graph-graph regression tasks (with similarity score Bai et al. (2019, 2020). To demonstrate the effectiveness of our model, we systematically evaluate the performance of HGMN on four datasets for both the graph-graph classification and regression tasks. Noted that, the graph-graph classification tasks here are different from the general graph classification tasks Ying et al. (2018); Ma et al. (2019) that only assign each graph with a label. Our graph-graph classification tasks learn a binary label (i.e., similar or dissimilar) for pairs of two graphs instead of one graph. Another important aspect is previous work does not consider the impact of the size of input graphs, which often plays an important role in determining the performance of graph similarity learning. Motivated by this observation, we consider three different ranges of graph sizes to evaluate the robustness of models. In addition, to bridge the gap of the lack of standard datasets for graph similarity learning, we create one new dataset from a real application together with a previously released dataset by Xu et al. (2017) for graph-graph classification tasks. Both code and data are available at https://github.com/kleincup/HGMN. In brief, we highlight our main contributions as follows:

  • We propose a hierarchical graph matching network (HGMN) for computing the graph similarity between any pair of graph-structured objects. HGMN jointly learns graph representations and a graph matching metric function for computing graph similarity in an end-to-end fashion.

  • In particular, we propose a novel node-graph matching network for effectively capturing the cross-level interactions between a node embedding of a graph and a corresponding attentive graph-level embedding of another graph.

  • Comprehensive experiments demonstrate that HGMN consistently outperforms state-of-the-art graph similarity learning baselines for different tasks (i.e., classification and regression) and also exhibits stronger robustness as the sizes of the two input graphs increase.

2 Problem Formulation

In this section, we briefly introduce the problem formulation. Given a pair of graph inputs , the aim of graph similarity learning in this paper is to produce a similarity score . The graph is represented as a set of nodes with a feature matrix , edges (binary or weighted) formulating an adjacency matrix , and a degree matrix . Similarly, the graph is represented as a set of nodes with a feature matrix , edges (binary or weighted) formulating an adjacency matrix , and a degree matrix . Note that, when performing the graph-graph classification tasks is the class label ; when performing the graph-graph regression tasks is the similarity score . We train our model based on a set of training triplet of structured input pairs and scalar output score .

Figure 1: The overview architecture of the full model HGMN, which consists of two components: SGNN and NGMN. At the final prediction layer, we concatenate a total of 6 aggregated graph-level embedding vectors where the middle 4 (pink) are from NGMN and the other 2 (blue) are from SGNN.

3 Hierarchical Graph Matching Networks

In this section, we will introduce two key components of HGMN - Siamese Graph Neural Networks (SGNN) and Node-Graph Matching Networks (NGMN). We first discuss SGNN for learning the global-level interactions between two graphs and then outline NGMN for effectively learning the cross-level node-graph interactions between nodes of one graph and one whole graph. The overall model architecture for HGMN is shown in Figure 1.

3.1 SGNN for Global-level Interaction Learning

The graph-level embeddings contain important information of a graph. Therefore, learning graph-level interactions between two graphs could be an important component for learning the graph similarity of two graphs. To capture the global-level interaction features between two graphs, we employ SGNN, which is based on the Siamese Network architecture Bromley et al. (1994) that has achieved great success in many applications such as visual recognition Bertinetto et al. (2016); Varior et al. (2016) and sentence similarity analysis He et al. (2015); Mueller and Thyagarajan (2016). In general, SGNN consists of 3 components: 1) node embedding layers; 2) graph-level embedding aggregation layers; 3) graph-graph matching and prediction layers.

Node Embedding Layers. We use three-layer graph convolution networks (GCN) with the siamese networks to generate node embeddings of both graphs and ,

(1)

where

is the activation function,

is the normalized Laplacian matrix for depending on or , and are hidden weighted matrices for each layer. Note that the twin networks share the parameters of GCN when training on the pair of graphs . The number of GCN layers required may depend on the real application graph data.

Graph-level Embedding Aggregation Layers. With the computed node embeddings for each graph, we need to aggregate them to formulate a corresponding graph-level embedding ,

(2)

We employ different aggregation functions such as element-wise max/mean pooling (Max/Avg), element-wise max/mean pooling following a fully connected layer on (FCMax/FCAvg), and an LSTM-based aggregator. Although LSTM is not permutation invariant on a set of node embeddings, it might admit more expressive ability in aggregation and has been applied in previous work Hamilton et al. (2017); Zhang et al. (2019).

Graph-Graph Matching & Prediction Layers. After the graph-level embeddings and are computed for and , we then use the resulting graph embeddings to compute the graph similarity score of

. As it is common to employ cosine similarity in the classification tasks 

Xu et al. (2017); Gu et al. (2018), we directly compute the cosine similarity of two graph-level embeddings,

(3)

Differently, the results of the regression tasks are continuous and are set in a range of [0,1]. Thus, for the regression tasks, we first concatenate the two graph embeddings into

, employ standard fully connected layers to gradually project the dimension of resulting vector down to 1, and finally perform the sigmoid function to enforce the similarity score in the range of [0,1].

(4)

For both tasks, we train the SGNN model with the mean square error loss function to compare the computed similarity score

with the groud-truth similarity score , i.e., .

3.2 NGMN for Cross-level Node-Graph Interaction Learning

Although global-level interaction learning could capture the important structural and feature information of two graphs to some extent, it is not enough to capture all important information of two graphs since they ignore other cross-level interactions between parts of two graphs. In particular, existing work has considered either global-level graph-graph interactions or low-level node-node interactions, ignoring the rich cross-level interactions between nodes of a graph and a whole graph. Inspired by these observations, we propose a novel node-graph matching network to effectively learn the cross-level interaction features and illustrate each part in detail as follows.

Node Embedding Layers. Similar as described in Section 3.1, we choose to employ the three-layer GCN to generate node embeddings and for graphs and . Conceptually, the node embedding layers of NGMN could be chosen to be an independent GCN or a shared GCN with SGNN. As shown in Figure 1, our NGMN shares the same graph encoder (i.e., GCN) with SGNN due to two reasons: i) it reduces the number of parameters by half, which helps mitigate possible overfitting; ii) it maintains the consistency of the resulting node embeddings for both NGMN and SGNN, potentially leading to more aligned global-level interaction and cross-level interaction features.

Node-Graph Matching Layers. This layer is the key part of NGMN, which can effectively learn the cross-level interactions between nodes of a graph and a whole graph. There are generally two steps for this layer: i) calculate the graph-level embedding of a graph; ii) compare the node embeddings of a graph with the associated graph-level embedding of a whole graph and then produce a similarity feature vector. To build more tight interactions between the two graphs for learning the graph-level embedding of each other, we first calculate the cross-graph attention coefficients between the node in and all other nodes in . Similarly, we calculate the cross-graph attention coefficients between the node in and all other nodes in . These two cross-graph attention coefficients can be computed with an attention function independently,

(5)

where is the attention function for computing the similarity score. For simplicity, we use cosine function in our experiments but other similarity metrics can be adopted as well. Then, we compute the attentive graph-level embeddings using the weighted average of the node embeddings of the other graph,

(6)

Next, we define a multi-perspective matching function to compute the similarity feature vector by comparing two vectors as follows,

(7)

where is a -dimension similarity feature vector, is a trainable weight matrix and each represents a perspective with total number of perspectives. Notably, could be any similarity function and we use the cosine similarity metric in our experiments. It is worth noting that the proposed essentially shares a similar spirit with multi-head attention Vaswani et al. (2017), with the difference that multi-head attention uses number of weighted matrices instead of vectors.

Therefore, we can use to compare the -th or -th node embedding of a graph with the corresponding attentive graph-level embedding to capture the cross-level node-graph interactions. The resulting similarity feature vectors (w.r.t node in either or ) can thus be computed by,

(8)

After performing node-graph matching over all nodes for both and , the newly produced interaction feature matrices and are ready to be fed into the aggregation layers.

Aggregation Layers. To aggregate the cross-level interaction feature matrix from the node-graph matching layer, we employ BiLSTM Hochreiter and Schmidhuber (1997) to aggregate the unordered feature embeddings,

(9)

where is computed by concatenating the last hidden vectors of two directions and represents the aggregated graph-level embedding for each graph and . Although other aggregators can also be used, our extensive experiments show that BiLSTM aggregator achieved consistent better performance over other aggregators (see Appendix A.4). Similar LSTM-type aggregators have also been employed in the previous work Hamilton et al. (2017); Zhang et al. (2019).

Prediction Layers. After the aggregated graph embeddings and are obtained, we then use these two embeddings to compute the similarity score of . Just like the prediction layer in SGNN, we use Equations (3) and (4) to predict the similarity score for both classification and regression tasks. We also use the same mean square error loss function for the model training.

3.3 Discussions on Our Full Model – HGMN

The full model HGMN combines the advantages of both SGNN and NGMN to capture both global-level graph-graph interaction features and cross-level node-graph interaction features between two graphs. For the final prediction layer of HGMN, we have a total of six aggregated graph embedding vectors where two are and from SGNN, and another four are and from NGMN.

Complexity. The computation complexity of SGNN is , where the most dominant computation is the sparse matrix-matrix operations in Equation (1). Similarly, the computational complexity of NGMN is , where the most computationally extensive operations are in Equations (6), (7), and (8). Compared to recently proposed work Bai et al. (2019, 2020); Li et al. (2019), their computational complexities are highly comparable.

4 Experiments

4.1 Datasets, Experimental Setup, and Baselines

Classification Datasets: we evaluate our model on the task of detecting a similarity score (i.e., 1) between two binary functions, which is the heart of many binary security problems Feng et al. (2016); Xu et al. (2017); Ding et al. (2019). As we represent binaries with control flow graphs, detecting the similarity between two binaries can be cast as learning the similarity score between two control flow graphs and

. We prepare two datasets generated from two popular open-source softwares:

FFmpeg and OpenSSL. Besides, existing work does not consider the impact of the sizes of graphs on the performance. However, we find the larger the graph size is, the worse the performance is. Therefore, it is important to evaluate the robustness of graph similarity networks in this setting. We thus further split each datasets into 3 sub-datasets ([3, 200], [20,200], and [50,200]) according to the range of graph sizes. 222Although there are many benchmarks for the general graph classification tasks, these cannot be directly used in our graph-graph classification tasks as we cannot simply treat two graphs with the same labels as “similar".

Tasks Datasets
Sub-
datasets
# of
Graphs
# of
Functions
AVG #
of Nodes
AVG #
of Edges
Init Feature
Dimensions
classif- ication FFmpeg [3, 200] 83,008 10,376 18.83 27.02 6
[20, 200] 31,696 7,668 51.02 75.88
[50, 200] 10,824 3,178 90.93 136.83
OpenSSL [3, 200] 73,953 4,249 15.73 21.97 6
[20, 200] 15,800 1,073 44.89 67.15
[50, 200] 4,308 338 83.68 127.75
regre- ssion AIDS700 - 700 - 8.90 8.80 29
LINUX1000 - 1000 - 7.58 6.94 1
Table 1: Summary statistics of datasets for both classification & regression tasks.

Regression Datasets: we evaluate our model on learning the graph edit distance (GED) Zeng et al. (2009); Gao et al. (2010); Riesen (2015), which measures the structural similarity between two graphs. Formally, GED is defined as the cost of the least expensive sequence of edit operations that transform one graph into another, where an edit operation can be an insertion or a deletion of a node or an edge. In our experiments, we normalize GED as , and evaluate models on two datasets AIDS700 and LINUX1000 from Bai et al. (2019). Table 1 shows the statistic for all datasets with more details in Appendix A.1.

Implementation Details

. We implement our models using PyTorch 1.1 

Paszke et al. (2017) and train them with Adam optimizer Kingma and Ba (2015). We use 3 GCN layers with each output dimension of 100 and set the number of perspectives

to 100. For classification tasks, we train the model by running 100 epochs with 0.5e-3 learning rate. At each epoch, we build the pairwise training data as follows. For each graph

in the training subset, we obtain one positive pair and a corresponding negative pair , where is randomly selected from all control flow graphs that compiled from the same source function as , and is selected from other graphs. By default, each mini-batch includes 5 positive and 5 negative pairs. For regression tasks, we train the model by 10000 iterations with a mini-batch of 128 graph pairs with 5e-3 learning rate. Each pair is a tuple of , where is the ground-truth GED between and . Noted that all experiments are conducted on a PC equipped with 8 Intel Xeon 2.2GHz CPU and one NVIDIA GTX 1080 Ti GPU. Other model settings and experiment details can be found in Appendix A.2.1.

Baseline Methods 333As the three baseline methods only consider classification tasks or regression tasks, we slightly adjust the last layer of model or loss function of each baseline in order to make fair comparisons on both tasks.: i) SimGNN Bai et al. (2019) adopts GCN to encode node features and applies 2 strategies to model the similarity between two graphs: one based on interactions between two graph-level embeddings, another based on histogram features from two sets of node embeddings; ii) GMN Li et al. (2019) employs a variant of message passing neural networks and improves the node embeddings of one graph via incorporating the information of attentive neighborhoods of another graph; iii) GraphSim Bai et al. (2020) extends SimGNN by turning the two sets of node embeddings into a similarity matrix and then processing the matrix with CNNs Krizhevsky et al. (2012). Detailed experimental settings are given in Appendix A.2.2.

Note that we have two variants of the full model HGMN: HGMN (FCMax) and HGMN

(BiLSTM), where SGNN uses either the FCMax or BiLSTM aggregator, respectively. We repeat all experiments 5 times and report the mean and standard deviation of results, with the best performance in

bold.

4.2 Comparison with Baseline Methods

Comparison on the Graph-Graph Classification Tasks. For graph-graph classification tasks, we measure the Area Under the ROC Curve (AUC) Bradley (1997) of different models. As shown in Table 2, our models (both full model HGMN or key component NGMN) clearly achieve state-of-the-art performance on all 6 sub-datasets for both FFmpeg and OpenSSL. Particularly when the graph size increases, both HGMN and NGMN models show better and more robust performance than state-of-the-art methods. In addition, compared with SGNN (Max), NGMN shows superior performance by a large margin, demonstrating the benefits of the multi-perspective node-graph matching mechanism that captures the cross-level interaction features between node embeddings of a graph and the graph-level embedding of another graph. HGMN (i.e., NGMN+SGNN) further improves the performance of NGMN together with global-level interaction features learned from SGNN (see more experiments of SGNN with other aggregation functions in Appendix A.3).

Model FFmpeg OpenSSL
[3, 200] [20, 200] [50, 200] [3, 200] [20, 200] [50, 200]
SimGNN 95.380.76 94.311.01 93.450.54 95.960.31 93.580.82 94.250.85
GMN 94.150.62 95.921.38 94.760.45 96.430.61 93.033.81 93.911.65
GraphSim 97.460.30 96.490.28 94.480.73 96.840.54 94.970.98 93.661.84
SGNN 93.920.07 93.820.28 85.151.39 91.070.10 88.940.47 82.100.51
NGMN 97.730.11 98.290.21 96.810.96 96.560.12 97.600.29 92.891.31
HGMN (FCMax) 98.070.06 98.290.10 97.830.11 96.870.24 97.590.24 95.581.13
HGMN (BiLSTM) 97.560.38 98.120.04 97.160.53 96.900.10 97.311.07 95.870.88
Table 2: Summary of classification results in terms of AUC scores (%).

Comparison on the Graph-Graph Regression Tasks. For the regression tasks of computing the normalized GED between two graphs, we evaluate the models using Mean Square Error (), Spearman’s Rank Correlation Coefficient (Spearman (1904), Kendall’s Rank Correlation Coefficient (Kendall (1938), and precision at k (p@k). All results of both AIDS700 and LINUX1000 datasets are summarized in Table 3

. Although GraphSim shows better performance than the other two baselines, our models (the full model HGMN and key component NGMN) outperform all baselines on both datasets in terms of most evaluation metrics. Moreover, compared with SGNN (Max), NGMN achieves much better performance (see more in Appendix 

A.3). It highlights the importance of our proposed node-graph matching mechanism, which could effectively capture cross-level node-graph interactions between nodes of a graph and a whole graph in NGMN. HGMN (i.e., SGNN+NGMN) further improves the performance of NGMN together with global-level interaction features learned from SGNN.

Datasets Model () p@10 p@20
AIDS700 SimGNN 1.3760.066 0.8240.009 0.6650.011 0.4000.023 0.4890.024
GMN 4.6100.365 0.6720.036 0.4970.032 0.2000.018 0.2630.018
GraphSim 1.9190.060 0.8490.008 0.6930.010 0.4460.027 0.5250.021
SGNN 2.8220.149 0.7650.005 0.5880.004 0.2890.016 0.3730.012
NGMN 1.1910.048 0.9040.003 0.7490.005 0.4650.011 0.5380.007
HGMN (FCMax) 1.2050.039 0.9040.002 0.7490.003 0.4570.014 0.5320.016
HGMN (BiLSTM) 1.1690.036 0.9050.002 0.7510.003 0.4560.019 0.5390.018
LINUX 1000 SimGNN 2.4791.038 0.9120.031 0.7910.046 0.6350.328 0.6500.283
GMN 2.5710.519 0.9060.023 0.7630.035 0.8880.036 0.8560.040
GraphSim 0.4710.043 0.9760.001 0.9310.003 0.9560.006 0.9420.007
SGNN 11.8320.698 0.5660.022 0.4040.017 0.2260.106 0.4920.190
NGMN 1.5610.020 0.9450.002 0.8140.003 0.7430.085 0.7410.086
HGMN (FCMax) 1.5750.627 0.9460.019 0.8170.034 0.8070.117 0.7840.108
HGMN (BiLSTM) 0.4390.143 0.9850.005 0.9190.016 0.9550.011 0.9430.014
Table 3: Summary of regression results on AIDS700 and LINUX1000.

4.3 Ablation Studies

Different Attention Functions. As discussed in Section 3.2, the proposed multi-perspective matching function shares similar spirits with the multi-head attention mechanism Vaswani et al. (2017), which makes it interesting to compare them. Therefore, we investigate the impact of these two different attention mechanisms for the proposed NGMN model with classification results showed in Table 4 Interestingly, our proposed multi-perspective attention mechanism consistently outperforms the results of the multi-head attention mechanism by quite a large margin. We suspect that our proposed multi-perspective attention uses vectors attention weights which may significantly reduce the potential overfitting.

Model FFmpeg OpenSSL
[3, 200] [20, 200] [50, 200] [3, 200] [20, 200] [50, 200]
Multi-Perspectives () 97.730.11 98.290.21 96.810.96 96.560.12 97.600.29 92.891.31
Multi-Heads () 91.185.91 77.495.21 68.156.97 92.815.21 85.435.76 56.877.53
Table 4: Classification results of multi-perspectives versus multi-heads in terms of AUC scores(%).

Different Numbers of Perspectives. We further investigate the impact of different number of perspectives adopted by the node-graph matching layer of the NGMN model for classification tasks. Following the same settings of previous experiments, we only change the number of perspectives (i.e., ) of NGMN. From Table 5, it is clearly seen that the AUC score of NGMN does not increase as the number of perspectives grows. We thus conclude that our model performance is not sensitive to the number of perspective (from 50 to 150) and we make by default.

Model FFmpeg OpenSSL
[3, 200] [20, 200] [50, 200] [3, 200] [20, 200] [50, 200]
NGMN () 98.110.14 97.760.14 96.930.52 97.380.11 97.030.84 93.383.03
NGMN () 97.990.09 97.940.14 97.410.05 97.090.25 98.660.11 92.104.37
NGMN () 97.730.11 98.290.21 96.810.96 96.560.12 97.600.29 92.891.31
NGMN () 98.100.03 98.060.08 97.260.36 96.730.33 98.670.11 96.032.08
NGMN () 98.320.05 98.110.07 97.920.09 96.500.31 98.040.03 97.130.36
Table 5: Classification results of different number of perspectives in terms of AUC scores(%).

Different GNNs. We investigate the impact of different GNNs including GraphSAGE Hamilton et al. (2017), GIN Xu et al. (2019), and GGNN Li et al. (2016) adopted by the node embedding layer of our NGMN models for both classification and regression tasks. Table 6 presents the results of classification tasks (see the regression results from Table 12 in Appendix A.5). In general, the performance of different GNNs is quite similar for all datasets of both classification and regression tasks, which indicates that our model is not sensitive to the choice of GNNs in the node embedding layers. An interesting observation is that NGMN-GGNN performs even better than our default NGMN-GCN on both FFmpeg and OpenSSL datasets. This shows that our model can be further improved by adopting more advanced GNN models or choosing the most appropriate GNNs according to different application tasks.

Model FFmpeg OpenSSL
[3, 200] [20, 200] [50, 200] [3, 200] [20, 200] [50, 200]
NGMN-GCN (Our) 97.730.11 98.290.21 96.810.96 96.560.12 97.600.29 92.891.31
NGMN-GraphSAGE
97.310.56 98.210.13 97.880.15 96.130.30 97.300.72 93.663.87
NGMN-GIN
97.970.08 98.060.22 94.664.01 96.980.20 97.420.48 92.292.23
NGMN-GGNN
98.420.41 99.770.07 97.931.18 99.350.06 98.511.04 94.177.74
Table 6: Classification results of different GNNs in terms of AUC scores (%).

5 Related Work

Conventional Graph Matching. In general, graph matching can be categorized into exact graph matching and error-tolerant graph matching. Exact graph matching aims to find a strict correspondence between two (in large parts) identical graphs being matched, while error-tolerant graph matching allows matching between completely non-identical graphs Riesen (2015). In real-world applications, the constraint of exact graph matching is too rigid, and thus an amount of work has been proposed to solve the error-tolerant graph matching problem, which is usually quantified by a specific similarity metric, such as GED, maximum common subgraph (MCS) Bunke (1997), or even more coarse binary similarity, according to different application backgrounds. For GED and MCS, both of them are well-studied NP-hard problems and suffer from exponential computational complexity and huge memory requirements for exact solutions in practice Bunke (1997); McGregor (1982); Zeng et al. (2009); Blumenthal and Gamper (2018).

Graph Similarity Computation and Graph Matching Networks.

Considering the great significance and challenge of computing the graph similarity, various approximation methods have been proposed for improvements for better accuracy and efficiency, including traditional heuristic methods 

Gao et al. (2010); Zeng et al. (2009); Riesen (2015); Wu et al. (2019); Yoshida et al. (2019); Wu et al. (2018) and recent data-driven graph matching networks Bai et al. (2019, 2020); Li et al. (2019), as detailed in the baselines of Section 4.1. Our research belongs to the graph matching networks, but differs from prior work in two main aspects. First, unlike prior work only consider graph-level or node-level interaction features, our HGMN model successfully captures richer interactions between nodes of a graph and a whole graph. Second, our work is the first one to systematically evaluate the performance on both graph-graph classification and regression tasks as well as the size of input graphs.

6 Conclusion and Future Work

In this paper, we presented a novel hierarchical graph matching network (HGMN) for computing the graph similarity between any pair of graph-structured objects. Our model jointly learned graph embeddings and a data-driven graph matching metric for computing graph similarity in an end-to-end fashion. We further proposed a new node-graph matching network for effectively learning cross-level interactions between two graphs beyond low-level node-node and global-level graph-graph interactions. Our extensive experimental results correlated the superior performance compared with state-of-the-art baselines on both graph-graph classification and regression tasks. One interesting future direction is to adapt our HGMN model for solving different real-world applications such as unknown malware detection, text matching and entailment, and knowledge graph question answering.

References

  • [1] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang (2019) SimGNN: a neural network approach to fast graph similarity computation. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 384–392. Cited by: §A.1.2, §A.1.2, §A.1.2, §A.1.2, §1, §1, §3.3, §4.1, §4.1, §5.
  • [2] Y. Bai, H. Ding, K. Gu, Y. Sun, and W. Wang (2020) Learning-based efficient graph similarity computation via multi-scale convolutional set matching. In

    Thirty-Forth AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §1, §3.3, §4.1, §5.
  • [3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In

    European conference on computer vision

    ,
    pp. 850–865. Cited by: §3.1.
  • [4] D. B. Blumenthal and J. Gamper (2018) On the exact computation of the graph edit distance. Pattern Recognition Letters. Cited by: §5.
  • [5] A. P. Bradley (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition. Cited by: §4.2.
  • [6] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a" siamese" time delay neural network. In Advances in neural information processing systems, pp. 737–744. Cited by: §3.1.
  • [7] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017)

    Geometric deep learning: going beyond euclidean data

    .
    IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §1.
  • [8] H. Bunke and G. Allermann (1983) Inexact graph matching for structural pattern recognition. Pattern Recognition Letters 1 (4), pp. 245–253. Cited by: §1.
  • [9] H. Bunke (1997) On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters 18 (8), pp. 689–694. Cited by: §5.
  • [10] T. S. Caetano, J. J. McAuley, L. Cheng, Q. V. Le, and A. J. Smola (2009) Learning graph matching. IEEE transactions on pattern analysis and machine intelligence 31 (6), pp. 1048–1058. Cited by: §1.
  • [11] S. H. Ding, B. C. Fung, and P. Charland (2019) Asm2vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In IEEE Symposium on Security and Privacy (S&P), Cited by: §4.1.
  • [12] Q. Feng, R. Zhou, C. Xu, Y. Cheng, B. Testa, and H. Yin (2016) Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Cited by: §4.1.
  • [13] X. Gao, B. Xiao, D. Tao, and X. Li (2010) A survey of graph edit distance. Pattern Analysis and applications 13 (1), pp. 113–129. Cited by: §4.1, §5.
  • [14] X. Gu, H. Zhang, and S. Kim (2018) Deep code search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), pp. 933–944. Cited by: §3.1.
  • [15] M. Guo, E. Chou, D. Huang, S. Song, S. Yeung, and L. Fei-Fei (2018) Neural graph matching networks for fewshot 3d action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 653–669. Cited by: §1.
  • [16] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, Cited by: §1, §3.1, §3.2, §4.3.
  • [17] P. E. Hart, N. J. Nilsson, and B. Raphael (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §A.1.2.
  • [18] H. He, K. Gimpel, and J. Lin (2015)

    Multi-perspective sentence similarity modeling with convolutional neural networks

    .
    In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 1576–1586. Cited by: §3.1.
  • [19] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation. Cited by: §3.2.
  • [20] M. G. Kendall (1938) A new measure of rank correlation. Biometrika. Cited by: §4.2.
  • [21] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations, Cited by: §4.1.
  • [22] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §1.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.
  • [24] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli (2019) Graph matching networks for learning the similarity of graph structured objects. ICML. Cited by: §1, §1, §3.3, §4.1, §5.
  • [25] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2016) Gated graph sequence neural networks. International Conference on Learning Representations. Cited by: §1, §4.3.
  • [26] Y. Ma, S. Wang, C. C. Aggarwal, and J. Tang (2019) Graph convolutional networks with eigenpooling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 723–731. Cited by: §1.
  • [27] J. J. McGregor (1982) Backtrack search algorithms and the maximal common subgraph problem. Software: Practice and Experience 12 (1), pp. 23–34. Cited by: §5.
  • [28] J. Mueller and A. Thyagarajan (2016) Siamese recurrent architectures for learning sentence similarity. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §3.1.
  • [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  • [30] K. Riesen, S. Emmenegger, and H. Bunke (2013) A novel software toolkit for graph edit distance computation. In International Workshop on Graph-Based Representations in Pattern Recognition, pp. 142–151. Cited by: §A.1.2.
  • [31] K. Riesen, X. Jiang, and H. Bunke (2010) Exact and inexact graph matching: methodology and applications. In Managing and Mining Graph Data, pp. 217–247. Cited by: §1.
  • [32] K. Riesen (2015) Structural pattern recognition with graph edit distance. In Advances in computer vision and pattern recognition, Cited by: §4.1, §5, §5.
  • [33] C. Spearman (1904) The proof and measurement of association between two things. American Journal of Psychology. Cited by: §4.2.
  • [34] R. R. Varior, M. Haloi, and G. Wang (2016) Gated siamese convolutional neural network architecture for human re-identification. In European conference on computer vision, pp. 791–808. Cited by: §3.1.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2, §4.3.
  • [36] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations, Cited by: §1.
  • [37] S. Wang, Z. Chen, X. Yu, D. Li, J. Ni, L. Tang, J. Gui, Z. Li, H. Chen, and P. S. Yu (2019) Heterogeneous graph matching networks for unknown malware detection. In Proceedings of International Joint Conference on Artificial Intelligence, Cited by: §1.
  • [38] X. Wang, X. Ding, A. K. Tung, S. Ying, and H. Jin (2012) An efficient graph indexing method. In 2012 IEEE 28th International Conference on Data Engineering, Cited by: §A.1.2.
  • [39] L. Wu, I. E. Yen, F. Xu, P. Ravikumar, and M. Witbrock (2018) D2ke: from distance to kernel and embedding. arXiv preprint arXiv:1802.04956. Cited by: §5.
  • [40] L. Wu, I. E. Yen, Z. Zhang, K. Xu, L. Zhao, X. Peng, Y. Xia, and C. Aggarwal (2019) Scalable global alignment graph kernel using random features: from node embedding to graph embedding. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1418–1428. Cited by: §5.
  • [41] K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In International Conference on Learning Representations, Cited by: §4.3.
  • [42] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song (2017) Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Cited by: §A.1.1, §1, §3.1, §4.1.
  • [43] X. Yan and J. Han (2002) Gspan: graph-based substructure pattern mining. In Proceedings of IEEE International Conference on Data Mining, pp. 721–724. Cited by: §1.
  • [44] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in Neural Information Processing Systems, pp. 4800–4810. Cited by: §1, footnote 1.
  • [45] T. Yoshida, I. Takeuchi, and M. Karasuyama (2019) Learning interpretable metric between graphs: convex formulation and computation with graph mining. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1026–1036. Cited by: §5.
  • [46] Z. Zeng, A. K. Tung, J. Wang, J. Feng, and L. Zhou (2009) Comparing stars: on approximating graph edit distance. Proceedings of the VLDB Endowment 2 (1), pp. 25–36. Cited by: §4.1, §5, §5.
  • [47] C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla (2019) Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Cited by: §3.1, §3.2.

Appendix A Appendix

a.1 Datasets

a.1.1 Classification Datasets

In our evaluation, two binary functions that are compiled from the same source code but under different settings (architectures, compilers, optimization levels, etc) are considered to be semantically similar to each other. It is noted that one source code function, after compiled with different settings (architectures, compilers, optimization levels, etc), can generate various binary functions. To learn the similarity scores from pairs of binary functions, we represent those binary functions with control flow graphs, whose nodes represent the basic blocks (a basic block is a sequence of instructions without jumps) and edges represent control flow paths between these basic blocks. Thus, detecting the similarity between two binary functions can be cast as the problem of learning the similarity score between two control flow graphs and , where indicates and are similar; otherwise indicates dissimilar. We prepare two benchmark datasets generated from two popular open-source softwares: FFmpeg and OpenSSL, to evaluate our model on the graph-graph classification tasks.

For FFmpeg, we prepare the corresponding control flow graph (CFG) dataset as the benchmark dataset to detect binary function similarity. First, we compile FFmpeg 4.1.4 using 2 different compilers gcc 5.4.0 and clang 3.8.0, and 4 different compiler optimization levels (O0-O3), and generate 8 different binary files. Second, these 8 generated binaries are disassembled using IDA Pro,444IDA Pro disassembler, https://www.hex-rays.com/products/ida/index.shtml. which can produce CFGs for all disassembled functions. Finally, for each basic block in CFGs, we extract 6 block-level numeric features as the initial node representation based on IDAPython (a python-based plugin in IDA Pro).

OpenSSL is built from OpenSSL (v1.0.1f and v1.0.1u) using gcc 5.4 in three different architectures (x86, MIPS, and ARM), and four different optimization levels (O0-O3). The OpenSSL dataset that we evaluate is previously released by [42] and publicly available555https://github.com/xiaojunxu/dnn-binary-code-similarity. with prepared 6 block-level numeric features.

Overall, for both FFmpeg and OpenSSL datasets, each node in the CFGs are initialized with 6 block-level numeric features: # of string constants, # of numeric constants, # of total instructions, # of transfer instructions, # of call instructions, and # of arithmetic instructions.

a.1.2 Regression Datasets

Instead of directly computing the graph edit distance (GED) between two graphs and , we try to learn a similarity score , which is the normalized exponential of GED in the range of . To be specific, , where or denotes the number of nodes of or , and or denotes the normalized/un-normalized GED between and .

We employ both AIDS700 and LINUX1000 released by [1], which are publicly available.666https://github.com/yunshengb/SimGNN. Each dataset contains a set of graph pairs as well as their ground-truth GED scores, which are computed by exponential-time exact GED computation algorithm [17, 30]. As the ground-truth GEDs of another dataset IMDB-MULTI are provided with in-exact approximations, we thus do not consider this dataset in our experiments.

AIDS700 is a subset of the AIDS dataset, a collection of AIDS antiviral screen chemical compounds from the Development Therapeutics Program (DTP) in the National Cancer Institute (NCI).777https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data Originally, AIDS contains 42687 chemical compounds, where each of them can be represented as a graph with atoms as nodes and bonds as edges. To avoid calculating the ground-truth GED between two graphs with a large number of nodes, the authors of [1] create the AIDS700 dataset that contains 700 graphs with 10 or fewer nodes. For each graph in AIDS700, every node is labeled with the element type of its atom and every edge is unlabeled (i.e., bonds features are ignored).

LINUX1000 is also a subset dataset of Linux that introduced in [38]. The original Linux dataset is a collection of 48747 program dependence graphs generated from Linux kernel. In this case, each graph is a static representation of data flow and control dependency within one function, with each node assigned to one statement and each edge describing the dependency between two statements. For the same reason as above that avoiding calculating the ground-truth GED between two graphs with a large number of nodes, the LINUX1000 dataset used in [1] is randomly selected and contains 1000 graphs with 10 or fewer nodes. For each graph in LINUX1000, both nodes and edges are unlabeled.

For both classification and regression datasets, Table 7 provides more detailed statistics. In our evaluation, for the classification tasks, we split each dataset into three disjoint subsets of binary functions for training/validation/testing. In the regression tasks, we first split graphs of each dataset into training, validation, and testing sets, and then build the pairwise training/validation/testing data as the previous work [1].

Tasks Datasets
Sub-
datasets
# of
Graphs
# of
Functions
# of Nodes
(Min/Max/AVG)
# of Edges
(Min/Max/AVG)
# of Degrees
(Min/Max/AVG)
classif- ication FFmpeg [3, 200] 83,008 10,376 (3/200/18.83) (2/332/27.02) (1.25/4.33/2.59)
[20, 200] 31,696 7,668 (20/200/51.02) (20/352/75.88) (1.90/4.33/2.94)
[50, 200] 10,824 3,178 (50/200/90.93) (52/352/136.83) (2.00/4.33/3.00)
OpenSSL [3, 200] 73,953 4,249 (3/200/15.73) (1/376/21.97) (0.12/3.95/2.44)
[20, 200] 15,800 1,073 (20/200/44.89) (2/376/67.15) (0.12/3.95/2.95)
[50, 200] 4,308 338 (50/200/83.68) (52/376/127.75) (2.00/3.95/3.04)
regre- ssion AIDS700 - 700 - (2/10/8.90) (1/14/8.80) (1.00/2.80/1.96)
LINUX1000 - 1000 - (4/10/7.58) (3/13/6.94) (1.50/2.60/1.81)
Table 7: Summary statistics of datasets for both classification & regression tasks.

a.2 More Experimental Setup

a.2.1 Other experimental settings for our models

For SGNN, we use three GCN layers in the node embedding layer and each of the GCNs has an output dimension of 100. We use ReLU as the activation function along with a dropout layer after each GCN layer with dropout rate being 0.1. In the graph-level embedding aggregation layer of SGNN, we can employ different aggregation functions (i,e., Max, FCMax, Avg, FCAvg and BiLSTM) as stated previously in Section 

3.1. For NGMN, We set the number of perspectives to 100. We also employed different aggregation functions similar to SGNN and found that BiLSTM consistently performs better than others (see Appendix A.4). Thus, for NGMN, we take BiLSTM as the default aggregation function and we make its hidden size equal to the dimension of node embeddings. For each graph, we concatenate the last hidden vector of two directions of BiLSTM, which results in a 200-dimension vector as the graph embedding.

a.2.2 Detailed experimental settings for baseline models

In principle, we follow the same experimental settings as the baseline methods of their original papers and adjust a few settings to fit specific tasks. For instance, SimGNN is originally used for graph-graph regression tasks, we modify the final layer of model architecture so that it can be used to evaluate graph-graph classification tasks fairly. Thus, detailed experimental settings of all three baseline methods for both classification and regression tasks are given as follows.

SimGNN

: SimGNN firstly adopts three-layer GCN to encode each node of a pair of graphs into a vector. Then, SimGNN employs a two-stage strategy to model the similarity between the two graphs: i) it uses Neural Tensor Network (NTN) module to interact two graph-level embeddings that aggregated by a node attention mechanism; ii) it uses the histogram features extracted from the pairwise node-node similarity scores. Finally, the features learned from the two-stage strategy are concatenated to feed into multiple fully connected layers to obtain a final prediction.

For the graph-graph regression tasks, the output dimensions for the three-layer GCNs are 64, 32, and 16, respectively. The number of K in NTN and the number of histogram bins are both set to 16. Four fully connected layers are employed to reduce the dimension of concatenated results from 32 to 16, 16 to 8, 8 to 4, 4 to 1. As for training, the mean square error (MSE) loss function is used to train the model with Adam optimizer. The learning rate is set to 0.001 and the batch size is set to 128. We set the number of iterations to 10,000 and select the best model based on the lowest validation loss.

To fairly compare our model with SimGNN in evaluating graph-graph classification tasks, we adjust the settings of SimGNN as follows. We follow the same architecture of SimGNN in regression tasks except that the output dimension of the last connected layer is set to 2. We apply a softmax operation over the output of SimGNN to get the predicted binary label for the graph-graph classification tasks. As for training, we use the cross-entropy loss function to train our model and set the number of epochs to 100. Other training hyper-parameters are kept the same as the regression tasks.

GMN: The spirit of GMN is improving the node embeddings of one graph by incorporating the implicit neighbors of another graph through a soft attention mechanism. GMN follows a similar model architecture of the neural message passing network with three components: encoder layer that maps the node and edge to initial vector features of node and edge, propagation layer further update the node embeddings through proposed strategies and an aggregator that compute a graph-level representation for each graph.

For the graph-graph classification tasks, we use 1-layer MLP as the node/edge encoder and set the number of rounds of propagation to 5. The dimension of the node feature is set to 32, and the dimension of graph-level representation is set to 128. The Hamming distance is employed to compute the distance of two graph-level representation vectors. Based on the Hamming distance, we train the model with the margin-based pairwise loss function for 100 epochs in which validation is carried out per epoch. Adam optimizer is used with learning rate of 0.001 and batch size 10.

In order to enable fair comparisons with GMN for graph-graph regression tasks, we adjust the GMN architecture by concatenating the graph-level representation of two graphs and feeding it into a four-layer fully connected layers like SimGNN so that the final output dimension is reduced to 1. As for training, we use mean square loss function with batch size 128. Other settings remain the same as the classification tasks.

GraphSim: The main idea of GraphSim is to convert the graph similarity computation problems into pattern recognition problems. GraphSim first employs GCN to generate node embeddings of the pair of graphs, then turns the two sets of node embedding into a similarity matrix consisting of the pairwise node-node interaction similarity scores, feeds these matrics into convolutional neural networks (CNN), and finally concatenates the results of CNN to multiple fully connected layers to obtain a final predicted graph-graph similarity score.

For the graph-graph regression tasks, three layers of GCN are employed with each output dimension being set to 128, 64, and 32, respectively. The following architecture of CNNs is used: , , , , , , , , , . Numbers in and indicates and . Eight fully connected layers are used to reduce the dimension of the concatenated results from CNNs, from 384 to 256, 256 to 128, 128 to 64, 64 to 32, 32 to 16, 16 to 8, 8 to 4, 4 to 1. As for training, the mean square error (MSE) loss function is used to train the model with Adam optimizer. The learning rate is set to 0.001 and the batch size is set to 128. Similar to SimGNN, we set the number of iterations to 10,000 and select the best model based on the lowest validation loss.

To make a fair comparison of our model with GraphSim in our evaluation, we also adjust GraphSim to solve the graph-graph classification tasks. We follow the same architecture of GraphSim in regression tasks except that seven connected layers are used instead of eight. The output dimension of final connected layers is set to 2, and we apply a softmax operation over it to get the predicted binary label for the classification tasks. As for training, we use the cross-entropy loss function to train our model and set the number of epochs to 100. Other training hyper-parameters are kept the same as the regression tasks.

a.2.3 Detailed experimental setup for different GNNs

When performing experiments to see how different GNNs affect the performance of NGMN, we only replace GCN with GraphSAGE, GIN, and GGNN using the geometric deep learning library - PyTorch Geometric888https://pytorch-geometric.readthedocs.io. More specifically, for GraphSAGE, we used a 3-layer GraphSAGE GNN with their output dimensions all set to 100. For GIN, we used 3 GIN modules with a 1-layer MLP with output dimension 100 as the learnable function. For GGNN, we used 3 one-layer propagation models to replace the 3 GCNs in our original setting and also set their output dimensions to 100.

a.3 SGNN with different aggregation functions for both classification & regression tasks

To further compare our models with the SGNN models, we train and evaluate several SGNN models with different aggregation functions, such as Max, FCMax, Avg, FCAvg, and BiLSTM. The classification results and regression results are summarized in Table 8 and Table 9, respectively. For both classification and regression tasks, our models (the full model HGMN and key component NGMN) show statistically significant improvement over all SGNN models with different aggregation functions, which indicates the advantage of the proposed node-graph matching network.

Model FFmpeg OpenSSL
[3, 200] [20, 200] [50, 200] [3, 200] [20, 200] [50, 200]
SGNN (BiLSTM) 96.920.13 97.620.13 96.350.33 95.240.06 96.300.27 93.990.62
SGNN (Max) 93.920.07 93.820.28 85.151.39 91.070.10 88.940.47 82.100.51
SGNN (FCMax) 95.370.04 96.290.14 95.980.32 92.640.15 93.790.17 93.210.82
SGNN (Avg) 95.610.05 96.090.05 96.700.13 92.890.09 93.900.24 94.120.35
SGNN (FCAvg) 95.180.03 95.740.15 96.430.16 92.700.09 93.720.19 93.490.30
NGMN 97.730.11 98.290.21 96.810.96 96.560.12 97.600.29 92.891.31
HGMN (Max) 97.440.32 97.840.40 97.220.36 94.771.80 97.440.26 94.061.60
HGMN (FCMax) 98.070.06 98.290.10 97.830.11 96.870.24 97.590.24 95.581.13
HGMN (BiLSTM) 97.560.38 98.120.04 97.160.53 96.900.10 97.311.07 95.870.88
Table 8: Classification results of SGNN models with different aggregation functions VS. NGMN and HGMN in term of AUC scores (%).
Datasets Model () p@10 p@20
AIDS700 SGNN (BiLSTM) 1.4220.044 0.8810.005 0.7180.006 0.3760.020 0.4720.014
SGNN (Max) 2.8220.149 0.7650.005 0.5880.004 0.2890.016 0.3730.012
SGNN (FCMax) 3.1140.114 0.7350.009 0.5540.008 0.2780.021 0.3640.017
SGNN (Avg) 1.4530.015 0.8760.002 0.7120.002 0.3530.007 0.4440.012
SGNN (FCAvg) 1.6580.067 0.8570.007 0.6890.008 0.3050.018 0.3990.021
NGMN 1.1910.048 0.9040.003 0.7490.005 0.4650.011 0.5380.007
HGMN (Max) 1.2100.020 0.9000.002 0.7430.003 0.4610.012 0.5340.009
HGMN (FCMax) 1.2050.039 0.9040.002 0.7490.003 0.4570.014 0.5320.016
HGMN (BiLSTM) 1.1690.036 0.9050.002 0.7510.003 0.4560.019 0.5390.018
LINUX 1000 SGNN (BiLSTM) 2.1401.668 0.9350.050 0.8250.100 0.9780.012 0.9650.007
SGNN (Max) 11.8320.698 0.5660.022 0.4040.017 0.2260.106 0.4920.190
SGNN (FCMax) 17.7950.406 0.3620.021 0.2520.015 0.2390.000 0.2410.000
SGNN (Avg) 2.3430.453 0.9330.012 0.7900.017 0.7780.048 0.8110.050
SGNN (FCAvg) 3.2110.318 0.9090.004 0.7570.008 0.8310.163 0.8130.159
NGMN 1.5610.020 0.9450.002 0.8140.003 0.7430.085 0.7410.086
HGMN (Max) 1.0540.086 0.9620.003 0.8500.008 0.8770.054 0.8830.047
HGMN (FCMax) 1.5750.627 0.9460.019 0.8170.034 0.8070.117 0.7840.108
HGMN (BiLSTM) 0.4390.143 0.9850.005 0.9190.016 0.9550.011 0.9430.014
Table 9: Regression results of SGNN models with different aggregation functions VS. NGMN and HGMN on AIDS700 and LINUX1000.

a.4 NGMN with different aggregation functions for both classification & regression tasks

We investigate the impact of different aggregation functions adopted by the aggregation layer of NGMN model for both classification and regression tasks. Following the default and same settings of previous experiments, we only change the aggregation layer of NGMN and use five possible aggregation functions: Max, FCMax, Avg, FCAvg, LSTM, and BiLSTM. As can be observed from Table 10 and Table 11, BiLSTM offers superior performance on all datasets for both classification and regression tasks in terms of most evaluation metrics. Therefore, we take BiLSTM as the default aggregation function for NGMN, and fix it for the NGMN part in HGMN models.

Model FFmpeg OpenSSL
[3, 200] [20, 200] [50, 200] [3, 200] [20, 200] [50, 200]
NGMN (Max) 73.748.30 73.851.76 77.722.07 67.142.70 63.313.29 63.022.77
NGMN (FCMax) 97.280.08 96.610.17 96.650.30 95.370.19 96.080.48 95.900.73
NGMN (Avg) 85.921.07 83.294.49 85.521.42 80.104.59 70.813.41 66.944.33
NGMN (FCAvg) 95.930.21 73.900.70 94.220.06 93.380.80 94.521.16 94.710.86
NGMN (LSTM) 97.160.42 97.020.99 84.656.73 96.300.69 97.510.82 89.418.40
NGMN (BiLSTM) 97.730.11 98.290.21 96.810.96 96.560.12 97.600.29 92.891.31
Table 10: Classification results of NGMN models with different aggregation functions in term of AUC scores (%).
Datasets Model () p@10 p@20
AIDS 700 NGMN (Max) 2.3780.244 0.8130.015 0.6420.013 0.5780.199 0.5830.169
NGMN (FCMax) 2.2201.547 0.8080.145 0.6560.122 0.4250.078 0.5040.064
NGMN (Avg) 1.5240.161 0.8800.010 0.7170.012 0.4080.044 0.4740.027
NGMN (FCAvg) 1.2810.075 0.8950.006 0.7370.008 0.4530.015 0.5270.016
NGMN (LSTM) 1.2900.037 0.8950.004 0.7370.005 0.4480.007 0.5200.012
NGMN (BiLSTM) 1.1910.048 0.9040.003 0.7490.005 0.4650.011 0.5380.007
LINUX 1000 NGMN (Max)* 16.9210.000 - - - -
NGMN (FCMax) 4.7930.262 0.8290.006 0.6650.011 0.7640.170 0.7670.166
NGMN (Avg) 4.0500.594 0.8880.008 0.7190.012 0.5010.093 0.5360.112
NGMN (FCAvg) 6.9530.195 0.8970.004 0.7360.005 0.4990.126 0.5090.129
NGMN (LSTM) 1.5350.096 0.9450.004 0.8130.007 0.6950.064 0.6980.081
NGMN (BiLSTM) 1.5610.020 0.9450.002 0.8140.003 0.7430.085 0.7410.086
  • As all duplicated experiments running on this setting do not converge in their training processes, their corresponding result metrics cannot be calculated.

Table 11: Regression results of NGMN models with different aggregation functions on AIDS700 and LINUX1000.

a.5 NGMN with Different GNNs on regression tasks.

As a supplement to Table 6 in Section 4.3, Table 12 shows the experimental results of GCN versus GraphSAGE/GIN/GGNN in NGMN for the regression tasks.

Datasets Model () p@10 p@20
AIDS 700 NGMN-GCN (Our) 1.1910.048 0.9040.003 0.7490.005 0.4650.011 0.5380.007
NGMN-(GraphSAGE)
1.2750.054 0.9010.006 0.7450.008 0.4480.016 0.5330.014
NGMN-(GIN)
1.3670.085 0.8890.008 0.7290.010 0.4000.022 0.4920.021
NGMN-(GGNN)
1.8700.082 0.8710.004 0.7060.005 0.3880.015 0.4570.017
LINUX 1000 NGMN-GCN (Our) 1.5610.020 0.9450.002 0.8140.003 0.7430.085 0.7410.086
NGMN-GraphSAGE
2.7840.705 0.9150.019 0.7670.028 0.6820.183 0.6930.167
NGMN-GIN
1.1260.164 0.9630.006 0.8580.015 0.7920.068 0.8210.035
NGMN-GGNN
2.0680.991 0.9380.028 0.8150.055 0.6280.189 0.6540.176
Table 12: Regression results of different GNNs on AIDS700 and LINUX1000.