I Introduction
Graph similarity search is a common and fundamental operation in graph databases, which has widespread applications in various fields including bioinformatics, sociology, and chemical analysis, over the past few decades. For evaluating the similarity between graphs, Graph Edit Distance (GED) [1] is one of the most prevalent measures because of its wide applicability, that is, GED is capable of dealing with various kinds of graphs including directed and undirected graphs, labeled and unlabeled graphs, as well as simple graphs and multigraphs (which could have multiple edges between two vertices). Furthermore, GED has high interpretability, since it corresponds to some sequences of concrete graph edit operations (including insertion of vertices and edges, etc.) of minimal lengths, rather than implicit graph embedding utilized in spectral [2] or kernel [3] measures. Example 1 illustrates the basic idea of GED.
Example 1
Assume that we have two graphs and as shown in Figure 1. The label sets of graph ’s vertices and edges are and , respectively, and the label sets of graph ’s vertices and edges are and , respectively. The Graph Edit Distance (GED) between and is the minimal number of graph edit operations to transform into . It can be proved that the GED between and is 3, which can be achieved by \⃝raisebox{0.9pt}{1} deleting the edge between and in , and \⃝raisebox{0.9pt}{2} inserting an isolated vertex with label , and \⃝raisebox{0.9pt}{3} inserting an edge between and with label .
With GED as the graph similarity measure, the graph similarity search problem is formally stated as follows.
Problem Statement: (Graph Similarity Search) Given a graph database , a query graph , and a similarity threshold , the graph similarity search is to find a set of graphs , where the graph edit distance (GED) between and each graph in is less than or equal to .
A straightforward solution to the problem above is to check exact GEDs for all pairs of and graphs in database . However, despite its prevalence, GED is proved to be NPhard for exact calculations [4], which may lead to unsatisfactory computational efficiency when we conduct a similarity search over large graphs. The most widelyapplied approach for computing exact GED is the algorithm [5]
, which aims to search out the optimal matching between the vertices of two graphs in a heuristic manner. Specifically, given two graphs with
and vertices, respectively, the time complexity of algorithm is in the worst case.Due to the hardness of computing exact GED (NPhard)[4], most existing works involving exact GED computation [4] [5] only conducted experiments on graphs with less than 10 vertices. In addition, a recent study [6] indicates that the algorithm is incapable of computing GED between graphs with more than 12 vertices, which can hardly satisfy the demand for searching realworld graphs. For instance, a common requirement in bioinformatics is to search and compare molecular structures of similar proteins [7]. However, the structures of human proteins usually contain hundreds of vertices (i.e., atoms) [8], which obviously makes similarity search beyond the capability of the approaches mentioned above. Another observation is that many realworld applications do not always require an exact similarity value, and an approximate one with some quality guarantee is also acceptable especially in realtime applications where the responsiveness is far more important than the accuracy. Taking the protein search as an example again, it is certainly more desirable for users to obtain an approximate solution within a second, rather than to wait for a couple of days to get the exact answer.
To address the problems above, many approaches have been proposed to achieve an approximate GED between the query graph and each graph in database in polynomial time [9], which can be leveraged to accelerate the graph similarity search by trading accuracy for efficiency. Assuming that there are two graphs with and vertices, respectively, where , one wellstudied method [10] [11] for estimating the GED between these two graphs is to solve a corresponding linear sum assignment problem, which requires at least time for obtaining the global optimal value or time for a local optimal value by applying the greedy algorithm [12]. An alternative method is spectral seriation[13]
, which first generates the vector representations of graphs by extracting their leading eigenvalues of the adjacency matrix (
time) [14], and then exploits a probabilistic model based on these vectors to estimate GED in time.To further enhance the efficiency of GED estimation and better satisfy the demands for graph similarity search on large graphs, we propose a novel probabilistic approach which aims at estimating the GED with less time cost , where is the number of vertices, is the average degree of the graphs involved, and is the similarity threshold in the stated graph similarity search problem. Note that the similarity threshold is often set as a small value (i.e., ) and does not increase with the number of vertices in previous studies [4] [15], thus, we can assume that is a constant with regard to when the graph is sufficiently large. Moreover, most realworld graphs studied in related works [11] [12] are scalefree graphs [16], and we prove that the average degree for scalefree graphs. Therefore, under the assumptions above, the time complexity of our approach is for comparing two scalefree graphs, and for searching similar graphs in the graph database , where is the number of graphs in database .
Our method is mainly inspired by probabilistic modeling approaches which are broadly utilized in similarity searches on various types of data such as text and images [17]. The basic idea of these methods is to model the formation of an object as a stochastic process, and to establish probabilistic connections between objects. In this paper, we follow this idea and model the formation of graph branch distances (GBDs) as the results of random graph editing processes, where GBD is defined in Section III. By doing so, we prove the probabilistic relationship between GED and GBD, which is finally utilized to estimate GED by graph branch distance (GBD).
To summarize, we make the following contributions.

[leftmargin=*]

We adopt branches [15] as elementary structures in graphs, and define a novel graph similarity measure by comparing branches between graphs, i.e., Graph Branch Distance (GBD), which can be efficiently calculated in time, where is the number of vertices, and is the average degree of the graphs involved.

We build a probabilistic model which reveals the relationship between GED and GBD by considering branch variations as the result ascribed to graph edit operations. By applying our model, the GED between any two graphs can be estimated by their GBD in time, where is the similarity threshold in the graph similarity search problem.

We conduct extensive experiments to show our approach has better accuracy, efficiency and scalability compared with the related approximation methods over real and synthetic data.
The paper is organized as follows. In Section II, we formally define the symbols and concepts which are used in this paper. In Section III, we give definitions of branches and Graph Branch Distance (GBD). In Section IV, we introduce the extended graphs, which are exploited to simplify our model. In Section V, we derive the probabilistic relation between GBD and GED, which is leveraged in Section VI to perform the graph similarity search. In Section VII, we demonstrate the efficiency and effectiveness of our proposed approaches through extensive experiments. We discuss related works in Section VIII, and we conclude the paper in Section IX.
Ii Preliminaries
The graphs discussed in this paper are restricted to simple labeled undirected graphs. Specifically, the th graph in database is denoted by: , where is the set of vertices, is the set of edges, while is a general labelling function. For any vertex , its label is given by . Similarly, for any edge , its label is given by . In addition, and are defined as the sets of all possible labels for vertices and edges, respectively. We also define as a virtual label, which will be used later in our approach. When the label of a vertex (or edge) is , the vertex (or edge) is said to be virtual and does not actually exist. Particularly, we have and . Note that our method can also handle directed and weighted graphs by considering edge directions and weights as special labels.
In this paper, we take Graph Edit Distance (GED)[1] as the graph similarity measure, which is defined as follows.
Definition 1 (Graph Edit Distance)
The edit distance between graphs and , denoted by , is the minimal number of graph edit operations which are necessary to transform into , where the graph edit operations (GEO) are restricted to the following six types:

AV: Add one isolated vertex with a nonvirtual label;

DV: Delete one isolated vertex;

RV: Relabel one vertex;

AE: Add one edge with a nonvirtual label;

DE: Delete one edge;

RE: Relabel one edge.
Assume that a particular graph edit operation sequence which transforms graph into is , where is the unique ID of this sequence. Then, according to Definition 1, is the minimal length for all possible operation sequences, that is, , where is the length of the sequence . The set of all operation sequences from to of the minimal length is defined as .
Iii Branch Distance Between Graphs
To reduce the high cost of exact GED computations (NPhard [4]) in the graph similarity search, one widelyapplied strategy for pruning search results [4] [18][19] [15] is to exploit the differences between graph substructures as the bounds of exact GED values. In this paper, we consider the branches [15] as elementary graph units, which are defined as:
Definition 2 (Branches)
The branch rooted at vertex is defined as , where is the label of vertex , and is the sorted multiset containing all labels of edges adjacent to . The sorted multiset of all branches in is denoted by .
In practice, each branch is stored as a list of strings whose first element is and the following elements are strings in the sorted multiset . In addition, for each graph is stored in a sorted multiset, whose elements are essentially lists of strings (i.e., branches) and are always sorted ascendingly by the ordering algorithm in [20]. For a fair comparison of the computational efficiency, we assume that all auxiliary data structures in different methods, such as the cost matrix [11], adjacency matrix [13], and our branches, are precomputed and stored with graphs.
To define the equality between two branches, we introduce the concept of branch isomorphism as follows.
Definition 3 (Branch Isomorphism)
Two branches and are isomorphic if and only if and , which is denoted by .
Suppose that we have two branches and where the degrees of vertices and are and , respectively. From previous discussions, the branch and are stored as lists of lengths and , respectively. Therefore, checking whether and are isomorphic is essentially judging whether two lists of lengths and are identical, which can be done in time when , and otherwise in one unit time since the length of a list can be obtained in one unit time.
Finally, we define the Graph Branch Distance (GBD).
Definition 4 (Graph Branch Distance)
The branch distance between graphs and , denoted by , is defined as:
(1) 
where and are the multisets of all branches in graphs and , respectively.
, the graph database  
, th graph in database  
, the vertices in  
, the edges in  
, the query graph  
labelling function for vertices and edges  
the set of all possible vertex labels  
the set of all possible edge labels  
the virtual label  
the similarity threshold  
a graph edit operation (GEO) sequence  
set of all GEO sequences of minimal lengths  
Graph Edit Distance  
the branch rooted at vertex  
the label of vertex  
sorted multiset of labels of edges adjacent to  
sorted multiset of all branches in graph  
Graph Branch Distance  
extended graph of with extension factor  
, when 
The intuition of introducing GBD is as follows. The stateoftheart method [15] for pruning graph similarity search results assumes that the difference between branches of two graphs has a close relation to their GED. Therefore, in this paper, we aim to use GBD to closely estimate the GED of two graphs.
Example 2 below illustrates the process of computing GBD.
Example 2
Assume that we have two graphs and as shown in Figure 1. The branches rooted at the vertices in graphs and are as follows.
The sorted multisets of branches in and are:
Therefore, according to Definition 4, we can obtain the graph branch distance (GBD) between graphs and by applying the Equation (4), which is:
where and are the sizes of multisets and , respectively. In addition, according to Definition 3, the only pair of isomorphic branches between multisets and is . Therefore, the intersection set of multisets and is , whose size is .
Note that the time cost of computing the size of intersection of two sorted multisets is [4], where and are the sizes of two multisets, respectively. Therefore, the GBD between query graph and any graph can be computed in time:
(2) 
where , is the degree of th compared vertex in , and is the average degree of graph .
The GBD defined in this section is utilized to model the graph edit process and further leveraged for estimating the graph edit distance (GED) in Section V.
Iv Extended Graphs
In this section, we reduce the number of graph edit operation types that need to be considered by extending graphs with virtual vertices and edges, which helps to simplify our probabilistic model in Section V. Moreover, we show that the GED and the GBD between the extended graphs stay the same as the GED and the GBD between the original ones, respectively.
The definition of extended graphs is as follows.
Definition 5 (Extended Graphs)
For graph , its extended graph, denoted by , is generated by first inserting isolated virtual vertices into , and then inserting a virtual edge between each pair of nonadjacent vertices in (including virtual vertices), where is the extension factor.
Example 3
In particular, for any graph pair where , we define and by extending and with extension factor and , respectively. Previous studies [21] [22] have shown that, for any graph edit operation sequence which transforms the extended graph into and has the minimal length, every operation in is equivalent to a relabelling operation on . Therefore, we only need to consider graph edit operations of types RV and RE when modeling the graph edit process of transforming the extended graph into .
Finally, given the graphs and (for ), and their extended graphs and , we have the following Theorems 1 and 2, which are utilized in the Section V.
Theorem 1
Please refer to Appendix A.
Theorem 2
Please refer to Appendix B.
Note that the extended graph is only a conceptual model for reducing the number of graph edit operation types that need to be considered. According to Theorems 1 and 2, whenever the values of and are required, we can instead calculate and . Therefore, in practice, we do not actually convert graphs into their extended versions, and the calculations of GEDs and GBDs are still conducted on original graphs rather than the extended ones, which means that there is no overhead for creating and maintaining extended graphs.
V Our Probabilistic Model
In this section, we aim to solve the stated graph similarity search problem by estimating GED from GBD in a probabilistic manner. According to Theorems 1 and 2, the relation between original graphs’ and must be the same with the relation between extended graphs’ and . Therefore, as discussed in Section IV, we only need to consider graph edit operations of types RV and RE when building our probabilistic model.
To be more specific, we consider two given graphs and where and their extended graphs and . For simplicity, we denote and by and , respectively. As mentioned in Section I, we model the formation of graph branch distances (GBDs) as the results of random graph editing processes, and thus we establish a probabilistic connection between GED and GBD. The detailed steps to construct our model is as follows.

[leftmargin=0pt]
 Step 1:

We consider as an observed random variable.
 Step 2:

We randomly choose one graph edit operation (GEO) sequence from all possible GEO sequences whose lengths are equal to . We define a random variable , where is a particular choice of operation sequence from , and iff choose the sequence with ID , that is, .
 Step 3:

We model the numbers of RV and RE operations in the sequence chosen in Step 2 as random variables and , respectively, where iff the number of operations with type RV in is , and iff the number of operations with type RE in is . Note that, when given and , we always have .
 Step 4:

We define random variables as the random variable where is a particular value of , and iff and the number of vertices covered by relabeled edges is when conducting . In addition, we define as the random variable where and are particular values of and , respectively. That is, iff , and the number of vertices either relabelled or covered by relabelled edges is .
The reason we conduct Step 4 is because we want to model branch variations by the number of branches rooted at the vertices either relabelled or covered by relabelled edges, i.e., the random variable .
 Step 5:

We consider as the random variable dependent on , where their relation is proved in Appendix G.
The random variables defined in the fivestep model above are listed in Table II, and their relations among are represented by a Bayesian Network, as shown in Figure 3. We use Example 4 below to better illustrate the key idea of our model.
Choice of operation sequence  
Number of relabeled vertices  
Number of relabeled edges  
Number of vertices covered by relabeled edges  
Number of vertices either relabeled or covered by relabeled edges 
Example 4
Given two extended graphs and , as shown in Figure 4, where the virtual edges are represented by dashed lines. The minimal number of graph edit operations to transform into is , and the set of all possible graph edit operation sequences with the minimal length is:
where the subscript of each sequence is its ID, and
Then, the values of random variables in our model could be:

[leftmargin=*]

First, we consider as a random variable, which has the value in this example. ()

Second, we randomly choose one sequence from , which is . Therefore, in this case the random variable .

Third, the numbers of RV and RE operations in are and , respectively. Therefore, the random variables and in this example.

Then, after conducting operations in , the number of vertices covered by relabelled edges is , and the number of vertices either relabelled or covered by relabelled edges is . Therefore, the random variables and .

Finally, is considered to be the random variable, where in this example.
In this paper, we aim to infer the probability distribution of
when given , which is essentially to calculate the following probability for given constants and :(3) 
By applying Bayes’ Rule, we have:
(4) 
where
(5)  
(6)  
(7) 
Therefore, the problem to solve becomes calculating the values of , and when given extended graphs and , and constants and . In this following three subsections VA, VB and VC, we illustrate how to utilize our model to calculate , and , respectively.
Va Calculating
In this subsection, we aim to calculate . The key idea to calculate
is to expand its formula by applying Chain Rule
[24] according to the dependency relationship between random variables in our model, where the expanded formula is:(8) 
where
(9)  
(10)  
(11)  
(12) 
VB Calculating
In this subsection, we aim to calculate , which is essentially to infer the prior distributions of GBDs. In practice, we precompute the prior distribution of GBDs without knowing the query graph in the graph similarity search problem. This is because in most realworld scenarios [4] [15], the query graph ofter comes from the same population as graphs in database . As a counter example, it is unusual to use a query graph of protein structure to search for similar graphs in social networks, and vice versa. Therefore, we assume that GBDs between and each graph in database , follow the same prior distributions as those among all graph pairs in .
To calculate the prior distribution of GBDs, we first randomly sample of graph pairs from the database , and calculate the GBD between each pair of sampled graphs. Then, we utilize the Gaussian Mixture Model (GMM) [25]
to approximate the distribution of GBDs between all pairs of sampled graphs, whose probability density function is:
(13) 
where is the number of components in the mixture model defined by user, and
is the probability density function of the normal distribution. Here,
, and are parameters of the th component in the mixture model, which can be inferred from the GBDs over sampled graph pairs. Please refer to [25] for the process of inferring parameters in GMM.Finally, we can compute the prior probability
by integrating the probability density function on the adjacent interval of , i.e., .(14) 
Note that this integration technique is commonly used in the field called continuity correction [26], which aims to approximate discrete distributions by continuous distributions. In addition, the choice of integral interval is a common practice in the continuity correction field [27].
Example 5
We randomly sample 60,000 pairs of graphs from Fingerprint dataset of IAM Graph Database[28], where the distribution of GBDs between all pairs of sampled graphs is represented by the blue histogram in Figure 6. We infer the GBD prior distribution of sampled graph pairs, which is represented by the red line in Figure 6. This way, we can compute and store the probability for each possible value of variable by Equation (14), where the range of variable is analyzed in Section VI.
VC Calculating
In this subsection, we focus on calculating , i.e., the prior distribution of GEDs. Recall that the GED computation is NPhard [4]. Therefore, sampling some graph pairs and calculating the GED between each pair is infeasible especially for large graphs, which means that we cannot simply infer the prior distribution of GEDs from sampled graph pairs.
To address this problem, we utilize the Jeffreys prior [29] as the prior distribution of GEDs. The Jeffreys prior is wellknown for its noninformative property [30], which means that it provides very little additional information to the probabilistic model. Therefore, Jeffreys prior is a common choice in Bayesian methods when we know little about the actual prior distribution. Then, according to the definition of Jeffreys prior [29], the value of probability is calculated by:
(15)  
(16) 
where Equation (15) comes from the definition of the Jeffreys prior [29], which is the expected value of function with respect to the conditional distribution of variable when given . Here, is a constant for normalization, and the function is defined in the following Equation (17).
(17) 
where the symbol represents the partial derivative of a function, and the symbol means to substitute value for variable in the function . Please refer to Appendix C for the closed forms of Equation (16).
According to the closed form of Equation (16), we find that the value of probability only depends on the constant and the size of the extended graph , i.e., (). Therefore, for each data set, we precompute the value of function for each possible value of and , and store these precomputed values in a matrix for quick looking up when searching similar graphs. In Section VI, we discuss the ranges of and in detail. Note that, if there are possible values of and possible values of , then the normalization constant in Equations (15) and (16) will be .
Example 6
Vi Graph Similarity Search with the Probabilistic Model
In this section, we first elaborate on our graph similarity search algorithm (i.e., GBDA) based on the model derived in the previous section, which consists of two stages: offline preprocessing and online querying. Then, we study the time and space complexity of these two stages in detail.
Via Graph Similarity Search Algorithm
Given a query graph , a graph database , a similarity threshold , and a probability threshold , the search results is achieved by Algorithm 1, where and are the extended graphs of and , respectively. Step 1 tagged with symbol * is precomputed in the offline stage, as discussed in Sections VB and VC. We give the Example 7 below to better illustrate the process of Algorithm 1.
Example 7
Assume that the graph in Figure 1 is the query graph , and in Figure 1 is a graph in database . Given the similarity threshold and the probability threshold , the process of determining whether should be in the search result is as follows.

[leftmargin=*]

First, we precompute and by inferring the prior distributions of GBDs and GEDs on database , respectively. Since this is a simulated example and there is no concrete database , we assume that for all possible values of and .

Second, from Example 2, we know .

Then, according to Equation (8), we can calculate

Therefore, is inserted into the search result .
ViB Complexity Analysis of Online Stage
The online querying stage in our approach includes Steps 2, 3 and 4 in Algorithm 1, where Step 4 clearly costs time for each graph . In addition, from the discussions in Section III, Step 2 costs time, where , and is the average degree of graph .
In Step 3, since and have already been precomputed in the offline stage (i.e., Step 1), their values can be obtained in time for each and graph .
Now we focus on analyzing the time complexity of computing in Step 3 of Algorithm 1. Let the time for computing , , and in Equation (8) be and , respectively. According to Equation (8), the time for computing for each and graph is:
(18) 
where and are the summation subscripts in Equation (8), and the ranges of and are:

[leftmargin=*]

. Since is the number of RV operations, must not be larger than the number of graph edit operations .

. Since is the number of vertices covered by relabelled edges given the relabelled edge number , and each edge can cover at most two vertices, we have .

. Note that is the number of vertices either relabelled or covered by relabelled edges when the relabelled vertex number is and the number of vertices covered by relabelled edges is . Therefore, we have .
In addition, according to the closed form of Equation (8) in Appendix C, it is clear that and . According to Equation (18), the time of computing for each and graph is:
(19) 
Moreover, from the above discussions about the ranges of summation subscripts, for any , we have:
(20)  
(21)  
(22) 
where