Updates-Aware Graph Pattern based Node Matching

02/18/2020 ∙ by Guohao Sun, et al. ∙ Macquarie University The University of Queensland 0

Graph Pattern based Node Matching (GPNM) is to find all the matches of the nodes in a data graph GD based on a given pattern graph GP. GPNM has become increasingly important in many applications, e.g., group finding and expert recommendation. In real scenarios, both GP and GD are updated frequently. However, the existing GPNM methods either need to perform a new GPNM procedure from scratch to deliver the node matching results based on the updated GP and GD or incrementally perform the GPNM procedure for each of the updates, leading to low efficiency. Therefore, there is a pressing need for a new method to efficiently deliver the node matching results on the updated graphs. In this paper, we first analyze and detect the elimination relationships between the updates. Then, we construct an Elimination Hierarchy Tree (EH-Tree) to index these elimination relationships. In order to speed up the GPNM process, we propose a graph partition method and then propose a new updates-aware GPNM method, called UA-GPNM, considering the single-graph elimination relationships among the updates in a single graph of GP or GD, and also the cross-graph elimination relationships between the updates in GP and the updates in GD. UA-GPNM first delivers the GPNM result of an initial query, and then delivers the GPNM result of a subsequent query, based on the initial GPNM result and the multiple updates that occur between two queries. The experimental results on five real-world social graphs demonstrate that our proposed UA-GPNM is much more efficient than the state-of-the-art GPNM methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Background

Graph Pattern Matching (GPM) is to find all the matching subgraphs of a pattern graph

in a data graph . In order to address the low-efficiency issue in the conventional NP-Complete GPM methods [ullmann1976algorithm, cordella2001improved, garey2002computers], Fan et al., proposed Bounded Graph Simulation (BGS) [fan2010graph], which has fewer restrictions but more capacity to efficiently extract more useful subgraphs because it supports simulation relations instead of an exact match of edges and nodes. In BGS, each node in and has a label (e.g., representing a person’s job title), and each edge in is labeled with either a positive integer or a symbol “*”. is the constraint of the maximal shortest path length of a match in and “*” indicates that there are no path length constraints. Then, the match of an edge could be a path if the start node and the end node of the path in the data graph have the same labels as the corresponding nodes of the edge in the pattern graph respectively. In social networks, on average, any two people can be connected in about six hops [milgram1967small]. Therefore, is usually set as a small integer in social networks [fan2010graph].

The GPM methods discussed above aim to find the entire subgraphs in . However, in some applications, such as group finding [lappas2009finding] and expert recommendation [morris2010people, brynielsson2010detecting], people are willing to find a group of nodes based on a specified structure between them, leading to the Graph Pattern based Node Matching (GPNM) problem [liu2007identifying], with an example discussed below.

Fig. 1: Graph Pattern based Node Matching

Example 1 (GPNM Problem): Fig. 1(a) depicts a data graph , where each node denotes a person, labeled with his job title, e.g., Project Manager (), Database Developer (), Software Engineer (), Test Engineer (), or Secretary (). Each edge indicates a collaboration relationship. A pattern graph is given in Fig. 1(b), where an IT project needs four types of people, namely, , , , and respectively. In BGS [fan2010graph], the integer on an edge shows the constraint of the maximum path length between two nodes. For example, in Fig. 1(b), a needs to connect with an and an within 3 hops respectively. The GPNM results are shown in TABLE I.

Nodes in Matching nodes in
,
,
TABLE I: The node matching results of Example 1

The existing GPM methods can be applied to solve the GPNM problem. However, they need to deliver the entire matching subgraphs, rather than the matching nodes only, which incurs a high time complexity [fan2010graph, fan2011adding]. Therefore, Fan et al., [fan2013diversified] proposed a method to find matching nodes only based on a given pattern graph. Although their method can reduce query processing time, it does not consider the updates of and that commonly exist in real scenarios [berger2006framework]. Even if there exists only one update in a small size pattern graph, for this update in pattern graph, the existing GPNM methods have to recompute the matching results in the data graph starting from scratch, which leads to much query processing time. In addition, the query structures of the pattern graphs given by billions of users in social networks are changed very frequently. Therefore, the updates to patterns are in high frequency. Therefore, it is necessary and significant to consider the updates in pattern graph and data graph to improve the efficiency of node matching. For example, in group finding in Online Social Networks (OSNs) [lappas2009finding], the joining of new users or the withdrawal of existing users in OSNs results in the updates of . When facing each of such updates, the existing GPNM methods [liu2007identifying, fan2013diversified] have to perform a new GPNM procedure from scratch, leading to low efficiency.

In order to improve efficiency, the state-of-the-art GPNM methods, called INC-GPNM [Sun2018incremental] and EH-GPNM [sun2019incremental], have been proposed. INC-GPNM first incrementally records the shortest path length range between different types of labels in and then identifies the affected area of w.r.t. the updates of and . Thus, INC-GPNM can improve the efficiency of GPNM when and are updated. However, in a large-scale social graph that is updated with a high frequency, INC-GPNM is still computationally expensive as it ignores the relationships that exist among the updates in both and , and thus, when facing any update, it has to perform an incremental GPNM procedure for each of the updates. EH-GPNM considers the updates in only. When facing updates in the pattern graph, it still has to perform the incremental GPNM procedure for each of the updates in the pattern graph. Therefore, a new efficient GPNM method is in demand.

I-B Motivations and Problems

In real scenarios, nodes and edges in both and are usually frequently updated over time. For example, in the application of group finding in social graphs, different queries can have different constraints and/or structures, which leads to the updates of , and the joining of new users in OSNs leads to the updates of . However, not all the updates in a pattern graph or a data graph essentially affect the GPNM matching results. Below Example 2 illustrates the details of our motivations.

Fig. 2: Updates-aware GPNM

Example 2 (Updates-aware GPNM): Based on the pattern graph and data graph shown in Fig. 2(c) and Fig. 2(a) respectively, the original GPNM matching results are shown in Table I. Suppose there are two updates in the pattern graph, where needs to be associated with a within 2 hops (denoted as in Fig. 2(b)), and an needs to be associated with a within 4 hops (denoted as in Fig. 2(b)). And there are also two updates in data graph, where establishes the collaboration relationship with (denoted as in Fig. 2(d)) and establishes the collaboration relationship with (denoted as in Fig. 2(d)). The new pattern graph and new data graph are shown in Fig. 2(b) and Fig. 2(d) respectively.

Based on these two updated graphs, the state-of-the-art incremental GPNM method [Sun2018incremental] has to apply the incremental procedure four times because there are a total of four updates in and , leading to low efficiency. However, in practice, one update can be eliminated by another update. It is easy to understand that in each single graph, if one edge (node) is firstly removed from (or inserted into) () and then inserted back to (or removed from) (), the effects of the two updates can eliminate each other. Therefore, there may exist elimination relationships among the updates in a single graph of or , and we term this kind of elimination relationships of a single graph as single-graph elimination relationships. More importantly, one update in a graph may eliminate an update in another graph, we term this kind of elimination relationships as cross-graph elimination relationships. In Example 2, although in update , a needs to be associated with a within 2 hops, it indeed leads to no change in the GPNM results. This is because in another update , happens to establish the collaboration with , making all the in the data graph be connected with a within 2 hops. Therefore, the effects of and eliminate each other.

This example motivates us to develop a new GPNM solution which considers the elimination relationships among the updates to efficiently answer GPNM queries. When facing an updated pattern graph and an updated data graph, we can compute the GPNM result for the original pattern graph, and then deliver the new GPNM result by analyzing the elimination relationships among the updates, instead of performing the incremental GPNM procedure for each of the updates.

Such a new GPNM solution is significant for the social graph searches in large-scale and frequently updated social networks, such as Facebook and Twitter. For example, on Facebook, on average, within each minute, 400 new users join in, 510,000 comments are posted, 317,000 statuses are updated, and 147,000 photos are uploaded111https://sproutsocial.com/insights/facebook-stats-for-marketers/.

In this new solution, there are three major challenges. Firstly, it is non-trivial to identify the elimination relationships among the updates because there exist both single-graph elimination relationships and cross-graph elimination relationships. Therefore, the first challenge of our work is CH1: how to effectively detect the elimination relationships of the updates. Secondly, if update eliminates update , and update eliminates update , there exists a hierarchical structure of them, which applies to all the elimination relationships. As it is computationally expensive to deliver GPNM results by investigating each of the elimination relationships among the updates, it is beneficial to build up an index to record the hierarchical structure of all the elimination relationships. Therefore, the second challenge of our work is CH2: how to build up an index structure to record the hierarchical structure of all the elimination relationships covering both single-graph elimination relationships and cross-graph elimination relationships, which supports the development of an efficient algorithm to deliver the GPNM results by making use of the index. Thirdly, in the GPNM procedure, we need to compute the shortest path length between any two nodes, which is very time-consuming. Therefore, the third challenge of our work is CH3: how to efficiently compute the shortest path length between any two nodes to speed up the GPNM procedure.

I-C Contributions

In this paper, we propose an efficient GPNM method to answer GPNM queries with multiple updates in both pattern graph and data graph. To the best of our knowledge, our method is the first GPNM solution that considers both the single-graph elimination relationships and cross-graph elimination relationships. The characteristics and contributions of our work are summarized as follows.

(1) Targeting CH1, we propose effective methods to detect the single-graph elimination relationships and cross-graph elimination relationships.

(2) Targeting CH2, we build up an Elimination Hierarchy Tree (EH-Tree) to index the hierarchical structure of all the different types of elimination relationships, which helps enhance query processing efficiency.

(3) Targeting CH3, we propose a graph partition method and based on the method we further propose a more efficient Updates-Aware GPNM algorithm called UA-GPNM.

(4) The experiments conducted on five real-world social graphs demonstrate that our UA-GPNM with graph partition strategy significantly outperforms the state-of-the-art GPNM methods [Sun2018incremental, sun2019incremental] by reducing the the query processing time with an average of 58.60% and 35.29% respectively.

The rest of this paper is organized as follows. We first review the related work in Section II. Then we introduce the necessary concepts and formulate the main problem in Section III. Section IV analyzes the elimination relationships. Section V introduces the partition method. Section VI proposes the new algorithm, UA-GPNM. Section VII discusses the experimental results, and Section VIII concludes the paper.

Ii Related Work

The existing methods can be classified into two categories based on their delivered matching results: i.e., (1)

Graph Pattern Matching (GPM), and (2) Graph Pattern based Node Matching (GPNM). In this section, we review these two categories respectively.

GPM: GPM is to find all the matching subgraphs of in . For example, the algorithm in [ullmann1976algorithm] is the most famous method for the subgraph isomorphism. In the light of the intractability of the NP-complete problem of subgraph isomorphism, an approximate solution BGS [fan2010graph] has been studied to find inexact matching subgraphs. In the application of community finding, Fang et al., [fang2016effective] proposed a method which aims to return an attributed community for an attributed graph, in which the attributed community is a subgraph which satisfies both structure cohesiveness and keyword cohesiveness. Fang et al., [lai2017scalable] studied scalable subgraph enumeration in MapReduce, considering that existing solutions for subgraph enumeration are not sufficiently scalable to handle large graphs.

However, social graphs are frequently updated [berger2006framework], and it is computationally expensive to perform a new procedure from scratch to find matching subgraphs when facing any updates. Therefore, Fan et al., [fan2013incremental] proposed an incremental approximate method to find the matching subgraphs. The complexity of this method is more accurately characterized in terms of the size of the area affected by the updates of data graphs, rather than the size of the entire input. Song et al., [song2014event] propose a new notion, “event pattern matching” on dynamic graphs. They study the semantics and efficient online algorithms for the event pattern matching. In the application of cyber security, Choudhury et al., [choudhury2015selectivity] present a new subgraph isomorphism algorithm in streaming graphs. They regard cyber attacks as a subgraph pattern, and apply the subgraph distributional statistics collected from the streaming graph to determine the query processing strategy. Semertzidis et al., [semertzidis2016durable] focused on labeled graphs that evolve over time. They find the matches that exist for the longest period of time. Sun et al., [sun2017mining] extended incremental methods to find maximal cliques that contain vertices incident to an edge which has been inserted. Fan et al., [fan2017incremental] further proposed incremental algorithms for four types of typical pattern graphs, which can reduce the computations on big graphs and minimize unnecessary re-computation. Ma et al., [ma2017fast] proposed a method to find dense subgraphs in temporal networks. They focueds on a special class of temporal networks, where the weights associated with edges regularly vary with timestamps. Li et al., [li2018persistent] aimed to identify the communities that are persistent over time in a temporal network, in which every edge is associated with a timestamp. In addition, Li et al., [li2018efficient] proposed a method to seek cohesive subgraphs in a signed network, in which each edge can be positive or negative, denoting friendship or conflict respectively. Li et al., [li2019time] proposed a solution to efficiently answer subgraph search in streaming graph data. In the method, they designed concurrency management strategies to improve system throughput. Das et al., [das2019incremental] proposed change-sensitive algorithms to maintain the set of subgraphs in dynamic graphs. They showed nearly tight bounds for the magnitude of change in the set of subgraphs and the time complexity of enumerating the change is proportional to the magnitude of the change. Dias et al., [dias2019fractal] proposed Fractal, a high performance and high productivity system for supporting distributed GPM applications. Fractal employs a dynamic (auto-tuned) load-balancing based on a hierarchical and locality-aware work stealing mechanism, allowing the system to adapt to different workload characteristics.
GPNM: Applying the existing GPM methods to solve the GPNM problem incurs a high time complexity as they need to deliver the entire matching subgraphs in [fan2010graph, fan2011adding]. Therefore, several GPNM methods have been proposed, which aim to find some nodes based on a specified structure between those nodes, such as group finding [lappas2009finding] and expert recommendation [morris2010people]. Some of them [liu2007identifying], [zou2009distance], [marian2005adaptive] are proposed to find matches of a specific node via subgraph isomorphism, which has the exponential complexity. To improve efficiency, Tong et al., [tong2007fast] proposed a ”Seed-Finder” method that identifies approximate matches for certain pattern nodes. This method only requires cubic time. Based on BGS, Fan et al., [fan2013diversified] revised graph patterns to support a specific output node and define functions to measure match relevance and diversity. Motivated by network analysis applications, Fan et al., [fan2016adding] proposed quantified matching for a specific pattern node, in which they extend traditional graph patterns with counting quantifiers.

To address the GPNM problem when graphs are updated over time, INC-CPNM and EH-GPNM have been proposed in [Sun2018incremental] and [sun2019incremental]. INC-GPNM first builds an index to incrementally record the shortest path length range between different label types in , and then identifies the affected nodes of in GPNM w.r.t. the updates of and . Moreover, based on the proposed index structure and novel search strategies, INC-GPNM can efficiently deliver node matching results taking the updates of and as input, and can greatly reduce the query processing time. EH-GPNM [sun2019incremental] realized there may exist single-graph elimination relationships in the data graph. It can deliver the GPNM results without performing the incremental procedure for each of the updates in the data graph.
Summary: The existing methods in the above two categories face the efficiency issue when answering GPNM queries with the updates in both pattern graphs and data graphs. Firstly, the GPM methods cannot be applied in GPNM because of the low efficiency of delivering the entire subgraph structures. Secondly, the state-of-the-art GPNM methods INC-GPNM [Sun2018incremental] and EH-GPNM [sun2019incremental] cannot offer good efficiency either. INC-GPNM has to perform the incremental procedure for each of the updates, which is still computationally expensive in a large-scale graph that is updated frequently. Although EH-GPNM realized there may exist single-graph elimination relationships in data graph, it ignores the single-graph elimination relationships in pattern graph and the cross-graph elimination relationships. When facing any update in the pattern graph, EH-GPNM still has to perform the incremental GPNM procedure for each of the updates in the pattern graph.

Iii Preliminaries

In this section, we introduce the concepts of data graph and pattern graph, and the problem of GPNM and the problem of Updates-Aware GPNM. Table II lists the notations used in this paper.

Notation Meaning
/ a data graph/pattern graph
/ an updated data/pattern graph
/ the updates of /
a directed edge from to
/ a set of vertices/edges in
/ a set of vertices/edges in
the bounded path length on in
the matching result of in based on BGS
/ the GPNM result of the initial/subsequent query
/ the insertions / deletions of edges for
/ the insertions / deletions of nodes for
/ the insertions / deletions of edges for
/ the insertions / deletions of nodes for
/ one update in /
the shortest path length matrix between
each pair of nodes in
the set of candidate nodes of
the set of affected nodes of
the shortest path length from to is
changed from to
one partition
/ the set of inner/outer bridge nodes of
TABLE II: Notations used in this paper

Iii-a Data Graph and Pattern Graph

Data Graph. A data graph is a directed graph , where

  • is a set of nodes;

  • , in which a tuple (, ) denotes a directed edge from node to ;

  • is a function such that for each node , is a set of labels. Intuitively, consists of the attributes of a node, e.g., name, age, job title [lappas2011survey].

Example 3: in Fig. 1(a) depicts a data graph, where each node denotes a person, together with the label of a person, e.g., stands for a Project Manager. Each edge denotes a relationship between the two connected nodes, e.g., means has a collaboration relationship with .

Pattern Graph. A pattern graph is defined as , where

  • and are a set of nodes and a set of directed edges, respectively;

  • is a function defined on such that for each node , is the label of node , e.g., Project Manager;

  • is a function defined on such that for each edge (, ), is the bounded path length of that is either a positive integer or a symbol “*”.

Example 4: in Fig. 1(b) depicts a pattern graph. In addition to the labels, each edge in has an integer as the bounded path length.

Bounded Graph Simulation (BGS). Consider a data graph and a pattern . The data graph matches the pattern graph based on bounded graph simulation, denoted by , if there exists a binary relation such that

  • for any , there exists , such that ;

  • of includes of ;

  • for each edge in , there exists a in such that , and if .

Remark: Note that there exists a path from to with if the shortest path length from to is no longer than . If , the graph pattern matching results are denoted as .

Iii-B Graph Pattern based Node Matching (GPNM)

GPNM. Given a pattern graph , a data graph , for a given node in , we define the matching node of in to be , where is the set of matching subgraphs of in based on BGS. GPNM is to find for of in . If has no match of based on BGS, then .

Example 5: Recall and shown in Fig. 1(a) and Fig. 1(b) respectively. Instead of finding the whole subgraphs, the GPNM aims to find the matching nodes in for each node of . Taking as an example, since and are in the subgraphs which can match based on BGS, they are the matching nodes of . The complete node matching results are shown in Table I.

Iii-C Updates-Aware Graph Pattern based Node Matching

  • Input: a pattern graph , a data graph , the GPNM result of the initial query (termed as ), a sequence of multiple updates to , and a sequence of multiple updates to .

  • Output: the GPNM result of the subsequent query (termed as ) of in ( and denote the updated and respectively).

Remark: may include the insertion of edges, insertion of nodes, deletion of edges and deletion of nodes, denoted by , , and respectively; may include the insertion of edges, insertion of nodes, deletion of edges and deletion of nodes, denoted by , , and respectively. We denote each update in as and each update in as .

Example 6: Recall and shown in Fig. 2(c) and Fig. 2(a) respectively, is shown in Table I. is to insert edge with a bounded path length 2 into and is to insert edge with a bounded path length 4 into shown in Fig. 2(b); is to insert edge into and is to insert edge into shown in Fig. 2(d). The updated pattern graph and updated data graph are shown in Fig. 2(b) and Fig. 2(d) respectively. The updates-aware GPNM is to deliver the for the in based on the updates and .

Iv Elimination Relationships

In this section, we first analyze there types of elimination relationships. Then, we propose the effective methods to detect the elimination relationships. We further build up an index to record the hierarchical structure of these elimination relationships.

Iv-a Elimination Relationship Types

The elimination relationships can be categorized into three types. Below we analyze the elimination relationships for these three types respectively.
Single-graph elimination relationships in (Type I): For each update in pattern graph , we need to identify the nodes in data graph that has the possibility to be added into or removed from the matching results. We call these nodes as candidate nodes and put these candidate nodes into the set of candidate nodes (denoted as ). Given two updates and , if the set of candidate nodes of an update covers that of , i.e., , we say eliminates , denoted as .
Remark: can be divided into two subsets: a) , which represents the set of candidate nodes that has the possibility to be added into the matching results; b) , which represents the set of candidate nodes that has the possibility to be removed from the matching results.

Single-graph elimination relationships in (Type II): In GPNM, we need to investigate if the shortest path length between each pair of nodes in can satisfy the requirements of the bounded path length in . For each update in date graph , if the shortest path between two nodes has been affected by , we call these nodes as affected nodes and put these affected nodes into the set of affected nodes (denoted as ). Given two updates and , if the set of affected nodes of an update covers that of , i.e., , we say eliminates , denoted as .

Cross-graph elimination relationships between and (Type III): For an update from a pattern graph and an update from a data graph , if these two updates keep the GPNM results unchanged, then and eliminate each other, denoted as .

Iv-B Detecting Elimination Relationships

Below we introduce the detailed steps for detecting the three types of elimination relationships respectively.

Detect Type I elimination relationships (DER-I): For each update in the pattern graph, we first identify the nodes that have the possibility to be removed from or added into the original matching results. Then if the set of candidate nodes of an update covers that of , then eliminates . Below are the detailed steps of detecting Type I elimination relationships. The pseudo-code is shown in Algorithm 1.

Input: , , ,
Output: The type I elimination relationships of the updates
1 for each pair of updates and  do
2           if  and  then
3                     for each pair of nodes and in  do
4                               if SLen(, ) the bounded path length on  then
5                                         Put , into ;
6                                        
7                              if SLen(, ) the bounded path length on  then
8                                         Put , into ;
9                                        
10                              
11                    if   then
12                               ;
13                    
14          if  and  then
15                     for each pair of nodes and in  do
16                               if SLen(, ) the bounded path length on  then
17                                         Put , into ;
18                                        
19                              if SLen(, ) the bounded path length on  then
20                                         Put , into ;
21                                        
22                              
23                    if   then
24                               ;
25                    
26          
Return type I elimination relationships of the updates;
Algorithm 1 DER-I

Step 1: We first build up the shortest path length matrix, , to record the shortest path length between each pair of nodes in ;
Step 2: For each , if , we then inspect if the shortest path length between the pair of nodes in can satisfy the bounded path length constrain on , if not, we put these nodes into as they cannot satisfy the bounded path length constrain of the newly added edge and have to be removed from ; If , we then inspect if the shortest path length between the pair of nodes in can satisfy the bounded path length constrain on , if not, we put these nodes into as the edge with the shortest path length constrain they cannot satisfy has been deleted and these nodes can be added into ;
Step 3: For each pair of and , if , then .

Remark: In real social network-based graphs, there are many nodes having no out-degree or in-degree. Therefore, the lengths of the shortest paths from the nodes with no out-degree to other nodes, and the lengths of the shortest paths from other nodes to the nodes with no in-degree are infinite, which makes the matrix sparse. Then, we can use some techniques to compress the sparse matrix to reduce the saving space. The Hybrid format [bell2009implementing] is a well-known technique that can be adopted. A storage space of size is required in Hybrid format, where is the maximum number of non-infinite values in a row and is the number of nodes in a data graph. Compared with the space cost of , Hybrid format can save the storage because is usually much smaller than .

Example 7: Recall and shown in Fig. 2(c) and Fig. 2(a) respectively, is shown in Table I. is to insert edge with a bounded path length 2 into and is to insert edge with a bounded path length 4 into shown in Fig. 2(b). the SLen of in Fig. 1(c) is shown in Table III. With , because the needs to be associated with within 2 steps and the shortest path length between and is , which is larger than the bounded path length 2, then and are added into . After and are set as candidate nodes, we need to check if the nodes connected to and can be set as candidate nodes. Because the shortest path length between and , the shortest path length between and , , and the shortest path length between, and are all less than the corresponding bounded path length in , then only and are added into ; With , only is added into . The set of candidate nodes of and are shown in Table IV. Because , then .

0 3 2 1 3 2 1
0 1 2 2 3 3
1 0 1 1 2 2
3 2 0 3 1 1
3 2 3 0 4 1
4 3 1 4 0 2
4 3 4 1 5 2
2 1 2 2 3 0
TABLE III: of in Fig.1 (c).
Updates in pattern graph
,
TABLE IV: The set of candidate nodes of

Theorem 1: The order of the updates in does not affect the correctness of the detection of Type I elimination relationships.
The proof of Theorem 1 can be found in the below weblink.
http://web.science.mq.edu.au/~yanwang/Proof.pdf.

Detect Type II elimination relationships (DER-II): For each update in the data graph, we first detect the nodes where the shortest path length in data graph between them are changed by each update (denoted as affected nodes). Then, if the set of affected nodes of an update covers that of , then eliminates . Below are the detailed steps of detecting Type II elimination relationships. The pseudo-code is shown in Algorithm 2.

Input: , , ,
Output: The type II elimination relationships of the updates
1 for each pair of updates and  do
2           if the shortest path lengths between the nodes are not affected then
3                     Keep the shortest path lengths in as that in ;
4                    
5          else
6                     Apply the Dijkstra’s algorithm for updating the shortest path lengths between the affected nodes in ;
7                    
8          Put the affected nodes into ;
9           Put the affected nodes into ;
10           if   then
11                     ;
12          
Return type II elimination relationships of the updates;
Algorithm 2 DER-II

Step 1: We first update to get the updated shortest path length matrix, , for each update in data graph;
Step 2: For each update , we compare the with , if the shortest path length of two nodes is changed due to , we put these nodes into ;
Step 3: For each pair of updates and , if , then .

0 3 2 1 3 2 1
0 1 2 2 3 3
1 0 1 1 2 2
3 2 0 3 1 1
3 2 3 0 4 1
4 3 1 4 0 2
4 3 4 1 5 2
2 1 2 2 3 0
TABLE V: with .
0 3 2 1 2 2 1
0 1 2 2 3 3
1 0 1 1 2 2
3 2 0 2 1 1
3 2 3 0 4 1
4 3 1 3 0 2
4 3 4 1 5 2
2 1 2 1 3 0
TABLE VI: with .

Remark:When identifying the candidate nodes for the updates in pattern graph and affected nodes for the updates in data graph, we first record the shortest path length for all the pairs of nodes for once. Then for each update, we first detect the nodes where the shortest path lengths between them are unchanged; and then Dijkstra’s algorithm is applied for updating the shortest path lengths between the affected nodes.

Example 8: Recall and shown in Fig. 2(c) and Fig. 2(a) respectively, is shown in Table I. is to insert edge into and is to insert edge into shown in Fig. 2(d). The SLen of in Fig. 1(c) is shown in Table III. the of and in Fig. 1(d) are shown in Table V and Table VI respectively. With , because the shortest path lengths from all the other nodes to are changed, then all the nodes in data graph are set as the affected nodes of . With , because the shortest path lengths from , , and to are are changed, then , , , and are set as affected nodes. The set of affected nodes of and are shown in Table VII. Because , then .

Updates in data graph The affected nodes
, , , , , , ,
, , , ,
TABLE VII: The affected nodes of and

Theorem 2: The order of the updates in does not affect the correctness of the detection of Type II elimination relationship.
The proof of Theorem 2 can be found in the below weblink.
http://web.science.mq.edu.au/~yanwang/Proof.pdf.

Detect Type III elimination relationships (DER-III): For an update from a pattern graph and an update from a data graph, we need to inspect if these two updates keep the GPNM results unchanged. Below are the detailed steps of detecting Type III elimination relationships. The pseudo-code is shown in Algorithm 3.

Input: , , , , ,
Output: The type III elimination relationships of the updates
1 for each update  do
2           Perform DER-I to get ;
3for each update  do
4           Perform DER-II to get ;
5for each pair of nodes , in  do
6           if  the bounded path length on  then
7                     ;
8          
Return type III elimination relationships of the updates;
Algorithm 3 DER-III

Step 1: For the update from a pattern graph, we identify the candidate nodes for .
Step 2: For the update from a data graph, we identify affected nodes for .
Step 3: Based on and , if , which means the shortest path length between any nodes in the set of candidate nodes is changed due to the update , we inspect the updated shortest path length matrix to check if the shortest path length of the candidate nodes can satisfy the new pattern graph. If so, no node should be added into or deleted from the matching results; therefore, . Example 9: Recall and shown in Fig. 2(c) and Fig. 2(a) respectively, is shown in Table I. is to insert edge with a bounded path length 2 into shown in Fig. 2(b) and is to insert edge into shown in Fig. 2(d). Based on example 7 and example 8, we have = and =, then . Since = , the shortest path length between and can still satisfy the bounded path length on the newly inserted edge. Therefore, .

Complexity: The complexity of the generation and the updates of is [ramalingam1996computational]. In the worst case, DER-I, DER-II and DER-III need to check for each update, then the complexity of each of DER-I, DER-II and DER-III is , where and are the number of the nodes and the number of the edges respectively in , and is the scale of the updates.

Iv-C Elimination Hierarchy Tree (EH-Tree)

As it is computationally expensive to deliver GPNM results by investigating each of the elimination relationships among the updates, we build up an index to record the hierarchical structure of the elimination relationships. This index structure can efficiently help detect the elimination relationships between each pair of updates. We present the details of the generation of EH-Tree as follows.

(1) Firstly, for each update, we use the method mentioned in Section IV to identify the affected nodes for each update in data graph and identify the candidate nodes for each update in pattern graph. Each tree node in EH-Tree denotes an update and stores the affected nodes or candidates of the update.
(2) Based on the affected nodes and candidate nodes of each update, we have the following strategies: (a) the update that has the maximum number of affected nodes or candidate nodes is set as the root of an EH-Tree; (b) if the affected nodes of one update can be covered by another update , then is set as a child tree node of ; (c) if the candidate nodes of one update can be covered by another update , then is set as a child tree node of ; (d) if and can eliminate each other, then we can set the as a child tree node of or set the as a child tree node of .
(3) We then recursively insert all the updates into the EH-Tree. Example 10: Recall , , and in Fig .2. As has the maximum number of affected nodes in all the updates, it is set as the root of EH-Tree; with , as the set of affected nodes of covers that of , is set as the child node of ; with , as the set of candidate nodes of covers that of , is set as child node of ; Because and can eliminate each other, is set as the child node of . The completed EH-Tree is shown in Fig. 3.

Fig. 3: The EH-Tree of Example 10

V Graph Partition

V-a Label-based Partition

Fig. 4: Label-based Partition

It is computational expensive to construct the shortest path length matrix and update the . Therefore, in this section, we propose a graph partition method to improve the efficiency of computing the shortest path length between any two nodes. Based on the observation that people with the same role (e.g., has the same job title) usually connect with each other closely [brandes2009structural], we put the nodes that have the same label in a data graph and their corresponding edges into the same partition. Then the shortest path computation will be processed distributively based on the partitions.

Example 11: Fig. 4(a) depicts a data graph, where it has three different labels of nodes, namely, , and respectively. Based on the different labels of the nodes, we divide the data graph into three partitions, denoted as partition , and respectively.

After the partition, we need to preserve the connectivity of the data graph. Then our partition method records the cross-partition edges in the partitions where the starting nodes are in. For example, in Fig. 4(a), we record in the partition . Before introducing the process of computing shortest path length, we first define some nodes with properties below.

Definition 1.

inner bridge node: Given a partition , a node ( ) is termed as an inner bridge node of if there is an edge in data graph and . Let denote the set of the inner bridge nodes of partition .

Example 12: In Fig. 4, is an inner bridge node of , because there exists an edge in Fig. 4(a) and .

Definition 2.

outer bridge node: Given a partition , a node ( ) is termed as a outer bridge node of if there exists an edge in data graph and . Let denote the set of the bridges nodes of partition .

Example 13: In Fig. 4, is a outer bridge node of because there exists an edge in Fig. 4(a).

We use record the inner bridge nodes and outer bridge nodes in each partition. For example, the inner bridge nodes for partition are and , and the outer bridge nodes are and .

V-B Graph Partition based Shortest Path Length Computation

We divide computation of the shortest path length into two sub-processes, i.e., sub-process-1: computing the shortest path length between any two nodes in the same partition, and sub-process-2: computing the shortest path length between any two nodes in different partitions. Below we introduce these two sub-processes in detail.

sub-process-1: For each partition , if = , we apply the Dijkstra’s algorithm in this partition to compute the shortest path length. Otherwise, we apply the following steps. The pseudo-code is shown in Algorithm 4.

  • Step 1: In each partition , we denote the nodes in as , for each pair of nodes from to , we first apply the Dijkstra’s algorithm to compute the shortest path length value from to in this partition (denoted as ) and set the shortest path length value (denotes as ) from to in the data graph as ;

  • Step 2: For each outer bridge node ( ), if , then the shortest path length between