Fast and Robust Distributed Subgraph Enumeration

01/23/2019 ∙ by Xuguang Ren, et al. ∙ POSTECH The Chinese University of Hong Kong Griffith University 0

We study the classic subgraph enumeration problem under distributed settings. Existing solutions either suffer from severe memory crisis or rely on large indexes, which makes them impractical for very large graphs. Most of them follow a synchronous model where the performance is often bottlenecked by the machine with the worst performance. Motivated by this, in this paper, we propose RADS, a Robust Asynchronous Distributed Subgraph enumeration system. RADS first identifies results that can be found using single-machine algorithms. This strategy not only improves the overall performance but also reduces network communication and memory cost. Moreover, RADS employs a novel region-grouped multi-round expand verify & filter framework which does not need to shuffle and exchange the intermediate results, nor does it need to replicate a large part of the data graph in each machine. This feature not only reduces network communication cost and memory usage, but also allows us to adopt simple strategies for memory control and load balancing, making it more robust. Several heuristics are also used in RADS to further improve the performance. Our experiments verified the superiority of RADS to state-of-the-art subgraph enumeration approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Subgraph enumeration is the problem of finding all occurrences of a query graph in a data graph. Its solution is a basis for many other algorithms and it finds numerous applications. This problem has been well studied under single machine settings [10][19]. However in the real world, the data graphs are often fragmented and distributed across different sites. This phenomenon highlights the importance of distributed systems of subgraph enumeration. Also, the increasing size of modern graph makes it hard to load the whole graph into memory, which further strengthens the requirement of distributed subgraph enumeration.

In recent years, several approaches and systems have been proposed [1, 21, 13, 15, 6, 5]. However, existing systems either need to exchange large intermediate results (e.g., [13],[15] and [21]), or copy and replicate large parts of the data graph on each machine (e.g., [1] and [6, 5]), or rely on heavy indexes (e.g., [18]). Both exchanging and caching large intermediate results and exchanging and caching large parts of the data graph will cause heavy burden on the network and on memory, in fact, when the graphs are large these systems tend to crash due to memory depletion. In addition, most of the current systems are synchronous, hence they suffer from synchronization delay, that is, the machines must wait for each other for the completion of certain processing tasks, making the overall performance equivalent to that of the slowest machine. More details about existing work can be found in Section 8.

It is observed in previous work [15, 18] that when the data graph is large, the number of intermediate results can be huge, making the network communication cost a bottleneck and causing memory crash. On the other hand, systems that rely on replication of large parts of the data graph or heavy indexes are impractical for large data graphs and low-end computer clusters. In this paper, we present RADS, a Robust Asynchronous Distributed Subgraph enumeration system. Different from previous work, our system does not need to exchange intermediate results or replicate large parts of the data graph. It does not rely on heavy indexes or suffer from synchronization delay. Our system is also more robust due to our memory control strategies and easy for load balancing.

To be specific, we make the following contributions:

  1. We propose a novel distributed subgraph enumeration framework, where the machines do not need to exchange intermediate results, nor do they need to replicate large parts of the data graph.

  2. We propose a method to identify embeddings that can be found on each local machine independent of other machines, and use single-machine algorithm to find them. This strategy not only improves the overall performance, but also reduces network communication and memory cost.

  3. We propose effective memory control strategies to minimize the chance of memory crash, making our system more robust. Our strategy also facilitates workload balancing.

  4. We propose optimization strategies to further improve the performance. These include (i) a set of rules to compute an efficient execution plan, (ii) a dynamic data structure to compactly store intermediate results.

  5. We conduct extensive experiments which demonstrate that our system is not only significantly faster than existing solutions 111Except for some queries using [18], which relies on heavy indexes., but also more robust.

Paper Organization In Section 2, we present the preliminaries. In Section 3, we present the architecture and framework of our system RADS. In Section 4, we present algorithms for computing the execution plan. In Section 5, we present the embedding trie data structure to compress our intermediate results. Our memory control strategy is given in Section 6. We present our experiments in Section 7, discuss related work in Section 8 and conclude the paper in Section 9. Some proofs, detailed algorithms and auxiliary experimental results are given in the appendix.

2 Preliminaries

Data Graph & Query Graph Both the data graph and query graph (a.k.a query pattern) are assumed to be unlabeled, undirected, and connected graphs. We use = (, ) and = (, ) to denote the data graph and query graph respectively, where and are the vertex sets, and and are the edge sets. We will use data (resp. query) vertex to refer to vertices in the data (resp. query) graph. Generally, for any graph , we use and to denote its vertex set and edge set respectively, and for any vertex in , we use to denote ’s neighbour set in and use to denote the degree of .

Subgraph Isomorphism Given a data graph and a query pattern , is subgraph isomorphic to if there exists an injective function : such that for any edge (, ) , there exists an edge ((), ()) . The injective function is also known as an embedding of in (or, from to ), and it can be represented as a set of vertex pairs (, ) where is mapped to . We will use to denote the set of all embeddings of in .

The problem of subgraph enumeration is to find the set . In the literature, subgraph enumeration is also referred to as subgraph isomorphism search [16][10][19] and subgraph listing [12][21].

Partial Embedding A partial embedding of graph in graph is an embedding in of a vertex-induced subgraph of . A partial embedding is a full embedding if the vertex-induced subgraph is itself.

Symmetry Breaking A symmetry breaking technique based on automorphism is conventionally used to reduce duplicate embeddings [8]. As a result the data vertices in the final embeddings should follow a preserved order of the query vertices. We apply this technique in this paper by default and we will specify the preserved order when necessary.

Graph Partition & Storage Given a data graph and machines in a distributed environment, a partition of is denoted where is the partition located in the machine . In this paper, we assume each partition is stored as an adjacency-list. For any data vertex , we assume its adjacency-list is stored in a single machine and we say is owned by (or resides in ). We call a foreign vertex of if is not owned by . We say a data edge is owned by (or resides in) (denoted as ) if either end vertex of resides in . Note that an edge can reside in two different machines.

For any owned by , we call a border vertex if any of its neighbors is owned by other machines than . Otherwise we call it a non-border vertex. We use to denote the set of all border vertices in .

3 RADS Architecture

In this section, we first present an overview of the architecture of , followed by the R-Meef framework of . We give a detailed implementation of R-Meef in Appendix B.

3.1 Architecture Overview

Figure 1: RADS Architecture

The architecture of is shown in Figure 1. Given a query pattern , within each machine, first launches a process of single-machine enumeration (SM-E) and a daemon thread, simultaneously. After SM-E finishes, launches a R-Meef thread subsequently. Note that the R-Meef threads of different machines may start at different time.

  • [leftmargin=*]

  • Single-Machine Enumeration The idea of SM-E is to try to find a set of local embeddings using a single-machine algorithm, such as TurboIso[10]

    , which does not involve any distributed processing. The subsequent distributed process only has to find the remaining embeddings. This strategy can not only boost the overall enumeration efficiency but also significantly reduce the memory cost and communication cost of the subsequent distributed process. Moreover the local embeddings can be used to estimate the space cost of a

    region group, which will help to effectively control the memory usage (to be discussed in Section 6).

    We first define the concepts of border distance and span, which will be used to identify embeddings that can be found by SM-E.

    Definition 1 (Border Distance)

    Given a graph partition and data vertex in , the border distance of w.r.t , denoted as , is the minimum shortest distance between and any border vertex of , that is


    where is the shortest distance between and .

    Definition 2 (Span)

    Given a query pattern , the span of query vertex , denoted as , is the maximum shortest distance between and any other vertex of , that is

    Proposition 1

    Given a data vertex of and a query vertex of , if , then there will be no embedding of in such that , and is not owned by , where , .

    Proposition 1 states that if the border distance of is not smaller than the span of query vertex , there will be no cross-machine embeddings (i.e., embeddings where the query vertices are mapped to data vertices residing in different machines) which map to . The proof of Proposition 1 is in the Appendix A.1.

    Let be the starting query vertex (namely, the first query vertex to be mapped) and be the candidate vertex set of in . Let be the subset of candidates whose border distance is no less than the span of . According to Proposition 1, any embedding that maps to a vertex in can be found using a single-machine subgraph enumeration algorithm over , independent of other machines. In RADS, the candidates in will be processed by SM-E, and the other candidates will be processed by the subsequent distributed process. The SM-E process is simple, and we will next focus on the distributed process. For presentation simplicity, from now on when we say a candidate vertex of , we mean a candidate vertex in , unless explicitly stated otherwise.

    The distributed process consists of some daemon threads and the subgraph enumeration thread:

  • Daemon Threads listen to requests from other machines and support four functionalities:
    (1) verifyE is to return the edge verification results for a given request consisting of vertex pairs. For example, given a request , posted to , will return if is an edge in while is not.
    (2) fetchV is to return the adjacency-lists of the requested vertices of the data graph. The requested vertices sent to machine must reside in .
    (3) checkR is to return the number of unprocessed region groups (which is a group of candidate data vertices of the starting query vertex, see Section 3.2) of the local machine (i.e., the machine on which the thread is running).
    (4) shareR is to return an unprocessed region group of the local machine to the requester machine. shareR will also mark the region group sent out as processed.

  • R-Meef Thread is the core subgraph enumeration thread. When necessary, the local R-Meef thread sends verifyE requests and fetchV requests to the Daemon threads located in other machines, and the other machines respond to these requests accordingly.

    Once a local machine finishes processing its own region groups, it will broadcast a checkR request to the other machines. Upon receiving the numbers of unfinished region groups from other machines, it will send a shareR request to the machine with the maximum number of unprocessed region groups. Once it receives a region group, it will process it on the local machine. checkR and shareR are for load balancing purposes only, and they will not be discussed further in this paper.

3.2 The R-Meef Framework

Before presenting the details of the R-Meef framework, we need the following definitions.

Definition 3 (embedding candidate)

Given a partition of data graph located in machine and a query pattern , an injective function : is called an embedding candidate (EC) of if for any edge , , there exists an edge , provided either or .

We use to denote the set of ECs of . Note that for an EC and a query vertex , is not necessarily owned by . That is, the adjacency-list of may be stored in other machines. For any query edge , an EC only requires that the corresponding data edge , exists if at least one of and resides in . Therefore, an EC may not be an embedding. Intuitively, the existence of the edge can only be verified in if one of its end vertices resides in . Otherwise the existence of the edge cannot be verified in , and we call such edges undetermined edges.

Definition 4

Given an EC of query pattern , for any edge , we say is an undetermined edge of if neither nor is in .

Example 1

Consider a partition of a data graph and a triangle query pattern where . The mapping , , is an EC of in if , and and neither nor resides in . is an undetermined edge of .

Obviously if we want to determine whether is actually an embedding of the query pattern, we have to verify its undetermined edges in other machines. For any undetermined edge , if its two end vertices reside in two different machines, we can use either of them to verify whether or not. To do that, we need to send a verifyE request to one of the machines.

Note that it is possible that an undetermined edge is shared by multiple ECs. To reduce network traffic, we do not send verifyE requests once for each individual EC, instead, we build an edge verification index (EVI) and use it to identify ECs that share undetermined edges. We assume each EC is assigned an ID (We will discuss how to assign such IDs and how to build EVI in Section 5).

Definition 5 (edge verification index)

Given a set of ECs, the edge verification index (EVI) of is a key-value map where

  1. for any tuple ,

    • the key is a vertex pair .

    • the value is the set of IDs of the ECs in of which is an undetermined edge.

  2. for any undetermined edge of , there exists a unique tuple in with as the key and the ID of in the value.

Intuitively, the EVI groups the ECs that share each undetermined edge together. It is straightforward to see:

Proposition 2

Given data graph , query pattern and an edge verification index , for any , if , then none of the ECs corresponding to can be an embedding of in .

Example 2

Consider two embedding candidates , , and , , of a triangle pattern of a data graph where . Assuming is an undetermined edge, we can have an edge verification index: where are represented by their IDs in . If is verified non-existing, both and can be filtered out.

Like SEED and Twintwig, we decompose the pattern graph into small decomposition units.

Definition 6 (decomposition)

A decomposition of query pattern is a sequence of decomposition units , , where every is a subgraph of such that

  1. The vertex set of consists of a pivot vertex and a non-empty set of leaf222In an abuse of the word “leaf”. vertices, all of which are vertices in ; and for every , .

  2. The edge set of consists of two parts, and , where is the set of edges between the pivot vertex and the leaf vertices, and is the set of edges between the leaf vertices.

  3. , and for , .

Note condition (3) in the above definition says the leaf vertices of each decomposition unit do not appear in the previous units. Unlike the decompositions in SEED [15] and TwinTwig [13], our decomposition unit is not restricted to stars and cliques, and may be a proper subset of .

Example 3

Consider the query pattern in Figure 2 (a), we may have a decomposition , , , ) where , , , , , , , , , and , . Note that the edge , is not in any decomposition unit.

Figure 2: Running Example

Given a decomposition , , of pattern , we define a sequence of sub-query patterns , where = , and for , consists of the union of and together with the edges across the vertices of and , that is, = , = . Note that (a) none of the leaf vertices of can be in ; and (b) is the subgraph of induced by the vertex set , and . We say forms an execution plan if for every , the pivot vertex of is in . Formally, we have

Definition 7 (execution plan)

A decomposition , , of is an execution plan () if for all .

For example, the decomposition in Example 3 is an execution plan.

Let , , be an execution plan. For each , we define

We call the edges in , and the expansion edges, sibling edges, and cross-unit edges respectively. The sibling edges and cross-unit edges are both called verification edges.

Consider in Example 3, we have =, =. For , we have =, =.

Note that the expansion edges of all the units form a spanning tree of , and the verification edges are the edges not in the spanning tree.

With the above concepts, we are ready to present the R-Meef framework. Given query pattern , data graph and its partition on machine , R-Meef finds a set of embeddings of in according to an execution plan , which provides a processing order for the query pattern . In our approach, each machine will evaluate in the first round, and based on the results in round i, it will evaluate the next pattern in the next round. The final results will be obtained when is evaluated in all machines (each machine computes a subset of the final embeddings, the union of which is the final set of embeddings of in ).

Moreover, in our approach, each machine starts by mapping (which is the in Section 3.1) to a candidate vertex of that resides in . When the number of such candidate vertices is large, there is a possibility of generating too many intermediate results (i.e., ECs and embeddings of , ). To prevent memory crash, we divide the candidate vertex set of into disjoint region groups = , , , and process each group separately.

The workflow of R-Meef is as follows:

  1. [leftmargin=*]

  2. From the vertices residing in , R-Meef divides the candidate vertices of into different region groups. Then it processes each group sequentially and separately.

  3. For each region group, R-Meef processes one unit at a round based on the execution plan . In the round, the workflow can be illustrated in Figure 3.

    Figure 3: R-Meef workflow

    In Figure 3, represents the set of embedding of generated and cached from the last round. For the first round (i.e., round 0), will be initialized as where is a candidate vertex of . By expanding , we get all the ECs of , i.e., . After verification and filtering, we get all the embeddings of for this region group of .

    In each round, the expand and verify & filter processes work as follows:

    • [leftmargin=*]

    • Expand Given an embedding of obtained from the previous round, has already been matched to a data vertex by since . By searching the neighborhood of , we expand to find the ECs of containing , . It is worth noting that if does not reside in , we have to fetch its adjacency-list from other machines. Different embeddings from previous round may share some common foreign vertices to fetch in order to expand. To reduce network traffic, for all the embeddings from last round, we gather all the vertices that need to be fetched and then fetch their adjacency-lists together by sending a single request.

      One important assumption here is that each machine has a record of the ownership information (i.e., which machine a data vertex resides in) of all the vertices. This record can be constructed offline as a map whose size is , which can be saved together with the adjacency-list and takes one extra byte space for each vertex.

    • Verify Filter Upon having a set of ECs (i.e. ), we store them compactly in a embedding trie and build an EVI from them (the embedding trie and EVI will be further discussed in Section 5). Then we send a request consisting of the keys of EVI, i.e., undetermined data edges, to other machines to verify their existence. After we get the verification results, each failed key indicates that the corresponding ECs can be filtered out. The output of the final round is the set of embeddings of query pattern found by for this region group.

    Note that a detailed implementation and example of R-Meef is given in Appendix B. Although the idea of our framework is straightforward. However, in order to achieve the best performance, each critical component of it should be carefully designed. In the following sections, we tackle the challenges one by one.

4 Computing Execution Plan

It is obvious that we may have multiple valid execution plans for a query pattern and different execution plans may have different performance. The challenge is how to find the most efficient one among them ? In this section, we present some heuristics to find a good execution plan.

4.1 Minimizing Number of Rounds

Given query pattern and an execution plan , we have rounds for each region group, and once all the rounds are processed we will get the set of final embeddings. Also, within each round, the workload can be shared. To be specific, a single undetermined edge may be shared by multiple ECs. If these embedding candidates are generated in the same round, the verification of can be shared by all of them. The same applies to the foreign vertices where the cost of fetching and memory space can be shared among multiple embedding candidates if they happen to be in the same round. Therefore, our first heuristic is to minimize the number of rounds (namely, the number of decomposition units) so as to maximize the workload sharing.

Here we present a technique to compute a query execution plan, which guarantees a minimum number of rounds. Our technique is based on the concept of maximum leaf spanning tree [7].

Definition 8

A maximum leaf spanning tree (MLST) of pattern is a spanning tree of with the maximum number of leafs (a leaf is a vertex with degree 1). The number of leafs in a MLST of is called the maximum leaf number of , denoted .

A closely related concept is minimum connected dominating set.

Definition 9

A connected dominating set (CDS) of is a subset of such that (1) is a dominating set of , that is, any vertex of is either in or adjacent to a vertex in , and (2) the subgraph of induced by is connected.

A minimum connected dominating set (MCDS) is a CDS with the smallest cardinality among all CDSs. The number of vertices in a MCDS is called the connected domination number, denoted .

It is shown in [4] that .

Theorem 1

Given a pattern , any execution plan of has at least decomposition units, and there exists an execution plan with exactly decomposition units.

The proof of Theorem 1 is in the Appendix A.1.

Theorem 1 indicates that is the minimum number of rounds of any execution plan. The above proof provides a method to construct an execution plan with rounds from a MLST. It is worth noting that the decomposition units in the query plan constructed as in the proof have distinct pivot vertices.

Example 4

Consider the pattern , it can be easily verified that the tree obtained by erasing the edges , , , and is a MLST of . Choosing as the root, we will get a minimum round execution plan =, , where , , , , , , , and , . If we choose as the root, we will get a different minimum-round execution plan =, where , , , , , ,

4.2 Minimizing the span of

Given a pattern , multiple execution plans may exist with the minimum number of rounds, while their can be different. When facing this case, here we present our second heuristic which is to choose the plan(s) whose have the smallest span. This strategy will maximize the number of embeddings that can be found using SM-E. Recall the RADS architecture where is the starting query vertex , based on Proposition 1, we know that the more candidate vertices of can be processed in SM-E, the more workload can be separated from the distributed processing, and therefore the more communication cost and memory usage can be reduced.

Figure 4: A Query Pattern

Consider the pattern in Figure 4, the bold edges demonstrate a MLST based on which both and can be chosen as . And the execution plans from them have the same number of rounds. However, while . Therefore we choose the plan with as the .

4.3 Maximizing Filtering Power

Given a pattern , multiple execution plans may exist with the minimum number of rounds and their have the same smallest span. Here we use the third heuristic which is to choose plans with more verification edges in the earlier rounds. The intuition is to maximize the filtering power of the verification edges as early as possible. To this end, we propose the following score function for an execution plan , , :


is the number of verification edges in round , and is a positive parameter used to tune the score function. In our experiments we use . The function calculates a score by assigning larger weights to the verification edges in earlier rounds (since if ).

Example 5

Consider the query plans and in Example 4. The total number of verification edges in these plans are the same. In , the number of verifications edges for the first, second and third round is 2, 1, 2 respectively. In , the number of verification edges for the three rounds is 1, 2, and 2 respectively. Therefore, we prefer . Using , we can calculate the scores of the two plans as follows:

When several minimum-round execution plans have the same score, we use another heuristic rule to choose the best one from them: the larger the degree of the pivot vertex, the earlier we process the unit. The pivot vertex with a larger degree has a stronger power to filter unpromising candidates.

To accommodate this rule, we can modify the score function in (1) by adding another component as follows:


To this end, we have a set of rules to follow when to compute the execution plan. Since the query vertex is normally very small. We can simply enumerate all the possible execution plans and choose the best according to those rules.

5 Embedding Trie

As stated before, to save memory, the intermediate results (which include embeddings and embedding candidates generated in each round) are stored a compact data structure called an embedding trie. Besides the compression, the challenges here are how to ensure each intermediate result has a unique ID in the embedding trie and the embedding trie can be easily maintained ?

Before we give our solution, we first define a matching order, which is the order following which the query vertices are matched in R-Meef. It is also the order the nodes in the embedding trie are organized.

Definition 10 (Matching Order)

Given a query execution plan , , of pattern , the matching order w. r. t is a relation defined over the vertices of that satisfies the following conditions:

  1. if ;

  2. For any two vertices and , if .

  3. For :

    1. [label = ()]

    2. for all ;

    3. for any vertices , that are not the pivot vertices of other units, if ¿ , or and the vertex ID of is less than that of ;

    4. if is a pivot vertex of another unit, and is not a pivot vertex of another unit, then .

Intuitively the above relation orders the vertices of as follows: (a) Generally a vertex in is before a vertex in if , except for the special case where and . In this special case, may appear in the leaf of some previous unit (), and it may be arranged before according to Condition (2) or Condition (3) (ii). (b) Starting from , the vertex is arranged before all other vertices. For the leaf vertices of , it arranges those that are pivot vertices of other units before those that are not (Condition (3)(iii)), and for the former, it arranges them according to the ID of the units for which they are the pivot vertex333Note that no two units share the same pivot vertex. (Condition (1)); for the latter, it arranges them in descending order of their degree in the original pattern , and if they have the same degree it arranges them in the order of vertex ID (Condition (3) (ii)). For each subsequent , the pivot vertex must appear in the leaf of some previous unit, hence its position has been fixed; and the leaf vertices of are arranged in the same way as the leaf vertices of .

It is easy to verify is a strict total order over . Following the matching order, the vertices of can be arranged into an ordered list. Consider the execution plan in Example 4. The vertices in the query can be arranged as () according to the matching order.

Let , , be an execution plan, be the subgraph of induced from the vertices in (as defined in Section 3.2), and be a set of results (i.e., embeddings or embedding candidates) of . For easy presentation, we assume the vertices in have been arranged into the list by the matching order, that is, the query vertex at position is . Then each result of can be represented as a list of corresponding data vertices. These lists can be merged into a collection of trees as follows:

  1. Initially, each result is treated as a tree , where the node at level stores the data vertex for , and the root is the node at level 0.

  2. If multiple results map to the same data vertex, merge the root nodes of their trees. This partitions the results in into different groups, each group will be stored in a distinct tree.

  3. For each newly merged node , if multiple children of correspond to the same vertex, merge these children into a single child of .

  4. Repeat step (3) untill no nodes can be merged.

The collection of trees obtained above is a compact representation of the results in . Each leaf node in the tree uniquely identifies a result.

The embedding trie is a collection of similar trees. However, since the purpose of the embedding trie is to save space, we cannot get it by merging the result lists. Instead, we will have to construct it by inserting nodes one by one when results are generated, and removing nodes when results are eliminated. Next we formally define embedding trie and present the algorithms for the maintenance of the embedding trie.

5.1 Structure of the Embedding Trie

Definition 11 (Embedding Trie)

Given a set of results of , the embedding trie of is a collection of trees used to store the results in such that:

  1. Each tree represents a set of results that map to the same data vertex.

  2. Each tree node has

    • v: a data vertex

    • parentN: a pointer pointing to its parent node (the pointer of the root node is null).

    • childCount: the number of child nodes of .

  3. If two nodes have the same parent, then they store different data vertices.

  4. Every leaf-to-root path represents a result in , and every result in is represented as a unique leaf-to-root path.

  5. If we divide the tree nodes into different levels such that the root nodes are at level 0, the children of the root nodes are at level 1 and so on, then the tree nodes at level () store the set of values .

Figure 5: Example of Embedding Trie
Example 6

Consider in Example 7, where the vertices are ordered as according to the matching order. There are three ECs of : , and . These results can be stored in a tree shown in Figure 5(a). When the second EC is filtered out, we have compressed in a tree as shown in Figure 5(b). The first EC can be expanded to an EC of (where the list of vertices of are ), which is as shown in Figure 5(c).

Although the structure of embedding trie is simple, it has some nice properties:

  • [leftmargin=*]

  • Compression Storing the results in the embedding trie saves space than storing them as a collection of lists.

  • Unique ID For each result in the embedding trie, the address of its leaf node in memory can be used as the unique ID.

  • Retrieval Given a particular ID represented by a leaf node, we can easily follow its pointer step-by-step to retrieve the corresponding result.

  • Removal To remove a result with a particular ID, we can remove its corresponding leaf node and decrease the of its parent node by 1. If of this parent node reaches 0, we remove this parent node. This process recursively affects the ancestors of the leaf node.

5.2 Maintaining the embedding trie

Recall that in Algorithm 4, given an embedding of , the function is used to search for the ECs of within the neighbourhood of the mapped data vertex of , where . Moreover, the function handles the task of expanding the embedding trie by concatenating with each newly found EC of . If an EC is filtered out or if an embedding cannot be expanded to a final result, the function must remove it from . Now we present the details of the function in Algorithm 1.

When is mapped to the data vertex by an embedding of , Algorithm 1 uses a backtracking approach to find the ECs of within the neighbourhood of . The recursive procedure is given in the subroutine . In each round of the recursive call, tries to match to a candidate vertex and add , to , where is a query vertex in . When is expanded to an EC of , which means an EC of is concatenated to the original , we add it into by chaining up the corresponding embedding trie nodes. If cannot be expanded into an EC of , we will remove it from .

Input: an embedding of , local machine , unit , embedding trie
Output: expanded and an edge verification index I
2 for each  do
4       for each  do
5             if  resides in  then
8      if  then
9             remove from
10             return
11 next vertex in query vertex list
12 get corresponding to
13 (, )
Algorithm 1 expandEmbedTrie

Lines 1 to 9 of Algorithm 1 compute the candidate set for each as the intersection of the neighbor set of and the neighbor set of each , where is a cross-unit edge and is in . If any of the candidate sets is empty, it removes from . Otherwise it passes on the next query vertex and the ID of (which is a node in ) to the recursive subroutine .

The subroutine is given in Algorithm 2. It plays the same roles as the SubgraphSearch procedure in the backtracking framework [16]. In Line 1, creates an local variable with default value . The value indicates whether can be extended to an EC of . For the leaf vertex , first creates a copy of , and then refines the candidate vertex set by considering every sibling edge where has already been mapped by to . If resides in , is shrank by an intersection with (Line 2 to 5). Then, for each vertex in the refined set , it first initializes a flag with the value (Line 7), this value indicates whether can be potentially mapped to . Then if resides in it will check every verification edge where has been mapped to see if exists, if one of such edge does not exist, it will set to false (Lines 8 to 11), meaning cannot be mapped to . This part (Lines 7 to 11) is like the IsJoinable function in the backtracking framework [16].

If is still true after the local verification, we add , to (Line 13). Then we create a new trie node for with as its parentN (Line 14, 15). After that, if grows to an EC of , then for each undetermined edge of (both end vertices are not in the local machine), we add to (Line 17, 18). We also set the as true (Line 19). If is not an EC of , which means there are still leaf vertices of not matched, we get the next leaf vertex (Line 21), and launch a recursive call of by passing it and (Line 22). We record the return value from its deeper as . If is true after all the recursive calls, which means there are ECs with mapped to in , we increase of the parentNode and add the newly created to as a child of in (Line 23 to 25). Then we backtrack by removing from , so that we can try to map to another candidate vertex in .

After we tried all the candidate vertices of , we return the value of (Line 27).

Input: Trie node representing embedding of , leaf vertex of
Output: expanded and an edge verification index
3 for each mapped in and  do
4       if  resides in  then
6for each  do
8       if  resides in  then
9             for each and mapped in  do
10                   if  not exists then
12  <