## 1 Introduction

The top- team formation problem is to find a list of highly collaborative teams of experts such that every team satisfies the skill requirements of a certain task.
Various approaches [25, 22, 7, 15, 34, 9] have been proposed,
and fall into two categories in terms of the way to improve the collaborative compatibility of team members:
(a) minimizing team communication costs,
defined with *e.g.,* the diameter, minimum spanning tree and the sum of pairwise member distances of the induced subgraph [25, 22, 7, 9], and
(b) maximizing team communication relations,
*e.g.,* the density of the induced subgraph [15, 34].
Further, [15] and [34] consider a practical setting by introducing a lower bound on the number of individuals with a specific skill in a team, and an upper bound of the total team members, respectively.

Example 1:
Consider a recommendation network taken from [37] as depicted in Fig. 1,
in which (a) a node denotes a person labeled with her expertise, *e.g.,* project manager (), software architect (), software developer (), software tester (),
user interface designer () and business analyst (),
and (b) an edge indicates the collaboration relationship between two persons, *e.g.,* (, ) indicates worked well with within previous projects.

A headhunt helps to set up a team for a software product by searching proper candidates from (ignore dashed edges). A desired team has (i) one , and one to two , , , and , such that (ii) should collaborate with , and well, and and should collaborate with each other well and both with well.

One may verify that existing methods [25, 22, 15, 34], can hardly find a desired team.
They only find teams satisfying the skill requirement [25, 22, 15] and the lower bound capacity requirement [15, 34] (condition (i)), and cannot guarantee the specific collaboration relationships among team members, *i.e.,* structural constraints (condition (ii)).

A natural question is how to further capture the structural and capacity constraints in a unified model for team formation? We introduce a revision of graph pattern matching for team formation to fill in this gap. Given a pattern graph and a data graph , graph pattern matching is to find all subgraphs in that match , and has been extensively studied [38, 19, 11, 29, 28, 14]. Essentially, we utilize patterns to capture the structural constraint, and revise the semantics of graph pattern matching for team formation. For instance, a desired team requirement can be specified by the pattern (ignore dashed edges) in Fig. 1, in which nodes represent the skill requirements, edges specify the topology constraint, and the bounds on nodes are the capacity constraint.

Another issue lies in that team formation is accompanied with a highly dynamic environment. It typically needs many efforts to find the ideal teams, and is common for professionals to refine patterns (requirements) multiple rounds [36, 18]. Further, real-life graphs are often big and constantly evolve over time [13]. We show this with an example.

Example 2: Consider and in Example 1 again.

(1) One may find that is too restrictive to find any sensible match in .
Hence, she needs to refine the pattern by updating with , *e.g.,* an edge
deletion .

(2) It is also common that a data update comes on , *e.g.,* an edge insertion .

(3) Finally, it can be the case when pattern update and data update come simultaneously on and .

This motivates us to study the dynamic top-k team formation problem, to handle continuous pattern and data updates, separately and simultaneously. It is known that incremental algorithms avoid re-computing from scratch by re-using previous results [32]. However, incremental algorithms of graph pattern matching for pattern updates has not been investigated, though there exist incremental algorithms of graph pattern matching for dealing with data updates [11, 13, 10]. Further, it is also challenging for incremental algorithms to handle simultaneous pattern and data updates in a unified way.

Contributions. To this end, we introduce a graph pattern matching approach for (dynamic) top- team formation.

(1) We propose team simulation, a revision of traditional graph pattern matching, for top- team formation (Section 2).
It extends existing methods by incorporating the structural and capacity constraints using pattern graphs.
To cope with the highly dynamic environment of team formation, we also formulate the dynamic top- team formation problem (Section 2), for dealing with pattern and data updates, separately and simultaneously.

(2) We develop a batch algorithm for computing top- teams via team simulation (Section 3).
We study the satisfiability problem for pattern graphs, a new problem raised in the presence of capacity bounds for graph pattern matching.
We also propose two optimization techniques, handling radius varied balls and density based filtering, for speeding up the process of computations.

(3) We develop a unified approach to handling the need for both pattern and data updates (Sections 4 and 5).
Due to the inherent difficulty of the problem, we propose an incremental strategy based on pattern fragmentation and affected balls by localizing the effects of pattern and data updates, and we develop a unified incremental algorithm for dealing with separate and simultaneous pattern and data updates,
with an optimization technique with the early return property for incremental top- algorithms, an analogy of the traditional early termination property.

(4) Using real-life data (Citation) and synthetic data (Synthetic), we demonstrate the effectiveness and efficiency of our graph pattern matching approach for (dynamic) team formation (Section 6).
We find that (a) our method is able to identify more sensible teams than existing team formation methods *w.r.t.* practical measurements,
and (b) our incremental algorithm outperforms our batch algorithm, even when changes reach 36% for pattern updates, 34% for data updates and (25%, 22%) for simultaneous pattern and data updates, and when 29% for continuous pattern updates, 26% for continuous data updates and (20%, 18%) for continuously simultaneous pattern and data updates, respectively.

To our knowledge, this work is among the first to study simultaneous pattern and data incremental computations, no previous work has studied pattern updates for incremental pattern matching [10, 13], not to mention continuous and simultaneous pattern and data updates. This is the most general dynamic setting for incremental computations.

All detailed proofs are available in the full version [4].

Related work

. Previous work can be classified as follows.

Graph simulation [19] and its extensions have been introduced for graph pattern matching [11, 29, 28, 14], in which strong simulation introduces duality and locality into simulation [29], and shows a good balance between its computational complexity and its ability to preserve graph topology. Furthermore, [14] already adopts capacity bounds on the edges of pattern graphs via subgraph isomorphism, and [12] uses graph pattern matching to find single experts, instead of a team of experts. In this study, team simulation is proposed for team formation as an extension of graph simulation and strong simulation on undirected graphs with capacity constraints on the nodes of pattern graphs.

There has been a host of work on team formation by minimizing the communication cost of team members, based on the diameter, density, minimum spanning tree, Steiner tree, and sum of pairwise member distances among others [25, 22, 7, 15, 34, 9, 27], which are essentially a specialized class of keyword search on graphs [6]. Similar to [22], we are to find top- teams. However, [22] adopted Lawler’s procedure [26], and is inappropriate for large graphs. We also adopt density as the communication cost, which shows a better performance [15], and further require that all team members are close to each other (located in the same balls), along the same line with [25, 22, 7, 9]. Except for simply minimizing the communication cost among team members, [20, 22] consider minimizing the cost among team members and team leaders. Different from these work, we introduce structural constraints, in terms of graph pattern matching [11, 29], into team formation, while retaining the capacity bounds on specific team members like [34, 15].

Incremental algorithms (see [32, 10] for a survey) have proven usefulness in a variety of applications, and have been studied for graph pattern matching [11, 13] and team formation [7] as well. However, [32, 10, 11, 13] only consider data updates, and [7] only considers continuously coming new tasks. In this work, we deal with both pattern and data updates for team formation, and support both insertions and deletions. To our knowledge, this is the first study on pattern updates, and is the most general and practical dynamic setting considered so far.

Query reformulation (*a.k.a.* query rewriting) is to generate alternative queries that may produce better answers,
and has been studied for structured queries [31], keyword queries [40] and graph queries [30].
However, different from our study of handling pattern updates, the focus of query reformulation is not on incremental computations.

## 2 Dynamic Team Formation

We first propose team simulation, a revision of traditional graph pattern matching. We then formally introduce the top- team formation problem via team simulation. We finally present the dynamic top- team formation problem.

### 2.1 Team Simulation

We first extend pattern graphs of traditional graph pattern matching to carry capacity requirements, and then define team simulation on undirected graphs.

We start with basic notations.

Data graphs. A data graph is a labeled undirected graph , , , where and are the sets of nodes and edges, respectively; and is a total labeling function that maps each node in to a set of labels.

Pattern graphs. A pattern graph (or simply pattern) is an undirected graph , , , , in which (1) and are the set of nodes and the set of edges, respectively; (2) is a total labeling function that maps each node in to a single label; and (3) is a total capacity function such that for each node , is a closed interval , where are non-negative integers.

Intuitively, specifies a range bound for node , indicating the required quantity for the matched nodes in data graphs. Note that for traditional patterns [16, 41, 11, 14], bounds are typically carried on edges, not on nodes. We also also denote data and pattern graphs as , and , respectively. The size of (resp. ), denoted by (resp. ), is defined to be the total number of nodes and edges in (resp. ).

We now redefine graph simulation on undirected graphs, which is originally defined on directed graphs [19, 11]. Consider pattern graph , and data graph , .

Graph simulation. Data graph matches pattern graph via graph simulation, denoted by , if there exists a binary match relation in for such that

(1) for each , the label of matches one label in the label set of , *i.e.,* ; and

(2) for each node , there exists such that (a) , and (b) for each adjacent node of in , there exists a adjacent node of in such that .

For any that matches , there exists a unique maximum match relation via graph simulation [19].

We then introduce the notions of balls and match graphs.

Balls. For a node in data graph and a non-negative integer , the ball with center and radius is a subgraph of , denoted by , such that (1) all nodes are in , if the number of hops between and , , is no more than , and (2) it has exactly the edges appearing in over the same node set.

Match graphs. The match graph *w.r.t.* a binary relation is a subgraph of data graph , in which
(1) a node if and only if it is in , and
(2) it has exactly the edges
appearing in over the same node set.

Intuitively, the match graph *w.r.t.* is the induced subgraph of
such that its nodes play a role in .

We are now ready to define team simulation, by extending graph simulation to incorporate the locality constraints enforced by balls, and the capacity bounds carried by patterns.

Team simulation. Data graph matches pattern via
team simulation *w.r.t.* a radius , denoted by , if
there exists a ball (, ) in , such that

(1) , with the maximum match relation and the match graph *w.r.t.* ; and

(2) for each node in , the number of nodes in with falls into .

We refer to as a perfect subgraph of *w.r.t.* .

Intuitively, (1) pattern graphs capture the structural and capacity constraints, and (2) a perfect subgraph of pattern corresponds to a desired team, which is required to satisfy the following conditions: (a) itself is located in a ball where as a match graph; and (b) satisfies the capacity constraints carried over pattern .

Example 3: Consider pattern and data graph in Fig. 1, and team simulation with is adopted.

One can easily verify that matches via team simulation, *i.e.,* ,
as (a) there is a perfect subgraph in in ball
, *i.e.,* the connected component of containing , which maps , , , , and in to , , {, }, {, }, {, } and {, }, respectively, and, moreover, (b) the capacity bounds on all pattern nodes are satisfied.

Remarks. (1) Team simulation differs from graph simulation [19] and strong simulation [29] in the existence of capacity bounds on pattern graphs and its ability to capture matches on undirected graphs.

(2) Different from strong simulation with a fixed radius for balls (*i.e.,* the diameter of a pattern), team simulation adopts a more natural setting that the radius of balls is auto-adjustable, having a user specified upper bound only.

### 2.2 Top-k Team Formation

Given pattern , data graph , and two positive integers and , the top- team formation problem, denoted as kTF,
is to find a list of perfect subgraphs (*i.e.,* teams) with the top- largest density in for , via team simulation.

Here the density of graph is , where and are the number of edges and the number of nodes respectively, as commonly used in data mining applications [17, 39].
Intuitively, the larger is, the more collaborative a team is.
In this way, not only the two objective functions of existing team formation methods are preserved,
*i.e.,* the locality retained by balls and the density function in selecting top- results,
but also the relationships among members and the capacity constraint on patterns.

Example 4: Consider in Fig. 1 and . We simply set , as most existing solutions for kTF only compute the best team [25, 7, 15, 34, 9].

One may want to look for candidate teams with existing methods, satisfying the search requirement in Example 1: (1) by minimizing the team diameter [25], which returns the team with , , , , , ,

(2) by minimizing the sum of all-pair distances of teams [22], which returns exactly the same team as (1) in this case, or

(3) by maximizing the team density [15], which returns the team with all the nodes in the two connected components in with and , except , , , .

One may already notice that these teams only satisfy the skill requirement, *i.e.,* condition (i) in Example 1, and cannot guarantee the specific collaboration relationships among team members.
Indeed, the team found in (1) and (2) is connected by only, and the team found in (3) has loose collaborations among its members.
That is, existing methods are not appropriate for identifying the the desired teams.

When team simulation is adopted, it returns the perfect subgraph in Example 2.1 with its density = 1.4, satisfying both conditions (i) and (ii), much better than those teams found by the above existing methods.

### 2.3 Dynamic Top-k Team Formation

We now introduce dynamic top- team formation.

Pattern updates (). There are five types of pattern updates: (1) edge insertions connecting nodes in , (2) edge deletions disconnecting nodes in , (3) node insertions attaching new nodes to , (4) node deletions removing nodes from , and (5) capacity changes adjusting the node capacities in , while remains connected in all cases.

Data updates (). There are four types of data updates, defined along the same lines as the first four types of pattern updates. Further, different from pattern updates, there is no need to keep connected for data updates.

Dynamic top- team formation. Given pattern , data graph , positive integers and , the list of top- perfect subgraphs for in , a set of pattern updates and a set of data updates , the dynamic top-k team formation problem, denoted by kDTF, is to find a list of perfect subgraphs with the top- largest density for in , via team simulation.

Here denotes applying changes to and to , and
and denote the updated pattern and data graphs.
It is worth mentioning that kDTF covers a broad range of dynamic situations,
*i.e.,* handling continuously separate and simultaneous pattern and data updates.

## 3 Finding Top-k Teams

In this section, we develop a batch algorithm for top- team formation. We first study the pattern satisfiability problem for team simulation, then introduce two optimization techniques, and finally we present our batch algorithm.

### 3.1 Pattern Satisfiability

Different from graph simulation [19] and its extensions [11, 29], there exist patterns that cannot match any data graph via team simulation, due to the presence of capacity constraints on patterns. We illustrate this with an example.

Example 5: (1) For pattern in Fig. 2, one can verify that there exist no data graphs such that because (a) for any nodes in , if matches with the node labeled with , then it must match with the node labeled with , and, hence, (b) the capacity upper bound on should not be less than the lower bound on .

We say that a pattern is satisfiable iff there exists a data graph such that matches via team simulation, *i.e.,* .
The good news is that checking the satisfiability of pattern graphs can be done in low polynomial time.

Proposition 1: The satisfiability of patterns can be checked in time.

By treating as both data and pattern graphs, compute the maximum match relation in for , via graph simulation. Then pattern is satisfiable iff for each with the capacity bounds on and on , respectively, holds. Observe that the size of is bounded by , and pattern graphs are typically small.

By Proposition 3.1, we shall consider satisfiable pattern graphs only in the sequel.

### 3.2 Batch Algorithm

We then introduce two techniques for optimizing the computation of team simulation.

Handling radius varied balls. kTF is to find top- teams within balls , where and .
However, it is very costly to construct all balls, and to compute perfect subgraphs in all of them.
Indeed it is also not necessary, and it only needs to construct and compute the matches for a number of balls, *i.e.,* the set of balls
where and radius is , and then incrementally computes the perfect subgraphs for balls () from the match graphs for ball , as shown below.

Theorem 2:
Given , ball and ( ) in ,
(1) if , then ; and
(2) if (resp. ) is the match graph *w.r.t.* the maximum match relation (resp. ) in (resp. ) for via graph simulation, then , and is a subgraph of .

When we have the match graph in for via graph simulation, to compute the perfect subgraph in () for via team simulation, we need to (1) first identify the subgraph in belonging to , which can be easily identified in the process for constructing without extra computation; (2) check whether is already a match graph for in via graph simulation; if not, remove the unmatched nodes and edges from until find the match graph for in . This can be achieved by executing an efficient incremental process in [13]; and (3) finally check whether capacity bounds are satisfied. If so, is the perfect subgraph in for via team simulation.

Density based ball filtering. We further reduce the number of balls to speedup the process by adopting the density based filtering technique. The key idea is to tell whether a ball is possible to produce one of the final top- matches.

Given a ball , we compute the density upper bound ,
where is a subgraph of .
If the bound is larger than the density of the current -th result, *i.e.,* there is a possibility for the ball;
Otherwise, the ball is simply ignored to avoid redundant computations.

The trick part is how to efficiently compute the upper bound of for each ball in . As the best densest subgraph algorithms are in time [17], which is costly, we utilize an important result in [39], shown below.

Lemma 3: Let and be the density of the maximum core and the densest subgraph of graph . Then (1) ; and (2) there exists an algorithm that computes in time [39].

Here the maximum core of a graph is a subgraph of whose node degree is at least , where is the maximum possible one. By Lemma 3.2, we use as the density upper bound for filtering unnecessary balls.

We are now ready to present our batch algorithm for kTF.

Algorithm . As shown in Fig. 3, it takes input as , , and two integers and , and outputs the top- densest perfect subgraphs in for . It firstly checks whether is satisfiable (line 1). If so, for each ball in , it computes the maximum core of , and checks whether the density based ball filtering condition holds (lines 3-6). If so, it skips the current ball, and moves to the next one; otherwise, it computes the perfect subgraph of in via team simulation by invoking (line 7, see full version [4]), an adaption from graph simulation [19, 11] and checking capacity bounds (line 8). It then computes perfect subgraphs of in inner balls by invoking , an extension of the data incremental algorithms in [13] and checking capacity bounds (lines 9-11).

Correctness & complexity analyses. The correctness of is assured by the following.

(1) The correctness of (resp. ) can be verified along the same lines as graph simulation [19] (resp. incremental simulation [13]).

(2) Theorem 3.2 and Lemma 3.2.
It takes to check pattern satisfiability, to compute team simulation, to incrementally compute matches in inner balls, and to compute the density of the maximum core for balls.
Thus is in . However, actual time is much less due to density based ball filtering and that is the worst case complexity for incremental process, while is small, *i.e.,* 2 or 3.

## 4 A Unified Incremental Solution

In this section, we first analyze the challenges and design principles of dynamic top- team formation, and then develop a unified incremental framework for kDTF For convenience, the notations used are summarized in Table 1.

### 4.1 Analyses of Dynamic Team Formation

By Theorem 3.2, pattern matches a ball (), only if matches ball via graph simulation, and the match results for can be derived from the matches for . Therefore, the key of the incremental computation is to deal with the balls with radius . In the sequel, a ball has a radius by default.

We first analyze the inherent computational complexity of the dynamic top- team formation.

Incremental complexity analysis. As observed in [33, 32], the complexity of incremental algorithms should be measured by the size of the changes in the input and output, rather than the entire input, to measure the amount of work essentially to be performed for the problem.

An incremental problem is said to be bounded if it can be solved by an algorithm whose complexity is a function of alone, and is unbounded, otherwise. Unsurprisingly, the dynamic top- team formation problem is unbounded, similar to the other extensions of graph simulation [11, 13].

Proposition 4: The kDTF problem is unbounded, even for = 1 and unit pattern or data updates.

We then illustrate the impact of pattern and data updates on the matching results with an example.

Example 6: Continue Example 1 with and .

(1) For , already matches , and may produce more matched nodes for ,
thus a re-computation for perfect subgraphs is needed.
For all other balls, may turn unmatched nodes to matched and may produce perfect subgraphs,
thus re-computation is also needed.

(2) For , it produces a new perfect subgraph for in ,
*i.e.,* the connected component having .

Notations | Description |
---|---|

pattern and data graphs | |

a ball in with center node and radius | |

the list of top- perfect subgraphs in for | |

pattern and data updates | |

applying updates and to and | |

pattern fragmentation: fragments and cut | |

affected balls | |

the maximum match relation in for | |

fragment-ball matches (auxiliary structure) | |

, | fragment status, ball status (auxiliary structure) |

fragment-ball-match index, containing | |

, | ball filter, update planner (auxiliary structure) |

We finally discuss the challenges and principles of designing incremental algorithms for kDTF from three aspects.

(1) Impacts of pattern and data updates. Beyond Proposition 4.1 and Example 4.1, one can also verify that (a) unit pattern updates are likely to result in the entire change in previous results, such that all balls need to be accessed and all matches need to be re-computed, and (b) the impact of data updates can also be global, such that the entire data graph may need to be accessed to re-compute matches. Hence, the key is to identify and localize the impacts of pattern and data updates.

(2) Maintenance of auxiliary information.
Auxiliary data on intermediate or final results for in are typically maintained for incremental computation [33, 13].
How to design light-weight and effective auxiliary structures is critical.
One may want to store , the match relations of for all balls in ,
as adopted by existing incremental pattern matching algorithms for data updates [13].
However, the impact of is global, as shown in Example 4.1.
By storing , for pattern edge/node deletions, it has to recompute matches for all balls, *i.e.,* the entire .
Thus, storing could be useless, not to mention , the list of top- perfect subgraphs for in *w.r.t.* .

(3) Support of continuous pattern and data updates. A practical solution should support continuous pattern and data updates, separately and simultaneously, which further increases difficulties on the design of auxiliary data structures and incremental algorithms.

### 4.2 A Unified Incremental Framework

Nevertheless, we develop an incremental approach to handling pattern and data updates in a unified framework, by utilizing pattern fragmentation and affected balls to localize the impacts of pattern and data updates, and to reduce the cost of maintaining auxiliary structures and computations.

(I) Localization with pattern fragmentation. We say that {, , , } is an -fragmentation of pattern , , denoted as , if (1) , (2) for any , (3) is exactly the edges in on , and (4) .

We also say () as a fragment of , and as a cut of , respectively.

Observe that by pattern fragmentation, a pattern update on is either on a fragment or on the cut of , and, in this way, the impact of pattern updates is localized. Moreover, graph simulation holds a nice property on pattern fragmentation, as shown below.

Theorem 5: Let be an -fragmentation of pattern . For any ball in , let () be the maximum match relation in for via graph simulation, and be the maximum match relation in for via graph simulation, respectively, then .

We also say that is a partial match relation in ball for via graph simulation. By the nature of graph simulation [19], is actually an intermediate result of . Once we have the maximum match relation for in , via graph simulation, we can further produce the result for in via team simulation, by a capacity check.

That is, based on pattern fragmentation, we maintain
an auxiliary structure for storing fragment-ball matches for incremental computations,
*i.e.,* *w.r.t.* that is the maximum match relations for all pattern fragments of in all balls of , via graph simulation.
Moreover, its space cost is light-weight, as will be shown in the experimental study.

By storing , we have for each ball , and we can simply update while leaving other parts untouched. That is, we indeed compute for , instead of , and combine all to derive . Even better, the updates on the cut of only involve with a simple combination process, avoiding the computation for any pattern fragments.

For a better incremental process, we typically want (1) to avoid skewed updates by balancing the sizes of all fragments, and (2) to minimize the efforts to assemble the partial matches of all fragments. Thus we define and investigate the

pattern fragmentation problem.Given pattern and a positive integer , it is to find an -fragmentation of such that both () and are minimized. Intuitively, the bi-criteria optimization problem partitions a pattern into components of roughly equal size while minimizing the cut size.

The problem is intractable, as shown below.

Proposition 6: The pattern fragmentation problem is np-complete, even for = 2.

However, and are typically small in practice [11], *e.g.,* and

. In light of this, we give a heuristic algorithm, denoted by

, for the problem, and is shown in the full version [4]. works by connecting pattern fragmentation to the widely studied -Balanced Partition problem [8], which is not approximable in general, but has efficient and sophisticated heuristic algorithms [23].(II) Localization with affected balls (). We further localize the impact of pattern and data updates with affected balls to avoid unnecessary computations.

We say that a ball in is affected *w.r.t.* an incremental algorithm , and pattern and data updates,
if accesses the ball again.
We use and to denote the cardinality and total size of , respectively.

Indeed, are those balls with a possibility to have final results *w.r.t.* and .
We only access , and ignore the rest balls.
Specifically, (1) for , it allows us to avoid computing updated partial relations for an updated fragment in every ball;
and (2) for , the locality property of team simulation supports to localize the update impacts to a set of balls whose structures are changed by .

(III) Algorithm framework. We now provide a unified incremental algorithm to handle both pattern and data updates, based on pattern fragmentation and affected balls.

Given pattern with its -fragmentation , data graph , two integers and , and auxiliary structures (to be introduced in Section 5) such as the partial match relations for all pattern fragments and all balls (radius ), algorithm consists of three steps for and , as follows.

(1) Identifying . Algorithm invokes two different procedures to identify for separate or , respectively. For simultaneous and , takes the union of the produced by the two procedures.

(2) Update partial match relations in . For a ball affected by , updates the partial match relations for the updated pattern fragments with incremental computation; For a ball affected by , updates the partial match relations for all pattern fragments; And, for a ball affected by both and , follows the same way as it does for only. Meanwhile, auxiliary structure (to be seen shortly) is updated for handling continuously separate and simultaneous pattern and data updates.

(3) Combining partial match relations. combines all partial relations for a subset of and computes the top- perfect subgraphs within them and their inner balls.

Observe that handles pattern and data updates, separately and simultaneously, in a unified way.

## 5 Incremental Algorithms

In this section, we introduce the details of our incremental algorithm , including (a) auxiliary data structures, (b) algorithms and to handle pattern and data updates, respectively, and (c) by integrating and together.

### 5.1 Auxiliary Data Structures

Auxiliary structures fall into two classes: maintain partial matches and handle pattern incremental computing. Consider an -fragmentation = {} of pattern , data graph , and pattern updates .

(I) Data structures in the first class are as follows.

(1) Fragment status () consists of

boolean vectors

, referred to as type code (tc), where is either 0 or 1. Recall that is very small,*e.g.,*3.

We use to classify the match status of balls in into types *w.r.t.* .
For a ball with type code , is 1 iff matches the ball via graph simulation.

(2) Ball status () consists of triples , such that is the of a ball, is the id of the latest processed unit pattern update for the ball (initially set to ), and is the density upper bound of subgraphs in the ball.

We use to store the basic information for balls in .

(3) Fragment-ball matches of in , denote as , are , such that is the maximum match relation for in ball , via graph simulation, and there are in total balls.

Here is used to store match relations for the pattern fragments of in all balls of . Instead of storing a single , we organize in terms of the match status between pattern fragments and balls, *i.e.,* and .

(4) Fragment-ball-match index () links and together, to form the fragment-ball-match index. Then is linked to . The details are shown below.

For each record of ball in , (a) there is a link from its type code in pointing to the record; and
(b) there is another link from the record to a set of in ,
if the type code with which the ball is associated has , *i.e.,* is not empty.

Intuitively, indexes the partial match relations based on the match status of balls *w.r.t.* .

Example 7: Consider and (both without dashed edges) in Fig. 1, , , , auxiliary structures and that are shown in Fig. 4(a).

(1) Pattern is divided into fragments and by algorithm , so there are type codes in .

(2) For balls linked with , *e.g.,* ball ,
there are matches to both and in the ball.
Besides, there exist balls , and
linked with , and respectively.
For simplicity, we use these 4 balls only in the following analysis.

These structures enforce a nice property as follows.

Theorem 7:
With and *w.r.t.* an -fragmentation of , given and ,
the incremental algorithm processes and in time
determined by , and , not directly depending on .

We shall prove Theorem 5.1 by providing specific techniques for and analyzing its time complexity.

(II) Data structures in the second class are as follows.

(1) Ball filter () consists of boolean vectors , , , referred to as filtering code (fc), such that each in corresponds to a type code in . Each in an of is initially set to , and is updated for each unit pattern update in : (a) when is an edge deletion or a node deletion to , the -th bit of all the filtering codes in is set to ; Otherwise, (b) remains intact.

(2) Update planner () consists of stacks , , , . Stack (resp. ) records all unit updates in all arrived pattern updates , , that are applied to fragment (resp. ) of . Initially, all of them are empty, and are dynamically updated for each unit update in each coming set of pattern updates.

### 5.2 Dealing with Pattern Updates

We present algorithm to handle pattern updates , following the steps in Section 4.2, and an early return optimization technique for .

(I) Identifying affected balls.
We first develop procedure to identify with structures and .

Procedure .
Given an -fragmentation of , ,
(1) it updates by processing all unit updates in .
(2) For each , it then executes a bitwise operation (&) between type code of in and updated filtering code in , *i.e.,* .
(3) Finally, if , refers to in to mark the balls with type code as , and resets to .
The condition holds as long as
(a) the -th () bit of is 0, *i.e.,* there exists an edge/node deletion on , which may produce more matched nodes, or
(b) the -th () bit of and are both 1, *i.e.,* balls with already match with , though there are no edge/node deletions on .

(1) comes with a unit edge deletion = on , and all four in are updated from (1, 1) to (1, 0), as shown in the second column of in Fig. 4(b).
identifies ball with and with as ,
since and .
Then resets the two corresponding filtering codes in to .

(2) Consider another case when comes with and , where is same as above and . is updated as shown in the second column of in Fig. 4(c), and identifies the same as above.

The correctness of is ensured by the following.

Proposition 8: For any ball in , if there exists a perfect subgraph of in , then must be an affected ball produced by procedure .

Lazy update policy. To reduce computation, only updates the partial relations for in for computing .
However, those partial relations in the filtered balls also need an update for handling future updates , but
definitely become outdated *w.r.t.* .
Hence, needs a smart policy to maintain those match relations in the filtered balls.

To do this, algorithm maintains the status of all unit updates applied to so far, and processes unit updates in as late as possible, while having no effects on future updates , *i.e.,* a lazy update policy.

Algorithm utilizes auxiliary structure together with the item in .
When handling current , for each ball , records the id of the latest processed unit pattern update for , and is initialized to .
When future comes, for any *w.r.t.* and any fragment ,
computes based on by procedure (to be seen shortly),
where consists of the unit updates stored in whose ids are larger than in .

(a) is updated *w.r.t.* .
updates the partial relations for *w.r.t.* in the two balls,
and sets their in to , as the status shown in Fig. 4.

(b) Afterwards, with an edge insertion comes.
updates and as shown in Fig. 4 and identifies balls with as , *e.g.,* .

(c) Finally, with a node deletion = comes.
identifies , and as .
Take ball for example, which is the first time identified as an .
By referring to , updates the partial relations for *w.r.t.* ,
and for *w.r.t.* .

(2) In the case when contains multiple updates, and are updated accordingly as shown in Fig. 4(c).

(II) Updating Fragment-Ball matches. We then update the partial match relations for in *w.r.t.* ,
by procedure .

Procedure . Given -fragmentation of , , , , and *w.r.t.* .
updates to in for each fragment and each .
Recall that consists of unprocessed unit updates accumulated in applied to .
We show how to update in different cases.

(1) There exist edge/node deletions in . In this case, accesses the in . It simply computes the maximum match relations for in by procedure in time.

(2) No edge/node deletions in . processes updates of the same type together in this case as follows.

(i) Capacity changes in or updates on .
In this case, no computation is needed for maintaining partial relations for at all,
*i.e.,*