As the geo-social networks become popular, finding geo-social groups has drawn great attention in recent years. In general, geo-social group search problem [liu2012circle, armenatzoglou2013general, zhu2017geo, 7202872] aims to find a group that is socially cohesive while spatially closest to a location, i.e., the found group satisfies a single social constraint while optimizing a distance objective function for most works. This is different from most social-aware spatial search works [wu2014social, li2014spatial, ahuja2015geo, Ghosh:2018:FSS:3282495.3302535] that consider various objectives together as an aggregate objective function and find the result that is optimum w.r.t. the aggregate function. One of the most motivating applications for geo-social group search is instant organization of impromptu activities. This is because two nice proprieties of geo-social groups. Firstly, the social cohesiveness of a geo-social group ensures the members are socially close within the group, which is key to ensure a good communication atmosphere for the activity. Secondly, subjecting to social cohesiveness, a geo-social group is the one that is closest to the location of the activity, which reduces the waiting time for the activity potentially. However, since most of the existing geo-social group studies only focus on social constraint while optimizing the spatial closeness, they become less useful when an activity has more demands, e.g., demanding attendees with certain skills and demanding minimum number of attendees. Let us consider one of the application scenarios below.
Online open-world game data: finding participants for a real time quest. For online open-world game data, each player is associated with a friend list, an attribute describing the role of a player, and location information showing his/her location in the virtual world. Suppose there is a real time quest requested in a randomly location with duration of minutes. The quest has a set of suggested roles and suggests that each role shall have no less than players for accomplishing the quest. The gaming system would like to formulate a group of participants who are adequate to carry out the quest. Who shall be the players in the group?
To effectively find the desired geo-social group for the above scenario, extra factors shall be considered thoroughly in addition to social and spatial closeness, i.e., the minimum number of players for each suggested role. If there are more demands, the effort to coordinate them increases substantially. As such, it is imperative to devise efficient novel techniques to alleviate the effort for planning or organising activities with multiple demands. A specific motivating example is shown below.
Figure 1 illustrates the gaming data, which consists of graph data in Figure 1(a) and spatial data in Figure 1(b), when a real time quest is happening at the location labelled as . The graph data contain friendships for players and the current role of players in terms of keyword. The spatial data contain the current location for each users. Let the quest has a suggested role of and has a suggested minimum number of players, say, for each suggested role. Below, we show the desired result and the results found by the most related models.
The desired group for accomplishing the quest is the subgraph enclosed by the dashed rectangle in Figure 1(a). The players found within the group have strong relationship while preserving spatial proximity to the query location . Simultaneously, the group contains players satisfying all roles recommended by the quest and the group also has sufficient players, i.e., two players for each suggested role. Considering a single social constraint (e.g., -core [KhaouidKCore]), geo-social group works such as [zhu2017geo] tend to find the nearest group satisfying the social constraint, i.e., induced subgraph in Figure 1(a). Considering a social constraint and the exact group size constraint, existing works such as [7202872, Ghosh:2018:FSS:3282495.3302535] are likely to find induced subgraph in Figure 1(a). None of them can find the group as the desired one since they do not consider that the activity has multiple demands as discussed above.
Geo-social group with multiple constraints. The example motivates us to study a novel type of geo-social group search problems for impromptu activities with multiple demands, and propose efficient solutions. In particular, the model which we study finds a group with multiple constraints induced by various demands of an activity while preserving that the group is most spatially proximate to the activity location, in which the spatial proximity is measured by the distance of the person in the group that is most distant to an activity location. The multi-constraint geo-social group search problem that we focus on is to find MKASG - a Group of people with Minimum requirements of Keyword cohesiveness, Acquaintance (social strength) and size while preserving its Spatial-proximity to a given location. We name this problem as MKASG search problem.
Existing search framework. Most existing approaches [liu2012circle, armenatzoglou2013general, zhu2017geo] for finding a geo-social group mainly based on the nearest neighbour search framework. This framework progressively adds vertices that potentially satisfy social constraint according to nearest neighbour order (w.r.t. the activity location), while checking the social constraint after each vertex is added. It returns the optimum result when it finds a subgraph satisfying the social constraint for the first time. This framework is efficient when considering a single social constraint. When coming to geo-social group with multiple constraints, this framework becomes less attractive since some constraints, e.g., minimum size constraint discussed above, may enlarge the size of desired geo-social group. This makes the times of multi-constraint checking substantially large, resulting poor performance.
Challenges. As discussed above, a general effective framework for searching geo-social group with multiple polynomial checkable constraints is required in urgency. This arises challenges as follows. Firstly, can we have a search framework that can narrow the search space fast while preserving the correct result? Secondly, can we have a theoretical bound for the size of the narrowed search space? Thirdly, given the specific constraints in MKASG, can we reduce the time complexity of multi-constraint checking approach to constant times of single constraint checking?
Our approach. In this paper, we devise a novel search framework for effectively finding geo-social group with multiple constraints. This search framework contains expanding and reducing stage. The expanding stage addresses the first two challenges. It approaches to a search space that is sufficient large to contain the optimum result at a cost equivalent to constant times of the time complexity of multi-constraint checking. The approached search space is no greater than the size of the optimum search space with a ratio of parameterized constant, which vastly restricts the search space for the reducing stage. For the reducing stage, we adapt the method proposed in [liinfluentalcom2015]. For MKASG search problem, within the proposed search framework, we further devise novel techniques including keyword aware truss union and keyword aware spanning forest, which reduce the overall search complexity, including expanding and reducing stages, to constant times of the social constraint checking. This addresses the third challenge. We also propose novel pruning techniques that further improve the search performance as much as possible.
Contribution. Our predominant contributions in this paper are summarised as follows.
We study finding geo-social group with multiple constraints, considering minimum keyword, social acquaintance and size constraints while preserving its spatial proximity to a specific query location. (Section 2)
We devise an effective search framework for multiple constraints geo-social group search problem, which first approaches to the region containing a group stratifying all constraints and then reduces the group to MKASG to guarantee the spatial proximity. (Section 4)
For the expanding stage, we propose a power law based expanding strategy which ensures that the evaluated search space of the expanding range is restricted. We further propose effective techniques including search region lower bound, and keyword aware truss union-find operation to speed up this stage. (Section 5)
For the reducing stage, we propose novel keyword frequency aware spanning forest, which guarantees the total cost of the reducing stage to its lower bound for MKASG search. (Section 6)
We conduct extensive experiments on real datasets to demonstrate the efficiency and effectiveness of the proposed algorithm and geo-social group model. (Section 7)
2 Problem formulation
In this section we formulate MKASG with social, keyword and size constraints and MKASG search problem. Some of other constraints on geo-social group that can be solved by our proposed method will be discussed in Section 6.3.
Data. We model data with network structure, spatial attribute and textual attribute as an undirected graph . has a set of vertices (users) and a set of edges (friendships) . For each vertex , has a piece of location information expressed as latitude and longitude denoted as , and has a keyword denoted as that describes the current role of .
We formally define the query for searching MKASG.
Query for MKASG. We allow users to give a query consisting of a query location , a set of keywords that describe the roles of the desired group members, an integer parameter for defining minimum size of the group, and an integer parameter that defines social cohesiveness.
Multiple constraints for MKASG. Now we define the multiple constraints of MKASG, given an MKASG query.
Social constraint. We consider minimum trussness to measure the social cohesiveness of an MKASG . Trussness is defined based on the number of triangles each edge is involved in a graph. In general, given a subgraph , we use to denote a triangle consisting of vertices .
Support. The support of an edge , denoted by , is the number of triangles containing , i.e., , where and are the neighbours of in correspondingly.
Minimum subgraph trussness. The trussness for a subgraph is defined as an integer that is plus the minimum possible support for edges in . That is, the minimum subgraph trussness defines that for every edge , the number of triangles in which participates shall be no less than - .
Based on the definition of trussness, we define the -truss constraint of an MKASG as follows:
-truss constraint. An MKASG satisfies -truss constraint if the trussness of is , and is connected.
Intuitively, if satisfies -truss constraint, the vertices of an edge in have at least - common neighbours in the group , every vertex in has no less than - edges and at least - edges have to be deleted in order to make disconnected. An with a large value indicates strong internal social relationships over vertices.
Example. For instance, in Figure 1(a), the whole graph is a -truss. Every edge in this graph involves no less than triangles.
Keyword constraint. We adopt the concept of collective keyword coverage to measure the keyword cohesiveness between the keyword attributes of and query keywords .
Collective keyword coverage. Given a group and the query keywords , the attributes of collectively cover if and only if .
Minimum size constraint. In real application, we could allow users to specify the minimum size of the group directly. However, this is likely to result in that the attributes of the found group members overemphasize on part of , which is undesired. To mitigate such effect, we propose an alternative approach defining the minimum size of the group together with the keyword constraint. We introduce the definition of minimum keyword vertex constraint.
Given a set of keyword , a social group , and let be the set of vertices in containing keywords , the minimum keyword vertex constraint is defined as follows.
Minimum keyword vertex constraint Given an integer , and , satisfies minimum keyword vertex constraint if: .
With the minimum keyword vertex constraint, the size of a group is no less than . In the following of this paper, we call minimum keyword vertex constraint as keyword vertex constraint.
Searching objective for MKASG search. Now, we formalize the spatial proximity measurement for MKASG and the research problem studied in this paper.
Spatial proximity. Given a query location , we consider a distance function to measure the closeness between and an MKASG as:
where denotes Euclidean distance between and .
-truss. Given a and a distance threshold , a subgraph is a -truss, if it satisfies all the constraints below.
satisfies -truss constraint.
MKASG search. Given a query and , return -truss so that there is no -truss with .
Example. Come back to Example 1, and set a query for MKASG search with , , , . MKASG, denoted as , is the subgraph in the doted area. It is a -truss subgraph, and for every keyword in there are no less than two members whose attributes match the keyword. It is also the group closest to subject to the social, keyword and size constraints.
3 Baseline Solutions
In this section, we discuss three baseline solutions that find the exact result.
Incremental approach. Given a query, this approach progressively includes a vertex into a candidate set according to nearest neighbour order w.r.t. the query location. Every time a vertex is added into the candidate set, this approach checks if there is a subgraph induced by vertices in the candidate set that satisfies all constraints. If there is one, the approach stops and returns the subgraph as result. Otherwise, this approach keeps on exploring the vertices in order.
This method has a time complexity of . The dominated cost is induced by repeatedly checking -truss constraint and keyword vertex constraint.
Decremental approach. Borrowing the technique proposed in [liinfluentalcom2015], a baseline with better time complexity can be derived. This approach progressively deletes the vertex most distant to . When a most distant vertex is deleted, this approach further deletes edges that do not satisfy trussness constraint. This ensures that every time before deleting the next most distant vertex, the remaining subgraphs are still -truss. To adapt this approach for our problem, after trussness checking, for the remaining truss subgraphs we further check if there is connected -truss satisfying both size and keyword vertex constraint using depth-first search. The decremental approach progressively deletes the most distant vertex and performs the multi-constraint checking until there is no subgraph that satisfies all constraints simultaneously. The last subgraph that satisfies all constraints becomes the result.
The time complexity of this approach is . This approach can reduce the cost of truss checking. But, it suffers from exploring large search space.
Binary search based approach. This approach progressively guesses a distance via binary search. For a distance , this approach checks if there is a subgraph that satisfies all constraints in the subgraph induced by vertices having distance no greater than to the query location . If there is one, this approach reduces the to and continues. If there is no such a subgraph, this approach increases to where is the last evaluated distance and checks the corresponding subgraphs. For any two consecutive evaluated and , if there is no vertices having distance to between and , the search stops and the last subgraph satisfying all constraints becomes the result. To support retrieve subgraphs based on efficiently, we use R-tree index in this method.
The time complexity of this approach is . The major drawback of this approach is that its search space is large even though it can approach to the optimum result fast.
Discussion. The advantage of incremental approach is if the result is near to the query location, the search space is quite restricted. The advantage of the decremantal approach is that it can reduce the cost of truss computation. The advantage of binary search based approach is it can quickly approach to optimum result in the worst case. Clearly, an ideal search framework shall take all the advantages. This motivate us to devise a novel framework that only explores restricted area, approaches to the optimum result fast and reduces multi-constraint checking as much as possible.
4 Search Framework
Before showing the search framework, we firstly introduce a pre-pruning technique and some definitions.
Maximal (, )-truss based pruning. A maximal (, )-truss is a (, , )-truss that cannot be extended by adding either an edge or a vertex while considering as .
Given an MKASG query containing parameters and , it is clear that MKASG for the query can only reside in a maximal (, )-truss if it exists. As such, given the MKASG query and , computing maximal (, )-truss subgraphs contained in would reduce the search space significantly. This can be done by traversing maximal -truss subgraph with the state of the art truss technique [DBLP:conf/sigmod/0001Y19].
radius bounded graph. Given a query location , a subgraph and a distance threshold , radius bounded graph, denoted as , is the subgraph of induced by vertices of with distance to no greater than .
We would like to highlight a special instance of radius bounded graph, radius bounded graph (), which is the radius bounded graph just large enough to contain MKASG for a query, i.e., there is no such that contains MKASG and . We refer as optimum search space since it is just large enough to contain MKASG for the query.
For instance, in Figure 2, and are demonstrated. and identified regions are displayed in Figure 2(a), i.e., cycles centred by with radius of and respectively. The subgraphs are shown in Figure 2(b), i.e., is the subgraph in doted area and is the subgraph in grey coloured area. is the optimum search space containing MKASG for the query in Example 1.
Next we show the search framework for MKASG. It firstly approaches to a just sufficient large to constrain quickly. Then it reduces to the optimum result.
The framework. As shown in Algorithm 1, MKASG search framework consists of two stages: expanding stage (lines 3 to 7) and reducing stage (line 8). During the expanding stage, Algorithm 1 intends to quickly identify that is just sufficiently large to contain the optimum search space by exploring that progressively gets larger, in which isptTrussIn is called to determine the existence of a subgraph satisfying all constraints. For the reducing stage, to get the optimum result, reducepcTruss attempts to progressively remove the vertex that is the most distant to in . The last survived (, )-truss during the vertices removing process is the optimum result.
In the following sections, we will discuss details of the two stages. We will propose techniques that make expanding stage having the time complexity of one time calling of isptTrussIn. For the reducing stage, we will propose novel online index and combine the index with our proposed reducing strategy to efficiently check all constraints of MKASG. Eventually, our proposed techniques can guarantee that Algorithm 1 has a time complexity of one time truss computation.
5 Expanding Stage
In this stage, we explore a set of radius subgraphs, starting from a relatively small radius subgraph and stopping at the first radius subgraph that is a super graph of .
Challenges. Since expanding stage involves expensive constraint checking, our first challenge is how to devise an expanding strategy that can elegantly bound the overall computations tightly? On the other hand, if we can expand to with less number of attempts, the performance will be improved. This can be achieved by starting the search from a radius graph with that is close to but no greater than . This arises the second challenge: can we identify such initial search range efficiently? At last, when processing an during the expanding stage, if we apply multi-constraint checking just on some restricted subgraphs of that potentially contain a (, )-truss, the search performance can be further boosted. This arises the third challenge on how to quickly identify those potential subgraphs in ?
In the following sub-sections, we will address these three challenges consecutively.
5.1 Expanding Strategy
In this part, we propose an expanding strategy which can bound the total amount of subgraphs that will be evaluated.
We first define an expanding invariant as follows.
size invariant. Let be the series of radius for defining radius graphs, for any two consecutive , in the series, we define invariant as
in which must hold.
The strategy. The strategy applied for the expanding stage is to maintain size invariant over any two consecutively evaluated , . Applying invariant for expanding stage guarantees two nice properties below.
Nearest first search. Vertices accessed by the expanding stage are in non-increasing order according to their distance to on a batch basis.
Power law expansion [bi2018optimal]. The sizes of the set of radius graphs follow power law expansion, i.e., equals .
The two properties help us introduce and prove a lemma as follows.
Let , be the last two radius subgraphs evaluated by the expanding stage, we have .
The correctness is clear. Firstly, when expanding, Properties 1 and 2 hold. Secondly, the expanding stage stops when is the first radius subgraph containing a -truss.
Next, we establish precise relationship between and via the lemma below.
Let be the last radius subgraph evaluated by the expanding stage, the inequality holds.
Now, let us show the tight bound that is guaranteed by applying the proposed expanding strategy.
Let , be the set of radius subgraphs evaluated in order by the expanding stage, the inequality must hold.
Proof sketch. Since we have invariant, is essentially the sum of a geometric progression with a common ratio of and a scale factor of . As such it equals to and is no greater than , which can be expressed as .
Discussion. With Lemma 3, the correctness of the following statement is clear. The running time of lines 3 to 8 in Algorithm 1 is proportional to the time complexity of ispcTrussIn(), where is determined by the time complexity of ispcTrussIn() (later on we show equals ). This provides a tight bound for the expanding stage if we can access every locally during the loop of lines 3 to 8. As such, we will introduce techniques that ensure local explanation during the expanding stage.
Local exploration. We propose a structure aiding us to retrieve for some radius subgraph with time liner to . We firstly show lemma as follows.
For any maximal connected -truss and fixed query, there is a structure that takes space, that can be built in time, and that retrieves in time.
The structure. The structure is an array of edges in with non-decreasing order according to their distances to query location, where the distance from to an edge is measured the same as Definition 3. To create the structure, we firstly sort the vertices in taking . And then arrange edges into appropriate position. For different maximal connected -truss , we sort them separately and then merge together to speed up the performance.
With the structure, for consecutive evaluated and , we can easily retrieve based on with time liner to .
5.2 Initial Expanding Range
Intuitively, if the initial search range is close to , the total amount of subgraphs that has to be evaluated to approaching is less. This motivates us to study a lower bound of radius subgraph.
We define the lower bound radius subgraph, denoted as defined as follows.
. A subgraph of is a lower bound radius subgraph of if it satisfies conditions: 1) is connected, 2) satisfies keyword vertex constraint and 3) there is no such that satisfies the first two constraints and .
relaxes the structure constraint of MKASG. As such, it can be computed efficiently, discussed below.
Finding lower bound radius subgraph. Algorithm 2 demonstrates the major steps for finding . It is a refined union-find process [Tarjan:1975:EGB:321879.321884]. We augment the union-find data structure with keyword vertex frequency. Algorithm 2 progressively performs union operations on edges in non-increasing order of their distance to . By union operations, vertices that are connected are added into the same set. Each set is attached with keyword vertex frequency for each keyword. When an edge is being evaluated, Algorithm 2 first finds if and are contained in the same set in existing union-find structure (lines 3 to 5). If not, the two sets containing and shall be connected via standard union operation and keyword vertex frequency of the two sets shall be aggregated (lines 8 to 9). Due to the space limitation, the discussion for union-find operations is omitted. After a union operation, if there is a set satisfying keyword vertex constraint, we find . Otherwise, Algorithm 2 continues.
Time complexity. The time complexity of Algorithm 2 is , where is the cost of one union-find operation [Tarjan:1975:EGB:321879.321884] and there are at most number union-find operations. Additionally, checking keyword vertex constraint can be considered taking constant time assuming is small.
Example. Figure 3 shows the keyword-aware union-find structure maintained by Algorithm 2 for the query in Example 1. Each of the sets in terms of trees in the keyword-aware union-find structure indicates a connected component the current subgraphs. After is added, the tree rooted by becomes the first connected component satisfying the keyword vertex constraint. The induced subgraphs of vertices in the trees are displayed in Figure 3(b).
Alternative initial bound. We may also relax the keyword vertex constraint to derive an alternative bound, i.e., considering the smallest containing a connected -truss as a lower bound. But, this bound is costly to compute.
5.3 Checking (, )-truss in Radius Subgraph
In this section, we show the detailed implementation of checking -truss in a radius subgraph , i.e., the procedure isptTruss in Algorithm 1.
To simplify the discussion, for any two consecutive and with , let us introduce a new notation to denote the subgraph of induced by vertices appearing in edges of .
Baseline approaches. For checking whether there is any -truss in , one baseline approach is to compute the trussness for the entire , and traverse truss subgraphs to further verify keyword vertex constraint and connectivity. A better approach is for any two consecutive and , we update trussness for according to the difference between and and traverse the updated -truss for checking keyword vertex constraint and connectivity.
The two baseline approaches suffer from two drawbacks. Firstly, trussness for the whole is computed/updated. As such for the parts of that cannot contain MKASG, the truss computation is wasted. Secondly, checking keyword vertex constraint and connectivity has to traverse the whole . If we can perform the check incrementally, the performance can be improved. We propose novel techniques to address the two drawbacks.
To address the first drawback, we propose lazy -truss checking strategy as follows.
Lazy -truss checking strategy. Given , we only apply -truss checking on any subgraph potentially containing -truss, defined as potential subgraph below.
potential subgraph . A subgraph is defined as potential subgraph if it is connected, satisfies keyword vertex constraint and is maximal within .
The strategy. Since a -truss should reside in , we propose lazy -truss checking strategy that applies -truss constraint checking on every in only instead of the entire .
Identifying all can be done almost at no cost by using keyword aware union-find structure discussed in Algorithm 2. That is, when expanding to , vertices in edges of are progressively added to keyword aware union-find structure. As such, the potential subgraphs in can be retrieved easily since every set in keyword aware union-find structure satisfying keyword vertex constraint identifies a potential subgraph.
For instance, in Figure 3, after all edges in are retrieved, the potential subgraph for the query in Example 1 is induced subgraph. As such, according to lazy -truss checking strategy, we only apply -truss checking on this potential subgraph. In contrast, we will not apply -truss checking on subgraph induced by .
Next, we show how to address the second drawback. Please be noted, the computation discussed below shall be performed on potential subgraphs only. The size of these subgraphs is vastly restricted compared to the size of .
Union with existing truss. To avoid graph traversing for checking keyword vertex constraint and connectivity after updating trussness, we propose a solution below. Firstly, we maintain every maximal connected truss subgraph in every , each of which is attached with keyword vertex frequency. Secondly, after is expanded to , we update the maintained -truss subgraphs if applicable. Although this approach cannot update trussness for existing truss subgraphs precisely, it is sufficient and efficient to check the existence of -truss in . As such, keyword vertex constraint and connectivity checking for truss subgraphs can be performed simultaneously and incrementally. We give formal explanations below and focus on truss unions for expanding a to . Since all in are disjoint, the truss union for expanding a to can be easily extended to truss unions for expanding to .
Existing truss . We maintain connected -truss subgraphs if they exist. For each , its keyword vertex frequencies for every keyword in are recorded.
Truss potential subgraph. After expanding to . We only compute maximal truss subgraphs in truss potential subgraph defined below.
Truss potential subgraph. Given two consecutive and with , the truss potential subgraph is defined as , where is the set of vertices appearing in .
The sufficiency of is clear since it contains all triangles in for edges that are not in but potentially lead to -truss.
Truss union. Based on Definition 8, for consecutive and , we compute maximal truss subgraphs in and then add them to via union operation, which forms .
Example. In Figure 4, we show an example for truss union operation. In Figure 4(a), let induced subgraph be and its and are the same graph. Let the whole graph in Figure 4(a) be . Then, is induced subgraph, and is , . Then is induced subgraph shown in Figure 4(b). Since there is a -truss in induced subgraph, truss-union data structure in Figure 4(a) (in terms of tree structure) is updated to the one in Figure 4(b).
Next, we show the (,)-truss checking algorithm with the proposed techniques.
The (,)-truss checking algorithm. The principal steps of (,)-truss checking are shown in Algorithm 3.
Data structure. Since Algorithm 3 is called iteratively, it works on progressively refined data structures including adjacency list of , the keyword aware union-find denoted as storing every potential subgraph, the keyword-aware truss union-find structure denoted storing maximal connected -truss with aggregated keyword frequency. All those data structures are empty sets before the first time when Algorithm 3 is called.
Principal steps. Algorithm 3 adds each edge in to and performs union operation on each edge to , where the edges of can be retrieved easily with the sorted array proposed in Lemma 4. After that, Algorithm 3 computes maximal -truss subgraphs in the truss potential subgraph defined in Definition 8 (line 5). More precisely, with and , is , where are the vertices appearing in . Next, Algorithm 3 performs truss union operations for the computed maximal truss subgraphs. After the truss union, if there is a set in that satisfies keyword vertex constraint, then there is a -truss and Algorithm 3 returns the -truss (line 9). Otherwise, Algorithm 3 returns .
The correctness of Algorithm 3 is clear according to the techniques discussed above.
Time complexity. The time complexity of Algorithm 3 is . Computations between lines to are dominated by keyword aware union-find operations that are , and it is the same for lines 6 to 7. The dominating part is line 5. In the worst case, could be the same as . This results in time complexity for Algorithm 3.
To conclude the expanding stage, we show lemma below.
The time complexity of expanding stage is .
6 Reducing Stage
For the reducing stage, we focus on searching MKASG in the -trusses found by the expanding stage, denoted as . We would like to revisit that the size of is .
Intuitively, this stage progressively removes the vertex in that is most distant to the query location till there is no -truss in the remaining . The last survived -truss is MKASG.
Efficiently checking the existence of -truss after deleting a vertex is challenging. This is because after a vertex deletion, we have to deal with truss computation, verifying keyword vertex constraint and checking connectivity. The obvious time consuming part is truss computation, which can be bounded nicely by taking the advantage of decremental truss computation. The pitfall when analyzing the cost is ignoring the cost of keyword vertex constraint and connectivity checking. Actually, a graph traversing-based implementation for checking them can lead to complexity of , which is worse than the time complexity of truss computation and becomes the performance bottleneck of MKASG search.
We will propose efficient approach for checking multiple constraints together.
6.1 Reducing Strategy
In this part, we show the reducing strategy.
The strategy. Algorithm 4 shows the major steps of the strategy for the reducing stage. It progressively removes the vertex that is most distant to (the query location) in and checks the existence of -trusses in the remaining of after the deletion. If there exists one, Algorithm 4 continues to delete next most distant vertex in . Otherwise, Algorithm 4 returns the last -truss as MKASG.
Clearly, the strategy can find MKASG in correctly since Algorithm 4 maintains an invariant that every time deleting the most distant vertex in , contains set of -trusses. This invariant is ensured by our proposed pcTrussChecking in Algorithm 4. That is, after the most distant vertex is deleted, we further delete edges violating the minimum trussness requirement. Meanwhile, for each edge deletion, we immediately check whether the remaining subgraphs contain a connected subgraph satisfying keyword vertex constraint. If no, we stop edge deletions and return empty set since no -truss exists. If yes, we exclude all the other subgraphs since they cannot lead to MKASG.
It is clear to see that the time complexity of Algorithm 4 consists of the trussness computation cost and keyword-aware connectivity checking cost. The former is bounded by since Algorithm 4 takes the advantage of decremantal truss computation and we have shown that returned by the expanding stage will be no greater than . The later is dependent on the cost of ckChecking called in pcTrussChecking, Algorithm 4.
In the following subsection, we focus on proposing techniques for devising efficient ckChecking (Algorithm 5), which makes the total cost of keyword-aware connectivity checking is less than . As such, the proposed strategy embedding with elegant techniques devised by us can bound the total cost of multi-constraint checking in the reducing stage as .
6.2 Keyword-aware Connectivity Checking
In this section, we show how to efficiently check the existence of a connected subgraph satisfying keyword vertex constraint after an edge is deleted induced by removing the most distant vertex in Algorithm 4.
High level idea. We will maintain a minimum spanning forest for (input of Algorithm 4) augmented with aggregated keyword vertex frequency. Notice that initially, every spanning tree in the forest satisfies keyword vertex constraint. After an edge is deleted from , one of the two cases below may happen.
Case 1: the deleted edge is not in the forest. In this case, the remaining subgraphs are still connected and each connected subgraph still satisfies keyword vertex constraint.
Case 2: the deleted edge is in the forest. In this case, one of the tree in the minimum spanning forest is cut into two trees, which may lead to one of the following subcases.
Subcase 1: cannot link the cut trees. In this subcase, we cannot find a replacement edge from the remaining to link the two trees, which means the subgraph referred by the two trees becomes two disjoint subgraphs. We update keyword vertex frequency for each of the cut tree. After the update, we safely prune the cut tree from the maintained spanning forest if it does not satisfy keyword constraint since they cannot contribute to MKASG.
Subcase 2: can link the cut trees. If we can find a replacement edge, the subgraph referred by two cut trees is still connected. We link the two trees with the replacement edge. Keyword vertex frequency remains the same.
It is clear that the above idea can correctly maintain all connected subgraphs satisfying keyword vertex constraint if they exist after deleting an edge from . But, it is challenging to preform the maintenance efficiently since checking the existence of a replacement edge could be costly.
To make the maintenance efficient, we borrow the idea from [holm2001poly]. Given , every edge in is associated with a level progressively increased as edges are deleted, which is equivalent to progressively partitioning hierarchically. Edges with high level refer to a more restricted part of . In contrast, edges with low level refer to a more general part of (super graphs of the high level subgraphs). As such when deleting an edge with a certain level, we do not need to consider any edge with lower level as a replacement edge, which elegantly reduces the search space for finding a replacement edge.
We first use an example to demonstrate our method.
Example. Suppose we have the input graph as shown in Figure 5(a), and we want remove vertex . The minimum spanning forest is shown in Figure 5(a) with edges in solid lines and the edges not in the spanning forest are shown as dashed lines. We do not show the level of an edge if its level is . Removing is equivalent to remove edges incident to . It is trivial to remove and since they are not a part of the spanning forest. After that, supposing that we remove shown as grey line in Figure 5(b), the spanning tree becomes two trees where the tree with vertices of is the smaller tree and the level of the edge in the tree is increased by . By checking edges incident to and , we find a replacement edge . By connecting the two trees, the spanning tree becomes the one in Figure 5(c). Next, we remove shown in Figure 5(c) and the tree with vertex only is the smaller tree. In this case, we cannot find any edge incident to , leading to Figure 5(d). We know the graph becomes separated and we also know that there is a connected component in the remaining graph with keyword frequencies of :, :, :. Without using the proposed method, we cannot simultaneously know the keyword vertex frequency and the connectivity of the subgraph after deleting .
Now, let us describe the method formally. We first introduce the keyword aware spanning forest.
Keyword aware spanning forest. The minimum spanning tree for every connected (, )-truss in from the expanding stage is computed and stored, in which each spanning tree is augmented with keyword vertex frequency. As discussed, initially every spanning tree in this forest () satisfies keyword vertex constraint and level for every edge in is assigned as . Below, we use to denote the forest of edges with level at least .
The algorithm. Algorithm 5 guarantees that after an edge deletion, every remaining minimum spanning tree in the keyword aware spanning forest satisfies keyword vertex constraint. It returns true if the keyword aware spanning forest is not an empty set. Otherwise it return . To efficiently achieve that, Algorithm 5 maintains invariants as follows.
Invariant 1. always holds. This invariant ensures no duplicated trees are generated.
Invariant 2. is a minimum spanning forest for edges with level at least induced subgraphs. This invariant maximizes the possibility that a deleted edge is not in the maintained forest.
Invariant 3. The number of vertices in is always no greater than . This is because when a tree is split into to subtrees, Algorithm 5 always increases the levels of edges in the smaller tree by . As such, the worst case is that every time a tree is split, the two trees are equal size, leading to largest possible size of a tree at level as . This invariant guarantees that the level of an edge is no greater than , which is the key for time complexity analysis.
More detailed steps are given below.
Given that with level is to be deleted, Algorithm 5 firstly checks whether it is in the current forest or not.
Case 1: . If is not in , is deleted (line 3), the algorithm return true .
Case 2: . If is in , Algorithm 5 deletes it from the tree containing from level which is the highest forest it is in.
Performing tree cut (lines 10 to 11). The tree is cut into two subtrees and , and levels of edges in the smaller tree in terms of number of vertices are increased by . Next, Algorithm 5 propagates the deletion from to so that from the view at all the levels, the tree is split.
Searching a replacement edge (line 13 to 20). After performing tree cut, Algorithm 5 starts to search a replacement edge of that may connect to . This is achieved by searching all edges incident to vertices appearing in . To maintain the minimum spanning forest property, Algorithm 5 searches an alternative edge from level . If an edge is incident to but and are in then its level is increased by .
Subcase 1: cannot link the cut trees (lines 24 to 26). If no replacement edge is found, induced subgraph is isolated. Aggregated keyword frequencies are adjusted. If the there is a tree violating the keyword vertex constraint after the adjustment, it is pruned.
Subcase 2: can link the cut tress (lines 21 to 23). If a replacement edge is found, the incident edge shall link to and this edge is inserted to to so that from the view of all the levels, the tree is linked.
Next we further discuss data structures used in our implementation which are useful for time complexity analysis.
Data structure. In Algorithm 5, to efficiently deal with tree cut and tree link operations, we store spanning forest as Euler tours and the Euler tours are stored as balanced binary search tree [henzinger1999randomized]. As such, each operation of tree cut and tree link can be performed in .
Time complexity analysis. The time complexity of index initialisation is . The time complexity of Algorithm 5 for deleting number of edges is . Lines 7 to 9 in the algorithm have the time complexity of . This is because for each edge, its level is at most . The dominating parts are lines 14 to 23 and lines 24 to 27 in the algorithm since they perform up to number of cut or link operations and each has a cost of , which results