Many problems in areas as diverse as recommendation systems, social network analysis, semantic search, and distributed root cause analysis can be modeled as pattern search on labeled graphs (also called "heterogeneous information networks" or HINs). Given a large graph and a query pattern with node and edge label constraints, a fundamental challenge is to nd the top-k matches ac- cording to a ranking function over edge and node weights. For users, it is di cult to select value k . We therefore propose the novel notion of an any-k ranking algorithm: for a given time budget, re- turn as many of the top-ranked results as possible. Then, given additional time, produce the next lower-ranked results quickly as well. It can be stopped anytime, but may have to continues until all results are returned. This paper focuses on acyclic patterns over arbitrary labeled graphs. We are interested in practical algorithms that effectively exploit (1) properties of heterogeneous networks, in particular selective constraints on labels, and (2) that the users often explore only a fraction of the top-ranked results. Our solution, KARPET, carefully integrates aggressive pruning that leverages the acyclic nature of the query, and incremental guided search. It enables us to prove strong non-trivial time and space guarantees, which is generally considered very hard for this type of graph search problem. Through experimental studies we show that KARPET achieves running times in the order of milliseconds for tree patterns on large networks with millions of nodes and edges.READ FULL TEXT VIEW PDF
Heterogeneous information networks (HIN) (Han et al., 2010), i.e., graphs with node and/or edge labels, have recently attracted a lot of attention for their ability to model many complex real-world relationships, thereby enabling rich queries. Often labels are used to represent types of nodes and their relationships:
Consider a photo-sharing social network with three vertex type labels: user, photo, and group. Users are connected to the photos they upload, and photos are connected to groups when they are posted there. Finally, users can connect to groups by joining them. To maintain a vibrant community and alert users about potentially interesting photos, the social network might run queries of the type shown in Figure 1: given photo1 and two users, user1 and user2, find alternative groups (matching nodes for group2) to post the photo in order to reach user2 without spamming her directly. This is achieved by identifying a user belonging to both groups (user3), who can post the photo in the other group. There might be hundreds of matching triples (group1, user3, group2), and there would be many more if user2 was not given in advance. Under these circumstances, the goal often is not to find all results, but only the most important ones. Importance can be determined based on node and edge weights, e.g., weights representing distances (or similarities). Then the query should return the lightest (or heaviest) pattern instances. For example, the weight of a group may be based on its number of members, the weight of a user on how active s/he is, and the weight of a link on the timestamp when it was established (to give preference to long-term relationships or more recent photo posts), or the sum of the PageRanks of its endpoints.
These types of rich query semantics also appears in other contexts, e.g., root-cause analysis in distributed systems. The Vitrage service for OpenStack (Vit, 2018)
makes use of path and tree patterns to specify rules for automatic root cause deduction of alarms raised by virtual machines and hardware. Large OpenStack deployments—involving thousands of hosts and tens of thousands of virtual machines and hardware components—necessitate pattern matching algorithms to deduce the root cause of such patterns in near real-time.
We focus on efficient solutions for acyclic pattern queries on general labeled graphs. To this end, we propose the notion of any- algorithms, a novel variant of top- algorithms. A top- algorithm exploits knowledge about the given to produce the top- lightest patterns faster than the “full enumeration” algorithm (which first produces all results and then ranks them by weight). In practice, it is difficult for users to know the value of upfront (“when will I have seen enough?”). An any- algorithm addresses this issue by not requiring a pre-set value for . Instead, an any- algorithm
returns the top-ranked result as quickly as possible,
then returns the second-ranked result next, followed by the third-ranked, and so on,
until the user is satisfied and terminates the process.
In other words, the ranked enumeration can be stopped anytime and should then return as many top results as possible.
The queries we are interested in correspond to subgraph isomorphism, which is known to be hard in general. In particular, subgraph isomorphism on homogeneous graphs is already NP-complete in the size of the query (even for the path case as Hamiltonian path is a special case). And labeled graphs contain unlabeled graphs as a special case. On the other hand, labels provide more opportunities for achieving better performance in practice by exploiting heterogeneity where present. Note that a key reason for hardness of isomorphism lies in the “non-repetition constraint,” i.e., the same graph node cannot occur more than once in an answer. Without this constraint, pattern search would correspond to the easier subgraph homomorphism problem which can be solved in PTIME.
Our approach is based on three key insights: (1) Constraints on node or edge labels can dramatically reduce the number of matching results; (2) Mutually exclusive type labels “narrow the gap” in cardinality between the set of isomorphic subgraphs and the set of homomorphic subgraphs (which includes all isomorphic ones). The reason is that query pattern nodes of different types cannot be mapped to the same graph node, even when the algorithm is only searching for homomorphism. In the example photo-sharing network, users and photos cannot stand in for a group node. In the extreme, if all nodes in the query pattern have different types, then any solution for subgraph homomorphism also satisfies isomorphism. This suggests an approach that aggressively prunes for the homomorphism case and then filters based on node repetitions in the result patterns; and (3) In many real-world cases, output size is small relative to the combinatorial size of the pattern search space. Hence algorithm complexity bounds based on output size promise to deliver practically meaningful performance guarantees.
Overview of the Solution. Our approach combines three conceptually separate steps into a two-phase algorithm.
1) The search space of possible homomorphic patterns is pruned to the provably smallest representation of the original graph. We use insights from the well-known Yannakakis algorithm (Yannakakis, 1981) for evaluating answers to acyclic conjunctive queries to create this representation in just one bottom-up and a subsequent top-down sweep through the query tree.
2) We devise a novel any- algorithm for enumerating homomorphic tree patterns. It uses dynamic programming to perform a bottom-up cost calculation, followed by a top-down guided search.
3) A final pruning step removes those homomorphic patterns that do not satisfy the isomorphism requirement.
We show how to combine the first two steps into just one bottom-up and one top-down phase. We then integrate the third step into the combined top-down phase. Our experiments show that even on graphs with millions of nodes and tens of millions of edges, we can return the top-ranked results in just a few milliseconds, whereas alternative approaches would take orders of magnitude longer. Our implementation can be downloaded from (Cod, 2018).
Main contributions. We devise KARPET (Kernelization111Kernelization is a pre-processing technique that replaces the original input by a (usually) smaller representation called “kernel” in order to reduce the computation cost. Our approach enumerates solutions over a smaller pruned candidate graph. And Rapid Pruning-based Exploration for Tree patterns), a novel and highly performant any- algorithm that can quickly identify top-ranked tree patterns in large graphs, then return the next lower-ranked ones when given extra time.
1) KARPET is designed as an anytime ranking algorithm that enumerates homomorphic subtrees in order of total edge weight with strong theoretic guarantees: We show that our worst-case time complexity for returning all homomorphism results is identical to full enumeration. In addition, KARPET provides strong upper bound guarantees for the time to return the top-ranked homomorphism result, as well as the time between returning a homomorphism result and the next. For cases with “small gap” between homomorphism and isomorphism, i.e., when “sufficiently many” homomorphic patterns are also isomorphic patterns, these guarantees carry over to subgraph isomorphism.
2) We propose fast and effective local pruning operations that exploit the heterogeneity of labeled graphs, proving that they also guarantee strong global pruning properties. Intuitively, for subgraph homomorphism, we show that inexpensive pruning based on 1-node neighborhoods efficiently removes all candidate nodes that are not part of any result pattern.
3) In contrast to a lot of theoretical work on subgraph isomorphism algorithms, our algorithm is output-sensitive—its worst case complexity depends on the output size, which is smaller when the graph and the query are more heterogeneous, rather than being exponential in the size of the query pattern.
4) We show how to speed up the search for top-ranked isomorphic answers by pushing the pruning for non-repeating nodes into the incremental result enumeration algorithm.
Our goal is to find the lightest subgraphs of a labeled graph that are isomorphic to a given tree pattern . Instead of returning all results at once after a long wait time, we set out to devise an anytime algorithm, which returns the top-ranked match as quickly as possible and then incrementally returns the remaining results over time.
An any- algorithm is a variant of a top- algorithm in which is not known at the start of the algorithm. The algorithm can be interrupted anytime, returning the top- results with being as large as possible.
We define the weight of a pattern as the sum
of edge weights. This also supports search for the “most reliable” pattern based on probabilities assigned to edges. Finding the pattern with the greatest probability of being connected, assuming independence, is equivalent to maximizing the sum of the logarithms of the edge probabilities. For our problem with a fixed query pattern, lightest and heaviest pattern search can be easily converted into each other. It is also straightforward to modify our approach to support pattern weight defined as minimum or maximum of edge weights. We present the formal definitions next. Table1 summarizes important notation.
|A labeled graph with node set and edge set|
|Set of node labels|
|Function mapping nodes to labels|
|Function mapping edges to weights|
|Tree pattern with node set and edge set|
|Required labels for a graph node matched to a query node|
|Set of leaf nodes (or terminals) in|
|Chosen root node in|
|Set of neighbors of in with label|
|Function mapping query nodes to graph nodes|
A Weighted Heterogeneous Information Network (HIN) is a labeled undirected graph , where is a set of vertices, is a set of edges, is a node labeling function , and is an edge weight function .
In many HINs, a node has at least two different kinds of labels: a unique node ID and a type (or class). In the photo-sharing network example (see Example 1.1), the labeling function assigns types such as “user” or “photo” to each node. Our approach can be easily extended to include multiple labels per node, as well as (multiple) edge labels, node weights, and directed edges. We omit these straightforward generalizations in order to simplify the exposition.
Given a vertex and label , we use to denote the set of all neighbors of with label , i.e., .
Given a labeled graph , a tree pattern is a rooted tree in which each node has a label constraint . We use to denote the root of the tree and to denote the set of its leaves (or terminals, i.e. nodes of degree one).
The labeling constraint can encode the selection of specific nodes or node types. For example, in the photo-sharing network scenario, setting for user1 to be the ID of a specific user node limits the candidate set for user1 to just this one graph node. Similarly, setting to the label encoding the type “group” will enforce that only graph nodes representing groups, but not users or photos, will be considered.
Notice that being rooted is not a restriction: any node in a tree can be chosen to be the root. We merely make use of the fact that the tree pattern is rooted in order to more easily describe our algorithms.
A homomorphic result pattern (or homomorphic match) of query is a graph such that there exists a function with the following properties: (1) , and (2) . The weight of a result pattern is defined as .
An (isomorphic) result pattern (or match) of query is a homomorphic result pattern with a bijective mapping function .
The above definitions make it clear that the set of isomorphic matches is a subset of the homomorphic matches; and can be obtained by removing all those homomorphic matches where multiple query nodes are mapped to the same graph node.
In the discussion below we will also refer to partial patterns (or partial matches) for intermediate results of the computation. These are incomplete instances where some of the query nodes are mapped to NIL by . The direct successor of a partial match is one where exactly one of the NIL targets is replaced by a graph node, growing the pattern by one additional node. With successor we refer to any partial or complete match in the transitive closure of direct successor.
For fast access to , we rely on GraphEdge, a two-level hash index constructed offline for . It maps a given node ID to another hash table, which in turn maps a given label to the set of neighbors of with label . If no label is specified, all nodes and corresponding edge weights in the secondary hash table for are returned. This index can be bulk-created from scratch in time linear in the graph size, and updated in time linear in the size of the changes.
Hardness. In general, even the decision version of sub-graph isomorphism, i.e., to determine if a given query graph is isomorphic to a sub-graph of , is NP-complete. When the sub-graph is connected acyclic (i.e., a tree), the best worst-case time bound for the decision problem is a parameterized algorithm of Koutis and Williams (Koutis and Williams, 2016) that requires time. Their algorithm also has matching conditional lower bounds (Krauthgamer and Trabelsi, 2017): achieving a bound of time, for any constant , would falsify a longstanding conjecture. Note that, since the decision problem is hard, the any- problem discussed here is at least as hard.
In practice we often know specific node instances such as for user1 in the photo-sharing network example, and can dramatically reduce the pattern search space by exploring starting from these nodes. Still, as Figure 2 illustrates, one cannot tell from the immediate neighborhood of node , if edge will belong to top-ranked results, or any results at all. Worse yet, not even the 3-hop neighborhood of will answer this question. Hence a pattern search algorithm might suffer from expensive backtracking or the inability to determine, without extensive graph traversal, when the top- lowest-weight patterns have been found.
To the best of our knowledge, KARPET is the first algorithm for ranked retrieval of graph query patterns that performs pruning and exploration based on “local” information, while provably guaranteeing to make the right decisions “globally.”
We next present an approach for sub-graph homomorphism; this is a relaxation of sub-graph isomorphism in that we do not require the mapping from query nodes to tree-pattern nodes to be bijective (in other words, a node can be repeated in the result pattern). Section 5 extends the approach for isomorphism.
KARPET consists of two phases: 1) a bottom-up sweep from leaves to the root of , and 2) a top-down depth-first traversal from root to leaves. The first phase prunes some of the spurious candidates and creates a “candidate graph” (discussed below) with “minimum subtree weights.” The second phase prunes the remaining spurious candidates and performs a search guided by the subtree weights. Here the term spurious candidate refers to a node or edge of the input graph that does not appear in any of the query results.
The bottom-up phase traverses the query tree in any bottom-up order and constructs a “candidate graph” consisting of two index structures: (1) returns for query node a hash index that maps a node candidate to a list of minimum subtree weights, with one weight for each of ’s children. (2) returns for each query edge between a node and its child a hash index that maps a candidate node of to all adjacent candidates of .
We illustrate Algorithm 1 with Figures 2(a), 2(b), and 2(c). It first inserts candidate nodes for each query leaf node into the corresponding candidates , setting their weights to zero (line 2). Note that leaves do not have children, hence the NIL value in the expression. In Figure 2(a) there is a single candidate per leaf, but in practice it can be a larger subset of for each query leaf, depending on the node constraints. Then, for each query node , the algorithm () finds possible candidate nodes, () prunes them, and () calculates the minimum subtree weights
In more detail: () for each query edge leading to a child , it first finds all candidate edges , storing the map (line 8). () Then, the algorithm only keeps the list of candidates for each query node that are reachable from candidate instances in all leaves of the query node (line 11): In Figure 2(c), the list of candidates for query node group1 is . Notice how spurious candidates not reachable from the leaves, e.g., in group2, are not even accessed (compare with Figure 2). Similarly, while in user3 is reachable from the left, it is not reachable from the right subtree and is thus automatically pruned as well. () Then, the algorithm finds for each reachable node, the min weight along each query edge starting at (line 16). For example, in Figure 2(c), the left weight for is computed as the minimum of weights for following , which is 5 as the sum of the weight of edge (= 2) plus the weight of (= 2+1), or for following , which is 7 as the sum of the weight of edge (= 4) plus the weight of (= 2+1). Notice we use here as short form for the sum of weights at a node , which we get from CandNode. The two new created indices speed up finding adjacent edges in a subtree of the query pattern during top-down traversal.
The second part of our algorithm performs top-down search, starting at the root node and proceeding downward to the leaves. This is essential for two reasons: First, the pre-computed subtree weights provide information to guide the search to the lightest patterns before exploring the heavier ones. Second, the top-down traversal implicitly prunes all remaining spurious candidates for sub-graph homomorphism, as we will prove in Section 4. Again, pruning actually happens implicitly by not reaching those candidates. To see the latter, consider group1 candidate in Figure 2(c). It is spurious, but could not be removed by the bottom-up sweep. However, it will never be accessed during top-down traversal, because was never recorded in CandNode by Algorithm 1.
Algorithm 2 shows the pseudo-code for top-down guided search. Initially, all candidates in the query root are inserted into priority queue pq (line 3), with their priorities set to the sum of the candidate’s weights. In Figure 2(c), there is a single candidate, , of weight . Then the algorithm repeatedly pops the top element from pq and expands the partial pattern using pre-order traversal. Function NextPreorder returns the edge, as the pair of parent and child node, along which the partial pattern will be expanded next (line 12). The priority value of each expanded partial match is defined as the sum of the pattern’s edge weights plus the sum of the weights of the unexplored subtrees. In the example, partial match is inserted into pq with priority 8 = 2 (edge weight) + (2+1) (weights of ) + 3 (weight of right subtree of ). Similarly, partial match is inserted with priority 4+(2+1)+3 = 10. Note that those values are computed incrementally during traversal (line 16). Consider expansion of to . Priority of was 8, with weight 5 for the newly expanded subtree rooted at group1. After retrieving from CandEdge, priority of is computed as 8 (old) - 5 (newly expanded subtree) + 4 (weight of edge ) + 3 (priority of ) = 10 (line 16 in Alg. 2). Then is popped next, and expanded to partial match with priority 8 = 8 - 2 + 2 + (0+0). This pattern is then expanded next to , , and finally —all with the same priority of 8. The latter is output as the minimal-weight solution. Only then will partial match with the higher priority value 10 be expanded analogously. Each expansion operation requires a pop operation from priority queue, visiting potential edges once.
All results in this section are for the relaxed version of the problem, based on sub-graph homomorphism instead of isomorphism. We discuss in Section 5 how to extend them to the isomorphism case. Proofs were omitted due to space constraints, but can be found in the extended version (Yang et al., 2018).
We show that during top-down search (Alg. 2), no spurious candidate node will ever be accessed. A candidate node for a query node is “spurious” if there does not exist any homomorphic result pattern where is matched to . Ensuring that no spurious nodes are accessed is crucial for proving strong upper bounds on the algorithm cost.
If node candidate for query node is accessed by Alg. 2, then there exists a homomorphic result pattern where .
Next, we show a powerful result that is crucial in establishing important algorithm properties: During the top-down guided search, for each query result there is at most one push and at most one pop operation on priority queue pq. For this, we need the following lemmas.
The priority value of a partial pattern is always less than or equal to the priority of all its successors.
Assume that Alg. 2 popped partial pattern , , of priority from pq. Then there exists a direct successor that has the same priority .
If the last pop operation on pq returned an incomplete pattern , then one of the direct successors of will have priority equal to the minimum priority over all elements in pq.
Consider the changes of pq for the example in Figure 2(c). Initially it contains , the sole root node candidate with priority 5+3=8. This element is popped and expanded along edges and . The priority of the former is 2 (weight of edge ) plus (2+1) (subtree weights of ) plus 3 (right subtree weight of ) = 8. It is identical to the initial priority of , because edge is the one that determined the minimum left subtree weight of 5 in . For , priority is 10 due to the higher weight of edge . After these two patterns are pushed, pq contains . The next pop delivers , which is expanded to , followed by repeated pop and push operations on this pattern, every time obtaining the same priority of 8, until the top result of weight 8 is completed. Only then will expansion of commence.
Front-element optimization. Based on Corollary 4.4, we next introduce an important optimization to Alg. 2. Since the corollary guarantees that one of the direct successors of the partial pattern popped before will have a minimal priority value, we avoid the push-pop cycle for it and keep expanding it directly, only pushing the other direct successors. More precisely, assume the algorithm just popped partial match of priority from pq. While expanding this pattern by one more node, it keeps in memory the first direct successor encountered that also has priority value , pushing all other direct successors to pq. This way the algorithm still works on a min-priority element, but avoids the push-pop cycle for it. This seemingly minor optimization has strong implications as formalized in the following theorems.
Using front-element optimization, for any , the -th pop operation from pq produces the -th lightest homomorphic result pattern, possibly requiring additional push operations, but no more pop operations until this result pattern is returned.
No matter how many results are retrieved, Alg. 2 never performs more than push operations on pq in total. Here denotes the number of homomorphic subtrees in .
This follows directly from Theorem 4.6 and the following observation. Assume the algorithm continues to run until all query results are found. At that point it has removed all partial matches from pq and the queue is empty. Theorem 4.6 implies that retrieving all results requires exactly pop operations. If the total number of push operations exceeded this, then the queue would not be empty. (And obviously, any execution of Alg. 2 that stops before returning all results will only have performed a subset of the push operations executed by the time all results are returned.)
To avoid notational clutter, we treat the size of the query pattern as a small constant and omit it from most formulas. (Note that pattern size is equal to the number of edges in , e.g., 5 in the photo-sharing network example.) It is straightforward to extend the formulas by including as a variable.
Algorithm 1. Theoretical worst case cost is , i.e., linear in graph size: for each of the query pattern edges, in the worst case all graph edges are accessed. The time for constructing CandEdge and CandNode adds a constant overhead per edge processed. In practice, only a small fraction of will be accessed because of the label constraints. In particular, by using GraphEdge in line 8 in Alg. 1, all neighbors of matching types (labels) are accessed in time linear in the number of these neighbors. Space cost is upper bounded by the combined size of CandNode and CandEdge, i.e., cannot exceed times input graph size.
Space cost of Alg. 2 is upper bounded by , the total result size for sub-graph homomorphism.
From a user’s point of view, the time it takes to produce the next lower-ranked result is crucial:
The initial latency for Alg. 2 to return the top-ranked homomorphic match, and also the time between returning any two consecutive homomorphic matches, is . Here is greater of (1) the number of candidates in the root node and (2) the maximum cardinality of the set of adjacent node candidates in for any query graph edge and candidate .
These strong results show that KARPET can effectively exploit selective label constraints. For instance, if there are a thousand homomorphic subgraphs in , then Theorems 4.8 and 4.9 guarantee that Alg. 2 will never store more than a thousand partial matches in memory and will perform at most a thousand (inexpensive) computation steps to deliver the next result to the user—no matter how big or connected the given graph!
We show next that the anytime property of KARPET, i.e., that it can deliver the top-ranked results quickly and then the next ones on request, incurs no performance penalty for producing all homomorphic matches:
The lower bound for producing all homomorphic result patterns is ; sorting them costs . Alg. 2 has matching total time complexity .
KARPET as introduced in Section 3 returns homomorphic matches. To obtain the desired isomorphic matches, function mapping query nodes to tree-pattern nodes has to be bijective. To guarantee this, one simply has to filter out all results where different query nodes are mapped to the same graph node. Instead of filtering on the final result, KARPET can perform early pruning by checking in line 15 in Alg. 2 if newly added node already appears in partial match —discarding if it does. This modification has the following implications for the cost analysis results in Section 4.3.
Since some of the items previously pushed to priority queue pq will now be discarded early, space consumption as well as computation cost of KARPET are lower than for finding all subgraph homomorphism results. However, worst-case complexity as established by Theorems 4.8 and 4.10 remains the same. And the guarantees for the time between results (Theorem 4.9) is weaker: In the worst case, e.g., when only the very first and the very last of the homomorphic matches represent isomorphic results, then time between consecutive results grows from to . Fortunately, as our experiments indicate, heterogeneity indeed results in a small gap between homomorphism and isomorphism, i.e., real-world performance is closer to .
Our experiments evaluate running time, memory consumption, and size of the search space explored by on several query templates and four data sets against two baseline algorithms. In order to allow reproducible results, our code can be downloaded from our project page (Cod, 2018).
Baseline algorithms. We chose two baselines that allow us to evaluate the relative contribution of our two key steps (pruning the search space, and guided search) to the performance of .
Unguided first calculates all results from the candidate graph and only then ranks them. Since it uses our pruned search space, but does not include any prioritizing of query results, it serves as a baseline to evaluate the contribution of our guided search phase.
Backbone is intended to evaluate the effect of our aggressive tree-based pruning strategy. It extends the state-of-art top- algorithm for path queries on HINs (Liang et al., 2016). First, it identifies the longest terminal-to-terminal path in the tree pattern (called the “backbone” of ). It then incrementally retrieves the lightest backbone instances in one-by-one. Note that such an instance is a partial match with smaller unmapped subtrees “hanging off” the backbone path. Thus, we can execute Unguided on each such subpattern, yielding a divide and conquer algorithm. Since each subpattern is independent, we extract the lightest instance of each subpattern, and merge these solutions with the backbone instance. This involves checking for repeat node occurrences to enforce isomorphism. Then the next heavier subtree matches are explored etc. Given a pre-defined value of , the algorithm can prune the set of remaining backbone instances every time a new full match is found. In all our experiments, we supply the ultimate value of to Backbone, to explore the best possible performance this algorithm might achieve (if it was able to guess the correct value of from the start).
Datasets. We use four well-known heterogeneous datasets: Flickr (Fli, 2018), DBLP (Liang et al., 2016), Enron (Liang et al., 2016), and Yelp (Yel, 2017). Figure 4 gives an overview of their properties. Notice that Enron, DBLP and Flickr have denser graphs than Yelp. In order to create weights for individual edges, we used the age of the edge with an exponential decay based on the difference between edge creation time and query time.
Query templates. We created five different pattern templates listed in Fig. 5. For each pattern, we chose 300 different assignments of actual nodes to the query leaves, which gave us 300 queries per template, for a total of 6,000 query instances across all datasets.
Performance metrics. We vary between 1 and total number of results, and compare the algorithms on three performance metrics. 1) Running time: We measure the time between having loaded the dataset into memory and returning the k-th result, reporting the average for each template. 2) Memory consumption: We report the maximum number of nodes (i.e., partial matches weighted by the number of matched nodes they contain) stored at any time during the execution of each algorithm. 3) Size of search space: We report the total number of partial matches (weighted by the number of matched nodes they contain) that are stored during execution.
Experimental setup. Experiments were run on an Intel Xeon CPU E5-2440 1.90GHz with 200GB of memory running Linux. We compiled the source code with g++ 4.8 (optimization flag O2).
To justify our “solving isomorphism through homomorphism” approach, we also ran KARPET with node-repetition check turned off. This produces all homomorphic matches, from which we can then determine offline how many were also results for the isomorphism case. We plot both numbers as we increase in Figure 6. It shows a representative result, obtained from the Enron dataset using a query from Template 1. The small gap between the lines confirms that the vast majority of homomorphic matches are also isomorphic result patterns.
Figure 7 shows a representative result comparing KARPET to Unguided, which bulk-computes the entire output and hence is not affected by the choice of . It is clearly visible how our any- algorithm continuously returns the top-ranked results in order, with very low latency between consecutive outputs. By the time Unguided finally returns the first match, KARPET has already delivered more than 85% of all matches. In the end, it took KARPET 4.98s to output all matches. For comparison, bulk-computation by Unguided took 4.31s, indicating a small overhead of 0.67s for supporting the anytime property.
For each template (fixing ) we show the average running time, memory usage, and search space of our algorithm, along with the two baselines, on all four datasets in Tables 2, 3, and 4. In terms of average running time, KARPET outperforms the baselines in all cases. On DBLP, we can see 100x faster running time compared to Unguided, and 10x speedup compared to Backbone. We make the following detailed observations about these results.
We observed that KARPET’s margin of improvement over the baselines is generally greater on denser graphs. Higher density leads to more matches, which KARPET can handle best, because it prioritizes the search based on subtree weight more effectively than the two baselines. The Yelp graph is extremely sparse, causing many queries to have only one or two matches. For those queries, all three algorithms behave nearly identically, e.g., running time for a typical single-match query on Yelp was msec for KARPET, msec for Unguided, and msec for Backbone. However, even for this sparse graph, there are several queries for each pattern that have a larger number of candidate instances. These result in a significantly slower average running time for both baselines, while KARPET averages to less than 1 msec.
Backbone does relatively well compared to Unguided on the larger dataset DBLP, and is the same order of magnitude for Flickr, but fails to perform well on Enron. For dense graphs, if the branch-and-bound does not terminate quickly, the overhead required by the divide-and-conquer merging steps can be very large because the same subtree may be visited multiple times.
The backbone-based algorithm often uses the least amount of memory, because in each iteration, it only holds the backbone matches and the single instance of the backbone it grows. The memory bottleneck for Unguided is the amount of storage needed to hold all matching trees when they are sorted by weight.
For “easy” queries, which only have a few matching instances, all three approaches show similar running time. On the other hand, when a query has to select the top- from a larger result set, e.g., dozens or 100s of results, KARPET has a significant advantage from efficiently pruning the search space at an earlier stage. For such “hard” queries, Unguided enumerates all matching instances before sorting them. Backbone only partially exploits pruning opportunities for the backbone. The non-trivial extensions proposed for KARPET are required to fully benefit from the constraints encoded by the entire tree structure.
Backbone can exhibit faster running times than Unguided in some cases. It achieves this speed-up due to its branch-and-bound nature, filtering out instances that exceed the threshold established by matches for the lightest backbones explored early on. (Note that this type of pruning takes advantage of advance knowledge of the final value of . In practice, the algorithm would not know and hence could not apply any such pruning.) Either way, in most cases, this advantage is outweighed by the fact that Backbone introduces an overhead for merging subtrees and repeatedly visiting some of the subtrees.
Our proposed notion of an any- algorithm is novel and extends the functionality of the previously-studied class of top- algorithms. The problem of fast graph pattern search has been studied in different research communities, such as algorithms, graph databases, and data mining. While traditional data mining and work in theory focuses on the structure of the graph, meta-path based approaches (Sun et al., 2011) also leverage the type information.
Subgraph isomorphism. Subgraph isomorphism is an NP-hard problem (Lubiw, 1981), and state-of-art algorithms are not practical for large graphs. Lee et al. (Lee et al., 2012) empirically compare the performance of several state-of-art subgraph isomorphism algorithms, including the Generic Subgraph Isomorphism Algorithm (Cordella et al., 2004), Ullmann algorithm (Ullmann, 1976), VF2 (Cordella et al., 2001), QuickSI (Shang et al., 2008), GADDI (Zhang et al., 2009), and GraphQL (He and Singh, 2008). They test on real-world datasets AIDS, NASA, Yeast and Human, covering a spectrum of relatively small graphs (). Modern social networks easily exceed that size by orders of magnitude, and the exact sub-graph isomorphism problem remains intractable for larger networks when label constraints and top- are not fully exploited. Hence, to the best of our knowledge, none of these existing precise pattern matching algorithms could be used for our target application.
K-shortest simple paths. The -shortest paths problem is a natural and long-studied generalization of the shortest path problem, in which not one but several paths in increasing order of length are sought. The additional constraint of “simple” paths requires that the paths be without loops. This problem has been researched for directed graphs (Yen, 1971), undirected graphs (Katoh et al., 1982), approximation algorithms (Roditty, 2007), and algorithm engineering (Hershberger et al., 2007; Feng, 2014). However, this body of literature was developed for graphs without labels. In contrast, KARPET efficiently finds -lightest instances matching a given query pattern by leveraging the heterogeneity constraints on the node and edge types to speed up the computation. Furthermore, in our scenario, is not known upfront.
Querying graph data. In the database community, querying and managing of large-scale graph data has been studied (Aggarwal and Wang, 2010), e.g., in the context of GIS, XML databases, bioinformatics, social networks, and ontologies. The main focus there has been on identifying connection patterns between the graph vertices (Faloutsos et al., 2004; Tong and Faloutsos, 2006; Wei, 2010; Kasneci et al., 2009; Ramakrishnan et al., 2005; Cheng et al., 2009). In contrast, KARPET finds matches for a given query pattern.
HINs and path patterns. Heterogeneous Information Networks (HINs) (Han et al., 2010) are an abstraction to represent graphs whose nodes are affiliated with different types. To derive complex relations from such information networks, “meta-paths” defined as node-typed paths on a heterogeneous network, are a representation of connections between nodes, by specifying the types along a path in the network (Sun et al., 2011). Thus, while the focus in that line of research has been on learning good meta-path patterns for various applications, KARPET efficiently finds matches for a given query pattern.
Liang et al. (Liang et al., 2016) derived a top-k algorithm for ranking path patterns in HINs. However, their algorithm does not easily extend to more complicated patterns such as trees; our Backbone baseline attempts to adapt their algorithm for ranking tree patterns, and our experiments shows a considerable advantage of KARPET.
Top-k query evaluation in databases. There is considerable amount of work on top-k queries in a ranking environment (Fagin et al., 2003; Ilyas et al., 2008; Akbarinia et al., 2007; Natsev et al., 2001). This work aims at minimizing I/O cost by trading sorted access vs. random access to data. In contrast to that body of work, we focus on main memory applications and a different cost model.
Graph search on RDFs and XML. Top- keyword search algorithms for XML databases (Chen and Papakonstantinou, 2010) combine semantic pruning based on document structure encoding with top- join algorithms from relational databases. The main challenge lies in dealing with query semantics based on least common ancestors. RDF is a flexible and extensible way to represent information about World Wide Web resources. Searching for a pattern on RDFs can be represented in SPARQL, and can be applied to ontology matching (Euzenat et al., 2007). Recent work on graph pattern matching in RDF databases (Aluc, 2015, Chapter 2) has resulted in several different approaches to database layout (Aluc, 2015, Chapter 3). However, as in the case of top- query evaluation, it appears that more focus has been placed on scalability issues, such as replication, parallelization, and distribution of workloads (Huang et al., 2011; Hose and Schenkel, 2013), as RDF datasets are often too large for a single machine. It is well known that SPARQL is descriptive enough to capture graph pattern matching queries (so-called basic graph patterns (Aluc, 2015, Chapter 2)), and these queries are typically decomposed into combinations of database primitive operations such as joins, unions, difference, etc. (Aluc, 2015, Page 23). Although work has been done optimizing these primitive operations in the context of graph patterns for certain types of queries that appear in practice (Neumann and Weikum, 2010), we are not aware of a similar approach to KARPET being employed for general tree patterns in the context of RDF.
We proposed KARPET for finding tree patterns in labeled graphs, e.g., heterogeneous information networks. Compared to previous work, it combines two unique properties. First, it is a top- anytime algorithm, in the sense that it quickly returns the top-ranked results, then incrementally delivers more on request. This is achieved without sacrificing performance for full-result retrieval compared to bulk-computation. Second, while being subject to the same general hardness of graph isomorphism, KARPET aggressively exploits the special properties of HINs. We demonstrate this by proving surprisingly strong theoretical guarantees that connect space and time complexity to parameters affected by heterogeneity: result cardinality for the slightly relaxed graph homomorphism problem and number of adjacent edges of a given type. The formulas show that greater heterogeneity of graph and query labels works in our favor by reducing the “gap” between homomorphism and isomorphism, and by reducing result size. In future work we will attempt to extend the approach to query patterns with cycles—which appears to be significantly more challenging. Intuitively, it is more challenging to perform elimination of spurious node and edge candidates when pruning based only on local neighborhoods.
Acknowledgments. This work was supported in part by the National Institutes of Health (NIH) under award number R01 NS091421 and by the National Science Foundation (NSF) under award number CAREER III-1762268. The content is solely the responsibility of the authors and does not necessarily represent the official views of NIH or NSF. We would also like to thank the reviewers for their constructive feedback.
Proc. IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition. 149–159.
We sketch the main ideas of the proof. (1) By definition of Alg. 2, for to be reached during the top-down search, there has to exist a path , where , from a root node candidate down to . For each , let denote the query node for which is a candidate. Without loss of generality, let be the left-most child of , . We next show that we can construct a query result from the path, the subtree rooted at , and the right subtrees for all , .
By definition, Alg. 1 only adds a candidate to a query node , if there exists a subtree rooted at in that is isomorphic to the subtree rooted at in query . (Otherwise would not have been reached during the bottom-up edge traversal from all its leaves.) It is now easy to see that () the path , () all right subtrees of the nodes in , and () the subtree rooted at , together cover the entire query graph and are pairwise disjoint. Hence the corresponding candidate instances, i.e., () path , () the right subtree matches rooted at each candidate in , and () the subtree match rooted at , form a query result containing candidate for query node . ∎
Recall that the priority of a partial pattern is defined as the sum of the weights of all matched edges, plus the sum of the pre-computed weights of all unmatched subtrees rooted at the corresponding matched candidates in . A candidate’s subtree weight, by definition, is the minimum over all possible matches in the subtree rooted at this candidate. Hence any successor of , which might not expand along the edges that determined the minimum subtree weights, will have a priority value greater than or equal to the priority value of . ∎
Similar to the proof for Lemma 4.2, we start with the fact that the priority of a partial pattern is defined as the sum of the weights of all matched edges, plus the sum of the pre-computed weights of all unmatched subtrees rooted at the corresponding matched candidates in . Since a candidate’s subtree weight, by definition, is the minimum over all possible matches in the subtree rooted at this candidate, there exists a full match whose weight is equal to that minimum. For a partial match , successor has the same priority if the edge connecting to the pattern is the one that determined the minimum weight value for the corresponding subtree in the parent of . ∎
Consider two results and with weights and , respectively, such that . Assume that the prefix of that was expanded into full result was popped at time ; and similarly for prefix of and time . We now prove that . Lemma 4.3 implies that the priority of at time is . Similarly, Lemmas 4.2 and 4.3 imply that at time , no prefix of could have been in pq. Any such prefix of has priority less than or equal to , hence would have been in front of at time . By definition, Alg. 2 can only push successors of a pattern it pops from pq. This implies that it can only push patterns that are successors of those partial matches currently in pq. (Note that it initially pushes all root candidates, guaranteeing the 1-node prefixes of all result patterns are initially in pq.) Since at time there is no prefix of in pq, Alg. 2 cannot push or pop any such prefix after time . This in turn implies that it must have popped in an earlier pop operation before . ∎
The upper bound follows from Corollary 4.7. We show that the bound is tight: Assume all results have the same prefix , except for the last node. Then Alg. 2 pushes only root , immediately pops it, then expands until reaching . The next expansion produces #results many patterns that are all pushed to pq, except for the minimal one. Hence at this point there are (#results - 1) elements in pq, each consisting of matched nodes. ∎
Theorem 4.1 implies . Assuming that pq is implemented using a relaxed heap, then it will have worst case time bound of O(1) for decrease key and O(log n) for delete min(Driscoll et al., 1988). Due to front-element optimization, the next full match is computed by (1) popping the first element from pq in time logarithmic in queue size, which is at most , then (2) expanding the partial match node-by-node until a full match. The latter requires retrieving all edges from a currently matched node in the partial match to the next unmatched node according to pre-order traversal. There are at most such edges, resulting in the corresponding number of pushes to pq, each at (expected) constant cost because of the constant time inserting into a relaxed heap. This is done at most times, which is a constant (see above). ∎
For the lower bound, note that each of the results has to be output. For the total cost of Alg. 2, Theorem 4.6 and Corollary 4.7 guarantee that the total number of push and pop operations on pq is . Since the former has constant cost, and the latter is logarithmic in queue size, we obtain an upper bound of for operations on pq. (To be precise, for both bulk-computation and Alg. 2, there is another additive term to account for the number of edges accessed to assemble each result pattern. This number is linear in .) ∎