Mining Rules Incrementally over Large Knowledge Bases

04/20/2019 ∙ by Xiaofeng Zhou, et al. ∙ University of Florida 0

Multiple web-scale Knowledge Bases, e.g., Freebase, YAGO, NELL, have been constructed using semi-supervised or unsupervised information extraction techniques and many of them, despite their large sizes, are continuously growing. Much research effort has been put into mining inference rules from knowledge bases. To address the task of rule mining over evolving web-scale knowledge bases, we propose a parallel incremental rule mining framework. Our approach is able to efficiently mine rules based on the relational model and apply updates to large knowledge bases; we propose an alternative metric that reduces computation complexity without compromising quality; we apply multiple optimization techniques that reduce runtime by more than 2 orders of magnitude. Experiments show that our approach efficiently scales to web-scale knowledge bases and saves over 90 mining system. We also apply our optimization techniques to the batch rule mining algorithm, reducing runtime by more than half compared to the state-of-the-art. To the best of our knowledge, our incremental rule mining system is the first that handles updates to web-scale knowledge bases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge Bases (KBs) have been attracting significant research interest over recent years, numerous new KBs are constructed rapidly and old ones growing in size. Examples include YAGO [13], Google Knowledge Vault [9], and NELL [3]. These KBs store millions to billions of facts about real-world entities such as people, places and organizations. Despite their large sizes, KBs are usually growing and changing over time. For instance, NELL [3] has been running continuously since January 2010 and learning facts from web pages; and DeepDive [23]

utilizes database and machine learning techniques and “engineer-in-the-loop” development cycles for incremental knowledge base construction.


The growth in size of KBs provides a great opportunity for deducing new information about the world by reasoning over the facts in KBs. While latent feature models such as embedding based models [24, 21] or methods based on graphical models such as Markov Random Fields [19] have proven to be effective in inferring new facts from KBs, one major drawback is their lack of interpretability. Another approach is to use graph features based logical rules. For example, the rule below

can help learn a person’s country of residence based on the city where s/he lives. Those rules not only reveal correlations between facts in an explainable way, but also help in various applications including reasoning and expansion over knowledge base [5], and construction [3]. Rules are used in other applications such as networks [15], medical geography [17] and recommender systems [7].

While mining rules, specifically mining Horn clauses, have been a major task in Inductive Logic Programming (ILP), classic ILP methods are not scalable for today’s large KBs 

[6, 11]s. Amie+ [11] scales up to 12M facts with pruning and approximation strategies, and OP [4] adapts Amie+’s mining model and scales to Freebase with 388M facts. However, to the best of our knowledge, no existing work handles dynamic aspect of today’s KBs. Those systems work in batches, and the state-of-the-art, OP [4], takes 20+ hours to finish on Freebase with fine-tuned parameters, and can only work from scratch if some facts change or new facts are added. At this scale, it is unrealistic to run the batch algorithm again every time a small update to the KB arrives.

In this paper, we aim to efficiently mine first-order inference rules incrementally over large growing KBs. We propose a novel incremental rule mining framework to mine rules from large evolving KBs by storing the inference rules and facts in relational tables, and using joins to propagate the updates on KBs.

However, the intermediate results used to facilitate computing changes in scoring metrics such as support and confidence are still large for efficient update, due to the deduplication needed according to those scoring metrics. Thus we propose two alternative ways to avoid the storage cost: 1) instead of storing intermediate results, we use filter-and-join to remove duplicates before calculating changes to rule metrics; 2) we propose a new metric that avoids the need for deduplication. We compare the runtime of two approaches and compare the new metric with standard confidence. We argue that the new metric performs as good on real knowledge bases and experimentally validate the claim. This new metric saves a significant amount of runtime.

We develop an incremental rule mining system that, to the best of our knowledge, is the first that can handle updates to web-scale KBs efficiently. In summary, the contributions in this paper are:

  • [noitemsep,topsep=2pt,parsep=1pt,partopsep=0pt,leftmargin=10pt,labelindent=0pt,itemindent=0pt]

  • We design an efficient incremental rule mining algorithm that can handle updates to large KBs.

  • We propose a new metric that avoids searching or storing the prohibitively large intermediate results required in the state-of-the-art systems and show experimentally that the new metric can work as good as competing metrics.

  • We apply multiple optimization techniques that avoid unnecessary searching, and handle data skew, which drastically reduces the runtime by nearly 2 orders of magnitude. We also apply these optimization techniques to the state-of-the-art batch rule mining algorithm, OP 

    [4], and save half the runtime.

  • We conduct detailed experiments over public large KBs, including YAGO, and Freebase. We validate our optimization techniques, and demonstrate that our system can handle updates to large KBs efficiently.

2 Related Work

Mining Horn Clauses. Researchers have been working on mining rules from knowledge bases [20], especially Horn clauses [14] since pioneering works on Inductive Logic Programming (ILP) [18]. ILP requires counter examples, which are absent in knowledge bases that implement the open world assumption. Amie [11] extended ILP to handle the absence of counter examples, and can process KBs in the order of 250K and 12M facts respectively. OP [4, 6] adopts Amie+’s mining model and by leveraging rule pruning and rule-based partitioning, scales up to Freebase with 388M facts. However, to the best of our knowledge, there is no existing work that can handle large dynamic knowledge bases.

Incremental Association Rule Mining. Association rule mining [1, 10] is a mature research field relevant to mining Horn rules from KBs. Various techniques have been proposed for Association rule mining including the Apriori algorithm [1], partitioning [22] and frequent pattern tree [12]. Incremental association rule mining has adapted similar techniques from its batch counterparts [8] for handling new transactions. We use support and confidence to measure the significance of rules. However, transactions contribute independently to the itemset counts, while facts in KBs are interconnected by the rules and variables [6]. Thus, incremental updates to KBs cannot be trivially updated using techniques from incremental association rule mining.

3 Preliminaries

3.1 Relational Knowledge Base Model.

In RDF [16] knowledge bases, each fact is represented in a triple format (subject, predicate, object). The subject and object refer to real-world entities, and the predicate reveals the relationship between the subject and object. For example, (Barack Obama, LivesIn, USA) encodes the fact that Obama lives in the USA.

3.2 First-order Rule Mining.

In KBs such as Freebase [2], entities and predicates come from an ontology or schema. Each entity belongs to one or more type and each predicate encodes the relationship between one or more types of entities. For example, the predicate specifies the relationship between type person and type place. This information can be used to generate a syntactically correct candidate rule set by traversing the ontology [4].

The rules that we aim to mine are of the form:

(1)

Where the body is a conjunction of atoms, is the head predicate, and the scoring metric measuring the rule’s likelihood.

Similar to OP [4], Amie+ [11] and ILP systems [18], we focus on mining closed and connected Horn rules. Two atoms in a rule are connected if they share an entity or a variable, and a rule is connected if each atom is connected transitively to another atom in the rule. A rule is closed if each entity or variable appears at least twice in the rule. The two restrictions are to avoid mining rules with completely unrelated atoms, or rules that predict merely the existence of a fact [11], limiting the search space for rules and retaining the sensitivity of the rules at the cost of expressiveness.

Two first-order clauses are defined to be structurally equivalent if they differ only in predicate symbols but with the same order of variable symbols. For example:

Although the two rules have different predicates and entity types, the structure of the two rules are equivalent in the sense that rule length and order of variables in the rules are the same. This enables us to store the structurally equivalent rules in fixed-column Table 1, and take advantage of well-optimized relational queries.

Body1 Body2 Head
liveIn isLocatedIn liveIn
actedIn directedBy friendWith
Table 1: One table for one type of equivalent rules.

3.3 Scoring Metric.

Not all mined rules are interesting or valid. This might be due to missing or erroneous facts in the knowledge base or simply because of inherent exceptions to the rule. Another important factor determining a rule’s interestingness is the significance of it, i.e., how many facts can be explained using this rule. To measure the correctness and significance of a rule, different metrics have been proposed in previous related work [11, 4], and we briefly review them here.

Support. If a rule is only applicable to a small number of facts it is not that interesting. Support is a measure of significance of a rule. The support of a rule is the number of facts predicted by the rule in the KB.

(2)

Where ’s represent any entities connecting the body.

Confidence. While support measures the number of correct predictions, confidence defines the quality of the rule based on the ratio of correct predictions to the total number of predictions. The most commonly used confidence is the standard confidence, defined as:

(3)

The support and standard confidence defined above incur extra computational costs in an incremental rule mining setting, hence, we introduce an alternative confidence in Section 6 to reduce the workload.

3.4 Spark Basics.

Apache Spark [25] is a general-purpose cluster computing system that uses the abstraction of resilient distributed datasets (RDD) – a collection of objects supporting parallel operations. We list a set of operations on RDD used in our method: map, groupByKey, reduceByKey and join, further details on Spark website111https://spark.apache.org/docs/latest/rdd-programming-guide.html.

4 Incremental Mining Framework

In this section we first give a formal definition of our problem. We then outline our general framework for incremental rule mining, and discuss one commonly used modules in the framework: incremental inference.

4.1 Problem Definition.

Let , be a set of an arbitrary number of triples, in the form (subject, predicate, object), conforming with a schema. Let be an evolving knowledge base and let be the currently cumulated KB. We aim to mine first-order horn clauses from , given scores previously mined from .

4.2 General Framework.

The main objective is to compute updated scoring metrics after absorbing new facts . Figure 1 provides a brief example of this update. At time , the set of candidate rules , together with two facts in the existing KB , contributed to the rule metrics . We keep the record of the number of predictions instead of the confidence, which can be readily obtained via division of the two values stored. At time , we have an update of fact(s) . To update the scoring metrics: 1) the new fact together with fact and rule can infer fact , which exists in the KB , i.e. a correct prediction, thus contributing ; 2) the new fact can be inferred from existing facts in , thus contributing . By handling the two cases, we obtain the changes to rule metrics .

Figure 1: Handling an update . : candidate rules set, : accumulated KB, : intermediate result that can be optionally materialized, : rules’ metrics.

Based on the example above, we see that the update facts in can appear in the body, or as the head of a rule, and we need to handle the two cases differently:

  • [noitemsep,topsep=2pt,parsep=1pt,partopsep=0pt,leftmargin=10pt,labelindent=0pt,itemindent=0pt]

  • Case 1: A fact in appears as a part of the body of some rules. We need to run the inference again by only applying the rules in (to ) that may have a fact in participating in their body. However, inferred facts, or predictions, generated this way might have been inferred before using the same rules with facts from solely . Those duplicate facts inferred by the same rules need to be ruled out when calculating the changes to the metrics. A straightforward way would be to simply store and maintain the previously inferred fact and inference rule id), i.e. the intermediate result () in Figure 1, and subsequently examine if those newly inferred facts exists in to obtain changes to rule metrics.

  • Case 2: A fact in appears as the head of some rules. We need to update scoring metrics for relevant rules that infer this fact, as their predictions have come true. This is straightforward if we maintain the intermediate result (i.e. predictions with rule IDs in above): we just check whether the new fact in exists in the intermediate result ( and ); and update scoring metrics for relevant rules.

Based on the observation above, we propose Algorithm 1, an incremental rule mining framework which accepts an evolving KB , with an upper limit for length of rules to be mined.

1:   construct-rules(.schema, ) 
2:  R
3:  
4:  for all  do
5:     I Inc-Infer(, , ) 
6:     R R Infer-Update(I, )  
7:     
8:     R R Check-Update() 
9:     output R  
Algorithm 1 Inc-Rule-Mining

Algorithm 1 outlines the major steps of our incremental framework: first we generate the syntactically valid rules from the schema (Line 1) using the ontological path-finding algorithm [4], then we process each update in iteratively. In each iteration, we run incremental inference to get inferred facts with rule IDs (Line 5), to get the changes to the scoring metrics for case 1 where the update facts act as body of rules (Line 6). Finally we get the changes to metrics for case 2 where update facts appears as head of rules (Line 8).

We focus on the commonly used module Inc-Infer in the rest of this section. Calculating the changes to scoring metrics in Infer-Update and Check-Update will be detailed in Section 5 and Section 6.

4.3 Incremental Inference.

Before calculating the changes to candidate rules’ scoring metrics, we need to infer new facts with the rule IDs in using the update facts in . Without loss of generality, we describe our algorithms based on the following class of rules:

h(x, y) <- b1(x, z), b2(z, y)

It is straightforward to generalize to other type of rules (see [4] for the full list of rule types).

0:   = {(predc,subc,objc)}
0:   = {(predi,subi,obji)}
0:   = {(ID,head,body1,body2)}
1:  preds .groupBy(body1)
2:  FlatMap each fact (predi, subi, obji) to: for all (ID, head, body1, body2) preds.get(predi)yield list of ((body2,obji), (head,subi,ID))
3:  Map each fact (predc,subc,objc) to ((predc,objc), subc)
4:  Join output from Line 2 and Line 3 on (body2,obji) = (predc,objc) to get {((predc,objc), ((head,subi,ID),subc)) }
5:  res1 Map each ((predc,objc),((head,subi,ID),subc)) to (head,subi,subc,ID)
6:  res2 {similarly for appears as body2 in }
7:  res3 {similarly for appears as both body1 and body2 in }
8:  return  res1 res2 res3
Algorithm 2 Inc-Infer(

Algorithm 2 infers new predictions given new facts , cumulated KB and candidate rules set . There are 3 cases to handle in Algorithm 2, the new facts in only act as , or , or both. For brevity, we only show how to handle the first case and the rest can be similarly implemented. First we do a map-side join on with rules in (Line 1-2), then we join the result with , which participate as (Line 3-4), to get the inferred facts with rule IDs (Line 5). We start from the incremental update , which we assume should be much smaller than , thus avoiding generating large intermediate results as much as possible.

5 Mining Using Standard Confidence.

Here we first introduce our vanilla approach where the intermediate result is stored and maintained. Then we propose a searching approach to avoid maintaining the large intermediate result. The searching approach requires calculating partial on the fly by searching the whole KB . We discuss how to optimize this operation. We also improve the performance of batch rule mining algorithms using those optimization techniques.

5.1 Vanilla Approach.

Maintaining a deduplicated copy of intermediate result makes updating facts straightforward. First, obtain the deduplicated via Algorithm 2: for case 1, remove tuples in from , and join with to calculate the changes for the relevant rules’ metrics; for case 2, update facts can be compared with to obtain changes for relevant rules. For brevity, we omit the algorithms here.

5.2 Mining via Searching.

To avoid storing the intermediate result , we can run the inference again on to obtain on the fly. However, the complexity would be equivalent to the batch mining algorithm. To minimize the inference cost, we filter before inference.

0:   = {(predc,subc,objc)}
0:   = {(head,sub,obj,ID)}
0:   = {(ID,head,body1,body2)}
1:   Distinct
2:  leftOuterJoin filter out tuples in that also exists in Search(, , )
3:  leftOuterJoin remaining with , yield (ID, (exist,1)), where exist=1 if the inferred fact exists, otherwise exist=0.
4:  reduceByKey summing the (exist, 1) pairs for each ID
Algorithm 3 Infer-UpdateSearch(
0:   = {(headi,subi,obji)}
0:   = {(predc,subc,objc)}
0:   = {(ID,head,body1,body2)}
1:  tmp Distinct Search(, , M)
2:  Join with tmp, and issue (ID, (1,0)) tuples.
3:  reduceByKey summing the (1, 0) pairs for each ID
Algorithm 4 Check-UpdateSearch(

Algorithm 3 deals with case 1 where the update facts appear in the body of the rules. Instead of maintaining intermediate result , we generate partial that includes (i.e. Algorithm 5 tries to infer new facts from with rules that can possibly ‘hit’ facts in for deduplicating ).

Algorithm 4 deals with case 2 similarly, but instead of using the intermediate result , we generate the partial incremental result that could possibly infer from existing facts on the fly. This is done using Algorithm 5 detailed below.

5.2.1 Searching.

Algorithm 5 enables us to search the large intermediate results without materializing it, thus significantly improving performance. First we filter with the entities (subjects and objects) in , to get and and group them by to get the list of pairs for each . Finally, we join the grouped lists on , and for each matched pair of lists connected by , we apply join Algorithm 6 to obtain the inferred facts, i.e. the partial relevant to input

0:   = {(preds,subs,objs)}
0:   = {(predb,subb,objb)}
0:   = {(ID,head,body1,body2)}
1:  f1 filter that have subb appearing as subs in
2:  f2 filter that have objb appearing as objs in
3:  g1 groupByKey f1 yielding list of (pred,sub) for each obj, with max group size limit m.
4:  g2 groupByKey f2 yielding list of (pred,obj) for each sub, with max group size limit m.
5:  Join g1 with g2 (apply Algorithm 6) yielding tuples of (pred,sub,obj,ID)
Algorithm 5 Search(

Algorithm 6 is a hash-join algorithm. We process each list of pairs by searching for rules in that can be applied to those pairs, and output inferred facts and rule IDs if available. The algorithm minimizes the search cost by iterating through the smaller list or rules set and searching the larger via hash-maps.

1:  if  then
2:     preds1 .groupBy(pred)
3:     preds2 .groupBy(pred)
4:     for  do
5:         for sub1 preds1.get(r.body1do
6:            for sub2 preds2.get(r.body2do
7:               emit (r.head,sub1,sub2,r.ID)
8:  else
9:     rules M.groupby((body1, body2))
10:     for (pred1, sub1)  do
11:         for (pred2, sub2)  do
12:            for r rules.get((pred1, pred2)) do
13:               emit (r.head,sub1,sub2,r.ID)
Algorithm 6 Group-Join-Adaptive(obj,
={(pred, sub)}, ={(pred, sub)}, )

5.2.2 Optimization Techniques.

While algorithm 5 is similar to batch rule mining algorithm in OP [4] for applying the rules to generate the intermediate result, there are three important improvements.

  • [noitemsep,topsep=2pt,parsep=1pt,partopsep=0pt,leftmargin=10pt,labelindent=0pt,itemindent=0pt]

  • Adaptive Join: Algorithm 6

    tries to minimize the cost for searching matching pairs. Due to the skewed power-law degree distribution of natural knowledge graphs, most of the matched lists are very small, a few of the matched lists can be very large. Different from OP where the searching loop only starts from rules set

    , which usually contains a significant number of candidate rules, our algorithm starts searching from the smaller one, or matched lists and .

  • Handle Data Skew: The runtime of Algorithm 6 is dominated by the largest matched lists in Line 5. This data skew causes single or a few long running tasks and leads to resource under-utilization. We impose a group size limit for each group in Algorithm 5 (Line 3-4) to re-distribute the workload evenly.

  • Adaptive Filter: In Algorithm 5 there are two approach to filter: 1) broadcast the subjects/objects of to each worker to apply filtering; 2) join the subjects/objects of with to apply filtering. While the broadcasting method is only suitable to small updates, large updates mandate join-filtering method. We apply an adaptive filtering approach to minimize the filtering cost.

In Section 7 we conduct experimental analysis and show that our optimization techniques above can reduce runtime drastically by almost 2 orders of magnitude.

6 Mining Using New Metric

The two incremental mining algorithms in the previous section either require storing the large intermediate result or suffer from large search burden. This is due to standard scoring metrics being holistic. In the definition of support (2) and confidence (3), the number of distinct pairs are counted. Thus we need to deduplicate the intermediate result, which becomes costly in the incremental setting.

6.1 New Confidence Metric

We introduce a new metric for measuring the correctness of a rule: xconf. Our metric measures confidence by normalizing the different number of instantiations with the total number of body-only instantiations:

(4)

In other terms, x-confidence does not force a uniqueness constraint as in standard confidence.

While using the rule’s number of instantiations to measure support is not generally suitable, the number of instantiations is not monotonic, i.e., with this definition adding new atoms to a rule can artificially increase the support [11]. For example, consider the rule:

adding the extra condition to the rule’s body makes a narrower rule, but the support increases for every z.

Nevertheless, it does not imply that the number of instantiations cannot be used to measure confidence. In xconf the denominator also incorporates the instantiations, and this normalizes the effect of adding new atoms mentioned above, except in very few corner cases. Additionally, rules longer than length 3 mostly reduce to shorter rules, or are erroneous, and are expensive to learn as runtime grows exponentially as rule length grows [4]. Thus we are mostly interested in closed horn rules of length 2 or 3. In such rules, the problems mentioned above do not arise and one can use the number of instantiations to measure significance.

It is trivial to support deletion of facts using new metric once addition is implemented, as no duplication is needed. As for standard confidence scores, supporting deletion requires keeping track of path counts of the body in instantiations for the same head, resulting in slight increase in storage cost and runtime. For simplicity we omit the deletion in this paper.

6.2 Mining Using Xconf

Algorithm 7 is adapted from Algorithm 3 to handle the first case where the update facts appear in the body of rules. Here we do not deduplicate and avoid filtering out duplicate tuples in using ‘Search’ algorithm.

0:   = {}
0:   =
1:  leftOuterJoin with , yield , where exist=1 if the inferred fact exists, otherwise exists=0.
2:  res reduceByKey summing the pairs for each
Algorithm 7 Infer-Updatexconf(

Algorithm 8 is almost identical to Algorithm 4, except that we remove the deduplicate operation and directly calculate the changes to new metric with the possible inferred facts from ’Search’, as all the body instantiations contribute separately.

0:   =
0:   = {}
0:   = {(ID,head,body1,body2)}
1:  tmp Search(, , )
2:  Join with tmp, and issue tuples.
3:  reduceByKey summing the pairs for each
Algorithm 8 Check-Updatexconf(

7 Experiment

We conduct experiments on two real-world KBs: YAGO [13] and Freebase [2]. First, we compare the 3 variants of our incremental algorithm with the state-of- the-art batch rule mining system OP [4]. We show that for different update sizes, our incremental algorithms can easily save more than 90% of the time, compared to re-running OP again. Second, we compare our new metric with the standard confidence in terms of rule quality and confirm that our new metric is close to standard confidence scores on real KBs. Third, we show how our optimization techniques for algorithm 5 speed up our incremental algorithm.

YAGO. YAGO [13] is a multilingual knowledge base derived from Wikipedia, WordNet and GeoNames. We use the same version of YAGO2s used in OP.

Freebase. Freebase [2] is a large collaborative knowledge base constructed from many sources. Freebase has 1.9B facts222https://developers.google.com/freebase/ and we preprocess the dataset by removing the multi-language support and use the remaining 344M facts. Statistics of the KBs are listed in Table 2

KB Size YAGO2s [c][.72cm][c]3cm# Entities: 2,137,468 # Facts: 4,484,907 Freebase # Entities: 110,459,875
# Facts: 344,192,734
     Max length 3 Max Group Size 30,000 Min support 0 Min confidence 0.0

Table 2: (Left) Datasets statistics. (Right) Default parameters.

Experimental Setup. We run experiments on a 64-core machine running AMD Opteron Processor (6376), with 512GB memory and 3TB disk space on Ubuntu 14.04.3 LTS with kernel version 3.13.0-68-generic, with software: Spark 2.2.0, in Scala 2.12 and Java 1.8.

7.1 Incremental vs Batch.

We use the state-of-the-art batch rule mining system OP [4] as the baseline. We also include the runtime of our implementation on batch rule mining enhanced with our optimization techniques 333 https://bitbucket.org/datasci/fast-rule-mining. Results are summarized in Table 3: OP takes 24.3 mins and 21.9 hours to finish mining on YAGO2s and Freebase, respectively. Our batch implementation takes 11.5 mins and 9.0 hours, less than half the time of OP.

Runtime YAGO2s Freebase
OP 24.3 mins 21.9 hrs
Fast OP1 (ours) 11.5 mins 9.0 hrs
Table 3: Batch Runtime

To evaluate the incremental algorithms, we randomly divide both datasets into two parts: 90% as the base (), and updates with different sizes() from 1% to 10%. We evaluate the incremental algorithms by applying updates of different size to the base. We report the runtime of our incremental algorithms.

Figure 1(a) shows the runtime of our incremental algorithms on YAGO2s with different update sizes. The incremental version with the new metric ’xconf’ takes about 2.4 mins for 1% update and 4.1 mins for 10% update. While for the vanilla approach, the runtime is 5.9 mins for a 1% update and 8.3 mins for 10% update; and for the searching approach, the runtime goes up from 4.9 mins to 11.1 mins for 1% update and 10% update, respectively.

(a)
(b)
Figure 2: runtime with different update size

Similarly Figure 1(b) shows the incremental mining runtime on Freebase. The new metric xconf performs the best: for 1% update size, it only takes about 0.61 hours to apply the update and get the changes to the rule scoring metric, and only 1.16 hours to apply a 10% update, which accounts for almost less than 95% of the batch counterpart. The vanilla approach can handle 1% within 1.8 hours and 10% in 2.4 hours, at the cost of storing intermediate result more than 10 times larger than Freebase which can grow rapidly as the KBs get less sparse. As for the searching approach, the runtime for 1% update takes about 4.2 hours and 10% update takes about 6.2 hours, about 6 times slower than the new metric and 2-3 times slower than the vanilla approach. Still, the searching approach takes about an acceptable 25% of OP’s time to handle 10% update size. The performance boost shows that our optimization techniques are more effective for larger KBs.

7.2 Xconf vs. Stdconf

We compare xconf to stdconf via the rules mined from YAGO2s and Freebase. From the definitions in (3) and (4), it is clear that a rule will have a non-zero stdconf iff it has a non-zero xconf, thus both algorithms will produce the same set of rules but possibly with different confidence values. The histogram of difference between the two metrics for each rule on both datasets is shown in Figure 3. Almost all rules mined have a very small difference in confidence score when measured with the two metrics. When different, xconf is usually higher than stdconf in both KBs. Only a few rules have a significant difference in confidence scores, and some are shown in Table 4.

Rule stdconf xconf
YAGO2s dealsWith(z,x) imports(z,y) exports(x,y) 0.06 0.13
influences(z,x) hasGender(z,y) hasGender(x,y) 0.81 0.89
dealsWith(z,x) dealsWith(y,z) dealsWith(x,y) 0.31 0.45
[4pt/5pt] Freebase model.parent_aircraft_model(x,z) model.manufacturer(z,y) model.manufacturer(x,y) 0.68 0.74
location.partially_contains(x,z) location.containedby(z,y) location.partially_contains(x,y) 0.06 0.10
Table 4: Link prediction results for various rule mining methods
Figure 3: Histogram of difference between the two confidence metrics of rules mined on YAGO2s and Freebase.

7.3 Effect of Optimization Techniques

We experimentally examine how optimization techniques in Section 5.2.2 reduce runtime by 2 orders of magnitude.

Adaptive join We experimentally demonstrate the effectiveness of Algorithm 5. Different searching strategies in Algorithm 6 lead to drastically different runtimes. Figure 6 shows the searching cost of 3 variants: 1) loop through the rules (same as OP) and search in facts(rules-1k); 2) loop through the paired facts and search for matching rules (facts); 3) search adaptively from the smaller size as in Algorithm 6 (adaptive). Variant 2 takes more than a day to finish even for a 1% update size, thus we randomly sampled 1k rules from the 22K candidate rules, and the actual searching cost would be roughly 22 times larger. For different update sizes from 1% to 10%, rules-1k runtime reaches 700 mins at 5% while handling only a 22th of total candidate rules; facts 42.2 mins to 702 mins; adaptive takes about 2.8 minutes to 12.3 minutes. Depending on different update sizes, adaptive can achieve 10x to 100x speedup.

Figure 4: Different hash joins
Figure 5: Different group sizes
Figure 6: Different filtering

Handle Data Skew In Groups. Figure 6 shows the runtime with different group sizes to handle the data skew. For 1% update size, the runtime drops from 23 minutes at max=1k to about 2-3 mins between max=5k to 40k and increases to 29 minutes at max=50k. Similarly, for 10% update size, the runtime drops from 81.6 minutes at max=5k to an optimal 11 minutes at max=40k and increases to more than 14 hours (we had to terminate instead of waiting for it to finish) at max=50. As small upper limit of the group sizes leads to more joining results, it will be a Cartesian product for all those split into small groups with the same key, while large max leads to long subtasks to finish. We choose max=30K as a default setting in our experiments.

Filtering: Broadcast vs Joining. We explore how to optimize the filtering in Algorithm 5. For a large update size , it’s mandatory to use the join method because the broadcast requires to be small enough to fit into each workers memory. Since we accommodate to various sizes of updates, we explore when to switch to the join-filtering method. Figure 6 shows the runtime of different filtering approaches on different update sizes on Freebase. For smaller updates, broadcast filtering is more efficient. From the graph we set the upper size limit to use rule-based filtering to be 10M, and this achieves approximately the optimal runtime for both small(1%) and larger(10%) update sizes.

8 Conclusion

We propose an incremental rule mining framework that outperforms state-of-the-art batch methods on real-world datasets by more than 2 orders of magnitude. Our method leverages various optimization techniques and a new metric that significantly enhances the performance while maintaining the rule quality. We conduct experiments on two real-world knowledge bases to justify our claims. To the best of our knowledge, our system is the first that can efficiently mine rules from evolving large-scale knowledge bases with incremental updates.

Acknowledgments

This work is partially supported by NSF under IIS Award # 1526753 and DARPA under Award # FA8750-18-2-0014 (AIDA/GAIA). The authors would also like to thank Anthony Colas, Caleb Bryant and anonymous reviewers for their comments on this paper.

References

  • [1] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. Mining association rules between sets of items in large databases. In Acm sigmod record, volume 22, pages 207–216. ACM, 1993.
  • [2] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM, 2008.
  • [3] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell. Toward an architecture for never-ending language learning. In AAAI, volume 5, page 3, 2010.
  • [4] Yang Chen, Sean Goldberg, Daisy Zhe Wang, and Soumitra Siddharth Johri. Ontological pathfinding. In Proceedings of the 2016 International Conference on Management of Data, pages 835–846. ACM, 2016.
  • [5] Yang Chen and Daisy Zhe Wang. Knowledge expansion over probabilistic knowledge bases. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 649–660. ACM, 2014.
  • [6] Yang Chen, Daisy Zhe Wang, and Sean Goldberg. Scalekb: scalable learning and inference over large knowledge bases. The VLDB Journal, 25(6):893–918, 2016.
  • [7] Carlos Iván Chesnevar and Ana G Maguitman. Arguenet: An argument-based recommender system for solving web search queries. In 2004 2nd International IEEE Conference on’Intelligent Systems’. Proceedings (IEEE Cat. No. 04EX791), volume 1, pages 282–287. IEEE, 2004.
  • [8] David W Cheung, Jiawei Han, Vincent T Ng, and CY Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. In Data Engineering, 1996. Proceedings of the Twelfth International Conference on, pages 106–114. IEEE, 1996.
  • [9] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–610. ACM, 2014.
  • [10] Wenfei Fan, Xin Wang, Yinghui Wu, and Jingbo Xu. Association rules with graph patterns. Proceedings of the VLDB Endowment, 8(12):1502–1513, 2015.
  • [11] Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M Suchanek. Fast rule mining in ontological knowledge bases with amie+. The VLDB Journal, 24(6):707–730, 2015.
  • [12] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In ACM Sigmod Record, volume 29, pages 1–12. ACM, 2000.
  • [13] Johannes Hoffart, Fabian M Suchanek, Klaus Berberich, and Gerhard Weikum. Yago2: A spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence, 194:28–61, 2013.
  • [14] Alfred Horn. On sentences which are true of direct unions of algebras. The Journal of Symbolic Logic, 16(01):14–21, 1951.
  • [15] Akshay Jain, Elena Lopez-Aguilera, and Ilker Demirkol. Enhanced handover signaling through integrated mme-sdn controller solution. In 2018 IEEE 87th Vehicular Technology Conference (VTC Spring), pages 1–7. IEEE, 2018.
  • [16] Frank Manola, Eric Miller, Brian McBride, et al. Rdf primer. W3C recommendation, 10(1-107):6, 2004.
  • [17] Abolfazl Mollalo, Ali Sadeghian, Glenn D Israel, Parisa Rashidi, Aioub Sofizadeh, and Gregory E Glass.

    Machine learning approaches in gis-based ecological modeling of the sand fly phlebotomus papatasi, a vector of zoonotic cutaneous leishmaniasis in golestan province, iran.

    Acta tropica, 188:187–194, 2018.
  • [18] Stephen Muggleton and Luc De Raedt. Inductive logic programming: Theory and methods. The Journal of Logic Programming, 19:629–679, 1994.
  • [19] Feng Niu, Ce Zhang, Christopher Ré, and Jude Shavlik. Elementary: Large-scale knowledge-base construction via machine learning and statistical inference. International Journal on Semantic Web and Information Systems (IJSWIS), 8(3):42–73, 2012.
  • [20] Stefano Ortona, Venkata Vamsikrishna Meduri, and Paolo Papotti. Robust discovery of positive and negative rules in knowledge bases. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1168–1179. IEEE, 2018.
  • [21] Ali Sadeghian, Miguel Rodriguez, Daisy Zhe Wang, and Anthony Colas. Temporal reasoning over event knowledge graphs. 2016.
  • [22] Ashok Savasere, Edward Robert Omiecinski, and Shamkant B Navathe. An efficient algorithm for mining association rules in large databases. Technical report, Georgia Institute of Technology, 1995.
  • [23] Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. Incremental knowledge base construction using deepdive. Proceedings of the VLDB Endowment, 8(11):1310–1321, 2015.
  • [24] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    , pages 1499–1509, 2015.
  • [25] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. Apache spark: A unified engine for big data processing. Communications of the ACM, 59(11):56–65, 2016.