I Introduction
Knowledge bases are collections of domainspecific and commonsense facts. Recently, the sizes of KBs are rocketing due to automatic extraction for knowledge and facts. For example, the number of facts in WikiData is up to 974 million! According to our observation, current KBs, especially domain KBs, show strong relevance in relations according to some topics[2, 1]. These patterns can be used to conclude and infer for part of facts in the KBs. Therefore, the original KBs can be minimzed by extracting patterns and essential facts.
In this paper, we introduce a framework for extracting knowledge essence and reducing overall volume of KBs by mining semantic patterns in relations. Facts are formalized as firstorder predicates and patterns are induced as Horn rules.
Table I and Rule (1), (2) show an example of such extraction. By extracting the rules from listed facts, both table I(b) and I(c) can be inferred from other tables and then be removed.





(1)  
(2) 
Ii Properties of Horn Rules
Iia Semantic Length and Fingerprint of a Rule
Firstorder Horn rules are adopted in our technique to describe semantic patterns in relations. They can further be decomposed into equivalence classes. Elements in each of the classes are arguments that are assigned to the same variables, and if some argument is assigned to constants, then the corresponding equivalence class only consists of the argument and the constant. For example, Rule (1) is decomposed to the following equivalent classes (number in the brackets denotes the argument index of certain relation, starting from 0):
The length of a rule is defined by the follwoing equation:
(3) 
where is one of these equivalence classes.
Fingerprints of rules are based on the equivalence classes with labels of arguments in the head that it applies to. For example, the last column of the above example shows the label of head arguments.
Lemma 1.
Two rules are semantically equivalent if and only if their fingerprints are identical.
Proof.
(Necessity)If two rules are semantically equivalent, they can be written in syntactically identical form. Thus equivalence classes of corresponding variables or constants are identical.
(Sufficiency)Each equivalence class tells position of one variable. Therefore, equivalence of all classes ensures that the set of predicates in both rules are identical. The labels of head arguments further determine that the head predicates are the same. Thus, the two rules are identical. ∎
IiB Search Space for Rules
Let be the search space for firstorder Horn rules. Some elements in make no sense and should be excluded. If some predicate in the body is identical to the head, then the predicate in the body is redundant. These rules are trivial rules. If some subset of the body does not share any variable with the remaining part (include the head), then the rule is either redundant nor unsatisfiable. The subset is called independent fragment. The new search space excluding these two types of rules is written as .
IiC Extension on Rules
Definition 2 (Limited Variable, Unlimited Variable, Generative Variable).
A variable is unlimited in some Horn rule if there is only one argument in that is assigned to it. A variable is limited in if there exist at least two arguments in that are assigned to it. A variable is generative if there exist arguments in both the head and body of that are assigned to it.
Searching for rules starts from most general forms, i.e. rules only with head predicate and arguments in the predicate are all unique unlimited variables. To construct new rules, new equivalence conditions are added to the equivalence classes. Syntactically, these operations fell in five extension operations, which is noted by :
Case 1: Assign an existing limited variable to some argument.
Case 2: Add a new predicate with unlimited variables to the rule and then assign an existing limited variable to one of these arguments.
Case 3: Assign a new limited variable to a pair of arguments.
Case 4: Add a new predicate with unlimited variables to the rule and then assign a new limited variable to a pair of arguments. In this case, the two arguments are not both selected from the newly added predicate.
Case 5: Assign a constant to some argument.
According to the rule extension, , if , then is the extension of , and is the origin of (denoted as since one may have multiple origins). Neighbours of a rule in consist of all its extensions and origins. The above extension operations can be used to search on . Let has only a head predicate and all arguments of are unlimited variables, every element in can be searched from some . To prove this we define a property link between predicates in a certain rule: If two predicates and in a rule share a limited variable , then and are linked by in , written as , or in short . Moreover, if there is a sequence of predicates , then there is a linked path between and , written as: . With this property, we can prove the search completeness as follows:
Lemma 3.
, every predicate in has a linked path with the head of .
Proof.
Suppose a predicate in rule has no linked path with the head. Then is not itself the head. Let , every predicate in has no linked path with the head. Then the fragment noted by does not share any variables with remaining predicates. Namely, denotes an independent fragment in rule . According to the definition of , we have , which contradicts with . ∎
Lemma 4.
(Search Completeness)Let has only a head predicate and all arguments of are unlimited variables, , such that .
Proof.
Suppose in . During the searching process of , when is already in a intermediate status , an extension of can be constructed by adding a new predicate and turning corresponding variables to . Thus, predicate is introduced into . Therefore, if and is already in a intermediate status, then can be introduced into . According to Lemma 3, all predicates in has linked path with its head. Each predicate can be introduced into the rule iteratively starting from the head predicate where arguments are all different unlimited variables. Other limited variables and constants can be added to the rule by other extension operations to finally construct . ∎
Rules with independent fragments will not be constructed starting from , as the extension operations do not introduce new predicates without any shared variables with other predicates.
Iii Problem Definition
Definition 5 (Essential Knowledge Extraction).
Let be the original KB, which is a finite set of atoms. The extraction on is a triple , where (for “Hypothesis”) is the set of firstorder Horn rules, (for “Necessary”) is a subset of , and (for “Counter Examples”) is a subset of the complement of subject to CWA. satisfies ( is logical entailment):



is minimal
where is the number of predicates in , and so be . is defined as the sum of lengths of all rules in it.
Definition 6 (Minimum Vertex Cover Problem).
Let be an undirected graph. A minimum vertex cover of is a minimum subset of such that .
Complexity of essence extraction can be proved by reducing minimum vertex cover problem to relational compression. Let be the graph in the vertex cover problem. By the following settings we create a relational knowledge base aligning with : Let be a unary predicate in for each ; let be a unary predicate in for edges; add two constants and to and six predicates , , , , , to for each ; add the following predicates to : ; and add the following constants to : .
For example, Figure 1 shows a graph with three vertices and two edges. The corresponding setting of relational compression is as follows:


, , , , , , , , , , , , , ,
By reducibility from minimum vertex cover problem to relational compression we can prove the latter is NPhard. The details are as follows:
Lemma 7.
is not in .
Proof.
Let , then , where . Thus, the number of predicates this rule entails is . Taking constants into consideration, the number of counter examples this rule entails is also . The size reduced is , no actual reduction. Therefore, it does not reduce the size of knowledge base. It is not in . ∎
Lemma 8.
Predicates of can only be entailed by the following rules: .
Proof.
Let rule be: , the length of which is 1. Then the number of predicates it entails is , where is the number of edges connected to vertex . There are no counter examples entailed by this rule. Thus the size it reduces is . If , this rule can be used to reduce the size of knowledge base.
According to Lemma 7, cannot be entailed by axioms, and since there is no other predicate in , can only be entailed by some . ∎
Lemma 9.
Let . All predicates in are provable after compression. That is, , where is the set of all provable predicates.
Proof.
According to Lemma 8, proof of relies only on predicates of . No matter predicates of is provable or not, the rules of can always be applied to prove . Suppose such that . Then there is another predicate and , where and correspond to some edge in and its duplicate, since these two predicates are both entailed by some rule if one of them is entailed by the rule. Then a new rule can be applied to entail these two predicates to further reduce the size of given result. However, according to definition of relational compression, output cannot be further reduced. Contradiction occurs. ∎
Lemma 10.
1.1 Let be the solution of minimum vertex cover problem. Let be a rule set and . Let be a rule set and . Then and .
Proof.
Theorem 11.
Relational compression is NPhard.
Proof.
Let be the set of minimum vertex cover of . According to the lemmas above, . All the operations involved with reducibility are with polynomial cost. Thus minimum vertex cover problem can be polynomially reduced to relational compression. Relational compression is NPhard. ∎
Iv Extraction Framework
To tell whether a fact is provable by others, we employ a directed graph to encode dependency among facts with respect to inference. , where each vertex is either a fact in or an assertion of truth under no condition. if is involved in the proof of by some rule. if can be inferred by some rule with empty body. The extraction for essence is given by Algorithm 1.
If the dependency graph is a DAG, then essential predicates are represented by the vertices with zero indegree. However, if cycles appear in , then at least one vertex in each cycle should be included in . This assertion is proved bellow:
Lemma 12.
If some cycle in is not overlapping with other cycles, then at least one vertex should be included in .
Proof.
A vertex in the dependency graph is guaranteed provable if it is in or all of its inneighbours are guaranteed provable. In the following proof, we assume that all other parts in are guaranteed provable except the cycles. If none of vertices in a single cycle (not overlapping with other cycles) is included in , then for each of these vertices, there is one inneighbour not guaranteed provable. Thus, none of vertices in the cycle is guaranteed provable. At least one vertex should be selected in . ∎
Lemma 13.
If some cycles in are overlapping, then at least one vertex should be included in .
Proof.
Suppose two cycles are overlapping in . If none of vertices in these cycles is in , then none of them are guaranteed provable. If one of the vertices in the nonoverlapping part is in , then from this vertex to the one before intersection, all of these vertices are guaranteed provable. The other cycle is remained equivalent to circumstances of nonoverlapping cycle and at least one of these vertices should be in . If one of the vertices in the overlapping part is in , then both cycles are guaranteed provable. In this case, still, at least one vertex is selected in each cycle. Cases are similar for more than two over lapping cycles. ∎
Lemma 14.
If there are cycles in the dependency graph, then at least one vertex should be included in .
In the framework, two components may be implemented in different strategies according to specific domains: findSingleRule and CoverCycle. To implement findSingleRule, pruning techniques are needed as the search space is large and useful candidates are sparse in the space. Given that semantic correlations may be strong in domain specific KBs, cycles are predicted to be large and frequent. Therefore, efficient coverage procedure is also required in the framework.
V Restore the Original KB
As the dependency graph implies, if all the inneighbours of some vertex in is in the KB, the vertex can be inferred by applying some rule in . Thus, in order to restore the original KB, we can iteratively apply each rule on current database until there is no more records inferred. Inference by a single rule can be done without full join in the relational data model. The algorithm is shown in Algorithm 2.
The cost of single inference is proportional to the number of equivalence classes and to the size of relevant relations. The number of equivalence classes is proportional to the number of columns in . The cost for single inference is:
where is the arity of functor . From the implication of the dependency graph, inference operations are the same as visiting vertices along paths in the graph. Thus, the maximum number of iterations is no larger than the maximum length of simple paths in . The overall cost of decompression is:
where the maximum length of simple path in .
Lemma 15.
When has reached to its maximum, the worst case cost of restoring is .
Proof.
According to the definition, . When , all vertices in form one single simple path. In this case, there can only be one rule in and only one relation in . And the maximum number of arguments is also , otherwise the rule cannot summarize . Therefore, the worst case cost is:
Other cases are the same. ∎
Vi Conclusion
In this paper, we introduced a framework for extracting essence from factual knowledge. Theoretical proofs are also given for key properties of the framework. To put it into practice, more concrete work is required to design and analyze in findSingleRule and CoverCycle.
References

[1]
(2020)
What is normal, what is strange, and what is missing in an knowledge graph
. In The Web Conference, Cited by: §I.  [2] (2013) AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In Proceedings of the 22nd international conference on World Wide Web, pp. 413–422. Cited by: §I.
Comments
There are no comments yet.