KGist
Knowledge Graph summarization for anomaly/error detection & completion (WebConf '20)
view repo
Knowledge graphs (KGs) store highly heterogeneous information about the world in the structure of a graph, and are useful for tasks such as question answering and reasoning. However, they often contain errors and are missing information. Vibrant research in KG refinement has worked to resolve these issues, tailoring techniques to either detect specific types of errors or complete a KG. In this work, we introduce a unified solution to KG characterization by formulating the problem as unsupervised KG summarization with a set of inductive, soft rules, which describe what is normal in a KG, and thus can be used to identify what is abnormal, whether it be strange or missing. Unlike first-order logic rules, our rules are labeled, rooted graphs, i.e., patterns that describe the expected neighborhood around a (seen or unseen) node, based on its type, and information in the KG. Stepping away from the traditional support/confidence-based rule mining techniques, we propose KGist, Knowledge Graph Inductive SummarizaTion, which learns a summary of inductive rules that best compress the KG according to the Minimum Description Length principle—a formulation that we are the first to use in the context of KG rule mining. We apply our rules to three large KGs (NELL, DBpedia, and Yago), and tasks such as compression, various types of error detection, and identification of incomplete information. We show that KGist outperforms task-specific, supervised and unsupervised baselines in error detection and incompleteness identification, (identifying the location of up to 93 baselines), while also being efficient for large knowledge graphs.
READ FULL TEXT VIEW PDFKnowledge Graph summarization for anomaly/error detection & completion (WebConf '20)
Knowledge graphs (KGs), such as NELL (carlson2010toward, ), DBpedia (auer2007dbpedia, ), and Yago (suchanek2007yago, ), store collections of entities and relations among those entities (Fig. 1), and are often used for tasks such as question answering, powering virtual assistants, reasoning, and fact checking (huang2019knowledge, ; bhutani2019learning, ; nickel2015review, ; shiralkar2017finding, ). Many KGs encode encyclopedic information, i.e., facts about the world, and are, to a large degree, automatically built (nickel2015review, ). As a result, they contain many types of errors, and are missing edges, nodes, and labels. This has led to a significant amount of research on KG refinement, resulting in task-specific methods that either identify erroneous facts or add new ones (paulheim2017knowledge, ). While the accuracy of KG tasks may be improved by refinement, KGs grow to the order of millions or billions of edges, making KGs more inaccessible to users (huang2019knowledge, ), and tasks over them more computationally difficult (nickel2015review, ).
As refinement helps address accuracy issues, graph summarization (liu2018graph, ) can help address KG size issues by describing a graph with simple and concise patterns. However, KG-specific summarization (zneika2019quality, ) focuses mostly on query- or search-related summaries (song2018mining, ; wu2013summarizing, ; safavi2019personalized, ), while most general-graph summarization work is designed for purposes other than KG refinement, and aims to compress a graph by grouping together similarly linked and similarly labeled nodes. These summaries would only cluster existing information in a KG, but encyclopedic KGs will always be missing facts (since the world’s information is unbounded).
Thus, we introduce the problem of inductive KG summarization, in which, given a knowledge graph , we seek to find a concise and interpretable summary of with inductive rules that can generalize to the parts of the world not captured by . With this characterization of the norm, we can also identify what is strange in and what is missing from : the parts of the graph that violate the rules or remain unexplained by the summary. These strange parts of the graph may be genuine exceptions, errors, or missing information. To solve the problem, we propose KGist, an information-theoretic approach that serves as a unified solution to summarization and various KG refinement tasks, which have traditionally been viewed independently.
Our main contributions are summarized as follows:
[wide]
Problem Formulation. Rather than targeting a specific refinement task (e.g., link prediction), we unify various refinement tasks by joining the problems of refinement and unsupervised summarization, and introduce the notion of inductive summarization with soft rules that plausibly generalize beyond the KG. § 3
Expressive rules. While current methods (§ 2) learn first-order logic rules that have single-element consequences, which predict single edges, our rules are labeled, rooted graphs that are recursively defined, allowing them to describe arbitrary graph structure around a node (i.e., they can have complex consequences). Our formulation of rules takes a step towards treating knowledge graphs as graphs—something often overlooked in KG refinement (paulheim2017knowledge, ). § 3
MDL-based approach. We introduce KGist, an unsupervised, information-theoretic approach that identifies rules via the Minimum Description Length (MDL) principle (Rissanen85Minimum, ), going beyond the support/confidence framework of prior work. § 4
Experiments on real KGs. We perform extensive experiments on large KGs (NELL, DBpedia, Yago), and diverse tasks, including compression, various types of error detection, and identifying the absence of nodes. We show that KGist learns orders of magnitude fewer rules than current methods, allowing KGist to be efficient and effective at diverse tasks. KGist identifies the location of 76-93% of missing entities—over 10% more than baselines. § 5
Our code and data are available at https://github.com/GemsLab/KGist.
KG refinement attempts to resolve erroneous or missing information (paulheim2017knowledge, ; paulheim2014improving, ). Next, we discuss the three most relevant categories of refinement techniques (although other methods exist, such as crowd-sourcing-based methods (jiang2018knowledge, )).
These approaches are reminiscent of association rule mining (agrawal1993mining, ). AMIE (galarraga2013amie, ) introduces an altered confidence metric based on the partial completeness assumption, according to which, if a particular relationship of an entity is known, then all relationships of that type for that entity are known (as opposed to the open-world assumption, which assumes that an absent relationship could either be missing or not hold in reality). AMIE+ (galarraga2015fast, ) is optimized to scale to larger KGs, and Tanon et al. (tanon2017completeness, ) seek to acquire and use counts of edges to measure the incompleteness of KGs. Other, non-rule-mining-based methods have also been proposed for measuring KG quality (rashid2019quality, ; jia2019triple, ). A supervised approach that augments AMIE+ (galarraga2017predicting, ) takes example complete and incomplete assertions (e.g., crowd-sourced) as training data, and predicts completeness of predicate types observed during training. These works focus on refinement and find Horn rules on binary predicates. In contrast, we focus on summarization, and our rules can be applied to a node, knowing only its type. Also, we go beyond the support/confidence framework, which treats KGs as a table of transactions, and take a graph-theoretic view instead. One work that does take a graph-theoretic view learns rules in a bottom-up fashion by sampling paths from the KG, but the rules are constrained to be path-based Horn-rules (meilicke2019anytime, ). Graph-Repairing Rules (GRRs) (cheng2018rule, ) have also been proposed to target the specific problems of identifying incomplete, conflicting, and redundant information in graphs. They focus on simple graphs, whereas KGs contain multi-edges (nickel2015review, ), multiple labels per node (Tab 2), and self-loops. GRRs were preceded by less expressive association rules with graph patterns (fan2015association, ) and functional dependencies for graphs (fan2016functional, ). Rule-mining also has applications beyond KG refinement, such as recommender systems (ma2019jointly, ). Our rules could potentially be used in these scenarios, but we leave that for future work.
KG embedding approaches seek to learn representations of nodes and relations in a latent space (wang2017knowledge, )
, spanning from tensor factorization-based methods
(nickel2011three, ; nickel2012factorizing, ) to translation-based methods such as TransE (bordes2013translating, ) and semantic matching models such as ComplEx (trouillon2016complex, ). These works often perform link prediction, which is useful for completing relationships among entities, but only predicts links between entities already in the KG. In contrast, KGist can identify the absence of entities from the KG.Recent refinement methods improve link prediction performance by iteratively applying rule mining and learning embeddings. For instance, pre-trained embeddings have been used to more accurately measure the quality of candidate rules (ho2018rule, ). In (zhang2019iteratively, ), facts inferred from rules improve embeddings of sparse entities, and in turn embeddings improve the efficiency of rule mining. Unlike these works, we focus on unifying different refinement tasks, going beyond link prediction.
Graph summarization seeks to succinctly describe a large graph in a smaller representation either in the original or a latent space (liu2018graph, ; JinRKKRK19, ). Much of the work on knowledge graph summarization has focused on query-related summaries, such as query answer-graph summaries (wu2013summarizing, ), patterns that can be used as query views to improve KG search (song2018mining, ; fan2014answering, ), and sparse, personalized KG summaries—based on historical user queries—for use on personal, resource-constrained devices (safavi2019personalized, ). While our summaries could conceptually be used for query-related problems, we focus on the problem of characterizing what is normal, strange, and missing in a KG. We also construct summaries with patterns that generalize, which is not considered by (song2018mining, ). Similar to summarization, Boded et al. (Bobed2019DatadrivenAO, ) use MDL to assess KG evolution, but they do not target refinement. Beyond KGs, MDL has been used to summarize static and temporal graphs via structures, such as cliques, stars, and chains (koutra2014vog, ; shah2015timecrunch, ; NavlakhaRS08bounded, ; Goebl1TBP6, ), or frequent subgraphs (noble2003graph, ) (also studied from the perspective of subgraph support (elseidy2014grami, )). Unlike these works, we learn inductive summaries of recursively defined rules or rooted graphs, which incorporate both the KG structure and semantics, and can be used for graph refinement.
In this section we describe our proposed MDL formulation for inductive summarization of knowledge graphs, after introducing some preliminary definitions. We list the most frequently used symbols in Table 1, along with their definitions.
A KG is a labeled, directed graph , consisting of a set of nodes or entities , a set of relationship types , a set of edges or triples , a set of node labels , and a function mapping nodes to their labels, the set of which we call the node’s type. We give an example KG in Fig. 1. An edge or triple connects the subject and object nodes via a relationship type (predicate) . An example is (War & Peace, writtenBy, Leo Tolstoy). Triples encode a unit of information or fact, semantically about the subject. Since a pair of nodes may have multiple edges between them, we represent the connectivity of with a adjacency tensor . Similarly, we store the label information in an binary label matrix, .
An ideal knowledge graph contains all the correct facts in the world and no incorrect ones, i.e., if and only if the fact holds in reality. An ideal KG is only a conceptual aid, and does not exist, since KGs have errors and missing information.
A model of a KG is a set of inductive rules, which describe its facts (see formal definition in § 3.1.4). In § 3.2, we will explain a model in the context of our work.
Notation | Description |
---|---|
knowledge graph | |
, | binary adjacency tensor and label matrix of , resp. |
model or set of rules, and the empty model, resp. | |
# of bits to transmit an object (e.g., a graph or rule) | |
rule in the form of a graph pattern | |
, , | assertions, correct assertions, exceptions of , resp. |
set cardinality and number of 1s in tensor/matrix | |
A rule is defined recursively and compositionaly. Specifically, rule is a rooted, directed graph, with a subset of node labels defining ’s root, and a set of children . Each child in is of the form consisting of a predicate (e.g., writtenBy), the directionality of the rule (i.e., ), and a descendent rule . A leaf rule has no children, i.e., . An atomic rule consists of one root with a single child (e.g., ({Book}, {writtenBy, , ({Author}, )})), since all rules can be formed from compositions of these. Rule in Fig. 2 (which reads, Books have fictional family characters and are written by authors who are born in countries.), rooted at Book, consists of three atomic rules, has root and two children (for clarity we omit the braces for sets): (writtenBy, , (Author, (bornIn, , ( Country, )))) and (character, , (Fictional Family, )).
An assertion of a rule over the KG is an instantiation of the edges and labels that asserts around a particular node, and is reminiscent of a rule grounding (meilicke2019anytime, ). The set of all assertions of rule is . Formally, is a subgraph induced by a traversal that starts at a node with at least the same labels as (i.e., ), and that recursively follows ’s syntax. For example, War & Peace is the starting node of one assertion of the rule in Fig. 2. If the traversal fails to match the syntax of the rule at any point, then we call it an exception of , in which case the assertion is just the node that violates the rule. Otherwise the induced subgraph is called a correct assertion of . Formally, and are the set of ’s correct assertions and exceptions respectively. Every assertion is either a correct assertion or an exception, so and form a partition of .
In two-part (crude) MDL (rissanen:78:mdl, ), given a set of models , the best model minimizes , where is the length (in bits) of the description of , and is the length of the description of the data when encoded using . In our work, we leverage MDL to concisely summarize a given KG.
Because both errors and missing information are instances of abnormalities, we unify KG characterization in terms of what is normal, strange, and missing, as follows:
Given a knowledge graph , and an inaccessible ideal knowledge graph , we seek to find a concise model of inductive rules that summarize what is normal in both and . The rules should be (1) interpretable (by which we mean readable in natural language) and, (2) their exceptions should reveal abnormal information in the KG, whether it be erroneous (e.g., some ), missing (e.g., some ), or a legitimate exception (e.g., some ).
The concise set of rules admits efficient performance on follow-up tasks (such as error detection and incompleteness identification). Although existing rule mining techniques can be adapted to handle variants of this problem (typically they are tailored to either detect a specific type of error or perform completion), they tend to result in a large number of redundant rules (§ 5.1.2
) and require heuristics to be adapted to tasks that they were not designed for. In the next section, we formalize our problem definition further and propose a principled, information-theoretic solution that naturally unifies KG characterization.
The inductive KG summarization problem (Problem 1
) is closely related to the idea of compression in information theory—compression finds patterns in data (what is normal), which in turn can reveal outliers (what is strange or missing). In this work, we leverage MDL (§
3.1.6) for KG summarization—a formulation that we are the first to use in the context of KG rule mining. Based on our preliminary definitions above, Problem 1 can be restated more formally:Given a knowledge graph , we seek to find the model (i.e., set of rules) that minimizes the description length of the graph,
(1) |
where is a set of rules (§ 3.1.3) describing what is normal in , is the number of bits to describe , and is the number of bits to describe parts of that fails to describe. Thus, expensive parts of and reveal abnormal information in (§ 4.3).
In § 3.2.1 we will define our model space , how to describe a KG with a model , and how to encode it in bits. Then, in § 3.2.2 we will describe the KG under the model, , which we refer to as the model’s error, since it encodes what is not captured by . All logarithms are base 2.
A model is a set of rules, and each rule has a set of correct assertions (or guided traversals of a graph , § 3.1.5). The model thus describes ’s semantics (labels) and connectivity (edges) through rule-guided traversals over . Each time a node is visited, some of its labels are revealed by the structure of the rule. For instance, arriving at the node Leo Tolstoy while traversing the subgraph in Fig. 2, reveals (i) its Author label, since this is implied by the rule on the left, and (ii) its link to where the traversal just came from (viz., War & Peace).
For our model, we consider a classic information theoretic transmitter/receiver setting (shannon1948mathematical, ), where the goal is to transmit (or describe) the graph to the receiver using as few bits as possible. In other words, the sender must guide the receiver in how to fill in an empty binary adjacency tensor and binary label matrix with the 1s needed to describe . Since MDL seeks to find the best model, the costs that are constant across all models (e.g., the number of nodes and edges) can be ignored during model selection. At a high level, beyond this preliminary information, we need to transmit the number of rules (upper bounded by the number of possible candidate rules), followed by the rules in and their assertions, which we discuss in detail next
(2) |
Encoding the Rules. The rules serve as schematic instructions on how to populate the adjacency tensor and label matrix that describe . Our rule definition states that a rule consists of a set of root labels (semantics) and recursive rule definitions of the children (structure), so we need to transmit both of them to the receiver. We encode them as
, |
where is the number of times predicate occurs in . We discuss each term in Eq. (3.2.1) next. We encode the root labels by transmitting their number (upper bounded by ) and then the actual labels via optimal prefix code (cover2012elements, ), since they may not occur with the same frequency:
(3) |
where is the number of times label occurs in . Then, for the children , we transmit their number (expected to be small) using the optimal encoding of an unbounded natural number (rissanen:83:integers, ) similarly to (akoglu2013mining, ) and denoted ; and per child we specify: (i) its predicate using an optimal prefix code as in Eq. (3), (ii) its directionality (i.e., or ) without making an a priori assumption about which is more likely, and (iii) its descendent rule , by recursively applying Eq. (3.2.1) until leaf rules (with 0 children) are reached.
We note that while some labels can be inferred from rules (e.g., the Author label of Leo Tolstoy), it is possible that all labels will not be revealed by rules. Thus, we transmit the un-revealed labels as negative error—i.e., information needed to make the transmission lossless, but that is not modeled by . We discuss this in § 3.2.2.
So far, in our running example, once the receiver has the information that War & Peace is a book, it can apply the rule in Fig. 2. It knows that War & Peace should have one or more Fictional Families as characters, and one or more Authors who wrote it, but it does not yet know which Fictional Families and Authors. This information will be encoded next in the assertions.
Encoding the Rule Assertions. In Eq. (2), the last term encodes the assertions, , of each rule . The receiver infers the starting nodes of the traversals from ’s root (Eq. (3.2.1)) and the node labels (encoded via other rules or in Eq. (9)). Thus, we transmit the failed traversals (i.e., exceptions) and details needed to guide the correct assertions:
(4) |
The first term transmits which assertions are exceptions to a rule (e.g. the book Syntactic Structures, which is non-fiction and hence does not have any Fictional Family characters). We transmit the number of exceptions, followed by their IDs (i.e., which assertions they are), chosen from among the assertions:
(5) |
where is an upper bound on the number of exceptions.
Intuitively, the bits needed to encode exceptions penalize overly complex rules, which are unlikely to be accurate and generalizeable.
The remaining traversals are correct assertions, for which we transmit details as we traverse each . The encoding cost for is the sum of the cost of all these traversals:
(6) |
Each traversal is encoded by recursively visiting neighbors according to the recursive structure of . Formally,
(7) |
where, for each child of , we first transmit the number of ’s neighbors with the child’s labels (upper-bounded by the number of nodes in ), followed by the neighbors’ IDs (which are the starting nodes of the child rule’s correct assertions, since the child is itself a rule) using a binomial transmission scheme. Once the neighbors have been revealed, the traversal proceeds recursively to them. For example, the traversal in Fig. 2 begins at War & Peace and the rule has two children (characters and authors). For each, we transmit the number of nodes relevant (5 and 1 respectively), followed by their IDs. The traversal then proceeds recursively to each node just specified.
In Eq. (1), along with sending the model , we also need to send anything not modeled, i.e., the model’s negative error. This error consists of the cost of encoding (i) the node labels that are not revealed by the rules and (ii) the unmodeled edges. We denote the modeled labels and edges as and respectively, which contain the subset of 1s in and that the receiver has been able to fill in via the rules it received in . We denote the unmodeled labels and edges as the binary matrix and binary tensor , and these are what we refer to as negative error. The cost of the model’s error is thus
(8) |
Specifically, the receiver can infer the number of missing node labels (i.e., 1s in ) given the total number of node labels and the number not already explained by the model (§ 3.2.1). Thus, we send only the position of the 1s in , encoding over a binomial (where denotes set cardinality and the number of 1s in a tensor/matrix):
(9) |
We transmit missing edges analogously
(10) |
In the previous section, we fully defined the encoding cost of a knowledge graph with a model of rules. Here we introduce our method, KGist, which will leverage our KG encoding to find a concise summary of inductive rules, with which it will characterize what is normal, what is strange, and what is missing in the KG.
A necessary step to this end is to generate a set of candidate rules from which MDL will construct the rules that can best compress . However, even given that set, selecting the optimal model involves a combinatorial search space, since any subset of is a valid model, i.e., (where even can be in the millions for large KGs). This cannot be searched exhaustively, and our MDL search space does not have easily exploitable structure, such as the anti-monotonicity property of the support/confidence framework. To find a tractable solution, we exploit the compositionality of rules—starting with simple, atomic rules and building from there. We give KGist’s pseudocode in Alg. 1, and describe it by line next.
We begin by generating atomic candidate rules—those that assert exactly one thing (§ 3.1.4). The number of possible atomic rules is exponential in the number of node labels, but not all of them need to be generated: rules that never apply in do not explain any of the KG, and hence will not be selected by MDL. Thus, we use the graph to guide candidate generation. For each edge in the graph, KGist generates atomic rules that could explain it. For instance, the edge (War & Peace, writtenBy, Leo Tolstoy) could be explained by rules such as, “books are written by authors” and “authors write books.” These have the forms and , respectively. To avoid candidate explosion from allowing rules to have any subset of node labels, we only generate atomic rules with a single label per node here, and account for more combinations of labels in the next step.
Adding more labels to rules can help make them more accurate and more inductive by limiting the number of places they apply (e.g., Fig. 3), and subsequently their exceptions (which incur a cost in our MDL model). To this end, given a rule , KGist identifies the labels shared by all the starting nodes of the correct assertions of the rule: . If this set contains more labels than the rule (i.e., ), then it forms a new rule with root . If , where is the empty model without any rules, KGist replaces with in . It carries this out for all the rules in . This can be viewed as qualifying : it qualifies the conditions under which applies, to those that contain all the labels rather than its original label alone.
Considering all possible combinations of candidate rules , and finding the optimal model is not tractable. Moreover, an alternative greedy approach that constructs the model by selecting in each iteration the rule that leads to the greatest reduction in the encoding cost, would still be quadratic in (which is in the order of many millions for large-scale KGs). Instead, for scalability, given the set of (potentially qualified) candidate rules , we devise a ranking that allows KGist to take a constant number of passes over the candidate rules. Intuitively, our ranking considers the amount of explanatory power that a rule has—i.e., how much reduction in error it could lead to:
(11) |
KGist sorts the rules descending on this value, and breaks ties by considering rules with more correct assertions first. If that fails, the final tie-breaker is the lexicographic ordering of rules’ root labels.
After ranking the candidate rules , KGist initializes and considers each in ranked order for inclusion in . For each rule , it computes , i.e., the MDL objective if is added to the current model. If this is less than the MDL cost without the rule (e.g., rule g correctly explains new parts of G), then KGist adds to If has a reverse version (e.g., “books are written by authors” and “authors write books”), KGist considers both at once and picks the one that gives a lower MDL cost. KGist runs a small number of passes over until no new rules are added. The resulting model approximates the true optimal model .
The model at this point only contains atomic rules. To better approximate , we introduce two refinements that compose rules via merging Rm and nesting Rn.
Refinement Rm for “rule merging” composes rules that share a root. It identifies all sets of rules, with matching roots that correctly apply in the same cases, i.e., and . It then merges these into a single rule , consisting of the union of the children . For example, if all books that have authors () also have publishers (), then these would be merged into a single rule. We refer to this variant as KGist+m.
Refinement Rn for “rule nesting” considers composing rules where an inner node of one rule matches the root of another rule , possibly creating a more inductive rule. Rn begins by computing, between each compatible and , the Jaccard similarity of the correct assertion starting points of the matching inner and root nodes (i.e., it quantifies the ‘fit’ of the nodes). For instance, if a rule asserts that “books have authors” , and another rule asserts that “authors have a birthplace” (), then the Jaccard similarity is computed over the set of book authors and the set of authors with birthplaces. The refinement then considers nesting the rules in descending order of Jaccard similarity, resulting in rule being subsumed into rule , which becomes its ancestor. If the composed rule leads to lower encoding cost than the individual rules (e.g., qualifies as in Fig. 3), i.e., , then the composition replaces and . The Jaccard similarity between rules that were compatible with or is then recomputed with , the list of compatible rules is re-sorted by Jaccard similarity, and the search continues until all pairs are considered with none being composed (i.e., when no composition reduces the encoding cost). This sorting is done over the set of selected rules (§ 4.2.1), where , and since only few compositions occur (§ 5.1.1), this repeated sorting is tractable. As nesting embeds into , this refinement allows for arbitrarily expressive rules to form. We call our method with both refinements (merging and nesting) KGist+n.
We now discuss how to leverage a model (i.e., a summary of rules) mined by KGist
towards identifying what is strange or anomalous in a KG, whether it be erroneous or missing—two key tasks in KG research. Anomaly detection seeks to identify objects that differ from the norm
(akoglu2015graph, ; Aggarwal16_outlier, ). In our case, the learned summary concisely describes what is normal in a KG.Intuitively, nodes that violate rules, and edges that are unexplained are likely to be anomalous. Next, we make this intuition more principled by defining anomaly scores for entities (nodes) and relationships (edges) in information theoretic terms.
We define the anomaly score of an entity or node as the number of bits needed to describe it as an exception to the rules in the model:
(12) |
where maps each node to the rules that apply to it. We distribute the cost of the exceptions equally over all violating nodes, following Eq. (5).
We also introduce an anomaly score for a relationship or edge. Intuitively, edges that are not explained by the model are anomalous, and their anomaly score is defined as the number of bits describing them as negative error. Since we transmit all unmodeled edges in together (Eq. (10)), we make this intuition more principled by distributing this transmission cost evenly across all unmodeled edges in the anomaly score:
(13) |
Under the reasonable assumption that our model effectively captures what is normal in the KG, it follows that edges unexplained by the model are likely to be abnormal. Equation (13) captures this notion, but to prevent the unexplained edges from all receiving equal scores, we add the anomaly score of the endpoints (Eq. (12)):
(14) |
Generating candidate rules involves iterating over each edge (and its nodes’ labels) as it is encountered. The number of possible atomic rules with a single label that could explain an edge is . Letting denote the max number of labels over all nodes, the overall complexity of candidate generation is . The number of candidate rules generated, , is also . Computing the error, , is constant since it only involves computing the log-binomials. The computation of depends on the time of traversing and describing the correct assertions. Since the traversals occur in a DFS manner (in linear time) over a subgraph enough smaller than to be ignored, takes time. Since ranking only requires computing , which is a small constant, the cost comes only from sorting items, which is . KGist takes a small number of passes over the candidate set (§ 4.2.1) in time. So, the overall complexity is , which simplifies to , or . We omit the complexity of the refinements for brevity.
avg | med | |||||
---|---|---|---|---|---|---|
NELL | 46,682 | 231,634 | 266 | 821 | 1.53 | 1 |
DBpedia | 976,404 | 2,862,489 | 239 | 504 | 2.72 | 3 |
Yago | 6,349,336 | 12,027,848 | 629,681 | 33 | 3.81 | 3 |
Our experiments seek to answer the following questions:
Does KGist characterize what is normal? How well can KGist compress, in an interpretable way, a variety of KGs?
Does KGist identify what is strange? Can it identify and characterize multiple types of errors?
Does KGist identify what is missing?
Is KGist scalable?
Horn rules | Rules of the form | ||||||
3-3 5-9 Dataset | Metric | AMIE+ | Freq | Coverage | KGist | KGist+m | KGist+n |
NELL (6,268,200 bits) | % Bits needed | N/A | 191.46% | 192.72% | 73.88% | 73.00% | 63.57% |
Edges Explained | N/A | 57.33% | 50.12% | 78.52% | 78.52% | 74.67% | |
# Rules | 32,676 | top- | top- | 1,115 | 647 | 573 | |
1-9 DBpedia (119,117,468 bits) | % Bits needed | N/A | 674.51% | 718.22% | 69.88% | 69.84% | 69.77% |
Edges Explained | N/A | 80.64% | 71.70% | 89.17% | 89.17% | 88.51% | |
# Rules | 6,963 (galarraga2015fast, ) | top- | top- | 516 | 505 | 498 | |
1-9 Yago (793,027,801 bits) | % Bits needed | N/A | 896.33% | 947.64% | 76.13% | 75.98% | 75.04% |
Edges Explained | N/A | 86.54% | 83.44% | 88.40% | 88.40% | 85.20% | |
# Rules | failed | top- | top- | 60,298 | 34,331 | 32,670 | |
Data. Table 2
gives descriptive statistics for our data:
NELL (carlson2010toward, ) or “Never-Ending Language Learning” continually learns facts via crawling the web. Our version contains 1,115 iterations, each introducing new facts for which the confidence has grown sufficiently large. DBpedia (auer2007dbpedia, ) is extracted from Wikipedia data, heavily using the structure of infoboxes. The extracted content is aligned with the DBpedia ontology via crowd-sourcing (paulheim2017knowledge, ). Yago (suchanek2007yago, ), like DBpedia, is built largely from Wikipedia. Yago contains 3 orders of magnitude more node labels than the other two graphs (Tab. 2).In this section, we demonstrate how KGist characterizes what is normal in a KG by achieving (1) high compression, (2) concise, and (3) interpretable summaries with intuitive rules.
Although compression is not our goal, it is our means to evaluate the quality of the discovered rules. Effective compression means that the discovered rules describe the KG accurately and concisely.
Setup. We run KGist on all three KGs since each has different properties (Tab. 2). denotes an empty model with no rules, corresponding to transmitting the graph entirely as error, i.e., . We compare compression over this model.
Baselines. We compare to: (i) Freq which, instead of using MDL to select rules from , selects the top- rules that correctly apply the most often, where we set to be the number selected by the best compressed version of KGist. (ii) Coverage is directly analogous to Freq, replacing the metric of frequency with the number of edges explained by the rule. Both select rules independently, without regard for whether rules explain the same edges. (iii) AMIE+ (galarraga2015fast, ) finds Horn rules, which cannot be encoded with our model, so we do not report compression results, but only the number of rules it finds. While other KG compression techniques exist (§ 2.2), we are seeking to find inductive rules that are useful for refinement, whereas generic graph compression methods compress the graph, but never generate rules, and are hence not comparable.
Metrics. For each dataset, the first row reports the percentage of bits needed for relative to the empty model. That is, it reports . Small values occur when compresses well, and hence smaller values are better. The second row reports the percentage of edges explained: . Lastly, we report how many rules were selected to achieve the results.
Results. We record KG compression in bits in Table 3. In all cases, KGist is significantly more effective than the Freq and Coverage baselines, which ignore MDL. Indeed, Freq and Coverage result in values greater than 100% in the first row, meaning they lead to an increase in encoding cost over , due in part to selecting rules independently from each other, and hence potentially explaining the same parts of the graph with multiple rules. KGist is very effective at explaining the graph, leaving only a small percentage of the edges unexplained. It also explains more edges than Coverage due to rule overlap again. The two refinements, Rm and Rn, are also effective at refining model to more concisely describe . Rn, which allows arbitrarily expressive rules, refines to contain fewer and better compressing rules. KGist+n explains slightly fewer edges than KGist+m because nested rules apply only when their root does (e.g., Fig 3).
We compare the number of rules mined by KGist to that of AMIE+ (galarraga2015fast, ). For AMIE+, we set min-support to 100 and min PCA confidence to 0.1, as suggested by the authors (galarraga2015fast, ). When running AMIE+ on graphs larger than NELL, we experienced intolerable runtimes (inconsistent with those in (galarraga2015fast, )). For Yago we were unable to get results, while for DBpedia we report numbers from (galarraga2013amie, ) on an older, but similarly sized version of DBpedia. In Tab. 3 we see that KGist mines orders of magnitude fewer rules than AMIE+, showing that it is more computationally tractable to apply our concise summary of rules to refinement tasks than the sheer number of rules obtained by other rule-mining methods that operate in a support/confidence framework. This is because redundant rules cost additional bits to describe, so MDL encourages conciseness. While these other methods could use the min-support parameter to reduce the number of rules, it is not clear how to set this parameter a priori. Using MDL, we can approximate the optimal number of rules in a parameter-free, information-theoretic way, leading to fewer but descriptive rules.
Furthermore, we present and discuss in Fig. 3 example rules mined with KGist. These show that our rules are interpretable and intuitively inductive, and that Rn is a useful refinement for improving the inductiveness of rules.
Here we quantitatively analyze the effectiveness of KGist at identifying a diverse set of anomalies, and demonstrate the interpretability of what it finds. Whereas most approaches focus on exceptional facts (zhang2018maverick, ), erroneous links, erroneous node type information (paulheim2017knowledge, ), or identification of incomplete information (e.g., link prediction) (galarraga2015fast, ), KGist rules can be used to address multiple of these at once. To evaluate this, we inject anomalies of multiple types into a KG, and see how well KGist identifies them.
Supervised | Unsupervised | ||||||
---|---|---|---|---|---|---|---|
Task | Metric | ComplEx | TransE | SDValidate | AMIE+ | KGist_Freq | KGist+m |
All anomalies | AUC | 0.5508 0.02 | 0.5779 0.04 | 0.4996 0.00 | 0.4871 0.04 | 0.5739 0.01 | 0.6052 0.03* |
P@100 | 0.4820 0.05 | 0.7040 0.06 | 0.5100 0.04 | 0.3980 0.07 | 0.6816 0.10 | 0.7419 0.07* | |
R@100 | 0.0087 0.00 | 0.0126 0.00 | 0.0092 0.00 | 0.0072 0.00 | 0.0126 0.00 | 0.0139 0.00* | |
F1@100 | 0.0172 0.00 | 0.0247 0.00 | 0.0181 0.00 | 0.0141 0.00 | 0.0247 0.01 | 0.0273 0.01* | |
1-9 A1 missing labels | AUC | 0.5842 0.04 | 0.6021 0.06 | 0.4997 0.00 | 0.4409 0.06 | 0.5149 0.02 | 0.6076 0.03* |
P@100 | 0.2640 0.05 | 0.4280 0.15 | 0.3040 0.06 | 0.1200 0.05 | 0.4067 0.11 | 0.4759 0.05* | |
R@100 | 0.0119 0.00 | 0.0181 0.01 | 0.0134 0.00 | 0.0057 0.00 | 0.0199 0.01 | 0.0244 0.01* | |
F1@100 | 0.0227 0.01 | 0.0346 0.01 | 0.0257 0.01 | 0.0109 0.01 | 0.0377 0.01 | 0.0463 0.02* | |
1-9 A2 superfluous labels | AUC | 0.5502 0.02 | 0.5659 0.03 | 0.4989 0.01 | 0.4946 0.03 | 0.4997 0.04 | 0.5115 0.03 |
P@100 | 0.1780 0.05 | 0.3160 0.16 | 0.2160 0.07 | 0.1040 0.09 | 0.2081 0.06 | 0.2485 0.09 | |
R@100 | 0.0122 0.00 | 0.0219 0.01 | 0.0152 0.00 | 0.0070 0.01 | 0.0169 0.01 | 0.0175 0.01 | |
F1@100 | 0.0229 0.00 | 0.0408 0.02 | 0.0283 0.01 | 0.0131 0.01 | 0.0311 0.01 | 0.0326 0.01 | |
1-9 A3 erroneous links | AUC | 0.2495 0.03 | 0.4126 0.08 | 0.4966 0.01 | 0.8902 0.08 | 0.7383 0.00 | 0.8423 0.00 |
P@100 | 0.1020 0.04 | 0.0020 0.00 | 0.0480 0.02 | 0.1860 0.08* | 0.0131 0.01 | 0.0137 0.01 | |
R@100 | 0.0374 0.02 | 0.0007 0.00 | 0.0176 0.01 | 0.0679 0.03* | 0.0051 0.01 | 0.0052 0.01 | |
F1@100 | 0.0548 0.02 | 0.0011 0.00 | 0.0257 0.01 | 0.0995 0.05* | 0.0074 0.01 | 0.0075 0.01 | |
1-9 A4 swapped labels | AUC | 0.5369 0.03 | 0.5527 0.02 | 0.4991 0.00 | 0.4891 0.03 | 0.6904 0.01* | 0.6633 0.07 |
P@100 | 0.2160 0.08 | 0.4200 0.09 | 0.2080 0.08 | 0.1240 0.06 | 0.5360 0.15* | 0.4768 0.10 | |
R@100 | 0.0136 0.00 | 0.0269 0.01 | 0.0128 0.00 | 0.0079 0.00 | 0.0379 0.01* | 0.0320 0.01 | |
F1@100 | 0.0256 0.01 | 0.0505 0.01 | 0.0241 0.01 | 0.0148 0.01 | 0.0705 0.01* | 0.0599 0.01 | |
Avg rank |
4.10 | 2.90 | 4.15 | 5.00 | 2.90 | 1.95 | |
-value (paired t-test) with an “*” for
KGist_Freq/KGist+m vs. unsupervised methods. The final row shows the average rank of each method. KGist+m performs the most consistently well.Setup. We inject four types of anomalies. For each, we select percent of ’s nodes uniformly at random to perturb. We sample nodes independently for each type, so it is possible that occasionally a node is chosen multiple times. This is realistic, since there are multiple types of errors in KGs at once (paulheim2017knowledge, ). Although we target nodes, our perturbations also affect their incident edges. Thus, we formulate the anomaly detection problem as identifying the perturbed edges. Specifically, we introduce the following anomalies:
A1 Missing labels: We remove one label from each node. Unlike the A2-A4, we only sample nodes with more than one label. E.g., we may remove the entrepreneur label from Bill Gates, leaving the labels billionaire, etc. We consider all the in/out edges of the altered nodes as perturbed.
A2 Superfluous labels: We add to each node a randomly selected label that it does not currently have. E.g., we may add the label Fruit to Taj Mahal.
A3 Erroneous links: We inject 1 or 2 edges incident to each node, choosing the edge’s predicate and destination randomly. E.g., we may inject random edges like (Des Moines, owner, Coca-Cola). We mark injected edges as anomalous.
A4 Swapped labels: For each node, we replace a label with a new random one that it does not yet have. For this experiment we show results on NELL, since it has confidence values for each of its edges, which we can use to sample negative examples. The perturbed edges are ground truth errors (positive examples), and we randomly sample from NELL an equal number of ground truth correct edges with a confidence value of 1.0 (after filtering out edges that our injected anomalies perturbed). We use a 20/80 validation/test split, and the perturbed graph for training.
Baselines. We compare to (i) ComplEx, an embedding method that we tune as in (trouillon2016complex, ) (ranking edges based on its scoring function), (ii) TransE, an embedding method that we tune as in (bordes2013translating, ) (ranking edges based on their energy scores), (iii) SDValidate (paulheim2014improving, ), an error detection method deployed in DBpedia (it outputs an edge ranking), and (iv) AMIE+, designed for link prediction, but which we adapt for error detection by ranking based on the sum of the confidences of the rules that predict each test edge (i.e., edges that are predicted by many, high-confidence rules will be low in the ranking, and edges that are not predicted by any rules will be high in the ranking). We also tried PaTyBRED (melo2017detection, ), but it had prohibitive runtime.
KGist variants. To define edge anomalies for our variants, we use the edge-based anomaly score in Eq. (14). KGist_Freq is the Freq method described in § 5.1.1, but uses KGist’s anomaly scores. While KGist+n learns compositional rules that help with compression, we found that the simpler rules of KGist+m performed better in this task, so we report only its results for brevity. The unsupervised methods do not have hyper-parameters, but are tested on the same test set as ComplEx/TransE, so the validation set errors are additional noise they must overcome.
Metrics. Each ranking includes only the test set edges, and we compute the AUC for each ranking, using reciprocal rank as the predicted score for each edge—edges higher in the ranking being closer to 1 (i.e., more anomalous). We also compute Precision@100, Recall@100, and F1@100 for (i) the entire test set of edges; (ii) each type of perturbed edges from the different anomaly types. For ties in the ranking, we extend the list beyond 100 until the tie is broken (e.g., if the 100th and 101st edge have the same score, then we compute over 101 edges). Ties did not often extend much beyond 100, but for KGist_Freq they tended to extend farthest. Positives are considered perturbed edges and negatives un-perturbed edges. When computing scores for a particular anomaly type, we first filter the ranking to only contain the edges perturbed by that anomaly type and the un-perturbed edges to ensure that edges perturbed by other anomaly types are not considered false negatives.
Results. In Table 4
we report results identifying anomalies generated with sampling probability
and 5 random seeds. We report avg and stdev over the 5 perturbed graphs. Across all anomaly types, KGist+m is most effective at identifying anomalous edges, demonstrating its generality. This is further evidenced by its top average ranking: it ranks 1.95 on average across all anomaly types and metrics. Furthermore, as discussed in Fig. 4, not only can it identify anomalies, but its interpretable rules allow us to reason about why something is anomalous.In most cases, KGist+m even outperformed ComplEx and TransE, supervised methods requiring validation data for hyper-parameter tuning. A2 is the only anomaly type where supervised methods outperform unsupervised methods, but the difference is not statistically significant. KGist_Freq performs better than most other baselines, demonstrating that our formulation of anomaly scores and rules are effective at finding anomalies. However, as KGist+m usually outperforms KGist_Freq, we conclude that MDL leads to improvement over simpler rule selection approaches. AMIE+ only performed well on A3. We conjecture that this is because randomly injected edges are likely to be left un-predicted by all of AMIE+’s rules. On the other hand, edges with perturbed endpoints may be left un-predicted by some rules, but, out of the large number of rules that AMIE+ mines (§ 5.1.2), some rule is likely to still predict it. The results for were overall consistent, with a few fluctuations between KGist+m and KGist_Freq. We omit the results for brevity.
In this section, we evaluate KGist’s ability to find missing information. Most KG completion methods target link prediction, which seeks to find missing links between pairs of nodes that are present in a KG. If either node is missing, then link prediction cannot provide any information. We focus on this task: revealing where entities are missing. Since KGist’s rules apply to nodes, rather than edges, the rule exceptions can reveal where links to seen or unseen entities are missing (but cannot predict which specific entity the link should be to). Thus, our task and link prediction are complementary.
Setup. We assume the commonly used partial completeness assumption (PCA) (paulheim2017knowledge, ; galarraga2013amie, ; galarraga2017predicting, ), according to which, if an entity has one relation of a particular type, then it has all relations of that type in the KG (e.g., a movie with at least one actor listed in the KG has all its actors listed). We generate a perturbed KG with ground-truth incomplete information via the following steps: (1) we randomly remove of nodes (and their adjacent edges) from , and (2) we enforce the PCA (e.g., if we removed one actor from a movie, then we remove all the movie’s actor edges). Our goal is to identify that the neighbors of the removed nodes are missing information, and what that information is. We run KGist on the perturbed KG, and identify the exceptions, , of each rule . If a rule asserts the removed information, then this is a true positive. For example, if we removed Frankenstein’s author and KGist mines the rule that books are written by authors, then that rule asserts the removed information. We use NELL and DBpedia for this experiment as their sizes permit several runs over different perturbed KG variants.
Baselines. Link prediction methods are typically used for KG completion, but they do not apply to our setting: they require that both endpoints of an edge be in to predict the edge, while our setup assumes that one endpoint is missing from . Thus, we compare KGist to Freq and AMIE+C. Freq, as before (§ 5.1.1), selects the top- rules with the most correct assertions, where is set to the number of rules KGist mines. AMIE+C is what we name the method from (galarraga2017predicting, ). AMIE+C requires training data comprised of examples of (, incomplete, ) triples where is an entity, is a predicate, and the triple specifies that node is missing its links of type (e.g., a movie is missing actors). We use 80% of the removed data as training data for AMIE+C and test all methods on the remaining 20%. We tune AMIE+C’s parameters as in (galarraga2017predicting, ).
Metrics. We report only recall, , since information that we did not remove but was reported missing could be either a false positive, or real missing information that we did not create (paulheim2017knowledge, ). We compute recall as the number of nodes identified as missing, divided by the total number of nodes removed. In addition, we compute a more strict recall variant, , which requires that the missing node’s label also be correctly identified (e.g., not only do we need to predict the absence of a missing writtenBy edge, but also that the edge should be connected to an author node). KGist can return this label information, but AMIE+C only predicts the missing link, not the label. Thus, we omit for AMIE+C.
Results. We report results in Table 5 with (the results are consistent with other values). KGist outperforms all baselines by a statistically significant amount (10-11% on , 13-27% on , and paired t-test -values ), which demonstrates its effectiveness in finding information missing from a KG. We conjecture that KGist outperforms AMIE+C because AMIE+C requires all training data be focused around a small number of predicates (e.g., 10-11 in (galarraga2017predicting, )) in order to learn effective rules, while MDL encourages KGist to explain as much of the KG as possible, allowing it to find missing information over more diverse regions of the KG. This is further evidenced by the fact that KGist outperforms Freq, which only applies to frequent regions of the KG. Not only does KGist effectively reveal where the missing nodes are, it also usually correctly identifies their labels (small drop in compared to ). AMIE+C is not able to do this, and Freq can only report the label sometimes, by taking advantage of the rules we formulated in this work.
Supervised | Unsupervised | ||||
---|---|---|---|---|---|
3-4 6-7 Dataset | Metric | LP | AMIE+C (galarraga2017predicting, ) | Freq | KGist |
NELL | N/A | ||||
N/A | N/A | ||||
1-7 DBpedia | N/A | ||||
N/A | N/A | ||||
In this section, we evaluate KGist’s performance as the number of edges in the KG grows. We perform this evaluation on an Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 1TB RAM, using a Python implementation. NELL, DBpedia, and Yago have from 231,634 to 12,027,848 edges. We run KGist on each KG three times. Since we aim to analyze the runtime with respect to the number of edges, but Yago has three orders of magnitude more labels, we run KGist with an optimization that only generates candidate rules with the 300 most frequent labels (approximately equal to NELL and DBpedia), allowing us to fairly investigate the effect of the number of edges. Figure 5 shows the number of edges vs. runtime in seconds. In practice, KGist is near-linear in the number of edges. Even on Yago, it mines summaries in only a few minutes, and on NELL in seconds.
This paper proposes a unified, information theoretic approach to KG characterization, KGist, which solves our proposed problem of inductive summarization with MDL. KGist describes what is normal in a KG with a set of interpretable, inductive rules, which we define in a new, graph-theoretic way. The rule exceptions, and the parts of the KG that the summary fails to describe reveal what is strange and missing in the KG. KGist detects various anomaly types and incomplete information in a principled, unified manner, while scaling nearly linearly with the number of edges in a KG—this property allows it to be applied to large, real-world KGs. Future work could explore using KGist’s rules to guide KG construction.
This material is based upon work supported by the National Science Foundation under Grant No. IIS 1845491, Army Young Investigator Award No. W911NF1810397, an Adobe Digital Experience research faculty award, and an Amazon faculty award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or other funding parties.
Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI), Atlanta, GA
, 2010.A review of relational machine learning for knowledge graphs.
Proceedings of the IEEE, 104(1):11–33, 2015.A Universal Prior for Integers and Estimation by Minimum Description Length.
The Annals of Statistics, 11(2), 1983.
Comments
There are no comments yet.