RuDaS: Synthetic Datasets for Rule Learning and Evaluation Tools

09/16/2019 ∙ by Cristina Cornelio, et al. ∙ ibm 0

Logical rules are a popular knowledge representation language in many domains, representing background knowledge and encoding information that can be derived from given facts in a compact form. However, rule formulation is a complex process that requires deep domain expertise, and is further challenged by today's often large, heterogeneous, and incomplete knowledge graphs. Several approaches for learning rules automatically, given a set of input example facts, have been proposed over time, including, more recently, neural systems. Yet, the area is missing adequate datasets and evaluation approaches: existing datasets often resemble toy examples that neither cover the various kinds of dependencies between rules nor allow for testing scalability. We present a tool for generating different kinds of datasets and for evaluating rule learning systems.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


RuDaS: Synthetic Datasets for Rule Learning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Logical rules are a popular knowledge representation language in many domains. They represent domain knowledge, encode information that can be derived from given facts in a compact form, and allow for logical reasoning. For example, given facts and , the datalog rule [S. Ceri and Tanca1989] , encodes the fact and describes its dependency on the other facts. Moreover, if the data grows and new facts are added, we can automatically derive new knowledge. Since rule formulation is complex and requires domain expertise, rule learning [Raedt2008, Fürnkranz, Gamberger, and Lavrac2012] has been an area of active research in AI for a long time, also under the name

inductive logic programming

(ILP). It has recently revived with the increasing use of knowledge graphs (KGs), which can be considered as large fact collections. KGs are used in various domains such as in the Semantic Web or with companies such as Google [Dong et al.2014] or Amazon [Krishnan2018], and there are large knowledge bases in the medical domain. Useful rules over these knowledge bases would obviously provide various benefits.

However, we argue that the evaluations of current ILP systems are insufficient. We demonstrate that the reported results are questionable, especially, in terms of generalization and because the datasets are lacking in various dimensions.

The evaluation of rule learning has changed over time. While the classical rule learning methods often focused on tricky problems in complex domains [ILP, Quinlan1990] and proved to be effective in practical applications, current evaluations can be divided into three categories. Some consider very small example problems with usually less than 50 facts and only few rules to be learned [Evans and Grefenstette2018]. Often, these problems are completely

defined, in the sense that all facts are classified as either true or false, or there are at least some negative examples given. Hence, the systems can be thoroughly evaluated based on classical measures such as accuracy. Other evaluations regard (subsets of) real KGs such as Wikidata

111˙Page or DBpedia222, some with millions of facts [Galárraga et al.2015, Omran, Wang, and Wang2018, Ho et al.2018]. Since there are no rules over these KGs, the rule suggestions of the systems are usually evaluated using metrics capturing the precision and coverage of rules (e.g., standard confidence [Galárraga et al.2015]) based on the facts contained in the KG. However, since the KGs are generally incomplete, the quality of the rule suggestions is not fully captured in this way. For instance, OWaWa-IJCAI18:scalable-rule-learning OWaWa-IJCAI18:scalable-rule-learning present an illustrative example rule, , which might well capture the facts in many existing KGs but which is heavily biased and does not extend to the entirety of valid facts beyond them. Furthermore, we cannot assume that the few considered KGs completely capture the variety of existing domains and especially the rules in them. For example, Minervini+-NAMPI18:ntp-at-scale Minervini+-NAMPI18:ntp-at-scale propose rules over WordNet333 that are of very simple nature – containing only a small number of the predicates in WordNet and having only a single body atom – and very different from the ones suggested in [Galárraga et al.2015] for other KGs.

Also the evaluation metrics vary, especially considering the intersection between more modern and classic approaches. We will show that most of the standard information retrieval measures used in machine learning are not adequate for a logic context because they neglect important facets like the size of the Herbrand universe (e.g., this may yield a too high accuracy). Some other measures have been used for neural ILP such as Mean Reciprocal Rank, or precision/recall@K, but they can be applied only in specific cases (i.e. the system outputs weighted/probabilistic rules or a ranking of facts). Yet, strict logic measures are not perfect either, since they are based on the assumption that the domain is very small and human understandable. For this reason the community needs to consider several metrics and should define new metrics suitable for both worlds.

Recently, synthetic datasets have been proposed, but they are very simple and do not cover all characteristics necessary to evaluate an ILP tool properly: 1) Dong+-ICLR19:nlms Dong+-ICLR19:nlms provide a first synthetic dataset generator for graph reasoning, which can produce an arbitrary number of facts regarding five fixed predicates while the rules are hand written. 2) on_the_fly on_the_fly argue, in line with us, for more diverse datasets for rule learning. However, their generated datasets are still restricted in several dimensions: e.g. small size and very simple rules (based on five fixed templates). Moreover, there are well-known ILP competitions444for example: 2016, in the logic community, but they consider only few real-world datasets and base the evaluation only on test facts and not on rules.

In summary, we claim that the existing datasets are not sufficient to cover the possible variety of that data and the rules that could be mined from arbitrary data. However, many existing KGs are large, noisy, heterogeneous, and might embed complex rules. The problem is that we do not know if such embedded rules do not exist or if they are just not learned today because of the restrictions of the current rule learners. Since it is unclear what sort of complexity is required to model the real world, we opted for an artificial but largely random approach that covers different kinds of variety and complexity missing in today’s datasets.

In this paper, we present RuDaS (Synthetic Datasets for Rule Learning), a tool for generating synthetic datasets containing both facts and rules, and for evaluating rule learning systems, that overcomes the above mentioned shortcomings of existing datasets and proper evaluation methods. RuDaS is highly parameterizable; for instance, number of constants, predicates, facts, consequences of rules (i.e., completeness) amount of noise (e.g., wrong or missing facts) and kinds of dependencies between rules can be selected. Moreover, RuDaS allows for assessing the performance of rule learning systems by computing classical and more recent metrics, including a new one that we introduce. Finally, we evaluate representatives of different types of rule learning systems on our datasets demonstrating the necessity of having a diversified portfolio of datasets to help revealing the variety in the capabilities of the systems and thus also to support and help researchers in developing and optimizing new/existing approaches. RuDaS is available at

2 Rule Learning Preliminaries

We assume the reader to be familiar with first-order logic (FOL) and its related concepts (e.g., inference, Herbrand models and universes, etc.).

We consider datalog rules [S. Ceri and Tanca1989]:


of length where all atoms , , are of the form with a predicate of arity and terms , . A term is either a constant or a variable. is called the head and the conjunction the body of the rule. All variables that occur in the head must occur in the body. A fact is an atom not containing variables.

Note that several classical ILP systems also consider more complex function-free Horn rules, which allow for existential quantification in the rule head or negation in the body, but most recent systems focus on datalog rules or restrictions of those [Galárraga et al.2015, Evans and Grefenstette2018, Rocktäschel and Riedel2017]. In particular, reasoning systems for KGs [Yang, Yang, and Cohen2017, Omran, Wang, and Wang2018] often consider only binary predicates and chain rules of the form .

We define the problem of rule learning in the most general way: given background knowledge in the form of facts, including a set of so-called positive examples (vs. negative or counter-examples), the goal is to learn rules that can be used to infer the positive examples from the background knowledge, based on standard FOL semantics. As it is common today, we do not separate the background knowledge into two types of facts but consider a single set of facts as input.

We recall that the closed-world assumption (CWA) (vs. open world assumption or OWA) states that all facts that are not explicitly given as true are assumed to be false .

A short overview of different types of rule learning systems is given in the appendix.

3 RuDaS Datasets

RuDaS contains an easy-to-use generator for ILP datasets. It generates datasets that vary in many dimensions and is highly parameterizable. While existing datasets are missing more detailed specifications but are described only in terms of size and number of different constants and predicates, we propose a much more detailed set of parameters which can serve as a general classification scheme for ILP datasets, and support evaluations. In this section, we give details about these parameters, and thus on the possible shapes of RuDaS datasets. Each dataset contains the rules and the facts in files in standard Prolog format (using the syntax of Rule (1)). We also describe example datasets we generated, which can be found in our repository.

Symbols. Our datasets are domain independent, which means that we consider synthetic names for predicates, for constants, and for variables with . While the kinds and numbers of the symbols used is random, it can be controlled by setting the following generator parameters:

  • number of constants and predicates

  • min/max arity of predicates

Observe that these numbers influence the variability and number of generated rules and facts.

(a) Chain

(b) Rooted DG (RDG)


(c) Disjunctive Rooted DG (DRDG)
Figure 1: Example rule structure generated for the different categories with size S and depth 2.

Rules. RuDaS datasets contain datalog rules (see Section 2) of variable structure. The generation is largely at random in terms of which predicates, variables, and constants appear in the rules; that is, in the structure of every single rule. We only require the head to contain some variable.

To classify a set of rules, we propose four categories depending on the dependencies between rules: Chain, Rooted Directed Graph (DG), Disjunctive Rooted DG, and Mixed. Figure 1 shows a generated rule set for each category. The dependencies between the rules are represented as edges in a directed graph where the rules are the nodes. That is, an incoming edge shows that the facts inferred by the child node’s rule might be used, during inference with the rule at the parent node. The node at the top is called the root. In the following, we use (rule) graph and DG interchangeably.

Category Chain. Each rule, except the one at the root, infers facts relevant for exactly one other rule (i.e., every node has at most one parent node) and, for each rule, there is at most one such other rule which might infer facts relevant for the rule (i.e., every node has at most one child node). However, recursive rules (where the predicate in the head occurs also in the body) represent an exception, they are relevant for themselves and for one other rule (i.e., the graph has a small loop at each node representing a recursive rule).

Category Rooted DG (RDG). It generalizes category Chain in that every rule can be relevant for several others (i.e., each node can have multiple parent nodes). Furthermore, for each rule, there may be several other rules which might infer facts relevant for the rule (i.e., a node may have several child nodes); and at least one such case exists. But, for each predicate occurring in the body of the former rule, there must be at most one other rule with this predicate in the head; that is, there are no alternative rules to derive facts relevant for a rule w.r.t. a specific body atom.

Category Disjunctive Rooted DG (DRDG). It generalizes category RDG by allowing for the latter alternative rules (represented as children of an “OR” node); and at least one such case exists.

Category Mixed. A rule graph that contains connected components of different of the above categories.

Figure 1 illustrates the differences between the categories. In (a), for each rule, there is at most one child node with a rule relevant for its derivations. In (b), there might be multiple children, but each child node contains a different predicate in the head. In (c), the latter does not hold anymore; for given facts, there may be various derivations.

The numbers and categories of connected components are selected randomly by default. The shape of RuDaS rule sets can be influenced with the following parameters though:

  • number and maximal length of rules

  • category of connected components (i.e., one of the above)

  • min/max number of connected components

  • maximal depth of rule graphs (i.e., number of rules nodes in the maximum of the shortest paths between root and leaves)

Facts. The main advantage of the RuDaS datasets, the availability of the rules, allows for classifying the facts as well. More specifically, facts can be (ir)relevant for inference, depending on if their predicates do (not) occur in a rule body, and they may be consequences of inferences. While such a classification of facts is impossible for all the existing datasets that do not contain rules, it allows for a better evaluation of the rule learners’ capabilities (see Section 6).

RuDaS fact sets vary in the following parameters:

  • dataset size: XS, S, M, L, XL

  • open-world degree

  • amount of noise in the data ,

An XS dataset contains about 50-100 facts, an S dataset about 101-1,000, an M dataset about 1,001-10,000, an L dataset about 10,001-100,000, and an XL dataset about 100,001-500,000. For larger sizes, we suggest meaningful abbreviations in the form of X2L for XXL etc., which allow for extension while being short and easy to understand. Since the main purpose of RuDaS is allowing the analysis of the rules learned (vs. scalability), we have however not considered such larger datasets so far. The open-world degree specifies how many of the consequences from an initial set of relevant facts, called support facts, are missing in the dataset (see Section 4 for a detailed description of the generation process). By noise, we mean facts that are not helpful in learning the rules either because they are not relevant for deriving the positive examples () or because they are relevant but missing ().

3.1 Example Datasets: RuDaS-v0

For demonstration purposes, we generated RuDaS-v0, a set of datasets which are available to the community (in our repository), and which we also used in our experiments (see Section 6). The datasets model different possible scenarios, and mainly vary in the structures and sizes of the rule sets and in the sorts and quantities of facts. RuDaS-v0 contains 40 Chain, 78 RDG, and 78 DRDG datasets, of sizes XS and S, and of depths 2 and 3, all evenly distributed. A table with further details is shown in the appendix. Note that each of the rules sets in RuDaS-v0 consists of exactly one connected component, and that we did not generate rule sets of category Mixed; datasets with more connected components of possibly different categories can be easily created by combining the datasets we generated. Further, we constrained both the maximal rule length and arity of atoms to two because several existing rule learning systems require that.

All the datasets were generated such that they are missing 20-40% of all consequences, 15-30% of the original support facts, and contain 10-30% facts that are irrelevant for the derivation of positive examples. Since real datasets may strongly vary in the numbers of missing consequences and noise and, in particular, since these numbers are generally unknown, we chose factors seeming reasonable to us. Also note that there is information regarding the accuracy of real fact sets such as YAGO555 (95%) and NELL666 (87%), that measures the amount of data correctly extracted from the Web etc. and hence corresponds to in our setting. Thus, our choices in this regard thus seem to be realistic.

We hence simulated an open-world setting and incorporated noise. While we consider this to be the most realistic training or evaluation scenario, specific rule learning capabilities might be better evaluated in more artificial settings with either consequences or noise missing. For this reason, every dataset mentioned in the table additionally includes files containing the incomplete set of facts without noise (i.e., as in the table; ; ) and the complete set of facts (i.e., ), with and without noise.

4 Dataset Generation

In this section, we describe the generation process of the rules and facts in detail, assuming the generator parameters (also configuration) listed in Section 3 to be set.

Preprocessing. As already mentioned, many parameters are determined randomly in a preprocessing step if they are not fixed in the configuration, such as the symbols that will be used, the numbers of DGs to be generated, and their depths. However, all random selections we mention are within the bounds given in the configuration under consideration; for instance, we ensure that the symbols chosen suffice to generate rule graphs and fact sets of selected size and that at least one graph is of the given maximal depth.

Rule generation. According to the rule set category specified and graph depths determined, rules (nodes in the graphs) of form (1) are generated top down breadth first, for each of the rule graphs to be constructed. The generation is largely at random, that is, w.r.t. the number of child nodes of a node and which body atom they relate to; the number of atoms in a rule; and the predicates within the latter, including the choice of the target predicate

(i.e., the predicate in the head of the root) in the very first step. RuDaS also offers the option that all graphs have the same target predicate. To allow for more derivations, we currently only consider variables as terms in head atoms; the choice of the remaining terms is based on probabilities as described in the following. Given the atoms to be considered (in terms of their number and predicates) and an arbitrary choice of head variables, we first determine a position for each of the latter in the former. Then we populate the other positions one after the other: a head variable is chosen with probability

; for one of the variables introduced so far, we have probability ; for a constant, ; and, for a fresh variable, . While this conditional scheme might seem rather complex, we found that it works best in terms of the variety it yields; nevertheless, the probabilities can be changed easily.

Fact generation. The fact generation is done in three phases: we first construct a set of relevant facts in a closed-world setting, consisting of support facts and their consequences , and then adapt it according to and .

As it is the (natural) idea, we generate facts by instantiating the rule graphs multiple times, based on the assumption that rule learning systems need positive examples for a rule to learn that rule, and stop the generation when the requested number of facts has been generated. We actually stop later because we need to account for the fact that we subsequently will delete some of them according to . More specifically, we continuously iterate over all rule graphs, for each, select an arbitrary but fresh variable assignment , and then iterate over the graph nodes as described in the following, in a bottom-up way. First, we consider each leaf and corresponding rule of form (1) and generate support facts . Then, we infer the consequences based on the rules and all facts generated so far. For every node on the next level and corresponding rule of form (1), we only generate those of the facts as support facts which are not among the consequences inferred previously. We then again apply inference, possibly obtaining new consequences, and continue iterating over all nodes in the graph in this way. We further diversify the process based on two integer parameters, and : in every -th iteration the graph is instantiated exactly in the way described; in the other iterations, we skip the instantiation of a node with probability 1/ and, in the case of DR-DGs, only instantiate a single branch below disjunctive nodes. We implemented this diversification to have more variability in the supports facts, avoiding to have only complete paths from the leaves to the root.

In the open-world setting, we subsequently construct a set by randomly deleting consequences from according to the open-world degree given: assuming to be the set of target facts (i.e., consequences containing the target predicate), we remove from , and similarly from . In this way, we ensure that the open-world degree is reflected in the target facts. Though, there is the option to have it more arbitrary by removing from instead of splitting the deletion into two parts.

The noise generation is split similarly. Specifically, we construct a set based on by arbitrarily removing from , and by adding arbitrary fresh facts that are neither in (i.e., we do not add facts which we have removed in the previous step) nor contain the target predicate such that contains of noise. In addition, we add arbitrary fresh facts on the target predicate that are not in already such that the set of facts within on that predicate finally contains of noise.

Output. The dataset generation produces: the rules; a training set (), which is of the requested size, and fulfills , , and ; and custom fact sets and for our evaluation tools generated in the same way as and . For further experiments, RuDaS also outputs , (an adaptation of which contains noise but all of ), , , and  (see also the end of Section 3.1).

5 Evaluation Tools

RuDaS contains also an evaluator (written in Python) that is able to compare the original rules of a dataset to the ones produced by a rule learning system. In this section, we describe this evaluator and the different measures it provides.

We focus on three logic(-inspired) distances and four standard information retrieval measures that are relevant to our goal of capturing rule learning performance: 1) Herbrand distance, the traditional distance between Herbrand models; two normalized versions of the Herbrand distance 2) Herbrand accuracy (H-accuracy) and 3) Herbrand score (H-score), a new metric we propose in this paper; 4) accuracy 5) precision; 6) recall; and 7) F1-score.

Our test fact sets (both facts and consequences) in the evaluation do not contain noise and all the consequences can be recovered by the original rules applied over the given facts. In line with that, we focused on measures that maintain the closed-world assumption, and did not include in RuDaS measures that focus on the open-world aspect for the evaluation (i.e., PCA in [Galárraga et al.2015]). Although, as it is explained in Section 5.2, F1-score is the best suit metric in RuDaS to deal with an open-world evaluation.

In what follows, denotes the set of facts inferred by grounding the rules over the support facts excluding the facts in . We denote an original rule set by , a learned one by , and support facts by . Our evaluation is performed comparing two sets: 1) obtained by the application of the induced rules to the fact sets ( and described in Section 4 - Output) using a forward-chaining engine (written in python and available in our tool); 2) that corresponds to : the result of the application of the original rules to the fact set .

5.1 Logic Measures

The Herbrand distance between two logic programs (sets of rules), defined over the same set of constants and predicates, is defined as the number of facts that differ between the two minimal Herbrand models of the two programs:

The standard confidence [Galárraga et al.2015] is the fraction of correctly inferred facts w.r.t. all facts that can be inferred by the learned rules capturing their precision:

In our closed-world setting, this corresponds to the precision of a model, since it is easy to see that are the number of true positive examples and corresponds to the union of true and false positive examples. The Herbrant accuracy corresponds to the Herbrand distance normalized on the Herbrand universe: , where is the size of the Herbrand universe defined by the original program. We introduce a new metric, the Herbrand score (H-score) defined as:

H-score provides an advantage over the other metrics since it captures both how many correct facts a set of rules produces and also its completeness (how many of the facts inferred by the original rules were correctly discovered), while the other measures consider these points only partially.

Note that Herbrand accuracy is not a significant measure if or the Herbrand universe is large, because, in these cases, it will be very high (close to ) disregarding the quality of the rules. This happens because all the facts in are considered correct predictions, as well as the facts in the Herbrand universe that neither appear in nor in .

5.2 Information Retrieval Measures

We adapted the main measures used in the machine learning evaluations to our context. We define: the sets of true positive examples (TP) as the cardinality , the set of false positive examples (FP) as the cardinality ; the set of false negative examples (FN) as the cardinality ; and the set of true negative examples (TN) as the cardinality of the difference between the Herbrand universe and the union . Given these four definitions, accuracy, precision, recall, F1-score etc. can be defined as usual [Russell and Norvig2002].

Note that the accuracy measure is not a significant measure if or the Herbrand universe is large, for the same reason reported for Herbrand accuracy above. Moreover, F1-score is similar to H-score, with the difference that F1-score gives more priority to the TP examples. We believe that giving uniform priority to FN, TP, and FP is more reasonable in the context of logic; this is in line with standard logic measures like

. However, F1-score is more suitable (compared to H-score) in open-world settings for the evaluation where some of the consequences could be missing, and thus predicted as FP (despite being correct). For this reason F1-score would give a better estimate of the quality of the induced rules since it focuses more on the TP examples and give less priority to the generated FP examples.

We observe that, if , then H-score is equal to precision and both are equal to ; and, if and are disjoint, then both are . Moreover the two measures coincide if . The main difference between the two measures is highlighted in the case where . Then, precision but H-score is . This property is intentional for our new metric (H-score) because we want to have H-score only if the predicted facts are exactly those produced by the original rules while precision is as soon as all predicted facts are correct.

5.3 Rule-based Measures

Several metrics between sets of logic rules have been defined in the literature [Estruch et al.2005, Estruch et al.2010, Nienhuys-Cheng1997, Preda2006, Seda and Lane2003]. However, we decided to do not include them in our analysis (and in the current version of RuDaS) since they strongly rely on the parse structure of the formulas and hence are more suitable for more expressive logics like full FOL.

6 Experiments

The goal of our experiments is to demonstrate the necessity of having a diversified portfolio of datasets for the evaluation of a rule learning system. The existing datasets are not diverse enough to provide a comprehensive evaluation of ILP methods (e.g., often fall into category Chain). In the following experiments, we show that the rule dependencies have a significant impact on the performance of the systems as well as the dataset size and the amount of noise.

In what follows, we evaluate representatives of the rule learning approaches on the datasets described in Section 3.1, in four main experiments to understand, respectively, the variety of the performance metrics, and the impact of missing consequences, noise, rule dependencies, and dataset size. We compared the following systems (configuration details in the appendix): 1) FOIL [Quinlan1990], a traditional ILP system; 2) AMIE+ [Galárraga et al.2015], a rule mining system; 3) Neural-LP [Yang, Yang, and Cohen2017]; and 4) NTP [Rocktäschel and Riedel2017]. The latter are both neural approaches. AMIE+, Neural-LP, and NTP output confidence scores for the learned rules. We therefore filtered their output using a system-specific threshold, obtained using grid search over all datasets. Further, to not disadvantage Neural-LP and NTP, which use auxiliary predicates, we ignored the facts produced on these predicates in the computation of the result metrics. It is important to notice that NTP requires additional information in the form of rule templates, that can be seen as an advantage given to this system.

In the experiments we do not report the standard deviation since the results span over different dataset categories and sizes and we do not penalize the instances that exceed the time limit (both for the evaluation or the systems’ execution). This does not influence the outcome of the evaluation; the results that successfully terminate are stable on average.

6.1 Overall Results in Terms of Different Metrics

In this experiment, we regarded overall results, reported in Table 1, in terms of the metrics introduced in Section 5

. As expected, the results for F1-score and Herbrand score are very similar, the only difference is that F1-score is a more “optimistic” measure, giving advantage to the methods with a higher number of true positive examples. Also Herbrand accuracy and accuracy provide similar results. Observe that these two measures are not meaningful in our settings since they yield always very high performances. Note that precision and H-score are very close for AMIE+, Neural-LP, and NTP, but not for FOIL. This could be explained by the fact that the training of the former systems maximizes functions that are similar to precision, while FOIL uses heuristics to produce the rules that induce the maximum number of facts in the training set and minimum number of facts not in the training set. The great discrepancy between the two measures with FOIL means that the rules it learns do not produce many false facts but only a subset of the facts induced by the original rules. For AMIE+ instead, since precision and H-score are similar, we have that its rules produce most of the consequences of the original rules and, thanks to the good performance, they do not produce too many false facts. Considering Neural-LP and NTP the two measures are also very similar, but very low: their rules produce most of the positive examples but also a lot of false facts.

H-accuracy 0.9872 0.8708 0.9852 0.9304
Accuracy 0.9872 0.8719 0.9850 0.9302
F1-score 0.2136 0.3164 0.1620 0.1192
H-score 0.1523 0.2429 0.1027 0.0772
Precision 0.5810 0.3125 0.1693 0.1049
Recall 0.2273 0.7178 0.2421 0.3960
Table 1:

Impact of different metrics, each one averaged on 120 datasets with uniformly distributed categories

{CHAIN, RDG, DRDG}, sizes {XS,S}, and graph depths {2,3}; , , .

6.2 Impact of Missing Consequences and Noise

In this experiment, we evaluated the performance of the systems in the presence of complete information, incomplete information, and incomplete information with noise. This was performed analyzing the impact of the different parameters given in RuDaS: , , and . The results are reported in Table 2. The noise parameters are defines as follows777the set memberships are intended to mean “uniformly distributed over”: complete datasets , , and , incomplete datasets , , and , and incomplete + noise datasets , , and . Moreover, in order to give an impression of some of the datasets considered in existing evaluations, we included one manually created dataset, EVEN, inspired by the corresponding dataset used in [Evans and Grefenstette2018]888In our version, is the only rule, and the input facts are such that we also have an accuracy of if the symmetric rule is learned (using the original fact set it would be ). AMIE+ and Neural-LP do not support unary predicates which are present in EVEN., which contains complete information. We notice that FOIL shows a good performance if the information is exact and complete while showing decreasing performance in more noisy scenarios. This is a result of the assumptions FOIL is based upon: it assumes negative examples to be given in addition in order to guide rule learning and, in particular, missing facts to be false (see Section 4.1 in [Quinlan1990]). AMIE+ seems to perform constant on average, showing robustness to noise and incomplete data in all the datasets. Neural-LP and NTP seem to be robust to noise and incomplete data, not showing changes in performance while adding more noise and uncertainty.

EVEN Complete Incomplete Incomplete
+ Noise
FOIL 1.0 0.3951 0.2102 0.0940
AMIE+ - 0.2219 0.2646 0.2634
Neural-LP - 0.0659 0.0750 0.0701
NTP 1.0 0.0601 0.0833 0.0718
Table 2: Effect of missing consequences and noise on 144 datasets. Each H-score value is averaged on 48 datasets, with uniformly distributed categories {RDG, DRDG}, sizes {XS,S}, and graph depths {2,3}.

6.3 Impact of Dependencies Between Rules

In this experiment, we analyzed the impact of the kind of the dependencies between rules (dataset categories). The results are reported in Table 3. As expected, we notice that the systems perform very different depending on the datasets’ rule categories, proving the necessity of diverse datasets for designing ILP systems. We notice that the systems perform better on the Chain datasets while only learning partially RDG and DRDG rules, meaning that the available rule learning systems are not yet able to capture complex rule set structures. Our results also confirm the system descriptions w.r.t. the rules they support (e.g., Neural-LP only supports chain rules) (details in the appendix). Nevertheless, rules that are not fully supported are still recognized partially sometimes.

FOIL 0.2024 0.0873 0.1648
AMIE+ 0.3395 0.2323 0.1443
Neural-LP 0.1291 0.1059 0.0718
NTP 0.1239 0.0551 0.0427
Table 3: Impact of dataset category. H-score averaged on 40 datasets. Datasets as in Section 6.1.

6.4 Scalability: Impact of Dataset Size

XS-2 XS-3 S-2 S-3
FOIL 0.2815 0.2119 0.0346 0.0934
AMIE+ 0.1449 0.1581 0.4392 0.2124
Neural-LP 0.1155 0.0643 0.1281 0.0992
NTP 0.1512 0.0605 0.0562 0.0471
Table 4: Impact of dataset size and rule graph depth. H-score averaged on 30 datasets. Datasets as in Section 6.1.

In this experiment, we analyzed the impact of the dataset size considering four different size-depth combinations: the results for XS-2, XS-3, S-2, and S-3 datasets are reported in Table 4. We can observe that FOIL is not scalable, since there is a 20% performance gap from the XS-dataset to the S-dataset. Although it does not seem to be influenced by the rules dependency tree depth, showing support to nested rules. AMIE+ seemingly shows constant performance and thus scalability. We can observe that there is a noticeable decrease of performance if we increase the depth of the rule dependency graphs. Neural-LP and NTP are robust to noise and incomplete data but NTP is not scalable yielding good accuracy only on the very small and simple instances (XS-2), while Neural-LP seems to be more scalable (we cannot see a decrease of performance, augmenting the size of the dataset) but does not support nested rules.

7 Conclusions and Future Work

In this paper, we have presented RuDaS, a system for generating datasets for rule learning and for evaluating rule learning systems. Our experiments on new, generated datasets have shown that it is very important to have diverse datasets that consider several rule types separately, different sizes, different amount and type of noise and to perform the evaluation using different measures of performance. With our datasets and evaluation tool we provide these capabilities allowing to fully understand the weaknesses and strengths of a rule learning system.

There are various directions for future work. The dataset generation can be extended to more expressive logics including probabilistic inference that would allow to evaluate methods that learn probabilistic rules (i.e., Manhaeve+-NIPS18:deepproblog Manhaeve+-NIPS18:deepproblog). Another possibility is to increase the probability to generate special predicate types: transitive predicates, predicates that admits only disjoint combination of constants (e.g., the relation between a person and their SSN), or functional predicates (e.g., the “biological parent” relationship), etc. In the evaluation, we want to consider additional measures that exploit the rule formulation without grounding the logic programs and also approximate accuracy measures that are easily computable.


  • [Campero et al.2018] Campero, A.; Pareja, A.; Klinger, T.; Tenenbaum, J.; and Riedel, S. 2018. Logical rule induction and theory learning using neural theorem proving. CoRR abs/1809.02193.
  • [de Jong and Sha2019] de Jong, M., and Sha, F. 2019. Neural theorem provers do not learn rules without exploration. ArXiv abs/1906.06805.
  • [Dong et al.2014] Dong, X. L.; Gabrilovich, E.; Heitz, G.; Horn, W.; Lao, N.; Murphy, K.; Strohmann, T.; Sun, S.; and Zhang, W. 2014. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, 601–610.
  • [Dong et al.2019] Dong, H.; Mao, J.; Lin, T.; Wang, C.; Li, L.; and Zhou, D. 2019. Neural logic machines. In International Conference on Learning Representations.
  • [Estruch et al.2005] Estruch, V.; Ferri, C.; Hernandez-Orallo, J.; and Ramírez-Quintana, M. 2005. Distance based generalisation. In Proceedings of the 15th International Conference on Inductive Logic Programming, ILP’05, 87–102. Berlin, Heidelberg: Springer-Verlag.
  • [Estruch et al.2010] Estruch, V.; Ferri, C.; Hernandez-Orallo, J.; and Ramírez-Quintana, M. 2010. An integrated distance for atoms. In Proceedings of Functional and Logic Programming (FLOPS 2010), volume 6009, 150–164.
  • [Evans and Grefenstette2018] Evans, R., and Grefenstette, E. 2018. Learning explanatory rules from noisy data. J. Artif. Intell. Res. 61:1–64.
  • [Fürnkranz, Gamberger, and Lavrac2012] Fürnkranz, J.; Gamberger, D.; and Lavrac, N. 2012. Foundations of Rule Learning. Cognitive Technologies. Springer.
  • [Galárraga et al.2015] Galárraga, L.; Teflioudi, C.; Hose, K.; and Suchanek, F. M. 2015. Fast rule mining in ontological knowledge bases with AMIE+. VLDB J. 24(6):707–730. code available at
  • [Ho et al.2018] Ho, V. T.; Stepanova, D.; Gad-Elrab, M. H.; Kharlamov, E.; and Weikum, G. 2018. Rule learning from knowledge graphs guided by embedding models. In The Semantic Web - ISWC 2018 - 17th International Semantic Web Conference, Proceedings, Part I, 72–90.
  • [ILP] ILP Applications and Datasets. accessed: 2019-01-30.
  • [Krishnan2018] Krishnan, A. 2018. Making search easier., accessed 2019-09-03.
  • [Manhaeve et al.2018] Manhaeve, R.; Dumancic, S.; Kimmig, A.; Demeester, T.; and Raedt, L. D. 2018. Deepproblog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018., 3753–3763.
  • [Minervini et al.2018] Minervini, P.; Bosnjak, M.; Rocktäschel, T.; and Riedel, S. 2018. Towards neural theorem proving at scale. In Neural Abstract Machines & Program Induction v2, NAMPI.
  • [Muggleton1995] Muggleton, S. 1995. Inverse entailment and progol. New Generation Comput. 13(3&4):245–286.
  • [Nienhuys-Cheng1997] Nienhuys-Cheng, S.-H. 1997. Distance between herbrand interpretations: A measure for approximations to a target concept. In Inductive Logic Programming. ILP 1997, volume 1297. Springer, Berlin, Heidelberg.
  • [Omran, Wang, and Wang2018] Omran, P. G.; Wang, K.; and Wang, Z. 2018. Scalable rule learning via learning representation. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence

    , 2149–2155.
  • [Preda2006] Preda, M. 2006. Metrics for sets of atoms and logic programs. Annals of the University of Craiova 33:67–78.
  • [Quinlan1990] Quinlan, J. R. 1990. Learning logical definitions from relations. Machine Learning 5:239–266. Code available at
  • [Raedt2008] Raedt, L. D. 2008. Logical and relational learning. Cognitive Technologies. Springer.
  • [Rocktäschel and Riedel2017] Rocktäschel, T., and Riedel, S. 2017. End-to-end differentiable proving. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 3791–3803. code available at
  • [Russell and Norvig2002] Russell, S., and Norvig, P. 2002. Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ, USA: Prentice Hall Press.
  • [S. Ceri and Tanca1989] S. Ceri, G. G., and Tanca, L. 1989. What you always wanted to know about datalog (and never dared to ask). In IEEE Trans. on Knowl. and Data Eng.
  • [Seda and Lane2003] Seda, A. K., and Lane, M. 2003. On continuous models of computation: Towards computing the distance between (logic) programs. In 6th International Workshop on Formal Methods (IWFM), 1–15. British Computer Society.
  • [Stepanova, Gad-Elrab, and Ho2018] Stepanova, D.; Gad-Elrab, M. H.; and Ho, V. T. 2018. Rule induction and reasoning over knowledge graphs. In Reasoning Web. Learning, Uncertainty, Streaming, and Scalability - 14th International Summer School 2018, Tutorial Lectures, 142–172.
  • [Wang and Li2015] Wang, Z., and Li, J. 2015. Rdf2rules: Learning rules from RDF knowledge bases by mining frequent predicate cycles. CoRR abs/1512.07734.
  • [Yang, Yang, and Cohen2017] Yang, F.; Yang, Z.; and Cohen, W. W. 2017. Differentiable learning of logical rules for knowledge base reasoning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2316–2325. code available at

Appendix A Rule Learning Approaches

Classical ILP systems such as FOIL [Quinlan1990] and Progol [Muggleton1995] usually apply exhaustive algorithms to mine rules for the given data and either require false facts as counter-examples or assume a closed world (for an overview of classical ILP systems see Table 2 in [Stepanova, Gad-Elrab, and Ho2018]). The closed-world assumption (CWA) states that all facts that are not explicitly given as true are assumed to be false.

Today, however, knowledge graphs (KGs) with their often incomplete, noisy, heterogeneous, and, especially, large amounts of data raise new problems and require new solutions. For instance, real data most often only partially satisfies the CWA and does not contain counter-examples. Moreover, in an open world, absent facts cannot be considered as counter-examples either, since they are not regarded as false. Therefore, successor systems, with AMIE+ [Galárraga et al.2015] and RDF2Rules [Wang and Li2015] as the most prominent representatives, assume the data to be only partially complete and focus on rule learning in the sense of mining patterns that occur frequently in the data. Furthermore, they implement advanced optimization approaches that make them applicable in wider scenarios. In this way, they address already many of the issues that arise with today’s knowledge graphs, still maintaining their processing exhaustive.

Recently, neural rule learning approaches have been proposed: YaYaCo-NIPS17:neurallp,RoR-NIPS17,EGre-jair18:learning-explanatory-rules,Minervini+-NAMPI18:ntp-at-scale,OWaWa-IJCAI18:scalable-rule-learning,Campero+-corr18 YaYaCo-NIPS17:neurallp,RoR-NIPS17,EGre-jair18:learning-explanatory-rules,Minervini+-NAMPI18:ntp-at-scale,OWaWa-IJCAI18:scalable-rule-learning,Campero+-corr18. These methodologies seem a promising alternative considering that deep learning copes with vast amounts of noisy and heterogeneous data. The proposed solutions consider vector or matrix embeddings of symbols, facts and/or rules, and model inference using differentiable operations such as vector addition and matrix composition. However, they are still premature: they only learn certain kinds of rules or lack scalability (e.g., searching the entire rule space) and hence cannot compete with established rule mining systems such as AMIE+ yet, as shown in

[Omran, Wang, and Wang2018], for example.

Appendix B Dataset Descriptions

Table 5 gives a detailed overview of the RuDaS-v0 datasets.

Appendix C More Observations Regarding the Experiments in Section 6.3

In this section we analyze the specific requirement of the system we used for the evaluation and we show how these are reflected in our results.

AMIE+ does not consider reflexive rules and requires rules to be connected (a rule is connected when every atom share an argument with each of the other atoms of the rule) and closed (a rule is closed if all its variables appear at least twice in the rule).

This explain why AMIE+ performs better on the Chain datasets, since all the rules in these datasets more often satisfy this condition (This is not true in general: our generator produces comprehensive datasets that do not necessarily satisfy this property).

Nevertheless, rules that are not fully supported can still be recognized partially. We can observe this from the fact Neural-LP performs on average equally also on RDG datasets although it only supports chain rules.

Also, NTP performs better on Chain datasets, but the discrepancy with the other types of datasets is not substantial. This can be explained by the fact that we provided all necessary templates for the training (for more details about this system’s requirements see [Rocktäschel and Riedel2017]).

We cannot draw significant conclusions for FOIL given its unstable behaviour regarding the dataset type.

In conclusion, we point out the importance of differentiating datasets with different rules types and to consider different measures of performance to be able to fully understand the weaknesses and strengths of a rule learning system.

Appendix D System Configurations

All the systems have the same computational restrictions (i.e. CPU, memory, time limit, etc.). The reader can find all the details (scripts etc.) in the RuDaS GitHub repository.




  • Paper: Differentiable Learning of Logical Rules for Knowledge Base Reasoning.Fan Yang, Zhilin Yang, William W. Cohen. NIPS 2017.

  • Running configuration:

    python \$SYSDIR/\$SYSTEM/src/
        > \$DIR/../output/binary/\$SYSTEM/\$NAME/log.txt
  • Parameter for accepting the rules: learned using grid-search – all the rules with ri-normalized prob are accepted

Neural-theorem prover (ntp)

    "data": {
        "kb": "$DATAPATH/$",
        "templates": "$DATAPATH/rules.nlt"
    "meta": {
        "parent": "$SYSTEMSPATH/conf/default.conf",
        "test_graph_creation": False,
        "experiment_prefix": "$NAME",
        "test_set": "$TEST",
        "result_file": "$OUTPUTPATH/results.tsv",
        "debug": False
    "training": {
        "num_epochs": 100,
        "report_interval": 10,
        "pos_per_batch": 10,
        "neg_per_pos": 1,
        "optimizer": "Adam",
        "learning_rate": 0.001,
        "sampling_scheme": "all",
        "init": None, # xavier initialization
        "clip": (-1.0, 1.0)
    "model": {
        "input_size": 100,
        "k_max": 10,
        "name": "???",
        "neural_link_predictor": "ComplEx",
        "l2": 0.01, # 0.01 # 0.0001
        "keep_prob": 0.7