Learning Explanatory Rules from Noisy Data

11/13/2017 ∙ by Richard Evans, et al. ∙ Google 0

Artificial Neural Networks are powerful function approximators capable of modelling solutions to a wide variety of problems, both supervised and unsupervised. As their size and expressivity increases, so too does the variance of the model, yielding a nearly ubiquitous overfitting problem. Although mitigated by a variety of model regularisation methods, the common cure is to seek large amounts of training data---which is not necessarily easily obtained---that sufficiently approximates the data distribution of the domain we wish to test on. In contrast, logic programming methods such as Inductive Logic Programming offer an extremely data-efficient process by which models can be trained to reason on symbolic domains. However, these methods are unable to deal with the variety of domains neural networks can be applied to: they are not robust to noise in or mislabelling of inputs, and perhaps more importantly, cannot be applied to non-symbolic domains where the data is ambiguous, such as operating on raw pixels. In this paper, we propose a Differentiable Inductive Logic framework (∂ILP), which can not only solve tasks which traditional ILP systems are suited for, but shows a robustness to noise and error in the training data which ILP cannot cope with. Furthermore, as it is trained by backpropagation against a likelihood objective, it can be hybridised by connecting it with neural networks over ambiguous data in order to be applied to domains which ILP cannot address, while providing data efficiency and generalisation beyond what neural networks on their own can achieve.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inductive Logic Programming (ILP) is a collection of techniques for constructing logic programs from examples. Given a set of positive examples, and a set of negative examples, an ILP system constructs a logic program that entails all the positive examples but does not entail any of the negative examples. From a machine learning perspective, an ILP system can be interpreted as implementing a rule-based binary classifier over examples, mapping each example to an evaluation of its truth or falsehood according to the axioms provided to the system, alongside new rules inferred by the system during training.

ILP has a number of appealing features. First, the learned program is an explicit symbolic structure that can be inspected, understood111

Human readability is a much touted feature of ILP systems, but when the learned programs become large, and include a number of invented auxiliary predicates, the resulting programs become less readable (see besolddoes). But even a complex machine-generated logic program will be easier to understand than a large tensor of floating point numbers.

, and verified. Second, ILP systems tend to be impressively data-efficient, able to generalise well from a small handful of examples. The reason for this data-efficiency is that ILP imposes a strong language bias on the sorts of programs that can be learned: a short general program will be preferred to a program consisting of a large number of special-case ad-hoc rules that happen to cover the training data. Third, ILP systems support continual and transfer learning. The program learned in one training session, being declarative and free of side-effects, can be copied and pasted into the knowledge base before the next training session, providing an economical way of storing learned knowledge.

The main disadvantage of traditional ILP systems is their inability to handle noisy, erroneous, or ambiguous data. If the positive or negative examples contain any mislabelled data, these systems will not be able to learn the intended rule. de2008probabilistic discuss this issue in depth, stressing the importance of building systems capable of applying relational learning to uncertain data.

A key strength of neural networks is that they are robust to noise and ambiguity. One way to overcome the brittleness of traditional ILP systems is to reimplement them in a robust connectionist framework. garcez2015neural argue strongly for the importance of integrating robust connectionist learning with symbolic relational learning.

Recently, a different approach to program induction has emerged from the deep learning community

[Graves, Wayne,  DanihelkaGraves et al.2014, Reed  de FreitasReed  de Freitas2015, Neelakantan, Le,  SutskeverNeelakantan et al.2015, KaiserKaiser2015, Andrychowicz  KurachAndrychowicz  Kurach2016, Graves, Wayne, Reynolds, Harley, Danihelka, Grabska-Barwińska, Colmenarejo, Grefenstette, Ramalho, Agapiou, et al.Graves et al.2016]. These neural network-based systems do not construct an explicit symbolic representation of a program. Instead, they learn an implicit procedure (distributed in the weights of the net) that produces the intended results. These approaches take a relatively low-level model of computation222Low-level models include a Turing machine [Graves, Wayne,  DanihelkaGraves et al.2014, Graves, Wayne, Reynolds, Harley, Danihelka, Grabska-Barwińska, Colmenarejo, Grefenstette, Ramalho, Agapiou, et al.Graves et al.2016], a Forth virtual machine [Riedel, Bošnjak,  RocktäschelRiedel et al.2016], a cellular automaton [KaiserKaiser2015], and a pushdown automaton [Sun, Giles, Chen,  LeeSun et al.1998, Grefenstette, Hermann, Suleyman,  BlunsomGrefenstette et al.2015, Joulin  MikolovJoulin  Mikolov2015].

—a model that is much “closer to the metal” than the Horn clauses used in ILP—and produce a differentiable implementation of that low-level model. The implicit procedure that is learned is a way of operating within that low-level model of computation (by moving the tape head, reading and writing in the case of differentiable Turing machines; by pushing and popping in the case of differentiable pushdown automata).

There are two appealing features of this differentiable approach to program induction. First, these systems are robust to noise. Unlike ILP, a neural system will tolerate some bad (mis-labeled) data. Second, a neural program induction system can be provided with fuzzy or ambiguous data (from a camera, for example). Unlike traditional ILP systems (which have to be fed crisp, symbolic input), a differentiable induction system can start with raw, un-preprocessed pixel input.

However, the neural approaches to program induction have two disadvantages when compared to ILP. First, the implicit procedure learned by a neural network is not inspectable or human-readable. It is notoriously hard to understand what it has learned, or to what extent it has generalised beyond the training data. Second, the performance of these systems tails off sharply when the test data are significantly larger than the training data: if we train the neural system to add numbers of length 10, they may also be successful when tested on numbers of length 20. But if we test them on numbers of length 100, the performance deteriorates [KaiserKaiser2015, Reed  de FreitasReed  de Freitas2015]. General-purpose neural architectures, being universal function approximators, produce solutions with high variance. There is an ever-present danger of over-fitting333With extra supervision, over-fitting can be avoided, to an extent. reed2015neural use a much richer training signal. Instead of trying to learn rules from mere input output pairs, they learn from explicit traces of desired computations. In their approach, a training instance is an input plus a fine-grained specification of the desired computational operations. For example, when learning addition, a training instance would be the two inputs and that we are expected to add, plus a detailed list of all the low-level operations involved in adding and . With this additional signal, it is possible to learn programs that generalise to larger training instances. .

This paper proposes a system that addresses the limits of connectionist systems and ILP systems, and attempts to combine the strengths of both. Differentiable Inductive Logic Programming (ILP) is a reimplementation of ILP in an an end-to-end differentiable architecture. It attempts to combine the advantages of ILP with the advantages of the neural network-based systems: a data-efficient induction system that can learn explicit human-readable symbolic rules, that is robust to noisy and ambiguous data, and that does not deteriorate when applied to unseen test data. The central component of this system is a differentiable implementation of deduction through forward chaining on definite clauses. We reinterpret the ILP task as a binary classification problem, and we minimise cross-entropy loss with regard to ground-truth boolean labels during training.

Our ilp system is able to solve moderately complex tasks requiring recursion and predicate invention. For example, it is able to learn “Fizz-Buzz” using multiple invented predicates (see Section 5.3.3). Unlike the MLP described by Grus444See the darkly humorous http://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/., our learned program generalises robustly to test data outside the range of training examples.

Unlike symbolic ILP systems, ilp is robust to mislabelled data. It is able to achieve reasonable performance with up to 20% of mislabelled training data (see Section 5.4). Unlike symbolic ILP systems, ilp is also able to handle ambiguous or fuzzy data. We tested ilp by connecting it to a convolutional net trained on MNIST digits, and it was still able to learn effectively (see Section 5.5).

The major limitation of our ilp system is that it requires significant memory resources. This limits the range of benchmark problems on which our system has been tested555The memory requirements require us to limit ourselves to predicates of arity at most two.. We discuss this further in the introduction to Section 5.3 and in Section 5.3.4, and offer an analysis in Appendix E.

The structure of the paper is as follows. We begin, in Section 2, by giving an overview of logic programming as a field, and of the collection of learning methods known as Inductive Logic Programming. In Section 3, we re-cast learning under ILP as a satisfiability problem, and use this formalisation of the problem as the basis for introducing, in Section 4, a differentiable form of ILP where continuous representations of rules are learned through backpropagation against a likelihood objective. In Section 5, we evaluate our system against a variety of standard ILP tasks, we measure its robustness to noise by evaluating its performance under conditions where consistent errors exist in the data, and finally compare it to neural network baselines in tasks where logic programs are learned over ambiguous data such as raw pixels. We complete the paper by reviewing and contrasting with related work in Section 6 before offering our conclusions regarding the framework proposed here and its empirical validation.

2 Background

We first review Logic Programming and Inductive Logic Programming (ILP), before focusing on one particular approach to ILP, which turns the induction problem into a satisfiability problem.

2.1 Logic Programming

Logic programming is a family of programming languages in which the central component is not a command, or a function, but an if-then rule. These rules are also known as clauses.

A definite clause666In this paper, we restrict ourselves to logic programs composed of definite clauses. We do not consider clauses containing negation-as-failure. is a rule of the form

composed of a head atom , and a body , where . These rules are read from right to left: if all of the atoms on the right are true, then the atom on the left is also true. If the body is empty, if , we abbreviate to just .

An atom is a tuple , where is a -ary predicate and are terms, either variables or constants. An atom is ground if it contains no variables. The set of all atoms, ground and unground, is denoted by . The set of ground atoms, also known as the Herbrand base, is .

In first-order logic, terms can also be constructed from function symbols. So, for example, if is a constant, and is a one place function, then are all terms. In this paper, we impose the following restriction, found in systems like Datalog, on clauses: the only terms allowed are constants and variables, while function symbols are disallowed777The reason for restricting ourselves to definite Datalog clauses is that this language is decidable and forward chaining inference (described below) always terminates..

For example, the following program defines the relation as the transitive closure of the relation:

We follow the logic programming convention of using upper case to denote variables and lower case for constants and predicates.

A ground rule is a clause in which all variables have been substituted by constants. For example:

is a ground rule generated by applying the substitution to the clause:

The consequences of a set of clauses is computed by repeatedly applying the rules in until no more consequences can be derived. More formally, define as the immediate consequences of rules applied to ground atoms :

where are the ground rules of . In other words, ground atom is in if is in or there exists a ground rule such that each ground atom is in .

Alternatively, we can define using substitutions:

where is the application of substitution to atom .

Now define a series of consequences

Now define the consequences after time-steps as the union of this series:

Consider, for example, the program :

The consequences are:

Given a set of clauses, and a ground atom , we say entails , written , if every model satisfying also satisfies . To test if , we check if . This technique is called forward chaining.

An alternative approach is backward chaining. To test if , we work backwards from , looking for rules in whose head unifies with . For each such rule , where , we create a sub-goal to prove the body . This procedure constructs a search tree, where nodes are lists of propositions to be proved, in left-to-right order, and edges are applications of rules with substitutions. The root of the tree is the node containing , and search terminates when we find a node with no atoms remaining to be proved.

A distinction we will need later is the difference between intensional and extensional predicates. An extensional predicate is a predicate that is wholly defined by a set of ground atoms. In our example above, is an extensional predicate defined by the set of atoms:

An intensional predicate is defined by a set of clauses. In our example, is an intensional predicate defined by the clauses:

2.2 Inductive Logic Programming (ILP)

An ILP problem is a tuple of ground atoms, where:

  • is a set of background assumptions, a set of ground atoms888We assume that is a set of ground atoms, but not all ILP systems make this restriction. In some ILP systems, the background assumptions are clauses, not atoms..

  • is a set of positive instances - examples taken from the extension of the target predicate to be learned

  • is a set of negative instances - examples taken outside the extension of the target predicate

Consider, for example, the task of learning which natural numbers are even. We are given a minimal description of the natural numbers:

Here, is the unary predicate true of if , and is the successor relation. The positive and negative examples of the predicate are:

The aim of ILP is to construct a logic program that explains the positive instances and rejects the negative instances.

Given an ILP problem , a solution is a set of definite clauses such that999This is a specific definition of ILP in terms of entailment. For a more general definition of the ILP problem, that contains this formulation as a special case, see the work of muggleton1994inductive,de2008probabilistic,de2008probabilistic2.:

  • for all

  • for all

Induction is finding a set of rules such that, when they are applied deductively to the background assumptions , they produce the desired conclusions. Namely: positive examples are entailed, while negative examples are not.

In the example above, one solution is the set :

Note that this simple toy problem is not entirely trivial. The solution requires both recursion and predicate invention: an auxiliary synthesised predicate .

We first describe how the ILP problem can be transformed into a satisfiability problem (Section 3), and then provide the main contribution of the paper: a differentiable implementation of the satisfiability solving process (Section 4).

3 ILP as a Satisfiability Problem

There are, broadly speaking, two families of approaches to ILP. The bottom-up approaches101010For example, Progol [MuggletonMuggleton1995]. start by examining features of the examples, extract specific clauses from those examples, and then generalise from those specific clauses. The top-down approaches111111For example, TopLog [Muggleton, Santos,  Tamaddoni-NezhadMuggleton et al.2008], TAL [Corapi, Russo,  LupuCorapi et al.2010], Metagol [Muggleton, Lin, Pahlavi,  Tamaddoni-NezhadMuggleton et al.2014, Muggleton, Lin,  Tamaddoni-NezhadMuggleton et al.2015, Cropper, Tamaddoni-Nezhad,  MuggletonCropper et al.2015]. use generate-and-test: they generate clauses from a language definition, and test the generated programs against the positive and negative examples.

Amongst the top-down approaches, one particular method is to transform the problem of searching for clauses into a satisfiability problem. Amongst these induction-via-satisfiability approaches, some121212See, for example, the work by chikarainductive. use the Boolean flags to indicate which branches of the search space (defined by the language grammar) to explore. Others131313See, for example, the work by corapi2010inductive. Note that they are working with Answer-Set Programming, but under the hood, this is propositionalised and compiled into a SAT problem. generate a large set of clauses according to a program template, and use the Boolean flags to determine which clauses are on and off.

In this paper, we also take a top-down, generate-and-test approach, in which clauses are generated from a language template. We assign each generated clause a Boolean flag indicating whether it is on or off. Now the induction problem becomes a satisfiability problem: choose an assignment to the Boolean flags such that the turned-on clauses together with the background facts together entail the positive examples and do not entail the negative examples. Our approach is an extension to this established approach to ILP: our contribution is to provide, via a continuous relaxation of satisfiability, a differentiable implementation of this architecture. This allows us to apply gradient descent to learn which clauses to turn on and off, even in the presence of noisy or ambiguous data.

We use the rest of this section (together with Appendix B) to explain ILP-as-satisfiability in detail.

3.1 Basic Concepts

A language frame is a tuple

where:

  • is the target predicate, the intensional predicate we are trying to learn

  • is a set of extensional predicates

  • is a map specifying the arity of each predicate

  • is a set of constants

An ILP problem is a tuple where

  • is a language frame

  • is a set of background assumptions, ground atoms formed from the predicates in and the constants in

  • is a set of positive examples, ground atoms formed from the predicate and the constants in

  • is a set of negative examples, ground atoms formed from the predicate and the constants in

Here, are ground atoms where the target predicate holds. For example, in the case of on natural numbers, might be . The set contains ground atoms where the target predicate does not hold, e.g.. .

A rule template describes a range of clauses that can be generated. It is a pair where:

  • specifies the number of existentially quantified variables allowed in the clause

  • specifies whether the atoms in the body of the clause can use intensional predicates () or only extensional predicates ()

A program template describes a range of programs that can be generated. It is a tuple where:

  • is a set of auxiliary (intensional) predicates; these are the additional invented predicates used to help define the target predicate

  • is a map specifying the arity of each auxiliary predicate

  • is a map from each intensional predicate to a pair of rule templates

  • specifies the max number of steps of forward chaining inference

Note that defines each intensional predicate by a pair of rule templates. In our system, we insist, without loss of generality141414If a predicate is defined by three clauses , , and , create a new auxiliary predicate and replace the above with four clauses: (i) ; (ii) ; (iii) ; (iv) . We have transformed a program in which one predicate is defined by three clauses into a program in which two predicates are defined by two clauses each. that each predicate can be defined by exactly two clauses.

Our program template is a tuple of hyperparameters constraining the range of programs that are used to solve the ILP problem. This template plays the same role as a collection of mode declarations (see Appendix

C) or a set of metarules in Metagol (see Appendix D).

Assume, for the moment, that the program template is given to us as part of the ILP problem specification. Later, in Section 

7.1, we shall discuss how to search through the space of program templates using iterative deepening.

We can combine the extensional predicates from the language-frame

with the intensional predicates from the program template

into a language where

Note that the predicate is always one of the intensional predicates .

Let be the complete set of predicates:

A language determines the set of all ground atoms. If we restrict ourselves to nullary, unary, and dyadic predicates, then the set of ground atoms is:

Note that includes the falsum atom , the atom that is always false.

3.2 Generating Clauses

For each rule template , we can generate a set of clauses that satisfy the template. To keep the set of generated clauses manageable, we make a number of restrictions. First, the only clauses we consider are those composed of atoms involving free variables. We do not allow any constants in any of our clauses. This may seem, initially, to be a severe restriction—but recall that our language has extensional predicates as well as intensional predicates. If we need a predicate whose meaning depends on particular constants, then we treat it as an extensional predicate, rather than an intensional predicate. For example, the predicate , which appears in the arithmetic examples later, is treated as an extensional predicate. We do not treat as an intensional predicate defined by a single rule with an empty body:

Rather, we treat as an extensional predicate defined by the single background atom:

The second restriction on generated clauses is that we only allow predicates of arity 0, 1, or 2. We do not currently support ternary predicates or higher151515This is because of practical space constraints. See Appendix E..

Third, we insist that all clauses have exactly two atoms in the body. This restriction can also be made without loss of generality. For any logic program involving clauses with more than two atoms in the body of some clause, there is an equivalent logic program (with additional intensional predicates used as auxiliary functions) with exactly two atoms in the body of each clause161616This is a separate condition from the restriction above in Section 3.1 that each predicate can be defined by exactly two clauses..

There are four additional restrictions on generated clauses. We rule out clauses that are (1) unsafe (a variable used in the head is not used in the body), (2) circular (the head atom appears in the body), (3) duplicated (equivalent to another clause with the body atoms permuted), and (4) those that do not respect the intensional flag (i.e. those that contain an intensional predicate in the clause body, even though the flag was set to 0, i.e. False). We provide a worked example in Appendix B.

3.3 Reducing Induction to Satisfiability

Given a program template , let be the ’th rule template for intensional predicate , where indicates which of the two rule templates we are considering for . Let be the clause in , the set of clauses generated for template .

To turn the induction problem into a satisfiability problem, we define a set of Boolean variables (i.e. atoms with nullary predicates) indicating which of the various clauses in are actually to be used in our program. Now a SAT solver can be used to find a truth-assignment to the propositions in , and we can extract the induced rules from the subset of propositions in that are set to True. The technical details behind this approach are described in Appendix B.

4 A Differentiable Implementation of ILP

In this section, we describe our core model: a continuous reimplementation of the ILP-as-satisfiability approach described above. The discrete operations are replaced by differentiable operators, so the ILP problem can be solved by minimising a loss using stochastic gradient descent. In this case, the loss is the cross entropy of the correct label with regard to the predictions of the system.

Instead of the discrete semantics in which ground atoms are mapped to , we now use a continuous semantics171717For a related approach, see the work of logictensornetworks, discussed in Section 6.3 below. which maps atoms to the real unit interval181818The values in are interpreted as probabilities rather than fuzzy “degrees of truth”.

. Instead of using Boolean flags to choose a discrete subset of clauses, we now use continuous weights to determine a probability distribution over clauses.

This model, which we call ilp, implements differentiable deduction over continuous values. The gradient of the loss with regard to the rule weights, which we use to minimise classification loss, implements a continuous form of induction.

4.1 Valuations

Given a set of ground atoms, a valuation

is a vector

mapping each ground atom to the real unit interval.

Consider, for example, the language , where

One possible valuation on the ground atoms of is

We insist that all valuations map to 0. (The reason for including the atom will become clear in Section 4.5).

4.2 Induction by Gradient Descent

Given the sets and of positive and negative examples, we form a set of atom-label pairs:

Each pair indicates whether atom is in (when ) or (when ). This can be thought of as a dataset, used to learn a binary classifier that maps atoms to their truth or falsehood.

Now given an ILP problem , a program template and a set of clause-weights , we construct a differentiable model that implements the conditional probability of for a ground atom :

We want our predicted label to match the actual label in the pair we sample from . We wish, in other words, to minimise the expected negative log likelihood when sampling uniformly pairs from :

To calculate the probability of the label given the atom , we infer the consequences of applying the rules to the background facts (using steps of forward chaining). In Figure 1 below, these consequences are called the “Conclusion Valuation”. Then, we extract as the probability of in this valuation.

The probability is defined as:

Here, is computed using four auxiliary functions: , , , and . (See Figure 1). and are differentiable operations, while , and are non-differentiable.

The function takes a valuation and an atom and extracts the value for that atom:

where is a function that assigns each ground atom a unique integer index. The function takes a set of atoms and converts it into a valuation mapping the elements of to 1 and all other elements of to 0:

and where is the ’th ground atom in for .

The function produces a set of clauses from a program template and a language :

This uses the function defined in Section 3.2 above.

The function is where all the heavy-lifting takes place. It performs steps191919Recall that is a part of the program template . of forward-chaining inference using the generated clauses, amalgamating the various conclusions together using the clause weights . It is described in detail below.

Figure 1: The ilp Architecture.

Figure 1 shows the architecture. The inputs to the network, the elements that are fed in every training step, are the atom and the corresponding label . These are fed into the boxes labeled “Sampled Target Atom” and “Sampled Label”. This input pair is sampled from the set . The conditional probability is the value in the “Predicted Label” box. The only trainable component is the set of clause weights , shown in red. The differentiable operations are shown in green, while the non-differentiable operations are shown in orange. Note that, even though not all operations are differentiable, the operators between the loss and the clause weights are all differentiable, so we can compute , which in turn is used to update the clause weights by stochastic gradient descent (or any related optimisation method).

So far, we have described the high-level picture, but not the details. In Section 4.3, we explain how the rule weights are represented. In Section 4.4, we show how inference is performed over multiple time steps, by translating each clause into a function

Finally, in Section 4.5, we describe how the functions are computed.

4.3 Rule Weights

The weights are a set of matrices, one matrix for each intensional predicate . The matrix for predicate is of shape , where is the number of clauses generated by the first rule template , and is the number of clauses generated by the second rule template . Note that the various matrices are typically not of the same size, as the different rule templates produce different sized sets of clauses.

The weight represents how strongly the system believes that the pair of clauses is the right way to define the intensional predicate . (Recall from Section 3.3 that is the ’th clause of the ’th rule template for intensional predicate . Here, as each predicate is defined by exactly two clauses, . Recall from Section 3.1 that we assume that each intensional predicate is defined by exactly two clauses). The weight matrix is a matrix of real numbers. We transform it into a probability distribution using :

Here, represents the probability that the pair of clauses is the right way to define the intensional predicate .

Using matrices to store the weights of each pair of clauses requires a lot of memory. See Appendix E. An alternative, less memory-hungry approach would be to have a vector of weights for every set of clauses generated by every individual rule template. Unfortunately, this alternative is much less effective at the ILP tasks. See Appendix F for a fuller discussion.

4.4 Inference

The central idea behind our differentiable implementation of inference is that each clause induces a function on valuations. Consider, for example, the clause :

Table 1 shows the results of applying the corresponding function to two valuations on the set of ground atoms.

Table 1: Applying , treated as a function , to valuations and

The details of how the functions are generated is deferred to Section 4.5 below. The important point now is that we can automatically generate, from each clause , a differentiable function on valuations that implements a single step of forward chaining inference using .

Recall that is an indexed set of generated clauses, where is the ’th clause of the ’th rule template for intensional predicate . Define a corresponding indexed set of functions where is the valuation function corresponding to the clause .

Now we define another indexed set of functions that combines the application of two functions and . Recall that each intensional predicate is defined by two clauses generated from two rule templates and . Now is the result of applying both clauses and and taking the element-wise max:

Next, we will define a series of valuations of the form . A valuation represents our conclusions after time-steps of inference.

The initial value when is based on our initial set of background axioms:

We now define :

Intuitively, is the result of applying one step of forward chaining inference to using clauses and . Note that this only has non-zero values for one particular predicate: the intensional predicate.

We can now define the weighted average of the , using the softmax of the weights:

Intuitively, is the result of applying all the possible pairs of clauses that can jointly define predicate , and weighting the results by the weights . Note that is also zero everywhere except for the ’th intensional predicate.

From this, we define the successor of :

The successor depends on the previous valuation and a weighted mixture of the clauses defining the other intensional predicates. Note that the valuations are disjoint for different , so we can simply sum these valuations.

When amalgamating the previous valuation, , with the single-step consequences, , there are various functions we can use for . First we considered:

(Note this is element-wise max over the two valuation vectors). But the use of here adversely affected gradient flow. The definition of we actually use is the probabilistic sum:

This keeps valuations within the real unit interval while allowing gradients to flow through both and . The two alternative ways of computing are compared in Table 2.

4.5 Computing the Functions

We now explain the details of how the various functions are computed.

Each function can be computed as follows. Let be a set of sets of pairs of indices of ground atoms for clause . Each contains all the pairs of indices of atoms that justify atom according to the current clause :

Note that we can restrict ourselves to pairs only (and don’t have to worry about triples, etc) because we are restricting ourselves to rules with two atoms in the body.

Here, if the pair of ground atoms satisfies the body of clause . If , then is true if there is a substitution such that and .

Also, is the head atom produced when applying clause to the pair of atoms . If and and then

For example, suppose and . Then our ground atoms are:

0 1 2 3 4 5 6 7 8
9 10 11 12

Suppose clause is:

Then is:

0 {}
1 {}
2 {}
3 {}
4 {}
5 {}
6 {}
7 {}
8 {}
9 {(1,5), (2, 7)}
10 {(1, 6), (2, 8)}
11 {(3, 5), (4, 7)}
12 {(3, 6), (4, 8)}

Focusing on a particular row, the reason why is in is that , , the pair of atoms satisfy the body of clause , and the head of the clause (for this pair of atoms) is which is .

We can transform , a set of sets of pairs, into a three dimensional tensor: . Here, is the maximum number of pairs for any in . The width depends on the number of existentially quantified variables in the rule template. Each existentially quantified variable can take values, so . is constructed from , filling in unused space with pairs that point to the pair of atoms :

This is why we needed to include the falsum atom in , so that the null pairs have some atom to point to. In our running example, this yields:

0
1
2
3
4
5
6
7
8
9
10
11
12

Let be two slices of , taking the first and second elements in each pair:

We shall use a function :

Now we are ready to define . Let be the results of assembling the elements of according to the matrix of indices in and :

Now let contain the results of element-wise multiplying the elements of and :

Here, is the vector of fuzzy conjunctions of all the pairs of atoms that contribute to the truth of , according to the current clause. Now we can define by taking the max fuzzy truth values in . Let where .

The following table shows the calculation of for a particular valuation , using our running example . Here, since there is one existential variable , , and .

0 0.0 0.00
1 1.0