This paper introduces a differentiable semantic reasoner, where rules are presented as a relevant set of graph transformations. These rules can be written manually or inferred by a set of facts and goals presented as a training set. While the internal representation uses embeddings in a latent space, each rule can be expressed as a set of predicates conforming to a subset of Description Logic.READ FULL TEXT VIEW PDF
Symbolic logic is the most powerful representation for building interpretable computational systems [Garcez2020]. In this work we adopt a subset of Description Logic [Krtzsch2012ADL] to represent knowledge and build a semantic reasoner, which derives new facts by applying a chain of transformations to the original set.
In the restricted context of this paper, knowledge can be expressed in predicate or graph form, interchangeably. Thus, semantic reasoning can be understood as a sequence of graph transformations [ehrigGraphTransformations], which act on a subset of the original knowledge base and sequentially apply the matching rules.
In this paper, we show that rule matching can be made differentiable by representing nodes and edges as embeddings. After building a one-to-one correspondence between a sequence of rules and a linear algebra expression, the system can eventually train the embeddings using a convenient loss function. The rules created in this fashion can then be applied during inference time.
Our system follows the recent revival of hybrid neuro-symbolic models [Garcez2020]
, combining insights from logic programming with deep learning methods. The main contribution of this work is to show that reasoning over graphs is a learnable task. While the system is presented here as a proof of concept, we show that differential graph transformations can effectively learn new rules by training nodes, edges, and matching thresholds through backpropagation.
The system presented here is a semantic reasoner inspired by the early STRIPS language [STRIPS1971]. It creates a chain of rules that connects an initial state of facts to a final state of inferred predicates. Each rule has a set of pre- and post-conditions, expressed here using a subset of Description Logic (DL). In the following, we restrict our DL to Assertional Axioms (ABox). Thus, each fact can be represented as a set of predicates, or - equivalently - as a graph with matching rules as described below.
We use a predicate form to represent facts, rules, and intermediate states, as shown in Fig. (1). For example the semantics for "Joe wins the election in the USA" is captured in the following form
In the prior example, , , and are nodes of the semantic graph, whereas and are convenient relations to represent the graph’s edges.
The rules are specified with a MATCH/CREATE pair as below
The MATCH statement specifies the pre-condition that triggers the rule, while the CREATE statement acts as the effect - or post-condition - after applying the rule. The result of applying this rule is shown in Fig. 1, where a new state is created from the original fact. Notice that the name (which matches ) is propagated forward to the next set of facts.
By applying rules in sequence one builds a inferential chain of MATCH and CREATE conditions. After each rule the initial facts graph is changed into a new set of nodes and edges. This chain of graph transformations builds a path in a convenient semantic space, as shown in Fig. 2. One of this paper’s main result is to show that there is a one-to-one correspondence between the chain of matching rules and a chain of linear algebra operations.
Both nodes and edges are represented as embeddings in a latent space. For convenience, in the current work the vocabulary of possible nodes matches the Glove 300dim dataset [pennington2014glove], whereas edges are associated random embeddings linked to the relevant ontology.
A rule is triggered if the pre-conditions graph is a sub-isomorphism of the facts. Each node and edge of the preconditions has a learnable threshold value . Two items match if the dot product between their embeddings is greater than a specific threshold. In the predicate representation, we make explicit these trainable thresholds by adding the symbol to the predicate’s name. In this way, the rule in II-A becomes
indicating that - for example - and would only match if their normalized dot product is greater than . In the Description Logic framework this is equivalent to a individuality assertion
Matching facts and preconditions creates a most general unifier (MGU) that is propagated forward on the inference chain.
During training, the final state is a goal of the system, as shown in Fig. 2. The system learns how to create rules given a set of template empty rules, where the embeddings for each node and edge are chosen randomly. These templates are specified prior to the training using to indicate a random embedding, as in the following
In the current state of development, the algorithm generates all possible paths - compatibly with boundary conditions - and then applies to each of them the training algorithm explained below. A more efficient method will be pursued in future works.
At every step of the inference chain the collection of predicates changes according to the order of transformations. At every step
we employ a vectorthat signals the truth value of each predicate. For computational reasons, the dimensions of this vector must be fixed in advance and set to the maximum size of the predicate set. The first value is a vector of ones, as every predicate in the knowledge base is assumed to be true.
At the end of the resolution chain there is a "goal" set of predicates, usually less numerous than the initial set of facts. A vector indicates the truth conditions of the goal predicates. This vector - also of size - contains a number of ones equal to the number of goal nodes and is zero otherwise. The application of a rule can then be described by two matrices: the similarity matrix and the rule propagation matrix .
A similarity matrix describes how well a set of facts matches the preconditions.
Where is the matrix with the preconditions’s nodes as colums, is the matrix with the fact nodes as columns at step . is the matrix of the matches, bearing value of if two nodes match and vanishing otherwise. For example if the first node of the preconditions matches the second node of the facts, the matrix will have value at position (1, 0).
The matrix is a bias matrix whose columns are the list of (trainable) thresholds for each predicate in the pre-conditions . This bias effectively enforces the matching thresholds: A negative value as an argument to will lead to an exponentially small result after the operation.
All the matrices , , and are square matrices . Eq. 1 is reminiscent of self-attention [vaswani_attention], with an added bias matrix and a mask .
A rule propagation matrix puts into contact the left side of a rule with the right side. The idea behind is to keep track of how information travels inside a single rule. In this work we simplify the propagation matrix as a fully connected layer with only one trainable parameter. For example, if the chosen size is 4, a rule with three pre-conditions and two post-conditions nodes has an matrix as
where is the "weight" of the rule. Given a first state , the set of truth condition after steps is
This final state is compared against the goal’s truth vector to create a loss function.
The training of the relation embeddings follows the same sequence of operations as for the nodes. A set of truth vectors and states is acted upon the relation similarity matrix
and the corresponding rule propagation matrix for relations , leading to the final truth vector for relations
Following the example of the nodes, a goal vector for the relations is named , containing the desired truth conditions for relations at the end of the chain.
The system learns the nodes and edges embeddings of the rules, while the initial facts and the goal are frozen during training. The system also learns the matching thresholds and each rule’s weight . Following Eqs. 3 and 5, the final loss function is computed as a binary cross entropy expression
The system can in principle be trained over a set of multiple facts and goal pairs, in which case the loss function is the sum of all the pairs’ losses. For simplicity, in this paper we limit the training to a single pair of facts and goal.
In order to avoid the Sussman anomaly the same rule can only be used once in the same path.
As a toy example we want the system to learn that if someone is married to a "first lady", then this person is president. The facts are
and the goal is
Given the empty rule
The system correctly learns the rule that connects the the facts with the goal.
While trivial, this is a fundamental test of the capacity of the system to learn the correct transformation. The matching thresholds have been clipped and cannot go below in training.
While a successful result is almost guaranteed by choosing a rule that closely matches the boundary conditions, the system is proven capable of converging onto the correct embeddings and thresholds using just backpropagation.
While a single-rule transformation can be useful in a few edge cases, the real power of semantic reasoning comes from combining rules together. In this section we show - using another toy example - that the system can learn two rules at the same time. The simplified task is as in the following: to learn that "if a fruit is round and is delicious, then it is an apple." The facts are
and the goal is
The system is given the two template rules to fit
Notice the "and" relations in the templates. These relations are frozen during training and constitute another constraint for the system to satisfy. In the end, our model learns the correct rules
which satisfy the goal when chained.
Here we forced the system to apply two rules since no single template would fit the boundary conditions. Of particular interest is the fact that the system learned the preconditions of the second rule . This is not a trivial task, given that it started training with random embeddings and the only information about the correct values is the one propagated forward from the first rule.
Neuro-symbolic reasoning has been an intriguing line of research in the past decades [neurosymb_book2002, Garcez2009NeuralSymbolicCR]
. Some recent results make use of a Prolog-like resolution tree as a harness where to train a neural network[rockt2017, minervini2018towards, weber-etal-2019-nlprolog, minervini2020differentiable]. Our work is similar to theirs, but builds upon a STRIPS-like system instead of Prolog. A different approach employs a Herbrand base for inductive logic programming in a bottom-up solver [evansILP].
Finally, one can see our method as a sequence of operations that create or destroy items sequentially. Each (differential) transformation brings forward a new state of the system made by discrete elements. These types of algorithms have already been investigated in the Physics community, for example in [sandvik2002].
In this work we presented a semantic reasoner that leverages on differential graph transformations for rule learning. The system is build through a one-to-one correspondence between a chain of rules and a sequence of linear algebra operations. Given a set of facts, a goal, and a set of rules with random embeddings, the reasoner can learn new rules that satisfy the constraints. The rules are then written as a set of predicates with pre- and post-conditions, a more interpretable representation than embeddings and weights.
The system presented here is limited in speed and - as a consequence - volume of training data. This is mostly due to our path-creation algorithm, which generates all possible paths given a set of rules. A more efficient algorithm would employ a guided approach to path creation, similar to the method in [minervini2020differentiable]. A different and possibly novel efficiency gain could be found in a Monte Carlo method, where the path converges to the correct one through means of a Metropolis algorithm. This last approach has already found application in the Computational Physics community and could be useful in our approach as well.
Another open question resides on whether the system is able to generalize, given a multiple set of facts and goals. This last inquiry will need a faster algorithm and will be pursued in a future work.