Learning Invariants through Soft Unification

09/16/2019 ∙ by Nuri Cingillioglu, et al. ∙ Imperial College London 0

Human reasoning involves recognising common underlying principles across many examples by utilising variables. The by-products of such reasoning are invariants that capture patterns across examples such as "if someone went somewhere then they are there" without mentioning specific people or places. Humans learn what variables are and how to use them at a young age, and the question this paper addresses is whether machines can also learn and use variables solely from examples without requiring human pre-engineering. We propose Unification Networks that incorporate soft unification into neural networks to learn variables and by doing so lift examples into invariants that can then be used to solve a given task. We evaluate our approach on four datasets to demonstrate that learning invariants captures patterns in the data and can improve performance over baselines.



There are no comments yet.


page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans have the ability to process symbolic knowledge and maintain symbolic thought (Unger and Deacon, 1998). When reasoning, humans do not require combinatorial enumeration of examples but instead utilise invariant patterns with placeholders replacing specific entities. Symbolic cognitive models (Lewis, 1999) embrace this perspective with the human mind seen as an information processing system operating on formal symbols such as reading a stream of tokens in natural language. The language of thought hypothesis (Morton and Fodor, 1978) frames human thought as a structural construct with varying sub-components such as “X went to Y”. By recognising what varies across examples, humans are capable of lifting examples into invariant principles that account for other instances. This symbolic thought with variables is learned at a young age through symbolic play (Piaget, 2001). For instance a child learns that a sword can be substituted with a stick (Frost et al., 2004) and engage in pretend play.

X:bernhard is a Y:frog
Z:lily is a Y:frog
Z:lily is A:green
what colour is X:bernhard
Figure 1: Invariant learned for bAbI task 16, basic induction, where X:bernhard denotes a variable with default symbol bernhard. This single invariant accounts for all the training and test examples.

Although variables are inherent in models of computation and symbolic formalisms, as in first-order logic (Russell and Norvig, 2018), they are pre-engineered and used to solve specific tasks by means of unification or assignments that bound variables to given values. However, when learning from data only, being able to recognise when and which symbols should take on different values, i.e. symbols that can act as variables, is crucial for lifting examples into general principles that are invariant across multiple instances. Figure 1 shows the invariant learned by our approach: if someone is the same thing as someone else then they have the same colour. With this invariant, our approach can solve all of the training and test examples in task 16 of the bAbI dataset (Weston et al., 2015).

In this paper we address the question of whether a machine can learn and use the notion of a variable, i.e. a symbol that can take on different values. For instance, given an example of the form “bernhard is a frog” the machine would learn that the token “bernhard” could be someone else and the token “frog” could be something

else. If we consider unification a selection of the most appropriate value for a variable given a choice of values, we can reframe it as a form of attention. Attention models 

(Bahdanau et al., 2014; Luong et al., 2015; Chaudhari et al., 2019) allow neural networks to focus, attend to certain parts of the input often for the purpose of selecting a relevant portion. Since attention mechanisms are also differentiable they are often jointly learned within a task. This perspective motivates our idea of a unification mechanism that utilises attention and is therefore fully differentiable which we refer to as soft unification.

Hence, we propose an end-to-end differentiable neural network approach for learning and utilising the notion of a variable that in return can lift examples into invariants used by the network to perform reasoning tasks. Specifically, we (i) propose a novel architecture capable of learning and using variables by lifting a given example through soft unification, (ii) present the empirical results of our approach on four datasets and (iii) analyse the learned invariants that capture the underlying patterns present in the tasks. Our implementation using Chainer (Tokui et al., 2015) is publicly available at https://github.com/nuric/softuni with the accompanying data.

2 Soft Unification

Reasoning with variables involves identifying what variables are, the setting in which they are used as well as the process by which they are assigned values. When the varying components, i.e. variables, of an example are identified, the remaining structure can be lifted into an invariant which then accounts for multiple other instances.

Definition 1 (Variable).

Given a set of symbols , a variable X is defined as a pair X where is the default symbol of the variable and x

is a discrete random variable of which the support is

. The representation of a variable is equal to the expected value of the corresponding random variable x given the default symbol :


where is a -dimensional real-valued feature of a symbol .

For example, could be an embedding and would become a weighted sum of symbol embeddings as in conventional attention models. The default symbol of a variable is intended to capture the variable’s bound meaning following the idea of referants by Frege (1948). We denote variables using X, Y, A etc. such as X:bernhard where X is the name of the variable and bernhard the default symbol as shown in Figure 1.

Definition 2 (Invariant).

Given a structure (e.g. list, grid) over , an invariant is a pair where is the invariant example such as a tokenised story and is a function representing the degree to which the symbol is considered a variable. Thus, the final representation of a symbol included in , is:


the linear interpolation between its representation

and its variable bound value with itself as the default symbol .

We adhere to the term invariant and refrain from mentioning rules, unground rules, etc. used in logic-based formalisms, e.g. Muggleton and de Raedt (1994), since neither the invariant structure needs to be rule-like nor the variables carry logical semantics. This distinction is clarified in Section 6.

Definition 3 (Unification).

Given an invariant and an example , unification binds the variables in to symbols in . Defined as a function

, unification binds variables by computing the probability mass functions,

in equation 1, and returns the unified representation using equation 2. The probability mass function of a variable X: is:


where is the unifying feature of a symbol and is applied element wise to symbols in . If is differentiable, it is referred to as soft unification.

We distinguish from to emphasise that the unifying properties of the symbols might be different from their representations. For example, could represent a specific person whereas the notion of someone.

Overall soft unification incorporates 3 learnable components: which denote the base features, variableness and unifying features of a symbol respectively. Given an upstream, potentially task specific, network , an invariant and an input example with a corresponding desired output , the following holds:


where now predicts based on the unified representation produced by . In this work, we focus on , the invariants it produces together with the interaction of .

3 Unification Networks

Since soft unification is end-to-end differentiable, it can be incorporated into existing task-specific upstream architectures. We present 3 architectures that model

using multi-layer perceptrons (MLP), convolutional neural networks (CNN) and memory networks 

(Weston et al., 2014) to demonstrate the flexibility of our approach. In all cases, the dimensional representation of symbols are learnable embeddings with randomly initialised by and

the one-hot encoding of the symbol. The variableness of symbols is a learnable weight

where and

is the sigmoid function. We consider every symbol independently a variable irrespective of its surrounding context and leave further contextualised formulations as future work. The underlying intuition of this configuration is that a

useful symbol for a correct prediction might need to take on other values for different inputs. This usefulness can be viewed as the inbound gradient to the corresponding parameter and acting as a gate. For further model details including the size of the embeddings, please refer to Appendix A.

Unification MLP (UMLP) (: MLP, : RNN) We combine soft unification into a multi-layer perceptron to process fixed length inputs. In this case, the structure is a sequence of symbols with a fixed length , e.g. a sequence of digits 4234. Given an embedded input , the upstream MLP computes the output symbol based on the flattened representations where is the output of the last layer. However, to compute the unifying features , definition 3, uses a bi-directional GRU (Cho et al., 2014) running over such that where is the hidden state of the GRU at symbol and is a learnable parameter. This model emphasises the flexibility around the boundary of and that the unifying features can be computed in any differentiable manner.

Figure 2: Graphical overview of soft unification within a memory network. Each sentence is processed by two bi-directional RNNs for memory and unification. At each iteration the context attention selects which sentences to unify and the invariant produces the same answer as the example.

Unification CNN (UCNN) (: CNN, : CNN) Given a grid of embedded symbols where is the width and the height, we use a convolutional neural network such that the final prediction is where

this time is the result of global max pooling and

are learnable parameters. We also model using a separate convolutional network with the same architecture as and set where

are the convolutional layers. The grid is padded with 0s to obtain

after each convolution such that every symbol has a unifying feature. This model conveys how soft unification can be adapted to the specifics of the domain for example by using a convolution in a spatially structured input.

Unification Memory Networks (UMN) (: MemNN, : RNN) Soft unification does not need to happen prior to in a fashion but can also be incorporated at any intermediate stage multiple times. To demonstrate this ability, we unify the symbols at different memory locations at each iteration of a Memory Network (Weston et al., 2014). Memory networks can handle a list of lists structure such as a tokenised story as shown in Figure 2. The memory network uses the final hidden state of a bi-directional GRU (outer squares in Figure 2) as the sentence representations to compute a context attention. At each iteration, we unify the words between the attended sentences using the same approach in UMLP with another bi-directional GRU (inner diamonds in Figure 2) for unifying features . Following equation 2, the new unified representation of the memory slot is computed and uses it to perform the next iteration. Concretely,

produces an unification tensor

where and is the number of sentences and words in the invariant respectively, and is the number of sentences in the example such that after the context attentions are applied over and , we obtain as the unified sentence at that iteration. Note that unlike in the UMLP case, the sentences can be of varying length. The prediction is then where is the hidden state of the invariant after iterations. This setup, however, requires pre-training such that the context attentions match the correct sentences.

A task might contain different questions such as “Where is X?” and “Why did X go to Y?”. To let the models differentiate between questions and potentially learn different invariants, we extend them with a repository of invariants and aggregate the predictions from each invariant. One simple approach is to sum the predictions of the invariants used in UMLP and UCNN. Another approach could be to use features from the invariants such as memory representations in the case of UMN. For UMN, we weigh the predictions using a bilinear attention based on the hidden states at the first iteration and such that

. To initially form the repository of invariants, we use the bag-of-words representation of the questions and find the most dissimilar ones based on their cosine similarity as a heuristic to obtain varied examples.

4 Datasets

We use 4 datasets consisting of context, query and an answer : fixed length sequences of symbols, shapes of symbols in a grid, story based natural language reasoning with the bAbI (Weston et al., 2015)

dataset and logical reasoning represented as logic programs, examples shown in Table 

1 with further samples in Appendix B. In each case we use an appropriate model: UMLP for fixed length sequences, UCNN for grid and UMN for iterative reasoning. We use synthetic datasets of which the data generating distributions are known to evaluate not only the quantitative performance but also the quality of the invariants learned by our approach.

Dataset Context Query Answer Training Size
Sequence 8384 duplicate 8 1k, 50
0 0 3
0 1 6
8 5 7
corner 7 1k, 50
Mary went to the kitchen.
Sandra journeyed to the garden.
Where is Mary? kitchen 1k, 50
p(X) q(X).
p(a). True 1k, 10k, 50
Table 1: Sample context, query and answer triples and their training sizes per task. For distribution of generated number of examples per task on Sequence and Grid data refer to Appendix B.

Fixed Length Sequences We generate sequences of length with 8 unique symbols represented as digits to predict (i) a constant, (ii) the head of the sequence, (iii) the tail and (iv) the duplicate symbol. We randomly generate 1000 triples and then only take the unique ones to ensure the test split contains unseen examples. The training is then performed over a 5-fold cross-validation.

Grid To spatially organise symbols, we generate a grid of size with 8 unique symbols organised into box of identical symbol, a vertical, diagonal or horizontal sequence of length 3, a cross or a plus shape and a triangle. In each task we predict (i) the identical symbol, (ii) the head of the sequence, (iii) the centre of the cross or plus and (iv) the corner of the triangle respectively. We follow the same procedure from sequences and randomly generate 1000 discarding duplicate triples.

bAbI The bAbI dataset has become a standard benchmark for evaluating memory based networks. It consists of 20 synthetically generated natural language reasoning tasks (refer to Weston et al. (2015) for task details). We take the 1k English set and use 0.1 of the training set as validation. Each token is lower cased and considered a unique symbol. Following previous works (Seo et al., 2016; Sukhbaatar et al., 2015), we take multiple word answers also to be a unique symbol in .

Logical Reasoning To demonstrate the flexibility of our approach and distinguish our notion of a variable from that used in logic based formalisms, we generate logical reasoning tasks in the form of logic programs using the procedure by Cingillioglu and Russo (2018). The tasks involve learning over 12 classes of logic programs exhibiting varying paradigms of logical reasoning including negation by failure (Clark, 1978). We generate 1k and 10k logic programs per task for training with 0.1 as validation and another 1k for testing. We set the arity of literals to 1 or 2 using one random character from the English alphabet for predicates and constants, e.g. and an upper case character for logical variables, e.g. .

5 Experiments

We probe three aspects of soft unification: the impact of unification on performance over unseen data, the effect of multiple invariants and data efficiency. To that end, we train UMLP and UCNN with and without unification, UMN with pre-training using 1 or 3 invariants over either the entire training set or only 50 examples. Every model is trained 3 times via back-propagation using Adam (Kingma and Ba, 2014) on an Intel Core i7-6700 CPU using the following objective function:


where is the negative log-likelihood with sparsity regularisation over at to discourage the models from utilising spurious number of variables. For UMLP and UCNN, we set for training just the unified output and the converse for the non-unifying versions. To pre-train the UMN, we start with

for 40 epochs then set

to jointly train the unified output. For iterative tasks, the mean squared error between hidden states at each iteration and, in the strongly supervised cases, the negative log-likelihood for the context attentions are also added to the objective function. Further details such as batch size and total number of epochs are available in Appendix C.

Figure 3: Test accuracy over iterations for Unification MLP and Unification CNN models with 1 invariant versus no unification. We observe that with soft unification the models achieve higher accuracy with fewer iterations than their plain counterparts on both per task training sizes.

Figure 3 portrays how soft unification generalises better to unseen examples in test sets - the same sequence or grid never appears in both the training and test sets as outlined in Section 4 - over plain models. Despite having more trainable parameters than alone, the models with unification not only maintain higher accuracy in each iteration and solve the tasks in as few as 250 iterations with training examples but also improve accuracy by when trained with only per task. We believe soft unification architecturally biases the models towards learning structural patterns which in return achieves better results on recognising common patterns of symbols across examples. Results with multiple invariants are identical and the models seem to ignore the extra invariants due to the fact that the tasks can be solved with a single invariant and the regularisation applied on zeroing out unnecessary invariants; further results in Appendix D. The fluctuations in accuracy around iterations 750 to 1000 in UCNN are also caused by penalising which forces the model to relearn the task with less variables half way through training.

Training Size 1k 50 1k
Supervision Weak Strong Weak Strong
# Invs / Model 1 3 1 3 3 N2N GN2N EntNet QRN MemNN
Mean 19.1 20.5 6.3 6.0 27.6 13.9 12.7 29.6 11.3 6.7
# 8 10 4 4 17 11 10 15 5 4
Table 2: Aggregate error rates (%) on bAbI 1k for UMN and N2N, GN2N, MemNN by Sukhbaatar et al. (2015), Liu and Perez (2017) and Weston et al. (2014) respectively. Full comparison on individual tasks are available in Appendix D.

Following Tables 2 and 3, we observe a trend of better performance through strong supervision, more data per task and using only 1 invariant. We believe strong supervision aids with selecting the correct sentences to unify and in a weak setting the model attempts to unify arbitrary context sentences often failing to follow the iterative reasoning chain. The increase in performance with more data and strong supervision is consistent with previous work reflecting how can be bounded by the efficacy of modelled as a memory network. As a result, only in the supervised case do we observe a minor improvement over MemNN by 0.7 in Table 2 and no improvement in the weak case over DMN or IMA in Table 3 with failing 17/20 and 12/12 tasks when trained using only 50 examples. The increase in error rate with 3 invariants, we speculate, stems from having more parameters and more pathways in the model rendering training more difficult.

Training Size 1k 10k 50 20k
Supervision Weak Strong Weak Strong Weak
Arity 1 2 1 2 1 2 1 2 2 2 2 2
# Invs / Model 1 3 1 3 1 3 1 1 3 3 DMN IMA
Mean 36.4 39.3 14.3 28.9 21.5 31.8 2.4 12.2 16.0 47.1 21.2 9.1
# 9 11 7 11 7 10 1 5 9 12 11 5
Table 3: Aggregate task error rates (%) on the logical reasoning dataset for UMN and DMN, IMA by Cingillioglu and Russo (2018). Strong supervision, more data and only 1 invariant seem to improve the performance of UMN over plain iterative models. Individual task results are in Appendix D.

6 Analysis

After training, we can extract the learned invariants by applying a threshold on indicating whether a symbol is used as a variable or not. We set for all datasets except for bAbI, we use . The magnitude of this threshold seems to depend on the amount of regularisation , equation 5, and the number of training steps along with batch size all controlling how much is pushed towards 0. Sample invariants shown in Figure 4 describe the common patterns present in the tasks with parts that contribute towards the final answer becoming variables. Extra symbols such as is or travelled do not emerge as variables, as shown in Figure 3(a); we attribute this behaviour to the fact that changing the token travelled to went does not influence the prediction but changing the action, the value of Z:left to ‘picked’ does. However, based on random initialisation, our approach can convert a random symbol into a variable and let compensate for the unifications it produces. For example, the invariant “X:8 5 2 2” could predict the tail of another example by unifying the head with the tail using , equation 3, of those symbols. Pre-training as done in UMN seems to produce more robust and consistent invariants compared to immediately training since, we speculate, by equation 4 might encourage .

Y:john Z:left the X:football
Y:john travelled to the A:office
where is the X:football
(a) bAbI task 2, two supporting facts. The model also learns Z:left since people can also drop or pick up objects potentially affecting the answer.
this Z:morning X:bill went to the Y:school
yesterday X:bill journeyed to the A:park
where was X:bill before the Y:school
(b) bAbI task 14, time reasoning. X:bill and Y:school are recognised as variables alongside Z:morning capturing when someone went which is crucial to this task.
5 8 6 4 const 2
X:8 3 3 1 head X:8
8 3 1 Y:5 tail Y:5
Z:1 4 3 Z:1 dup Z:1
(c) Successful invariants learned with UMLP using 50 training examples only shown as .
0 X X
0 X X
0 0 0
0 1 0
6 Y 8
0 7 0
0 0 1
0 5 4
7 8 X
box centre corner
(d) Successful invariants learned with UCNN. Variable default symbols are omitted for clarity.
X:i ( T ) Z:l ( T ),
Z:l ( U ) R:x ( U ),
R:x ( K ) S:n ( K ),
S:n ( Y:o ) X:i ( Y:o )
(e) Logical reasoning task 5 with arity 1. The model captures how S:n could entail X:i in a chain.
Figure 4: Invariants learned across the four datasets using the three architectures. For iterative reasoning datasets, bAbI and logical reasoning, they are taken from strongly supervised UMN.
(a) bAbI task 16. A one-to-one mapping is created between variables X:bernhard with brian and Y:frog with lion.

(b) bAbI task 6. Only Y:bathroom is recognised as variable creating a one-to-many binding to capture the same information.
(c) bAbI task 2. When a variable is unused in the next iteration, e.g. Z:football, it unifies with random tokens often biased by position.
(d) Logical reasoning task 1. A one-to-one alignment is created between predicates and constants.
(e) Logical reasoning task 3. Arity 1 atom forced to bind with arity 2 creates a one-to-many mapping.
(f) Logical reasoning task 1. An arity 2 atom forced to bind with arity 1 creates a many-to-one mapping.
Figure 5: Variable assignment maps from equation 3. Darker cells indicate higher attention values.

Interpretability versus Ability A desired property of interpretable models is transparency (Lipton, 2018). A novel outcome of the learned invariants in our approach is that they provide an approximation of the underlying general principle present in the data such as the structure of multi-hop reasoning shown in Figure 3(e). However, certain aspects such as temporal reasoning are still hidden inside . In Figure 3(b), although we observe Z:morning as a variable, the overall learned invariant captures nothing about how changing the value of Z:morning alters the behaviour of . The model might look before or after a certain time point X:bill went somewhere depending what Z:morning binds to. Without the regularising term on , we initially noticed the models using, one might call extra, symbols as variables and binding them to the same value occasionally producing unifications such as “bathroom bathroom to the bathroom” and still predicting, perhaps unsurprisingly, the correct answer as bathroom. Hence, regularising with the correct amount in equation 5 seems critical in extracting not just any invariant but one that represents the common structure.

Soft unification from equation 3 reveals three main patterns: one-to-one, one-to-many or many-to-one bindings as shown in Figure 5. Figures 4(a) and 4(d) capture what one might expect unification to look where variables unify with their corresponding counterparts. However, occasionally the model can optimise to use less variables and squeeze the required information into a single variable, for example by binding Y:bathroom to john and kitchen as in Figure 4(b). We believe this occurs due to the sparsity constraint on encouraging the model to be as conservative as possible. In a similar fashion, the unification can bind a single variable Y:o to both other constants as in Figure 4(e). Finally, if there are more variables than needed as in Figure 4(f), we observe a many-to-one binding with Y:w and Z:e mapping to the same constant. This behaviour begs the question how does the model differentiate between and . We speculate the model uses the magnitude of and to encode the difference despite both variables unifying with the same constant.

7 Related Work

Learning an underlying general principle in the form of an invariant is often the means for arguing generalisation in neural networks. For example, Neural Turing Machines 

(Graves et al., 2014) are tested on previously unseen sequences to support the view that the model might have captured the underlying pattern or algorithm. In fact, Weston et al. (2014) claim “MemNNs can discover simple linguistic patterns based on verbal forms such as (X, dropped, Y), (X, took, Y) or (X, journeyed to, Y) and can successfully generalise the meaning of their instantiations.” However, this claim is based on the output of and unfortunately it is unknown whether the model has truly learned such a representation or indeed is utilising it. Our approach sheds light to this ambiguity and presents these linguistic patterns explicitly as invariants ensuring their utility through without solely analysing the output of on previously unseen symbols. Although we associate these invariants with our existing understanding of the task to mistakenly anthropomorphise the machine, for example by thinking it has learned X:mary as someone, it is important to acknowledge that these are just symbolic patterns. In these cases, our interpretations do not necessarily correspond to any understanding of the machine, relating to the Chinese room argument made by Searle (1980).

Learning invariants by lifting ground examples is related to least common generalisation (Reynolds, 1970) by which inductive inference is performed on facts (Shapiro, 1981) such as generalising went(mary,kitchen) and went(john,garden) to went(X,Y). Unlike in a predicate logic setting, our approach allows for soft alignment and therefore generalisation between varying length sequences. Existing neuro-symbolic systems (Broda et al., 2002) focus on inducing rules that adhere to given logical semantics of what a variable and a rule are. For example,  (Evans and Grefenstette, 2018) constructs a network by rigidly following the given semantics of first-order logic. Similarly, Lifted Relational Neural Networks (Sourek et al., 2015) ground first-order logic rules into a neural network while Neural Theorem Provers (Rocktäschel and Riedel, 2017) build neural networks using backward-chaining (Russell and Norvig, 2018) on a given background knowledge base with templates. This architectural approach for combining logical variables is also observed with TensorLog (Cohen, 2016) and Logic Tensor Networks (Serafini and d’Avila Garcez, 2016) while grounding logical rules can also be used as regularisation (Hu et al., 2016). However, the notion of a variable is pre-engineered rather than learned with a focus on presenting a practical approach to solving certain problems whereas our motivation stems from a cognitive perspective.

At first it may seem the learned invariants, Section 6, make the model more interpretable; however, this transparency is not of the model but of the data. The invariant captures patterns that potentially approximates the data generating distribution but we still do not know how the model

uses them upstream. Thus, from the perspective of explainable artificial intelligence (XAI) 

(Adadi and Berrada, 2018), learning invariants or interpreting them do not constitute an explanation of the reasoning model even though “if someone goes somewhere then they are there” might look like one. Instead, it can be perceived as causal attribution (Miller, 2019) in which someone being somewhere is attributed to them going there. This perspective also relates to gradient based model explanation methods such as Layer-Wise Relevance Propagation (Bach et al., 2015) and Grad-CAM (Selvaraju et al., 2017; Chattopadhay et al., 2018). Consequently, a possible view on , equation 2, is a gradient based usefulness measure such that a symbol utilised upstream by to determine the answer becomes a variable similar to how a group of pixels in an image contribute more to its classification.

Finally, one can argue that our model maintains a form of counterfactual thinking (Roese, 1997) in which soft unification creates counterfactuals on the invariant example to alter the output of towards the desired answer, equation 4. The question where Mary would have been if Mary had gone to the garden instead of the kitchen is the process by which an invariant is learned through multiple examples during training. This view relates to methods of causal inference (Pearl, 2019; Holland, 1986) in which counterfactuals are vital as demonstrated in structured models by Pearl (1999).

8 Conclusion

We presented a new approach for learning variables and lifting examples into invariants through the usage of soft unification. Evaluating on four datasets, we analysed how Unification Networks perform comparatively to existing similar architectures while having the benefit of lifting examples into invariants that capture underlying patterns present in the tasks. Since our approach is end-to-end differentiable, we plan to apply this technique to multi-modal tasks in order to yield multi-modal invariants for example in visual question answering.


We would like to thank Murray Shanahan for his helpful comments, critical feedback and insights regarding this work.


  • A. Adadi and M. Berrada (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, pp. 52138–52160. External Links: Document Cited by: §7.
  • S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015)

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

    PLOS ONE 10 (7), pp. e0130140. External Links: Document Cited by: §7.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. External Links: http://arxiv.org/abs/1409.0473v7 Cited by: §1.
  • K. B. Broda, A. S. D. Garcez, and D. M. Gabbay (2002) Neural-symbolic learning systems. Springer London. External Links: ISBN 1852335122 Cited by: §7.
  • A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian (2018) Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In

    2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

    pp. 839–847. External Links: Document Cited by: §7.
  • S. Chaudhari, G. Polatkan, R. Ramanath, and V. Mithal (2019) An attentive survey of attention models. External Links: http://arxiv.org/abs/1904.02874v1 Cited by: §1.
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014) On the properties of neural machine translation: encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, External Links: Document Cited by: §3.
  • N. Cingillioglu and A. Russo (2018) DeepLogic: towards end-to-end differentiable logical reasoning. External Links: http://arxiv.org/abs/1805.07433v3 Cited by: §4, Table 3.
  • K. L. Clark (1978) Negation as failure. In Logic and Data Bases, pp. 293–322. External Links: Document Cited by: §4.
  • W. W. Cohen (2016) TensorLog: a differentiable deductive database. arXiv:1605.06523. External Links: http://arxiv.org/abs/1605.06523v2 Cited by: §7.
  • R. Evans and E. Grefenstette (2018) Learning explanatory rules from noisy data. 61, pp. 1–64. External Links: Document, http://arxiv.org/abs/1711.04574v2 Cited by: §7.
  • G. Frege (1948) Sense and reference. 57 (3), pp. 209. External Links: Document Cited by: §2.
  • J. L. Frost, P. Brown, J. A. Sutterby, and C. D. Thornton (2004) The developmental benefits of playgrounds. Association for Childhood Education International. External Links: ISBN 0871731649 Cited by: §1.
  • A. Graves, G. Wayne, and I. Danihelka (2014) Neural turing machines. External Links: http://arxiv.org/abs/1410.5401v2 Cited by: §7.
  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, and D. Hassabis (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. External Links: Document Cited by: Table 7.
  • M. Henaff, J. Weston, A. Szlam, A. Bordes, and Y. LeCun (2016) Tracking the world state with recurrent entity networks. ICLR. External Links: http://arxiv.org/abs/1612.03969v3 Cited by: Table 7.
  • P. W. Holland (1986) Statistics and causal inference. Journal of the American Statistical Association 81 (396), pp. 945–960. External Links: Document Cited by: §7.
  • Z. Hu, X. Ma, Z. Liu, E. Hovy, and E. Xing (2016) Harnessing deep neural networks with logic rules. ACL. External Links: Document, http://arxiv.org/abs/1603.06318v5 Cited by: §7.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. ICLR. External Links: http://arxiv.org/abs/1412.6980v9 Cited by: §5.
  • A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher (2015)

    Ask me anything: dynamic memory networks for natural language processing

    In ICML, pp. 1378–1387. External Links: http://arxiv.org/abs/1506.07285v5 Cited by: Table 7.
  • R. L. Lewis (1999) Cognitive modeling, symbolic. pp. 525–527. Cited by: §1.
  • Z. C. Lipton (2018) The mythos of model interpretability. Communications of the ACM 61 (10), pp. 36–43. External Links: Document, http://arxiv.org/abs/1606.03490v3 Cited by: §6.
  • F. Liu and J. Perez (2017) Gated end-to-end memory networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1–10. External Links: Document, http://arxiv.org/abs/1610.04211v2 Cited by: Table 7, Table 2.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, External Links: Document, http://arxiv.org/abs/1508.04025v5 Cited by: §1.
  • T. Miller (2019) Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1–38. External Links: Document, http://arxiv.org/abs/1706.07269v3 Cited by: §7.
  • A. Morton and J. A. Fodor (1978) The language of thought.. Vol. 75, Philosophy Documentation Center. External Links: Document Cited by: §1.
  • S. Muggleton and L. de Raedt (1994) Inductive logic programming: theory and methods. The Journal of Logic Programming 19-20, pp. 629–679. External Links: Document Cited by: §2.
  • J. Pearl (1999) Probabilities of causation: three counterfactual interpretations and their identification. Synthese 121 (1-2), pp. 93–149. Cited by: §7.
  • J. Pearl (2019)

    The seven tools of causal inference with reflections on machine learning

    Communications of the ACM 62 (3), pp. 54–60. External Links: Document Cited by: §7.
  • J. Piaget (2001) The psychology of intelligence. Routledge. External Links: ISBN 0415254019 Cited by: §1.
  • J. C. Reynolds (1970) Transformational systems and algebraic structure of atomic formulas. Machine intelligence 5, pp. 135–151. Cited by: §7.
  • T. Rocktäschel and S. Riedel (2017) End-to-end differentiable proving. pp. 3791–3803. External Links: http://arxiv.org/abs/1705.11040v2 Cited by: §7.
  • N. J. Roese (1997) Counterfactual thinking.. Psychological Bulletin 121 (1), pp. 133–148. External Links: Document Cited by: §7.
  • S. Russell and P. Norvig (2018) Artificial intelligence: a modern approach (3rd edition). Pearson. External Links: ISBN 1292153962 Cited by: §1, §7.
  • J. R. Searle (1980) Minds, brains, and programs. 3 (3), pp. 417–424. External Links: Document Cited by: §7.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-CAM: visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626. External Links: Document Cited by: §7.
  • M. Seo, S. Min, A. Farhadi, and H. Hajishirzi (2016) Query-reduction networks for question answering. ICLR. External Links: http://arxiv.org/abs/1606.04582v6 Cited by: Table 7, §4.
  • L. Serafini and A. d’Avila Garcez (2016)

    Logic tensor networks: deep learning and logical reasoning from data and knowledge

    arXiv:1606.04422. External Links: http://arxiv.org/abs/1606.04422v2 Cited by: §7.
  • E. Y. Shapiro (1981) Inductive inference of theories from facts. Yale University, Department of Computer Science. Cited by: §7.
  • G. Sourek, V. Aschenbrenner, F. Zelezny, and O. Kuzelka (2015) Lifted relational neural networks. arXiv:1508.05128. External Links: http://arxiv.org/abs/1508.05128v2 Cited by: §7.
  • S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus (2015) End-to-end memory networks. In NIPS, pp. 2440–2448. External Links: 1503.08895 Cited by: Table 7, §4, Table 2.
  • S. Tokui, K. Oono, S. Hido, and J. Clayton (2015) Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), External Links: Link Cited by: §1.
  • J. M. Unger and T. W. Deacon (1998) The symbolic species: the co-evolution of language and the brain. Vol. 82, Wiley. External Links: Document Cited by: §1.
  • J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, and T. Mikolov (2015) Towards ai-complete question answering: a set of prerequisite toy tasks. External Links: http://arxiv.org/abs/1502.05698v10 Cited by: Table 7, §1, §4, §4.
  • J. Weston, S. Chopra, and A. Bordes (2014) Memory networks. External Links: http://arxiv.org/abs/1410.3916v11 Cited by: Table 7, §3, §3, Table 2, §7.
  • C. Xiong, S. Merity, and R. Socher (2016) Dynamic memory networks for visual and textual question answering. pp. 2397–2406. External Links: http://arxiv.org/abs/1603.01417v1 Cited by: Table 7.

Appendix A Model Details

a.1 Unification MLP & CNN

Unification MLP (UMLP) To model as a multi-layer perceptron, we take symbol embeddings of size and flatten sequences of length

into an input vector of size

. The MLP consists of 2 hidden layers with non-linearity of sizes and respectively. To process the query, we concatenate the one-hot encoding of the task id to yielding a final input of size . For unification features , we use a bi-directional GRU with hidden size and an initial state of

. The hidden state at each symbol is taken with a linear transformation to give

where is the hidden state of the biGRU. The variable assignment is then computed as an attention over the according to equation 3.

Unification CNN (UCNN) We take symbols embeddings of size to obtain an input grid . Similar to UMLP, for each symbol we append the task id as a one-hot vector to get an input of shape . Then consists of 2 convolutional layers with

filters each, kernel size of 3 and stride 1. We use

non-linearity in between the layers. We pad the grid with 2 columns and 2 rows to a such that the output of the convolutions yield again a hidden output of the same shape. As the final hidden output , we take a global max pool to over to obtain . Unification function is modelled identical to without the max pooling such that where is the hidden output of the convolutional layers.

a.2 Unification Memory Networks

Unlike previous architectures, with UMN we interleave into . We use embedding sizes of and model with an iterative memory network. We take the final hidden state of a bi-directional GRU, with initial state , to represent the sentences of the context and query in a -dimensional vector and the query . The initial state of the memory network is . At each iteration :


where is another -dimensional bi-directional GRU and with the element-wise multiplication and the concatenation of vectors. Taking as the context attention, we obtain the next state of the memory network:


and iterate many times in advance. The final prediction becomes . All weight matrices

and bias vectors

are independent but are tied across iterations.

Appendix B Generated Dataset Samples

Dataset Task Context Query Answer
Sequence i 1488 constant 2
Sequence ii 6157 head 6
Sequence iii 1837 tail 7
Sequence iv 3563 duplicate 3
Grid i
0 0 0
0 2 2
8 2 2
box 2
Grid ii
4 0 0
0 7 0
8 0 1
head 4
Grid iii
0 6 0
1 7 2
0 3 0
centre 7
Grid iv
8 0 0
5 6 0
2 4 1
corner 2
Table 4: Sample context, query and answer triples from sequences and grid tasks.
Task Sequences Grid
Table 5: Training sizes for randomly generated fixed length sequences and grid tasks with 8 unique symbols. The reason for Grid task (i) to be smaller is because there are at most 32 combinations of boxes in a grid with 8 unique symbols.

Appendix C Training Details

c.1 Unification MLP & CNN

Both unification models are trained on a 5-fold cross-validation over the generated datasets for 2000 iterations with a batch size of 64. We don’t use any weight decay and save the training and test accuracies every 10 iterations, as presented in Figure 3.

c.2 Unification Memory Networks

We again use a batch size of 64 and pre-train for 40 epochs then together with for 260 epochs. We use epochs for UMN since the dataset sizes are fixed. To learn alongside , we combine error signals from the unification of the invariant and the example. Following equation 4, the objective function not only incorporates the negative log-likelihood of the answer but also the mean squared error between intermediate states and at each iteration as an auxiliary loss:


We pre-train by setting for 40 epochs and then set . For strong supervision we also compute the negative log-likelihood for the context attention , described in Appendix A

, at each iteration using the supporting facts of the tasks. We apply a dropout of 0.1 for all recurrent neural networks used and only for the bAbI dataset weight decay with 0.001 as the coefficient.

Appendix D Further Results

Figure 6: Results of Unification MLP and CNN on increasing number of invariants. There is no impact on performance when more invariants per task are given. Upon closer inspection, we noticed the models ignore the extra invariants and only use 1. We speculate the regularisation encourages the models to use a single 1 invariant.
Supervision Weak Strong
# Invs 1 3 1 3 3
Training Size 1k 1k 1k 1k 50
1 0.0 0.0 0.0 0.0 1.4
2 65.6 63.1 0.3 0.7 30.0
3 67.1 62.6 1.0 2.4 39.8
4 0.0 0.0 0.0 0.0 37.0
5 3.4 4.0 0.8 1.1 26.5
6 0.2 0.6 0.0 0.0 18.4
7 22.0 22.8 10.7 11.3 22.8
8 10.3 8.5 7.4 7.6 24.7
9 0.1 25.7 0.0 0.0 33.8
10 0.1 2.0 0.0 0.3 32.6
11 0.0 0.0 0.0 0.0 11.9
12 0.0 0.1 0.0 0.0 21.3
13 2.1 3.7 0.0 0.1 5.8
14 19.7 13.5 0.5 0.1 54.8
15 0.0 0.7 0.0 0.0 0.0
16 55.2 56.2 0.0 0.0 39.7
17 39.2 49.0 51.1 49.3 48.8
18 4.4 8.0 0.6 0.5 10.4
19 91.8 89.6 53.9 46.7 90.2
20 0.0 0.0 0.0 0.0 2.7
Mean 19.1 20.5 6.3 6.0 27.6
Std 27.9 27.0 15.6 14.3 21.0
# 8 10 4 4 17
Table 6: Individual task error rates on bAbI tasks for Unification Memory Networks.
Support Weak Strong
Size 1k 10k 1k 10k
1 0.0 0.0 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 8.3 8.1 56.4 0.5 65.6 0.3 0.4 0.0 0.7 1.8
3 40.3 38.8 69.7 1.2 67.1 1.1 1.8 0.0 2.4 4.8
4 2.8 0.4 1.4 0.7 0.0 0.0 0.0 0.0 0.0 0.0
5 13.1 1.0 4.6 1.2 3.4 0.5 0.8 2.0 1.1 0.7
6 7.6 8.4 30.0 1.2 0.2 0.0 0.0 0.0 0.0 0.0
7 17.3 17.8 22.3 9.4 22.0 2.4 0.6 15.0 11.3 3.1
8 10.0 12.5 19.2 3.7 10.3 0.0 0.3 9.0 7.6 3.5
9 13.2 10.7 31.5 0.0 0.1 0.0 0.2 0.0 0.0 0.0
10 15.1 16.5 15.6 0.0 0.1 0.0 0.2 2.0 0.3 2.5
11 0.9 0.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1
12 0.2 0.0 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13 0.4 0.0 9.0 0.3 2.1 0.0 0.1 0.0 0.1 0.2
14 1.7 1.2 62.9 3.8 19.7 0.2 0.4 1.0 0.1 0.0
15 0.0 0.0 57.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0
16 1.3 0.1 53.2 53.4 55.2 45.3 55.1 0.0 0.0 0.6
17 51.0 41.7 46.4 51.8 39.2 4.2 12.0 35.0 49.3 40.6
18 11.1 9.2 8.8 8.8 4.4 2.1 0.8 5.0 0.5 4.7
19 82.8 88.5 90.4 90.7 91.8 0.0 3.9 64.0 46.7 65.5
20 0.0 0.0 2.6 0.3 0.0 0.0 0.0 0.0 0.0 0.0
Mean 13.9 12.7 29.6 11.3 19.1 2.8 3.8 6.7 6.0 6.4
# 11 10 15 5 8 1 2 4 4 2
Table 7: Comparison of individual task error rates (%) on the bAbI (Weston et al., 2015) dataset of the best run. We preferred 1k results if a model had experiments published on both 1k and 10k for data efficiency. References from left to right: (Sukhbaatar et al., 2015) - (Liu and Perez, 2017) - (Henaff et al., 2016) - (Seo et al., 2016) - Ours - (Xiong et al., 2016) - (Graves et al., 2016) - (Weston et al., 2014) - Ours - (Kumar et al., 2015)
Size 1k 10k 50 20k
Support Weak Strong Weak Strong Weak
Arity 1 2 1 2 1 2 1 2 2 2 2 2
# Invs / Model 1 3 1 3 1 3 1 1 3 3 DMN IMA
Facts 1.2 0.9 0.0 0.4 0.0 0.0 0.0 0.0 0.0 33.5 0.0 0.0
Unification 0.0 10.3 0.0 10.8 0.0 0.0 0.0 0.0 0.0 41.3 13.0 10.0
1 Step 50.3 49.8 4.4 20.0 1.2 27.8 0.1 1.3 5.7 50.2 26.0 2.0
2 Steps 47.5 50.0 5.7 35.0 37.2 47.8 0.0 29.7 28.7 49.9 33.0 5.0
3 Steps 47.6 49.2 10.4 38.7 39.6 45.6 0.0 26.0 26.1 48.3 23.0 6.0
AND 31.3 37.4 10.7 16.4 29.8 29.0 0.2 0.4 1.2 50.0 20.0 5.0
OR 25.2 38.1 21.0 35.0 20.5 30.2 4.4 20.6 17.4 47.6 13.0 3.0
Transitivity 50.0 26.6 39.6 5.0 6.0 49.2 50.0 50.0
1 Step NBF 46.4 38.7 3.8 28.8 1.1 21.6 0.1 1.1 8.0 47.6 21.0 2.0
2 Steps NBF 48.5 48.9 7.7 39.6 30.4 48.2 0.1 33.4 28.7 50.3 15.0 4.0
AND NBF 51.0 50.1 43.1 48.6 29.4 44.2 0.1 1.3 40.1 49.5 16.0 8.0
OR NBF 51.4 48.4 50.8 47.3 47.6 47.8 21.3 27.6 30.5 47.3 25.0 14.0
Mean 36.4 39.3 14.3 28.9 21.5 31.8 2.4 12.2 16.0 47.1 21.2 9.1
Std 18.7 15.9 16.4 14.1 17.1 16.7 6.1 13.2 13.6 4.7 12.3 13.4
# 9 11 7 11 7 10 1 5 9 12 11 5
Table 8: Individual task error rates (%) on the logical reasoning dataset.
Xsandra went back to the Y:bathroom
is X:sandra in the Y:bathroom
Figure 7: bAbI task 6, yes or no questions. The invariant does not variablise the answer.
X:m ( Y:e ) X:m ( Y:e )
X:a ( Y:w , Z:e ) X:a ( Y:w , Z:e )
X:m ( T ) X:m ( c )
X:x ( A ) not Z:q ( A ) X:x ( Y:z )
Figure 8: Invariants learned on tasks 1, 2 and 11 with arity 1 and 2 from the logical reasoning dataset. Last invariant on task 11 lifts the example around the negation by failure, denoted as not, capturing its semantics.
3 X 7 Y head
7 4 X Y head
X 3 1 X tail
X X 5 Y duplicate
(a) Invariants with extra varialbes learned with UMLP.
X 0 0
0 1 0
0 0 Y
0 Y 0
1 X 8
0 2 0
6 4 Y
0 X 8
0 0 7
head centre corner
(b) Mismatching invariants learned with UCNN.
Figure 9: Invariants learned that do not match the data generating distribution from UMLP and UCNN using examples to train. In these instances the unification still bind to the the correct symbols in order to predict the desired answer; quantitatively we get the same results. Variable default symbols are omitted for clarity.