Learning to Reason with Third-Order Tensor Products

11/29/2018 ∙ by Imanol Schlag, et al. ∙ IDSIA 12

We combine Recurrent Neural Networks with Tensor Product Representations to learn combinatorial representations of sequential data. This improves symbolic interpretation and systematic generalisation. Our architecture is trained end-to-end through gradient descent on a variety of simple natural language reasoning tasks, significantly outperforming the latest state-of-the-art models in single-task and all-tasks settings. We also augment a subset of the data such that training and test data exhibit large systematic differences and show that our approach generalises better than the previous state-of-the-art.



There are no comments yet.


page 8

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Certain connectionist architectures based on Recurrent Neural Networks (RNNs) Werbos:88gasmarket ; WilliamsZipser:92 ; RobinsonFallside:87tr

such as the Long Short-Term Memory (LSTM)  

lstm97and95 ; Gers:2000nc are general computers, e.g., siegelmann91turing

. LSTM-based systems achieved breakthroughs in various speech and Natural Language Processing tasks

googlevoice2015 ; wu2016google ; facebook2017 . Unlike humans, however, current RNNs cannot easily extract symbolic rules from experience and apply them to novel instances in a systematic way fodor1988connectionism ; hadley1994systematicity . They are catastrophically affected by systematic fodor1988connectionism ; hadley1994systematicity differences between training and test data still_not_systematic_lake ; lake2017building ; atzmon2016learning ; phillips1995connectionism .

In particular, standard RNNs have performed poorly at natural language reasoning (NLR) babi_tasks_weston where systematic generalisation (such as rule-like extrapolation) is essential. Consider a network trained on a variety of NLR tasks involving short stories about multiple entities. One task could be about tracking entity locations ([…] Mary went to the office. […] Where is Mary?), another about tracking objects that people are holding ([…] Daniel picks up the milk. […] What is Daniel holding?). If every person is able to perform every task, this will open up a large number of possible person-task pairs. Now suppose that during training we only have stories from a small subset of all possible pairs. More specifically, let us assume Mary is never seen picking up or dropping any item. Unlike during training, we want to test on tasks such as […] Mary picks up the milk. […] What is Mary carrying?. In this case, the training and test data exhibit systematic differences. Nevertheless, a systematic model should be able to infer milk because it has adopted a rule-like, entity-independent reasoning pattern that generalises beyond the training distribution. RNNs, however, tend to fail to learn such patterns if the train and test data exhibit such differences.

Here we aim at improving systematic generalisation by learning to deconstruct natural language statements into combinatorial representations BrousseCombinatorialRepresentations . We propose a new architecture based on the Tensor Product Representation (TPR) smolensky1990tensor

, a general method for embedding symbolic structures in a vector space.

Previous work already showed that TPRs allow for powerful symbolic processing with distributed representations


, given certain manual assignments of the vector space embedding. However, TPRs have commonly not been trained from data through gradient descent. Here we combine gradient-based RNNs with third-order TPRs to learn combinatorial representations from natural language, training the entire system on NLR tasks via error backpropagation

Linnainmaa:1970 ; Kelley:1960 ; Werbos1990BTT . We point out similarities to systems with Fast Weights von1981correlation ; feldman1982dynamic ; hinton1987deblur , in particular, end-to-end-differentiable Fast Weight systems Schmidhuber:92ncfastweights ; Schmidhuber:93ratioicann ; schlag2017gated . In experiments, we achieve state-of-the-art results on the bAbI dataset babi_tasks_weston , obtaining better systematic generalisation than other methods. We also analyse the emerging combinatorial and, to some extent, interpretable representations. The code we used to train and evaluate our models is available at github.com/ischlag/TPR-RNN.

2 Review of the Tensor Product Representation and Notation

The TPR method is a mechanism to create a vector-space embedding of symbolic structures. To illustrate, consider the relation implicit in the short sentences "Kitty the cat" and "Mary the person". In order to store this structure into a TPR of order 2, each sentence has to be decomposed into two components by choosing a so-called filler symbol and a role symbol . Now a possible set of fillers and roles for a unique role/filler decomposition could be and . The two relations are then described by the set of filler/role bindings: . Let denote positive integers. A distributed representation is then achieved by encoding each filler symbol by a filler vector in a vector space and each role symbol by a role vector in a vector space . In this work, every vector space is over . The TPR of the symbolic structures is defined as the tensor in a vector space where is the tensor product operator. In this example the tensor is of order 2, a matrix, which allows us to write the equation of our example using matrix multiplication:


Here, the tensor product — or generalised outer product — acts as a variable binding operator. The final TPR representation is a superposition of all bindings via the element-wise addition.

In the TPR method the so-called unbinding operator consists of the tensor inner product which is used to exactly reconstruct previously stored variables from using an unbinding vector. Recall that the algebraic definition of the dot product of two vectors and is defined by the sum of the pairwise products of the elements of and . Equivalently, the tensor inner product can be expressed through the order increasing tensor product followed by the sum of the pairwise products of the elements of the -th and -th order.


Given now the unbinding vector , we can then retrieve the stored filler . In the simplest case, if the role vectors are orthonormal, the unbinding vector equals . Again, for a TPR of order 2 the unbinding operation can also be expressed using matrix multiplication.


Note how the dot product and matrix multiplication are special cases of the tensor inner product. We will later use the tensor inner product which can be used with a tensor of order 3 (a cube) and a tensor of order 1 (a vector) such that they result in a tensor of order 2 (a matrix). Other aspects of the TPR method are not essential for this paper. For further details, we refer to Smolensky’s work smolensky1990tensor ; smolensky2012symbolic ; basic_reasoningTPR_smolensky .

3 The TPR as a Structural Bias for Combinatorial Representations

A drawback of Smolensky’s TPR method is that the decomposition of the symbolic structures into structural elements — e.g. and in our previous example — are not learned but externally defined. Similarly, the distributed representations and are assigned manually instead of being learned from data, yielding arguments against the TPR as a connectionist theory of cognition fodor1990connectionism .

Here we aim at overcoming these limitations by recognising the TPR as a form of Fast Weight memory which uses multi-layer perceptron (MLP) based neural networks trained end-to-end by stochastic gradient descent. Previous outer product-based Fast Weights

Schmidhuber:93ratioicann , which share strong similarities to TPRs of order 2, have shown to be powerful associative memory mechanisms Ba2016using ; schlag2017gated . Inspired by this capability, we use a graph interpretation of the memory where the representations of a node and an edge allow for the associative retrieval of a neighbouring node. For the context of this work, we refer to the nodes of such a graph as entities and to the edges as relations. This requires MLPs which deconstruct an input sentence into the source-entity , the relation , and the target-entity such that and belong to the vector space and to . These representations are then bound together with the binding operator and stored as a TPR of order 3 where we interpret multiple unbindings as a form of graph traversal.

We’ll use a simple example to illustrate the idea. For instance, consider the following raw input: "Mary went to the kitchen.". A possible three-way task-specific decomposition could be , , and . At a later point in time, a question like "Where is Mary?" would have to be decomposed into the vector representations and . The vectors and have to be similar to the true unbinding vectors and in order to retrieve the previously stored but possibly noisy .

We chose a graph interpretation of the memory due to its generality as it can be found implicitly in the data of many problems. Another important property of a graph inspired neural memory is the combinatorial nature of entities and relations in the sense that any entity can be connected through any relation to any other entity. If the MLPs can disentangle entity-like information from relation-like information, the TPR will provide a simple mechanism to combine them in arbitrary ways. This means that if there is enough data for the network to learn specific entity representations such as then it should not require any more data or training to combine with any of the learned vectors embedded in even though such examples have never been covered by the training data. In Section  7 we analyse a trained model and present results which indicate that it indeed seems to learn representations in line with this perspective.

4 Proposed Method

RNNs can implement algorithms which map input sequences to output sequences. A traditional RNN uses one or several tensors of order 1 (i.e. a vector usually referred to as the hidden state) to encode the information of the past sequence elements necessary to infer the correct current and future outputs. Our architecture is a non-traditional RNN encoding relevant information from the preceding sequence elements in a TPR of order 3.

At discrete time , in the input sequence of varying length , the previous state is updated by the element-wise addition of an update representation .


The proposed architecture is separated into three parts: an input, update, and inference module. The update module produces while the inference module uses as parameters (Fast Weights) to compute the output of the model given a question as input. is the zero tensor.

Input Module

Similar to previous work, our model also iterates over a sequence of sentences and uses an input module to learn a sentence representation from a sequence of words NIPS2015_memory_networks . Let the input to the architecture at time be a sentence of words with learned embeddings . The sequence is then compressed into a vector representation by


where are learned position vectors that are equivalent for all input sequences and is the Hadamard product. The vectors , and are in the vector space .

Update Module

The TPR update is defined as the element-wise sum of the tensors produced by a write, move, and backlink function. We abbreviate the respective tensors as , , and and refer to them as memory operations.


To this end, two entity and three relation representations are computed from the sentence representation using five separate networks such that


where is an MLP network and its weights.

The write operation allows for the storage of a new node-edge-node association (, , ) using the tensor product where represents the source entity, represents the target entity, and the relation connecting them. To avoid superimposing the new association onto a possibly already existing association (, , ), the previous target entity has to be retrieved and subtracted from the TPR. If no such association exists, then will ideally be the zero vector.


While the write operation removes the previous target entity representation , the move operation allows to rewrite back into the TPR with a different relation . Similar to the write operation, we have to retrieve and remove the previous target entity that would otherwise interfere.

Figure 1: Illustration of our memory operations for a single time-step given some previous state. Each arrow is represented by a tensor of order 3. The superposition of multiple tensors defines the current graph. Red arrows are subtracted from the state while green arrows are added. In this illustration, exists but and do not yet — they are zero vectors. Hence, the two constructed third-order tensors that are subtracted according to the move and backlink operation will both be zero tensors as well. Note that the associations are not necessarily as discrete as illustrated. Best viewed in color.

The final operation is the backlink. It switches source and target entities and connects them with yet another relation . This allows for the associative retrieval of the neighbouring entity starting from either one but with different relations (e.g. John is left of Mary and Mary is right of John).


Inference Module

One of our experiments requires a single prediction after the last element of an observed sequence (i.e. the last sentence). This final element is the question sentence representation . Since the inference module does not edit the TPR memory, it is sufficient to compute the prediction only when necessary. Hence we drop index in the following equations. Similar to the update module, first an entity and a set of relations are extracted from the current sentence using four different networks.

Figure 2: Illustration of the inference procedure. Given an entity and three relations (blue) we can extract the inferred representations (yellow).

The extracted representations are used to retrieve one or several previously stored associations by providing the necessary unbinding vectors. The values of the TPR can be thought of as context-specific weights which are not trained by gradient descent but constructed incrementally during inference. They define a function that takes the entity and relations as an input. A simple illustration of this process is shown in Figure 2.

The most basic retrieval requires one source entity and one relation to extract the first target entity. We refer to this retrieval as a one-step inference and use the additional extracted relations to compute multi-step inferences. Here refers to layer normalization Ba2017LayerNorm which includes a learned scaling and shifting scalar. As in other Fast Weight work, improves our training procedure which is possibly due to making the optimization landscape smoother batch_norm_smoothes .


Finally, the output of our architecture consists of the sum of the three previous inference steps followed by a linear projection into the symbol space

where a softmax transforms the activations into a probability distribution over all words from the vocabulary of the current task.


5 Related Work

To our knowledge, our system is the first with a TPR of order 3 trained on raw data by backpropagation Linnainmaa:1970 ; Kelley:1960 ; Werbos1990BTT . However, previous work used TPRs of order 2 for simpler associations in the context of image-caption generation TPGN_captioning , question-answering TPRN_squad , and general NLP huang2018attentive problems with a gradient-based optimizer similar to ours.

On the other hand, the central operation of an order 2 TPR is the outer product of two vectors. This relates to many previous ideas, most notably Hebbian learning Hebb:49 , which partially inspired differentiable, outer product-based Fast Weight architectures Schmidhuber:93ratioicann learning context dependent weight changes through error-backpropagation - compare even earlier work on differentiable Fast Weights Schmidhuber:91fastweights . Variations of such outer product-based Fast Weights were able to generalise in a variety of small but complex sequence problems where standard RNNs tend to perform poorly Ba2016using ; schlag2017gated ; miconi2018differentiable .

RNNs are popular choices for modelling natural language. Despite ongoing research in RNN architectures, the good old LSTM lstm97and95 has been shown to outperform more recent variants lstm_still_sota_melis on standard language modelling datasets. However, such networks do not perform well in NLR tasks such as question answering babi_tasks_weston . Recent progress came through the addition of memory and attention components to RNNs. For the context of question answering, a popular line of research are memory networks memory_networks_weston ; dynamic_memory_networks_kumar ; weakly_memory_networks_sukhbaatar ; gated_endtoend_memory_networks_Liu ; dynamic_memory_networks_2_xiong . But it remains unclear whether mistakes in trained models arise from imperfect logical reasoning, knowledge representation, or insufficient data due to the difficulty of interpreting their internal representations dupoux_beyond_toytasks .

Some early memory-augmented RNNs focused primarily on improving the ratio of the number of trainable parameters to memory size Schmidhuber:93ratioicann

. The Neural Turing Machine

Graves2014NTM was among the first models with an attention mechanism over external memory that outperformed standard LSTM on tasks such as copying and sorting. The Differentiable Neural Computer (DNC) further refined this approach graves2016 ; sparse_dnc_rae , yielding strong performance also on question-answering problems.

6 Experiments

Figure 3: Training accuracy on all bAbI tasks over the first 600k iterations. All our all-tasks models achieve <5% error in  48 hours (i.e.  250k steps). We stopped training our own implementation of the DNC graves2016 after roughly 7 days (600k steps) and instead compare accuracy in Table 1 using previously published results.

We evaluate our architecture on bAbI tasks, a set of 20 different synthetic question-answering tasks designed to evaluate NLR systems such as intelligent dialogue agents babi_tasks_weston . Every task addresses a different form of reasoning. It consists of the story - a sequence of sentences - followed by a question sentence with a single word answer. We used the train/validation/test split as it was introduced in v1.2 for the 10k samples version of the dataset. We ignored the provided supporting facts that simplify the problem by pointing out sentences relevant to the question. We only show story sentences once and before the query sentence, with no additional supervision signal apart from the prediction error.

We experiment with two models. The single-task model is only trained and tested on the data from one task but uses the same computational graph and hyper-parameters for all. The all-tasks model is a scaled up version trained and tested on all tasks simultaneously, using only the default hyper-parameters. More details such as specific hyper-parameters can be found in Appendix A.

In Table 1 and 2 we compare our model to various state-of-the-art models in the literature. We added best results for a better comparison to earlier work which did not provide statistics generated from multiple runs. Our system outperforms the state-of-the-art in both settings. We also seem to outperform the DNC in convergence speed as shown in Figure 3.

Task REN recurrent_entity_networks_henaff DNC graves2016 SDNC sparse_dnc_rae TPR-RNN (ours)
Avg Error 9.7 2.6 12.8 4.7 6.4 2.5 1.34 0.52
Failure (>5%) 5 1.2 8.2 2.5 4.1 1.6 0.86 1.11
Table 1:

Mean and variance of the test error for the all-task setting. We perform early stopping according to the validation set. Our statistics are generated from 10 runs.

Task LSTM weakly_memory_networks_sukhbaatar N2N weakly_memory_networks_sukhbaatar DMN+ dynamic_memory_networks_2_xiong REN recurrent_entity_networks_henaff TPR-RNN (ours)
best best best best best mean
1 0.0 0.0 0.0 0.0 0.0 0.02 0.05
2 81.9 0.3 0.3 0.1 0.0 0.06 0.09
3 83.1 2.1 1.1 4.1 1.2 1.78 0.58
4 0.2 0.0 0.0 0.0 0.0 0.02 0.04
5 1.2 0.8 0.5 0.3 0.5 0.61 0.17
6 51.8 0.1 0.0 0.2 0.0 0.22 0.19
7 24.9 2.0 2.4 0.0 0.5 2.78 1.81
8 34.1 0.9 0.0 0.5 0.1 0.47 0.45
9 20.2 0.3 0.0 0.1 0.0 0.14 0.13
10 30.1 0.0 0.0 0.6 0.3 1.24 1.30
11 10.3 0.0 0.0 0.3 0.0 0.14 0.11
12 23.4 0.0 0.0 0.0 0.0 0.04 0.05
13 6.1 0.0 0.0 1.3 0.3 0.42 0.11
14 81.0 0.2 0.2 0.0 0.0 0.24 0.29
15 78.7 0.0 0.0 0.0 0.0 0.0 0.0
16 51.9 51.8 45.3 0.2 0.0 0.02 0.045
17 50.1 18.6 4.2 0.5 0.4 0.9 0.69
18 6.8 5.3 2.1 0.3 0.1 0.64 0.33
19 31.9 2.3 0.0 2.3 0.0 12.64 17.39
20 0.0 0.0 0.0 0.0 0.0 0.0 0.00
Avg Error 36.4 4.2 2.8 0.5 0.17 1.12 1.19
Failure (>5%) 16 3 1 0 0 0.4 0.55
Table 2: Mean and variance of the test error for the single-task setting. We perform early stopping according to the validation set. Statistics are generated from 5 runs. We added best results for comparison with previous work. Note that only our results for task 19 are unstable where different seeds either converge with perfect accuracy or fall into a local minimum. It is not clear how much previous work is affected by such issues.

Ablation Study

We ran ablation experiments on every task to assess the necessity of the three memory operations. The experimental results in Table 3 indicate that a majority of the tasks can be solved by the write operation alone. This is surprising at first because for some of those tasks the symbolic operations that a person might think of as ideal typically require more complex steps than what the write operation allows for.

Operations Failed tasks (err > 5%)
3, 6, 9, 10, 12, 13, 17, 19
9, 10, 13, 17
Table 3: Summary results of the ablation experiments. We experimented with 3 variations of memory operations in order to analyse their necessity with regards to single-task performance. The results indicate that the move operation is in general less important than the backlink operation.

However, the optimizer seems to be able to find representations that overcome the limitations of the architecture. That said, more complex tasks do benefit from the additional operations without affecting the performance on simpler tasks.

7 Analysis

Here we analyse the representations produced by the MLPs of the update module. We collect the set of unique sentences across all stories from the validation set of a task and compute their respective entity and relation representations , , , , and

. For each representation we then hierarchically cluster all sentences based on their cosine similarity.

Figure 4: The hierarchically clustered similarity matrices of all unique sentences of the validation set of task 3. We compute one similarity matrix for each representation produced by the update module using the cosine similarity measure for clustering.

In Figure 4 we show such similarity matrices for a model trained on task 3. The image based on shows 4 distinct clusters which indicate that learned representations are almost perfectly orthogonal. By comparing the sentences from different clusters it becomes apparent that they represent the four entities independent of other factors. Note that the dimensionality of this vector space is 15 which seems larger than necessary for this task.

In the case of we observe that sentences seem to group into three, albeit less distinct, clusters. In this task, the structure in the data implies three important events for any entity: moving to any location, bind with any object, and unbind from a previously bound object; all three represented by a variety of possible words and phrases. By comparing sentences from different clusters, we can clearly associate them with the three general types of events.

We observed clusters of similar discreteness in all tasks; often with a semantic meaning that becomes apparent when we compare sentences that belong to different clusters. We also noticed that even though there are often clean clusters they are not always perfectly combinatorial, e.g., in as seen in Figure 4, we found two very orthogonal clusters for the target entity symbols and .

Systematic Generalisation

Figure 5: Average accuracy over the generated test sets of each task. The novel entities that we add to the training data were not trained on all tasks. For a model that generalises systematically, the test accuracy should not drop for entities with only partial training data.

We conduct an additional experiment to empirically analyse the model’s capability to generalise in a systematic way fodor1988connectionism ; hadley1994systematicity . For this purpose, we join together all tasks which use the same four entity names with at least one entity appearing in the question (i.e. tasks 1, 6, 7, 8, 9, 11, 12, 13). We then augment this data with five new entities such that the train and test data exhibit systematic differences. The stories for a new entity are generated by randomly sampling 500 story/question pairs from a task such that in 20% of the generated stories the new entity is also contained in the question. We then add generated stories from all possible 40 combinations of new entities and tasks to the test set. To the training set, however, we only add stories from a subset of all tasks.

More specifically, the new entities are Alex, Glenn, Jordan, Mike, and Logan for which we generate training set stories from , , , , of the tasks respectively. We summarize the results in Figure 5 by averaging over tasks. After the network has been trained, we find that our model achieves high accuracy on entity/task pairs on which it has not been trained. This indicates its systematic generalisation capability due to the disentanglement of entities and relations.

Our analysis and the additional experiment indicate that the model seems to learn combinatorial representations resulting in interpretable distributed representations and data efficiency due to rule-like generalisation.

8 Limitations

To compute the correct gradients, an RNN with external memory trained by backpropagation through time must store all values of all temporary variables at every time step of a sequence. Since outer product-based Fast Weights Schmidhuber:93ratioicann ; schlag2017gated and our TPR system have many more time-varying variables per learnable parameter than a classic RNN such as LSTM, this makes them less scalable in terms of memory requirements. The problem can be overcome through RTRL WilliamsZipser:92 ; RobinsonFallside:87tr , but only at the expense of greater time complexity. Nevertheless, our results illustrate how the advantages of TPRs can outweigh such disadvantages for problems of combinatorial nature.

One difficulty of our Fast Weight-like memory is the well-known vanishing gradient problem


. Due to multiplicative interaction of Fast Weights with RNN activations, forward and backward propagation is unstable and can result in vanishing or exploding activations and error signals. A similar effect may affect the forward pass if the values of the activations are not bounded by some activation function. Nevertheless, in our experiments, we abandoned bounded TPR values as they significantly slowed down learning with little benefit. Although our current sub-optimal initialization may occasionally lead to exploding activations and NaN values after the first few iterations of gradient descent, we did not observe any extreme cases after a few dozen successful steps, and therefore simply reinitialize the model in such cases.

A direct comparison with the DNC is a bit inconclusive for the following reasons. Our architecture, uses a sentence encoding layer similar to how many memory networks encode their input. This slightly facilitates the problem since the network doesn’t have to learn which words belong to the same sentence. Most memory networks also iterate over sentence representations, which is less general than iterating over the word level, which is what the DNC does, which is even less general than iterating over the character level. In preliminary experiments, a word level variation of our architecture solved many tasks, but it may require non-trivial changes to solve all of them.

9 Conclusion

Our novel RNN-TPR combination learns to decompose natural language sentences into combinatorial components useful for reasoning. It outperforms previous models on the bAbI tasks through attentional control of memory. Our approach is related to Fast Weight architectures, another way of increasing the memory capacity of RNNs. An analysis of a trained model suggests straight-forward interpretability of the learned representations. Our model generalises better than a previous state-of-the-art model when there are strong systematic differences between training and test data.


We thank Paulo Rauber, Klaus Greff, and Filipe Mutz for helpful comments and helping hands. We are also grateful to NVIDIA Corporation for donating a DGX-1 as part of the Pioneers of AI Research Award and to IBM for donating a Minsky machine. This research was supported by an European Research Council Advanced Grant (no: 742870).



Appendix A Experimental Details

We encode the valid words for a task as a one-hot vector; the dimensionality of the vector space is equal to the size of the vocabulary. Each MLP which produces the entity and relation representations from a sentence representation consists of two layers, where each layer is an affine transformation followed by the hyperbolic tangent nonlinearity. The hidden layers of the MLPs refer to the intermediate activations and are vectors from the vector space .

We initialize the word embeddings with a uniform distribution from

to and apply the Glorot initialization scheme [53] for all other weights except the position vectors which are initialized as a vector of ones divided by , the number of position vectors. We implemented the model in the TensorFlow framework and compute the gradients through its automatic differentiation engine [54] based on Linnainmaa’s automatic differentiation or backpropgation scheme [19]

. We pad shorter sentences with the padding symbol to achieve a uniform sentence length but keep the story length dynamic as in previous work.

To deal with possible unstable initializations we incorporate a warm-up phase in which we train the network for 50 steps with of the learning rate. In the case of NaN values during this warm-up phase we reinitialize the network from scratch. After successful warm-up phases we never encountered any further instabilities.

We optimize the neural networks using the Nadam optimizer [55] which in our experiments consistently outperformed others in convergence speed but not necessarily in final performance. Finally, we multiply the learning rate by a factor of once it has reached a validation set loss below .

The Single-Task Model

For the single-task model , , and . Note that depends on the vocabulary size of each individual task. We achieved the results in Table 2 using the hyper-parameters , , , and a batch-size of 128. These hyper-parameters have been optimised using an evolution procedure with small random perturbations. The main effect is improved convergence speed. With the exception of a few tasks that were sensible to the momentum parameter, similar final performance can be achieved with the default hyper-parameters.

The All-Tasks Model

In the all-tasks setting we train one model on all tasks simultaneously. We increase the size of the model to , , and and train with a batch-size of 32. We used the default hyper-parameters , , in that case.

Appendix B Detailed All-Tasks Training Runs

task run-0 run-1 run-2 run-3 run-4 run-5 run-6 run-7 best mean
all (0) 1.50 1.69 1.13 1.04 0.78 0.96 1.20 2.40 0.78 1.34 0.52
1 0.10 0.00 0.10 0.20 0.00 0.00 0.00 0.00 0.00 0.05 0.08
2 1.70 0.80 0.60 0.30 0.40 0.50 0.50 0.30 0.30 0.64 0.46
3 4.70 2.50 3.50 2.20 3.40 5.40 3.50 7.90 2.20 4.14 1.85
4 0.00 0.00 0.00 0.10 0.20 0.10 0.00 0.00 0.00 0.05 0.08
5 1.10 1.50 0.80 0.70 1.00 1.00 0.80 1.10 0.70 1.00 0.25
6 0.00 1.10 0.70 0.10 0.10 0.40 0.00 0.50 0.00 0.36 0.39
7 1.70 3.50 1.10 2.60 1.00 1.90 1.60 1.60 1.00 1.88 0.82
8 0.20 1.40 0.40 0.40 0.50 0.40 0.30 0.50 0.20 0.51 0.37
9 0.20 1.30 0.20 0.10 0.30 0.80 0.20 0.10 0.10 0.40 0.43
10 1.40 2.40 1.20 0.30 0.40 0.20 0.40 0.80 0.20 0.89 0.75
11 1.60 2.00 1.10 0.70 1.30 1.00 0.50 1.20 0.50 1.18 0.48
12 1.30 1.00 2.60 1.00 0.20 0.00 3.40 1.30 0.00 1.35 1.14
13 2.50 2.10 2.10 1.90 2.10 2.50 2.40 3.40 1.90 2.38 0.47
14 0.80 0.20 0.70 1.90 0.20 0.90 1.00 1.10 0.20 0.85 0.54
15 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.07
16 0.20 0.20 0.10 4.00 0.40 0.00 0.60 0.10 0.00 0.70 1.35
17 1.60 9.00 4.20 0.80 0.60 1.40 2.60 7.30 0.60 3.44 3.16
18 0.20 1.60 1.30 0.70 0.00 0.70 1.20 0.10 0.00 0.72 0.60
19 11.00 3.90 2.50 1.20 4.20 4.10 6.00 22.80 1.20 6.96 7.03
20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Table 4: Error percentage of 8 different TPR-RNNs in the all-tasks setting. The performance is further broken down into task-specific error percentages. We compare our all-tasks results to the previous state-of-the-art in Table 1.

Appendix C Further Similarity Matrices

Figure 6: Hierarchically clustered sentences based on the cosine-similarity of the representation. We select sentences randomly from the test set and use our best all-tasks model to extract the representations. The sentences below are are also clustered. Note how they cluster together based on entity information like milk, green, or bathroom.

36: mary put down the milk there .
39: mary discarded the milk .
42: mary grabbed the milk there .
41: mary picked up the milk there .
45: jason picked up the milk there .
5: brian is green .
7: lily is green .
35: mary travelled to the kitchen .
43: jason went to the kitchen .
11: daniel went back to the kitchen .
17: sandra went back to the kitchen .
19: sandra travelled to the kitchen .
21: sandra journeyed to the kitchen .
33: mary went to the bathroom .
18: sandra went back to the bathroom .
49: john travelled to the bathroom .
14: daniel travelled to the bathroom .
12: daniel went to the bathroom .
15: daniel journeyed to the bathroom .
31: mary went back to the bedroom .
30: mary moved to the bedroom .
38: mary journeyed to the bedroom .
48: john went to the bedroom .
27: yann travelled to the bedroom .
10: daniel moved to the bedroom .
13: daniel travelled to the bedroom .
2: julius is white .
28: yann is tired .
8: sumit moved to the garden .
9: sumit is bored .
44: jason is thirsty .
34: mary travelled to the hallway .
26: bernhard is gray .
47: john went to the office .
20: sandra journeyed to the office .
32: mary went to the office .
3: julius is a rhino .
6: lily is a rhino .
22: sandra discarded the football .
23: sandra picked up the football there .
24: sandra took the football there .
46: then she journeyed to the bathroom .
1: greg is a frog .
4: brian is a frog .
29: yann picked up the pajamas there .
25: bernhard is a swan .
40: mary discarded the apple there .
16: daniel took the apple there .
37: mary got the apple there .

Figure 7: Hierarchically clustered sentences based on the cosine-similarity of the representation. We select sentences randomly from the test set and use our best all-tasks model to extract the representation. The sentences below are are also clustered. Note how the sentences cluster together based on entity information like milk, green, or bathroom.

30: mary went back to the kitchen .
28: mary moved to the hallway .
29: mary moved to the garden .
27: sandra journeyed to the bathroom .
24: sandra travelled to the bedroom .
22: sandra moved to the kitchen .
23: sandra went to the office .
3: daniel went back to the bathroom .
4: daniel went to the bathroom .
2: daniel moved to the kitchen .
1: daniel moved to the office .
5: daniel travelled to the office .
36: john travelled to the bathroom .
35: john went to the kitchen .
33: john moved to the office .
34: john went back to the bedroom .
6: daniel and john went to the hallway .
31: mary and sandra travelled to the hallway .
26: sandra left the apple .
37: john discarded the football .
19: the bathroom is south of the hallway .
20: the bathroom is north of the kitchen .
21: the bathroom is east of the garden .
12: the hallway is west of the office .
10: the hallway is north of the bedroom .
11: the hallway is east of the bedroom .
7: the office is east of the bathroom .
8: the office is west of the hallway .
9: the office is west of the garden .
38: john grabbed the football there .
32: mary picked up the milk there .
13: the garden is west of the bedroom .
25: sandra got the apple there .
17: the kitchen is south of the office .
18: the kitchen is south of the bathroom .
16: the bedroom is west of the garden .
14: the bedroom is east of the hallway .
15: the bedroom is east of the kitchen .

Figure 8: Hierarchically clustered questions based on the cosine-similarity of the representation. We select questions randomly from the test set and use our best all-tasks model to extract the representation. The questions below are are also clustered. Note how the sentences cluster together based on relation information like carrying, color, or afraid of.

4: is the pink rectangle to the left of the yellow square ?
3: is the red sphere below the triangle ?
1: is bill in the kitchen ?
0: does the chocolate fit in the container ?
5: is mary in the hallway ?
2: is julie in the park ?
6: is mary in the kitchen ?
14: how do you go from the bedroom to the office ?
13: how do you go from the garden to the hallway ?
21: what is john carrying ?
17: what is daniel carrying ?
19: what is mary carrying ?
15: how many objects is mary carrying ?
26: what did fred give to mary ?
22: what color is greg ?
24: what color is lily ?
23: what color is brian ?
25: what color is bernhard ?
12: where was fred before the office ?
10: where is mary ?
7: where is daniel ?
18: what is the kitchen south of ?
9: where is sandra ?
11: where is john ?
8: where is the milk ?
16: what is gertrude afraid of ?
20: what is emily afraid of ?