Log In Sign Up

StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts

Inferring spatial relations in natural language is a crucial ability an intelligent system should possess. The bAbI dataset tries to capture tasks relevant to this domain (task 17 and 19). However, these tasks have several limitations. Most importantly, they are limited to fixed expressions, they are limited in the number of reasoning steps required to solve them, and they fail to test the robustness of models to input that contains irrelevant or redundant information. In this paper, we present a new Question-Answering dataset called StepGame for robust multi-hop spatial reasoning in texts. Our experiments demonstrate that state-of-the-art models on the bAbI dataset struggle on the StepGame dataset. Moreover, we propose a Tensor-Product based Memory-Augmented Neural Network (TP-MANN) specialized for spatial reasoning tasks. Experimental results on both datasets show that our model outperforms all the baselines with superior generalization and robustness performance.


page 1

page 2

page 3

page 4


Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

A multi-hop question answering (QA) dataset aims to test reasoning and i...

How Well Do Multi-hop Reading Comprehension Models Understand Date Information?

Several multi-hop reading comprehension datasets have been proposed to r...

Question-Aware Memory Network for Multi-hop Question Answering in Human-Robot Interaction

Knowledge graph question answering is an important technology in intelli...

Generative Context Pair Selection for Multi-hop Question Answering

Compositional reasoning tasks like multi-hop question answering, require...

Understanding Dataset Design Choices for Multi-hop Reasoning

Learning multi-hop reasoning has been a key challenge for reading compre...

Neural Belief Reasoner

This paper proposes a new generative model called neural belief reasoner...

SMART: A Situation Model for Algebra Story Problems via Attributed Grammar

Solving algebra story problems remains a challenging task in artificial ...

1 Introduction

Neural networks have been successful in a wide array of perceptual tasks, but it is often stated that they are incapable of solving tasks that require higher-level reasoning Ding et al. (2020). Since spatial reasoning is ubiquitous in many scenarios such as autonomous navigation Vogel and Jurafsky (2010), situated dialog Kruijff et al. (2007), and robotic manipulation Yang et al. (2020); Landsiedel et al. (2017), grounding spatial references in texts is essential for effective human-machine communication through natural language. Navigation tasks require agents to reason about their relative position to objects and how these relations change as they move through the environment Chen et al. (2019). If we want to develop conversational systems able to assist users in solving tasks where spatial references are involved, we need to make them able to understand and reason about spatial references in natural language. Such ability can help conversational systems to successfully follow instructions and understand spatial descriptions. However, despite its tremendous applicability, reasoning over spatial relations remains a challenging task for existing conversational systems.

Earlier works in spatial reasoning focused on spatial instruction understanding in a synthetic environment Bisk et al. (2018); Tan and Bansal (2018); Janner et al. (2018) or in a simulated world with spatial information annotation in texts Pustejovsky et al. (2015), spatial relation extractions across entities Petruck and Ellsworth (2018) and visual observations Anderson et al. (2018); Chen et al. (2019). However, few of the existing datasets are designed to evaluate models’ inference over spatial information in texts. A spatial relational inference task often requires an conversational system to infer the spatial relation between two items given a description of a scene. For example, imagine a user asking to a conversational system to recognize the location of an entity based on the description of other entities in a scene. To do so, the conversational system needs to be able to reason about the location of the various entities in the scene using only textual information.

BAbI Weston et al. (2016) is the most relevant dataset for this task. It contains 20 synthetic question answering (QA) tasks to test a variety of reasoning abilities in texts, like deduction, co-reference, and counting. In particular, the positional reasoning task (no. 17) and the path finding task (no. 19) are designed to evaluate models’ spatial reasoning ability. These two tasks are arguably the most challenging ones van Aken et al. (2019). The state-of-the-art model on the bAbI Le et al. (2020) dataset almost perfectly solve these two spatial reasoning tasks. However, in this paper, we demonstrate that such good performance is attributable to issues with the bAbI dataset rather than the model inference ability.

We find four major issues with bAbI’s tasks 17 and 19: (1) There is a data leakage between the train and test sets; that is, most of the test set samples appear in the training set. Hence, the evaluation results on the test set cannot truly reflect models’ reasoning ability; (2) Named entities are fixed and only four relations are considered. Each text sample always contains the same four named entities in the training, validation, and test sets. This further biases the learning models towards these four entities. When named entities in the test set are replaced by unseen entities or the number of such entities increases, the model performance decreases dramatically Chen et al. (2020a). Also, relations such as top-left, top-right, lower-left, lower-right are not taken into consideration; (3) Learning models are required to reason only over one or two sentences in the text descriptions, making such tasks relatively simple. Palm et al. (2018) pointed out that multi-hop reasoning is not necessary for the bAbI dataset since models only need a single step to solve all the tasks, and; (4) It is a synthetic dataset with a limited diversity of spatial relation descriptions. It thus cannot truly reveal the models’ ability in understanding textual space descriptions.

In this paper, we propose a new dataset called StepGame to tackle the above-mentioned issues and a novel Tensor Product-based Memory-Augmented Neural Network architecture (TP-MANN) for multi-hop spatial reasoning in texts.

The StepGame dataset is based on crowdsourced descriptions of 8 potential spatial relations between 2 entities. These descriptions are then used as templates when generating the dataset. To increase the diversity of these templates, crowdworkers were asked to diversify their expressions. This was done in order to ensure that the crowdsourced templates cover most of the natural ways relations between two entities can be described in text. The StepGame dataset is characterized by a combinatorial growth in the number of possible description of scenes, named stories, as the number of described relations between two entities increases. This combinatorial growth reduces the chances to leak stories from the training to the validation and test sets. Moreover, we use a large number of named entities and require multi-hop reasoning to answer questions about two entities mentioned in the stories. Experimental results show that existing models (1) fail to achieve a performance on the StepGame dataset similar to that achieved on the bAbI dataset, and (2) suffer from a large performance drop as the number of required reasoning steps increases.

The TP-MANN architecture is based on tensor product representations Smolensky (1990) that are used in a recurrent memory module to store, update or delete the relation information among entities inferred from stories. This recurrent architecture provides three key benefits: (1) it enables the model to make inferences based on the stored memory; (2) it allows multi-hop reasoning and it is robust to noise, and; (3) the number of parameters remains unchanged as the number of recurrent layers in the memory module increases. Experimental results on the StepGame dataset show that our model achieves state-of-the-art performance with a substantial improvement, and demonstrates a better generalization ability to more complex stories. Finally, we also conduct some analysis of our recurrent structure and demonstrate its importance for multi-hop reasoning.

2 Related Work and Background

2.1 Related Work

Reasoning Datasets.

The role of language in spatial reasoning has been investigated since the 1980s Pustejovsky (1989); Gershman and Tenenbaum (2015); Tversky (2019), and reasoning about spatial relations has been studied in several contexts such as, 2D and 3D navigation Bisk et al. (2018); Tan and Bansal (2018); Janner et al. (2018); Yang et al. (2020), and robotic manipulation Landsiedel et al. (2017). However, few of the datasets used in these works are used to evaluate systems’ spatial reasoning ability in texts.

The bAbI Weston et al. (2016) dataset consists of several QA tasks. Solving these tasks require logical reasoning steps and cannot be solved by simply word matching. Of particular interest to this paper are tasks 17 and 19. Task 17 is about positional reasoning while task 19 is about path finding. These two tasks can be used to evaluate the spatial inference ability of learning models. However, the bAbI dataset has several issues as mentioned above: the data leakage, the fixed named entities and expressions, and the lack of a need to perform multi-hop reasoning. Another relevant dataset is SpartQA Mirzaee et al. (2021), which is designed for spatial reasoning over texts but only requires a limited multi-hop reasoning compared to StepGame.

Multi-Hop QA Datasets.

The multi-hop QA tasks require reasoning over multiple pieces of evidence and focus on leveraging the connections between entities to infer a requested property of a set of them. Commonly-used multi-hop QA datasets are HotpotQA Yang et al. (2018), ComplexWebQuestions Talmor and Berant (2018), and QAngaroo Welbl et al. (2018)

. The proposed StepGame dataset is different from these datasets. The StepGame dataset focuses on spatial reasoning, which requires machine learning models to infer the spatial relations among the described entities. Moreover, multi-hop QA datasets usually require no more than two reasoning steps, while the StepGame dataset can require as many as 10 reasoning steps.

Reasoning Models.

There are three types of reasoning models: memory-augmented neural networks, graph neural networks, and transformer-based networks. Works of the first type augment neural networks with external memory, such as End to End Memory Networks Sukhbaatar et al. (2015), Differential Neural Computer Graves et al. (2016), and Gated End-to-End Memory Networks Liu and Perez (2017). These models have shown remarkable abilities in tackling difficult computational and reasoning tasks. Works of the second type use graph structure to incorporate a stronger relational inductive bias Battaglia et al. (2018). Santoro et al. (2017) introduced Relational Networks (RN) and demonstrated strong relational reasoning capabilities with a shallow architecture by modelling binary relations between entity pairs. Palm et al. (2018) proposed a graph representation of objects and models multi-hop relational reasoning using a message passing mechanism. Works of the third type use transformers. Although transformers have been proven successful in many NLP tasks, they still struggle with reasoning tasks. van Aken et al. (2019) analyzed the performance of BERT Devlin et al. (2019) on bAbI’s tasks and demonstrated that most of BERT’s errors come from task 17 and 19 which require spatial reasoning. Meanwhile, Dehghani et al. (2019) demonstrated that standard transformers cannot perform as well as memory-augmented networks on the bAbI dataset. Moreover, it is important to note that most of the errors of their proposed Universal Transformer come also from task 17 and task 19 of the bAbI dataset, which matches our observations on other transformer-based models. Therefore, spatial reasoning tasks are arguably the most challenging tasks in the bAbI dataset.

Tensor Product Representation.

The Tensor Product Representation (TPR) Smolensky (1990); Schlag et al. (2021)

is a technique for encoding symbolic structural information and modelling symbolic reasoning in vector spaces by learning to deconstruct natural language statements into combinatorial representations 

Chen et al. (2020b). TPR has been used for tasks that require deductive reasoning abilities and it is able to represent entire problem statements to solve math questions in natural language Chen et al. (2020b) and generate natural language captions from images Huang et al. (2018).

Schlag and Schmidhuber (2018) proposed a gradient-based RNN with third-order TPR (TPR-RNN), which creates a vector space embedding of complex symbolic structures by tensor products and stores these learned representations into a third-order TPR-like memory. Self-Attentive Associative Memory (STM) Le et al. (2020) utilizes a second-order item memory and a third-order TPR-like relational memory to simulate the hippocampus, achieving state-of-the-art performance on the bAbI dataset. Despite a gain in the performance on bAbI compared to TPR-RNN, STM takes a longer time to converge in practice.

Recently, Schlag et al. (2021) compared a concatenated memory with a 3-order memory , and experimental results indicate a drop in performance when a concatenated memory is used. However, neither STM nor TPR-RNN processes information at the paragraph level and allows later modifications after the first information is stored, as done in our model. Both STM and TPR-RNN use an RNN-like architecture where each sentence in a paragraph is stored recurrently. This may result in a long-term dependency problem Vaswani et al. (2017) where necessary information would not interact with each other. To solve this issue, an explicit mechanism to update relational information between entities at the end of each story is introduced in our model.

2.2 Background

The Tensor Product Representation (TPR) is a method to create a vector space embedding of complex symbolic structures by tensor product. Such representation can be constructed as follows:


where is the TPR, is a set of filler vectors and is a set of role vectors. For each role-filler vector pair, which can be considered as an entity-relation pair, we bind (or store) them into by performing their outer product. Then, given an unbinding role vector , associated to the filler vector , can be recovered by performing:


where if and only if . It can be proven that the recovering is perfect if the role vectors are orthogonal to each other. In our model, TPR-like binding and unbinding methods are used to store and retrieve information from and to the TPR , which we will call memory.

3 The StepGame Dataset

To design a benchmark dataset that explicitly tests models’ spatial reasoning ability and tackle the above mentioned problems, we build a new dataset named StepGame inspired by the spatial reasoning tasks in the bAbI dataset Weston et al. (2016). The StepGame is a contextual QA dataset, where the system is required to interpret a story about several entities expressed in natural language and answer a question about the relative position of two of those entities. Although this reasoning task is trivial for humans, to equip current NLU models with such a spatial-ability remains still a challenge. Also, to increase the complexity of this dataset we model several form of distracting noises. Such noises aim to make the task more difficult and force machine learning models that are trained on this dataset to be more robust in their inference process.

Figure 1: An example of the generation of a StepGame sample with .

3.1 Template Collection

The aim of this crowdsourcing task is to find out all possible ways we can describe the positional relationship between two entities. The crowdworkers from Amazon Mechanical Turk were provided with an image visually describing the spatial relations of two entities and a request to describe these entities’ relation. This crowdsourcing task was performed in multiple runs. In the first run, we provided crowdworkers with an image and two entities (e.g., A and B) and they were asked to describe their positional relation. From the data collected in this round, we then manually removed bad answers, and showed the remaining good ones as positive examples to crowdworkers in the next run. However, crowdworkers were instructed to avoid repeating them as an answer to our request. We repeated this process until no new templates could be collected. In total, after performing a manual generalization where templates discovered for a relation were translated to the other relations, we collected 23 templates for left and right relations, 27 templates for top and down relations, and 26 templates for top-left, top-right, down-left, and down-right relations.

3.2 Data Generation

The task defined by the StepGame dataset is composed of several story-question pairs written in natural language. In its basic form, the story describes a set of spatial relations among entities, and it is structured as a list of sentences each talking about entities. The relations are and the entities because they define a chain-like shape. The question requests the relative position of two entities among the ones mentioned in the story. To each story-question pair an answer is associated. This answer can take possible values: top-left, top-right, top, left, overlap, right, down-left, down-right, and down, each representing a relative position. The number of edges between the two entities in the question () determines the number of hops a model has to perform in order to get to the correct answer.

To generate a story, we follow three steps, as depicted in Figure 1. Given a value and a set of entities :

Step 1.

We generate a sequence of entities by sampling a set of unique entities from . Then, for each pair of entities in the sequence, spatial relations are sampled. These spatial relations can take any of the 8 possible values: top, down, left, right, top-left, top-right, down-left, and down-right. Because the sampling is unconstrained, entities can overlap with each other. This step results in a sequence of linked entities that from now on we will call a chain.

Step 2.

Two of the chain’s entities are then selected at random to be used in the question.

Step 3.

From the chain generated in Step 1, we translate the relations into sentence descriptions in natural language. Each description is based on a randomly sampled crowdsourced template. We then shuffle these sentences to avoid potential distributional biases. These shuffled sentence descriptions is a called a story. From the entities selected in Step 2, we then generate a question also in natural language. Finally, using the chain and the selected entities, we infer the answer to each story-question pair.

Given this generation process we can quickly calculate the complexity of the task before using the templates. This is possible because entities can overlap. Given relations, entities sampled from in any order (), 8 possible relations between pairs of entities with 2 ways of describing them (), e.g., A is on the left of B or B is on the right of A, a random order of the sentences in the story (), and a question about 2 entities with 2 ways of describing it (), the number of examples that we can generate is equal to:


The complexity of the dataset grows exponentially with . The StepGame dataset uses . For we have 10,400 possible samples, for

we have more than 23 million samples, and so on. The sample complexity of the problem guarantees that when generating the dataset the probability of leaking samples from the training set to the test set diminishes with the increase of

. Please note that these calculations do not include templates. If we were to considering also the templates, the number of variations of the StepGame would be even larger.

Figure 2: On the left-hand side we have the original chain. Orange entities are those targeted by the question. Beside, we show the same chain with the addition of noise. In green we represent irrelevant, disconnected and supporting entities.

3.3 Distracting Noise

To make the StepGame more challenging we also include noisy examples in the test set. We assume that when models trained on the non-noisy dataset make mistakes on the noisy test set, these models have failed to learn how to infer spatial relations. We generate three kinds of distracting noise: disconnected, irrelevant, and supporting. Examples of all kinds of noise are provided in Figure 2. The irrelevant noise extends the original chain by branching it out with new entities and relations. The disconnected noise adds to the original chain a new independent chain with new entities and relations. The supporting noise adds to the original chain new entities and relations that may provide alternative reasoning paths. We only add supporting noise into chains with more than 2 entities. All kinds of noise have no impact on the correct answer. The type and amount of noise added to each chain is randomly determined. The detailed statistics for each type of distracting noise are provided in the Appendix.

4 The TP-MANN Model

Figure 3: The TP-MANN architecture. PE stands for positional encoder, the sign in the box below the symbol

represents a feed-forward neural network, the

sign represents the outer-product operator, the sign represents the inner product operator, and LN represents a layer normalization. The , , and LN boxes implement the formulae as presented in Section 4. Lines indicate the flow of information. Those without an arrow indicate which symbols are taken as input and are output by their box.

In this section we introduce the proposed TP-MANN model, as shown in Figure 3. The model comprises three major components: a question and story encoder, a recurrent memory module, and a relation decoder. The encoder learns to represent entities and relations for each sentence in a story. The recurrent memory module learns to store entity-relation pair representations into the memory independently. It also updates the entity-relation pair representations based on the current memory and stores the inferred information. The decoder learns to represent the question and using the information stored in the memory recurrently infers the spatial relation of the two entities mentioned in the question.

It also has been shown that learned representations in the TPR-like memory could be orthogonal Schlag and Schmidhuber (2018). We use an example to illustrate the inspiration behind this architecture. A person may experience that when she goes back to her hometown and sees an old tree, her happy childhood memory about playing with her friends under that tree might be recalled. However, this memory may not be reminisced unless triggered by the old tree appearance. In our model, unbinding vectors in the decoder module play the role of the old tree in the example, where unbinding vectors are learned based on the target questions. The decoder module unbinds relevant memories given a question via a recurrent mechanism. Moreover, although memories are stored separately, there are integration processes in brains that retrieve information via a recursive mechanism. This allows episodes in memories to interact with each other Kumaran and McClelland (2012); Schapiro et al. (2017); Koster et al. (2018).


The input of the encoder is a story and a question. Given a input story with sentences and a question both described by words in a vocabulary . Each sentence is mapped to learnable embeddings . Then, a positional encoding (PE) is applied to each word embedding and then averaged together , where are learnable position vectors, and is the element-wise product. This operation defines , where each row of represents an encoded sentence and is the dimension of a word embedding. For the input question we convert it to a vector in the same way. For each sentence of the story in , we learn entity and relation representations as:


where are feed-forward neural networks that output entity representations and are feed-forward neural networks that output relation representations . Finally, we define three search keys as:


where . Keys will be used to manipulate the memory in the next module and retrieve potential existing associations for each entity-relation pair.

Recurrent Memory Module.

To allow stored information to interact with each other, we use a recurrent architecture with recurrent-layers to update the TPR-like memory representation , where contains trainable parameters. Through this recurrent architecture, existing episodes stored in memory can interact with new inferences to generate new episodes. Different from many models like Transformer Vaswani et al. (2017) and graph-based models Kipf and Welling (2017); Velickovic et al. (2018) where adding more layers in the model leads to a larger number of trainable parameters, our model will not increase the number of trainable parameters as the number of recurrent-layers increases.

At each layer , given the keys s, we extract pseudo-entities s for each sentence in . In the first layer (), since there is no previous information existing in memory , the model just converts each sentence in as an episode and stores them in it (). Then at the later layers (), pseudo-entities s build bridges between episodes in the current memory and allow them to interact with potential entity-relation associations.


where . We then construct the memory episodes needed to be updated or removed. This is done after the first storage at so that all story information is already available in . These old episodes, , will be updated or removed to avoid memory conflicts that may occur when receiving new information:


Afterwards, new episodes, and , will be added into the memory:


Then we apply this change to the memory by removing (subtracting) old episodes and adding up the new ones to the now dated memory :


where is a layer normalization.


The prediction is computed based on the constructed memory at the last layer and a question vector . To do this we follow the same procedure designed by Schlag and Schmidhuber (2018):


where is a feed-forward neural network that outputs a -dimensional unbinding vector, and are feed-forward neural networks that output -dimensional unbinding vectors. Then, the information stored in will be retrieved in a recurrent way based on unbinding vectors learned from the question:


A linear projection of trainable parameters and a softmax function are used to map the extracted information into

. Hence, the decoder module outputs a probability distribution over the terms of the vocabulary


5 Experiments and Results

In this section we aim to address the following research questions: (RQ1) What is the degree of data leakage in the datasets? (RQ2) How does our model behave with respect to state-of-the-art NLU models in spatial reasoning tasks? (RQ3) How do these models behave when tested on examples more challenging than those used for training? (RQ4) What is the effect of the number of recurrent-layers in the recurrent memory module? Before answering these questions, we first present the material and baselines used in our experiments. The software and data are available at:

5.1 Material and Baselines

In the following experiments we will use two datasets, the bAbI dataset and the StepGame dataset. For the bAbI dataset we only focus on task 17 and task 19 and use the original train and test splits made of samples for the training set and for the validation and test sets. For the StepGame dataset, we generate a training set made of samples varying from 1 to 5 at steps of 1, and a test set with varying from 1 to 10. Moreover, the test set will also contain distracting noise. The final dataset consists of, for each value, training samples, validation samples, and test samples.

We compare our model against five baselines: Recurrent Relational Networks (RRN) Palm et al. (2018), Relational Network (RN) Santoro et al. (2017), TPR-RNN Schlag and Schmidhuber (2018), Self-attentive Associative Memory (STM) Le et al. (2020), and Universal Transformer (UT) Dehghani et al. (2019). Each model is trained and validated on each dataset independently following the hyper-parameter ranges and procedures provided in their original papers. All training details, including those for our model, are reported in the Appendix.

5.2 Training-Test Leakage

To answer RQ1 we have calculated the degree of data leakage present in bAbI and the StepGame datasets. For the task 17, we counted how many samples in the test set appear also in the training set: 23.2% of the test samples are also in the training set. For task 19, for each sample we extracted the relevant sentences in the stories (i.e., those sentences necessary to answer the question correctly) and questions. Then we counted how many such pairs in the test set appear in the training set: 80.2% of the pairs overlap with pairs in the training set. For the StepGame dataset, for each sample we extracted the sentences in the stories and questions. The sentences in the story are sorted in lexicographical order. Then we counted how many such pairs in the test set appear also in the training set before adding distracting noise and using the templates: 1.09% of the pairs overlap with triples in the training set. However, such overlap is all produced by the samples with , which due to their limited number have a higher chance of being included in the test set. If we remove those examples, the overlap between training and test sets drops to 0%.

5.3 Spatial Inference

To answer RQ2 and judge the spatial inference ability of our model and the baselines we train them on the bAbI and the StepGame datasets and compare them by measuring their test accuracy.

max width= Task 17 Task 19 Mean RN 97.331.55 98.631.79 97.98 RRN 97.802.34 49.805.76 73.80 STM 97.801.06 99.980.05 98.89 UT 98.603.40 93.907.30 96.25 TPR-RNN 97.551.99 99.950.06 98.75 Ours 99.880.10 99.980.04 99.93

Table 1: Test accuracy on the task 17 and 19 of the bAbI dataset: MeanStd over 5 runs.

In Table 1 we present the results of our model and the baselines on the task 17 and 19 of the bAbI dataset. The performance of our model is slightly better than the best baseline. However, due to the issues of the bAbI dataset, these results are not enough to firmly answer RQ2.

max width=0.92 Model =1 =2 =3 =4 =5 Mean RN Santoro et al. (2017) 22.640.25 17.081.41 15.082.58 12.842.27 11.521.73 15.83 RRN Palm et al. (2018) 24.054.48 19.984.68 16.032.89 13.222.51 12.312.16 17.12 UT Dehghani et al. (2019) 45.114.16 28.364.50 17.412.18 14.072.87 13.451.35 23.68 STM Le et al. (2020) 53.423.73 35.964.45 23.031.83 18.451.87 15.141.56 29.20 TPR-RNN Schlag and Schmidhuber (2018) 70.293.03 46.032.24 36.142.66 26.822.64 24.772.75 40.81 Ours 85.773.18 60.312.23 50.182.65 37.454.21 31.253.38 52.99

Table 2: Test accuracy on the StepGame dataset: MeanStd over 5 runs.

max width=0.92 Model =7 =8 =9 =10 Mean RN Santoro et al. (2017) 11.120.96 11.530.70 11.210.98 11.131.00 11.340.87 11.27 RRN Palm et al. (2018) 11.620.80 11.400.76 11.830.75 11.220.86 11.691.40 11.56 UT Dehghani et al. (2019) 12.732.37 12.111.52 11.400.92 11.410.96 11.741.07 11.88 STM Le et al. (2020) 13.801.95 12.631.69 11.541.61 11.301.13 11.770.93 12.21 TPR-RNN Schlag and Schmidhuber (2018) 22.253.12 19.882.80 15.452.98 13.012.28 12.652.71 16.65 Ours 28.533.59 26.452.95 23.672.78 22.522.36 21.461.72 24.53

Table 3: Test accuracy on StepGame for larger s (only on the test set). MeanStd over 5 runs.

In Table 2 we present the results for the StepGame dataset. In this dataset, the training set is without noise but the test set is with distracting noise. In the table we break down the performance of the trained models across . In the last column we report the average performance across . Our model outperforms all the baseline models. Compared to Table 1, the decreased accuracy in Table 2 demonstrates the difficulty of spatial reasoning with distracting noise. It is not surprising that the performance of all five baseline models decreases when increases, that is, when the number of required inference hops increases. We also report test accuracy on test sets without distracting noise in the Appendix.

5.4 Systematic Generalization

To answer RQ3 we generate new StepGame test sets with with distracting noise. We then test all the models jointly trained on the StepGame train set with as in the Section 5.3. We can consider this experiment as a zero-shot learning setting for larger s.

In Table 3 we present the performance of different models on this generalization task. Not surprisingly, the performance of all models degrades monotonically as we increase . RN, RRN, UT and SAM fail to generalize to the test sets with higher values, while our model is more robust and outperforms the baseline models with a large margin. This demonstrates the better generalization ability of our model, which performs well on longer stories never seen during training.

5.5 Inference Analysis

Figure 4: Analysis of TP-MANN’s number of recurrent-layers (). The x-axis is with which the model has been trained. Each line represents a different value of of the StepGame dataset.

To answer RQ4, we conduct an analysis of the hyper-parameter , the number of recurrent-layers in our model. We jointly train TP-MANN on the StepGame dataset with between 1 and 5 with number of between 1 and 6 and report the break down test accuracy for each value of . These results are shown in the left-hand side figure of Figure 4. The test sets with higher benefit more from a higher number of recurrent layers than those with lower , indicating that recurrent layers are critical for multi-hop reasoning. We also analyze how the recurrent layer structure affects systematic generalization. To do this we also test on a StepGame test set with between 6 and 10 with noise. These s are larger than the largest used during training. These results are shown in the right-hand side figure in Figure 4. Here we see that as increases, the performance of the model improves. This analysis further corroborates that our recurrent structure supports multi-hop inference. It is worth noting, that the number of trainable parameters in our model remains unchanged as increases. Interestingly, we find that the number of recurrent-layers needed to solve the task is less than the length of the stories suggesting that the inference process may happen in parallel.

6 Conclusion

In this paper, we proposed a new dataset named StepGame that requires a robust multi-hop spatial reasoning ability to be solved and mitigates the issues observed in the bAbI dataset. Then, we introduced TP-MANN, a tensor product-based memory-augmented neural network architecture that achieves state-of-the-art performance on both datasets. Further analysis also demonstrated the importance of a recurrent memory module for multi-hop reasoning.


  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. D. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    pp. 3674–3683. External Links: Document Cited by: §1.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018)

    Relational inductive biases, deep learning, and graph networks

    arXiv preprint arXiv:1806.01261. Cited by: §2.1.
  • Y. Bisk, K. J. Shih, Y. Choi, and D. Marcu (2018) Learning interpretable spatial operations in a rich 3d blocks world. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018

    Cited by: §1, §2.1.
  • C. Chen, Y. Fu, H. Cheng, and S. Lin (2020a) Unseen filler generalization in attention-based natural language reasoning models. In 2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI), pp. 42–51. Cited by: §1.
  • H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi (2019) TOUCHDOWN: natural language navigation and spatial reasoning in visual street environments. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Cited by: §1, §1.
  • K. Chen, Q. Huang, H. Palangi, P. Smolensky, K. D. Forbus, and J. Gao (2020b) Mapping natural-language problems to formal-language solutions using structured neural representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research. Cited by: §2.1.
  • M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser (2019) Universal transformers. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §2.1, §5.1, Table 2, Table 3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Cited by: §2.1.
  • D. Ding, F. Hill, A. Santoro, and M. Botvinick (2020) Object-based attention for spatio-temporal reasoning: outperforming neuro-symbolic models with flexible distributed architectures. arXiv preprint arXiv:2012.08508. Cited by: §1.
  • S. Gershman and J. B. Tenenbaum (2015) Phrase similarity in humans and machines. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society, CogSci 2015, Pasadena, California, USA, July 22-25, 2015, Cited by: §2.1.
  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. (2016) Hybrid computing using a neural network with dynamic external memory. Nature. Cited by: §2.1.
  • Q. Huang, P. Smolensky, X. He, L. Deng, and D. Wu (2018) Tensor product generation networks for deep NLP modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Cited by: §2.1.
  • M. Janner, K. Narasimhan, and R. Barzilay (2018) Representation learning for grounded spatial reasoning. Transactions of the Association for Computational Linguistics. Cited by: §1, §2.1.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §4.
  • R. Koster, M. J. Chadwick, Y. Chen, D. Berron, A. Banino, E. Düzel, D. Hassabis, and D. Kumaran (2018) Big-loop recurrence within the hippocampal system supports integration of information across episodes. Neuron 99 (6), pp. 1342–1354. Cited by: §4.
  • G. M. Kruijff, H. Zender, P. Jensfelt, and H. I. Christensen (2007) Situated dialogue and spatial organization: what, where… and why?. International Journal of Advanced Robotic Systems. Cited by: §1.
  • D. Kumaran and J. L. McClelland (2012) Generalization through the recurrent interaction of episodic memories: a model of the hippocampal system.. Psychological review. Cited by: §4.
  • C. Landsiedel, V. Rieser, M. Walter, and D. Wollherr (2017) A review of spatial reasoning and interaction for real-world robotics. Advanced Robotics. Cited by: §1, §2.1.
  • H. Le, T. Tran, and S. Venkatesh (2020) Self-attentive associative memory. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 5682–5691. External Links: Link Cited by: §1, §2.1, §5.1, Table 2, Table 3.
  • F. Liu and J. Perez (2017) Gated end-to-end memory networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain. Cited by: §2.1.
  • R. Mirzaee, H. R. Faghihi, Q. Ning, and P. Kordjamshidi (2021) SPARTQA: a textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4582–4598. Cited by: §2.1.
  • R. B. Palm, U. Paquet, and O. Winther (2018) Recurrent relational networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 3372–3382. External Links: Link Cited by: §1, §2.1, §5.1, Table 2, Table 3.
  • M. R. L. Petruck and M. J. Ellsworth (2018) Representing spatial relations in FrameNet. In Proceedings of the First International Workshop on Spatial Language Understanding, New Orleans, pp. 41–45. External Links: Document, Link Cited by: §1.
  • J. Pustejovsky, P. Kordjamshidi, M. Moens, A. Levine, S. Dworman, and Z. Yocum (2015) SemEval-2015 task 8: SpaceEval. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado. External Links: Link Cited by: §1.
  • J. Pustejovsky (1989) Language and spatial cognition. Computational Linguistics 15 (3). External Links: Link Cited by: §2.1.
  • A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. W. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 4967–4976. External Links: Link Cited by: §2.1, §5.1, Table 2, Table 3.
  • A. C. Schapiro, N. B. Turk-Browne, M. M. Botvinick, and K. A. Norman (2017) Complementary learning systems within the hippocampus: a neural network modelling approach to reconciling episodic memory with statistical learning. Philosophical Transactions of the Royal Society B: Biological Sciences. Cited by: §4.
  • I. Schlag, T. Munkhdalai, and J. Schmidhuber (2021) Learning associative inference using fast weight memory. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, Cited by: §2.1, §2.1.
  • I. Schlag and J. Schmidhuber (2018) Learning to reason with third order tensor products. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 10003–10014. External Links: Link Cited by: §2.1, §4, §4, §5.1, Table 2, Table 3.
  • P. Smolensky (1990) Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence 46 (1-2), pp. 159–216. Cited by: §1, §2.1.
  • S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus (2015) End-to-end memory networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 2440–2448. External Links: Link Cited by: §2.1.
  • A. Talmor and J. Berant (2018) The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent (Eds.), pp. 641–651. External Links: Document Cited by: §2.1.
  • H. Tan and M. Bansal (2018) Source-target inference models for spatial instruction understanding. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, S. A. McIlraith and K. Q. Weinberger (Eds.), pp. 5504–5511. External Links: Link Cited by: §1, §2.1.
  • B. Tversky (2019) Mind in motion: how action shapes thought. Hachette UK. Cited by: §2.1.
  • B. van Aken, B. Winter, A. Löser, and F. A. Gers (2019) How does BERT answer questions?: A layer-wise analysis of transformer representations. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM, Cited by: §1, §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Cited by: §2.1, §4.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §4.
  • A. Vogel and D. Jurafsky (2010) Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 806–814. External Links: Link Cited by: §1.
  • J. Welbl, P. Stenetorp, and S. Riedel (2018) Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6, pp. 287–302. Cited by: §2.1.
  • J. Weston, A. Bordes, S. Chopra, and T. Mikolov (2016) Towards ai-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §2.1, §3.
  • T. Yang, A. Lan, and K. Narasimhan (2020) Robust and interpretable grounding of spatial references with relation networks. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1908–1923. External Links: Document, Link Cited by: §1, §2.1.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) HotpotQA: A dataset for diverse, explainable multi-hop question answering. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018

    pp. 2369–2380. External Links: Document Cited by: §2.1.