Systematic Generalization on gSCAN with Language Conditioned Embedding

09/11/2020 ∙ by Tong Gao, et al. ∙ The University of Texas at Austin 13

Systematic Generalization refers to a learning algorithm's ability to extrapolate learned behavior to unseen situations that are distinct but semantically similar to its training data. As shown in recent work, state-of-the-art deep learning models fail dramatically even on tasks for which they are designed when the test set is systematically different from the training data. We hypothesize that explicitly modeling the relations between objects in their contexts while learning their representations will help achieve systematic generalization. Therefore, we propose a novel method that learns objects' contextualized embeddings with dynamic message passing conditioned on the input natural language and end-to-end trainable with other downstream deep learning modules. To our knowledge, this model is the first one that significantly outperforms the provided baseline and reaches state-of-the-art performance on grounded-SCAN (gSCAN), a grounded natural language navigation dataset designed to require systematic generalization in its test splits.



There are no comments yet.


page 5

page 7

page 8

page 10

page 11

Code Repositories


ReaSCAN is a synthetic navigation task that requires models to reason about surroundings over syntactically difficult languages.

view repo



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Systematic Generalization refers to a learning algorithm’s ability to extrapolate learned behavior to unseen situations that are distinct but semantically similar to its training data. It has long been recognized as a key aspect of humans’ cognitive capacities (Fodor et al., 1988)

. Specifically, humans’ mastery of systematic generalization is prevalent in grounded natural language understanding. For example, humans can reason about the relations between all pairs of concepts from two domains, even if they have only seen a small subset of pairs during training. If a child observes ”red squares”, ”green squares” and ”yellow circles”, he or she can recognize ”red circles” at their first encounter. Humans can also contextualize their reasoning about objects’ attributes. For example, a city being referred to as ”the larger one” within a state might be referred to as ”the smaller one” nationwide. In the past decade, deep neural networks have shown tremendous success on a collection of grounded natural language processing tasks, such as visual question answering (VQA), image captioning, and vision-and-language navigation

(Hudson and Manning, 2018; Anderson et al., 2018a, b). Despite all the success, recent literature shows that current deep learning approaches are exploiting statistical patterns discovered in the datasets to achieve high performance, an approach that does not achieve systematic generalization. Gururangan et al. (2018) discovered that annotation artifacts like negation words or purpose clauses in natural language inference data can be used by simple text classification categorization model to solve the given task. Jia and Liang (2017) demonstrated that adversarial examples can fool reading comprehension systems. Indeed, deep learning models often fail to achieve systematic generalizations even on tasks on which they are claimed to perform well. As shown by Bahdanau et al. (2018), state-of-the-art Visual Questioning Answering (VQA) (Hudson and Manning, 2018; Perez et al., 2018) models fail dramatically even on a synthetic VQA dataset designed with systematic difference between training and test sets.

In this work, we focus on approaching systematic generalization in grounded natural language understanding tasks. We experiment with a recently introduced synthetic dataset, grounded SCAN (gSCAN), that requires systematic generalization to solve (Ruis et al., 2020). For example, after observing how to ”walk hesitantly” to a target object in a grid world, the learning agent is tested with instruction that requires it to ”pull hesitantly”, therefore testing its ability to generalize adverbs to unseen adverb-verb combinations.

When presented with a world of objects with different attributes, and natural language sentences that describe such objects, the goal of the model is to generalize its ability to understand unseen sentences describing novel combinations of observed objects, or even novel objects with observed attributes. One of the essential steps in achieving this goal is to obtain good object embeddings to which natural language can be grounded. By considering each object as a bag of its descriptive attributes, this problem is further transformed into learning good representations for those attributes based on the training data. This requires: 1) learning good representations of attributes whose actual meanings are contextualized, for example, ”smaller” and ”lighter”, etc.; 2) learning good representations for attributes so that conceptually similar attributes, e.g., ”yellow” and ”red”, have similar representations. We hypothesize that explicitly modeling the relation between objects in their contexts, i.e., learning contextualized object embedding, will help achieve systematic generalization. This is intuitively helpful for learning concepts with contextualized meaning, just as learning to recognize the ”smaller” object in a novel pair requires experience of comparison between semantically similar object pairs. Learning contextualized object embeddings can also be helpful for obtaining good representations for semantically similar concepts when such concepts are the only differences between two contexts. Inspired by Hu et al. (2019), we propose a novel method that learns objects’ contextualized embeddings with dynamic message passing conditioned on the input natural language. At each round of message passing, our model collects relational information between each object pair, and constructs an object’s contextualized embedding as a weighted combination of them. Such weights are dynamically computed conditioned on the input natural sentence. The contextualized object embedding scheme is trained end-to-end with downstream deep modules for specific grounded natural language processing tasks, such as navigation. Experiments show that our approach significantly outperforms a strong baseline on gSCAN.

2 Related Work

Research on deep learning models’ systematic generalization behavior has gained traction in recent years, with particular focus on natural language processing tasks.

2.1 Compositionality

An idea that is closely related to systematic generalization is compositionality. Kamp and Partee (1995) phrased the principle of compositionality as “The meaning of a whole is a function of the meanings of the parts and of the way they are syntactically combined”. Hupkes et al. (2020) synthesizes different interpretations of this abstract principle into 5 theoretically grounded tests to evaluate a model’s ability to represent compositionality: 1) Systematicity: if the model can systematically recombine known parts and rules; 2) Productivity: if the model can extend their predictions beyond what they have seen in the training data; 3) Substitutivity; if the model is robust to synonym substitutions; 4) Localism: if the model’s composition operations are local or global; and 5) Overgeneralisation: if the model favor rules or exceptions during training. The gSCAN dataset focuses more on capturing the first three tests in a grounded natural language understanding setting, and our proposed model achieves significant performance improvement on test sets relating to systematicity and substitutivity.

2.2 Systematic Generalization Datasets

Many systematic generalization datasets have been proposed in recent years (Bahdanau et al., 2018; Chevalier-Boisvert et al., 2018; Hill et al., 2019; Lake and Baroni, 2017; Ruis et al., 2020). This paper is conceptually most related to the SQOOP dataset proposed by Bahdanau et al. (2018), the SCAN dataset proposed by Lake and Baroni (2017), and the gSCAN dataset proposed by Ruis et al. (2020).

The SQOOP dataset consists of a random number of MNIST-style alphanumeric characters scattered in an image with specific spatial relations (”left”, ”right”, ”up”, ”down”) among them

(Bahdanau et al., 2018). The algorithm is tested with a binary decision task of reasoning about whether a specific relation holds between a pair of alphanumeric characters. Systematic difference is created between the testing and training set by only providing supervision on relations for a subset of digit pairs to the learner, while testing its ability to reason about relations between unseen alphanumeric character pairs. For example, the algorithm is tested with questions like “is S above T” while it never sees a relation involving both S and T during training. Therefore, to fully solve this dataset, it must learn to generalize its understanding of relation “above” to unseen pairs of characters. Lake and Baroni (2017) proposed the SCAN dataset and its related benchmark that tests a learning algorithm’s ability to perform compositional learning and zero-shot generalization on a natural language command translation task . Given a natural language command with a limited vocabulary, an algorithm needs to translate it into a corresponding action sequence consisting of action tokens from a finite token set. Comparing to SQOOP, SCAN tests the algorithm’s ability to learn more complicated linguistic generalizations like ”walk around left” to ”walk around right”. SCAN also ensures that the target action sequence is unique, and an oracle solution exists by providing an interpreter function that can unambiguously translate any given command to its target action sequence.

Going beyond SCAN that focuses purely on syntactic aspects of systematic generalization, the gSCAN dataset proposed by Ruis et al. (2020) is an extension of SCAN. It contains a series of systematic generalization tasks that require the learning agent to ground its understanding of natural language commands in a given grid world to produce the correct action token sequence. We choose gSCAN as our benchmark dataset, as its input command sentences are linguistically more complex, and require processing of multi-modal input to solve.

2.3 Systematic Generliazation Algorithms

Bahdanau et al. (2018) demonstrated that modular networks, with a carefully chosen module layout, can achieve nearly perfect systematic generalization on SQOOP dataset. Our approach can be considered as a conceptual generalization of theirs. Each object’s initial embedding can be considered as a simple affine encoder module, and we learn the connection scheme among these modules conditioned on natural language instead of hand-designing it. Gordon et al. (2019) proposed to solve the SCAN benchmark by hard-coding their model to be equivariant to all permutations of SCAN’s verb primitives. Andreas (2020) proposed GECA (“Good-Enough Compositional Augmentation”) that systematically augments the SCAN dataset by identifying sentence fragments with similar syntactic context, and permuting them to generate novel training examples. This line of permutation-invariant approaches is shown to not generalize well on the gSCAN dataset (Ruis et al., 2020). To the best of our knowledge, our method is the first one to outperform the strong baseline provided in the gSCAN benchmark, and also the first one to apply language-conditioned message passing to learn contextualized input embeddings for systematic generalization tasks.

3 Problem Definition & Algorithm

3.1 Task Definition

gSCAN contains a series of systematic generalization tasks in a grounded natural language understanding setting. In gSCAN, the learning agent is tested with the task of following a given natural language instruction to navigate in a two-dimensional grid world with objects. This is achieved in the form of generating a sequence of action tokens from a finite action token set

that brings the agent from its starting location to the target location. An object in gSCAN’s world state is encoded with an one-hot encoding describing its attributes in three property types: 1) color

2) shape 3) size . The agent is also encoded as an “object” in the grid world, with properties including orientation

and a binary variable

denoting the presence of the agent. Therefore, the whole grid is represented as a tensor

, where is the dimension of the grid, and . Mathematically, given an input tuple , where represents the navigation instruction, the agent needs to predict the correct output action token sequence . Despite its simple form, this task is quite challenging. For one, generating the correct action token sequence requires understanding the instruction within the context of the agent’s current grid world. It also involves connecting specific instructions with complex dynamic patterns. As an example, “pulling” a square will be mapped to a “pull” command when the square has a size of 1 or 2, but to “pull pull” when the square has a size of 3 or 4 (a “heavy” square); “move cautiously” requires the agent to turn left and turn right once each before making the actual move. gSCAN also introduces a series of test sets that have systematic differences from the training set. Computing the correct action token sequences on these test sets requires the model to learn to combine seen concepts into novel combinations, including novel object property combinations, novel contextual references, etc..

Figure 1: Model Overview

3.2 Algorithm Definition

The overview of our model architecture is shown in Figure 1.

3.2.1 Input Extraction

Given the input sentence and the grid world state, we first project them into higher dimensional embeddings. For the input instruction where

is the embedding vector of word

, following the practice of Ruis et al. (2020) and Hu et al. (2019), we first encode it as the hidden states and the summary vector obtained by feeding the input to a Bi-LSTM as:


Where we use semi-colon to represent concatenation, and is the concatenation of the forward and backward direction of the LSTM hidden state for input word . For each round of message passing between the objects embeddings, we further apply a transformation using a multi-step textual attention module similar to that of Hudson and Manning (2018) and Hu et al. (2018) to extract the round-specific textual context. Given a round-specific projection matrix , the textual attention score for word at message passing round is computed as:


The final textual context embedding for message passing round is computed as:


Details of the message passing mechanism will be described in later sections.
As for the grid-world representation, from each grid, we extract one-hot representations of color , shape , size and agent orientation , and embed each property with a 16-dimensional vector. We finally concatenate them back into one vector and use this vector as the object’s local embedding.

3.2.2 Language-conditioned Message Passing

After extracting textual context embeddings and objects’ local embeddings, we perform a language-conditioned iterative message passing for rounds to obtain the contextualized object embedding, where is a hyper-parameter.

1) Denoting the extracted object local embedding as , and previous round’s object context embedding as , we first construct a fused representation of an object at round by concatenating its local, context embedding as well their element-wise product:


We use an object’s local embedding to initialize its context embedding at round 0.

2) For each pair of objects , we use their fused representations, together with this round’s textual context embedding to compute their message passing weight as:


Note that the computation of the raw weight logits is asymmetric.

3) We consider all the objects in a grid world as nodes, and they together form a complete graph. Each node computes its message to receiver node as:


and each receiver node updates its context embedding as:


After rounds of iterative message passing, the final contextualized embedding for object will be:


3.2.3 Encoding the Grid World

After obtaining contextualized embeddings for all objects in a grid world as each of dimensionality , we map them back to their locations in the grid world, and construct a new grid world representation

by zero-padding cells without any object. This is then fed into three parallel convolutional networks with different kernel sizes to obtain a grid world’s embedding at multiple scales, as done by

Wang and Lake (2019). The final grid world encoding is as follows:


where denotes the th convolutional network.

3.2.4 Decoding Action Sequences

We use a Bi-LSTM with multi-modal attention to both the grid world embedding and the input instruction embedding to decode the final action sequence, following the baseline model provided by Ruis et al. (2020). At each step , the hidden state of the decoder is computed as:


where is the embedding of the previous output action token , is the instruction context computed with attention over textual encoder’s hidden states , and is the grid world context computed with attention over all locations in the grid world embedding . We use the attention implementation proposed by Bahdanau et al. (2014). The instruction context is computed as:


Similarly, the grid world context is computed as:


Where , , , are learnable parameters.
The distribution of next action token can then be computed as .

4 Experimental Evaluation

Split Baseline Our Model Description
A: Random Randomly split test sets
B: Novel Direction Target object is to the South-West of the agent
C: Relativity Target object is a size circle, referred to with the small modifier
D: Red Squares Red squares are the target object
E: Yellow Squares Yellow squares are referred to with a color and a shape at least
F: Adverb to verb All examples with the adverb ’while spinning’ and the verb ’pull’
G: Class Inference All examples where the agent needs to push a square of size 3
Table 1: Experimental Results

4.1 Methodology & Implementation Details

We run experiments to test the hypothesis that contextualized embeddings help systematic generalization. Since this task has a limited vocabulary size, word-level accuracy is no longer a proper metric to reflect the model’s performance. We follow the baseline and use the exact match percentage as our metric, where an exact match means that the produced action token sequence is exactly the same as the target sequence. We compare our model with the baseline on different test sets, and use early stopping based on the exact match score on the validation set. We set the learning rate as 1e-4, decaying by 0.9 every 20,000 steps. We choose the number of message passing iterations to be 4. Our model is trained for 6 separate runs, and the average performance as well as the standard deviation are reported. Our encoder/decoder model is implemented in PyTorch

(Paszke et al., 2017) and the message passing graph network is backed by DGL (Wang et al., 2019). For comparison, we use test set, validation set, and baseline model released by Ruis et al. (2020).

4.2 Results

Table 1 shows our experiment results on 7 different test sets. In the following sections, we present the results on each systematic generalization test split, and also introduce the configuration of test splits. Note that test split A is a random split set that has no systematic difference from the training set.

Split B: This tests the model’s ability to generalize to navigation in a novel direction. For example, a testing example would require the agent to move to a target object that is to its south-west, even though during training target objects are never placed south-west of the agent. Although our model manages to predict some correct action sequences compared to the baseline’s complete failure, our model still fails on the majority of cases. We further analyze the failure on Split B in the discussion section.

Split C, G: Split C tests the model’s ability to generalize to novel contextual references. In the training set, a circle of size 2 is never referred to as “the small circle”, while in the test set the agent needs to generalize the notion “small” to it based on its size comparison with other circles in the grid world. The message passing mechanism helps the model comprehend the relative sizes of objects, and boost the performance on split C. Besides, our model shows promising results on exploring the interrelationship between an agent and other objects in the scene, as well as learning abstract concepts by contextual comparison as shown in split G. This test split asks the model to push a square of size 3. An object with the size of 3 or 4 is defined as “heavy”, according to the configuration, and requires two consecutive push/pull actions applied on it before it actually moves. The challenge here is that the model has been trained to“pull” heavy squares and “push” squares with size of 4, but was never trained to “push” a size-3 square. Thus, it needs to generalize the concept of “heavy” and act accordingly.

Split D, E: Split D and E are similar, as they both define the target object with novel combinations of color and shape. Split E is generally easier because the target object, a yellow square, appears as the target in training examples, but is only referred to as “the square”, “the smaller square”, or “the bigger square”. Split D increases the difficulty by referring to the red square, which never appears in the training set as a target but does appear as a background object. We find that while the baseline model understands the concept of “square”, it gets confused by target objects with a new color-shape combination. In contrast, our model can generalize to novel compositions of object properties and correctly find the target object, performing significantly better on these two splits.

Split F:

This split is designed to test the model’s ability to generalize to novel adverb-verb combinations, where the model is tested under different situations but always with the terms “while spinning” and “pull” in the commands. However, they never appear in the training set together, consequently the model needs to generalize to this novel combination of adverb and verb. The results shows that our model does a bit better than the baseline, but suffers from high variance across different runs.

Figure 2: While the target is correctly chosen, the baseline did not stop pushing even after encountering an obstacle.

4.3 Discussion

Figure 3: Baseline cannot distinguish the correct square from similar candidates.

Model Comparison. We reveal the strength of our model by analyzing two test examples where it succeeds and the baseline fails. For each example, we visualize the grid world that the agent is in, where each cell is colored with different grey-scale levels indicating its assigned attention score.

Figure 2 from split G visualizes the prediction sequence as well as the attention weights generated by the baseline. The baseline attends to the position of the target object but is unable to capture the dynamic relationship between the target object and the green cylinder. It tries to push the target object over it, while our model correctly predicts the incoming collision and stops at the right time.

Another example on which our model outperforms the baseline is shown in Figure 3. The baseline model incorrectly attends to two small blue squares and picks one as the target rather than the correct small red square. Note that the model has seen blue and green squares as targets in the training set, but has never seen a red square. This is a common mistake since the baseline struggles to choose target objects with novel property combinations when there are similar objects in the scene that were seen during training. On the contrary, our model handles these cases well, demonstrating its ability to generalize to novel color-shape combinations with the help of contextualized object embeddings.

Split Base Full
Table 2: Ablation Study

Ablation Study. We conduct an ablation study to test the significance of the language-conditioned message passing component in our network. We built a base model whose architecture is the same as our full model, except that we remove the language-conditioned message passing module described in section 3.2.2. That is, we follow all the steps in section 3.2.1 and obtain every object’s local embedding, then map new embeddings back to the their locations as stated in section 3.2.3. The results in Table 2 indicate that language-conditioned message passing does help achieve higher exact match accuracy in many test splits, though it sometimes hurts the performance on split F. We conclude that the model is getting better at understanding object-related commands (“pull” moves the object), sacrificing some ability to discover the meaning of easy-to-translate adverbs that are irrelevant to the interaction with objects (“while spinning” only describes the behavior of agent with no impact on the scene).

Failure on Split B. Here we analyze a failure case to understand why split B is notably difficult for our model. Figure 4 demonstrates an example that leads to both models’ failure. The attention scores indicate that the model has identified the correct target position, but does not know the correct action sequence to get there. The LSTM decoder cannot generalize the meaning of action tokens that direct the agent towards an unseen direction. We can observe from our model’s output prediction that, even if it manages to correctly predict the first few steps (”turn left turn left walk”), it quickly gets lost and fails to navigate to the target location. The model only observes the initial world state and the command, then generates a sequence of actions toward the target. In other words, it is blindly generating the action sequence with only a static image of the agent and the target’s location, not really modeling the movement of the agent. However, humans usually do not handle navigation to novel direction in this way. Instead, they will first turn to the correct direction, and transform the novel task into a familiar task (”walk southwest is equivalent to turn southwest then walk the same as you walk north”). This naturally requires a change of perspective and conditioning on the agent’s previous action. A possible improvement is to introduce clues to inform the model of possible changes in its view as it takes actions.

Figure 4: Failure case on split B, prediction and attention scores were generated by our model.

5 Conclusion and Future Work

In this paper, we proposed a language-conditioned message passing model for a grounded language navigation task that can dynamically extract contextualized embeddings based on input command sentences, and can be trained end-to-end with the downstream action-sequence decoder. We showed that obtaining such contextualized embeddings improves performance on a recently introduced challenge problem, gSCAN, significantly outperforming the state-of-the-art across several test splits designed to test a model’s ability to represent novel concept compositions and achieve systematic generalization.

Nonetheless, our model’s fairly poor performance on split B and F shows that challenges still remain. As explained in the discussion section, our model is falling short of estimating the effect of each action on the agent’s state. An alternative view of this problem is as a reinforcement learning task with sparse reward. Sample-efficient model-based reinforcement learning

(Buckman et al., 2018) could then be used, and its natural ability to explicitly model environment change should improve performance on this task.

It would also be beneficial to visualize the dynamically generated edge weights during message passing to have a more intuitive understanding of what contextual information is integrated during the message passing phase. Currently, we consider all objects appearing on the grid, including the agent, as homogeneous nodes during message passing, and all edges in the message passing graph are modelled in the same way. However, intuitively, we should model the relation between different types of objects differently. For example, the relation between the agent and the target object of pulling might be different from the relation between two objects on the grid. Inspired by Bahdanau et al. (2018), it would be interesting to try modeling different edge types explicitly with neural modules, and perform type-specific message passing to obtain better contextualized embeddings.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018a) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6077–6086. Cited by: §1.
  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018b) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683. Cited by: §1.
  • J. Andreas (2020) Good-enough compositional data augmentation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7556–7566. External Links: Link Cited by: §2.3.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2.4.
  • D. Bahdanau, S. Murty, M. Noukhovitch, T. H. Nguyen, H. de Vries, and A. C. Courville (2018) Systematic generalization: what is required and can it be learned?. CoRR abs/1811.12889. External Links: Link, 1811.12889 Cited by: §1, §2.2, §2.3, §5.
  • J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee (2018) Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Advances in Neural Information Processing Systems, pp. 8224–8234. Cited by: §5.
  • M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio (2018) BabyAI: first steps towards grounded language learning with a human in the loop. arXiv preprint arXiv:1810.08272. Cited by: §2.2.
  • J. A. Fodor, Z. W. Pylyshyn, et al. (1988) Connectionism and cognitive architecture: a critical analysis. Cognition 28 (1-2), pp. 3–71. Cited by: §1.
  • J. Gordon, D. Lopez-Paz, M. Baroni, and D. Bouchacourt (2019) Permutation equivariant models for compositional generalization in language. In International Conference on Learning Representations, Cited by: §2.3.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324. Cited by: §1.
  • F. Hill, A. Lampinen, R. Schneider, S. Clark, M. Botvinick, J. L. McClelland, and A. Santoro (2019) Emergent systematic generalization in a situated agent. arXiv preprint arXiv:1910.00571. Cited by: §2.2.
  • R. Hu, J. Andreas, T. Darrell, and K. Saenko (2018) Explainable neural computation via stack neural module networks. In Proceedings of the European conference on computer vision (ECCV), pp. 53–69. Cited by: §3.2.1.
  • R. Hu, A. Rohrbach, T. Darrell, and K. Saenko (2019) Language-conditioned graph networks for relational reasoning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §1, §3.2.1.
  • D. A. Hudson and C. D. Manning (2018) Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067. Cited by: §1, §3.2.1.
  • D. Hupkes, V. Dankers, M. Mul, and E. Bruni (2020) Compositionality decomposed: how do neural networks generalise?.

    Journal of Artificial Intelligence Research

    67, pp. 757–795.
    Cited by: §2.1.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328. Cited by: §1.
  • H. Kamp and B. Partee (1995) Prototype theory and compositionality. Cognition 57 (2), pp. 129–191. Cited by: §2.1.
  • B. M. Lake and M. Baroni (2017) Generalization without systematicity: on the compositional skills of sequence-to-sequence recurrent networks. External Links: 1711.00350 Cited by: §2.2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.
  • E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  • L. Ruis, J. Andreas, M. Baroni, D. Bouchacourt, and B. M. Lake (2020) A benchmark for systematic generalization in grounded language understanding. External Links: 2003.05161 Cited by: §1, §2.2, §2.2, §2.3, §3.2.1, §3.2.4, §4.1.
  • M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, et al. (2019) Deep graph library: towards efficient and scalable deep learning on graphs. arXiv preprint arXiv:1909.01315. Cited by: §4.1.
  • Z. Wang and B. M. Lake (2019) Modeling question asking using neural program generation. arXiv preprint arXiv:1907.09899. Cited by: §3.2.3.

Appendix A Appendix

a.1 Example Visualization

We visualize more cases reflecting our model’s strength and weakness. Figure 5 - 8 are the cases that our model’s prediction exactly matches the target while the baseline does not. Some typically common failure cases of our model are listed in figure 9 - 11.

Figure 5: Baseline picked a yellow square as the target.
Figure 6: Baseline picked a red square as the target.
Figure 7: Baseline falsely predicted the consequential interaction and decided not to push.
Figure 8: Baseline picked the bigger circle instead of the smaller one.
Figure 9: Get lost at long sequence: Our model fails when the target sequence contains same action tokens repeated for several times.
Figure 10: Incorrect path plan: Our model generates the path plan in a partially-reversed order.
Figure 11: Early stop before reaching boundary: Our model stops pushing when the target object is next to the boundary grid.