ReaSCAN is a synthetic navigation task that requires models to reason about surroundings over syntactically difficult languages.
Systematic Generalization refers to a learning algorithm's ability to extrapolate learned behavior to unseen situations that are distinct but semantically similar to its training data. As shown in recent work, state-of-the-art deep learning models fail dramatically even on tasks for which they are designed when the test set is systematically different from the training data. We hypothesize that explicitly modeling the relations between objects in their contexts while learning their representations will help achieve systematic generalization. Therefore, we propose a novel method that learns objects' contextualized embeddings with dynamic message passing conditioned on the input natural language and end-to-end trainable with other downstream deep learning modules. To our knowledge, this model is the first one that significantly outperforms the provided baseline and reaches state-of-the-art performance on grounded-SCAN (gSCAN), a grounded natural language navigation dataset designed to require systematic generalization in its test splits.READ FULL TEXT VIEW PDF
Human language users easily interpret expressions that describe unfamili...
Numerous models for grounded language understanding have been recently
Recently, many datasets have been proposed to test the systematic
Recently, deep neural networks (DNNs) have achieved great success in
Supervised learning models often make systematic errors on rare subsets ...
In adversarial (challenge) testing, we pose hard generalization tasks in...
We combine Recurrent Neural Networks with Tensor Product Representations...
ReaSCAN is a synthetic navigation task that requires models to reason about surroundings over syntactically difficult languages.
Systematic Generalization refers to a learning algorithm’s ability to extrapolate learned behavior to unseen situations that are distinct but semantically similar to its training data. It has long been recognized as a key aspect of humans’ cognitive capacities (Fodor et al., 1988)
. Specifically, humans’ mastery of systematic generalization is prevalent in grounded natural language understanding. For example, humans can reason about the relations between all pairs of concepts from two domains, even if they have only seen a small subset of pairs during training. If a child observes ”red squares”, ”green squares” and ”yellow circles”, he or she can recognize ”red circles” at their first encounter. Humans can also contextualize their reasoning about objects’ attributes. For example, a city being referred to as ”the larger one” within a state might be referred to as ”the smaller one” nationwide. In the past decade, deep neural networks have shown tremendous success on a collection of grounded natural language processing tasks, such as visual question answering (VQA), image captioning, and vision-and-language navigation(Hudson and Manning, 2018; Anderson et al., 2018a, b). Despite all the success, recent literature shows that current deep learning approaches are exploiting statistical patterns discovered in the datasets to achieve high performance, an approach that does not achieve systematic generalization. Gururangan et al. (2018) discovered that annotation artifacts like negation words or purpose clauses in natural language inference data can be used by simple text classification categorization model to solve the given task. Jia and Liang (2017) demonstrated that adversarial examples can fool reading comprehension systems. Indeed, deep learning models often fail to achieve systematic generalizations even on tasks on which they are claimed to perform well. As shown by Bahdanau et al. (2018), state-of-the-art Visual Questioning Answering (VQA) (Hudson and Manning, 2018; Perez et al., 2018) models fail dramatically even on a synthetic VQA dataset designed with systematic difference between training and test sets.
In this work, we focus on approaching systematic generalization in grounded natural language understanding tasks. We experiment with a recently introduced synthetic dataset, grounded SCAN (gSCAN), that requires systematic generalization to solve (Ruis et al., 2020). For example, after observing how to ”walk hesitantly” to a target object in a grid world, the learning agent is tested with instruction that requires it to ”pull hesitantly”, therefore testing its ability to generalize adverbs to unseen adverb-verb combinations.
When presented with a world of objects with different attributes, and natural language sentences that describe such objects, the goal of the model is to generalize its ability to understand unseen sentences describing novel combinations of observed objects, or even novel objects with observed attributes. One of the essential steps in achieving this goal is to obtain good object embeddings to which natural language can be grounded. By considering each object as a bag of its descriptive attributes, this problem is further transformed into learning good representations for those attributes based on the training data. This requires: 1) learning good representations of attributes whose actual meanings are contextualized, for example, ”smaller” and ”lighter”, etc.; 2) learning good representations for attributes so that conceptually similar attributes, e.g., ”yellow” and ”red”, have similar representations. We hypothesize that explicitly modeling the relation between objects in their contexts, i.e., learning contextualized object embedding, will help achieve systematic generalization. This is intuitively helpful for learning concepts with contextualized meaning, just as learning to recognize the ”smaller” object in a novel pair requires experience of comparison between semantically similar object pairs. Learning contextualized object embeddings can also be helpful for obtaining good representations for semantically similar concepts when such concepts are the only differences between two contexts. Inspired by Hu et al. (2019), we propose a novel method that learns objects’ contextualized embeddings with dynamic message passing conditioned on the input natural language. At each round of message passing, our model collects relational information between each object pair, and constructs an object’s contextualized embedding as a weighted combination of them. Such weights are dynamically computed conditioned on the input natural sentence. The contextualized object embedding scheme is trained end-to-end with downstream deep modules for specific grounded natural language processing tasks, such as navigation. Experiments show that our approach significantly outperforms a strong baseline on gSCAN.
Research on deep learning models’ systematic generalization behavior has gained traction in recent years, with particular focus on natural language processing tasks.
An idea that is closely related to systematic generalization is compositionality. Kamp and Partee (1995) phrased the principle of compositionality as “The meaning of a whole is a function of the meanings of the parts and of the way they are syntactically combined”. Hupkes et al. (2020) synthesizes different interpretations of this abstract principle into 5 theoretically grounded tests to evaluate a model’s ability to represent compositionality: 1) Systematicity: if the model can systematically recombine known parts and rules; 2) Productivity: if the model can extend their predictions beyond what they have seen in the training data; 3) Substitutivity; if the model is robust to synonym substitutions; 4) Localism: if the model’s composition operations are local or global; and 5) Overgeneralisation: if the model favor rules or exceptions during training. The gSCAN dataset focuses more on capturing the first three tests in a grounded natural language understanding setting, and our proposed model achieves significant performance improvement on test sets relating to systematicity and substitutivity.
Many systematic generalization datasets have been proposed in recent years (Bahdanau et al., 2018; Chevalier-Boisvert et al., 2018; Hill et al., 2019; Lake and Baroni, 2017; Ruis et al., 2020). This paper is conceptually most related to the SQOOP dataset proposed by Bahdanau et al. (2018), the SCAN dataset proposed by Lake and Baroni (2017), and the gSCAN dataset proposed by Ruis et al. (2020).
The SQOOP dataset consists of a random number of MNIST-style alphanumeric characters scattered in an image with specific spatial relations (”left”, ”right”, ”up”, ”down”) among them(Bahdanau et al., 2018). The algorithm is tested with a binary decision task of reasoning about whether a specific relation holds between a pair of alphanumeric characters. Systematic difference is created between the testing and training set by only providing supervision on relations for a subset of digit pairs to the learner, while testing its ability to reason about relations between unseen alphanumeric character pairs. For example, the algorithm is tested with questions like “is S above T” while it never sees a relation involving both S and T during training. Therefore, to fully solve this dataset, it must learn to generalize its understanding of relation “above” to unseen pairs of characters. Lake and Baroni (2017) proposed the SCAN dataset and its related benchmark that tests a learning algorithm’s ability to perform compositional learning and zero-shot generalization on a natural language command translation task . Given a natural language command with a limited vocabulary, an algorithm needs to translate it into a corresponding action sequence consisting of action tokens from a finite token set. Comparing to SQOOP, SCAN tests the algorithm’s ability to learn more complicated linguistic generalizations like ”walk around left” to ”walk around right”. SCAN also ensures that the target action sequence is unique, and an oracle solution exists by providing an interpreter function that can unambiguously translate any given command to its target action sequence.
Going beyond SCAN that focuses purely on syntactic aspects of systematic generalization, the gSCAN dataset proposed by Ruis et al. (2020) is an extension of SCAN. It contains a series of systematic generalization tasks that require the learning agent to ground its understanding of natural language commands in a given grid world to produce the correct action token sequence. We choose gSCAN as our benchmark dataset, as its input command sentences are linguistically more complex, and require processing of multi-modal input to solve.
Bahdanau et al. (2018) demonstrated that modular networks, with a carefully chosen module layout, can achieve nearly perfect systematic generalization on SQOOP dataset. Our approach can be considered as a conceptual generalization of theirs. Each object’s initial embedding can be considered as a simple affine encoder module, and we learn the connection scheme among these modules conditioned on natural language instead of hand-designing it. Gordon et al. (2019) proposed to solve the SCAN benchmark by hard-coding their model to be equivariant to all permutations of SCAN’s verb primitives. Andreas (2020) proposed GECA (“Good-Enough Compositional Augmentation”) that systematically augments the SCAN dataset by identifying sentence fragments with similar syntactic context, and permuting them to generate novel training examples. This line of permutation-invariant approaches is shown to not generalize well on the gSCAN dataset (Ruis et al., 2020). To the best of our knowledge, our method is the first one to outperform the strong baseline provided in the gSCAN benchmark, and also the first one to apply language-conditioned message passing to learn contextualized input embeddings for systematic generalization tasks.
gSCAN contains a series of systematic generalization tasks in a grounded natural language understanding setting. In gSCAN, the learning agent is tested with the task of following a given natural language instruction to navigate in a two-dimensional grid world with objects. This is achieved in the form of generating a sequence of action tokens from a finite action token set
that brings the agent from its starting location to the target location. An object in gSCAN’s world state is encoded with an one-hot encoding describing its attributes in three property types: 1) color2) shape 3) size . The agent is also encoded as an “object” in the grid world, with properties including orientation
and a binary variable
denoting the presence of the agent. Therefore, the whole grid is represented as a tensor, where is the dimension of the grid, and . Mathematically, given an input tuple , where represents the navigation instruction, the agent needs to predict the correct output action token sequence . Despite its simple form, this task is quite challenging. For one, generating the correct action token sequence requires understanding the instruction within the context of the agent’s current grid world. It also involves connecting specific instructions with complex dynamic patterns. As an example, “pulling” a square will be mapped to a “pull” command when the square has a size of 1 or 2, but to “pull pull” when the square has a size of 3 or 4 (a “heavy” square); “move cautiously” requires the agent to turn left and turn right once each before making the actual move. gSCAN also introduces a series of test sets that have systematic differences from the training set. Computing the correct action token sequences on these test sets requires the model to learn to combine seen concepts into novel combinations, including novel object property combinations, novel contextual references, etc..
The overview of our model architecture is shown in Figure 1.
Given the input sentence and the grid world state, we first project them into higher dimensional embeddings. For the input instruction where
is the embedding vector of word, following the practice of Ruis et al. (2020) and Hu et al. (2019), we first encode it as the hidden states and the summary vector obtained by feeding the input to a Bi-LSTM as:
Where we use semi-colon to represent concatenation, and is the concatenation of the forward and backward direction of the LSTM hidden state for input word . For each round of message passing between the objects embeddings, we further apply a transformation using a multi-step textual attention module similar to that of Hudson and Manning (2018) and Hu et al. (2018) to extract the round-specific textual context. Given a round-specific projection matrix , the textual attention score for word at message passing round is computed as:
The final textual context embedding for message passing round is computed as:
Details of the message passing mechanism will be described in later sections.
As for the grid-world representation, from each grid, we extract one-hot representations of color , shape , size and agent orientation , and embed each property with a 16-dimensional vector. We finally concatenate them back into one vector and use this vector as the object’s local embedding.
After extracting textual context embeddings and objects’ local embeddings, we perform a language-conditioned iterative message passing for rounds to obtain the contextualized object embedding, where is a hyper-parameter.
1) Denoting the extracted object local embedding as , and previous round’s object context embedding as , we first construct a fused representation of an object at round by concatenating its local, context embedding as well their element-wise product:
We use an object’s local embedding to initialize its context embedding at round 0.
2) For each pair of objects , we use their fused representations, together with this round’s textual context embedding to compute their message passing weight as:
Note that the computation of the raw weight logits is asymmetric.
3) We consider all the objects in a grid world as nodes, and they together form a complete graph. Each node computes its message to receiver node as:
and each receiver node updates its context embedding as:
After rounds of iterative message passing, the final contextualized embedding for object will be:
After obtaining contextualized embeddings for all objects in a grid world as each of dimensionality , we map them back to their locations in the grid world, and construct a new grid world representation
by zero-padding cells without any object. This is then fed into three parallel convolutional networks with different kernel sizes to obtain a grid world’s embedding at multiple scales, as done byWang and Lake (2019). The final grid world encoding is as follows:
where denotes the th convolutional network.
We use a Bi-LSTM with multi-modal attention to both the grid world embedding and the input instruction embedding to decode the final action sequence, following the baseline model provided by Ruis et al. (2020). At each step , the hidden state of the decoder is computed as:
where is the embedding of the previous output action token , is the instruction context computed with attention over textual encoder’s hidden states , and is the grid world context computed with attention over all locations in the grid world embedding . We use the attention implementation proposed by Bahdanau et al. (2014). The instruction context is computed as:
Similarly, the grid world context is computed as:
Where , , , are learnable parameters.
The distribution of next action token can then be computed as .
|A: Random||Randomly split test sets|
|B: Novel Direction||Target object is to the South-West of the agent|
|C: Relativity||Target object is a size circle, referred to with the small modifier|
|D: Red Squares||Red squares are the target object|
|E: Yellow Squares||Yellow squares are referred to with a color and a shape at least|
|F: Adverb to verb||All examples with the adverb ’while spinning’ and the verb ’pull’|
|G: Class Inference||All examples where the agent needs to push a square of size 3|
We run experiments to test the hypothesis that contextualized embeddings help systematic generalization. Since this task has a limited vocabulary size, word-level accuracy is no longer a proper metric to reflect the model’s performance. We follow the baseline and use the exact match percentage as our metric, where an exact match means that the produced action token sequence is exactly the same as the target sequence. We compare our model with the baseline on different test sets, and use early stopping based on the exact match score on the validation set. We set the learning rate as 1e-4, decaying by 0.9 every 20,000 steps. We choose the number of message passing iterations to be 4. Our model is trained for 6 separate runs, and the average performance as well as the standard deviation are reported. Our encoder/decoder model is implemented in PyTorch(Paszke et al., 2017) and the message passing graph network is backed by DGL (Wang et al., 2019). For comparison, we use test set, validation set, and baseline model released by Ruis et al. (2020).
Table 1 shows our experiment results on 7 different test sets. In the following sections, we present the results on each systematic generalization test split, and also introduce the configuration of test splits. Note that test split A is a random split set that has no systematic difference from the training set.
Split B: This tests the model’s ability to generalize to navigation in a novel direction. For example, a testing example would require the agent to move to a target object that is to its south-west, even though during training target objects are never placed south-west of the agent. Although our model manages to predict some correct action sequences compared to the baseline’s complete failure, our model still fails on the majority of cases. We further analyze the failure on Split B in the discussion section.
Split C, G: Split C tests the model’s ability to generalize to novel contextual references. In the training set, a circle of size 2 is never referred to as “the small circle”, while in the test set the agent needs to generalize the notion “small” to it based on its size comparison with other circles in the grid world. The message passing mechanism helps the model comprehend the relative sizes of objects, and boost the performance on split C. Besides, our model shows promising results on exploring the interrelationship between an agent and other objects in the scene, as well as learning abstract concepts by contextual comparison as shown in split G. This test split asks the model to push a square of size 3. An object with the size of 3 or 4 is defined as “heavy”, according to the configuration, and requires two consecutive push/pull actions applied on it before it actually moves. The challenge here is that the model has been trained to“pull” heavy squares and “push” squares with size of 4, but was never trained to “push” a size-3 square. Thus, it needs to generalize the concept of “heavy” and act accordingly.
Split D, E: Split D and E are similar, as they both define the target object with novel combinations of color and shape. Split E is generally easier because the target object, a yellow square, appears as the target in training examples, but is only referred to as “the square”, “the smaller square”, or “the bigger square”. Split D increases the difficulty by referring to the red square, which never appears in the training set as a target but does appear as a background object. We find that while the baseline model understands the concept of “square”, it gets confused by target objects with a new color-shape combination. In contrast, our model can generalize to novel compositions of object properties and correctly find the target object, performing significantly better on these two splits.
This split is designed to test the model’s ability to generalize to novel adverb-verb combinations, where the model is tested under different situations but always with the terms “while spinning” and “pull” in the commands. However, they never appear in the training set together, consequently the model needs to generalize to this novel combination of adverb and verb. The results shows that our model does a bit better than the baseline, but suffers from high variance across different runs.
Model Comparison. We reveal the strength of our model by analyzing two test examples where it succeeds and the baseline fails. For each example, we visualize the grid world that the agent is in, where each cell is colored with different grey-scale levels indicating its assigned attention score.
Figure 2 from split G visualizes the prediction sequence as well as the attention weights generated by the baseline. The baseline attends to the position of the target object but is unable to capture the dynamic relationship between the target object and the green cylinder. It tries to push the target object over it, while our model correctly predicts the incoming collision and stops at the right time.
Another example on which our model outperforms the baseline is shown in Figure 3. The baseline model incorrectly attends to two small blue squares and picks one as the target rather than the correct small red square. Note that the model has seen blue and green squares as targets in the training set, but has never seen a red square. This is a common mistake since the baseline struggles to choose target objects with novel property combinations when there are similar objects in the scene that were seen during training. On the contrary, our model handles these cases well, demonstrating its ability to generalize to novel color-shape combinations with the help of contextualized object embeddings.
Ablation Study. We conduct an ablation study to test the significance of the language-conditioned message passing component in our network. We built a base model whose architecture is the same as our full model, except that we remove the language-conditioned message passing module described in section 3.2.2. That is, we follow all the steps in section 3.2.1 and obtain every object’s local embedding, then map new embeddings back to the their locations as stated in section 3.2.3. The results in Table 2 indicate that language-conditioned message passing does help achieve higher exact match accuracy in many test splits, though it sometimes hurts the performance on split F. We conclude that the model is getting better at understanding object-related commands (“pull” moves the object), sacrificing some ability to discover the meaning of easy-to-translate adverbs that are irrelevant to the interaction with objects (“while spinning” only describes the behavior of agent with no impact on the scene).
Failure on Split B. Here we analyze a failure case to understand why split B is notably difficult for our model. Figure 4 demonstrates an example that leads to both models’ failure. The attention scores indicate that the model has identified the correct target position, but does not know the correct action sequence to get there. The LSTM decoder cannot generalize the meaning of action tokens that direct the agent towards an unseen direction. We can observe from our model’s output prediction that, even if it manages to correctly predict the first few steps (”turn left turn left walk”), it quickly gets lost and fails to navigate to the target location. The model only observes the initial world state and the command, then generates a sequence of actions toward the target. In other words, it is blindly generating the action sequence with only a static image of the agent and the target’s location, not really modeling the movement of the agent. However, humans usually do not handle navigation to novel direction in this way. Instead, they will first turn to the correct direction, and transform the novel task into a familiar task (”walk southwest is equivalent to turn southwest then walk the same as you walk north”). This naturally requires a change of perspective and conditioning on the agent’s previous action. A possible improvement is to introduce clues to inform the model of possible changes in its view as it takes actions.
In this paper, we proposed a language-conditioned message passing model for a grounded language navigation task that can dynamically extract contextualized embeddings based on input command sentences, and can be trained end-to-end with the downstream action-sequence decoder. We showed that obtaining such contextualized embeddings improves performance on a recently introduced challenge problem, gSCAN, significantly outperforming the state-of-the-art across several test splits designed to test a model’s ability to represent novel concept compositions and achieve systematic generalization.
Nonetheless, our model’s fairly poor performance on split B and F shows that challenges still remain. As explained in the discussion section, our model is falling short of estimating the effect of each action on the agent’s state. An alternative view of this problem is as a reinforcement learning task with sparse reward. Sample-efficient model-based reinforcement learning(Buckman et al., 2018) could then be used, and its natural ability to explicitly model environment change should improve performance on this task.
It would also be beneficial to visualize the dynamically generated edge weights during message passing to have a more intuitive understanding of what contextual information is integrated during the message passing phase. Currently, we consider all objects appearing on the grid, including the agent, as homogeneous nodes during message passing, and all edges in the message passing graph are modelled in the same way. However, intuitively, we should model the relation between different types of objects differently. For example, the relation between the agent and the target object of pulling might be different from the relation between two objects on the grid. Inspired by Bahdanau et al. (2018), it would be interesting to try modeling different edge types explicitly with neural modules, and perform type-specific message passing to obtain better contextualized embeddings.
Journal of Artificial Intelligence Research67, pp. 757–795. Cited by: §2.1.