Gated-Attention Architectures for Task-Oriented Language Grounding

To perform tasks specified by natural language instructions, autonomous agents need to extract semantically meaningful representations of language and map it to visual elements and actions in the environment. This problem is called task-oriented language grounding. We propose an end-to-end trainable neural architecture for task-oriented language grounding in 3D environments which assumes no prior linguistic or perceptual knowledge and requires only raw pixels from the environment and the natural language instruction as input. The proposed model combines the image and text representations using a Gated-Attention mechanism and learns a policy to execute the natural language instruction using standard reinforcement and imitation learning methods. We show the effectiveness of the proposed model on unseen instructions as well as unseen maps, both quantitatively and qualitatively. We also introduce a novel environment based on a 3D game engine to simulate the challenges of task-oriented language grounding over a rich set of instructions and environment states.


page 1

page 5

page 7

page 8

page 12


Attention Based Natural Language Grounding by Navigating Virtual Environment

In this work, we focus on the problem of grounding language by training ...

Dynamic Attention Networks for Task Oriented Grounding

In order to successfully perform tasks specified by natural language ins...

Task-Oriented Language Grounding for Language Input with Multiple Sub-Goals of Non-Linear Order

In this work, we analyze the performance of general deep reinforcement l...

Following Instructions by Imagining and Reaching Visual Goals

While traditional methods for instruction-following typically assume pri...

Grounding Hindsight Instructions in Multi-Goal Reinforcement Learning for Robotics

This paper focuses on robotic reinforcement learning with sparse rewards...

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Existing benchmarks for grounding language in interactive environments e...

Guided Feature Transformation (GFT): A Neural Language Grounding Module for Embodied Agents

Recently there has been a rising interest in training agents, embodied i...

1 Introduction

Artificial Intelligence (AI) systems are expected to perceive the environment and take actions to perform a certain task [Russell and Norvig1995]. Task-oriented language grounding refers to the process of extracting semantically meaningful representations of language by mapping it to visual elements and actions in the environment in order to perform the task specified by the instruction.

Consider the scenario shown in Figure 1, where an agent takes natural language instruction and pixel-level visual information as input to carry out the task in the real world. To accomplish this goal, the agent has to draw semantic correspondences between the visual and verbal modalities and learn a policy to perform the task. This problem poses several challenges: the agent has to learn to recognize objects in raw pixel input, explore the environment as the objects might be occluded or outside the field-of-view of the agent, ground each concept of the instruction in visual elements or actions in the environment, reason about the pragmatics of language based on the objects in the current environment (for example instructions with superlative tokens, such as ‘Go to the largest object’) and navigate to the correct object while avoiding incorrect ones.

Figure 1: An example of task-oriented language grounding in the 3D Doom environment with sample instructions. The test set consists of unseen instructions.
Figure 2:

The proposed model architecture to estimate the policy given the natural language instruction and the image showing the first-person view of the environment.

To tackle this problem, we propose an architecture that comprises of a state processing module that creates a joint representation of the instruction and the images observed by the agent, and a policy learner to predict the optimal action the agent has to take in that timestep. The state processing module consists of a novel Gated-Attention multimodal fusion mechanism, which is based on multiplicative interactions between both modalities [Dhingra et al.2017, Wu et al.2016].

The contributions of this paper are summarized as follows: 1) We propose an end-to-end trainable architecture that handles raw pixel-based input for task-oriented language grounding in a 3D environment and assumes no prior linguistic or perceptual knowledge111The environment and code is available at We show that the proposed model generalizes well to unseen instructions as well as unseen maps222See demo videos at 2) We develop a novel Gated-Attention mechanism for multimodal fusion of representations of verbal and visual modalities. We show that the gated-attention mechanism outperforms the baseline method of concatenating the representations using various policy learning methods. The visualization of the attention weights in the gated-attention unit shows that the model learns to associate attributes of the object mentioned in the instruction with the visual representations learned by the model. 3) We introduce a new environment, built over ViZDoom [Kempka et al.2016], for task-oriented language grounding with a rich set of actions, objects and their attributes. The environment provides a first-person view of the world state, and allows for simulating complex scenarios for tasks such as navigation.

2 Related Work

Grounding Language in Robotics. In the context of grounding language in objects and their attributes, guadarrama2014open (guadarrama2014open) present a method to ground open vocabulary to objects in the environment. Several works look at grounding concepts through human-robot interaction [Chao, Cakmak, and Thomaz2011, Lemaignan et al.2012]. Other works in grounding include attempts to ground natural language instructions in haptic signals [Chu et al.2013]

and teaching robot to ground natural language using active learning

[Kulick et al.2013]. Some of the work that aims to ground navigational instructions include [Guadarrama et al.2013], [Bollini et al.2013] and [Beetz et al.2011], where the focus was to ground verbs like go, follow, etc. and spatial relations of verbs [Tellex et al.2011, Fasola and Mataric2013].

Mapping Instructions to Action Sequences

. chen2011learning (chen2011learning) and artzi2013weakly (artzi2013weakly) present methods based on semantic parsing to map navigational instructions to a sequence of actions. mei2015listen (mei2015listen) look at neural mapping of instructions to sequence of actions, along with input from bag-of-word features extracted from the visual image. While these works focus on grounding navigational instructions to actions in the environment, we aim to ground visual attributes of objects such as shape, size and color.

Deep reinforcement learning using visual data.

Prior work has explored using Deep Reinforcement learning approaches for playing FPS games

[Lample and Chaplot2016, Kempka et al.2016, Kulkarni et al.2016]

. The challenge here is to learn optimal policy for a variety of tasks, including navigation using raw visual pixel information. chaplottransfer (chaplottransfer) look at transfer learning between different tasks in the Doom Environment. In all these methods, the policy for each task is learned separately using a deep Q-Learning

[Mnih et al.2013]. In contrast, we train a single network for multiple tasks/instructions. zhu2016target (zhu2016target) look at target-driven visual navigation, given the image of the target object. We use the natural language instruction and do not have the visual image of the object. yu2017deep (yu2017deep) look at learning to navigate in a 2D maze-like environment and execute commands, for both seen and zero-shot setting, where the combination of words are not seen before. misra2017mapping (misra2017mapping) also look at mapping raw visual observations and text input to actions in a 2D Blocks environment. While these works also looks at executing a variety of instructions, they tackle only 2D environments. oh2017zero (oh2017zero) look at zero-shot task generalization in a 3D environment. Their method tackles long instructions with several subtasks and a wide variety of action verbs. However, the position of the agent is discretized like 2D Mazes and their method encodes some prior linguistic knowledge in a analogy making objective.

Compared to the prior work, this paper aims to address visual language grounding in a challenging 3D setting involving raw-pixel input, continuous agent positions and partially observable envrionment, which poses additional challenges of perception, exploration and reasoning. Unlike many of the previous methods, our model assumes no prior linguistic or perceptual knowledge, and is trainable end-to-end.

3 Problem Formulation

We tackle the problem of task-oriented language grounding in the context of target-driven visual navigation conditioned on a natural language instruction, where the agent has to navigate to the object described in the instruction. Consider an agent interacting with an episodic environment . In the beginning of each episode, the agent receives a natural language instruction () which indicates the description of the target, a visual object in the environment. At each time step, the agent receives a raw pixel-level image of the first person view of the environment (), and performs an action . The episode terminates whenever the agent reaches any object or the number of time steps exceeds the maximum episode length. Let denote the state at each time step. The objective of the agent is to learn an optimal policy , which maps the observed states to actions, eventually leading to successful completion of the task. In this case, the task is to reach the correct object before the episode terminates. We consider two different learning approaches: (1) Imitation Learning [Bagnell2015]: where the agent has access to an oracle which specifies the optimal action given any state in the environment; (2) Reinforcement Learning [Sutton and Barto1998]: where the agent receives a positive reward when it reaches the target object and a negative reward when it reaches any other object.

4 Proposed Approach

We propose a novel architecture for task-oriented visual language grounding, which assumes no prior linguistic or perceptual knowledge and can be trained end-to-end. The proposed model is divided into two modules, state processing and policy learning, as shown in Figure 2.

State Processing Module: The state processing module takes the current state as the input and creates a joint representation for the image and the instruction. This joint representation is used by the policy learner to predict the optimal action to take at that timestep. It consists of a convolutional network [LeCun, Bengio, and others1995] to process the image

, a Gated Recurrent Unit (GRU)

[Cho et al.2014] network to process the instruction  and a multimodal fusion unit that combines the representations of the instruction and the image. Let be the representation of the image, where denote the parameters of the convolutional network, denotes number of feature maps (intermediate representations) in the convolutional network output, while and denote the height and width of each feature map. Let be the representation of the instruction, where denotes the parameters of the GRU network. The multimodal fusion unit, combines the image and instruction representations. Many prior methods combine the multimodal representations by concatenation [Mei, Bansal, and Walter2015, Misra, Langford, and Artzi2017]. We develop a multimodal fusion unit, Gated-Attention, based on multiplicative interactions between instruction and image representation.

Concatenation: In this approach, the representations of the image and instruction are simply flattened and concatenated to create a joint state representation:

where denotes the flattening operation. The concatenation unit is used as a baseline for the proposed Gated-Attention unit as it is used by prior methods [Mei, Bansal, and Walter2015, Misra, Langford, and Artzi2017].

Figure 3: Gated-Attention unit architecture.
Figure 4: A3C policy model architecture.

Gated-Attention: In the Gated-Attention unit, the instruction embedding is passed through a fully-connected linear layer with a sigmoid activation. The output dimension of this linear layer, , is equal to the number of feature maps in the output of the convolutional network (first dimension of

). The output of this linear layer is called the attention vector

, where denotes the fully-connected layer with sigmoid activation. Each element of is expanded to a matrix. This results in a 3-dimensional matrix, whose element is given by: . This matrix is multiplied element-wise with the output of the convolutional network:

where denotes the Hadamard product [Horn1990]. The architecture of the Gated-Attention unit is shown in Figure 3. The whole unit is differentiable which makes the architecture end-to-end trainable.

The proposed Gated-Attention unit is inspired by the Gated-Attention Reader architecture for text comprehension [Dhingra et al.2017]

. They integrate a multi-hop architecture with a Gated-attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. In contrast, we propose a Gated-Attention multimodal fusion unit which is based on multiplicative interactions between the instruction representation and the convolutional feature maps of the image representation. This architecture can be extended to any application of multimodal fusion of verbal and visual modalities.

The intuition behind Gated-Attention unit is that the trained convolutional feature maps detect different attributes of the objects in the frame, such as color and shape. The agent needs to attend to specific attributes of the objects based on the instruction. For example, depending on the whether the instruction is “Go to the green object”, “Go to the pillar” or “Go to the green pillar” the agent needs to attend to objects which are ‘green’, objects which look like a ‘pillar’ or both. The Gated-Attention unit is designed to gate specific feature maps based on the attention vector from the instruction, .

Policy Learning Module

The output of the multimodal fusion unit ( or ) is fed to the policy learning module. The architecture of the policy learning module is specific to the learning paradigm: (1) Imitation Learning or (2) Reinforcement Learning.

For imitation learning, we consider two algorithms, Behavioral Cloning [Bagnell2015] and DAgger [Ross, Gordon, and Bagnell2011]. Both the algorithms require an oracle that can return an optimal action given the current state. The oracle is implemented by extracting agent and target object locations and orientations from the Doom game engine. Given any state, the oracle determines the optimal action as follows: The agent first reorients (using turn_left, turn_right actions) towards the target object. It moves forward (move_forward action), reorienting towards the target object if deviation of the agent’s orientation is greater than the minimum turn angle supported by the environment.

For reinforcement learning, we use the Asynchronous Advantage Actor-Critic (A3C) algorithm [Mnih et al.2016]

which uses a deep neural network to learn the policy and value functions and runs multiple parallel threads to update the network parameters. We also use the entropy regularization for improved exploration as described by

[Mnih et al.2016]. In addition, we use the Generalized Advantage Estimator [Schulman et al.2015]

to reduce the variance of the policy gradient

[Williams1992] updates.

The policy learning module for imitation learning contains a fully connected layer to estimate the policy function. The policy learning module for reinforcement learning using A3C (shown in Figure 4) consists of an LSTM layer, followed by fully connected layers to estimate the policy function as well as the value function. The LSTM layer is introduced so that the agent can have some memory of previous states. This is important as a reinforcement learning agent might explore states where all objects are not visible and need to remember the objects seen previously.

Figure 5: Sample starting states and bird’s eye view of the map (not visible to the agent) showing agent and object locations in Easy, Medium and Hard modes.

5 Environment

We create an environment for task-oriented language grounding, where the agent can execute a natural language instruction and obtain a positive reward on successful completion of the task. Our environment is built on top of the ViZDoom API [Kempka et al.2016]

, based on Doom, a classic first person shooting game. It provides the raw visual information from a first-person perspective at every timestep. Each scenario in the environment comprises of an agent and a list of objects (a subset of ViZDoom objects) - one correct and rest incorrect in a customized map. The agent can interact with the environment by performing navigational actions such as turn left, turn right, move forward. Given an instruction “Go to the green torch”, the task is considered successful if the agent is able to reach the

green torch correctly. The customizable nature of the environment enables us to create scenarios with varying levels of difficulty which we believe leads to designing sophisticated learning algorithms to address the challenge of multi-task and zero-shot reinforcement learning.

An instruction is a combination of (action, attribute(s), object) triple. Each instruction can have more than one attribute but we limit the number of actions and objects to one each. The environment allows a variety of objects to be spawned at different locations in the map. The objects can have various visual attributes such as color, shape and size333See Appendix for the list of objects and instructions. We provide a set of 70 manually generated instructions33footnotemark: 3. For each of these instructions, the environment allows for automatic creation of multiple episodes, each randomly created with its own set of correct object and incorrect objects. Although the number of instructions are limited, the combinations of correct and incorrect objects for each instruction allows us to create multiple settings for the same instruction. Each time an instruction is selected, the environment generates a random combination of incorrect objects and the correct object in randomized locations. One of the significant challenges posed for a learning algorithm is to understand that the same instruction can refer to different objects in the different episodes. For example, “Go to the red object” can refer to a red keycard in one episode, and a red torch in another episode. Similarly, “Go to the keycard” can refer to keycards of various colors in different episodes. Objects could also occlude each other, or might not even be present in the agent’s field of view, or the map could be more complicated, making it difficult for the agent to make a decision based solely on the current input, stressing the need for efficient exploration.

Our environment also provides different modes with respect to spawning of objects each with varying difficulty levels (Figure 5): Easy: The agent is spawned at a fixed location. The candidate objects are spawned at five fixed locations along a single horizontal line along the field of view of the agent. Medium: The candidate objects are spawned in random locations, but the environment ensures that they are in the field of view of the agent. The agent is still spawned at a fixed location. Hard: The candidate objects and the agent are spawned at random locations and the objects may or may not be in the agents field of view in the initial configuration. The agent needs to explore the map to view all the objects.

Figure 6: Comparison of the performance of the proposed Gated-Attention (GA) unit to the baseline Concatenation unit using Reinforcement learning algorithm, A3C for (a) easy, (b) medium and (c) hard environments.

6 Experimental Setup

We perform our experiments in all of the three environment difficulty modes, where we restrict the number of objects to 5 for each episode (one correct object, four incorrect objects and the agent are spawned for each episode). During training, the objects are spawned from a training set of 55 instructions, while 15 instructions pertaining to unseen attribute-object combinations are held out in a test set for zero-shot evaluation. During training, at the start of each episode, one of the train instructions is selected randomly. A correct target object is selected and 4 incorrect objects are selected at random. These objects are placed at random locations depending on the difficulty level of the environment. The episode terminates if the agent reaches any object or time step exceeds the maximum episode length (

). The evaluation metric is the

accuracy of the agent which is success rate of reaching the correct object before the episode terminates. We consider two scenarios for evaluation:

(1) Multitask Generalization (MT), where the agent is evaluated on unseen maps with instructions in the train set. Unseen maps comprise of unseen combination of objects placed at randomized locations. This scenario tests that the agent doesn’t overfit to or memorize the training maps and can execute multiple instructions or tasks in unseen maps.

(2) Zero-shot Task Generalization (ZSL), where the agent is evaluated on unseen test instructions. This scenario tests whether the agent can generalize to new combinations of attribute-object pairs which are not seen during the training. The maps in this scenario are also unseen.

Baseline Approaches

Reinforcement Learning: We adapt [Misra, Langford, and Artzi2017] as a reinforcement learning baseline in the proposed environment. misra2017mapping (misra2017mapping) looks at jointly reasoning on linguistic and visual inputs for moving blocks in a 2D grid environment to execute an instruction. Their work uses raw features from the 2D grid, processed by a CNN, while the instruction is processed by an LSTM. Text and visual representations are combined using concatenation. The agent is trained using reinforcement learning and enhanced using distance based reward shaping. We do not use reward shaping as we would like the method to generalize to environments where the distance from the target is not available.

Imitation Learning: We adapt [Mei, Bansal, and Walter2015] as an imitation learning baseline in the proposed environment. mei2015listen (mei2015listen) map sequence of instructions to actions, treated as a sequence-to-sequence learning problem, with visual state input received by the decoder at each decode timestep. While they use a bag-of-visual words representation for visual state, we adapt the baseline to directly process raw pixels from the 3D environment using CNNs.

To ensure fairness in comparison, we use exact same architecture of CNNs (to process visual input), GRUs (to process textual instruction) and policy learning across baseline and proposed models. This reduces the reinforcement learning baseline to A3C algorithm with concatenation multimodal fusion (A3C-Concat), and imitation learning baseline to Behavioral Cloning with Concatenation (BC-Concat).

Easy Medium Hard
Imitation Learning BC Concat 5.21M 0.86 0.71 0.23 0.15 0.20 0.15
BC GA 5.09M 0.97 0.81 0.30 0.23 0.36 0.29
DAgger Concat 5.21M 0.92 0.73 0.45 0.23 0.19 0.13
DAgger GA 5.09M 0.94 0.85 0.55 0.40 0.29 0.30
Reinforcement Learning A3C Concat 3.44M 1.00 0.80 0.80 0.54 0.24 0.12
A3C GA 3.39M 1.00 0.81 0.89 0.75 0.83 0.73
Table 1: The accuracy of all the models with Concatenation and Gated-Attention (GA) units. A3C Concat and BC Concat are the adapted versions of misra2017mapping (misra2017mapping) and mei2015listen (mei2015listen) respectively for the proposed environment. All the accuracy values are averaged over 100 episodes.


The input to the neural network is the instruction and an RGB image of size 3x300x168. The first layer convolves the image with 128 filters of 8x8 kernel size with stride 4, followed by 64 filters of 4x4 kernel size with stride 2 and another 64 filters of 4x4 kernel size with stride 2. The architecture of the convolutional layers is adapted from previous work on playing deathmatches in Doom

[Chaplot and Lample2017]. The input instruction is encoded through a Gated Recurrent Unit (GRU) [Chung et al.2014] of size 256.

For the imitation learning approach, we run experiments with Behavioral Cloning (BC) and DAgger algorithms in an online fashion, which have data generation and policy update function per outer iteration. The policy learner for imitation learning comprises of a linear layer of size 512 which is fully-connected to 3 neurons to predict the policy function (i.e. probability of each action). In each data generation step, we sample state trajectories based on oracle’s policy in BC and based on a mixture of oracle’s policy and the currently learned policy in DAgger. The mixing of the policies is governed by an exploration coefficient, which has a linear decay from 1 to 0. For each state, we collect the optimal action given by the policy oracle. Then the policy is updated for 10 epochs over all the state-action pairs collected so far, using the RMSProp optimizer

[Tieleman and Hinton2012]. Both methods use Huber loss [Huber1964] between the estimated policy and the optimal policy given by the policy oracle.

For reinforcement learning, we run experiments with A3C algorithm. The policy learning module has a linear layer of size 256 followed by an LSTM layer of size 256 which encodes the history of state observations. The LSTM layer’s output is fully-connected to a single neuron to predict the value function as well as three other neurons to predict the policy function. All the network parameters are shared for predicting both the value function and the policy function except the final fully connected layer. All the convolutional layers and fully-connected linear layers have ReLu activations

[Nair and Hinton2010]

. The A3C model was trained using Stochastic Gradient Descent (SGD) with a learning rate of 0.001. We used a discount factor of 0.99 for calculating expected rewards and run 16 parallel threads for each experiment. We use mean-squared loss between the estimated value function and discounted sum of rewards for training with respect to the value function, and the policy gradient loss using for training with respect to the policy function.

7 Results & Discussions

For all the models described in section 4, the performance on both Multitask and Zero-shot Generalization is shown in Table 1. The performance of A3C models on Multitask Generalization during training is plotted in Figure 6.

Performance of GA models: We observe that models with the Gated-Attention (GA) unit outperform models with the Concatenation unit for Multitask and Zero-Shot Generalization. From Figure 6 we observe that A3C models with GA units learn faster than Concat models and converge to higher levels of accuracy. In hard mode, GA achieves accuracy on Multitask Generalization and on Zero-Shot Generalization, whereas Concat achieves and respectively and fails to show any considerable performance. For Imitation Learning, we observe that GA models perform better than Concat, and that as the environment modes get harder, imitation learning does not perform very well as there is a need for exploration in medium and hard settings. In contrast, the inherent extensive exploration of the reinforcement learning algorithm makes the A3C model more robust to the agent’s location and covers more state trajectories.

Policy Execution : Figure 9 shows a policy execution of the A3C model in the hard mode for the instruction short green torch. In this figure, we demonstrate the agent’s ability to explore the environment and handle occlusion. In this example, none of the objects are in the field-of-view of the agent in the initial frame. The agent explores the environment (makes a  300 degree turn) and eventually navigates towards the target object. It has also learned to distinguish between a short green torch and tall green torch and to avoid the tall torch before reaching the short torch444Demo videos:

Figure 7: Heatmap of the values of the 64-dimensional attention vector for different instructions grouped by object type and sub-grouped by object color. The test instructions are marked by *. The red boxes indicate that certain dimensions of the attention vector get activated for particular attributes of the target object referred in the instruction.
Figure 8: The t-SNE visualization of the attention vectors showing clusters based on object color, type and size.
Figure 9: This figure shows an example of the A3C policy execution at different points for the instruction ‘Go to the short green torch’. Left: Navigation map of the agent, Right: frames at each point. A: Initial frame: None of the objects are visible. B: agent has turned so that objects are in the field of view. C : agent successfully avoids the tall green torch. D : agent moves towards the short green torch. E :agent reaches target.

Analysis of Attention Maps: Figure 8 shows the heatmap for values of the attention vector for different instructions grouped by object type of the target object (additional attention maps are given in the supplementary material). As seen in the figure, dimension 18 corresponds to ‘armor’, dimensions 8 corresponds to the ‘skullkey’ and dimension 36 corresponds to the ‘pillar’. Also, note that there is no dimension which is high for all the instructions in the first group. This indicates that the model also recognizes that the word ‘object’ does not correspond to a particular object type, but rather refers to any object of that color (indicated by dotted red boxes in 8). These observations indicate that the model is learning to recognize the attributes of objects such as color and type, and specific feature maps are gated based on these attributes. Furthermore, the attention vector weights on the test instructions (marked by * in figure  8) also indicate that the Gated-Attention unit is also able to recognize attributes of the object in unseen instructions. We also visualize the t-SNE plots for the attention vectors based on attributes, color and object type as shown in Figure 8. The attention vectors for objects of red, blue, green, and yellow are present in clusters whereas those for instructions which do not mention the object’s color are spread across and belong to the clusters corresponding to the object type. Similarly, objects of a particular type present themselves in clusters. The clusters indicate that the model is able to recognize object attributes as it learns similar attention vectors for objects with similar.

8 Conclusion

In this paper we proposed an end-to-end architecture for task-oriented language grounding from raw pixels in a 3D environment, for both reinforcement learning and imitation learning. The architecture uses a novel multimodal fusion mechanism, Gated-Attention, which learns a joint state representation based on multiplicative interactions between instruction and image representation. We observe that the models (A3C for reinforcement learning and Behavioral Cloning/DAgger for imitation learning) which use the Gated-Attention unit outperform the models with concatenation units for both Multitask and Zero-Shot task generalization, across three modes of difficulty. The visualization of the attention weights for the Gated-Attention unit indicates that the agent learns to recognize objects, color attributes and size attributes.


We would like to thank Prof. Louis-Philippe Morency and Dr. Tadas Baltrušaitis for their valuable comments and guidance throughout the development of this work. This work was partially supported by BAE grants ADeLAIDE FA8750-16C-0130-001 and ConTAIN HR0011-16-C-0136.


Appendix A Doom objects

The ViZDoom environment supports spawning of several objects of various colors and sizes. The types of objects available are Columns, Torches, Armors and Keycards. In our experiments, we use several of these objects, which are shown in Figure 10.

Figure 10: Objects of various colors and sizes used in the environment

Appendix B Instructions

The list of 70 navigational instructions that was used to train and test the system in given in the Table 2.

Instruction Type Instruction
Size + Color tall green torch, short red object, short red pillar, short red torch, tall red object,
tall blue object, tall green object, tall red pillar, tall green pillar, short blue torch,
tall red torch, short green torch, short green object, short blue object,
tall blue torch, short green pillar
Color + Size red short object, green tall torch, red short pillar, red short torch, red tall object,
green tall object, blue tall object, red tall pillar, green tall pillar,
red tall torch, blue tall torch, green short object, green short torch,
blue short object, green short pillar, blue short torch
Color blue torch, red torch, green torch, yellow object,
green armor, tall object, red skullkey, red object, green object
blue object, red pillar, green pillar, red keycard, red armor, blue skullkey,
blue keycard, yellow keycard, yellow skullkey
Object Type torch, keycard, skullkey, pillar, armor
SuperlativeSize+Color smallest yellow object, smallest blue object, smallest green object,
largest blue object, largest red object, largest green object,
largest yellow object, smallest red object
SuperlativeSize largest object, smallest object
Size short torch, tall torch ,tall pillar ,short pillar ,short object, tall object
Table 2: List of instructions. Each instruction of Go to the X, where each ‘X’ is each entry in the table

Appendix C Attention Maps

The attention maps for different instructions grouped based on description is shown in 11 and grouped based on color is shown in figure 12.

Figure 11: Attention vector output for different instructions grouped by description. The test instructions are marked by *.
Figure 12: Attention vector output for different instructions grouped by color. The test instructions are marked by *.