Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

04/28/2017 ∙ by Dipendra Misra, et al. ∙ Microsoft cornell university 0

We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent's exploration, we use reward shaping with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Put the Toyota block in the same row as the SRI block, in the first open space to the right of the SRI block
[0.5pt/1pt]Move Toyota to the immediate right of SRI, evenly aligned and slightly separated
[0.5pt/1pt][0.5pt/1pt]Move the Toyota block around the pile and place it just to the right of the SRI block
[0.5pt/1pt]Place Toyota block just to the right of The SRI Block
[0.5pt/1pt]Toyota, right side of SRI
Figure 1: Instructions in the Blocks environment. The instructions all describe the same task. Given the observed RGB image of the start state (large image), our goal is to execute such instructions. In this task, the direct-line path to the target position is blocked, and the agent must plan and move the Toyota block around. The small image marks the target and an example path, which includes 34 steps.

An agent executing natural language instructions requires robust understanding of language and its environment. Existing approaches addressing this problem assume structured environment representations (e.g.,. Chen and Mooney, 2011; Mei et al., 2016), or combine separately trained models (e.g., Matuszek et al., 2010; Tellex et al., 2011), including for language understanding and visual reasoning. We propose to directly map text and raw image input to actions with a single learned model. This approach offers multiple benefits, such as not requiring intermediate representations, planning procedures, or training multiple models.

Figure 1 illustrates the problem in the Blocks environment (Bisk et al., 2016). The agent observes the environment as an RGB image using a camera sensor. Given the RGB input, the agent must recognize the blocks and their layout. To understand the instruction, the agent must identify the block to move (Toyota block) and the destination (just right of the SRI block). This requires solving semantic and grounding problems. For example, consider the topmost instruction in the figure. The agent needs to identify the phrase referring to the block to move, Toyota block, and ground it. It must resolve and ground the phrase SRI block as a reference position, which is then modified by the spatial meaning recovered from the same row as or first open space to the right of, to identify the goal position. Finally, the agent needs to generate actions, for example moving the Toyota block around obstructing blocks.

To address these challenges with a single model, we design a neural network agent. The agent executes instructions by generating a sequence of actions. At each step, the agent takes as input the instruction text, observes the world as an RGB image, and selects the next action. Action execution changes the state of the world. Given an observation of the new world state, the agent selects the next action. This process continues until the agent indicates execution completion. When selecting actions, the agent jointly reasons about its observations and the instruction text. This enables decisions based on close interaction between observations and linguistic input.

We train the agent with different levels of supervision, including complete demonstrations of the desired behavior and annotations of the goal state only. While the learning problem can be easily cast as a supervised learning problem, learning only from the states observed in the training data results in poor generalization and failure to recover from test errors. We use reinforcement learning (Sutton and Barto, 1998) to observe a broader set of states through exploration. Following recent work in robotics (e.g., Levine et al., 2016; Rusu et al., 2016), we assume the training environment, in contrast to the test environment, is instrumented and provides access to the state. This enables a simple problem reward function that uses the state and provides positive reward on task completion only. This type of reward offers two important advantages: (a) it is a simple way to express the ideal agent behavior we wish to achieve, and (b) it creates a platform to add training data information.

We use reward shaping (Ng et al., 1999) to exploit the training data and add to the reward additional information. The modularity of shaping allows varying the amount of supervision, for example by using complete demonstrations for only a fraction of the training examples. Shaping also naturally associates actions with immediate reward. This enables learning in a contextual bandit setting (Auer et al., 2002; Langford and Zhang, 2007), where optimizing the immediate reward is sufficient and has better sample complexity than unconstrained reinforcement learning (Agarwal et al., 2014).

We evaluate with the block world environment and data of Bisk et al. (2016), where each instruction moves one block (Figure 1). While the original task focused on source and target prediction only, we build an interactive simulator and formulate the task of predicting the complete sequence of actions. At each step, the agent must select between 81 actions with 15.4 steps required to complete a task on average, significantly more than existing environments (e.g., Chen and Mooney, 2011). Our experiments demonstrate that our reinforcement learning approach effectively reduces execution error by 24% over standard supervised learning and 34-39% over common reinforcement learning techniques. Our simulator, code, models, and execution videos are available at:

2 Technical Overview


Let be the set of all instructions, the set of all world states, and the set of all actions. An instruction is a sequence , where each is a token. The agent executes instructions by generating a sequence of actions, and indicates execution completion with the special action . Action execution modifies the world state following a transition function . The execution of an instruction starting from is an -length sequence , where , , and . In Blocks (Figure 1), a state specifies the positions of all blocks. For each action, the agent moves a single block on the plane in one of four directions (north, south, east, or west). There are blocks, and possible actions at each step, including . For example, to correctly execute the instructions in the figure, the agent’s likely first action is , which moves the Toyota block one step west. Blocks can not move over or through other blocks.


The agent observes the world state via a visual sensor (i.e., a camera). Given a world state , the agent observes an RGB image generated by the function . We distinguish between the world state and the agent context111We use the term context similar to how it is used in the contextual bandit literature to refer to the information available for decision making. While agent contexts capture information about the world state, they do not include physical information, except as captured by observed images. , which includes the instruction, the observed image , images of previous states, and the previous action. To map instructions to actions, the agent reasons about the agent context to generate a sequence of actions. At each step, the agent generates a single action. We model the agent with a neural network policy. At each step , the network takes as input the current agent context , and predicts the next action to execute . We formally define the agent context and model in Section 4.


We assume access to training data with examples , where is an instruction, is a start state, and is an execution demonstration of starting at . We use policy gradient (Section 5) with reward shaping derived from the training data to increase learning speed and exploration effectiveness (Section 6). Following work in robotics (e.g., Levine et al., 2016), we assume an instrumented environment with access to the world state to compute the reward during training only. We define our approach in general terms with demonstrations, but also experiment with training using goal states.


We evaluate task completion error on a test set , where is an instruction, is a start state, and is the goal state. We measure execution error as the distance between the final execution state and .

3 Related Work

Learning to follow instructions was studied extensively with structured environment representations, including with semantic parsing (Chen and Mooney, 2011; Kim and Mooney, 2012, 2013; Artzi and Zettlemoyer, 2013; Artzi et al., 2014a, b; Misra et al., 2015, 2016), alignment models (Andreas and Klein, 2015), reinforcement learning (Branavan et al., 2009, 2010; Vogel and Jurafsky, 2010), and neural network models (Mei et al., 2016). In contrast, we study the problem of an agent that takes as input instructions and raw visual input. Instruction following with visual input was studied with pipeline approaches that use separately learned models for visual reasoning (Matuszek et al., 2010, 2012; Tellex et al., 2011; Paul et al., 2016). Rather than decomposing the problem, we adopt a single-model approach and learn from instructions paired with demonstrations or goal states. Our work is related to Sung et al. (2015). While they use sensory input to select and adjust a trajectory observed during training, we are not restricted to training sequences. Executing instructions in non-learning settings has also received significant attention (e.g., Winograd, 1972; Webber et al., 1995; MacMahon et al., 2006).

Our work is related to a growing interest in problems that combine language and vision, including visual question answering (e.g., Antol et al., 2015; Andreas et al., 2016b, a), caption generation (e.g., Chen et al., 2015, 2016; Xu et al., 2015), and visual reasoning (Johnson et al., 2016; Suhr et al., 2017). We address the prediction of the next action given a world image and an instruction.

Reinforcement learning with neural networks has been used for various NLP tasks, including text-based games (Narasimhan et al., 2015; He et al., 2016), information extraction (Narasimhan et al., 2016), co-reference resolution (Clark and Manning, 2016), and dialog (Li et al., 2016).

Neural network reinforcement learning techniques have been recently studied for behavior learning tasks, including playing games (Mnih et al., 2013, 2015, 2016; Silver et al., 2016) and solving memory puzzles (Oh et al., 2016). In contrast to this line of work, our data is limited. Observing new states in a computer game simply requires playing it. However, our agent also considers natural language instructions. As the set of instructions is limited to the training data, the set of agent contexts seen during learning is constrained. We address the data efficiency problem by learning in a contextual bandit setting, which is known to be more tractable (Agarwal et al., 2014), and using reward shaping to increase exploration effectiveness. Zhu et al. (2017) address generalization of reinforcement learning to new target goals in visual search by providing the agent an image of the goal state. We address a related problem. However, we provide natural language and the agent must learn to recognize the goal state.

Reinforcement learning is extensively used in robotics (Kober et al., 2013). Similar to recent work on learning neural network policies for robot control (Levine et al., 2016; Schulman et al., 2015; Rusu et al., 2016), we assume an instrumented training environment and use the state to compute rewards during learning. Our approach adds the ability to specify tasks using natural language.

4 Model

Figure 2: Illustration of the policy architecture showing the 10th step in the execution of the instruction Place the Toyota east of SRI in the state from Figure 1. The network takes as input the instruction , image of the current state , images of previous states and (with ), and the previous action . The text and images are embedded with LSTM and CNN

. The actions are selected with the task specific multi-layer perceptron.

We model the agent policy with a neural network. The agent observes the instruction and an RGB image of the world. Given a world state , the image is generated using the function . The instruction execution is generated one step at a time. At each step , the agent observes an image of the current world state and the instruction , predicts the action , and executes it to transition to the next state . This process continues until is predicted and the agent stops, indicating instruction completion. The agent also has access to images of previous states and the previous action to distinguish between different stages of the execution (Mnih et al., 2015). Figure 2 illustrates our architecture.


We use bold-face capital letters for matrices and bold-face lowercase letters for vectors. Computed input and state representations use bold versions of the symbols. For example,

is the computed representation of an instruction . at step , the agent considers an agent context , which is a tuple , where is the natural language instruction, is an image of the current world state, the images represent previous states, and is the previous action. The agent context includes information about the current state and the execution. Considering the previous action allows the agent to avoid repeating failed actions, for example when trying to move in the direction of an obstacle. In Figure 2, the agent is given the instruction Place the Toyota east of SRI, is at the -th execution step, and considers previous images.

We generate continuous vector representations for all inputs, and jointly reason about both text and image modalities to select the next action. We use a recurrent neural network 

(RNN; Elman, 1990)

with a long short-term memory 

(LSTM; Hochreiter and Schmidhuber, 1997) recurrence to map the instruction to a vector representation . Each token is mapped to a fixed dimensional vector with the learned embedding function . The instruction representation is computed by applying the LSTM recurrence to generate a sequence of hidden states , and computing the mean  (Narasimhan et al., 2015). The current image and previous images ,…,

are concatenated along the channel dimension and embedded with a convolutional neural network (

CNN) to generate the visual state  (Mnih et al., 2013). The last action is embedded with the function . The vectors , , and are concatenated to create the agent context vector representation .

To compute the action to execute, we use a feed-forward perceptron that decomposes according to the domain actions. This computation selects the next action conditioned on the instruction text and observations from both the current world state and recent history. In the block world domain, where actions decompose to selecting the block to move and the direction, the network computes block and direction probabilities. Formally, we decompose an action

to direction and block . We compute the feedforward network:

and the action probability is a product of the component probabilities:

At the beginning of execution, the first action is set to the special value NONE, and previous images are zero matrices. The embedding function is a learned matrix. The function concatenates the embeddings of and , which are obtained from learned matrices, to compute the embedding of . The model parameters include , , , , , , the parameters of the LSTM recurrence, the parameters of the convolutional network CNN, and the embedding matrices. In our experiments (Section 7), all parameters are learned without external resources.

5 Learning

We use policy gradient for reinforcement learning (Williams, 1992)

to estimate the parameters

of the agent policy. We assume access to a training set of examples , where is an instruction, is a start state, and is an execution demonstration starting from of instruction . The main learning challenge is learning how to execute instructions given raw visual input from relatively limited data. We learn in a contextual bandit setting, which provides theoretical advantages over general reinforcement learning. In Section 8, we verify this empirically.

Reward Function

The instruction execution problem defines a simple problem reward to measure task completion. The agent receives a positive reward when the task is completed, a negative reward for incorrect completion (i.e., in the wrong state) and actions that fail to execute (e.g., when the direction is blocked), and a small penalty otherwise, which induces a preference for shorter trajectories. To compute the reward, we assume access to the world state. This learning setup is inspired by work in robotics, where it is achieved by instrumenting the training environment (Section 3). The agent, on the other hand, only uses the agent context (Section 4). When deployed, the system relies on visual observations and natural language instructions only. The reward function is defined for each training example , :

where is the length of .

The reward function does not provide intermediate positive feedback to the agent for actions that bring it closer to its goal. When the agent explores randomly early during learning, it is unlikely to encounter the goal state due to the large number of steps required to execute tasks. As a result, the agent does not observe positive reward and fails to learn. In Section 6, we describe how reward shaping, a method to augment the reward with additional information, is used to take advantage of the training data and address this challenge.

Policy Gradient Objective

We adapt the policy gradient objective defined by Sutton et al. (1999) to multiple starting states and reward functions:

where is the value given by starting from under the policy . The summation expresses the goal of learning a behavior parameterized by natural language instructions.

Contextual Bandit Setting

In contrast to most policy gradient approaches, we apply the objective to a contextual bandit setting where immediate reward is optimized rather than total expected reward. The primary theoretical advantage of contextual bandits is much tighter sample complexity bounds when comparing upper bounds for contextual bandits (Langford and Zhang, 2007) even with an adversarial sequence of contexts (Auer et al., 2002) to lower bounds (Krishnamurthy et al., 2016) or upper bounds (Kearns et al., 1999) for total reward maximization. This property is particularly suitable for the few-sample regime common in natural language problems. While reinforcement learning with neural network policies is known to require large amounts of training data (Mnih et al., 2015), the limited number of training sentences constrains the diversity and volume of agent contexts we can observe during training. Empirically, this translates to poor results when optimizing the total reward (REINFORCE baseline in Section 8). To derive the approximate gradient, we use the likelihood ratio method:

where reward is computed from the world state but policy is learned on the agent context. We approximate the gradient using sampling.

This training regime, where immediate reward optimization is sufficient to optimize policy parameters , is enabled by the shaped reward we introduce in Section 6. While the objective is designed to work best with the shaped reward, the algorithm remains the same for any choice of reward definition including the original problem reward or several possibilities formed by reward shaping.

Entropy Penalty

We observe that early in training, the agent is overwhelmed with negative reward and rarely completes the task. This results in the policy rapidly converging towards a suboptimal deterministic policy with an entropy of . To delay premature convergence we add an entropy term to the objective (Williams and Peng, 1991; Mnih et al., 2016)

. The entropy term encourages a uniform distribution policy, and in practice stimulates exploration early during training. The regularized gradient is:

where is the entropy of given the agent context ,

is a hyperparameter that controls the strength of the regularization. While the entropy term delays premature convergence, it does not eliminate it. Similar issues are observed for vanilla policy gradient 

(Mnih et al., 2016).

1:Training set , learning rate

, epochs

, horizon , and entropy regularization term .
2: is a camera sensor that reports an RGB image of state . is a probabilistic neural network policy parameterized by , as described in Section 4. executes the action at the state , and returns the new state. is the reward function for example . applies a per-feature learning rate to the gradient  (Kingma and Ba, 2014).
3:Policy parameters .
4:» Iterate over the training data.
5:for  to , to  do
7:    ,
9:    » Rollout up to episode limit.
10:    while  do
11:        » Observe world and construct agent context.
14:        » Sample an action from the policy.
17:        » Compute the approximate gradient.
Algorithm 1 Policy gradient learning


Algorithm 1 shows our learning algorithm. We iterate over the data times. In each epoch, for each training example , , we perform a rollout using our policy to generate an execution (lines 10 - 20). The length of the rollout is bound by , but may be shorter if the agent selected the action. At each step , the agent updates the agent context (lines 12 - 13), samples an action from the policy (line 15), and executes it to generate the new world state (line 16). The gradient is approximated using the sampled action with the computed reward (line 18). Following each rollout, we update the parameters with the mean of the gradients using Adam (Kingma and Ba, 2014).

6 Reward Shaping

Reward shaping is a method for transforming a reward function by adding a shaping term to the problem reward. The goal is to generate more informative updates by adding information to the reward. We use this method to leverage the training demonstrations, a common form of supervision for training systems that map language to actions. Reward shaping allows us to fully use this type of supervision in a reinforcement learning framework, and effectively combine learning from demonstrations and exploration.

Adding an arbitrary shaping term can change the optimality of policies and modify the original problem, for example by making bad policies according to the problem reward optimal according to the shaped function.333For example, adding a shaping term will result in a shaped reward that is always 0, and any policy will be trivially optimal with respect to it. Ng et al. (1999) and Wiewiora et al. (2003) outline potential-based terms that realize sufficient conditions for safe shaping.444For convenience, we briefly overview the theorems of Ng et al. (1999) and Wiewiora et al. (2003) in Appendix A. Adding a shaping term is safe if the order of policies according to the shaped reward is identical to the order according to the original problem reward. While safe shaping only applies to optimizing the total reward, we show empirically the effectiveness of the safe shaping terms we design in a contextual bandit setting.

We introduce two shaping terms. The final shaped reward is a sum of them and the problem reward. Similar to the problem reward, we define example-specific shaping terms. We modify the reward function signature as required.

Figure 3: Visualization of the shaping potentials for two tasks. We show demonstrations (blue arrows), but omit instructions. To visualize the potentials intensity, we assume only the target block can be moved, while rewards and potentials are computed for any block movement. We illustrate the sparse problem reward (left column) as a potential function and consider only its positive component, which is focused on the goal. The middle column adds the distance-based potential. The right adds both potentials.

Distance-based Shaping ()

The first shaping term measures if the agent moved closer to the goal state. We design it to be a safe potential-based term (Ng et al., 1999):

The potential is proportional to the negative distance from the goal state . Formally, , where is a constant scaling factor, and is a distance metric. In the block world, the distance between two states is the sum of the Euclidean distances between the positions of each block in the two states, and is the inverse of block width. The middle column in Figure 3 visualizes the potential .

Trajectory-based Shaping ()

Distance-based shaping may lead the agent to sub-optimal states, for example when an obstacle blocks the direct path to the goal state, and the agent must temporarily increase its distance from the goal to bypass it. We incorporate complete trajectories by using a simplification of the shaping term introduced by Brys et al. (2015). Unlike , it requires access to the previous state and action. It is based on the look-back advice shaping term of Wiewiora et al. (2003), who introduced safe potential-based shaping that considers the previous state and action. The second term is:

Given , to compute the potential , we identify the closest state in to . If and , , else , where is a penalty parameter. We use the same distance computation and parameter as in . When the agent is in a state close to a demonstration state, this term encourages taking the action taken in the related demonstration state. The right column in Figure 3 visualizes the effect of the potential .

7 Experimental Setup


We use the environment of Bisk et al. (2016). The original task required predicting the source and target positions for a single block given an instruction. In contrast, we address the task of moving blocks on the plane to execute instructions given visual input. This requires generating the complete sequence of actions needed to complete the instruction. The environment contains up to 20 blocks marked with logos or digits. Each block can be moved in four directions. Including the action, in each step, the agent selects between 81 actions. The set of actions is constant and is not limited to the blocks present. The transition function is deterministic. The size of each block step is 0.04 of the board size. The agent observes the board from above. We adopt a relatively challenging setup with a large action space. While a simpler setup, for example decomposing the problem to source and target prediction and using a planner, is likely to perform better, we aim to minimize task-specific assumptions and engineering of separate modules. However, to better understand the problem, we also report results for the decomposed task with a planner.


Bisk et al. (2016) collected a corpus of instructions paired with start and goal states. Figure 1 shows example instructions. The original data includes instructions for moving one block or multiple blocks. Single-block instructions are relatively similar to navigation instructions and referring expressions. While they present much of the complexity of natural language understanding and grounding, they rarely display the planning complexity of multi-block instructions, which are beyond the scope of this paper. Furthermore, the original data does not include demonstrations. While generating demonstrations for moving a single block is straightforward, disambiguating action ordering when multiple blocks are moved is challenging. Therefore, we focus on instructions where a single block changes its position between the start and goal states, and restrict demonstration generation to move the changed block. The remaining data, and the complexity it introduces, provide an important direction for future work.

To create demonstrations, we compute the shortest paths. While this process may introduce noise for instructions that specify specific trajectories (e.g., move SRI two steps north and …) rather than only describing the goal state, analysis of the data shows this issue is limited. Out of 100 sampled instructions, 92 describe the goal state rather than the trajectory. A secondary source of noise is due to discretization of the state space. As a result, the agent often can not reach the exact target position. The demonstrations error illustrates this problem (Table 3). To provide task completion reward during learning, we relax the state comparison, and consider states to be equal if the sum of block distances is under the size of one block.

The corpus includes 11,871/1,719/3,177 instructions for training/development/testing. Table 1 shows corpus statistic compared to the commonly used SAIL navigation corpus (MacMahon et al., 2006; Chen and Mooney, 2011). While the SAIL agent only observes its immediate surroundings, overall the blocks domain provides more complex instructions. Furthermore, the SAIL environment includes only 400 states, which is insufficient for generalization with vision input. We compare to other data sets in Appendix D.

SAIL Blocks
Number of instructions 3,237 16,767
Mean instruction length 7.96 15.27
Vocabulary 563 1,426
Mean trajectory length 3.12 15.4
Table 1: Corpus statistics for the block environment we use and the SAIL navigation domain.


We evaluate task completion error as the sum of Euclidean distances for each block between its position at the end of the execution and in the gold goal state. We divide distances by block size to normalize for the image size. In contrast, Bisk et al. (2016) evaluate the selection of the source and target positions independently.


We report performance of ablations, the upper bound of following the demonstrations (Demonstrations), and five baselines: (a) Stop: the agent immediately stops, (b) Random: the agent takes random actions, (c) Supervised: supervised learning with maximum-likelihood estimate using demonstration state-action pairs, (d) DQN: deep Q-learning with both shaping terms (Mnih et al., 2015), and (e) REINFORCE: policy gradient with cumulative episodic reward with both shaping terms (Sutton et al., 1999). Full system details are given in Appendix B.

Parameters and Initialization

Full details are in Appendix C. We consider previous images, and horizon length . We initialize our model with the Supervised model.

8 Results

Algorithm Distance Error Min. Distance
Mean Med. Mean Med.
Demonstrations 0.35 0.30 0.35 0.30
Stop 5.95 5.71 5.95 5.71
Random 15.3 15.70 5.92 5.70
Supervised 4.65 4.45 3.72 3.26
REINFORCE 5.57 5.29 4.50 4.25
DQN 6.04 5.78 5.63 5.49
Our Approach 3.60 3.09 2.72 2.21
  w/o Sup. Init 3.78 3.13 2.79 2.21
  w/o Prev. Action 3.95 3.44 3.20 2.56
  w/o 4.33 3.74 3.29 2.64
  w/o 3.74 3.11 3.13 2.49
  w/ Distance 8.36 7.82 5.91 5.70
Supervised 4.64 4.27 3.69 3.22
REINFORCE 5.28 5.23 4.75 4.67
DQN 5.85 5.59 5.60 5.46
Our Approach 3.59 3.03 2.63 2.15
Table 2: Mean and median (Med.) development results.
Algorithm Distance Error Min. Distance
Mean Med. Mean Med.
Demonstrations 0.37 0.31 0.37 0.31
Stop 6.23 6.12 6.23 6.12
Random 15.11 15.35 6.21 6.09
Supervised 4.95 4.53 3.82 3.33
REINFORCE 5.69 5.57 5.11 4.99
DQN 6.15 5.97 5.86 5.77
Our Approach 3.78 3.14 2.83 2.07
Table 3: Mean and median (Med.) test results.

Table 2 shows development results. We run each experiment three times and report the best result. The Random and Stop baselines illustrate the task complexity of the task. Our approach, including both shaping terms in a contextual bandit setting, significantly outperforms the other methods. Supervised learning demonstrates lower performance. A likely explanation is test-time execution errors leading to unfamiliar states with poor later performance (Kakade and Langford, 2002), a form of the covariate shift problem. The low performance of REINFORCE and DQN illustrates the challenge of general reinforcement learning with limited data due to relatively high sample complexity (Kearns et al., 1999; Krishnamurthy et al., 2016). We also report results using ensembles of the three models.

We ablate different parts of our approach. Ablations of supervised initialization (our approach w/o sup. init) or the previous action (our approach w/o prev. action) result in increase in error. While the contribution of initialization is modest, it provides faster learning. On average, after two epochs, we observe an error of with initialization and without. We hypothesize that the shaping term, which uses full demonstrations, helps to narrow the gap at the end of learning. Without supervised initialization and , the error increases to (the 0% point in Figure 4). We observe the contribution of each shaping term and their combination. To study the benefit of potential-based shaping, we experiment with a negative distance-to-goal reward. This reward replaces the problem reward and encourages getting closer to the goal (our approach w/distance reward). With this reward, learning fails to converge, leading to a relatively high error.

Figure 4 shows our approach with varying amount of supervision. We remove demonstrations from both supervised initialization and the shaping term. For example, when only 25% are available, only 25% of the data is available for initialization and the term is only present for this part of the data. While some demonstrations are necessary for effective learning, we get most of the benefit with only 12.5%.

Table 3 provides test results, using the ensembles to decrease the risk of overfitting the development. We observe similar trends to development result with our approach outperforming all baselines. The remaining gap to the demonstrations upper bound illustrates the need for future work.

To understand performance better, we measure minimal distance (min. distance in Tables 2 and 3), the closest the agent got to the goal. We observe a strong trend: the agent often gets close to the goal and fails to stop. This behavior is also reflected in the number of steps the agent takes. While the mean number of steps in development demonstrations is , the agent generates on average steps, and of the time it takes the maximum number of allowed steps (). Testing on the training data shows an average steps and exhausts the number of steps of the time. The mean number of steps in training demonstrations is . This illustrates the challenge of learning how to be behave at an absorbing state, which is observed relatively rarely during training. This behavior also shows in our video.555

% Demonstrations

Mean Error
Figure 4: Mean distance error as a function of the ratio of training examples that include complete trajectories. The rest of the data includes the goal state only.

We also evaluate a supervised learning variant that assumes a perfect planner.666As there is no sequence of decisions, our reinforcement approach is not appropriate for the planner experiment. The architecture details are described in Appendix B. This setup is similar to Bisk et al. (2016), except using raw image input. It allows us to roughly understand how well the agent generates actions. We observe a mean error of on the development set, an improvement of almost two points over supervised learning with our approach. This illustrates the complexity of the complete problem.

We conduct a shallow linguistic analysis to understand the agent behavior with regard to differences in the language input. As expected, the agent is sensitive to unknown words. For instructions without unknown words, the mean development error is . It increases to for instructions with a single unknown word, and to for two.777This trend continues, although the number of instructions is too low () to be reliable.

We also study the agent behavior when observing new phrases composed of known words by looking at instructions with new n-grams and no unknown words. We observe no significant correlation between performance and new bi-grams and tri-grams. We also see no meaningful correlation between instruction length and performance. Although counterintuitive given the linguistic complexities of longer instructions, it aligns with results in machine translation 

(Luong et al., 2015).

9 Conclusions

We study the problem of learning to execute instructions in a situated environment given only raw visual observations. Supervised approaches do not explore adequately to handle test time errors, and reinforcement learning approaches require a large number of samples for good convergence. Our solution provides an effective combination of both approaches: reward shaping to create relatively stable optimization in a contextual bandit setting, which takes advantage of a signal similar to supervised learning, with a reinforcement basis that admits substantial exploration and easy avenues for smart initialization. This combination is designed for a few-samples regime, as we address. When the number of samples is unbounded, the drawbacks observed in this scenario for optimizing longer term reward do not hold.


This research was supported by a Google Faculty Award, an Amazon Web Services Research Grant, and a Schmidt Sciences Research Award. We thank Alane Suhr, Luke Zettlemoyer, and the anonymous reviewers for their helpful feedback, and Claudia Yan for technical help. We also thank the Cornell NLP group and the Microsoft Research Machine Learning NYC group for their support and insightful comments.


  • Agarwal et al. (2014) Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire. 2014. Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the International Conference on Machine Learning.
  • Andreas and Klein (2015) Jacob Andreas and Dan Klein. 2015. Alignment-based compositional semantics for instruction following. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

  • Andreas et al. (2016a) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016a. Learning to compose neural networks for question answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Andreas et al. (2016b) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016b. Neural module networks. In

    Conference on Computer Vision and Pattern Recognition

  • Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In International Journal of Computer Vision.
  • Artzi et al. (2014a) Yoav Artzi, Dipanjan Das, and Slav Petrov. 2014a. Learning compact lexicons for CCG semantic parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.
  • Artzi et al. (2014b) Yoav Artzi, Maxwell Forbes, Kenton Lee, and Maya Cakmak. 2014b. Programming by demonstration with situated semantic parsing. In AAAI Fall Symposium Series.
  • Artzi and Zettlemoyer (2013) Yoav Artzi and Luke Zettlemoyer. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association of Computational Linguistics 1:49–62.
  • Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. 2002. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1):48–77.
  • Bisk et al. (2016) Yonatan Bisk, Deniz Yuret, and Daniel Marcu. 2016. Natural language communication with robots. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Branavan et al. (2009) S.R.K. Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP.
  • Branavan et al. (2010) S.R.K. Branavan, Luke Zettlemoyer, and Regina Barzilay. 2010. Reading between the lines: Learning to map high-level instructions to commands. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.
  • Brys et al. (2015) Tim Brys, Anna Harutyunyan, Halit Bener Suay, Sonia Chernova, Matthew E. Taylor, and Ann Nowé. 2015. Reinforcement learning from demonstration through shaping. In

    Proceedings of the International Joint Conference on Artificial Intelligence

  • Chen and Mooney (2011) David L. Chen and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the National Conference on Artificial Intelligence.
  • Chen et al. (2016) Wenhu Chen, Aurélien Lucchi, and Thomas Hofmann. 2016. Bootstrap, review, decode: Using out-of-domain textual data to improve image captioning. CoRR abs/1611.05321.
  • Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. CoRR abs/1504.00325.
  • Clark and Manning (2016) Kevin Clark and D. Christopher Manning. 2016. Deep reinforcement learning for mention-ranking coreference models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
  • Elman (1990) Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science 14:179–211.
  • He et al. (2016) Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. 2016. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9.
  • Johnson et al. (2016) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2016. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. CoRR abs/1612.06890.
  • Kakade and Langford (2002) Sham Kakade and John Langford. 2002. Approximately optimal approximate reinforcement learning. In Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 8-12, 2002.
  • Kearns et al. (1999) Michael Kearns, Yishay Mansour, and Andrew Y. Ng. 1999.

    A sparse sampling algorithm for near-optimal planning in large markov decision processes.

    In Proeceediings of the International Joint Conference on Artificial Intelligence.
  • Kim and Mooney (2012) Joohyun Kim and Raymond Mooney. 2012. Unsupervised PCFG induction for grounded language learning with highly ambiguous supervision. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
  • Kim and Mooney (2013) Joohyun Kim and Raymond Mooney. 2013. Adapting discriminative reranking to grounded language learning. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations.
  • Kober et al. (2013) Jens Kober, J. Andrew Bagnell, and Jan Peters. 2013. Reinforcement learning in robotics: A survey. International Journal of Robotics Research 32:1238–1274.
  • Krishnamurthy et al. (2016) Akshay Krishnamurthy, Alekh Agarwal, and John Langford. 2016. PAC reinforcement learning with rich observations. In Advances in Neural Information Processing Systems.
  • Langford and Zhang (2007) John Langford and Tong Zhang. 2007. The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007.
  • Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. 2016. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research 17.
  • Li et al. (2016) Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao. 2016. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
  • MacMahon et al. (2006) Matthew MacMahon, Brian Stankiewics, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, knowledge, action in route instructions. In Proceedings of the National Conference on Artificial Intelligence.
  • Matuszek et al. (2010) Cynthia Matuszek, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the international conference on Human-robot interaction.
  • Matuszek et al. (2012) Cynthia Matuszek, Evan Herbst, Luke S. Zettlemoyer, and Dieter Fox. 2012. Learning to parse natural language commands to a robot control system. In Proceedings of the International Symposium on Experimental Robotics.
  • Mei et al. (2016) Hongyuan Mei, Mohit Bansal, and R. Matthew Walter. 2016. What to talk about and how? selective generation using lstms with coarse-to-fine alignment. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Misra et al. (2016) Dipendra K. Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. 2016. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research 35(1-3):281–300.
  • Misra et al. (2015) Kumar Dipendra Misra, Kejia Tao, Percy Liang, and Ashutosh Saxena. 2015. Environment-driven lexicon induction for high-level instructions. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
  • Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing atari with deep reinforcement learning. In Advances in Neural Information Processing Systems.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, and Georg Ostrovski. 2015. Human-level control through deep reinforcement learning. Nature 518(7540).
  • Narasimhan et al. (2015) Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. 2015. Language understanding for text-based games using deep reinforcement learning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
  • Narasimhan et al. (2016) Karthik Narasimhan, Adam Yala, and Regina Barzilay. 2016. Improving information extraction by acquiring external evidence with reinforcement learning. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
  • Ng et al. (1999) Andrew Y. Ng, Daishi Harada, and Stuart J. Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the International Conference on Machine Learning.
  • Oh et al. (2016) Junhyuk Oh, Valliappa Chockalingam, Satinder P. Singh, and Honglak Lee. 2016. Control of memory, active perception, and action in minecraft. In Proceedings of the International Conference on Machine Learning.
  • Paul et al. (2016) Rohan Paul, Jacob Arkin, Nicholas Roy, and Thomas M. Howard. 2016. Efficient grounding of abstract spatial concepts for natural language interaction with robot manipulators. In Robotics: Science and Systems.
  • Rusu et al. (2016) Andrei A. Rusu, Matej Vecerik, Thomas Rothörl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell. 2016. Sim-to-real robot learning from pixels with progressive nets. CoRR .
  • Schulman et al. (2015) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. 2015. Trust region policy optimization .
  • Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of go with deep neural networks and tree search. Nature 529 7587:484–9.
  • Suhr et al. (2017) Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. 2017. A corpus of compositional language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  • Sung et al. (2015) Jaeyong Sung, Seok Hyun Jin, and Ashutosh Saxena. 2015. Robobarista: Object part based transfer of manipulation trajectories from crowd-sourcing in 3d pointclouds. In International Symposium on Robotics Research.
  • Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement learning: An introduction. IEEE Trans. Neural Networks 9:1054–1054.
  • Sutton et al. (1999) Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
  • Tellex et al. (2011) Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew Walter, Ashis G. Banerjee, Seth Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the National Conference on Artificial Intelligence.
  • Vogel and Jurafsky (2010) Adam Vogel and Daniel Jurafsky. 2010. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.
  • Webber et al. (1995) Bonnie Webber, Norman Badler, Barbara Di Eugenio, Christopher Geib, Libby Levison, and Michael Moore. 1995. Instructions, intentions and expectations. Artificial Intelligence 73(1):253–269.
  • Wiewiora et al. (2003) Eric Wiewiora, Garrison W. Cottrell, and Charles Elkan. 2003. Principled methods for advising reinforcement learning agents. In Proceedings of the International Conference on Machine Learning.
  • Williams (1992) Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8.
  • Williams and Peng (1991) Ronald J Williams and Jing Peng. 1991. Function optimization using connectionist reinforcement learning algorithms. Connection Science 3(3):241–268.
  • Winograd (1972) Terry Winograd. 1972. Understanding natural language. Cognitive Psychology 3(1):1–191.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Jamie Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning.
  • Zhu et al. (2017) Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. 2017. Target-driven visual navigation in indoor scenes using deep reinforcement learning.

Appendix A Reward Shaping Theorems

In Section 6, we introduce two reward shaping terms. We follow the safe-shaping theorems of Ng et al. (1999) and Wiewiora et al. (2003). The theorems outline potential-based terms that realize sufficient conditions for safe shaping. Applying safe terms guarantees the order of policies according to the original problem reward does not change. While the theory only applies when optimizing the total reward, we show empirically the effectiveness of the safe shaping terms in a contextual bandit setting. For convenience, we provide the definitions of potential-based shaping terms and the theorems introduced by Ng et al. (1999) and Wiewiora et al. (2003) using our notation. We refer the reader to the original papers for the full details and proofs.

The distance-based shaping term is defined based on the theorem of Ng et al. (1999):

A shaping term is potential-based if there exists a function such that, at time , , and , where is a future reward discounting factor. The function is the potential function of the shaping term .
Given a reward function , if the shaping term is potential-based, the shaped reward does not modify the total order of policies.

In the definition of , we set the discounting term to 1.0 and omit it.

The trajectory-based shaping term follows the shaping term introduced by Brys et al. (2015). To define it, we use the look-back advice shaping term of Wiewiora et al. (2003), who extended the potential-based term of Ng et al. (1999) for terms that consider the previous state and action:

A shaping term is potential-based if there exists a function such that, at time , , and , where is a future reward discounting factor. The function is the potential function of the shaping term .
Given a reward function , if the shaping term is potential-based, the shaped reward does not modify the total order of policies.

In the definition of as well, we set the discounting term to 1.0 and omit it.

Appendix B Evaluation Systems

We implement multiple systems for evaluation.


The agent performs the action immediately at the beginning of execution.


The agent samples actions uniformly until is sampled or actions were sampled, where is the execution horizon.


Given the training data with instruction-state-execution triplets, we generate training data of instruction-state-action triplets and optimize the log-likelihood of the data. Formally, we optimize the objective:

where is the length of the execution , is the agent context at step in sample , and is the demonstration action of step in demonstration execution . Agent contexts are generated with the annotated previous actions (i.e., to generate previous images and the previous action). We use minibatch gradient descent with Adam updates (Kingma and Ba, 2014).


We use deep Q-learning (Mnih et al., 2015) to train a Q-network. We use the architecture described in Section 4, except replacing the task specific part with a single 81-dimension layer. In contrast to our probabilistic model, we do not decompose block and direction selection. We use the shaped reward function, including both and . We use a replay memory of size 2,000 and an -greedy behavior policy to generate rollouts. We attenuate the value of from 1 to 0.1 in 100,000 steps and use prioritized sweeping for sampling. We also use a target network that is synchronized after every epoch.


We use the REINFORCE algorithm (Sutton et al., 1999) to train our agent. REINFORCE performs policy gradient learning with total reward accumulated over the roll-out as opposed to using immediate rewards as in our main approach. REINFORCE samples the total reward using monte-carlo sampling by performing a roll-out. We use the shaped reward function, including both and terms. Similar to our approach, we initialize with a Supervised model and regularize the objective with the entropy of the policy. We do not use a reward baseline.

Supervised with Oracle Planner

We use a variant of our model assuming a perfect planner. The model predicts the block to move and its target position as a pair of coordinates. We modify the architecture in Section 4 to predict the block to move and its target position as a pair of coordinates. This model assumes that the sequence of actions is inferred from the predicted target position using an oracle planner. We train using supervised learning by maximizing the likelihood of the block being moved and minimizing the squared distance between the predicted target position and the annotated target position.

Appendix C Parameters and Initialization

c.1 Architecture Parameters

We use an RGB image of 120x120 pixels, and a convolutional neural network (CNN) with 4 layers. The first two layers apply 32

filters with a stride of 4, the third applies 32

filters with a stride of 2. The last layer performs an affine transformation to create a 200-dimension vector. We linearly scale all images to have zero mean and unit norm. We use a single layer RNN with 150-dimensional word embeddings and 250 LSTM units. The dimension of the action embedding is , including 32 for embedding the block and 24 for embedding the directions. is a matrix and is a 120-dimension vector. is for 20 blocks, and is for the four directions (north, south, east, west) and the action. We consider previous images, and use horizon length .

c.2 Initialization

Embedding matrices are initialized with a zero-mean unit-variance Gaussian distribution. All biases are initialized to

. We use a zero-mean truncated normal distribution to initialize the CNN filters (0.005 variance) and CNN weights matrices (0.004 variance). All other weight matrices are initialized with a normal distribution (mean=

, standard deviation=

). The matrices used in the word embedding function are initialized with a zero-mean normal distribution with standard deviation of 1.0. Action embedding matrices, which are used for , are initialized with a zero-mean normal distribution with 0.001 standard deviation. We initialize policy gradient learning, including our approach, with parameters estimated using supervised learning for two epochs, except the direction parameters and , which we learn from scratch. We found this initialization method to provide a good balance between strong initialization and not biasing the learning too much, which can result in limited exploration.

c.3 Learning Parameters

We use the distance error on a small validation set as stopping criteria. After each epoch, we save the model, and select the final model based on development set performance. While this method overfits the development set, we found it more reliable then using the small validation set alone. Our relatively modest performance degradation on the held-out set illustrates that our models generalize well. We set the reward and shaping penalties . The entropy regularization coefficient is . The learning rate is for supervised learning and for policy gradient. We clip the gradient at a norm of . All learning algorithms use a mini-batch of size 32 during training.

Appendix D Dataset Comparisons

Name # Samples Vocabulary Mean Instruction # Actions Mean Trajectory Partially
Size Length Length Observed
Blocks 16,767 1,426 15.27 81 15.4 No
[0.5pt/1pt]SAIL 3,237 563 7.96 3 3.12 Yes
[0.5pt/1pt]Matuszek 217 39 6.65 3 N/A No
[0.5pt/1pt]Misra 469 775 48.7 21.5 No
Table 4: Comparison of several related natural language instructions corpora.

We briefly review instruction following datasets in Table 4, including: Blocks (Bisk et al., 2016), SAIL (MacMahon et al., 2006; Chen and Mooney, 2011), Matuszek (Matuszek et al., 2012), and Misra (Misra et al., 2015). Overall, Blocks provides the largest training set and a relatively complex environment with well over possible states.888We compute this loose lower bound on the number of states in the block world as (the number of block permutations). This is a very loose lower bound. The most similar dataset is SAIL, which provides only partial observability of the environment (i.e., the agent observes what is around it only). However, SAIL is less complex on other dimensions related to the instructions, trajectories, and action space. In addition, while Blocks has a large number of possible states, SAIL includes only 400 states. The small number of states makes it difficult to learn vision models that generalize well. Misra (Misra et al., 2015) provides a parameterized action space (e.g., ), which leads to a large number of potential actions. However, the corpus is relatively small.

Appendix E Common Questions

This is a list of potential questions following various decisions that we made. While we ablated and discussed all the crucial decisions in the paper, we decided to include this appendix to provide as much information as possible.

Is it possible to manually engineer a competitive reward function without shaping?

Shaping is a principled approach to add information to a problem reward with relatively intuitive potential functions. Our experiments demonstrate its effectiveness. Investing engineering effort in designing a reward function specifically designed to the task is a potential alternative approach.

Are you using beam search? Why not?

While using beam search can probably increase our performance, we chose to avoid it. We are motivated by robotic scenarios, where implementing beam search is a challenging task and often not possible. We distinguish between beam search and back-tracking. Beam search is also incompatible with common assumptions of reinforcement learning, although it is often used during test with reinforcement learning systems.

Why are you using the mean of the LSTM hidden states instead of just the final state?

We empirically tested both options. Using the mean worked better. This was also observed by Narasimhan et al. (2015). Understanding in which scenarios one technique is better than the other is an important question for future work.

Can you provide more details about initialization?

Please see Appendix C.

Does the agent in the block world learn to move obstacles and other blocks?

While the agent can move any block at any step, in practice, it rarely happens. The agent prefers to move blocks around obstacles rather than moving other blocks and moving them back into place afterwards. This behavior is learned from the data and shows even when we use only very limited amount of demonstrations. We hypothesize that in other tasks the agent is likely to learn that moving obstacles is advantageous, for example when demonstrations include moving obstacles.

Does the agent explicitly mark where it is in the instruction?

We estimate that over 90% of the instructions describe the target position. Therefore, it is often not clear how much of the instruction was completed during the execution. The agent does not have an explicit mechanism to mark portions of the instruction that are complete. We briefly experimented with attention, but found that empirically it does not help in our domain. Designing an architecture to allows such considerations is an important direction for future work.

Does the agent know which blocks are present?

Not all blocks are included in each task. The agent must infer which blocks are present from the image and instruction. The set of possible actions, which includes moving all possible blocks, does not change between tasks. If the agent chooses to move a block that is not present, the world state does not change.

Did you experiment with executing sequences of instruction? The Bisk et al. (2016) includes such instructions, right?

The majority of existing corpora, including SAIL (Chen and Mooney, 2011; Artzi and Zettlemoyer, 2013; Mei et al., 2016), provide segmented sequences of instructions. Existing approaches take advantage of this segmentation during training. For example, Chen and Mooney (2011), Artzi and Zettlemoyer (2013), and Mei et al. (2016) all train on segmented data and test on sequences of instructions by doing inference on one sentence at a time. We are also able to do this. Similar to these approaches, we will likely suffer from cascading errors. The multi-instruction paragraphs in the Bisk et al. (2016) data are an open problem and present new challenges beyond just instruction length. For example, they often merge multiple block placements in one instruction (e.g, put the SRI, HP, and Dell blocks in a row). Since the original corpus does not provide trajectories and our automatic generation procedure is not able to resolve which block to move first, we do not have demonstrations for this data. The instructions also present a significantly more complex task. This is an important direction for future work, which illustrates the complexity and potential of the domain.

Potential-based shaping was proven to be safe when maximizing the total expected reward. Does this apply for the contextual bandit setting, where you maximize the immediate reward?

The safe shaping theorems (Appendix A) do not hold in our contextual bandit setting. We show empirically that shaping works in practice. However, how and if it changes the order of policies is an open question.

How long does it take to train? How many frames the agent observes?

The agent observes about 2.5 million frames. It takes 16 hours using 50% capacity of an Nvidia Pascal Titan X GPU to train using our approach. DQN takes more than twice the time for the same number of epochs. Supervised learning takes about 9 hours to converge. We also trained DQN for around four days, but did not observe improvement.

Did you consider initializing DQN with supervised learning?

Initializing DQN with the probabilistic supervised model is challenging. Since DQN is not probabilistic it is not clear what this initialization means. Smart initialization of DQN is an important problem for future work.