Human Instruction-Following with Deep Reinforcement Learning via Transfer-Learning from Text

05/19/2020 ∙ by Felix Hill, et al. ∙ Google 0

Recent work has described neural-network-based agents that are trained with reinforcement learning (RL) to execute language-like commands in simulated worlds, as a step towards an intelligent agent or robot that can be instructed by human users. However, the optimisation of multi-goal motor policies via deep RL from scratch requires many episodes of experience. Consequently, instruction-following with deep RL typically involves language generated from templates (by an environment simulator), which does not reflect the varied or ambiguous expressions of real users. Here, we propose a conceptually simple method for training instruction-following agents with deep RL that are robust to natural human instructions. By applying our method with a state-of-the-art pre-trained text-based language model (BERT), on tasks requiring agents to identify and position everyday objects relative to other objects in a naturalistic 3D simulated room, we demonstrate substantially-above-chance zero-shot transfer from synthetic template commands to natural instructions given by humans. Our approach is a general recipe for training any deep RL-based system to interface with human users, and bridges the gap between two research directions of notable recent success: agent-centric motor behavior and text-based representation learning.



There are no comments yet.


page 2

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Developing machines that can follow natural human commands, particularly those pertaining to an environment shared by both machine and human, is a long-standing and elusive goal of AI (Winograd, 1972). Recent work has applied deep reinforcement learning (RL) methods to this challenge, where a neural-network-based agent is optimized to process language input, perceive its surroundings and execute appropriate movements jointly (Oh et al., 2017; Hermann et al., 2017; Chaplot et al., 2018; Zhong et al., 2020). Deep RL promises a way to deal flexibly with the complexity of the physical, visual and linguistic world without relying on (potentially brittle) hand-crafted features, rules or policies. Nevertheless, the cost of this flexibility is the large number of environment interactions (samples) required for a network to learn behaviour policies from raw experience. To make the approach feasible, many studies thus employ a synthetic language that is generated on demand from templates by the environment simulator (Chevalier-Boisvert et al., 2018; Jiang et al., 2019; Yu et al., 2018b, a; Zhong et al., 2020). Attempts to integrate deep RL with natural language typically employ less realistic grid-like environments (Misra et al., 2017) or grant agents privileged global observations and gold-standard action sequences to make learning more tractable (Misra et al., 2018).

Figure 1: Left: stills of the putting task from the agent’s perspective, involving three randomly coloured and positioned moveable objects, plus a pink bed and a white tray. During evaluation, the agent receives an instruction given by a human (e.g. “Drop A Casio onto the bed"), identify the intended object of reference, move towards it and lift it, rotating if necessary, and then lower it into the specified location. Right: example ShapeNet models from the class ‘chair‘ used in the reference task, demonstrating the perceptual variation which in turn drives substantial linguistic variation across the human testing instructions.

Here, we propose an alternative learning-based recipe for training deep RL agents that are robust to a natural human commands, which we call SHIFTT (Simulation-to-Human Instruction Following via Transfer from Text). In this approach, agents are first endowed with powerful pretrained language encoders and then their visual processing and behaviour policies are optimised using conventional deep RL methods to respond to synthetic, template-based language. Finally, we evaluate the agents on their ability to execute instructions from human testers that are (in theory) semantically-equivalent, but can be superficially quite distinct from those experienced by the agent during training.

We demonstrate the effectiveness of this approach in the context of a 3D room containing object models from the ShapeNet dataset (Chang et al., 2015). Using 3D models based on everyday objects allows us to solicit from human testers diverse ways of instructing and/or referring to things, and hence to probe the agents’ robustness to this diversity. Unsurprisingly, we find that agents trained with conventional template-based commands do not adapt well to the natural (keyboard-typed) commands of the testers. With SHIFTT, however, agents can satisfy human instructions with substantially above-chance accuracy, which shows that powerful language models can support a notable degree of zero-shot policy generalization. Indeed, on the two tasks we consider in this work, our agent, trained only in simulation, performs as well as naive human game operators at the task of executing the noisy instructions provided by other humans.

To better analyze the generalization that supports this robustness, we consider different ways of incorporating pre-training language models into a deep RL agent, and probe each condition with specific synthetic deviations from their template-based training instructions. We find that pre-trained language encoders based on non-BERT (context-free) word embeddings can support a degree of generalization driven by lexical similarity (executing Find a vehicle when trained to Find a car), but not phrasal equivalence (failing at Put a plate on the container when trained to Put the dish on the tray). In contrast, methods that integrate powerful contextual word representations (i.e. BERT) support both types of generalization. We note further boosts to test robustness when these pretrained language encoders are complemented with additional learned self-attention layers (tuned to the agent’s environmental objectives). Ablation experiments isolate the role of WordPiece tokenization (Schuster and Nakajima, 2012) in robustness to typed human instructions. This motivates the addition of typo noise as a critical component of the training pipeline for the highest-performing agents.

The SHIFTT method

Figure 2: The three stages of the SHIFTT method. During RL training (2.) the influence of semantic knowledge induced via text-based pre-training (1.) is gradually distributed across the agent’s policy network, yielding an agent robust to human instructions (3.).

In the general case, our proposed recipe (Fig 2) involves the following steps:

  1. Train or acquire a pretrained language model .

  2. Construct a language-dependent object manipulation task. Define global set of human-nameable objects and a class of binary spatial relations . For each , write a reward function , such that iff are in spatial relation .

    For each episode:

    1. Sample a subset of unique objects, and individual objects .

    2. Sample a spatial relation .

    3. Construct an instruction according to the template Put the the , where is the everyday name for object or relation .

    4. Spawn all and an agent at random positions and orientations in the environment. Append to agent environment observations.

  3. Train the agent using RL to maximize the expected cumulative rewards from based on per-timestep visual observations and instructions encoded by .

  4. Have human testers interact with episodes in the environment and pose instructions to the agent (with the macro-objective of having relations realized for some each episode). Evaluate the performance of the agent in response to these instructions.

In this work, we apply SHIFTT in a simulated room in the Unity111Unity. game engine. Note that there is no reason, in theory at least, why the recipe could not apply to robot-learning for object manipulation (e.g.  Johannink et al. (2019)).

2 Architectures for instruction-following agents

Since the object of our study is instruction-following, we consider an agent architecture that is as conventional as possible in aspects not related to language processing. It consists of standard components for computing visual representations of pixels, embedding text strings and predicting actions contingent on these and memory of past inputs.222See Supplement for details not included here.

Visual processing

The visual observations received by the agent at each timestep are

real-valued tensors, which are processed by a 3-layer residual convnet.

Memory core

The output of the visual processing is combined with language information according to a particular encoding strategy, as described below. In all conditions, some combination of vision and language input at each timestep passes into a LSTM memory with hidden dim. .

Action and value prediction

The state of the memory core at each timestep is passed through a linear layer and softmax to compute a distribution over 26 actions. Independently, the memory state is passed to a linear layer to yield a scalar value prediction.

Training algorithm

The agent is trained using an importance-weighted actor-critic algorithm with a central learner and distributed actors (Espeholt et al., 2018).

2.1 Language encoding

The agent must process both string observations in the simulated environment (during training) and human instructions, also encoded as strings (during evaluation). Its representations of language must be combined with visual information to make decisions about how to act. We compare various different ways of achieving this encoding. For methods that transfer knowledge from unsupervised text-based learning, we take weights from the well-known BERT model  (Devlin et al., 2018), specifically the uncased BERTBASE model made available by the authors.333Available at

BERT + mean pool  For a given input of words, BERTBASE returns context-dependent (sub)-word representations of units. In this condition, a mean pooling operation is applied over the dimension to yield a single representation of dimension , which is concatenated with the flattened output of the visual processing module. This multi-modal representation is passed through a single layer MLP with activation and output dimension before entering the memory core of the agent. We apply the standard BERTBASE WordPiece vocabulary (of size 30,000). Note that WordPiece encodes language in terms of subwords (a mix of characters, common word chunks, morphemes and words) rather than the word-level vocabulary applied in more traditional neural language models.

BERT + self-attention layer  When applying BERT to text classification, performance can sometimes be improved by fine-tuning the weights in the BERT encoder to suit the task. Because of the large number of gradient updates required to learn a complex behaviour policy, fine-tuning the BERT weights in this way would cause substantial overfitting to the synthetic environment language. We therefore keep the BERT weights frozen, but experiment with an additional learned self-attention layer (Vaswani et al., 2017) to create a language encoder whose bottom layers are pretrained and whose final layer optimizes these representations to the present environment or tasks. This additional layer has attention heads, and uses dimensional key and value embeddings.

BERT+ cross-modal self-attention  We also consider a cross-modal self-attention layer of the type suggested by e.g. Tsai et al. (2019); Lu et al. (2019), which in our application provides an explicit pathway for the agent to bind visual experience to specific (contextual) word representations. In this case, we treat each of the output channels of the visual processing network as word-like entities, by passing them through a linear layer of output size to match the BERT output. The embeddings for all words and the visual channels are then processed with a single self-attention layer, whose weights are again learned with other agent parameters. Note that, when applied across modalities, a Transformer self-attention layer is more expressive than prior attention-based operations for fusing vision and language in agents  Chaplot et al. (2018), since it permits interactions at the (contextualized) word-level rather than the entire instruction, because it includes multiple heads, and because the key-query-value mechanism allows the model to learn both content-based and key-based lookup processes.

Pretrained (sub)-word embeddings  Language classification experiments with BERT show the value of highly context-dependent (sub)-word representations, but transfer in that context is also possible with more conventional (context-independent) word embeddings (Collobert and Weston, 2008). To measure the effect of this distinction, we consider a simpler encoder based on the (context independent) input (sub-)word embeddings from BERT, which are also of dimension .444These embeddings capture lexical similarity much like conventional word embeddings; a cosine metric produces a Spearman correlation of with human ratings from the Simlex-999 dataset (Hill et al., 2015).

Taking a mean of these context-independent vectors would yield a word-order-invariant representation of language. We therefore process them with single-layer transformer with

attention heads. As in BERT + mean pool, this output is averaged, reduced by a single-layer MLP (from to units) and passed to the agent’s core memory.

Typo noise

  Finally, in one condition we introduce typo noise when training the agent (SHIFTT stage 2). The noise function that we apply simply replaces each key character of the template language strings produced by the environment with that of an adjacent character on a standard keyboard with a probability of

. This encourages the agent to learn to compensate, during the training process, to potential typographical errors of human operators. No modifications were made to the evaluation stimuli. Much previous work applies typing noise for language classifier robustness; see e.g.

(Pruthi et al., 2019) for a recent survey.

We compare to various baselines designed to isolate specific components of these encoders.

Random mean pool  As a direct baseline for BERT mean pool, we consider an identical architecture but in which the contextual BERT embeddings are replaced by fixed random (subword-specific) vectors (also of dimension ). The weights in the rest of the agent network are trained as in other conditions. We note that random sentence vectors are a competitive baseline on many language classification tasks (Wieting and Kiela, 2019).

Word-level and WordPiece transformers  In addition to pre-trained weights, a potentially important aspect of encoders based on BERT is WordPiece tokenization, which can afford greater ability to make sense of typos or rare words with familiar morphemes than in a more conventional word-level encoder. To isolate the effect of tokenization from that of pretrained weights, we compared two further encoders. In one, we split all input strings by whitespace and hash each word string to a unique index representing an input to a single-layer transformer with attention heads and embedding size (chosen to match BERTBASE), the output of which is processed identically to the Random mean pool condition. We contrast this with an otherwise identical condition in which the WordPiece tokenization from BERT is applied rather than splitting on white space.

Human performance estimate

  As a point of comparison, we recruited a set of five human testers, separate from those that produced the original instructions, and subjected them to a similar evaluation procedure as the agent for twenty episodes of the reference and putting task. Observing these episodes, we found the reasons for human failure on a given episode to include unclear or ambiguous instructions (such as Put a toy on the bed when any object could conceivably be a toy), objects with ambiguous appearance (e.g. the car and bus look similar), and (rarely) a failure of control or object manipulation.

3 Experiments

We experiment with a reference task, which focuses on object identification, and a putting task which focuses on object relations and discrete object control. In both tasks the locations of all objects and the initial position and orientation of the agent are chosen at random on floating point scales, so it is highly unlikely that any two episodes are spatially identical (whether during training or evaluation). The action set consists of movements (move-forward,back,left,right), turns (turn-up,down,left,right), and object pick-up and hand control with DoF. The placement of objects is assisted by a visible vertical line. See Supplement for a full description.

3.1 Determining reference of ShapeNet objects

In an episode of the reference task, the environment selects two movable objects at random from a global object set, and generates a language instruction of the form Find a X, where X is the correct name for (exactly) one of those objects. To achieve reward, the agent must locate the X, lift it more than 1m above the ground and keep it there for 5 time steps. If the reward function in the environment determines that these conditions are met, a positive reward of is emitted and the episode ends. If a lifting takes place with the incorrect object, the episode also ends with reward . For a global set of recognisable objects, we use the ShapeNet dataset (Chang et al., 2015), which contains over 12,000 3D rendered models. For performance reasons we discard models with more than 8,000 vertices or an OBJ file size greater than 1 MB and consider only those tagged into a synset in the WordNet taxonomy. We take models from the ShapeNetSem slice of the data, and use the name of the first lemma of that synset as a proxy for the name of the object. From these names, we selected 80 for the environment, each referring to a set of object models with a minimum of 12 exemplars. See Supplement for details.

Template lang. Natural
(training) Synonym reference
Find a flag Find a Find the indian flag
Find a flag
banner Find the flag.
Find the cardboard.
Find a pillow Find a Find a pillows
Find a cushion
cushion Find a paper
Find a set of cushions
Table 1: Example training/test instructions in the reference task.

For all language encoding strategies described above, the agent was able to learn the task, and a well-trained policy completed episodes in an average of timesteps with accuracy around 90%. The failure of agents to reach perfect performance on the training set is unimportant for the present study; we suspect that the agent’s comparatively small convolutional network fails to perceive important distinctions between the more intricate ShapeNet models.

Template lang. D.O. I.O. D.O. & I.O.
(training) synonym synonym synonym Natural instruction
Put a mug Put a cup Put a mug Put a cup place mug in the basket
on the tray on the tray on the box on the box Keep the cup in atub
place the mug in a container
put the coffee mug in the box
Put a train Put a locomotive Put a train on Put a locomotive Put the tractor on the bed
on the bed on the bed the bunk on the bunk Move the train toy onto the bed
Place a toyvehicle on the bed
place the rail on tthe bed
Table 2: Example training (left) and test instructions in the putting task. Underlined words are examples of synonyms, italics indicate entire phrases provided by human annotators.

We consider two evaluation settings. In the synonym evaluation, we ran the environment for 1,000 episodes with the noun in the environment template instruction replaced by a synonym (Find a becomes Find a where ). The synonyms were provided by native English speaking subjects. In the natural reference evaluation, we gave 40 annotators access to a room containing a single ShapeNet model via a crowd-sourcing platform. We asked them to write down what they found in the room and then hit a button that restarted the environment with a new object in the room.555See Supplement for the all synonyms and human instructions. We used this interactive method of soliciting instructions, rather than paraphrases of template instructions, because the setting more faithfully reflects the perspective of a user instructing a robot or situated learning system.

As illustrated in Table 1, unlike the synonyms, the natural referring expressions involve variation in articles as well as nouns (a pencil might become the pen), may include spelling mistakes or typos, can refer entirely incorrectly to the intended object (if the subject fails to recognize the ShapeNet model), but may also match the training instruction exactly. Moreover, unlike the synonym test, there are natural referring expressions for each of the 80 environment nouns, from which we sample randomly when evaluating the agent (again on 1,000 evaluation espisodes). Overall, the human referring expressions are highly varied; while there are 82 unique word types across all possible template instructions for this task, the human referring expressions involved 557 unqiue word types.The full set of human instructions aligned with ShapeNet model IDs can be downloaded from

3.2 Putting objects on other objects

The strength of models like BERT is their ability to combine lexical representations into phrasal or sentence representations. To study this capacity in the context of instruction-following, we devised a putting task involving the verb ‘to put’, which, in the imperative (put the cup on the tray) takes two arguments, the direct object (D.O.) cup and the indirect object (I.O.) tray. In terms of the behaviours required, the putting task focuses on control and object relations rather than object identification or reference. The environment was configured to begin each episode with three randomly-chosen moveable objects and two larger immovable objects (a bed and a tray), each randomly positioned in the room. In each episode of this task, the agent receives an instruction Put a D.O. on the I.O., where D.O. is any of the three moveable objects (chosen from a global set of ten) and I.O. is either bed or tray. The environment checks whether an instance of D.O. is at rest (and not held by the agent) on top of the I.O., returning a positive reward if so and ending the episode. If the object D.O. is placed on something other than a I.O., or if another movable object is placed on the bed or the tray then again the episode ends immediately with reward .666We found that ending the episode in such cases made learning much faster.

As before, we first trained all agents on the putting task with synthetic environment language instructions. Training a policy on this task with reinforcement learning required a bespoke task curriculum (see Supplement for more details); a well-trained policy completes each episode in an average of actions/timesteps. To gather the evaluation stimuli, we again crowd-sourced humans to provide both natural synonyms for each of the 12 objects in the global set for this task and, in this case, entirely free-form natural human instructions. To obtain natural instructions, we instantiated an environment with only one of the global set of moveable objects, coloured red, and one of either the bed or the tray, coloured white, and asked subjects to ask somebody to place the red object on top of the white object without mentioning their color. This resulted in a set of natural instructions containing a total of 180 unique word types, compared to only 19 in the template instructions used for agent training. We then evaluated agents on four specific evaluations, illustrated in Table 2: D.O. synonym, I.O. synonym and D.O. & I.O. synonym, in which particular parts of the original template command were replaced with synonyms, and Natural instruction, the fully free-form human instruction, which can include orthographic errors and misidentified objects.

3.3 Discussion of results

Template language Natural
Model (training) Synonym reference
Random ‘lifting’ act 0.5 0.50 0.50
Random-embedding + MP
Word-level Transf.
WordPiece Transf.
Word-level Transf. + TN
WordPiece Transf. + TN
Word embeddings + Transf.
Human performance estimate
Table 3: Accuracy on training instructions, the synonym evaluation and the natural referring expression evaluation in the reference task. MP: mean pool, SA: self-attention layer, CMSA: cross-modal self-attention layer, TN: typo-noise. Scores show mean across 1,000 episodes (with instructions randomly chosen by the environment generator).
Template lang D.O. I.O. D.O. & I.O. Natural
Model (training) synonym synonym synonym instruction
Random ‘putting’ act 0.17 0.17 0.17 0.17 0.17
Random-embedding + MP
Word-level Transf.
WordPiece Transf.
Word-level Transf. + TN
WordPiece Transf. + TN
Word embeddings + Transf.
Multitask BERT + CMSA + TN
Human performance estimate
Table 4: Accuracy of different models on the putting task when different parts of instructions of the form "Put a [D.O.] on the [I.O.]" are replaced with synonyms. Underlined words in model names indicate pre-trained weights from text-based training. MP: mean pool, SA: self-attention layer, CMSA: cross-modal self-attention layer, TN: typo-noise. Multitask: single agent trained on both reference and putting tasks. The evaluation task covers episodes in which [I.O.] is either ‘bed’ or ‘tray’. Scores show mean across 1,000 random episodes.

The accuracies for the reference task are presented in Table 3 and for the putting task in Table 4. A video of the BERT+CMSA+TN agent both succeeding and failing when following human instructions on the putting task can be seen at The results reveal the following main effects of language encoding on model performance:

Substantial transfer from text requires contextual encoders Agents with weights that are pretrained on text data exhibit substantially higher accuracy on both the reference and the putting tasks. This effect is greatest in the more focused synonym evaluations, but also holds for the the free-form human instructions. A small transfer effect can be seen by comparing the word embeddings + Transformer condition (62% accuracy on the synonym evaluation, reference task and 57% accuracy on the D.O. synonym evaluation, putting task) with the WordPiece Transformer (57% and 20%). However, overall the transfer effect is much stronger in the case of the full context-dependent BERT representations. On the same two evaluations, BERT + mean pool achieves 77% and 94% accuracy respectively. The gains from transferring via BERT representations vs. just (sub)word embeddings are greatest for the (longer) putting instructions than for the reference (finding) instructions, and greatest of all in the D.O & I.O. synonym evaluation. These are cases where one would expect the marginal value of powerful sentential (rather than just lexical) representations should be greatest.

Figure 3: Self-attention layers learn to pull-apart both training nouns and their synonyms as the agent learns the putting task with synthetic environment language.

Tuning via self-attention layers (with typo noise) helps Interestingly, we find that tuned self-attention layers do not improve generalization performance over using BERT and mean pooling. This may be simply because the additional layers cause a degree of overfitting to the template environment language during training. However, typo noise mitigates this issue, so that the strongest evaluation performance on the putting tasks overall is observed with a combination of a tuned cross-modal self-attention layer and typo-noise training.777See Supplement for a comparison of BERT-based architectures with and without typo-noise. Indeed, the value of typo noise as a regularizer can be seen by the fact that it improves the robustness of agents with tuned self-attention layers even in the synonym evaluations (for both reference task and putting), which do not involve any typos. Thus, the BERT + CMSA + TN model performs better than all others on two of the three synonym evaluations in the putting task.

One way in which the appropriately-tuned self-attention layer might make the agent more robust to synonyms in this case is by spreading out task-relevant object-nouns in the agent’s language representation space (leaving those words closer to synonyms than to potential confounding words). The degree to which this happens when the object-nouns and their synonymns in our environments are passed first through BERT and then through a (BERT + SA) agent-tuned self-attention layer (compared to passing through the same layer but with random weights) is shown in Fig 3.

WordPiece tokenization adds robustness In the two evaluations involving natural language instructions from humans, a comparison of Word-level Transformer and WordPiece Transformer shows that some robustness is obtained simply from WordPiece encoding, which in turn must play some part in all BERT-based conditions (e.g. improving from % to % in the natural referring expression evaluation and from % to % in the natural instruction putting evaluation). As mentioned above, BERT-based agents with WordPiece encoding are particularly robust to human instructions when trained with typo noise, and this is most effective when combined with tuned self-attention layers. This makes intuitive sense, as a learnable self-attention layer should provide the agent with more flexibility to learn to correct for typos during training. Indeed, on the natural instruction evaluation of the putting task, where typos or spelling errors are most common, the cross-modal attention agent trained with typo noise achieves 70% accuracy. Note that 100% accuracy on this evaluation may be impossible, even for humans, because visual ambiguity or human error mean that instructions can sometimes refer in entirely mistaken ways to the objects in the room.

Scaling to multiple tasks When evaluated on the putting task, the performance of an agent trained on both the putting and reference tasks is not substantially lower than agents that are specialized to each of the tasks individually.

Agent performance on this data is close to ceiling The best-performing agent achieves a similar level of performance to humans that are otherwise unfamiliar with the game or the instruction-setters. This underlines the extent to which the instructions procured from our human raters may be ambiguous or ill-formed. However, it also suggests that the best agents must be performing close to perfectly on all test episodes in which they have a reasonable chance to do so.

4 Related work

Most closely related to our work is an experiment by Chan et al. (2019), who showed how an agent trained with InferLite sentence representations (Kiros and Chan, 2018) can be robust to synonym replacements in template instructions. The task itself involves object identification in the VizDoom environment (Kempka et al., 2016), which requires only motor actions. Our work develops this insight substantially, applying a similar approach to a visually-realistic environment requiring 4, integrating context-dependent pre-trained models with subword tokenization (BERT), analysing architectures and training strategies for integrating such models and extending from synonym replacements to free-form instructions typed by humans.

Much recent work applies deep learning and policy optimization in end-to-end approaches to learning instruction following 

(Chaplot et al., 2018; Oh et al., 2017; Bahdanau et al., 2018; Chevalier-Boisvert et al., 2018; Yu et al., 2018b, a; Jiang et al., 2019). As noted in the introduction, many of these studies, particularly those requiring policies to make fine-grained movements of the agent’s body or objects, do not involve human language.

Vision and language navigation (VLN) models learn to follow natural language directions that are longer that those considered in this work Misra et al. (2017, 2018); Anderson et al. (2018); Fried et al. (2018); Wang et al. (2019); Zhu et al. (2019). The best approaches to VLN do not apply deep RL methods, since VLN typically requires a high degree of exploration, rather than precise object control. VLN agents also often make use of privileged (i.e. non-first-person) observations Misra et al. (2018), knowledge of shortest paths  Fried et al. (2018) and/or expert trajectories Wang et al. (2019). In contrast, because we seek to evaluate a method that may ultimately be applied to robotics, we consider a setting in which the agent does not have access to privileged observations and/or gold-standard trajectories and an environment where different everyday objects must be controlled with a finer-grained set of actions ( DoF). Another important difference is that, SHIFTT is a method for zero-shot transfer from template-based to natural instructions, whereas VLN models are both trained and tested on natural language instructions.

A limitation of all studies above, including this one, is the reliance on simulated environments. Both object identification and manipulation are likely far harder in reality, and it remains to be seen whether our proposed method would work seamlessly for robot language understanding (Tellex et al., 2011, 2012; Walter et al., 2014). However, we note that it can in theory be applied in any multi-task or language-dependent policy-learning setting. See also (Anderson et al., 2018; Wang et al., 2019) for recent improvements to visual realism in simulated environments.

Finally, there is a long history of building in knowledge about the structure of language and/or its environment into instruction-following systems rather than learning it end-to-end. In Winograd (1972)’s SHRDLU, syntactic modules parsed the language input into a logical form, and hand-written rules were applied to connect such forms to the environment. More recent pipeline-based approaches use learning algorithms to map language to a program that can then interface with a planner (Chen and Mooney, 2011; Matuszek et al., 2013; Wang et al., 2016), and/or a controller, both of which may have priviledged information about how the world connects to the program. It is likely that pretrained language encoders could add robustness to parts of these approaches, much as they do here. Our focus on end-to-end learning, however, is motivated by the intuition that it may eventually scale or adapt more flexibly to arbitrary environments or problems than pipeline approaches.

5 Conclusions

In this work, we have developed an agent that can follow natural human instructions requiring the identification, control and positioning of visually-realistic assets. Our method relies on zero-shot transfer from template language instructions to those given by human annotators when asked to refer and instruct in natural ways. The results show that, with powerful pretrained language encoders, this transfer effect is sufficiently strong to permit decoding extended language-dependent motor behaviours, despite the shift in distribution of the agent’s input. More generally, we hope that this contribution serves to bring research on text-based and situated language learning closer together. To facilitate this, we make available our dataset of natural instructions and referring expressions aligned to ShapeNet models.

While this study can be considered an interesting proof of concept for the SHIFTT approach to training linguistic deep RL agents via transfer-learning, there are many ways in which it can be improved and extended. As agents become better able to learn a wide range of conditional policies covering a larger set of motor behaviours, it will be instructive to scale the linguistic scope of the agent via the proposed technique, for instance to language involving verbs, or to questions and dialogue. Moreover, an important long-term objective is to apply SHIFTT to the training of robotic agents. There are also many alternative possibilities effecting the knowledge transfer from language model to agent that are not considered here. For instance, our approach involves freezing rather than fine-tuning the BERT encoder weights to our desired behaviour policy, to avoid overfitting, but techniques such as knowledge distillation (Hinton et al., 2015)

could point to more elegant ways to learn jointly from text and environmental experiences. In addition, we have focused on BERT, but improvements may be possible by applying alternative general-purpose language encoders, such as GPT-2 

(Radford et al., 2019), Roberta (Liu et al., 2019) and Transformer XL (Dai et al., 2019).

In sum, we have presented a conceptually simple recipe for transferring knowledge from a text corpus to a reinforcement-learning agent, and shown that the method permits zero-shot transfer from simulated to (constrained) natural language with surprising efficacy. We hope this opens new channels for research combining unsupervised (or semi-supervised) representation-learning with reinforcement learning, particularly at the intersection of language, vision and behaviour.


  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3674–3683. Cited by: §4, §4.
  • D. Bahdanau, F. Hill, J. Leike, E. Hughes, A. Hosseini, P. Kohli, and E. Grefenstette (2018) Learning to understand goal specifications by modelling reward. arXiv preprint arXiv:1806.01946. Cited by: §4.
  • H. Chan, Y. Wu, J. Kiros, S. Fidler, and J. Ba (2019) ACTRCE: augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv preprint arXiv:1902.04546. Cited by: §4.
  • A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012. Cited by: §1, §3.1.
  • D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. Salakhutdinov (2018) Gated-attention architectures for task-oriented language grounding. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §2.1, §4.
  • D. L. Chen and R. J. Mooney (2011) Learning to interpret natural language navigation instructions from observations. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: §4.
  • M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio (2018) BabyAI: first steps towards grounded language learning with a human in the loop. arXiv preprint arXiv:1810.08272. Cited by: §1, §4.
  • R. Collobert and J. Weston (2008)

    A unified architecture for natural language processing: deep neural networks with multitask learning


    Proceedings of the 25th international conference on Machine learning

    pp. 160–167. Cited by: §2.1.
  • Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §2, §6.1.
  • D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell (2018) Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, pp. 3314–3325. Cited by: §4.
  • K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, et al. (2017) Grounded language learning in a simulated 3D world. arXiv preprint arXiv:1706.06551. Cited by: §1.
  • F. Hill, R. Reichart, and A. Korhonen (2015) Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: footnote 4.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.
  • Y. Jiang, S. Gu, K. Murphy, and C. Finn (2019) Language as an abstraction for hierarchical deep reinforcement learning. arXiv preprint arXiv:1906.07343. Cited by: §1, §4.
  • T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine (2019) Residual reinforcement learning for robot control. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6023–6029. Cited by: §1.
  • M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski (2016) Vizdoom: a doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. Cited by: §4.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
  • J. Kiros and W. Chan (2018) Inferlite: simple universal sentence representations from natural language inference data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4868–4874. Cited by: §4.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §5.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. Cited by: §2.1.
  • C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox (2013) Learning to parse natural language commands to a robot control system. In Experimental Robotics, pp. 403–415. Cited by: §4.
  • D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi (2018) Mapping instructions to actions in 3D environments with visual goal prediction. arXiv preprint arXiv:1809.00786. Cited by: §1, §4.
  • D. Misra, J. Langford, and Y. Artzi (2017) Mapping instructions and visual observations to actions with reinforcement learning. arXiv preprint arXiv:1704.08795. Cited by: §1, §4.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §6.1.
  • J. Oh, S. Singh, H. Lee, and P. Kohli (2017) Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2661–2670. Cited by: §1, §4.
  • D. Pruthi, B. Dhingra, and Z. C. Lipton (2019) Combating adversarial misspellings with robust word recognition. arXiv preprint arXiv:1905.11268. Cited by: §2.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §5.
  • M. Schuster and K. Nakajima (2012) Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. Cited by: §1.
  • S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. Teller, and N. Roy (2011) Approaching the symbol grounding problem with probabilistic graphical models. AI magazine 32, pp. 64–76. Cited by: §4.
  • S. Tellex, P. Thaker, R. Deits, D. Simeonov, T. Kollar, and N. Roy (2012) A probabilistic approach for enabling robots to acquire information from human partners using language. Cited by: §4.
  • Y. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov (2019) Multimodal transformer for unaligned multimodal language sequences. arXiv preprint arXiv:1906.00295. Cited by: §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.1.
  • M. R. Walter, S. Hemachandra, B. Homberg, S. Tellex, and S. Teller (2014) A framework for learning semantic maps from grounded natural language descriptions. The International Journal of Robotics Research 33 (9), pp. 1167–1190. Cited by: §4.
  • S. I. Wang, P. Liang, and C. D. Manning (2016) Learning language games through interaction. arXiv preprint arXiv:1606.02447. Cited by: §4.
  • X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang, and L. Zhang (2019)

    Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6629–6638. Cited by: §4, §4.
  • J. Wieting and D. Kiela (2019) No training required: exploring random encoders for sentence classification. arXiv preprint arXiv:1901.10444. Cited by: §2.1.
  • T. Winograd (1972) Understanding natural language. Cognitive psychology 3 (1), pp. 1–191. Cited by: §1, §4.
  • H. Yu, X. Lian, H. Zhang, and W. Xu (2018a) Guided feature transformation (gft): a neural language grounding module for embodied agents. arXiv preprint arXiv:1805.08329. Cited by: §1, §4.
  • H. Yu, H. Zhang, and W. Xu (2018b) Interactive grounded language acquisition and generalization in a 2D world. arXiv preprint arXiv:1802.01433. Cited by: §1, §4.
  • V. Zhong, T. Rocktäschel, and E. Grefenstette (2020) RTFM: generalising to novel environment dynamics via reading. Proceedings of ICLR. Cited by: §1.
  • F. Zhu, Y. Zhu, X. Chang, and X. Liang (2019) Vision-language navigation with self-supervised auxiliary reasoning tasks. arXiv preprint arXiv:1911.07883. Cited by: §4.

6 Supplementary Material

6.1 Agent architecture and training details

Figure 4: Schematic of the agent architecture with CMSA layer, and typo-noise applied to the environment’s template instructions.
Body movement actions Movement and grip actions Object manipulation
Table 5: Agent action spec

The action set of the agent is presented in Table 5.

To process visual input, the agent uses a residual convolutional network with channels in the first, second and third layers respectively and residual blocks in each layer.

As the agent learns, actors carry replicas of the latest learner network, interact with the environment independently and send trajectories (observations and agent actions) back to a central learner. The learning algorithm [Espeholt et al., 2018] modifies the learner weights to optimize an actor-critic objective, with updates importance-weighted to correct for differences between the actor policy and the current state of the learner policy. We used a learning rate of , a learner batch size of , an agent unroll length of , a discounting factor of , an epsilon (epsilon-greedy policy) of and an entropy cost of  [Mnih et al., 2016]. We use an Adam optimizer [Kingma and Ba, 2014] with and .

In order to train the agent on the putting task, it was necessary to combine episodes of the task itself with episodes of simpler tasks, in a form of curriculum (although learning in parallel on all tasks). In particular, we trained it concurrently on a reference task that involved the same moveable objects as the putting task (so that the agent could receive signal about their names without relying on a complete act of putting). We also found it beneficial to add a put-near task to the curriculum, where the instructions were of the form Put a cup near the tray rather than Put a cup on the tray, and a reward was emitted if the agent moved the cup a short distance from the tray and placed it on the ground. During training, the agent received equal experience of the modified reference task, the put-near task and the (target) putting task, and continued learning until performance on the putting task converged. When training the agent on the reference task, no curriculum was required. When training the multi-task agent on both tasks, the agent was trained with equal experience on four tasks: the reference task, the modified reference task, the put-near task and the putting task. Training was stopped when performance on the putting task and the reference task converged. Note that the the distributed (IMPALA) framework lends itself naturally to sharing learning experience across tasks in this way, since we can simply have some proportion of the total agent actor threads interacting with particular tasks. The experience across the tasks in the aggregated into batches on the learner according to these proportions.

For the reference task, training the agent requires approximately 200 million frames of experience (approximately 7 million episodes), which takes about 24 hours with 250 actors on GPU. In the putting task, training in each condition was stopped after 30 million episodes, approximately three days of training.

6.2 Full experimental results

Figure 5: reference task, synonym evaluation. Left: accuracy of different agents over 1,000 episodes when no change is made to the environment template instruction. Right: accuracy of different agents when a synonym is introduced. Dotted line indicates score of an agent that carries out the correct behaviour but with random objects from the room.
Figure 6: reference task, natural referring expression evaluation. Performance of agents on 1,000 evaluation episodes when part of the environment template reference task instruction is replaced by a natural referring expression from human annotators. Dotted line indicates score of an agent that carries out the correct behaviour but with random objects from the room.
Figure 7: Putting task, synonym evaluation. Performance of agents on 1,000 evaluation episodes when (left) no change is made to the template environment instruction and (rightmost three) different parts of the template environment instruction are replaced with synonyms. Dotted line indicates score of an agent that carries out the correct behaviour but with random objects from the room.
Figure 8: Putting task, natural instruction evaluation. Performance of agents on 1,000 evaluation episodes when the full environment template putting instruction is replaced by a natural instruction from human annotators. Dotted line indicates score of an agent that carries out the correct behaviour but with random objects from the room.

For a superset of the experimental results presented in the main paper (with two additional conditions involving typo noise), see the Figures 1-4 on the following pages.

6.3 Instructions to human annotators

Human annotators use a keyboard and mouse to control a player in the environment simulator, as is standard in first-person video games. The annotators were given the following instructions as part of the each annotation task.

6.3.1 Natural referring expressions

This is a task called Name The Object. You will find yourself in a room containing a single object. Please move around the room to get a good view of the object. When you know what the object is:

  1. Hit Enter

  2. Type the name of the object

  3. Hit Enter again

Examples of good responses:

  1. A kettle

  2. Some trousers

  3. A pair of scissors

  4. A tennis ball

Please don’t describe the object. Just write down what you see, with an article like ‘a’ or ‘some’ if appropriate.

Example of bad responses

  1. A brown ball

  2. A large piano with long black legs

  3. A small thing with lumps on the side

You should not need more than 4-5 different words, and most objects will require just 1 or 2 words to name.

Sometimes, you might be unsure what the object is. If that’s the case, just make your best guess.

6.3.2 Full human instructions

This is a task called Ask to put. You will find yourself in a room. Your job is to imagine giving an instruction to somebody else so that that person puts the red object on to the white object. Move around the room to get a good look at what the two objects are. When you are ready to give your instruction:

  1. Hit Enter

  2. Type your instruction

  3. Hit Enter again

Examples of good instructions:

  1. Place the cup onto the table

  2. Put the ball on the plate

  3. Move the pencil onto the box

Words to avoid

Your instruction must not contain words for colours or other properties of the object. Please do not use the words red, white, scarlet, dark, large etc. in your instruction. Instead, refer to objects by their name as you recognise them. If you don’t recognise what an object is, just make your best guess.

Examples of bad instructions:

  1. Put the red thing on the table

  2. Put the large object on the small object

  3. Put the small round thing on the chest

Keep your language varied. Try to use various ways to express your instruction in different episodes to keep things interesting.

6.4 Further environment details

For all experiments, the environment is a Unity room of dimension 4m x 4m. The walls, floor and ceiling are always the same color, but we add a window and door (positioned randomly per episode) to give some sense of absolute location to the agent.

When importing ShapeNet models, we use the scaling provided in the metadata, unless it it is very small (less than 0.000001), in which case we interpret the coordinates in the OBJ file as meters. All objects have rigid bodies with collision meshes generated using Unity’s built in MeshCollider, with convex set to true. The masses of all movable objects are set to 1 kg, so that our avatar has enough strength to pick all of them up (except for beds and trays, which are made kinetic).

When selecting ShapeNet models, for performance reasons we discarded all models with a vertex count higher than 8000, and an OBJ file size greater than 1 MB. The native ShapeNet category names are often not natural everyday names (for instance, they can be highly specific, like "dual shock analog controller"). To mitigate this, we used ShapeNet’s WordNet tags, grouping models into categories according to WordNet synsets and assigning the name of the first synset lemma to the category.

Because depth-perception is challenging without binocular vision, the agent is assisted in manipulation by a visual guide (bottom-right) that highlights objects within grasping range and drops a vertical line from held objects.

Full lists of objects and synonyms are on the following pages.

name synonym unique models example shapenet model id wordnet synset
chest of drawers cupboard 241 793aa6d322f1e31d9c75eb4326997fae n3018908
table desk 199 db80fbb9728e5df343f47bfd2fc426f7 n4386330
chair seat 167 6bcabd23a81dd790e386ecf78eadd61c n3005231
sofa couch 158 11d5e99e8faa10ff3564590844406360 n4263630
television receiver aerial 147 cebb35bd663c82d3554a13580615ae1 n4413042
lamp light 138 62ab9c2f7b826fbcb910025244eec99a n3641940
desk bench 98 a2c81b5c0ce9231245bc5a3234295587 n3184367
box tin 81 15b5c54a8ddc84353a5acb1c5cbd19ff n2886585
vase flute 78 415a96bfbb45b3ee35836c728d324152 n4529463
bed bunk 73 bbd23cc1e56a24204e649bb31ff339dd n2821967
floor lamp lamp 73 54017ba00148a652846fa729d90e125 n3371905
cabinet cupboard 68 f5024636f3514eb51d0662c550f8f994 n2936496
book magazine 67 c8691c86e110318ef2bc9da1ba799c60 n6422547
pencil pen 67 839923a17af633dfb29a5ca18b250f3b n3914323
laptop computer 66 20d42e934260b59c53c0c910fd6231ef n3648120
monitor screen 58 4d11b3c781f36fb3675041302508f0e1 n3787723
coffee table table 53 e21cadab660eb1c5c71d82cf1efbe60a n3067971
picture image 52 2c31ea08641b076d6ab1a912a88dca35 n3937282
bench chair 52 c6706e8a2b4efef5133db57650fae9d9 n2832068
painting portrait 51 6e51501eee677f7f73b3b0e3e8724599 n3882197
rug carpet 44 b49fda0f6fc828e8c2b8c5618e94c762 n4125115
shelf surface 43 482f328d1a777c4bd810b14a81e12eca n4197095
switch flower 42 7f20daa4952b5f5a61e2803817d42718 n4379457
plant shrub 42 94a36139281d1837465e08d496c0420f n17402
stool seat 41 1b0626e5a8bf92b3945a77b945b7b70f n4334034
bottle jug 40 a429f8eb0c3e6a1e6ea2d79f658bbae7 n2879899
loudspeaker speaker 37 82b9111b3232904eec3b2e05ce8fd39b n3696785
electric refrigerator fridge 36 392f9b7a08c8ba8718c6f74ea0d202aa n3278824
toilet urinal 34 5d567a0b5b57d8ab8b6558e44187a06e n4453655
dining table table 29 82a1545cc0b3227ede650492e45fb14f n3205892
poster picture 28 de74ab90cc9f46af9703672f184b66db n6806283
wall clock clock 28 1e2ea05e566e315c35836c728d324152 n4555566
cellular telephone mobile 26 85a94f368a791343985b19765176f4ab n2995984
person human 25 2c3dbe3bd247b1ddb19d42104c111188 n5224944
cup mug 24 542235fc88d22e1e3406473757712946 n3152175
stapler clipper 24 376eb047b40ef4f6a480e3d8fdbd4a92 n4310635
mirror reflector 24 36d2cb436ef75bc7fae7b9efb5c3bbd1 n3778568
toilet tissue toilet paper 24 6658857ea89df65ea35a7666f0cfa5bb n15099708
desktop computer pc 24 102a6b7809f4e51813842bc8ef6fe18 n3184677
table lamp desk light 23 85b52753cc7e7207cf004563556ddb36 n4387620
flag banner 22 ac50bd1ed53b7cb9cec8a10ad1c084eb n3359749
armchair settee 21 806bce1a95268006ecd7cae46ee113ea n2741540
cupboard wardrobe 21 3bb80aa0267a12bed00bff798ed59ff5 n3152990
pen pencil 20 628e5c83b1762320873ca101f05858b9 n3913116
bag sack 20 b2c9ac70c58c90fa6dd2d391b72f2211 n2776843
soda can drink 20 16526d147e837c386829bf9ee210f5e7 n4262696
sword knife 20 555c17f73cd6d530603c267665ac68a6 n4380981
bookshelf shelf 20 586356e8809b6a678d44b95ca8abc7b2 n2874800
fireplace burner 19 df1bd1065e7f7cde5e29ce2c9d37b952 n3351301
curtain drape 19 6f3da555075ec7f9e17e45953c2e0371 n3155743
battery cell 18 62733b55e76a3b718c9d9ab13336021b n2813606
bookcase bookshelf 18 bc80335bbfda741df1783a44a88d6274 n2874241
pizza pie 18 caca4c8d409cddc66b04c0f74e5b376e n7889783
hammer mallet 17 a49a6c15dcef467bc84c00e08b1e25d7 n3486255
microwave oven 16 b3bf04a02a596b139220647403cfb896 n3766619
cereal box oatmeal 16 dc394c7fdea7e07887e775fad2c0bf27 n3001610
food meal 16 d92f30d9e38cf61ce69bbdea737daae6 n21445
screen monitor 16 4d7fb62f0ed368a18f62bdf4e9082924 n4159912
glass beaker 16 89cf9af7513ecd0947bdf66811027e14 n3443167
wine bottle magnum 15 e101cc44ead036294bc79c881a0e818b n4599016
lamppost streetlight 15 9cf4a30ab7e41c85c671a0255ba06fe5 n3642472
oven cooker 15 69c57dd0f6abdab7ac51268fdb437a9e n3868196
camera polaroid 14 48e26e789524c158e1e4b7162e96446c n2946154
globe world 14 3e97d91fda31a1c56ab57877a8c22e14 n3445436
cd player stereo 14 fe90d87deaa5a8a5336d4ad4cfab5bfe n2991759
wardrobe cupboard 14 cb48ec828b688a78d747fd5e045d7490 n4557470
bible text 14 6ad85ddb5a110664e632c30f54a9aa37 n6434286
pillow cushion 13 8b0c10a775c4c4edc1ebca21882cca5d n3944520
mug cup 13 ea33ad442b032208d778b73d04298f62 n3802912
chandelier light 13 37d81dd3c640a6e2b3087a7528d1dd6a n3008889
chessboard checkers 13 ec6e09bca187c688a4166ee2938aa8ff n3017971
calculator computer 13 387e59aec6f5fdc04b836f408176a54c n2942270
dishwasher washing machine 12 ff421b871c104dabf37a318b55c6a3c n3212662
mattress mat 12 f30cb7b1b9a4184eb32a756399014da6 n3736655
basket bag 12 91b15dd98a6320afc26651d9d35b77ca n2805104
pencil sharpener sharpener 12 287b8f5f679b8e8ecf01bc59d215f0 n3914833
candle lantern 12 cf1b637d9c8c30da1c637e821f12a67 n2951508
bowl dish 12 594b22f21daf33ce6aea2f18ee404fd5 n2884435
clock watch 12 299832be465f4037485059ffe7a2f9c7 n3050242
coat hanger hanger 12 5aa1d74d04065fd98a7103092f8e8a33 n3061905
Table 6: The 80 ShapeNet category names and their sizes, corresponding WordNet synsets and an example model id for a category exemplar used in the reference task. In the extra materials we provide a full list of ShapeNet model ids in each category.
Environment word Synonym
Immovable objects tray box
bed mattress
Movable objects boat ship
hairdryer dryer
racket bat
bus coach
rocket spaceship
car automobile
plane aeroplane
mug cup
robot android
train locomotive
keyboard piano
helicopter airplane
candle lamp
Table 7: The two immovable and ten movable objects in the putting experiment. To begin each episode, both immovable objects and three randomly-selected movable objects are randomly positioned in the room.