Developing machines that can follow natural human commands, particularly those pertaining to an environment shared by both machine and human, is a long-standing and elusive goal of AI (Winograd, 1972). Recent work has applied deep reinforcement learning (RL) methods to this challenge, where a neural-network-based agent is optimized to process language input, perceive its surroundings and execute appropriate movements jointly (Oh et al., 2017; Hermann et al., 2017; Chaplot et al., 2018; Zhong et al., 2020). Deep RL promises a way to deal flexibly with the complexity of the physical, visual and linguistic world without relying on (potentially brittle) hand-crafted features, rules or policies. Nevertheless, the cost of this flexibility is the large number of environment interactions (samples) required for a network to learn behaviour policies from raw experience. To make the approach feasible, many studies thus employ a synthetic language that is generated on demand from templates by the environment simulator (Chevalier-Boisvert et al., 2018; Jiang et al., 2019; Yu et al., 2018b, a; Zhong et al., 2020). Attempts to integrate deep RL with natural language typically employ less realistic grid-like environments (Misra et al., 2017) or grant agents privileged global observations and gold-standard action sequences to make learning more tractable (Misra et al., 2018).
Here, we propose an alternative learning-based recipe for training deep RL agents that are robust to a natural human commands, which we call SHIFTT (Simulation-to-Human Instruction Following via Transfer from Text). In this approach, agents are first endowed with powerful pretrained language encoders and then their visual processing and behaviour policies are optimised using conventional deep RL methods to respond to synthetic, template-based language. Finally, we evaluate the agents on their ability to execute instructions from human testers that are (in theory) semantically-equivalent, but can be superficially quite distinct from those experienced by the agent during training.
We demonstrate the effectiveness of this approach in the context of a 3D room containing object models from the ShapeNet dataset (Chang et al., 2015). Using 3D models based on everyday objects allows us to solicit from human testers diverse ways of instructing and/or referring to things, and hence to probe the agents’ robustness to this diversity. Unsurprisingly, we find that agents trained with conventional template-based commands do not adapt well to the natural (keyboard-typed) commands of the testers. With SHIFTT, however, agents can satisfy human instructions with substantially above-chance accuracy, which shows that powerful language models can support a notable degree of zero-shot policy generalization. Indeed, on the two tasks we consider in this work, our agent, trained only in simulation, performs as well as naive human game operators at the task of executing the noisy instructions provided by other humans.
To better analyze the generalization that supports this robustness, we consider different ways of incorporating pre-training language models into a deep RL agent, and probe each condition with specific synthetic deviations from their template-based training instructions. We find that pre-trained language encoders based on non-BERT (context-free) word embeddings can support a degree of generalization driven by lexical similarity (executing Find a vehicle when trained to Find a car), but not phrasal equivalence (failing at Put a plate on the container when trained to Put the dish on the tray). In contrast, methods that integrate powerful contextual word representations (i.e. BERT) support both types of generalization. We note further boosts to test robustness when these pretrained language encoders are complemented with additional learned self-attention layers (tuned to the agent’s environmental objectives). Ablation experiments isolate the role of WordPiece tokenization (Schuster and Nakajima, 2012) in robustness to typed human instructions. This motivates the addition of typo noise as a critical component of the training pipeline for the highest-performing agents.
The SHIFTT method
In the general case, our proposed recipe (Fig 2) involves the following steps:
Train or acquire a pretrained language model .
Construct a language-dependent object manipulation task. Define global set of human-nameable objects and a class of binary spatial relations . For each , write a reward function , such that iff are in spatial relation .
For each episode:
Sample a subset of unique objects, and individual objects .
Sample a spatial relation .
Construct an instruction according to the template Put the the , where is the everyday name for object or relation .
Spawn all and an agent at random positions and orientations in the environment. Append to agent environment observations.
Train the agent using RL to maximize the expected cumulative rewards from based on per-timestep visual observations and instructions encoded by .
Have human testers interact with episodes in the environment and pose instructions to the agent (with the macro-objective of having relations realized for some each episode). Evaluate the performance of the agent in response to these instructions.
2 Architectures for instruction-following agents
Since the object of our study is instruction-following, we consider an agent architecture that is as conventional as possible in aspects not related to language processing. It consists of standard components for computing visual representations of pixels, embedding text strings and predicting actions contingent on these and memory of past inputs.222See Supplement for details not included here.
The visual observations received by the agent at each timestep are
real-valued tensors, which are processed by a 3-layer residual convnet.
The output of the visual processing is combined with language information according to a particular encoding strategy, as described below. In all conditions, some combination of vision and language input at each timestep passes into a LSTM memory with hidden dim. .
Action and value prediction
The state of the memory core at each timestep is passed through a linear layer and softmax to compute a distribution over 26 actions. Independently, the memory state is passed to a linear layer to yield a scalar value prediction.
The agent is trained using an importance-weighted actor-critic algorithm with a central learner and distributed actors (Espeholt et al., 2018).
2.1 Language encoding
The agent must process both string observations in the simulated environment (during training) and human instructions, also encoded as strings (during evaluation). Its representations of language must be combined with visual information to make decisions about how to act. We compare various different ways of achieving this encoding. For methods that transfer knowledge from unsupervised text-based learning, we take weights from the well-known BERT model (Devlin et al., 2018), specifically the uncased BERTBASE model made available by the authors.333Available at https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1
BERT + mean pool For a given input of words, BERTBASE returns context-dependent (sub)-word representations of units. In this condition, a mean pooling operation is applied over the dimension to yield a single representation of dimension , which is concatenated with the flattened output of the visual processing module. This multi-modal representation is passed through a single layer MLP with activation and output dimension before entering the memory core of the agent. We apply the standard BERTBASE WordPiece vocabulary (of size 30,000). Note that WordPiece encodes language in terms of subwords (a mix of characters, common word chunks, morphemes and words) rather than the word-level vocabulary applied in more traditional neural language models.
BERT + self-attention layer When applying BERT to text classification, performance can sometimes be improved by fine-tuning the weights in the BERT encoder to suit the task. Because of the large number of gradient updates required to learn a complex behaviour policy, fine-tuning the BERT weights in this way would cause substantial overfitting to the synthetic environment language. We therefore keep the BERT weights frozen, but experiment with an additional learned self-attention layer (Vaswani et al., 2017) to create a language encoder whose bottom layers are pretrained and whose final layer optimizes these representations to the present environment or tasks. This additional layer has attention heads, and uses dimensional key and value embeddings.
BERT+ cross-modal self-attention We also consider a cross-modal self-attention layer of the type suggested by e.g. Tsai et al. (2019); Lu et al. (2019), which in our application provides an explicit pathway for the agent to bind visual experience to specific (contextual) word representations. In this case, we treat each of the output channels of the visual processing network as word-like entities, by passing them through a linear layer of output size to match the BERT output. The embeddings for all words and the visual channels are then processed with a single self-attention layer, whose weights are again learned with other agent parameters. Note that, when applied across modalities, a Transformer self-attention layer is more expressive than prior attention-based operations for fusing vision and language in agents Chaplot et al. (2018), since it permits interactions at the (contextualized) word-level rather than the entire instruction, because it includes multiple heads, and because the key-query-value mechanism allows the model to learn both content-based and key-based lookup processes.
Pretrained (sub)-word embeddings Language classification experiments with BERT show the value of highly context-dependent (sub)-word representations, but transfer in that context is also possible with more conventional (context-independent) word embeddings (Collobert and Weston, 2008). To measure the effect of this distinction, we consider a simpler encoder based on the (context independent) input (sub-)word embeddings from BERT, which are also of dimension .444These embeddings capture lexical similarity much like conventional word embeddings; a cosine metric produces a Spearman correlation of with human ratings from the Simlex-999 dataset (Hill et al., 2015).
Taking a mean of these context-independent vectors would yield a word-order-invariant representation of language. We therefore process them with single-layer transformer withattention heads. As in BERT + mean pool, this output is averaged, reduced by a single-layer MLP (from to units) and passed to the agent’s core memory.
Finally, in one condition we introduce typo noise when training the agent (SHIFTT stage 2). The noise function that we apply simply replaces each key character of the template language strings produced by the environment with that of an adjacent character on a standard keyboard with a probability of
. This encourages the agent to learn to compensate, during the training process, to potential typographical errors of human operators. No modifications were made to the evaluation stimuli. Much previous work applies typing noise for language classifier robustness; see e.g.(Pruthi et al., 2019) for a recent survey.
We compare to various baselines designed to isolate specific components of these encoders.
Random mean pool As a direct baseline for BERT mean pool, we consider an identical architecture but in which the contextual BERT embeddings are replaced by fixed random (subword-specific) vectors (also of dimension ). The weights in the rest of the agent network are trained as in other conditions. We note that random sentence vectors are a competitive baseline on many language classification tasks (Wieting and Kiela, 2019).
Word-level and WordPiece transformers In addition to pre-trained weights, a potentially important aspect of encoders based on BERT is WordPiece tokenization, which can afford greater ability to make sense of typos or rare words with familiar morphemes than in a more conventional word-level encoder. To isolate the effect of tokenization from that of pretrained weights, we compared two further encoders. In one, we split all input strings by whitespace and hash each word string to a unique index representing an input to a single-layer transformer with attention heads and embedding size (chosen to match BERTBASE), the output of which is processed identically to the Random mean pool condition. We contrast this with an otherwise identical condition in which the WordPiece tokenization from BERT is applied rather than splitting on white space.
Human performance estimate
Human performance estimateAs a point of comparison, we recruited a set of five human testers, separate from those that produced the original instructions, and subjected them to a similar evaluation procedure as the agent for twenty episodes of the reference and putting task. Observing these episodes, we found the reasons for human failure on a given episode to include unclear or ambiguous instructions (such as Put a toy on the bed when any object could conceivably be a toy), objects with ambiguous appearance (e.g. the car and bus look similar), and (rarely) a failure of control or object manipulation.
We experiment with a reference task, which focuses on object identification, and a putting task which focuses on object relations and discrete object control. In both tasks the locations of all objects and the initial position and orientation of the agent are chosen at random on floating point scales, so it is highly unlikely that any two episodes are spatially identical (whether during training or evaluation). The action set consists of movements (move-forward,back,left,right), turns (turn-up,down,left,right), and object pick-up and hand control with DoF. The placement of objects is assisted by a visible vertical line. See Supplement for a full description.
3.1 Determining reference of ShapeNet objects
In an episode of the reference task, the environment selects two movable objects at random from a global object set, and generates a language instruction of the form Find a X, where X is the correct name for (exactly) one of those objects. To achieve reward, the agent must locate the X, lift it more than 1m above the ground and keep it there for 5 time steps. If the reward function in the environment determines that these conditions are met, a positive reward of is emitted and the episode ends. If a lifting takes place with the incorrect object, the episode also ends with reward . For a global set of recognisable objects, we use the ShapeNet dataset (Chang et al., 2015), which contains over 12,000 3D rendered models. For performance reasons we discard models with more than 8,000 vertices or an OBJ file size greater than 1 MB and consider only those tagged into a synset in the WordNet taxonomy. We take models from the ShapeNetSem slice of the data, and use the name of the first lemma of that synset as a proxy for the name of the object. From these names, we selected 80 for the environment, each referring to a set of object models with a minimum of 12 exemplars. See Supplement for details.
|Find a flag||Find a||Find the indian flag|
|Find a flag|
|banner||Find the flag.|
|Find the cardboard.|
|Find a pillow||Find a||Find a pillows|
|Find a cushion|
|cushion||Find a paper|
|Find a set of cushions|
For all language encoding strategies described above, the agent was able to learn the task, and a well-trained policy completed episodes in an average of timesteps with accuracy around 90%. The failure of agents to reach perfect performance on the training set is unimportant for the present study; we suspect that the agent’s comparatively small convolutional network fails to perceive important distinctions between the more intricate ShapeNet models.
|Template lang.||D.O.||I.O.||D.O. & I.O.|
|Put a mug||Put a cup||Put a mug||Put a cup||place mug in the basket|
|on the tray||on the tray||on the box||on the box||Keep the cup in atub|
|place the mug in a container|
|put the coffee mug in the box|
|Put a train||Put a locomotive||Put a train on||Put a locomotive||Put the tractor on the bed|
|on the bed||on the bed||the bunk||on the bunk||Move the train toy onto the bed|
|Place a toyvehicle on the bed|
|place the rail on tthe bed|
We consider two evaluation settings. In the synonym evaluation, we ran the environment for 1,000 episodes with the noun in the environment template instruction replaced by a synonym (Find a becomes Find a where ). The synonyms were provided by native English speaking subjects. In the natural reference evaluation, we gave 40 annotators access to a room containing a single ShapeNet model via a crowd-sourcing platform. We asked them to write down what they found in the room and then hit a button that restarted the environment with a new object in the room.555See Supplement for the all synonyms and human instructions. We used this interactive method of soliciting instructions, rather than paraphrases of template instructions, because the setting more faithfully reflects the perspective of a user instructing a robot or situated learning system.
As illustrated in Table 1, unlike the synonyms, the natural referring expressions involve variation in articles as well as nouns (a pencil might become the pen), may include spelling mistakes or typos, can refer entirely incorrectly to the intended object (if the subject fails to recognize the ShapeNet model), but may also match the training instruction exactly. Moreover, unlike the synonym test, there are natural referring expressions for each of the 80 environment nouns, from which we sample randomly when evaluating the agent (again on 1,000 evaluation espisodes). Overall, the human referring expressions are highly varied; while there are 82 unique word types across all possible template instructions for this task, the human referring expressions involved 557 unqiue word types.The full set of human instructions aligned with ShapeNet model IDs can be downloaded from https://tinyurl.com/s6u5bbj
3.2 Putting objects on other objects
The strength of models like BERT is their ability to combine lexical representations into phrasal or sentence representations. To study this capacity in the context of instruction-following, we devised a putting task involving the verb ‘to put’, which, in the imperative (put the cup on the tray) takes two arguments, the direct object (D.O.) cup and the indirect object (I.O.) tray. In terms of the behaviours required, the putting task focuses on control and object relations rather than object identification or reference. The environment was configured to begin each episode with three randomly-chosen moveable objects and two larger immovable objects (a bed and a tray), each randomly positioned in the room. In each episode of this task, the agent receives an instruction Put a D.O. on the I.O., where D.O. is any of the three moveable objects (chosen from a global set of ten) and I.O. is either bed or tray. The environment checks whether an instance of D.O. is at rest (and not held by the agent) on top of the I.O., returning a positive reward if so and ending the episode. If the object D.O. is placed on something other than a I.O., or if another movable object is placed on the bed or the tray then again the episode ends immediately with reward .666We found that ending the episode in such cases made learning much faster.
As before, we first trained all agents on the putting task with synthetic environment language instructions. Training a policy on this task with reinforcement learning required a bespoke task curriculum (see Supplement for more details); a well-trained policy completes each episode in an average of actions/timesteps. To gather the evaluation stimuli, we again crowd-sourced humans to provide both natural synonyms for each of the 12 objects in the global set for this task and, in this case, entirely free-form natural human instructions. To obtain natural instructions, we instantiated an environment with only one of the global set of moveable objects, coloured red, and one of either the bed or the tray, coloured white, and asked subjects to ask somebody to place the red object on top of the white object without mentioning their color. This resulted in a set of natural instructions containing a total of 180 unique word types, compared to only 19 in the template instructions used for agent training. We then evaluated agents on four specific evaluations, illustrated in Table 2: D.O. synonym, I.O. synonym and D.O. & I.O. synonym, in which particular parts of the original template command were replaced with synonyms, and Natural instruction, the fully free-form human instruction, which can include orthographic errors and misidentified objects.
3.3 Discussion of results
|Random ‘lifting’ act||0.5||0.50||0.50|
|Random-embedding + MP|
|Word-level Transf. + TN|
|WordPiece Transf. + TN|
|Word embeddings + Transf.|
|BERT + MP|
|BERT + SA|
|BERT + CMSA|
|BERT + CMSA + TN|
|Human performance estimate||–||–|
|Template lang||D.O.||I.O.||D.O. & I.O.||Natural|
|Random ‘putting’ act||0.17||0.17||0.17||0.17||0.17|
|Random-embedding + MP|
|Word-level Transf. + TN|
|WordPiece Transf. + TN|
|Word embeddings + Transf.|
|BERT + MP|
|BERT + SA|
|BERT + CMSA|
|BERT + CMSA + TN|
|Multitask BERT + CMSA + TN|
|Human performance estimate||–||–||–||–|
The accuracies for the reference task are presented in Table 3 and for the putting task in Table 4. A video of the BERT+CMSA+TN agent both succeeding and failing when following human instructions on the putting task can be seen at https://tinyurl.com/uy4fus2. The results reveal the following main effects of language encoding on model performance:
Substantial transfer from text requires contextual encoders Agents with weights that are pretrained on text data exhibit substantially higher accuracy on both the reference and the putting tasks. This effect is greatest in the more focused synonym evaluations, but also holds for the the free-form human instructions. A small transfer effect can be seen by comparing the word embeddings + Transformer condition (62% accuracy on the synonym evaluation, reference task and 57% accuracy on the D.O. synonym evaluation, putting task) with the WordPiece Transformer (57% and 20%). However, overall the transfer effect is much stronger in the case of the full context-dependent BERT representations. On the same two evaluations, BERT + mean pool achieves 77% and 94% accuracy respectively. The gains from transferring via BERT representations vs. just (sub)word embeddings are greatest for the (longer) putting instructions than for the reference (finding) instructions, and greatest of all in the D.O & I.O. synonym evaluation. These are cases where one would expect the marginal value of powerful sentential (rather than just lexical) representations should be greatest.
Tuning via self-attention layers (with typo noise) helps Interestingly, we find that tuned self-attention layers do not improve generalization performance over using BERT and mean pooling. This may be simply because the additional layers cause a degree of overfitting to the template environment language during training. However, typo noise mitigates this issue, so that the strongest evaluation performance on the putting tasks overall is observed with a combination of a tuned cross-modal self-attention layer and typo-noise training.777See Supplement for a comparison of BERT-based architectures with and without typo-noise. Indeed, the value of typo noise as a regularizer can be seen by the fact that it improves the robustness of agents with tuned self-attention layers even in the synonym evaluations (for both reference task and putting), which do not involve any typos. Thus, the BERT + CMSA + TN model performs better than all others on two of the three synonym evaluations in the putting task.
One way in which the appropriately-tuned self-attention layer might make the agent more robust to synonyms in this case is by spreading out task-relevant object-nouns in the agent’s language representation space (leaving those words closer to synonyms than to potential confounding words). The degree to which this happens when the object-nouns and their synonymns in our environments are passed first through BERT and then through a (BERT + SA) agent-tuned self-attention layer (compared to passing through the same layer but with random weights) is shown in Fig 3.
WordPiece tokenization adds robustness In the two evaluations involving natural language instructions from humans, a comparison of Word-level Transformer and WordPiece Transformer shows that some robustness is obtained simply from WordPiece encoding, which in turn must play some part in all BERT-based conditions (e.g. improving from % to % in the natural referring expression evaluation and from % to % in the natural instruction putting evaluation). As mentioned above, BERT-based agents with WordPiece encoding are particularly robust to human instructions when trained with typo noise, and this is most effective when combined with tuned self-attention layers. This makes intuitive sense, as a learnable self-attention layer should provide the agent with more flexibility to learn to correct for typos during training. Indeed, on the natural instruction evaluation of the putting task, where typos or spelling errors are most common, the cross-modal attention agent trained with typo noise achieves 70% accuracy. Note that 100% accuracy on this evaluation may be impossible, even for humans, because visual ambiguity or human error mean that instructions can sometimes refer in entirely mistaken ways to the objects in the room.
Scaling to multiple tasks When evaluated on the putting task, the performance of an agent trained on both the putting and reference tasks is not substantially lower than agents that are specialized to each of the tasks individually.
Agent performance on this data is close to ceiling The best-performing agent achieves a similar level of performance to humans that are otherwise unfamiliar with the game or the instruction-setters. This underlines the extent to which the instructions procured from our human raters may be ambiguous or ill-formed. However, it also suggests that the best agents must be performing close to perfectly on all test episodes in which they have a reasonable chance to do so.
4 Related work
Most closely related to our work is an experiment by Chan et al. (2019), who showed how an agent trained with InferLite sentence representations (Kiros and Chan, 2018) can be robust to synonym replacements in template instructions. The task itself involves object identification in the VizDoom environment (Kempka et al., 2016), which requires only motor actions. Our work develops this insight substantially, applying a similar approach to a visually-realistic environment requiring 4, integrating context-dependent pre-trained models with subword tokenization (BERT), analysing architectures and training strategies for integrating such models and extending from synonym replacements to free-form instructions typed by humans.
Much recent work applies deep learning and policy optimization in end-to-end approaches to learning instruction following(Chaplot et al., 2018; Oh et al., 2017; Bahdanau et al., 2018; Chevalier-Boisvert et al., 2018; Yu et al., 2018b, a; Jiang et al., 2019). As noted in the introduction, many of these studies, particularly those requiring policies to make fine-grained movements of the agent’s body or objects, do not involve human language.
Vision and language navigation (VLN) models learn to follow natural language directions that are longer that those considered in this work Misra et al. (2017, 2018); Anderson et al. (2018); Fried et al. (2018); Wang et al. (2019); Zhu et al. (2019). The best approaches to VLN do not apply deep RL methods, since VLN typically requires a high degree of exploration, rather than precise object control. VLN agents also often make use of privileged (i.e. non-first-person) observations Misra et al. (2018), knowledge of shortest paths Fried et al. (2018) and/or expert trajectories Wang et al. (2019). In contrast, because we seek to evaluate a method that may ultimately be applied to robotics, we consider a setting in which the agent does not have access to privileged observations and/or gold-standard trajectories and an environment where different everyday objects must be controlled with a finer-grained set of actions ( DoF). Another important difference is that, SHIFTT is a method for zero-shot transfer from template-based to natural instructions, whereas VLN models are both trained and tested on natural language instructions.
A limitation of all studies above, including this one, is the reliance on simulated environments. Both object identification and manipulation are likely far harder in reality, and it remains to be seen whether our proposed method would work seamlessly for robot language understanding (Tellex et al., 2011, 2012; Walter et al., 2014). However, we note that it can in theory be applied in any multi-task or language-dependent policy-learning setting. See also (Anderson et al., 2018; Wang et al., 2019) for recent improvements to visual realism in simulated environments.
Finally, there is a long history of building in knowledge about the structure of language and/or its environment into instruction-following systems rather than learning it end-to-end. In Winograd (1972)’s SHRDLU, syntactic modules parsed the language input into a logical form, and hand-written rules were applied to connect such forms to the environment. More recent pipeline-based approaches use learning algorithms to map language to a program that can then interface with a planner (Chen and Mooney, 2011; Matuszek et al., 2013; Wang et al., 2016), and/or a controller, both of which may have priviledged information about how the world connects to the program. It is likely that pretrained language encoders could add robustness to parts of these approaches, much as they do here. Our focus on end-to-end learning, however, is motivated by the intuition that it may eventually scale or adapt more flexibly to arbitrary environments or problems than pipeline approaches.
In this work, we have developed an agent that can follow natural human instructions requiring the identification, control and positioning of visually-realistic assets. Our method relies on zero-shot transfer from template language instructions to those given by human annotators when asked to refer and instruct in natural ways. The results show that, with powerful pretrained language encoders, this transfer effect is sufficiently strong to permit decoding extended language-dependent motor behaviours, despite the shift in distribution of the agent’s input. More generally, we hope that this contribution serves to bring research on text-based and situated language learning closer together. To facilitate this, we make available our dataset of natural instructions and referring expressions aligned to ShapeNet models.
While this study can be considered an interesting proof of concept for the SHIFTT approach to training linguistic deep RL agents via transfer-learning, there are many ways in which it can be improved and extended. As agents become better able to learn a wide range of conditional policies covering a larger set of motor behaviours, it will be instructive to scale the linguistic scope of the agent via the proposed technique, for instance to language involving verbs, or to questions and dialogue. Moreover, an important long-term objective is to apply SHIFTT to the training of robotic agents. There are also many alternative possibilities effecting the knowledge transfer from language model to agent that are not considered here. For instance, our approach involves freezing rather than fine-tuning the BERT encoder weights to our desired behaviour policy, to avoid overfitting, but techniques such as knowledge distillation (Hinton et al., 2015)
could point to more elegant ways to learn jointly from text and environmental experiences. In addition, we have focused on BERT, but improvements may be possible by applying alternative general-purpose language encoders, such as GPT-2(Radford et al., 2019), Roberta (Liu et al., 2019) and Transformer XL (Dai et al., 2019).
In sum, we have presented a conceptually simple recipe for transferring knowledge from a text corpus to a reinforcement-learning agent, and shown that the method permits zero-shot transfer from simulated to (constrained) natural language with surprising efficacy. We hope this opens new channels for research combining unsupervised (or semi-supervised) representation-learning with reinforcement learning, particularly at the intersection of language, vision and behaviour.
- Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In , pp. 3674–3683. Cited by: §4, §4.
- Learning to understand goal specifications by modelling reward. arXiv preprint arXiv:1806.01946. Cited by: §4.
- ACTRCE: augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv preprint arXiv:1902.04546. Cited by: §4.
- Shapenet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012. Cited by: §1, §3.1.
Gated-attention architectures for task-oriented language grounding.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1, §4.
- Learning to interpret natural language navigation instructions from observations. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: §4.
- BabyAI: first steps towards grounded language learning with a human in the loop. arXiv preprint arXiv:1810.08272. Cited by: §1, §4.
A unified architecture for natural language processing: deep neural networks with multitask learning. In
Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §2.1.
- Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §5.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.1.
- Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §2, §6.1.
- Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, pp. 3314–3325. Cited by: §4.
- Grounded language learning in a simulated 3D world. arXiv preprint arXiv:1706.06551. Cited by: §1.
- Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: footnote 4.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.
- Language as an abstraction for hierarchical deep reinforcement learning. arXiv preprint arXiv:1906.07343. Cited by: §1, §4.
- Residual reinforcement learning for robot control. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6023–6029. Cited by: §1.
- Vizdoom: a doom-based ai research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. Cited by: §4.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
- Inferlite: simple universal sentence representations from natural language inference data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4868–4874. Cited by: §4.
- RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §5.
- Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. Cited by: §2.1.
- Learning to parse natural language commands to a robot control system. In Experimental Robotics, pp. 403–415. Cited by: §4.
- Mapping instructions to actions in 3D environments with visual goal prediction. arXiv preprint arXiv:1809.00786. Cited by: §1, §4.
- Mapping instructions and visual observations to actions with reinforcement learning. arXiv preprint arXiv:1704.08795. Cited by: §1, §4.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §6.1.
- Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2661–2670. Cited by: §1, §4.
- Combating adversarial misspellings with robust word recognition. arXiv preprint arXiv:1905.11268. Cited by: §2.1.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §5.
- Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152. Cited by: §1.
- Approaching the symbol grounding problem with probabilistic graphical models. AI magazine 32, pp. 64–76. Cited by: §4.
- A probabilistic approach for enabling robots to acquire information from human partners using language. Cited by: §4.
- Multimodal transformer for unaligned multimodal language sequences. arXiv preprint arXiv:1906.00295. Cited by: §2.1.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.1.
- A framework for learning semantic maps from grounded natural language descriptions. The International Journal of Robotics Research 33 (9), pp. 1167–1190. Cited by: §4.
- Learning language games through interaction. arXiv preprint arXiv:1606.02447. Cited by: §4.
Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6629–6638. Cited by: §4, §4.
- No training required: exploring random encoders for sentence classification. arXiv preprint arXiv:1901.10444. Cited by: §2.1.
- Understanding natural language. Cognitive psychology 3 (1), pp. 1–191. Cited by: §1, §4.
- Guided feature transformation (gft): a neural language grounding module for embodied agents. arXiv preprint arXiv:1805.08329. Cited by: §1, §4.
- Interactive grounded language acquisition and generalization in a 2D world. arXiv preprint arXiv:1802.01433. Cited by: §1, §4.
- RTFM: generalising to novel environment dynamics via reading. Proceedings of ICLR. Cited by: §1.
- Vision-language navigation with self-supervised auxiliary reasoning tasks. arXiv preprint arXiv:1911.07883. Cited by: §4.
6 Supplementary Material
6.1 Agent architecture and training details
|Body movement actions||Movement and grip actions||Object manipulation|
|NOOP||GRAB||GRAB + SPIN_OBJECT_RIGHT|
|MOVE_FORWARD||GRAB + MOVE_FORWARD||GRAB + SPIN_OBJECT_LEFT|
|MOVE_BACKWARD||GRAB + MOVE_BACKWARD||GRAB + SPIN_OBJECT_UP|
|MOVE_RIGHT||GRAB + MOVE_RIGHT||GRAB + SPIN_OBJECT_DOWN|
|MOVE_LEFT||GRAB + MOVE_LEFT||GRAB + SPIN_OBJECT_FORWARD|
|LOOK_RIGHT||GRAB + LOOK_RIGHT||GRAB + SPIN_OBJECT_BACKWARD|
|LOOK_LEFT||GRAB + LOOK_LEFT||GRAB + PUSH_OBJECT_AWAY|
|LOOK_UP||GRAB + LOOK_UP||GRAB + PULL_OBJECT_CLOSE|
|LOOK_DOWN||GRAB + LOOK_DOWN|
The action set of the agent is presented in Table 5.
To process visual input, the agent uses a residual convolutional network with channels in the first, second and third layers respectively and residual blocks in each layer.
As the agent learns, actors carry replicas of the latest learner network, interact with the environment independently and send trajectories (observations and agent actions) back to a central learner. The learning algorithm [Espeholt et al., 2018] modifies the learner weights to optimize an actor-critic objective, with updates importance-weighted to correct for differences between the actor policy and the current state of the learner policy. We used a learning rate of , a learner batch size of , an agent unroll length of , a discounting factor of , an epsilon (epsilon-greedy policy) of and an entropy cost of [Mnih et al., 2016]. We use an Adam optimizer [Kingma and Ba, 2014] with and .
In order to train the agent on the putting task, it was necessary to combine episodes of the task itself with episodes of simpler tasks, in a form of curriculum (although learning in parallel on all tasks). In particular, we trained it concurrently on a reference task that involved the same moveable objects as the putting task (so that the agent could receive signal about their names without relying on a complete act of putting). We also found it beneficial to add a put-near task to the curriculum, where the instructions were of the form Put a cup near the tray rather than Put a cup on the tray, and a reward was emitted if the agent moved the cup a short distance from the tray and placed it on the ground. During training, the agent received equal experience of the modified reference task, the put-near task and the (target) putting task, and continued learning until performance on the putting task converged. When training the agent on the reference task, no curriculum was required. When training the multi-task agent on both tasks, the agent was trained with equal experience on four tasks: the reference task, the modified reference task, the put-near task and the putting task. Training was stopped when performance on the putting task and the reference task converged. Note that the the distributed (IMPALA) framework lends itself naturally to sharing learning experience across tasks in this way, since we can simply have some proportion of the total agent actor threads interacting with particular tasks. The experience across the tasks in the aggregated into batches on the learner according to these proportions.
For the reference task, training the agent requires approximately 200 million frames of experience (approximately 7 million episodes), which takes about 24 hours with 250 actors on GPU. In the putting task, training in each condition was stopped after 30 million episodes, approximately three days of training.
6.2 Full experimental results
For a superset of the experimental results presented in the main paper (with two additional conditions involving typo noise), see the Figures 1-4 on the following pages.
6.3 Instructions to human annotators
Human annotators use a keyboard and mouse to control a player in the environment simulator, as is standard in first-person video games. The annotators were given the following instructions as part of the each annotation task.
6.3.1 Natural referring expressions
This is a task called Name The Object. You will find yourself in a room containing a single object. Please move around the room to get a good view of the object. When you know what the object is:
Type the name of the object
Hit Enter again
Examples of good responses:
A pair of scissors
A tennis ball
Please don’t describe the object. Just write down what you see, with an article like ‘a’ or ‘some’ if appropriate.
Example of bad responses
A brown ball
A large piano with long black legs
A small thing with lumps on the side
You should not need more than 4-5 different words, and most objects will require just 1 or 2 words to name.
Sometimes, you might be unsure what the object is. If that’s the case, just make your best guess.
6.3.2 Full human instructions
This is a task called Ask to put. You will find yourself in a room. Your job is to imagine giving an instruction to somebody else so that that person puts the red object on to the white object. Move around the room to get a good look at what the two objects are. When you are ready to give your instruction:
Type your instruction
Hit Enter again
Examples of good instructions:
Place the cup onto the table
Put the ball on the plate
Move the pencil onto the box
Words to avoid
Your instruction must not contain words for colours or other properties of the object. Please do not use the words red, white, scarlet, dark, large etc. in your instruction. Instead, refer to objects by their name as you recognise them. If you don’t recognise what an object is, just make your best guess.
Examples of bad instructions:
Put the red thing on the table
Put the large object on the small object
Put the small round thing on the chest
Keep your language varied. Try to use various ways to express your instruction in different episodes to keep things interesting.
6.4 Further environment details
For all experiments, the environment is a Unity room of dimension 4m x 4m. The walls, floor and ceiling are always the same color, but we add a window and door (positioned randomly per episode) to give some sense of absolute location to the agent.
When importing ShapeNet models, we use the scaling provided in the metadata, unless it it is very small (less than 0.000001), in which case we interpret the coordinates in the OBJ file as meters. All objects have rigid bodies with collision meshes generated using Unity’s built in MeshCollider, with convex set to true. The masses of all movable objects are set to 1 kg, so that our avatar has enough strength to pick all of them up (except for beds and trays, which are made kinetic).
When selecting ShapeNet models, for performance reasons we discarded all models with a vertex count higher than 8000, and an OBJ file size greater than 1 MB. The native ShapeNet category names are often not natural everyday names (for instance, they can be highly specific, like "dual shock analog controller"). To mitigate this, we used ShapeNet’s WordNet tags, grouping models into categories according to WordNet synsets and assigning the name of the first synset lemma to the category.
Because depth-perception is challenging without binocular vision, the agent is assisted in manipulation by a visual guide (bottom-right) that highlights objects within grasping range and drops a vertical line from held objects.
Full lists of objects and synonyms are on the following pages.
|name||synonym||unique models||example shapenet model id||wordnet synset|
|chest of drawers||cupboard||241||793aa6d322f1e31d9c75eb4326997fae||n3018908|
|toilet tissue||toilet paper||24||6658857ea89df65ea35a7666f0cfa5bb||n15099708|
|table lamp||desk light||23||85b52753cc7e7207cf004563556ddb36||n4387620|