Executing natural language navigation instructions from raw observations requires solving language, perception, planning, and control problems. Consider instructing a quadcopter drone using natural language. Figure 2 shows an example instruction. Resolving the instruction requires identifying the blue fence, anvil and tree in the world, understanding the spatial constraints towards and on the right, planning a trajectory that satisfies these constraints, and continuously controlling the quadcopter to follow the trajectory. Existing work has addressed this problem mostly using manually-designed symbolic representations for language meaning and environment [1, 2, 3, 4, 5, 6]. This approach requires significant knowledge representation effort and is hard to scale. Recently, Blukis et al. 
proposed to trade-off the representation design with representation learning. However, their approach was developed using synthetic language only, where a small set of words were combined in a handful of different ways. This does not convey the full complexity of natural language, and may lead to design decisions and performance estimates divorced from the real problem.
In this paper, we study the problem of executing instructions with a realistic quadcopter simulator using a corpus of crowdsourced natural language navigation instructions. Our data and environment combine language and robotic challenges. The instruction language is rich with linguistic phenomena, including object references, co-references within sentences, and spatial and temporal relations; the environment simulator provides a close approximation of realistic quadcopter flight, including a realistic controller that requires rapid decisions in response to continuously changing observations.
We address the complete execution problem with a single model that is decomposed into two stages of planning and plan execution. Figure 2
illustrates the two stages. The first stage takes as input the language and the observations of the agent, and outputs two distributions that aim to to solve different challenges: (a) identifying the positions that are likely to be visited during a correct instruction execution, and (b) recognizing the correct goal position. The second stage of the model controls the drone to fly between the high probability positions to complete the task and reach the most likely goal location. The two stages are combined into a single neural network. While the approach does not require designing an intermediate symbolic representation, the agent plan is still interpretable by simple visualization of the distributions over a map.
Our approach introduces two learning challenges: (a) estimate the model parameters with the limited language data available and a realistic number of experiences in the environment; and (b) ensure the different parts of the model specialize to solve their intended tasks. We address both challenges by training each of the two stages separately. We train the visitation prediction stage with supervised learning using expert demonstrations, and the plan execution stage by mapping expert visitation distributions to actions using imitation learning. At test time, the second stage uses the predicted distributions from the first. This learning method emphasizes sample efficiency. The first stage uses supervised learning with training demonstrations; the second stage is independent from the complex language and visual reasoning required, allowing for sample-efficient imitation learning. This approach also does not suffer from the credit assignment problem of training the complete network using rewards on actions only. This ensures the different parts of the network solve their intended task, and the generated interpretable distributions are representative of the agent reasoning.
To evaluate our approach, we adapt the Lani corpus  for the realistic quadcopter simulator from Blukis et al. , and create a continuous control instruction following benchmark. The Lani corpus includes 27,965 crowdsourced natural language instructions paired with human demonstrations. We compare our approach to the continuous-action analogs of two recently proposed approaches [9, 10], and demonstrate absolute task-completion accuracy improvements of 16.85%. We also discuss a generalization of our position visitation prediction approach to state-visitation distribution prediction for sequential decision processes, and suggest the conditions for applying it to future work on robot learning problems. The models, dataset, and environment are publicly available at https://github.com/clic-lab/drif.
2 Technical Overview
Let be the set of natural language instructions, be the set of world states, and be the set of all actions. An instruction is a sequence of tokens . An action is either a tuple of forward and angular velocities or the completion action . The state contains information about the current configuration of all objects in the world. Given a start state and an instruction , the agent executes by generating a sequence of actions, where the last action is the special action , which indicates task completion. The agent behavior is determined by its configuration . An execution of length is a sequence , where is the state at timestep , is the action updating the agent configuration, and the last action is . Given an action , we set the agent configuration , which specifies the controller setpoint. Between actions, the agent maintains its configuration.
The agent does not have access to the world state. At timestep , the agent observes the agent context , where is the instruction and and , are monocular first-person RGB images and 6-DOF agent poses observed at time step . The pose is a pair , where is a position and is an orientation. Given an agent context , we predict two visitation distributions that define a plan to execute and the actions required to execute the plan. A visitation distribution is a discrete distribution over positions in the environment. The trajectory-visitation distribution puts high probability on positions in the environment the agent is likely to go through during execution, and the goal-visitation distribution puts high probability on positions where the agent should to complete its execution. Given and , the second stage of the model predicts the actions to complete the task by going through high probability positions according to the distributions. As the agent observes more of the environment during execution, the distributions are continuously updated.
We assume access to a training set of examples , where is an instruction, is a start state, and is a sequence of positions that defines a trajectory generated from a human demonstration execution of . Learning is decomposed into two stages. We first train the visitation distributions prediction given the visitation distributions inferred from the oracle policy . We then use imitation learning using to generate the sequence of actions required given the visitation distributions.
We evaluate on a test set of examples , where is an instruction, is a start state, and is the goal position. We consider the task successfully completed if the agent outputs the action within a predefined Euclidean distance of . We additionally evaluate the mean and median Euclidean distance to the goal position.
3 Related Work
Natural language instruction following has been studied extensively on physical robots [11, 12, 13, 2, 14, 12, 15, 16] and simulated agents [17, 18, 19, 3, 20, 21]. These methods require hand-engineering of intermediate symbolic representations, an effort that is hard to scale to complex domains. Our approach does not require a symbolic representation, instead relying on a learned spatial representation induced directly from human demonstrations. Our approach is related to recent work on executing instructions without such symbolic representations using discrete environments [9, 22, 23, 8]. In contrast, we use a continuous environment. While we focus on the challenge of using natural language, this problem was also studied using synthetic language with the goal of abstracting natural language challenges and focusing on navigation [24, 10] and continuous control .
Our approach is related to recent work on learning visuomotor control policies for grasping [25, 26, 27], dexterous manipulation [28, 29, 30] and visual navigation . While these methods have mostly focused on learning single robotic tasks, or transferring a single task between multiple domains [30, 32, 33, 34], our aim is to train a model that can execute navigation tasks specified using natural language, including previously unseen tasks during test time.
Treating planning as prediction of visitation probabilities is related to recent work on neural network models that explicitly construct internal maps [35, 7, 36], incorporate external maps [31, 37], or do planning . These architectures take advantage of domain knowledge to provide sample-efficient training and interpretable representations. In contrast, we cast planning as an image-to-image mapping [38, 39]
, where the output image is interpreted as a probability distribution over environment locations. Our architecture borrows building blocks from prior work. We use the ResNet architecture for perception and the neural mapping approach of Blukis et al.  to construct a dynamic semantic map. We also use the LingUNet conditional image translation module . While it was introduced for first-person goal location prediction, we use it to predict visitation distributions.
Learning from Demonstrations (LfD) approaches have previously decomposed robot learning into learning high-level tasks and low-level skills (e.g. Dynamic Movement Primitives [41, 42, 43, 44]). Our approach follows this general idea. However, instead of using trajectories or probabilities as task representations , we predict visitation distributions using a neural network. This results in a reactive approach that defers planning of the full trajectory and starts task execution under uncertainty that is gradually reduced with additional observations. This approach does not assume access to the full system configuration space or a symbolic environment representation. Furthermore, the learned representation is not constrained to a specific robot. For example, the same predicted visitation distribution could potentially be used on a humanoid or a ground vehicle, each running its own plan execution component.
We model the agent behavior using a neural network policy . The input to the policy at time is the agent context , where is the instruction and and , are first-person images and 6-DOF agent poses observed at timestep and state . The policy outputs an action , where is a forward velocity and is an angular velocity, and a probability for the action . We decompose the policy to visitation prediction and plan execution. Visitation prediction computes a 2D discrete semantic map . Each position in
corresponds to an area in the environment, and represents it with a learned vector. The map is used to generate two probability distributions: trajectory-visitation distributionand goal-visitation distribution , where is a position in . The first distribution models the probability of visiting each position as part of an optimal policy executing the instruction , and the second the probability of each position being the goal where the agent should select the action. We update the semantic map at every timestep with the latest observations. The distributions and are only computed every timesteps. When not updating the distributions, we set and . This allows for periodic re-planning and limits the computational workload. In the second stage, plan execution generates the action and the stop probability . Figure 3 illustrates our architecture, and Figure 4 shows example visitation distributions generated by our approach.
4.1 Stage 1: Visitation Prediction
Feature Projection and Semantic Mapping
We predict the visitation distributions over a learned semantic map of the environment. We construct the map using the method of Blukis et al. 
. The full details of the process are specified in the original paper. Roughly speaking, the semantic mapping process includes three steps: feature extraction, projection, and accumulation. At timestep, we process the currently observed image using a 13-layer residual neural network ResNet to generate a feature map of size . We compute a feature map in the world coordinate frame by projecting with a pinhole camera model onto the ground plane at elevation zero. The semantic map of the environment at time is an integration of and , the map from the previous timestep. The integration equation is given in Section 4c in Blukis et al. 
. This process generates a tensorof size that represents a map, where each location is a -dimensional feature vector computed from all past observations , each processed to learned features and projected onto the environment ground in the world frame at coordinates . This map maintains a learned high-level representation for every world location that has been visible in any of the previously observed images. We define the world coordinate frame using the agent starting pose : the agent position is the coordinates , and the positive direction of the -axis is along the agent heading. This gives consistent meaning to spatial language, such as turn left or pass on the left side of.
Position Visitation Distribution Prediction
We use image generation to predict the visitation distributions and . For each of the two distributions, we generate a matrix of dimension , the height and width dimensions of the semantic map , and normalize the values to compute the distribution. To generate these matrices we use LingUNet, a language-conditioned image-to-image encoder-decoder architecture .
The input to LingUNet is the semantic map and a grounding map that incorporates the instruction into the semantic map. We create with a 11 convolution . The kernel
is computed using a learned linear transformation, where is the instruction embedding.
The grounding map has the same height and width as , and during training we optimize the parameters so it captures the objects mentioned in the instruction (Section 5).
LingUNet uses a series of convolution and deconvolution operations. The input map is processed through cascaded convolutional layers to generate a sequence of feature maps , .111 denotes concatenation along the channel dimension. Each is filtered with a 11 convolution with weights . The kernels are computed from the instruction embedding using a learned linear transformation . This generates language-conditioned feature maps , . A series of deconvolution operations computes feature maps of increasing size:
The output of LingUNet is , which is of size . The full details of LingUNet are specified in Misra et al. . We apply a softmax operation on each channel of separately to generate the trajectory-visitation distribution and the goal-visitation distribution . In Section 5, we describe how we estimate the parameters to ensure that and model the visitation distributions.
4.2 Stage 2: Plan Execution
The action generation component generates the action values from the two visitation distributions and and the current agent pose . We first perform an affine transformation of the most recent visitation distributions to align them with the current agent egocentric reference frame as defined by its pose , and crop a region centered around the agent’s position. We fill the positions outside the semantic map with zeros. We flatten and concatenate the cropped regions of the distributions into a single vector of size , and compute the feed-forward network:
Our model parameters can be divided into two groups. The visitation prediction parameters include the parameters of the functions , LSTM, and ResNet, , , and the components of LingUNet: , , , . The plan execution parameters are , , , . We use supervised learning to estimate the visitation prediction parameters and imitation learning for the plan execution parameters.
Estimating Visitation Prediction Parameters
We assume access to training examples , where is an instruction, is a start state, and is a sequence of positions.222To simplify notation, we describe learning for a single example. We convert the sequence to a sequence of positions in the semantic map . We generate expert trajectory-visitation distribution by assigning high probability for positions around the demonstration trajectory, and goal-visitation distribution by assigning high probability around the goal position . For each location in the semantic map, we calculate the probability of visiting and stopping there as:
is a Gaussian probability density function with mean
and variance, and and are normalization terms. The distributions are computed efficiently by applying a Gaussian filter on an image of the human trajectory. We then generate a sequence of agent contexts by executing an oracle policy , which is implemented with a simple control rule that steers the quadcopter along the human demonstration trajectory . We create a training example for each time step in the oracle trajectory when we compute the visitation distributions, and minimize the KL divergence between the expert and predicted distribution: . The data and objective do not consider the incremental update of the distributions, and we always optimize towards the full visitation distributions.
We additionally use three auxiliary loss functions fromBlukis et al.  to bias the different components in the model to specialize as intended: (a) the object recognition loss
to classify visible objects using their corresponding positions in the semantic map; (b) the grounding lossto classify if a visible object in the semantic map is mentioned in the instruction ; and (c) the language loss to classify if objects are mentioned in the instruction . To compute and
, we use alignments between words and object labels that we heuristically extract from the training data using pointwise mutual information. Please refer to the supplementary material for full details.
The complete objective for an example for time is:
is a hyperparameter weighting the contribution of the corresponding auxiliary loss.
Estimating Plan Execution Parameters
We train the plan execution stage using imitation learning with the oracle policy . During imitation learning, we use the visitation distributions and induced from the human demonstrations. This provides the model access to the same information that guides the oracle policy, which it learns to imitate. We use DAggerFM , a variant of DAgger  for low-memory usage. DAggerFM performs iterations of training. For each iteration and a training example , we generate an execution using a mixture policy. The mixture policy selects an action at time using with probability or the learned policy with probability , where
is a hyperparameter. The states generated in the execution are aggregated in a dataset across iterations. After each iteration, we prune the dataset to a fixed size and perform one epoch of supervised learning. We use a binary cross-entropy loss for theprobability , and a mean-squared-error loss for the velocities. When the oracle selects , both velocities are zero. We initialize imitation learning with supervised learning using the oracle policy trajectories.
Our approach is an instance of learning state-visitation distributions in Markov Decisions Processes (MDP). Consider an MDP, where is a set of states, is a set of actions, is a reward function, is a probabilistic transition function, is the time horizon, and is the start-state distribution.333 denotes a probability distribution. The state-visitation distribution of a policy is defined as , where is the probability of visiting state at time following policy with the initial state-distribution .
Reasoning about the entire state space is challenging. Instead, we consider an alternative discrete state space with a mapping and a reward function . For example, in a robot navigation scenario, can be the robot pose estimate , or the positions in our semantic map . In a manipulation setup, can be the manipulator configuration. This choice is task-specific, but should include variables that are are measurable and relevant to task completion. The state-visitation distribution in is . In general, we construct as a small set to support efficient computation of the visitation distribution, and enable our two stage learning. In the first stage, we train a visitation model to predict the visitation distribution for the oracle policy , and in the second stage, we learn a plan execution model using the oracle visitation distribution using imitation learning.
There is a strong relation between learning the state distribution and policy learning. For predicted visitation distributions with a bounded error in regard to the optimal visitation distribution, the sub-optimality error of policies that accurately follow the predicted distribution is bounded as well:
Suppose and let be the set of all policies whose approximate state-visitation distribution has at maximum KL divergence from . Assume that for every there holds . Then:
6 Experimental Setup
Data and Environments
We evaluate our approach on the Lani corpus . Lani contains crowd-sourced instructions for navigation in an open environment. Each datapoint includes an instruction, a human-annotated ground-truth demonstration trajectory, and an environment with various landmarks and lakes. The dataset train/dev/test split is 19,758/4,135/4,072. Each environment specification defines placement of 6–13 landmarks within a square grass field of size 50m50m. We use the quadcopter simulator environment from Blukis et al.  based on the Unreal Engine,444https://www.unrealengine.com/ which uses the AirSim plugin  to simulate realistic quadcopter dynamics.
We create additional data for visitation prediction learning by rotating the semantic map and the gold distributions, and , by a random angle . This allows the agent to generalize beyond the common behavior of heading towards the object in front.
We measure the stopping distance of the agent from the goal as , where is the end-point of the human annotated demonstration and is the position where the agent output the action. A task is completed successfully if the stopping distance is , 10% of the environment edge-length. We also report the average and median stopping distance.
We compare our Position-visitation Network (PVN) approach to the Chaplot  and GSMN  approaches. Chaplot is an instruction following model that makes use of gated attention. Similar to our approach, GSMN builds a semantic map, but uses simple language-derived convolutional filters to infer the goal location instead of computing visitation probabilities. We also report Oracle performance as an upper bound and two trivial baselines: (a) Stop: stop immediately; and (b) Average: fly forward for the average number of steps () with the average velocity (), both computed with the Oracle policy from the training data. Hyperparameter settings are provided in the supplementary material.
Table 1 shows the performance on the test set and our ablations on the development set. The low performance of the Stop and Average baselines shows the hardness of the problem. Our full model PVN demonstrates absolute task-completion improvement of 16.85% over the second-best system (GSMN), and a relative improvement of 12.7% on average stopping distance and 32.3% on the median stopping distance. The relatively low performance of GSMN compared to previous results with the same environment but synthetic language , an accuracy drop of 54.8, illustrates the challenges introduced by natural language. The performance of Chaplot similarly degrades by 9.6 accuracy points compared to previously reported results on the same corpus but with a discrete environment . This demonstrates the challenges introduced by a realistic simulation.
Our ablations show that all components of the methods contribute to its performance. Removing the auxiliary objectives (PVN no aux) or the goal-distribution prediction to rely only on the trajectory-visitation distribution (PVN no ) both lower performance significantly. While using imitation learning shows a significant benefit, model performance degradation is less pronounced when only using supervised learning for the second stage (PVN no DAgger). The low performance of the model without access to the instruction (PVN no ) illustrates that our model makes effective use of the input language. Figure 5 shows example trajectories executed by our model, illustrating the ability to reason about spatial language. The supplementary material includes more examples.
We evaluate the quality of goal-visitation distribution with an ideal plan execution model that stops perfectly at the most likely predicted stopping position . The performance increase from using a perfect goal-visitation distribution with our model (PVN ideal Act) illustrates the improvement that could be achieved by a better plan execution policy. We observe a more drastic improvement with full observability (PVN full obs), where the input image is set to the top-down view of the environment. This suggests the model architecture is capable of significantly higher performance with improved exploration and mapping.
Finally, we do initial tests for model robustness against test-time variations. We test for visual differences by flying at 2.5m (PVN ), half the training height (5.0m). We test for dynamic differences by doubling the angular velocity during testing for every output action (PVN ). In both cases, the difference in model performance is relatively small, revealing the robustness of a modular approach to small visual and dynamics differences.
We study the problem of mapping natural language instructions and raw observations to continuous control of a quadcopter drone. Our approach is tailored for navigation. We design a model that enables interpretable visualization of the agent plans, and a learning method optimized for sample efficiency. Our modular approach is suitable for related tasks with different robotics agents. However, the effectiveness of our mapping mechanism with limited visibility, for example with a ground robot, remains to be tested empirically in future work. Investigating the generalization of our visitation prediction approach to other tasks also remains an important direction for future work.
This research was supported by Schmidt Sciences, NSF award CAREER-1750499, AFOSR award FA9550-17-1-0109, the Amazon Research Awards program, and cloud computing credits from Amazon. We thank the anonymous reviewers for their helpful comments.
- Huang et al.  A. S. Huang, S. Tellex, A. Bachrach, T. Kollar, D. Roy, and N. Roy. Natural language command of an autonomous micro-air vehicle. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2010.
- Tellex et al.  S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. Gopal Banerjee, S. Teller, and N. Roy. Approaching the Symbol Grounding Problem with Probabilistic Graphical Models. AI Magazine, 2011.
- Matuszek et al.  C. Matuszek, E. Herbst, L. Zettlemoyer, and D. Fox. Learning to parse natural language commands to a robot control system. In International Symposium on Experimental Robotics, 2012.
Thomason et al. 
J. Thomason, S. Zhang, R. J. Mooney, and P. Stone.
Learning to interpret natural language commands through human-robot
International Joint Conferences on Artificial Intelligence, 2015.
- Arumugam et al.  D. Arumugam, S. Karamcheti, N. Gopalan, L. L. Wong, and S. Tellex. Accurately and efficiently interpreting human-robot instructions of varying granularities. In Robotics: Science and Systems, 2017.
- Gopalan et al.  N. Gopalan, D. Arumugam, L. L. Wong, and S. Tellex. Sequence-to-sequence language grounding of non-markovian task specifications. In Robotics: Science and Systems, 2018.
- Blukis et al.  V. Blukis, N. Brukhim, A. Bennet, R. Knepper, and Y. Artzi. Following high-level navigation instructions on a simulated quadcopter with imitation learning. In Robotics: Science and Systems, 2018.
Misra et al. 
D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkin, and Y. Artzi.
Mapping instructions to actions in 3D environments with visual goal
Conference on Empirical Methods in Natural Language Processing, 2018.
Misra et al. 
D. Misra, J. Langford, and Y. Artzi.
Mapping instructions and visual observations to actions with reinforcement learning.In Conference on Empirical Methods in Natural Language Processing, 2017.
- Chaplot et al.  D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. Salakhutdinov. Gated-attention architectures for task-oriented language grounding. AAAI Conference on Artificial Intelligence, 2018.
Matuszek et al. 
C. Matuszek, N. FitzGerald, L. Zettlemoyer, L. Bo, and D. Fox.
A Joint Model of Language and Perception for Grounded Attribute
International Conference on Machine Learning, 2012.
- Duvallet et al.  F. Duvallet, T. Kollar, and A. Stentz. Imitation learning for natural language direction following through unknown environments. In IEEE International Conference on Robotics and Automation, 2013.
- Walter et al.  M. R. Walter, S. Hemachandra, B. Homberg, S. Tellex, and S. Teller. Learning Semantic Maps from Natural Language Descriptions. In Robotics: Science and Systems, 2013.
- Misra et al.  D. K. Misra, J. Sung, K. Lee, and A. Saxena. Tell me dave: Context-sensitive grounding of natural language to mobile manipulation instructions. In Robotics: Science and Systems, 2014.
- Hemachandra et al.  S. Hemachandra, F. Duvallet, T. M. Howard, N. Roy, A. Stentz, and M. R. Walter. Learning models for following natural language directions in unknown environments. In IEEE International Conference on Robotics and Automation, 2015.
- Knepper et al.  R. A. Knepper, S. Tellex, A. Li, N. Roy, and D. Rus. Recovering from Failure by Asking for Help. Autonomous Robots, 2015.
- MacMahon et al.  M. MacMahon, B. Stankiewicz, and B. Kuipers. Walk the talk: Connecting language, knowledge, and action in route instructions. In AAAI Conference on Artificial Intelligence, 2006.
- Branavan et al.  S. R. K. Branavan, L. S. Zettlemoyer, and R. Barzilay. Reading between the lines: Learning to map high-level instructions to commands. In Annual Meeting of the Association for Computational Linguistics, 2010.
- Matuszek et al.  C. Matuszek, D. Fox, and K. Koscher. Following directions using statistical machine translation. In International Conference on Human-Robot Interaction, 2010.
- Artzi and Zettlemoyer  Y. Artzi and L. Zettlemoyer. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics, 2013.
- Suhr and Artzi  A. Suhr and Y. Artzi. Situated mapping of sequential instructions to actions with single-step reward observation. In Annual Meeting of the Association for Computational Linguistics, 2018.
- Shah et al.  P. Shah, M. Fiser, A. Faust, J. C. Kew, and D. Hakkani-Tur. Follownet: Robot navigation by following natural language directions with deep reinforcement learning. arXiv preprint arXiv:1805.06150, 2018.
- Anderson et al.  P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. v. d. Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. arXiv preprint arXiv:1711.07280, 2017.
- Hermann et al.  K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. Czarnecki, M. Jaderberg, D. Teplyashin, et al. Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551, 2017.
- Lenz et al.  I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 2015.
- Levine et al.  S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen. Learning hand-eye coordination for robotic grasping with large-scale data collection. In International Symposium on Experimental Robotics, 2016.
- Quillen et al.  D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine. Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. IEEE International Conference on Robotics and Automation, 2018.
- Levine et al.  S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 2016.
- Nair et al.  A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine. Combining self-supervised learning and imitation for vision-based rope manipulation. In IEEE International Conference on Robotics and Automation, 2017.
- Tobin et al.  J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2017.
- Bhatti et al.  S. Bhatti, A. Desmaison, O. Miksik, N. Nardelli, N. Siddharth, and P. H. Torr. Playing doom with slam-augmented deep reinforcement learning. arXiv preprint arXiv:1612.00380, 2016.
- Srinivas et al.  A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn. Universal planning networks. International Conference on Machine Learning, 2018.
- Bousmalis et al.  K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, et al. Using simulation and domain adaptation to improve efficiency of deep robotic grasping. IEEE International Conference on Robotics and Automation, 2018.
- Tan et al.  J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. Robotics: Science and Systems, 2018.
- Gupta et al.  S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planning for visual navigation. In , 2017.
- Khan et al.  A. Khan, C. Zhang, N. Atanasov, K. Karydis, V. Kumar, and D. D. Lee. Memory augmented control networks. In International Conference on Learning Representations, 2018.
- Savinov et al.  N. Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation. International Conference on Learning Representations, 2018.
- Ronneberger et al.  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 2015.
Zhu et al. 
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In International Conference on Computer Vision, 2017.
- He et al.  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- Pastor et al.  P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal. Learning and generalization of motor skills by learning from demonstration. In IEEE International Conference on Robotics and Automation, 2009.
- Pastor et al.  P. Pastor, M. Kalakrishnan, S. Chitta, E. Theodorou, and S. Schaal. Skill learning and task outcome prediction for manipulation. In IEEE International Conference on Robotics and Automation, 2011.
- Konidaris et al.  G. Konidaris, S. Kuindersma, R. Grupen, and A. Barto. Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research, 2012.
- Maeda et al.  G. Maeda, M. Ewerton, T. Osa, B. Busch, and J. Peters. Active incremental learning of robot movement primitives. In Conference on Robot Learning, 2017.
- Paraschos et al.  A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann. Probabilistic movement primitives. In Advances in Neural Information Processing Systems. 2013.
- Hochreiter and Schmidhuber  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
- Maas et al.  A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, 2013.
- Ross et al.  S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International conference on artificial intelligence and statistics, 2011.
- Shah et al.  S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, 2017.
Appendix A Details on Auxiliary Objectives
We use three additive auxiliary objectives to help the different components of the model specialize as intended with limited amount of training data.
Object Recognition Loss
The object-recognition objective ensures the semantic map stores information about locations and identities of various objects. At timestep , for every object that is visible in the first person image , we classify the element in the semantic map corresponding to the object location in the world. We apply a linear softmax classifier to every semantic map element that spatially corresponds to the center of an object. At a given timestep the classifier loss is:
where is the true class label of the object and is the predicted probability. is the set of objects visible in the image .
For every object visible in the first-person image , we use the feature vector from the grounding map corresponding to the object location in the world with a linear softmax classifier to predict whether the object was mentioned in the instruction . The objective is:
where is a 0/1-valued label indicating whether the object o was mentioned in the instruction and is the corresponding model prediction. is the set of objects visible in the image .
The instruction-mention auxiliary objective uses a similar classifier to the grounding loss. Given the instruction embedding , we predict for each of the 63 possible objects whether it was mentioned in the instruction . The objective is:
where is a 0/1-valued label, same as above.
Appendix B Automatic Word-object Alignment Extraction
In order to infer whether an object was mentioned in the instruction , we use automatically extracted word-object alignments from the dataset. Let be the event that an object occurs within 15 meters of the human-demonstration trajectory , let be the event that a word type occurs in the instruction , and let be the event that both and occur simultaneously. The pointwise mutual information between events and over the training set is:
where the probabilities are estimated from counts over training examples . The output set of word-object alignments is:
where and are threshold hyperparameters.
Appendix C Hyperparameter Settings
Image and Feature Dimensions
Camera horizontal FOV:
Input image dimensions:
Feature map dimensions:
Semantic map dimensions:
Visitation distributions and dimensions:
Cropped visitation distribution dimensions:
Environment edge length in meters:
Environment edge length in pixels on :
Visitation prediction interval timesteps:
Auxiliary objective weights: , ,
Learning library: PyTorch 0.3.0
Number of iterations:
Number of environments for policy execution per iteration:
Number of policy executions per iteration (executions): on average
Memory size (number of executions):
Appendix D Proof of Theorem 5.1
Given that the state-visitation distribution of a policy is defined as , we can write the state-value function for the policy as:
where is the start-state distribution that places the entire probability mass on state .
Using the definition and assuming we can write,
|Because is a probability distribution, which gives|
|Using Holder’s inequality|
|Using Pinsker’s inequality.|
|Using the theorem assumptions.|
|where . Additionally, rewards are only positive.|
We did not use any information about or in the above steps except for . Therefore taking supremum over and completes the proof. ∎
Appendix E Additional instruction-following examples
Figure 6 shows example instructions from the development set along with the trajectories taken by our model and the human demonstrators.