Robotics has benefited greatly from advances in computer vision, but sometimes our objectives have been misaligned. While the central question of computer vision is “tell me what you see”, ours is “do what I say.” In goal directed tasks, most of the scene is a distraction. When grabbing an apple, an agent only needs to care about the counter, table, or chairs if they interfere with accomplishing the goal. Additionally, when a robot learns through grounded interactions, architectures must be sample efficient in order to learn visual representations quickly for new environments. In this work we show how inverting the traditional perception pipeline: Visionreasoning, to incorporate goal information early into the visual stream allows agents to jointly reason and perceive: Vision + Goal
action, yielding faster and more robust learning. While our approach is relatively simple, early fusion approaches have not been explored to the same extent as late fusion attention models and present many advantages in episodic goal-oriented tasks.
In this work we focus on the task of retrieving objects in a 3D environment. This task includes vocabulary learning, navigation, and scene understanding. Task completion requires computing action trajectories and resolving 3D occlusions from a 2D image which satisfy the user’s requests. Fast and efficient planners work well in the presence of ground-truth knowledge of the world
. However, in practice, this ground-truth knowledge is difficult to obtain, and we must often settle for noisy estimates. Additionally, when many objects need to be collected or moved, the planning problem grows rapidly in the search space.
Our work is most closely related to recent advances in instruction following and visual attention [2, 3]. However, we focus on learning visual representations for goal-specific task completion without explicit supervision of object detection or classification. In order to isolate the goal-oriented visual learning component of this problem, we provide explicit goals instead of natural language questions and use imitation learning to provide action supervision. We show that early fusion of goal information in the visual processing pipeline (Early Fusion) outperforms more traditional approaches and learns faster. Furthermore, model accuracy does not degrade in performance even when reducing model parameters by several orders of magnitude (from 6M to 25K).
Ii Task Definition
Our task is to collect objects in a 3D scene as efficiently as possible. The agent is presented with a cluttered scene and a list of requested objects. Often there are multiple instances of the same object, or unrequested objects blocking the agent’s ability to reach a target. This forces the agent to develop detailed spatial reasoning to recognize the target objects, determine which is closest and finally remove obstructions as necessary. The list of remaining requested objects is presented to the agent at every time step, to avoid conflating scene understanding performance with issues of memory. The goal (Figure 1) is to train an agent that receives an image and a list of requested objects and produces the optimal next action.
Ii-a Simulation Environment: CHALET
Our environment consists of a tabletop setting with randomly placed objects, within a kitchen from the CHALET  house environment. Every episode consists of a randomly sampled environment which determines the set of objects (number, position, orientation and type) in addition to which subset will be requested. When there is more than one instance of a particular object, collecting any instance will satisfy the collection criteria, but one may be closer and require fewer steps to reach. Figure 2 shows the sixteen object types that we use for this task (six from CHALET and ten from the YCB dataset).
The objects are chosen randomly and placed at a random location () on the table with a random upright orientation (). Positions and orientations are sampled until a non-colliding configuration is found. A random subset of the instances on the table are used for the list of requested objects. This process allows the same object type to be requested multiple times if multiple of those objects exist in the scene. Additionally, random sampling means an object may serve as a target in one episode and a distractor in the next. The agent receives 128x128 pixel images of the world and has a 60 horizontal field of view, allowing it to see most of the workspace.
Our agent consists of a first-person camera that can tilt up and down and pan left and right with additional collect, remove and idle actions. Each of the pan and tilt actions deterministically rotate the camera 2 in the specified direction. The collect action removes the nearest object that is within 3 of the center axis of the camera and registers the object as having been collected for the purposes of calculating the agent’s score. This region is visualized in Figure 1 as a magenta circle in the center of the frame. The remove action does the same thing as collect, but does not register the item as having been collected. This is used to remove superfluous items occluding the requested target. Finally, the idle action performs no action and should only be used once all requested items have been collected. All actions require one time step, therefore objects which are physically closer to the center of the camera may take more time to reach if they are occluded. For example, in Figure 3 the peach (orange box) requires fewer steps to collect than the Jello box (blue box). The precision required to successfully collect an object makes this a difficult task to master from visual data alone.
In our task, models must learn to ground visual representations of the world to the description of what to collect. How to best combine this information is a crucial modelling decision. Most multimodal approaches compute a visual feature map representing the contents of the entire image before selectively filtering based on the goal. This is commonly achieved using soft attention mechanisms developed in the language  and vision [6, 7, 8] communities.
Attention re-weights the image representation and leads to more informative gradients, helping models learn quickly and efficiently. Despite its successes, attention has important limitations. Most notably, because task specific knowledge (e.g. target information) is only incorporated late in the visual processing pipeline, the model must first build dense image representations that encode anything the attention might want to extract for all possible future goals. In complex scenes and tasks, this places a heavy burden on the initial stages of the vision system. In contrast, we present a technique that injects goal information early into the visual pipeline in order to build a task specific representation of the image from the bottom up. Our approach avoids the traditional bottleneck imposed on perception systems, and allows the model to discard irrelevant information immediately.
Below, we briefly describe the three models (Figure 4) we compare: Traditional approaches with delayed goal information (Late Fusion & Attention Map) versus our goal conditioned Early Fusion architecture.
Iii-a Late Fusion
constructs a single holistic representation of the entire image via a stack of convolution and pooling layers before concatenating an embedding of the requested objects in order to predict an action. An object embedding is computed using a simple linear layer designed to turn a one-hot encoding of the object into a dense representation. The complete request for multiple objects is computed as a sum of these individual object embeddings. This architecture forces the vision module to store semantic and spatial information about every object in the scene so the final fully connected layers can ground target objects and reason about actions.
Iii-B Attention Map
We test traditional attention mechanisms over image regions. As with Late Fusion
, the first step of this model is to pass the image through a stack of convolution layers. Rather than concatenate the request embedding directly onto the resulting representation, these models first compute an attention map over the spatial dimensions of the convolution output. This is accomplished by comparing the embedded target vector with each region of the convolutional feature map via a simple dot product. This provides a distribution over regions which can then be used to form the final image representation. This weighted representation is concatenated to the request in order to make an action decision. We test two attention models: Softmax Attention Map which is defined above and Attention Map which is unnormalized. Using a softmax causes the model to remove the contribution of small entries and focus on fewer regions of the image. This is visualized in Figure 7.
In contrast to the Late Fusion model, the attention mechanism provides a filter on extraneous aspects of the image to simplify the control processing. In these models the grounding from image features to goal objects is done with a direct comparison operator (the dot product). These models are widely used for Visual Question Answering (VQA) problems on static images. We also explored more complex models  for computing attention maps, but found this traditional version worked the best in our setting and provided a strong baseline for comparison.
When compressing an entire image via a weighted sum, spatial information is lost from the final vector. To address this we concatenate a normalized grid ranging from -1 to 1 in 2D to our image representation for these models . We only report results from the best performing architecture, which concatenates the grid directly to the initial image rather than later in the convolution stack. Grid information did not yield gains for the Early Fusion or Late Fusion models.
Iii-C Early Fusion
Finally, we present our most successful approach, Early Fusion, which concatenates the request embedding to every region of a convolutional filter map. This feature is then processed normally by a set of convolution kernels that have been augmented to account for the extra channels. Figure 5 shows this process. All further processing in the network is computed normally. The model’s subsequent convolution and fully connected layers may filter the visual information according to the goal description that is now combined with the visual input. This results in an image representation which contains only the necessary information for deciding on the next action, effectively gaining the benefits of a bottleneck while dispersing the logic throughout the network. Critically, this means that the network does not have to build a semantic representation of the entire image (See section IV-D for details).
Two important results of this architecture are: 1. Because the goal information is incorporated early, the network can learn to ground the image features to the goal objects at any point in the model without additional machinery (like attention); and 2. The model can compute and retain the spatial information needed for its next action without requiring the addition of a spatial grid. These benefits allow us to obviate the complexity of other approaches, minimize parameters, and outperform other approaches on our task.
Iii-D Imitation Learning
we roll out trajectories using the current model while collecting supervision from the expert. We then use batches of recent trajectories to train the model for a small number of epochs and repeat this process. We found that for our item retrieval problem this was faster to train than a more faithful implementation of DAgger which trains a new policy on all previous data at each step, and offered significant improvements over behavior cloning (training on trajectories demonstrated by the expert policy). The models performed best when trained on only the most recent 150 trajectories.When training we make three passes through the data in these 150 trajectories before rolling out 50 new trajectories, discarding the oldest 50 trajectories and repeating this process.
Rather than teach our agents to find the shortest path to multiple objects, which is intractable in general, we design our expert policy to behave greedily and move to collect the requested object that would take the fewest steps to reach (including the time necessary to remove occluding objects).
Iii-E Implementation Details
All convolutions have 3
3 kernels with a padding of one, followed by 213]14]. This produces a feature map with half the spatial dimensions of the input. The number of convolution channels and hidden dimensions in the fully connected layers vary by experiment (see Section IV-B).
Our images are RGB and 128x128 pixels, but as is common practice in visual episodic settings  we found our models performed best when we concatenated the most recent three frames to create a 9x128x128 input. All models use four convolution layers.
Models are provided the complete set of remaining items to collect as a list of one-hot vectors. These are encoded into a single dense vector () by summing their learned embeddings. Because the sequence order is not important to our task, we found no benefit from RNN based encodings, though the use of an embedding layer, rather than a count vector, proved essential to model performance.
We tested all four models on a series of increasingly cluttered and difficult problems. We also tested these models with varying network capacity by reducing the number of convolution channels and features in the fully connected layers. In all of these experiments, our Early Fusion model performs as well or better than the others, while typically training faster and with fewer parameters.
Iv-a Varying Problem Difficulty
To test models on problems of increasing difficulty, we built three variations of the basic task by varying clutter and the number of requested items. In the simplest task (Simple), each episode starts with four instances randomly placed on the table and one object type is requested. Next, for Medium eight instances are placed and two are requested. Finally, for Hard twelve instances are placed and three are requested. The agent’s goal is to collect only the requested items in the allotted time. To evaluate peak performance for these experiments we fixed the number of convolutions and hidden layer dimensions in the fully connected layers to 128.
Each episode runs for forty-eight steps, during which it is possible for the agent to both successfully collect requested objects and erroneously collect items that were not requested. We therefore measure task completion using an F1 score. Precision is the percentage of collected objects that were actually requested, and recall is the percentage of requested objects that were collected. The F1 score is computed at the end of each episode. In addition, we report overall agreement between the model and the expert’s actions over the entire episode. Figure 6 plots the results of all four models on each of these problems as a function of training time.
Except for the Late Fusion model, which performs poorly in all scenarios, all models are able to master the easiest task. The Early Fusion and Softmax Attention Map models learn quickly, but Attention Map eventually catches up to them. The failure of the Late Fusion baseline on this task shows that even the simplest version of this problem is non-trivial.
The intermediate problem formulation is clearly more difficult, as no models are able to perform as well on it as the easiest problem. The Early Fusion model gains a small but significant improvement in performance while Softmax Attention Map and Attention Map are slightly worse, but comparable to each other.
In this case the networks must deal with more cluttered images and more complex goal descriptions. The Early Fusion model is clearly superior, learning significantly faster than the other models and results in greater overall performance.
It is also worth comparing the Attention Map and Softmax Attention Map models. While these models perform similarly on these tasks, the Softmax Attention Map model learns faster than the Attention Map model on the easiest task, but slightly slower on the more difficult ones. We posit that the softmax focuses the attention heavily on only a few regions which is useful for sparse uncluttered environments, but less appropriate when the network must reason about multiple objects in different regions.
Figure 7 provides a comparison of four attention maps. Unsurprisingly, the Softmax Attention Map model produces a sharper distribution around the requested objects, but both methods correctly highlight the objects of interest. In this work, we have limited our definition of clutter to 12 items per scene, in part for ease of visualization and compute time. We have not investigated the upper limit for saturating our models, but rather focused on how their learning curves diverged as a function of scene complexity. We anticipate a further widening in arbitrarily complex real world images.
Iv-B Varying Network Capacity
Having demonstrated that Early Fusion is at least as powerful as attention based approaches while being simpler (no grid information or attention logic), we next explore how these approaches perform on varying parameter budgets. Real-time and embedded systems require efficiency both when training and during inference. Since Early Fusion filters and removes irrelevant information early in the processing pipeline, we expect it to require less network capacity than the other methods. To test this claim, we re-run our Medium difficulty setting (because attention models performed well) and compare performance when models have access to 256, 128, 64, 32, or only 16 channel convolutions and fully connected layers, reducing our model sizes by several orders of magnitude.
In Figure 8, we see that training time increases for small networks, but Early Fusion is able to quickly achieve around the same final performance regardless of the extremely small network capacity. This allows for dramatically more efficient inference and parameter/memory usage. However, the same cannot be said for the other models, which degrade substantially as the number of parameters in the network decreases. Note that after 50,000 trajectories it appears that attention based models are still slowly improving, but there is a stark contrast in learning rates. In particular, for the smallest models (16) we see that Attention Map, even after training for twice as long as Early Fusion, still has half the performance.
Because attention mechanisms collapse their final representations, they have a smaller fully connected layer and therefore fewer parameters for the same number of channels. To account for this, we have also included a dashed orange line in Figure 8 which shows the performance of Early Fusion with half the channels as the other models and fewer parameters. We see again that smaller Early Fusion networks outperform and learn faster than the other approaches.
To determine how well the agent can generalize and represent the compositionality inherent in the requests we conduct experiments in which the agent is trained on a subset of the possible request combinations and then tested on unseen requests. Here the agent is trained with 128 different two-item combinations for 15,000 trajectories, and then tested on the held out 128 two-item combinations (Rows 1 and 2 below). In this setting, performance degrades only slightly, indicating that the agent is not merely memorizing combinations, but learning to recognize the structure of requests composed of individual objects.
In the second experiment, the same agent was tested on a random collection of three-item combinations to determine if the agent can generalize to higher counts than during training (Row 3). Performance degrades slightly more in this case, but the agent remains quite robust. In all cases, results were generated by testing on 1,000 new episodes after training.
Iv-D Information Retention
We have argued above that knowing the request allows the network to discard information about irrelevant objects in the scene. To investigate how much information is retained in the intermediate stages of the network we use the hidden states from models trained on the Simple task and assess whether they can be used to predict the correct action for a new query that is different than the one they were conditioned on. This is implemented by freezing the original model, and training a new set of final layers with a second conditional (Figure 9). In this experiment, we use the Late Fusion model as a proxy for the layer prior to attention in those models.
For all models, we find that if the same request is fed to both the original network and the new branch, we quickly achieve performance comparable to the original model (dotted lines). On the other hand if mismatched requests are fed into the two branches all models suffer a substantial degradation of performance, with most unable to collect a single object (solid lines). Both Early Fusion and the attention models have completely removed irrelevant information, while Late Fusion approaches appear to only retain some of the irrelevant information.
V Related Work
Learning to recognize objects, perform actions, and ground instructions to observed phenomena is core to many AI domains and advances are spread across the Robotics, Vision and Natural Language communities. Most immediately relevant to our work, is the recent proliferation of goal directed visual learning in simulated worlds [19, 20, 21, 22, 23, 24] which each aim to bring different amounts of language, vision and interaction to the task of navigating a 3D environment. These systems are often built using one of several open environment simulators based on 3D game engines [25, 26, 27, 28]. This has also been attempted in real 3D environments . Importantly, in contrast to our work, these approaches often pretrain as much of their networks as possible.  do not pretrain for their RL based language learning. Their work focuses on a limited vocabulary with composition and does not address learning with occulusion or larger vocabularies.
Visual Question Answering and Visual Referring Expressions have recently emerged as challenging problems in the computer vision community. In these settings a model is trained to either answer questions about an image or detect parts of an image referred to in a natural language expression. Challenging benchmark data sets have been proposed using both real [30, 31, 32] and simulated  images. Many successful approaches to these problems use visual attention similar to our baseline models [34, 35, 36].
Within the Natural Language community, instruction following  is often studied as semantic parsing [38, 39, 40] which aims to convert sentences into executable logical forms. Recent work has also investigated mapping language in referring expressions [41, 42]. In parallel, the robotics literature has investigated grounding instructions directly to robotic control [43, 44, 45, 24, 12, 46].
Training end-to-end visual and control networks 
, has proven difficult due to long roll outs and large action spaces. Within reinforcement learning, several approaches for mapping natural language instructions to actions rely on reward shapping[2, 3] and imitation learning [12, 46]. Imitation learning has also proven effective for fine grained activities like grasping , leading to state-of-the-art results on a broad set of tasks . The difficulty encountered in these scenarios emphasizes the need to explore new methods for efficient learning of multimodal representations.  explored attention model architectures, but do not include early fusion techniques. Early fusion of goal information has shown promise with small observation spaces , but our work begins to explore this method for high-dimensional visual domains. In this paper, we hope to provide some insight into this approach and highlight its power in interactive settings.
Goal directed computer vision is an important area for robotics research. While our community has benefited greatly by borrowing results from computer vision, we often have different goals and need to specialize our architectures appropriately. Minimally, in many robotic systems vision alone is not enough to direct behavior, and instead additional goal or task information must be taken into account. This paper argues that vision systems perform best when they are aware of goal and task information as early as possible. So far we have shown this to be true on a simulated robotic object retrieval task, but to the extent that this holds more generally, it motivates a line of research that moves away from vision systems that produce broadly general descriptions of images and environments and towards systems that build contextual representations that are specific to the task at hand.
We have compared four models on a simplified robotic retrieval task to show both the necessity of selective reasoning in these problems and demonstrate the effectiveness of the Early Fusion technique. We see how it slowly filters unnecessary information unlike the hard bottleneck of attention. Further, because of the relative simplicity of our approach, we observe substantially better scaling and parameter efficiency of Early Fusion, making it particularly well suited to low power and embedded systems.
-  S. Srinivasa, G. Johnson, A.and Lee, M. Koval, S. Choudhury, J. King, C. Dellin, M. Harding, D. Butterworth, P. Velagapudi, and A. Thackston, “A system for multi-step mobile manipulation: Architecture, algorithms, and experiments,” in International Symposium on Experimental Robotics, 2016.
-  D. K. Misra, J. Langford, and Y. Artzi, “Mapping instructions and visual observations to actions with reinforcement learning,” 04 2017.
D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi,
“Mapping instructions to actions in 3D environments with visual goal
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018.
-  C. Yan, D. Misra, A. Bennett, A. Walsman, Y. Bisk, and Y. Artzi, “CHALET: Cornell House Agent Learning Environment,” 2018. [Online]. Available: https://arxiv.org/abs/1801.07357
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” inICLR 2015, 09 2015.
-  J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with visual attention,” in ICLR 2015, 12 2015.
-  V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in NIPS, 06 2014.
-  J. Singh, V. Ying, and A. Nutkiewicz, “Attention on attention: Architectures for visual question answering (vqa),” arXiv preprint arXiv:1803.07724, 2018.
-  R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski, “An intriguing failing of convolutional neural networks and the coordconv solution,” arXiv preprint arXiv:1807.03247, 2018.
S. Ross, G. J. Gordon, and J. A. Bagnell, “A Reduction of Imitation Learning
and Structured Prediction to No-Regret Online Learning,” in
Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011), 11 2011.
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” inAdvances in Neural Information Processing Systems, 2015, pp. 1171–1179.
-  V. Blukis, N. Brukhim, A. Bennett, R. A. Knepper, and Y. Artzi, “Following high-level navigation instructions on a simulated quadcopter with imitation learning,” in Proceedings of the Robotics: Science and Systems Conference, 2018.
-  X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” 02 2015.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference for Learning Representations, 2015.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” vol. 15, pp. 1929–1958, 01 2014.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
-  D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi, “IQA: Visual Question Answering in Interactive Environments,” in Computer Vision and Pattern Recognition, 2018. [Online]. Available: https://arxiv.org/abs/1712.03316
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied
Question Answering,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, M. Wainwright, C. Apps, D. Hassabis, and P. Blunsom, “Grounded Language Learning in a Simulated 3D World,” 2017. [Online]. Available: http://arxiv.org/abs/1706.06551
-  P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. D. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [Online]. Available: http://arxiv.org/abs/1711.07280
-  F. Codevilla, M. Müller, A. Dosovitskiy, A. López, and V. Koltun, “End-to-end driving via conditional imitation learning,” arXiv preprint arXiv:1710.02410, 2017.
-  Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in ICRA, 2017.
-  S. Brodeur, E. Perez, A. Anand, F. Golemo, L. Celotti, F. Strub, J. Rouat, H. Larochelle, and A. C. Courville, “HoME: a Household Multimodal Environment,” 2017. [Online]. Available: http://arxiv.org/abs/1711.11017
-  M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun, “MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments,” arXiv:1712.03931, 2017. [Online]. Available: https://arxiv.org/abs/1712.03931v1
-  Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building Generalizable Agents with a Realistic and Rich 3D Environment,” 2017. [Online]. Available: https://arxiv.org/abs/1801.02209v1
-  E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
-  S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cognitive mapping and planning for visual navigation,” arXiv preprint arXiv:1702.03920, vol. 3, 2017.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in International Conference on Computer Vision (ICCV), 2015.
-  S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 787–798.
-  J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 1988–1997.
-  K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia, “Abc-cnn: An attention based convolutional neural network for visual question answering,” arXiv preprint arXiv:1511.05960, 2015.
-  H. Xu and K. Saenko, “Ask, attend and answer: Exploring question-guided spatial attention for visual question answering,” in European Conference on Computer Vision. Springer, 2016, pp. 451–466.
-  D. A. Hudson and C. D. Manning, “Compositional attention networks for machine reasoning,” arXiv preprint arXiv:1803.03067, 2018.
-  A. Vogel and D. Jurafsky, “Learning to follow navigational directions,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden: Association for Computational Linguistics, 07 2010, pp. 806–814.
-  D. Chen and R. J. Mooney, “Learning to interpret natural language navigation instructions from observations,” in Proceedings of the National Conference on Artificial Intelligence, 2011.
Y. Artzi and L. S. Zettlemoyer, “Weakly supervised learning of semantic parsers for mapping instructions to actions,” pp. 49–62, 01 2013.
-  J. Andreas and D. Klein, “Alignment-based compositional semantics for instruction following,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, 09 2015, pp. 1165–1174.
-  Y. Bisk, D. Yuret, and D. Marcu, “Natural language communication with robots,” in Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics, San Diego, CA, June 2016, pp. 751–761.
-  V. Cirik, T. Berg-Kirkpatrick, and L.-P. Morency, “Using syntax to ground referring expressions in natural images,” 2018.
-  C. Matuszek, D. Fox, and K. Koscher, “Following directions using statistical machine translation,” in Proceedings of the international conference on Human-robot interaction, 2010.
-  S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. Teller, and N. Roy, “Understanding natural language commands for robotic navigation and mobile manipulation,” in Proceedings of the National Conference on Artificial Intelligence, 2011.
-  D. K. Misra, J. Sung, K. Lee, and A. Saxena, “Tell me dave: Context-sensitive grounding of natural language to mobile manipulation instructions,” in Robotics: Science and Systems, ser. RSS, 2014.
-  V. Blukis, D. Misra, R. A. Knepper, and Y. Artzi, “Mapping navigation instructions to continuous control actions with position visitation prediction,” in Proceedings of the Conference on Robot Learning, 2018.
S. Levin, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep
visuomotor policies,” in
Journal of Machine Learning Research, 2017.
-  C. Eppner, J. Sturm, M. Bennewitz, C. Stachniss, and W. Burgard, “Imitation learning with generalized task descriptions,” in 2009 IEEE International Conference on Robotics and Automation, May 2009, pp. 3968–3974.
-  R. J. R. M.-M. A. S. V. W. O. B. Clemens Eppner, Sebastian Höfer, “Lessons from the amazon picking challenge: Four aspects of building robotic systems,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 2017, pp. 4831–4835. [Online]. Available: https://doi.org/10.24963/ijcai.2017/676
-  L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 31–36.