Log In Sign Up

From Seeing to Moving: A Survey on Learning for Visual Indoor Navigation (VIN)

Visual Indoor Navigation (VIN) task has drawn increasing attentions from the data-driven machine learning communities especially with the recent reported success from learning-based methods. Due to the innate complexity of this task, researchers have tried approaching the problem from a variety of different angles, the full scope of which has not yet been captured within an overarching report. In this survey, we discuss the representative work of learning-based approaches for visual navigation and its related tasks. Firstly, we summarize the current work in terms of task representations and applied methods along with their properties. We then further identify and discuss lingering issues impeding the performance of VIN tasks and motivate future research in these key areas worth exploring in the future for the community.


page 1

page 3


Deep Learning-based Spacecraft Relative Navigation Methods: A Survey

Autonomous spacecraft relative navigation technology has been planned fo...

A Survey of Visual Analytics Techniques for Machine Learning

Visual analytics for machine learning has recently evolved as one of the...

The Steep Road to Happily Ever After: An Analysis of Current Visual Storytelling Models

Visual storytelling is an intriguing and complex task that only recently...

Benchmarking Classic and Learned Navigation in Complex 3D Environments

Navigation research is attracting renewed interest with the advent of le...

Motion Control for Mobile Robot Navigation Using Machine Learning: a Survey

Moving in complex environments is an essential capability of intelligent...

Target driven visual navigation exploiting object relationships

Recently target driven visual navigation strategies have gained a lot of...

2D Grid Map Generation for Deep-Learning-based Navigation Approaches

In the last decade, autonomous navigation for roboticshas been leveraged...

1 Introduction

John McCarthy, who coined the term Artificial Intelligence back in 1955

[23], defines it as the “science and engineering of making intelligent machines”, in which an intelligent agent is a system that perceives its environment and takes actions that maximize its chances of success to achieve certain goals. Visual Indoor Navigation (dubbed as VIN) fits this standard definition of an AI task, where an intelligent agent (a.k.a. robot) is instructed to navigate towards a user-specified goal in an indoor environment based on its first-person visual observations (typically the RGB images captured by its on-board camera). It is a fundamental yet an integral task towards achieving the goal of Artificial Intelligence, which requires the agent to be able to understand its visual inputs, infer its current location, reason about the goal location, plan a trajectory, and execute an action to perform at each step. The capability of performing VIN well further enables a variety of higher level AI tasks, such as Embodied Question Answering [5] where the agent needs to navigate to a question-specified target location to gather visual information for question answering, and Vision-and-Language Navigation [2] in which the agent has to follow the human language instructions to navigate the indoor environments. As a result, VIN has drawn increasing research attention, and inspired a large amount of work attempting to tackle it.

Classical map-based methods for visual navigation have been studied for years [4]. These methods explicitly decompose the navigation task into a set of sub-tasks, i.e. mapping, localization, planning and motion control. Although these methods have achieved a decent amount of success of the years, modular designs have fundamental limitations preventing their widespread adoption. One significant limitation is their susceptibility to sensors’ noises accumulate and propagate down the pipeline from the mapper to the controller, making these algorithms less robust. More importantly, they require extensive case-specific scenario-driven manual-engineering, making them difficult to integrate with other downstream AI tasks that have achieved superior performance with the data-driven learning methods, such as visual recognition, question answering, and scene captioning [12].

Figure 1: An illustration of the learning-based method for visual indoor navigation task.
Tasks Methods Related Work Label SL [14], [13] RL [40], [37] Image SL [27], [35] RL [45], [38] Language SL [9], [31] RL [11], [43]
Figure 2:

A brief summary of the VIN tasks categorized by their goals’ representations, and the applied methods (SL: Supervised Learning based methods; RL: Reinforcement Learning based methods).

Due to their recent success in related tasks, there has been a surge of works applying learning-based methods to VIN challenges. As shown in Figure 1, the learning methods take visual inputs and user-specified goals as inputs and output an optimal action for the agent to take at each timestamp in order to achieve the user-specified goals. As opposed to classical methods, learning-based methods infer solutions directly from the data and as a consequence, require little manual-engineering and serve as a foundation for novel AI-driven visual navigation tasks. While it is promising, learning to navigate also poses challenges to tackle. For example, how to efficiently represent the visual inputs, how to reason the connection between the current observation and the user-specified goal location, especially when they are from different modalities, how to train the model without the ground-truth actions labeled, etc. Each challenge warrants extensive research efforts and overarching guiding theory is still lacking, resulting in scattered perspectives. This survey aims to provide readers with a more wholistic understanding of recent work in learning-based methods for VIN and provide guidance on how general theory and protocols for the field could be achieved.

We categorize the recent learning-based visual navigation work into certain high level categories. We specify what aspects of the VIN system each of these works improves. We then revolve the visual navigation system and discuss what is still missing to further improve the VIN performance. Lastly, we summarize the current progress of the learning methods on this task and conclude with listing the future directions of where the research can progress towards.

Figure 3: An illustration of the recent research on VIN (Visual Indoor Navigation) task.

2 VIN Overview

First, we describe the recent work that addresses the VIN problem with the learning methods. These works are categorized as depicted in Figure 2 and detailed in the following sections.

2.1 A Variety of Goal Representations

Visual observations and user-specified goals are the two inputs to learning-based visual navigation models, with tasks being defined based on the latter. We summarize them in terms of the representations of the user-specified goals (as shown in Figure 3).

Goals represented by labels. In a 2D environment or an environment where the map is known or learned, it is straightforward to specify the goal with an absolute 3D position that is defined in the coordinate frame of either the environment or the agent [13, 14]. However, in the first-person view navigation task, the map information is not necessary and thus specifying the absolute goal position is not efficient. Some work encoded the goal position into the model, allowing the agent to memorize it [24, 16]. It is more common to specify the goals with the labels that can be inferred from the visual observation, such as room types or object categories in order to ask the agent to navigate to the designated rooms [36, 37] or search target objects [40, 41, 25, 7]. [39]

also proposed a task called Embodied Amodal Recognition (EAR) in which the goal of the agent is to correctly classify, amodal localize and segment the target occluded objects through the viewpoints collected during the agent’s navigation.

Goals represented by images. [45] represented the goals with the scene images taken at the goal positions, so that the goal representations and the agent’s visual observations are homogeneous. The setting is also followed by [27, 33, 35, 20]. In [38], the authors adopted the scene images that contain objects as the goals to guide the agent to approach the image-indicated object, while in [42], the authors provided the target objects’ images without any contextual information included.

Goals represented by natural language. Two primary tasks that take human language as navigation goals have drawn much attention. The first is the Embodied Question Answering (EQA) [5] or the Interactive Question Answering (IQA) [11] task. The task asks questions that require the agent to navigate in an indoor environment and collect visual information to infer an answer. More advanced variations can be found in [6, 10, 43, 34]. The second is the Vision-and-Language Navigation (VLN) task proposed by [2], in which navigation instructions are provided in the form of natural language. Unlike other goal-driven navigation tasks, VLN task requires both the goal locations and the navigation trajectories to be aligned with the provided instructions. The task has been extensively studied in [32, 9, 30, 15, 31].

2.2 Through the Lens of Representation Learning

With many simulation platforms being developed for visual navigation tasks, such as AI2-THOR [18], House3D [36], R2R [2] and Habitat [28], the optimal trajectories for most visual navigation tasks are accessible. For example, the optimal trajectories for most goal-driven navigation tasks are the shortest paths from the agent’s current locations to the goal locations where the two locations are specified by the agent’s visual observations and the goal inputs, respectively. Even for the VLN task, the corresponding R2R dataset also provides desired trajectories as references. As a consequence, each visual observation together with the user-specified goal is associated with the optimal action (or action distribution) , which can serve as training data that enables supervised learning method to address the visual navigation problem. To be specific, the supervised learning method approximates a function in which for the training data with the hope that it can generalize to the testing data. In visual navigation task, such function requires capturing strong feature representations from the visual observations as well as the user-specified goals.

Taking visual navigation as feature representation learning is common in VLN task, as the preliminary challenge of the VLN task is the cross-modal grounding of the visual observations and natural language instructions. With the desired trajectories provided in the VLN benchmark dataset R2R, many works make effort towards learning better feature representations for VLN tasks. [31] presented a novel cross-modal matching architecture to ground language instruction on both local visual observation and global visual trajectory. In [22, 44, 15], the authors proposed self-supervised auxiliary tasks to accelerate the learning of the effective feature representations. Both [22] and [44]estimated the navigation progress represented by either the distance towards the goal location or the percentage of steps. [44] and [15] performed cross-modal alignment task, where [44] checked if the the language feature matches the vision-language feature while [15] predicted if a given instruction-path fit each other. In addition, [44] also proposed the trajectory retelling task to reconstruct the instruction words and the angle prediction task to predict the ground-truth action angles considering the available actions incorporate vision noises. Some methods augmented the training data in order to acquire more robust feature representations so that they can achieve better generalization ability. For instance, the Speaker-Follower models introduced in [9] augmented data by adopting its speaker model to create synthetic instructions on sampled new routes. [30] came up with the “environment dropout” method to mimic unseen environments.

In addition to the VLN task, other goal-driven visual navigation tasks also learn the feature representations in the presence of the optimal trajectories. In [25], the authors evaluated various visual representations for goal-driven navigation. [35] learned the feature representation by developing a generative model to predict the next expected visual observation. [27, 13, 14] built environment representations from visual observations and then planned a sequence of actions on it to perform the visual navigation task. In [27], the authors represented the environment with a non-parametric landmark graph generated from a recording of a traversal of the environment during the pre-exploration stage. With a developed retrieval network, the authors further localized the agent’s current location and the goal location and then planned a shortest path on the graph to select a sub-goal for its locomotion network to achieve. The authors in [13] built a top-down egocentric map from the agent’s current visual observation and developed a planner that outputs desired actions given the generated map and the goal specification. While in [14], the authors generated the top-down egocentric map from a small number of registered images taken from the agent’s past experience, followed by a planner to plan a path and an execution module to execute the path.

2.3 Modeling VIN as a Decision Making Process

Although shortest paths as the optimal trajectories could be generated and made accessible under simulation platforms, it is expensive and almost impractical to ascertain in real world scenarios.

Moreover, for some visual navigation tasks, the optimal trajectories are unavailable and even unattainable. For example, the optimal trajectories of VLN task are generated from human annotations. In Embodied Question Answering task, instead of the shortest paths, the optimal trajectories should help the agent to collect useful observations to answer the user-specified questions correctly. Similarly in the Embodied Amodal Recognition task, the optimal trajectories are the ones to assist the agent to recognize the occluded objects as early as possible. As an intuitive extension, researchers attempted to address the visual navigation problem without exploiting the optimal trajectories. They formulate the visual navigation problem as a Markov Decision Process (MDP) and address it within the deep reinforcement learning paradigm.

In the MDP setting, the agent’s visual input is defined as an observation of its hidden state. At each time step, the agent takes an action to transit from its current state to a new state which yields a new observation, and then receives rewards as feedback until it reaches a goal state. The agent collects experience through its trial and error interactions with the environment and learns the optimal action policy by maximizing the expected cumulative rewards. While the optimal trajectories as supervisions are not needed, solving VIN under the MDP setting heavily relies on 1) defining a proper reward function, 2) representing the agent’s hidden state and 3) determining the task’s goal state.

A straightforward reward setup is to provide a positive reward when the agent reaches the goal states and zero or a small negative reward when the agent lingers at intermediate states, which was adopted in [45, 40, 38, 20]. The authors of [42] and [41] defined the reward based on the size of the bounding box of the target object from the agent’s detection system in order to solve their object search task. A much denser reward function was defined in [32, 30, 15, 37], where they calculated the change in distance to the goal location as the immediate reward for the performed action. Such a dense reward function was also applied in [5, 6, 43, 21] to help the agent get close to the goal locations. Additionally, they adopted the accuracy of their question answering models as the final reward in order to perform the EQA task well. In [39], the authors rewarded the agent with the performance of its downstream task, i.e. the amodal recognition.

To improve the sample efficiency, efforts have been made towards capturing meaningful state representations. [45] adopted siamese layers to capture spatial arrangement between the agent’s visual observation and the goal observation. [41] represented the hidden state with the semantic masks and depth information estimated from the visual observation. [38] proposed an inverse dynamic model to capture the state representation by predicting the action given two adjacent visual observations in a self-supervised manner. [40, 20, 21] augmented the visual observation with additional information to regularize the learning of the state representations. In [40]

, the authors built topological graph from the agent’s exploration experiences to represent the environment. Then, the attention features extracted from the graph were concatenated with the agent’s visual observation to better represent the agent’s hidden state. Similarly in


, the authors supplemented the visual observation with feature embedding extracted from an external knowledge graph. The authors of

[21] incorporated the predicted next observations into the state representations. With the informed state representations, some work [40, 5, 6, 21] explored the idea of letting the agent learn to stop by itself, hoping it would be able to be aware of the goal. Others utilized a termination checker to determine if the goal state is reached, such as the semantic room classifier developed in [37], and the detection system adopted in [42, 41]. Nevertheless, most works [45, 38, 20, 39] chose to stop the agent automatically whenever the agent steps into a user or environment designated goal state.

2.4 The Holy Grail: Generalization

While generalization ability is always adopted to evaluate learning-based approaches, especially the supervised learning methods, the definition of generalizing in VIN is still an open problem due to varying settings. In general, the generalization ability in the visual navigation task denotes how well a trained visual navigation model performs on an unseen environment to achieve a new homogeneous user-specified goal without extra training process. When a model is trained with optimal trajectories under the seen environments in a supervised way, it is natural to evaluate it in terms of its generalization ability, as most of the work we enumerate in Section 2.2 did. However, when viewing the visual navigation task as the MDP problem and solving it with deep reinforcement learning (DRL), the methods are unlikely to generalize well to unseen environments or goals, since the reinforcement learning methods are designed to tackle a fixed MDP problem defined on a certain specific environment. Efforts being made to improve DRL’s generalization ability for specific tasks, such as EQA.

As described above, the shortest paths are not the optimal trajectories for the EQA task. Still, researchers take the shortest paths as the supervised signal to pre-train their navigation models before fine-tuning the whole EQA model with reinforcement learning algorithms in order to achieve better performance on both seen and unseen environments [5, 6, 43, 21]. A few work exploit prior knowledge to improve the generalization ability. For example, the authors of [32] built an environment dynamics model that allows the agent to plan ahead, and thereby can better transferred to unseen environment. [40] improved the generalization ability by embedding an object relational graph learned from the Visual Genome dataset [19] into the state representation. [37] adopted a probabilistic graph to capture the room layout prior. The graph allows efficient planning and updating, and can be taken as a high-level planner integrated with reinforcement learning based locomotion policy in order to achieve better generalization performance. Additionally, [41] and [25] explored the feature representations to improve generalization ability. [45] designed scene-specific layers to enable their trained models to generalize to unseen environments with a much smaller number of extra training iterations.

3 Further Discussion

The goal of the visual navigation task is to equip an intelligent agent with the capability of navigating towards a user-specified goal under any environments, especially under real-world environments. To this end, we discuss certain critical issues remaining unsolved in the recent work as summarized in Section 2.

3.1 Simulation vs Real-world

Simulation platforms allow almost unlimited experiments without the hassles of dealing with the time-consuming mechanical work on a physical robot, which largely improves the data efficiency and facilitates the research on the VIN task. To reduce the gap between the simulation and the real-world environment, some simulation platforms, like the R2R dataset [2], build the virtual environments upon real images. However, simulations are still far away from the real word as a large chunk of uncertainties in the real world cannot be captured and accurately modeled, impeding the transfer of the progress achieved on simulation platforms to the real-world scenarios.

One benefit of the simulation platforms is the feasibility of generating the shortest paths as the supervised signals to train visual navigation models for certain tasks. However, such shortest paths are typically generated without taking the real-world’s uncertainty into consideration. One significant factor among them is the physical robots’ control errors. Thus, whether the feature representations learned from the oracle shortest paths in the simulation platforms can adapt to the real-world environments remains unclear. We argue that achieving this is critical, as generating shortest paths in real-world environments is extremely expensive, and even generating a small number of samples for model fine-tuning may not be affordable. In addition, even in the real image constructed simulated environments, the environments are static, far away from the reality where the real environments are subject to change of light, objects layout etc. Therefore, the solutions under the static simulated environments still leave much to be desired, make it challenging to explore the visual navigation task under dynamic environments.

As also pointed out by [17], the progress on simulation platforms does not hold well in reality, since the virtual agents tend to take advantage of the simulators imperfection. In their point goal navigation experiments conducted on Habitat [28], they found the virtual agents are able to slide around the obstacles to reach a desirable state which would not happen in the real world. In other simulation platforms and experiments [45, 42], the agents simply stay at the current position without changing the environment when a collision happens, which is also unrealistic. To this end, many works study the simulation to real transfer in the visual navigation domain [3], while in [17], the authors suggest to evaluate simulators in terms of how likely the performance improvement achieved on them can hold in reality, rather than their visual or physical realism. To conclude, the gap between simulation and real-world still needs to be further studied with caution on claims being made that are validated on simulated environments only.

3.2 Supervised Learning vs Reinforcement Learning

As we describe in Section 2, there are two primary methods being used in VIN task. One is the supervised learning method of matching the predicted actions to the available ground-truth navigation trajectories and the other is the reinforcement learning method of solving the VIN task by maximizing the user-defined rewards. Though the efficacy of both methods has been demonstrated, the applicability and the limitations of each method have not been explicitly made clear.

While the supervised learning method is straightforward yet powerful, it requires a large amount ground-truth annotations as supervised signals. For certain tasks, such as the point goal navigation, it makes sense to take the shortest paths as the ground-truth trajectories. Typically, it is easy to generate shortest paths in terms of the agent’s available actions under simulation platforms when totally ignoring the control noises. However, in real-word scenarios or simulation platforms that model uncertainties of the real world, generating the shortest paths is expensive, and thus limits the use of the supervised learning method. Moreover, the shortest paths are not always optimal or ground-truth trajectories for other tasks, such as the EQA task, VLN task and EAR task. The optimal trajectories are either not accessible or require intensive labor, making the supervised learning methods not applicable. On the contrary, the reinforcement learning method is always applicable without the requirement of the existence of the ground-truth trajectories.

Additionally, even in the situations where both supervised learning method and reinforcement learning method are applicable, it is not fair to directly compare the two methodologies. Firstly, the supervised learning method requires optimal trajectories under the training environments as additional information compared to the reinforcement learning method. Secondly, the goal of the supervised learning methods is to learn the feature representations with the optimal trajectories under the training environments that can generalize to the testing environments, where the training and testing environments are homogeneous. Therefore, the evaluation of the supervised learning method is typically conducted on the testing environments where no extra training process is allowed. In comparison, the reinforcement learning method assumes the optimal trajectories are unknown under the training environments, and its ultimate goal is to figure out the optimal trajectories through many trial and error interactions with these environments. As a result, the reinforcement learning method is typically evaluated in the training environments or on a testing environment that allows further training process.

In summation, the supervised learning method is more generalizable in homogeneous tasks, but its applicability is limited by the availability of the ground-truth solutions. The reinforcement learning method achieves less desirable generalization ability, but the methodology applied on one task sheds light on solving other tasks, including heterogeneous ones.

3.3 Reinforcement Learning for Generalization

Improving the generalization ability with the reinforcement learning method has always been an active research area with many challenges. Typically, it is studied under the benchmarks that consist of a suite of similar control tasks. In the VIN task, more attention is given to learning from the informative visual inputs, with only a few empirical attempts on generalization improvement (see Section 2.3). In fact, the generalization ability is also of great importance in the VIN task, and the advances towards the generalization of the reinforcement learning could lead to potential solutions.

Generalization in reinforcement learning has various formulations in terms of whether the training process is allowed during the testing time. For some tasks, such as the VLN task, where the optimal action policy can be inferred from the agent’s visual observation and the user-specified goal, it is justifiable to require generalization ability like in the supervised learning method where training is not allowed during the testing. In such a case, the action policy should be learned in a way that is either robust to or adaptable to the environmental variations. A survey of the relevant methods can be found in [26], but its efficacy on the VIN task need to be further validated.

More commonly in VIN, the agent needs to interact with the new environment for adaptation. For example, the object search task requires the agent to be aware of the location of the target object which can only be acquired from the agent’s interactions with the environment, rather than the model’s inputs. As a result, the generalization ability is measured by the data efficiency of adapting a learned model to the new environment, and thus the meta-learning formulation can be adopted. In the meta-learning formulation, the model’s generalization ability is explicitly optimized by training the model on a set of tasks drawn from a certain distribution. An example is the MAML algorithm proposed by [8]. MAML finds a shared prior parameter in the parameters’ space that has minimum average distance to the optimal parameter of each task. In the VIN task, the shared prior can been seen as the general navigation ability, such as collision avoidance. In [29], the authors conditioned state action values on task embeddings which are extracted from the agent’s interactions, and as a consequence the state action values can quickly adapt to new tasks by feeding into the embeddings generated from new interactions. With these methods, it is also promising to generalize a learned model to a new task, rather than the same task in new environments.

Advances in the generalization of reinforcement learning could be used to overcome the shortcomings of applying RL to the VIN task. However, these methods have barely been applied and demonstrated and as such, we expect to see more exploration in this area in the near future. Moreover, the visual navigation task itself also serves as a natural testing-bed for studying generalization in reinforcement learning, which could support further studies along this avenue.

3.4 Role of Knowledge

The importance of knowledge has long been identified in many AI tasks, such as image recognition [1]. Intuitively, knowledge should also be of great benefit to better perform the VIN task. Even for human beings, it is much easier to navigate in a structured indoor environments than a contextless maze, indicating that the high navigation performance is usually achieved by reasoning from observations rather than merely memorizing the environments. To be specific, the indoor environments typically have distinct structures, such as functional areas and object layouts in houses. With the knowledge of such structure, the agent is expected to explore the environment more efficiently by avoiding getting trapped at irrelevant locations. For example, with the commonsense knowledge that a sofa is typically found in the living room, the agent shouldn’t spend much time in the kitchen to find a sofa. Moreover, such knowledge is likely to still hold in previously unseen environments, making it possible to achieving better generalization ability.

A few works exploit knowledge to help perform the visual navigation task. In [37], the authors captured the room layout information aiding navigation. [40] encoded the spatial relationships of all the objects to perform an object search task. However, they are still preliminary compared to how human beings perform the VIN task, indicating a fruitful direction to be further explored.

4 Summary and Future Work

In this paper, we discussed the recent advances in learning-based visual indoor navigation. We first summarized all relevant tasks into three categories in terms of the representations of the user-specified goals, i.e. label indicated goals, such as goal positions and semantic labels of the target objects or rooms, image indicated goals, such as the images of the target objects or scene images taken from the goal positions, and language indicated goals, such as questions or navigation instructions. We further described the two primary methods that are applied under certain conditions. One is the supervised learning method aimed to learn generalizable feature representations given the optimal trajectories are available or easy to obtain. The other is the reinforcement learning method of taking the visual navigation task as a MDP problem. Since the reinforcement learning method is not as suited for generalization as the supervised learning method is, we also introduced some existing studies on improving its generalization ability for the VIN task.

To help further performance improvements in VIN , we pointed out several issues with the current field which are worth future study: 1) the legality of studying the VIN task on the simulation platforms; 2) the fairness in comparing the supervised learning based method and the reinforcement learning based method; 3) the potential for improving the generalization ability of the reinforcement learning method in the VIN domain; 4) the possibility of integrating knowledge to guide the learning process of the VIN.


  • [1] S. Aditya, Y. Yang, and C. Baral (2019) Integrating knowledge and reasoning in image understanding. In IJCAI, Cited by: §3.4.
  • [2] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In CVPR, Cited by: §1, §2.1, §2.2, §3.1.
  • [3] H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull (2019) A data-efficient framework for training and sim-to-real transfer of navigation policies. In ICRA, Cited by: §3.1.
  • [4] F. Bonin-Font, A. Ortiz, and G. Oliver (2008) Visual navigation for mobile robots: a survey. Journal of intelligent and robotic systems 53 (3), pp. 263. Cited by: §1.
  • [5] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied question answering. In CVPR Workshops, Cited by: §1, §2.1, §2.3, §2.3, §2.4.
  • [6] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Neural modular control for embodied question answering. In CoRL, Cited by: §2.1, §2.3, §2.3, §2.4.
  • [7] R. Druon, Y. Yoshiyasu, A. Kanezaki, and A. Watt (2020) Visual object search by learning spatial context. RA-L. Cited by: §2.1.
  • [8] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §3.3.
  • [9] D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell (2018) Speaker-follower models for vision-and-language navigation. In NeurIPS, Cited by: Figure 2, §2.1, §2.2.
  • [10] D. Gordon, D. Fox, and A. Farhadi (2019) What should i do now? marrying reinforcement learning and symbolic planning. arXiv:1901.01492. Cited by: §2.1.
  • [11] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi (2018) Iqa: visual question answering in interactive environments. In CVPR, Cited by: Figure 2, §2.1.
  • [12] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew (2016) Deep learning for visual understanding: a review. Neurocomputing 187, pp. 27–48. Cited by: §1.
  • [13] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In CVPR, Cited by: Figure 2, §2.1, §2.2.
  • [14] S. Gupta, D. Fouhey, S. Levine, and J. Malik (2017) Unifying map and landmark based representations for visual navigation. arXiv:1712.08125. Cited by: Figure 2, §2.1, §2.2.
  • [15] H. Huang, V. Jain, H. Mehta, A. Ku, G. Magalhaes, J. Baldridge, and E. Ie (2019) Transferable representation learning in vision-and-language navigation. In ICCV, Cited by: §2.1, §2.2, §2.3.
  • [16] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu (2017) Reinforcement learning with unsupervised auxiliary tasks. In ICLR, Cited by: §2.1.
  • [17] A. Kadian, J. Truong, A. Gokaslan, A. Clegg, E. Wijmans, S. Lee, M. Savva, S. Chernova, and D. Batra (2019) Are we making real progress in simulated environments? measuring the sim2real gap in embodied visual navigation. arXiv:1912.06321. Cited by: §3.1.
  • [18] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) Ai2-thor: an interactive 3d environment for visual ai. arXiv:1712.05474. Cited by: §2.2.
  • [19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123 (1), pp. 32–73. Cited by: §2.4.
  • [20] D. Li, D. Zhao, Q. Zhang, Y. Zhuang, and B. Wang (2019) Graph attention memory for visual navigation. arXiv:1905.13315. Cited by: §2.1, §2.3, §2.3.
  • [21] J. Li, S. Tang, F. Wu, and Y. Zhuang (2019) Walking with mind: mental imagery enhanced embodied qa. In MM ’19, Cited by: §2.3, §2.3, §2.4.
  • [22] C. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong (2019) Self-monitoring navigation agent via auxiliary progress estimation. arXiv:1901.03035. Cited by: §2.2.
  • [23] J. McCarthy, M. L. Minsky, N. Rochester, and C. E. Shannon (2006) A proposal for the dartmouth summer research project on artificial intelligence, august 31, 1955. AI magazine 27 (4), pp. 12–12. Cited by: §1.
  • [24] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. (2016) Learning to navigate in complex environments. In ICLR, Cited by: §2.1.
  • [25] A. Mousavian, A. Toshev, M. Fišer, J. Košecká, A. Wahid, and J. Davidson (2019) Visual representations for semantic target driven navigation. In ICRA, Cited by: §2.1, §2.2, §2.4.
  • [26] C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song (2018) Assessing generalization in deep reinforcement learning. arXiv:1810.12282. Cited by: §3.3.
  • [27] N. Savinov, A. Dosovitskiy, and V. Koltun (2018) Semi-parametric topological memory for navigation. In ICLR, Cited by: Figure 2, §2.1, §2.2.
  • [28] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al. (2019) Habitat: a platform for embodied ai research. In ICCV, Cited by: §2.2, §3.1.
  • [29] F. Sung, L. Zhang, T. Xiang, T. Hospedales, and Y. Yang (2017) Learning to learn: meta-critic networks for sample efficient learning. arXiv:1706.09529. Cited by: §3.3.
  • [30] H. Tan, L. Yu, and M. Bansal (2019) Learning to navigate unseen environments: back translation with environmental dropout. In NAACL-HLT, Cited by: §2.1, §2.2, §2.3.
  • [31] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang, and L. Zhang (2019)

    Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation

    In CVPR, Cited by: Figure 2, §2.1, §2.2.
  • [32] X. Wang, W. Xiong, H. Wang, and W. Yang Wang (2018) Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In ECCV, Cited by: §2.1, §2.3, §2.4.
  • [33] D. Watkins-Valls, J. Xu, N. Waytowich, and P. Allen (2019) Learning your way without map or compass: panoramic target driven visual navigation. arXiv:1909.09295. Cited by: §2.1.
  • [34] E. Wijmans, S. Datta, O. Maksymets, A. Das, G. Gkioxari, S. Lee, I. Essa, D. Parikh, and D. Batra (2019) Embodied question answering in photorealistic environments with point cloud perception. In CVPR, Cited by: §2.1.
  • [35] Q. Wu, D. Manocha, J. Wang, and K. Xu (2019) Visual navigation by generating next expected observations. arXiv:1906.07207. Cited by: Figure 2, §2.1, §2.2.
  • [36] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian (2018) Building generalizable agents with a realistic and rich 3d environment. arXiv:1801.02209. Cited by: §2.1, §2.2.
  • [37] Y. Wu, Y. Wu, A. Tamar, S. Russell, G. Gkioxari, and Y. Tian (2019) Bayesian relational memory for semantic visual navigation. In ICCV, Cited by: Figure 2, §2.1, §2.3, §2.3, §2.4, §3.4.
  • [38] Y. Wu, Z. Rao, W. Zhang, S. Lu, W. Lu, and Z. Zha (2019) Exploring the task cooperation in multi-goal visual navigation. In IJCAI, Cited by: Figure 2, §2.1, §2.3, §2.3.
  • [39] J. Yang, Z. Ren, M. Xu, X. Chen, D. J. Crandall, D. Parikh, and D. Batra (2019) Embodied amodal recognition: learning to move to perceive objects. In ICCV, Cited by: §2.1, §2.3, §2.3.
  • [40] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi (2018) Visual semantic navigation using scene priors. In ICLR, Cited by: Figure 2, §2.1, §2.3, §2.3, §2.4, §3.4.
  • [41] X. Ye, Z. Lin, J. Lee, J. Zhang, S. Zheng, and Y. Yang (2019) Gaple: generalizable approaching policy learning for robotic object searching in indoor environment. RA-L. Cited by: §2.1, §2.3, §2.3, §2.4.
  • [42] X. Ye, Z. Lin, H. Li, S. Zheng, and Y. Yang (2018) Active object perceiver: recognition-guided policy learning for object searching on mobile robots. In IROS, Cited by: §2.1, §2.3, §2.3, §3.1.
  • [43] L. Yu, X. Chen, G. Gkioxari, M. Bansal, T. L. Berg, and D. Batra (2019) Multi-target embodied question answering. In CVPR, Cited by: Figure 2, §2.1, §2.3, §2.4.
  • [44] F. Zhu, Y. Zhu, X. Chang, and X. Liang (2019) Vision-language navigation with self-supervised auxiliary reasoning tasks. arXiv:1911.07883. Cited by: §2.2.
  • [45] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, Cited by: Figure 2, §2.1, §2.3, §2.3, §2.4, §3.1.