An Exploration of Embodied Visual Exploration

01/07/2020 ∙ by Santhosh K. Ramakrishnan, et al. ∙ The University of Texas at Austin University of Pennsylvania 14

Embodied computer vision considers perception for robots in general, unstructured environments. Of particular importance is the embodied visual exploration problem: how might a robot equipped with a camera scope out a new environment? Despite the progress thus far, many basic questions pertinent to this problem remain unanswered: (i) What does it mean for an agent to explore its environment well? (ii) Which methods work well, and under which assumptions and environmental settings? (iii) Where do current approaches fall short, and where might future work seek to improve? Seeking answers to these questions, we perform a thorough empirical study of four state-of-the-art paradigms on two photorealistic simulated 3D environments. We present a taxonomy of key exploration methods and a standard framework for benchmarking visual exploration algorithms. Our experimental results offer insights, and suggest new performance metrics and baselines for future work in visual exploration.



There are no comments yet.


page 2

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual recognition has seen tremendous success in recent years that is driven by large-scale collections of internet data [43, 31, 50, 29] and massive parallelization. However, the focus on passively analyzing manually captured photos after learning from curated datasets does not fully address issues faced by real-world robots, who must actively capture their own visual observations. Embodied active perception [1, 8, 7, 54] tackles these problems by learning task-specific motion policies for navigation [58, 22, 21, 44, 4] and recognition [34, 27, 57]

often via deep reinforcement learning (RL), where the agent is rewarded for reaching a specific target or inferring the right label.

In contrast, in embodied visual exploration [39, 45, 14, 42], the goal is inherently more open-ended and task-agnostic: how does an agent learn to move around in an environment to gather information that will be useful for a variety of tasks that it may have to perform in the future?

Figure 1: Intelligent exploration prepares an agent for future tasks.

Intelligent exploration in 3D environments is important as it allows unsupervised preparation for future tasks. Embodied agents that have the ability to explore are flexible to deploy, since they can use knowledge about previously seen environments to quickly gather useful information in the new one, without having to rely on humans. This ability allows them to prepare for as yet unspecified downstream tasks in the new environment. For example, a newly deployed home-robot could prepare itself by automatically discovering rooms, corridors, and objects in the house. After this exploration stage, it could quickly adapt to instructions such as “Bring coffee from the kitchen to Steve in the living room.” See Fig. 1.

A variety of exploration methods have been proposed both in the reinforcement learning and computer vision literature. They employ ideas like curiosity [47, 39, 11], novelty [52, 9, 45], coverage [14], and reconstruction [28, 42] to overcome sparse rewards [9, 39, 11, 45] or learn task-agnostic policies [28, 14, 42] that generalize to new tasks. In this study, we consider algorithms designed to handle complex photorealistic indoor environments where a mobile agent observes the world through its camera’s field of view and can exploit common semantic priors from previously seen environments (as opposed to exploration in randomly generated mazes or Atari games).

Despite the growing literature on embodied visual exploration, it has been hard to analyze what works when and why: how do different exploration algorithms work for different types of environments and downstream tasks? This difficulty is due to several reasons. First, prior work evaluates on different simulation environments such as SUN360 [42, 49], ModelNet [42, 49], VizDoom [39, 45], SUNCG [14], DeepMindLab [45], and Matterport3D [13]. Second, prior work uses different baselines, architectures, and reinforcement learning algorithms. Finally, exploration methods have been evaluated from different perspectives such as overcoming sparse rewards [39, 11], pixelwise reconstruction of environments [28, 41, 42, 49], area covered in the environment [14, 13], object interactions [40, 23], or as an information gathering phase to solve downstream tasks such as navigation [14], recognition [28, 42, 49]

, or pose estimation 

[42]. Due to this lack of standardization, it is hard to compare any two methods in the literature.

This paper presents a unified view of exploration algorithms for visually rich 3D environments, and a common evaluation framework to understand their strengths and weaknesses. First, we formally define the exploration task. Next, we provide a taxonomy of exploration approaches and sample representative methods for evaluation. We evaluate these methods and several baselines on common ground: two well-established 3D datasets [2, 12] and a state-of-the-art architecture [14]. Unlike navigation and recognition, which have clear-cut success measures, exploration is more open-ended. Hence, we quantify exploration quality along multiple meaningful dimensions such as mapping, robustness to sensor noise, and relevance for different downstream tasks. Finally, we highlight the strengths and weaknesses of different methods and identify key factors for learning good exploration policies.

Our main contribution is the first unified and systematic empirical study of exploration paradigms on 3D environments. In the spirit of recent influential studies in other domains [17, 33, 20, 36, 35], we aim first and foremost to provide a reliable, unbiased, and thorough view of the state of the art that can be a reference to the field moving forward. To facilitate this, we introduce technical improvements to shore up existing approaches, and we propose new baselines and performance metrics. We will publicly release all code to standardize the development and evaluation of exploration algorithms.

Figure 2: Examples of 3D environment layouts in the Active Vision Dataset [2] (first 2 cols.) and Matterport3D [12] (last 3 cols.). The top row shows a first-person view and the bottom row shows the 3D layout of the environment with free space in gray, occupied space in white, and the viewpoint as the blue arrow.

2 Empirical study framework

We define a framework for systematically studying different exploration algorithms. Exploration is sequential: at each time step, the agent takes as input the current view and history of accumulated views, updates its internal model of the environment (e.g., a spatial memory or map), and selects the next action (i.e., camera motion) to maximize the information it gathers about the environment. How the latter is defined is specific to the exploration algorithm, as we will detail in Sec. 3

. This process may be formalized as a finite-horizon partially observed Markov decision process (POMDP). Next, we describe the POMDP formulation, the 3D simulators used to realize it, and the policy architecture that accounts for partial observability while taking actions.

2.1 The exploration POMDP

A partially observable Markov decision process consists of a tuple with state space , observation space , action space , state-conditioned observation distribution , transition distribution , reward function , initial state distribution , discount factor , and finite exploration episode length . The agent is spawned at an initial state in an unknown environment. At time , the agent at state receives an observation , executes camera motion action , receives a reward , and reaches state . The state representation is obtained using the history of observations , and is the agent’s policy. The goal is to learn an optimal exploration policy that maximizes the expected cumulative sum of the discounted rewards over an episode:


where is a sequence of tuples generated by starting at and behaving according to policy at each time step. The reward function captures the method-specific incentive for exploration (see Sec. 3). Next, we concretely define the instantiation of the POMDP in terms of photorealistic 3D simulators.

2.2 Simulators for embodied perception

In order to standardize the experimentation pipeline, we use simulators built on top of two realistic 3D datasets: (1) Active Vision Dataset [2] (AVD) and (2) Matterport3D [12] (MP3D). There are several other valuable 3D assets in the community [55, 30, 6, 51]; we chose MP3D and AVD for this study due to their complementary attributes (see Tab. 1).

Properties Active Vision [2] Matterport3D [12]
View sampling Discrete Continuous
Environment sizes Small Large
Train / val / test splits 9 / 2 / 4 61 / 11 / 18
Large scale data No Yes
Outdoor components No Yes
Clutter Significant Mild
Forward motion 0.3 0.25
Rotation angle
Table 1: The contrasting properties of AVD and MP3D provide diverse testing conditions. Last 2 rows show the action magnitudes.

The Active Vision Dataset [2] is a dataset of dense RGB-D scans from 15 unique indoor houses and office buildings. We simulate embodied motion by converting the dataset into a connectivity graph with discrete points and moving along the edges of the graph (similar to [5]). AVD offers realistic cluttered home interiors, which are lacking in real estate photos or computer graphics datasets.

Matterport3D [12] is a dataset of photorealistic 3D meshes from 90 indoor buildings. Some include outdoor components such as swimming pools, porches, and gardens which present unique challenges to exploration (as we will see in Sec. 5.3). We use the publicly available Habitat simulator [35], which provides fast simulation.

In these simulation environments, the state space consists of the agent’s position and orientation within the environment. track the camera position, and a bump sensor to detect collisions. The action space is discrete and has three actions: move forward, turn left, and turn right. The motion values are given in Tab. 1.

2.3 Policy architecture

To benchmark exploration algorithms, we train them with a common policy architecture [14] that is well-suited to these partially observed rich 3D environments. It incorporates a spatial occupancy memory and temporally aggregates information to facilitate long-term information storage and effective planning. Unlike traditional SLAM, such a learned spatio-temporal memory—popular in recent embodied perception approaches [21, 26, 14, 13]—allows the agent to leverage both statistical visual patterns as well as geometry to extrapolate what it learns to novel environments.

The spatial map is built by aggregating the RGB-D and odometer sensor readings over time (similar to [14]). At each time step, a local occupancy map is derived from the depth map, transformed to global coordinates using the odometer readings, and accumulated over time to generate an allocentric map (Fig. 3, top-left). Egocentric maps at two resolutions are generated given the agent’s current position (Fig. 3 red and orange boxes).

The RGB and occupancy maps are encoded using independent ResNet [25] models (Fig. 3

center). The image features, the past action, and the collision sensor inputs are temporally aggregated in a recurrent neural network (GRU), whose hidden states are used as the state representation

for learning a policy and a corresponding value function

used for variance reduction 

[48] (Fig. 3 right). See Supp. file for more details. This framework allows us to plug in different reward functions (enumerated in Sec. 3) to train different exploration agents.

Figure 3: The policy architecture has a spatial occupancy memory and a temporal GRU memory for effective state representation.

3 Taxonomy of exploration paradigms

Figure 4: The four paradigms of exploration in 3D visual environments. Curiosity rewards visiting states that are predicted poorly by the current forward-dynamics model. Novelty rewards visiting less frequently visited states. Coverage rewards visiting “all possible” parts of the environment. Reconstruction rewards visiting states that allow better reconstruction (hallucination) of the full environment.

We now present a taxonomy for exploration algorithms in the literature. We identify four core paradigms: curiosity, novelty, reconstruction, and coverage (see Fig. 4). Each paradigm can be viewed as a particular reward function in the POMDP. In the following, we review their key ideas and choose representative methods for benchmarking that capture the essence of each paradigm. Please see Supp. for more background on prior work.

3.1 Curiosity

In the curiosity paradigm [47, 38, 32, 39], the agent is encouraged to visit states where its predictive model of the environment is uncertain. We focus on the dynamics-based formulation of curiosity, which was shown to perform well on large-scale scenarios [39, 11]. The agent learns a forward-dynamics model that predicts the effect of an action on the agent’s state, i.e, . Then, the curiosity reward at each time step is:


The forward-dynamics model is trained in an online-fashion to minimize , encouraging the agent to move to newer states once its predictions on the current set of states are accurate.

We adapt the curiosity formulation from [39] for our experiments by using the GRU hidden state in Sec. 2.3 as the state representation for forward dynamics prediction.

3.2 Novelty

While curiosity seeks hard-to-predict states, novelty [52, 9, 37, 53, 45] seeks previously unvisited states. Each is assigned a visitation count . The novelty reward is inversely proportional to the square-root of visitation frequency at the current state:


For our experiments, we adapt the Grid Oracle method from [45]: we discretize the 3D environment into a 2D grid where each grid cell is considered to be a unique state, and assign rewards according to Eqn. 3. We define square grid cells of width in AVD and in MP3D, and consider all points within a grid cell to correspond to that state.

3.3 Coverage

The coverage paradigm aims to observe as many things of interest as possible—typically the area seen in the environment [14, 13]. Whereas novelty encourages explicitly visiting all locations, coverage encourages observing all of the environment. Note that the two are distinct: at any given location, how much and how far a robot can see varies depending on the nearby 3D structures.

The coverage approach from [14] learns RL policies that maximize area coverage. More generally, this idea can be exploited for learning to visit other things of interest such as objects (similar to the search task from [18]) and landmarks, as we define below. The coverage reward consists of the increment in some observed quantity of interest:


where is the quantity of interesting things (e.g., area) visited by time . We consider several options for :
(1) area-coverage: the number of filled cells in the agent’s allocentric map (green+blue regions in Fig. 3).
(2) objects-coverage: We consider an object to be visited if the agent is close to it and the object is unoccluded within its field of view (see Supp. for the exact criteria).
(3) landmarks-coverage: We mine a set of “landmark” viewpoints from the environment which contain distinctive visual components that do not appear elsewhere in that environment, e.g., a colorful painting or a decorated fireplace.

(4) random-view-coverage: We sample random viewpoints in the environment and reward the agent for visiting them using the same visitation criteria. This method is similar to the “goal agnostic” baseline in [44].

3.4 Reconstruction

Reconstruction-based methods [28, 41, 42, 49] use the objective of active observation completion [28] to learn exploration policies. The idea is to gather views that best facilitate the prediction of unseen viewpoints in the environment. The reconstruction reward scores the quality of the predicted outputs:


where is a set of true “query” views at camera poses in the environment, is the set of view reconstructions generated by the agent after time steps, and is a distance function. Whereas curiosity rewards views that are individually surprising, reconstruction rewards views that bolster the agent’s correct hallucination of all other views.

Prior attempts at exploration with reconstruction are limited to pixelwise reconstructions on panoramas and CAD models, where is on pixels. To scale the idea to 3D environments, we propose a novel adaptation that predicts concepts present in unobserved views rather than pixels, i.e., a form of semantic reconstruction. This reformulation requires the agent to predict whether, say, an instance of a “door” concept is present at some query location, rather than reconstruct the door pixelwise as in [28, 42].

We automatically discover these visual concepts from the training environments, which has the advantage of not relying on supervised object detectors or semantic annotations. Specifically, we sample views uniformly from training environments and cluster them into discrete concepts using -means applied to ResNet-50 features. Each cluster centroid is a concept and may be semantic (doors, pillars) or geometric (arches, corners).

Then, we reward the agent for acquiring views that help it accurately predict the dominant concepts in all query views sampled from a uniform grid of locations in each environment. Let denote the ResNet-50 feature for the -th query view. We define its true “reconstructing” concepts to be the nearest cluster centroids to

, and assign equal probability to those

concepts. The agent has a multilabel classifier that takes a query pose

as input—not the view —and returns , the posteriors for each concept being present in . The distance in Eqn (5) is the KL-divergence between the true concept distribution and the agent’s inferred distribution , summed over all . The reward thus encourages reducing the prediction error in reconstructing the true concepts.

4 Exploration evaluation framework

Having defined the taxonomy of exploration algorithms, we now define baselines and metrics to evaluate them.

4.1 Baseline methods

Heuristic baselines:

We use four non-learned heuristics for exploration: (1)

random-actions [42, 35]

samples from a uniform distribution over all actions, (2)

forward-action [35] always samples the forward action, (3) forward-action+ samples the forward action unless a collision occurs, in which case, it turns left, and (4) frontier-exploration [56] uses the egocentric map from Fig. 3 and iteratively visits the frontiers, i.e., the edges between free and unexplored spaces (see Supp.). This is closely related to area-coverage, but depends on hand-crafted heuristics and may be vulnerable to noisy inputs.
Oracle graph exploration: exploits the underlying graph structure in the environment (which reveals reachability and obstacles) to visit a sequence of sampled target locations via the true shortest paths. In contrast, all methods we benchmark are not given this graph, and must discover it through exploration. We define three oracles that visit (1) randomly sampled locations, (2) landmark views (see Sec. 3.3), and (3) objects within the environment. These oracles serve as an upper bound for exploration performance.
Imitation learning: We imitate the oracle trajectories to get three imitation variants [10, 19]

, one for each oracle above. Whereas the oracles assume full observability and therefore are not viable exploration approaches, these imitation learning baselines are viable, assuming disjoint training data with full observability is available.

Tab. 2 lists the assumptions on information availability made by different approaches. Methods requiring less information are more flexible. While we assume access to the full environment during training, there is ongoing work [9, 37, 45] on relaxing this assumption.

Training / Testing
GT depth GT pose GT objects GT state
random-actions No / No No / No No / No No / No
forward-action(+) No / No No / No No / No No / No
imitation-X   Yes / Yes*   Yes / Yes*  Yes / No  Yes / No
curiosity   Yes / Yes*   Yes / Yes* No / No No / No
novelty   Yes / Yes*   Yes / Yes* No / No  Yes / No
frontier-exploration      - / Yes      - / Yes      - / No      - / No
coverage   Yes / Yes*   Yes / Yes* Yes / No No / No
oracle No / No No / No Yes / Yes  Yes / Yes
Table 2: Assumptions about information availability: the information required for each method (including the architecture assumptions) during training/testing. In our experiments, we assume all information is given during training, but only sensory inputs are given for testing. * - learned methods may adapt to noisy inputs.
Figure 5: Visiting interesting things: The plots compare the best methods from each paradigm and select baselines for clarity. The table shows the mean and std dev at the last time step and includes all baselines. Parallel plots for the landmarks visitation metric are in Supp.

4.2 Evaluation metrics

A good exploration method visits interesting locations and collects information that is useful for a variety of downstream tasks. Different methods may be better suited for different tasks. For example, a method optimizing for area coverage may not interact sufficiently with objects, leading to poor performance on object-centric tasks. We measure exploration performance with two families of metrics:

(1) Visiting interesting things.

These metrics quantify the extent to which the agent visits things of interest such as area [14, 56, 13, 18], objects [18, 23], and landmarks. Together, they capture different levels of semantic and geometric reasoning that an agent may need to perform in the environment. To account for varying environment sizes and content, we normalize each metric into by dividing by the best oracle score on each episode.

(2) Downstream task transfer.

These metrics directly evaluate exploration’s impact on downstream tasks. The setup is as follows: an exploration agent is given a time budget of to explore the environment, after which the information gathered must be utilized to solve a task within the same environment. More efficient exploration algorithms gather better information and are expected to have higher task performance. We consider three recent tasks from the literature which ask fundamental, yet diverse questions: (i) PointNav: how to quickly navigate from point A to point B? (ii) view localization: where was this photo taken? and (iii) reconstruction: what can I expect to see at point B?

In PointNav [3, 46], the agent is respawned at the original starting point after exploration, and given a navigation goal relative to its position. The agent must use its map to navigate to the goal within a maximum of time steps. Intuitively, for efficient navigation, an exploration algorithm needs to explore potential dead ends in the environment that may cause planning failure. We use an A* planner [24] that navigates using the spatial occupancy map built during exploration. While other navigation algorithms are possible, A* is a consistent and lightweight means to isolate the impact of the exploration models. We evaluate using Success rate normalized by Path Length (SPL) [3].

In view localization [42], the agent is presented with images sampled from distinct landmark viewpoints (cf. Sec. 3.3) and must localize them relative to its starting location. This task captures the agent’s model of the overall layout, e.g., “where would you need to go to see this view?”. While PointNav requires planning a path to a point target around obstacles, view localization requires locating a visual target. We measure localization accuracy by the pose-success rate (PSR ), the fraction of views localized within mm of their ground truth pose.

In reconstruction [28, 42, 49], the agent is presented with uniformly spread query locations, and must predict the set of concepts present at each location (cf. Sec. 3.4). This can be viewed as the inverse problem of view localization: the agent has to predict views given locations. Performance is measured using Precision between the agent’s predicted concepts and the ground truth.

For each task, we define a standard pipeline that uses the exploration experience to solve the task (details in Supp.). We do not fine-tune exploration policies for downstream tasks since we treat them as independent evaluation metrics.

Figure 6: Visualizing exploration behaviors: Exploration trajectories for each paradigm are visualized from the top-down view of the environment (AVD in first four, MP3D in last four). The best coverage variant is selected per dataset. Black and green locations represent unexplored and explored areas, respectively. The agent’s trajectory uses color changes to represent time. The behaviors are largely correlated with the quantitative performance of each paradigm in Sec. 5.1: better exploration methods cover larger parts of the environment.

5 Experiments

We next present the results organized by the visitation metrics (Sec. 5.1) and downstream tasks (Sec. 5.2).

Implementation details

All policies are trained for on AVD/MP3D for 2000 episode batches. We sample episodes on AVD/MP3D uniformly from all test environments. For navigation, we generate difficult test episodes on AVD/MP3D that require good exploration to navigate effectively. We learn exploration policies in two stages [15, 16, 14]. First, we pre-train the policy by imitating [10, 19] shortest-path oracle trajectories (cf. Sec. 4.1). We then fine-tune the policy with the RL training objective using Proximal Policy Optimization (PPO) [48]. See Supp. for more implementation details.

5.1 Results on visitation metrics

We first evaluate the visitation metrics (cf. Sec. 4.2). For brevity, only the best of the three oracle and imitation variants on each metric are reported.

How well do the four paradigms explore?

Fig. 5 shows the results on both datasets, and Fig. 6 shows examples. We compare the performance of different paradigms, where we select the best coverage variant individually per metric. We see two trends emerging:


The trends on AVD and MP3D match except in two cases: (1) novelty performs significantly better in MP3D while coverage underperforms. This trend can be explained by our finding that novelty scales better than coverage when trained on more environments (see Sec. 5.3). (2) curiosity performs poorly on MP3D and is consistently outperformed by imitation and the other learned methods. curiosity needs a good state representation that accounts for partial observability, which is naturally harder in the large MP3D environments. While the memory architecture that we use is better for curiosity than using image features (see Supp.), better memory architectures may boost the performance further.

When compared to the baselines random-actions and forward-action, which were used in prior work [28, 42, 35], our proposed baselines forward-actions+ and imitation are significantly better on most metrics. forward-actions+ tends to get stuck circling a single room and saturates quickly. While learned methods generally outperform these baselines, frontier-exploration outperforms most learned methods on MP3D. However, it performs sub-par on AVD, possibly because depth inputs are noisier in AVD. Even in MP3D, noisy inputs deteriorate its performance, echoing past findings [14] (see Sec. 5.3).

How well do the different coverage variants explore?

Next we unpack how the coverage variants compare to one another (see Supp. for plots). For each of the visitation metrics, we have one method that is optimized for doing well on that metric. For example, area-coverage optimizes for area visited, objects-coverage optimizes for objects visited, etc. As expected, on AVD we generally observe that the method optimized for a particular metric typically does better than most methods on that metric. Interestingly, random-views-coverage, a method that is not optimized for any of these metrics, generalizes well across the metrics and outperforms most of the other methods. This may be because random views are easy to encounter in small environments, providing denser rewards. However, on MP3D, we find that area-coverage dominates, while (landmarks,random-views)-coverage perform poorly on all metrics. We believe that this shift in the trend is due to optimization difficulties caused by reward sparsity: landmarks and randomly sampled views occur more sparsely in the large MP3D environments.

Figure 7: Exploration skills: Each exploration agent is assigned a 0-1 value for a set of five skills. The most general agents have larger polygon areas (See Sec. 5.2). Best viewed in color.
Figure 8: Transfer tasks: Comparison of different paradigms on AVD (first 3 columns) and MP3D (last 3 columns) for the tasks of view localization, navigation, and reconstruction. Please see Supp. for complete results on all baselines, oracles, and the coverage variants.

5.2 Task transfer results

Next we evaluate the relative success of the four paradigms for the three downstream tasks (see Fig. 8). We observe similar trends as reported in Sec. 5.1:


All methods do significantly worse on view-localization in MP3D than AVD, indicative of increased task difficulty associated with large MP3D environments. Oracles exploit the underlying graph-structure to achieve high scores even in large environments (see Supp.). On both datasets, the trends in navigation differ when compared to other metrics. The gap between different methods are reduced, imitation closely competes with other paradigms, and much larger values of are required to improve performance on MP3D. This hints at the inherent difficulty of exploring well to navigate: efficient navigation requires exploration agents to uncover potential obstacles that reduce path planning efficiency. Notably, existing exploration methods do not incorporate such priors into their rewards.

The radar plots in Fig. 7 concisely summarize results thus far along five skills: Mapping, Navigation, Object discovery, Localization, and Reconstruction. The metrics from above are normalized to [0, 1] where 0 and 1 represent the performance of random-actions and oracle, respectively. See Supp. for the metrics to skills mapping. We pick the best coverage variant that works well across several skills.111AVD: random-views-coverage, MP3D: area-coverage In AVD, coverage dominates the other paradigms on most skills, closely followed by reconstruction. In MP3D environments, which tend to be large, different methods are stronger on different skills. For example, novelty performs best on object discovery, frontier-exploration dominates on localization and reconstruction, and both dominate on mapping.

5.3 Factors influencing performance

We now analyze some factors affecting exploration quality. Please see Supp. for further details on each point below.

How does dataset size affect learning?

Next, we analyze how the exploration performance (area at ) of area-coverage and novelty varies with the training dataset size, i.e., the number of unique 3D environments in MP3D. We select these two methods as their behavior on AVD, MP3D varies significantly. We train agents on 3 random subsets of MP3D environments. While both agents achieve reasonable performance with just environments, performance climbs with the number of environments for novelty, but saturates for area-coverage (Fig. 9 left). We attribute this to novelty’s smoother reward function that exponentially anneals rewards based on the visitation frequency, which may provide better training signals than the binary rewards in area-coverage.

Figure 9: Impact of training data, testing environments on exploration.

How does environment size affect exploration?

Next we select the top 5 methods on MP3D and measure their performance as a function of testing environment size (see Fig. 9 right). First, we group test episodes based on area visited by the best oracle. Within each group, we report the of episodes in which each method ranks in the top 3 out of 5. novelty and reconstruction are robust as they perform well on most groups. frontier-exploration struggles in large MP3D environments with mesh defects as the agent gets stuck. Interestingly, area-coverage performs fairly well in large, open environments, but gets stuck in small rooms with narrow exits. Please see Supp. for qualitative results demonstrating these cases.

How does noisy occupancy affect exploration?

Noisy occupancy maps can result from mesh defects, noisy depth, and incorrectly estimated affordances222Height-based occupancy does not account for affordances. For example, an agent cannot walk on a swimming pool at the ground-plane level., especially outdoors. We characterize noise robustness as the ratio of the area visited by an agent with noisy and noise-free occupancy maps. On MP3D, noise robustness is close to for all learned agents, but drops to for a purely geometric method like frontier-exploration.

Is imitation learning pre-training important?

We find that it does not accelerate training compared to pure RL training from scratch, except for novelty, despite the strong base performance.

6 Conclusions

We considered the problem of visual exploration in 3D environments. Prior work presents results on varying experimental conditions, making it hard to analyze what works when. Motivated by this, we presented a comparative study of four popular exploration paradigms. We benchmarked different methods under common experimental conditions: policy architecture, 3D environments, and learning algorithm. To enable this study, we introduced new metrics and baselines, and improved upon the existing reconstruction-based approaches to scale to 3D environments. Our analysis provides a comprehensive view of the state of the art and each paradigm’s strengths and weaknesses.

To recap some of our key findings: novelty and frontier-exploration are the strongest performers in large environments, and tend to dominate on different skills, highlighting the need for diverse evaluation metrics. Our proposed adaptation of reconstruction successfully explores 3D environments and competes closely with the best methods on most settings. Two new easy-to-implement heuristics forward-action+ and imitation significantly outperform baselines typically employed, and can serve as better baselines for future research. Also, a relatively simple method trained to cover random views in the environment outperforms all other methods in small environments. Our ablations indicate that coverage does well with low data while novelty scales better with more data; novelty and reconstruction are robust to different testing environments while frontier-exploration collapses in the presence of mesh defects.

We hope that our study serves as a useful starting point and a reliable benchmark for future research in exploration. Code, data and models will be publicly released.


  • [1] J. Aloimonos, I. Weiss, and A. Bandyopadhyay (1988) Active vision. International Journal of Computer Vision. Cited by: §1.
  • [2] P. Ammirato, P. Poirson, E. Park, J. Kosecka, and A. Berg (2016) A dataset for developing and benchmarking active vision. In ICRA, Cited by: Figure 2, §1, §2.2, §2.2, Table 1.
  • [3] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al. (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: §4.2.
  • [4] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: §1.
  • [5] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.2.
  • [6] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese (2017-02)

    Joint 2D-3D-Semantic Data for Indoor Scene Understanding

    ArXiv e-prints. External Links: 1702.01105 Cited by: §2.2.
  • [7] R. Bajcsy (1988) Active perception. Proceedings of the IEEE. Cited by: §1.
  • [8] D. H. Ballard (1991) Animate vision. Artificial intelligence. Cited by: §1.
  • [9] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, Cited by: §1, §3.2, §4.1.
  • [10] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §4.1, §5.
  • [11] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros (2018) Large-scale study of curiosity-driven learning. In arXiv:1808.04355, Cited by: §1, §1, §3.1.
  • [12] A. Chang, A. Dai, T. Funkhouser, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from rgb-d data in indoor environments. In Proceedings of the International Conference on 3D Vision (3DV), Cited by: Figure 2, §1, §2.2, §2.2, Table 1.
  • [13] D. S. Chaplot, S. Gupta, A. Gupta, and R. Salakhutdinov Modular visual navigation using active neural mapping. Cited by: §1, §2.3, §3.3, §4.2.
  • [14] T. Chen, S. Gupta, and A. Gupta (2019) Learning exploration policies for navigation. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §1, §1, §2.3, §2.3, §3.3, §3.3, §4.2, §5, §5.1.
  • [15] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
  • [16] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Neural modular control for embodied question answering. In Conference on Robot Learning, pp. 53–62. Cited by: §5.
  • [17] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In

    International Conference on Machine Learning

    pp. 1329–1338. Cited by: §1.
  • [18] K. Fang, A. Toshev, L. Fei-Fei, and S. Savarese (2019) Scene memory transformer for embodied agents in long-horizon tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 538–547. Cited by: §3.3, §4.2.
  • [19] A. Giusti, J. Guzzi, D. C. Cireşan, F. He, J. P. Rodríguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, G. Di Caro, et al. (2016) A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters. Cited by: §4.1, §5.
  • [20] P. Goyal, D. Mahajan, A. Gupta, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. arXiv preprint arXiv:1905.01235. Cited by: §1.
  • [21] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625. Cited by: §1, §2.3.
  • [22] S. Gupta, D. Fouhey, S. Levine, and J. Malik (2017) Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125. Cited by: §1.
  • [23] N. Haber, D. Mrowca, L. Fei-Fei, and D. L. Yamins (2018) Learning to play with intrinsically-motivated self-aware agents. arXiv preprint arXiv:1802.07442. Cited by: §1, §4.2.
  • [24] P. E. Hart, N. J. Nilsson, and B. Raphael (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: §4.2.
  • [25] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.3.
  • [26] J. F. Henriques and A. Vedaldi (2018) Mapnet: an allocentric spatial memory for mapping environments. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8476–8484. Cited by: §2.3.
  • [27] D. Jayaraman and K. Grauman (2018) End-to-end policy learning for active visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
  • [28] D. Jayaraman and K. Grauman (2018) Learning to look around: intelligently exploring unseen environments for unknown tasks. In Computer Vision and Pattern Recognition, 2018 IEEE Conference on, Cited by: §1, §1, §3.4, §3.4, §4.2, §5.1.
  • [29] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §1.
  • [30] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv. Cited by: §2.2.
  • [31] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, Cited by: §1.
  • [32] M. Lopes, T. Lang, M. Toussaint, and P. Oudeyer (2012) Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in neural information processing systems, pp. 206–214. Cited by: §3.1.
  • [33] A. R. Mahmood, D. Korenkevych, G. Vasan, W. Ma, and J. Bergstra (2018) Benchmarking reinforcement learning algorithms on real-world robots. In Conference on Robot Learning, pp. 561–591. Cited by: §1.
  • [34] M. Malmir, K. Sikka, D. Forster, J. Movellan, and G. W. Cottrell (2015) Deep Q-learning for active recognition of GERMS. In BMVC, Cited by: §1.
  • [35] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra (2019) Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §2.2, §4.1, §5.1.
  • [36] D. Mishkin, A. Dosovitskiy, and V. Koltun (2019) Benchmarking classic and learned navigation in complex 3d environments. arXiv preprint arXiv:1901.10915. Cited by: §1.
  • [37] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos (2017) Count-based exploration with neural density models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2721–2730. Cited by: §3.2, §4.1.
  • [38] P. Oudeyer, F. Kaplan, and V. V. Hafner (2007) Intrinsic motivation systems for autonomous mental development.

    IEEE transactions on evolutionary computation

    11 (2), pp. 265–286.
    Cited by: §3.1.
  • [39] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, Cited by: §1, §1, §1, §3.1, §3.1.
  • [40] D. Pathak, D. Gandhi, and A. Gupta (2018) Beyond games: bringing exploration to robots in real-world. Cited by: §1.
  • [41] S. K. Ramakrishnan and K. Grauman (2018) Sidekick policy learning for active visual exploration. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 413–430. Cited by: §1, §3.4.
  • [42] S. K. Ramakrishnan, D. Jayaraman, and K. Grauman (2019) Emergence of exploratory look-around behaviors through active observation completion. Science Robotics 4 (30). External Links: Document, Link, Cited by: §1, §1, §1, §3.4, §3.4, §4.1, §4.2, §4.2, §5.1.
  • [43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision. Cited by: §1.
  • [44] N. Savinov, A. Dosovitskiy, and V. Koltun (2018) Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653. Cited by: §1, §3.3.
  • [45] N. Savinov, A. Raichuk, R. Marinier, D. Vincent, M. Pollefeys, T. Lillicrap, and S. Gelly (2018) Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274. Cited by: §1, §1, §1, §3.2, §4.1.
  • [46] M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun (2017) MINOS: multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931. Cited by: §4.2.
  • [47] J. Schmidhuber (1991) Curious model-building control systems. In Proc. international joint conference on neural networks, pp. 1458–1463. Cited by: §1, §3.1.
  • [48] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.3, §5.
  • [49] S. Seifi and T. Tuytelaars (2019) Where to look next: unsupervised active visual exploration on 360 deg input. arXiv preprint arXiv:1909.10304. Cited by: §1, §3.4, §4.2.
  • [50] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §1.
  • [51] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe (2019) The Replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797. Cited by: §2.2.
  • [52] A. L. Strehl and M. L. Littman (2008) An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences 74 (8), pp. 1309–1331. Cited by: §1, §3.2.
  • [53] H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and P. Abbeel (2017) # exploration: a study of count-based exploration for deep reinforcement learning. In Advances in neural information processing systems, pp. 2753–2762. Cited by: §3.2.
  • [54] D. Wilkes and J. K. Tsotsos (1992) Active object recognition. In Computer Vision and Pattern Recognition, 1992. IEEE Computer Society Conference on, Cited by: §1.
  • [55] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson env: real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9068–9079. Cited by: §2.2.
  • [56] B. Yamauchi (1997) A frontier-based approach for autonomous exploration.. Cited by: §4.1, §4.2.
  • [57] J. Yang, Z. Ren, M. Xu, X. Chen, D. Crandall, D. Parikh, and D. Batra (2019) Embodied visual recognition. arXiv preprint arXiv:1904.04404. Cited by: §1.
  • [58] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi (2017) Visual Semantic Planning using Deep Successor Representations. In Computer Vision, 2017 IEEE International Conference on, Cited by: §1.