Learning to Visually Navigate in Photorealistic Environments Without any Supervision

04/10/2020 ∙ by Lina Mezghani, et al. ∙ 3

Learning to navigate in a realistic setting where an agent must rely solely on visual inputs is a challenging task, in part because the lack of position information makes it difficult to provide supervision during training. In this paper, we introduce a novel approach for learning to navigate from image inputs without external supervision or reward. Our approach consists of three stages: learning a good representation of first-person views, then learning to explore using memory, and finally learning to navigate by setting its own goals. The model is trained with intrinsic rewards only so that it can be applied to any environment with image observations. We show the benefits of our approach by training an agent to navigate challenging photo-realistic environments from the Gibson dataset with RGB inputs only.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Designing algorithms for learning to navigate is a classical problem in robotics. This problem is challenging, especially in settings where it is necessary to do without accurate depth or position information; or more generally, with as little supervision as possible. Furthermore, if the goal location is specified as an image, the agent needs to learn a good visual representation and an efficient exploration strategy in addition to the navigation policy.

One important set of approaches, called Simultaneous Localization And Mapping (SLAM) [29], builds a map of an environment while

keeping track of where the agent is in the partial map. Although many SLAM methods use statistical methods to improve estimation, until recently they did not emphasize statistical

learning. Thus these methods are unable to generalize and make use of regularities in the environment (or between environments) beyond what has been built by hand into the algorithm.

There has been a recent interest in using techniques from deep learning in the context of SLAM, or more generally, in the context of navigation 

[2, 11, 13, 15, 16, 22, 23, 32, 36]. Deep learning-based methods typically require a large number of trials during training and have been rarely considered outside of simulators. However, the growing number of photo realistic environments [4, 33], efficient simulators [9, 18] and dedicated methods to transfer from simulated to real environments [25, 30] have fueled the research in deep learning-based navigation methods.

In a separate line of study, there has been great progress in learning image representations through “self-supervised” approaches [3, 8, 12, 37]

. In these works, using prior knowledge about the basic regularities of images, researchers find pretext tasks that, when solved, give good feature representations for other tasks of interest. While self-supervised learning is interesting for understanding learning methods abstractly, it also promises to be important in applications, as it is often the case that a pretext task is easier to come by and more general than strong supervision.

Figure 1: Three stages of training: the agent learns to distinguish locations from its visual inputs, then it explores the environment and build a map of the environment, finally it uses this map to learn how to navigate the environment. Each step requires no external supervision or reward and the agent has only access to a visual RGB input, and has no information about its position.

In this work, we introduce an entirely unsupervised method for learning to navigate through simulators like Habitat [18] in photorealistic environments and large-scale three-dimensional point clouds such as the Gibson dataset [33]. In particular, we assume that the agent only has access to image observations and that the target location is also given as an image. The method is composed of three stages. First, the agent learns a visual representation that can distinguish between nearby and far-away pairs of points in a similar way to [27]. The fundamental prior knowledge we use is that in most situations, an agent’s representation of the world should not change very fast as it moves; but on the other hand, for most pairs of far-away points, the representations should be different. Next, the agent learns to explore, adding states to a memory buffer when their feature representations are dissimilar to any in the buffer. Finally, the agent trains itself to complete navigation tasks, using the buffer to shape the reward for the navigation policy. An important component of our model is that the agent uses a Scene Memory Buffer for both its policy and reward. In particular, the agent takes actions via a Transformer [31] applied to the memory buffer. Because our approach can be applied in situations where the practitioner has no control over the environment - and in particular, with no ability to give supervision or move to arbitrary positions in the environment - the method is general. We show that despite this generality, its final navigation policy outperforms other approaches.

Our contributions in this paper are the following:

  • We propose a novel three-stage algorithm for learning to navigate using only RGB vision without any external supervision or reward in photorealistic environments that simulates actual houses.

  • We introduce several improvements to the exploration policy [27] such as conditioning on past memory and using discrete rewards.

  • We evaluate our model and show that it outperforms all baselines on scenes from the Gibson dataset.

2 Related Work

Iteratively building a map of an environment to perform localization or navigation tasks has been extensively studied in robotics in the context of SLAM  [29]. Standard SLAM is composed of multiple hand-crafted modules to fit with the physical constraints of a robot [21].

Recently, several works have replaced components of SLAM with neural networks; for example, Chaplot 

et al[5] replace the localization module. Gupta et al[13]

propose a model composed of two successive modules, a mapper to build a latent world map, and a planner, that takes actions based on this map. The mapper does not have dedicated external rewards but the planner performs tasks associated with external rewards and backpropagates the resulting gradients to the mapper. This map has been further extended with image features 

[14] or with a dynamic structure [2, 15]. Other works replace SLAM entirely by deep models with no planning but explicit map-like or SLAM inspired memory structures [16, 22, 23, 36]. Closer to our work, Kumar et al[17] use human-made trajectories stored as sequences of feature representations of views, and Fang et al[11] show the potential of the Transformer layer [31] as a scene memory for navigating realistic environments. As opposed to these works, our model is trained with intrinsic reward only.

Alternatively, several work train deep models to solve a navigation task without explicit world representations. Mirowski et al[20] learn a navigation policy with a recurrent network in synthetic mazes, and later, in real-world data from Google Maps [19]. Similar to our work, they use a surrogate loss on loop closure to help the training of the model, but they use sparse external reward to guide its training. Similarly, Zhu et al[38] show the benefit of deep models on a localization task framed as finding an observation taken from the goal. Later, Yang et al.  [34] extend this to navigation to an object described only by its name.

Many works train agents to explore the world with an intrinsic reward [7, 24, 28]. For example, the curiosity-driven reward of Pathak et al[24] encourages agents to move to states that are hard to predict. Of particular interest, Chen et al[6] propose a coverage reward that encourages the agent to explore every part of its latent map. This reward is quite general and benefits both exploration and navigation, but it does not directly optimize for navigation like ours.

Finally, our approach is most related to a recent line of research that uses multiple stages of learning to build a set or graph of scene observations [10, 26, 27, 35]. Savinov et al[26] internalize a landmark memory obtained from human trajectories. They store representations of the locations visited in the trajectories and build a navigation graph based on their similarities. Our work follows their self-supervised training of a reachability network to distinguish between nearby observations, but we extend the self-supervision to both exploration and navigation. Savinov et al[27] also use a curiosity-driven intrinsic reward based on a memory buffer. Our exploration phase follows an intrinsic reward inspired by their work, but we also use the memory buffer in our Transformer-based policy. Finally, Eysenbach et al[10] propose a method to learn an agent to explore and navigate an environment with intrinsic rewards. Their training follows the same sequence of steps as ours, with the exception that they clean the graph by testing existing edges and adding new ones and then learning to navigate on the graph, and not the environment. Instead, our agent trains itself to navigate the environment directly by shaping dense rewards from the memory buffer. It means that our agent can potentially learn more efficient navigation strategies not constrained to paths on the memory-graph.

3 Problem formulation

In this paper, we simulate a realistic setting where an agent must learn to navigate in a 3D environment. We formulate this problem using the following assumptions:

  • No extrinsic reward. We do not have control over the environment and thus cannot add extrinsic reward to guide the training of the agent.

  • No human guidance. The environment is new and has never been explored. We do not have access to human trajectories or other forms of external information.

  • 3d scan environments. We focus on photo-realistic environments such as the ones in the Habitat platform.

We are interested in the capability of the agent to explore and navigate an environment and we report the following metrics to measure its success:

  • Coverage. We measure the coverage of an environment by an agent by discretizing the map into cells of the same size and counting the ratio of visited cells by the agent after steps.

  • Image driven navigation. We measure the capability of an agent to navigate the environment to an image target. That is: we give the agent an image observation from the location and we measure the number of steps it takes to reach the destination so that the agent’s observation matches the image target, starting from the entry point of the map.

Finally, as a secondary goal, we are also interested in the robustness of an agent to limited sensor data. To that end, we focus on RGB inputs in this paper. We do not use depth, gps coordinate or relative position as inputs.

4 Approach

In this section, we describe our algorithm and its three stage training: first the agent learns a visual representation of the environment from random trajectories, then it learns to explore the environment to build a latent map, and finally it trains itself to navigate using the map. Each step has a module trained without external supervision.

4.1 Stage 1: Visual representation of the environment

Figure 2: Reachability network [27]. Given a set of observations made by an agent with a random walk policy (left), we train the (local) reachability network to distinguish between observations that are temporally near or distant. For a given observation (marked in blue), the nearest observations are in green and the distant ones in red. The reachability network (right) is a siamese network composed of a convolutional network followed by a fully-connected network.

As the agent moves around the environment, it receives data from its visual sensor, which in this work produces RGB images. From this first-person input, the agent builds a representation of its current location that should encode information to distinguish the current location from other locations, as well as give an idea of the distance between locations. This is achieved by encouraging nearby locations to have similar representations while pushing distant locations to have different representations. However, in the absence of information about the agent position or a map, we do not have an explicit notion of distance between locations.

Reachability as an image-based self-supervision [27].

An approximation of the spatial distance between locations is the number of time steps taken by an agent with a random walk policy to reach one location starting from the other. Indeed, the expected distance covered by a random walk is the square root of the number of time steps. We thus use the temporal distance between observations as a surrogate similarity measure. More precisely, we let a random agent explore the environment for steps and collect the sequence of observations, . We then define a reachability label for each pair of observations based on their distance in the sequence, i.e., the label is equal to if and otherwise,

being a hyperparameter.

Learning visual features.

We train a siamese neural network to predict the reachability label from the input pair

with a logistic regression. It is parameterized by a feedforward network

and a convolutional network such that  [27]. We use the resulting convolutional network to form visual features and the siamese network in the reward function of the exploration module. This stage is summarized in Fig. 2.

Figure 3: Exploration and navigation stages. The agent first learns to explore (left) the environment using a Scene Memory Buffer (SMB) of visited regions for its policy and intrinsic reward. Next, the agent learns to navigate (right) using SMB to set image oriented goals to itself and learn to navigate towards them.

4.2 Stage 2: Learning to Explore

Once the agent can differentiate images of nearby locations from distant locations, it can explore and map the environment. In this section, we describe how to train our exploration module with a curiosity-driven intrinsic reward, which is the second stage of our training.

4.2.1 Exploration module.

The agent explores the environment to dynamically build an internal map. At each step, this map and the current observation are used to plan an action that moves the agent toward unexplored regions. We model the internal map as a scene memory buffer that contains important past observations, and the agent takes actions by applying a Transformer on this memory buffer. This stage is shown in Fig 3 (left).

Scene Memory Buffer.

The agent has a Scene Memory Buffer (SMB) module that stores some of its previous observations. At each time step , the SMB stores an unstructured set of visual features, . Storing every observation is not efficient and we follow the mechanism of Savinov et al[27] to select which observation to store, i.e.,

. The idea is to add only observations that are distant from the current memory vectors. Since the siamese network

has been trained explicitly to distinguish close from distant observations, we compute a score of novelty by comparing the current observation with the SMB, i.e.,

and we propose the following rule to update the SMB:


where is a threshold that influences the radius covered by each memory vector in the SMB. The SMB will reset after each episode.

Transformer on the SMB.

The navigation policy exploits the SMB to move toward unexplored locations through a Transformer. More precisely, we apply a Transformer layer on top of the SMB and the visual features of the current location to form a vector

. The logits for the navigation policy and its value function are both linear functions of this vector

. Overall, at time step , the agent receives an observation and has an SMB . From those we compute the vector in the following way:


where Att, MLP and LN denote respectively, the multi-head attention, the feedforward and the layer-normalization sublayers of a Transformer. Note that the CNN is a convolutional network different from

. We also add absolute temporal position embeddings to encode the temporal distance between the current time step and the moment a memory vector was inserted in the SMB. We refer the reader to Vaswani 

et al[31] for more details on Transformers.

4.2.2 Instrinsic exploration reward.

Intrinsic curiosity rewards the agent for exploring parts of the environment that looks unfamiliar to the agent. This reward is based on the agent’s intrinsic representation of the environment. In our case, this representation is the Scene Memory Buffer and a positive reward is given if the current observation has been added to the SMB, i.e.,


This reward is a discrete version of the episodic curiosity [27]. Discretizing the reward removes the trivial solutions noticed in [27] where the agent stops in a location that gives a reward that is greater than any reachable locations.

4.3 Stage 3: Learning to Navigate

In this section, we describe the third stage of our algorithm: the training of our navigation module. Every episode will start with an exploration phase where the exploration module builds an internal map of the environment. This is followed by a navigation phase that trains the navigation module to reach a goal sampled from the map. The internal map is also used for generating the intrinsic navigation reward. The trained navigation module does not need to follow the visited locations on the map — these are only used during training to shape the reward. In particular, the navigation policy can be more efficient than policies that plan over visited locations on the map at test time. This stage of the training is depicted in Fig 3 (right).

4.3.1 Building an internal map.

In the exploration phase of an episode, an internal map of the environment is built by the exploration policy that is already trained in the previous stage. The exploration policy runs for steps and fills the SMB with visual representations of locations. While the SMB alone is sufficient for training the navigation module with sparse rewards, we also record the connectivity of those locations to be leveraged in the dense-reward version of the training.

The path followed by the agent connects different memory vectors in the SMB. We use this path to form a graph on top the SMB . More precisely, we keep track of the closest element of the current observation after updating the SMB. Note that this means that is equal to if it is added to the SMB. If is different from , we add an edge to . This results in a directed graph representing paths between the memory vectors of the SMB.

4.3.2 Navigation module.

The navigation module takes as an input the current observation as well as a target observation . The module transforms these observations into features with a CNN, and concatenates the resulting features. We then apply a Transformer layer on top of this vector and the SMB, resulting in a feature . We compute the feature as follows:


Similar to the exploration module, the policy and value function are linear functions of a feature . Note that set of parameters for the attention modules for the exploration and navigation modules are different, but not the CNNs.

4.3.3 Memory based navigation reward.

After the exploration phase of an episode, the navigation phase starts by setting a randomly selected element of the SMB as a goal to navigate towards. A positive intrinsic reward is given if the agent considers that it has reached the target location based on its reachability network, i.e.,


This is an intrinsic reward built solely on the capability of the agent to perceive if it has reached the goal sampled from its SMB. However this reward is sparse and we propose to densify the reward by further exploiting the SMB.

Dense intrinsic navigation reward.

We leverage the graph to form dense navigation reward by computing a graph based approximation of the distance to the goal. More precisely, at each time step , we compute the shortest path between and in and denote by its length. We thus add a dense reward based on this distance as:


Note that, since we update the graph as we navigate the environment, this reward may change over time for a same target and memory vector . Note that this bonus reward only absolute progress towards the goal and the total reward accumulated over an episode is equal to the length of the shortest path as estimated at the beginning of the episode. Overall, we use both the dense and sparse reward during the navigation phase.

5 Experimental Evaluation

In this section we present the empirical evaluation of our model. We evaluate both the exploration and the navigation modules. Let us start by describing the data we use and providing technical details of our experimental setup.

5.1 Datasets.

For a realistic setup we perform all of our experiments on scenes taken from the Gibson dataset [33]. We run the simulations for these experiments inside the Habitat-sim framework [18]. We have selected a subset of eight scenes from the Gibson dataset, based on the quality of the 3d mesh, surfaces, and the number of floors, following the study presented in [18]. The scenes are fairly complex as they have 16 rooms on average spanning multiple floors. Some statistics for the selected scenes are provided in the supplementary material. The action set contains three actions: moving forward by one meter, and turning right or left by 45 degrees. We only keep the RGB data, discarding the depth channel, and use images of size pixels.

In this work, we make the assumption that the agent is always spawned in the same location of a scene. To achieve this, for each scene we manually select a starting position corresponding to the entrance door in the house.

5.2 Implementation Details.

Visual Representation Learning.

We implement the network as a siamese network with resnet18 as the function , and use a comparison function composed of two hidden layers of dimension . For each scene, we sample random trajectories of k steps. From each trajectory we extract pairs, yielding a dataset of k image pairs. The maximal action distance for a positive pair is set to five steps. We train this network using SGD for epochs with a batch size of , a learning rate of , a momentum of , a weight decay of and no dropout. We do not share parameters between scenes.

Exploration and Navigation.

For our CNN module, we use a network with convolutional layers with kernels of size

, strides of size

and number of channels . For the attention on the memory, we use an attention with two heads, a hidden dimension of and a feedforward network with a hidden dimension of . We train the policy using PPO, where each batch consists of 16 full episodes, each with steps. We run PPO epochs, with set to , an entropy coefficient set to and clipping of

. We optimize the parameters using RMSprop with a learning rate of

, a weight decay of , and parameters and set to and respectively. For this model we do use dropout with , and a learning rate warm up phase of steps. As with the network, we do not share parameters between scenes.

5.3 Main Results

The main experiment in our evaluation checks how well our agent navigates to new test goals. After training itself to navigate to elements of the memory, the agent can be given a new goal feature as a target. In this experiment, we sampled 100 random locations from each scene, and saved the corresponding RGB observation and location. For each scene, and each target location, we first run 1000 steps of exploration to fill the memory and launch the navigation episode. The total navigation episode lasts for 1000 steps, and as soon as a goal is reached, a new goal is sampled.

Evaluation Metrics.

We evaluate success by computing the number of targets that the agent reached successfully out of 100. The target is considered reached if the agent navigates to a distance of at most one meter from the target location. The first metric that we compute is the success rate, which simply corresponds to the fraction of goals that were reached within the allocated 1000 steps. Since this measure does not account for the length of the path taken, the second metric we report is the SPL metric [1]. Let us assume that we have access to the length of the shortest path from the starting location to the goal , computed by the simulator, that we denote . If we write the indicator of success as defined above, and the metric length of the trajectory obtained with our algorithm, the SPL is defined as follows:


In order to evaluate the quality of our navigation algorithm, we compare our model to three baselines: SPTM, Supervised and Random. We describe these baselines in more detail here.

First, we compare our algorithm to the Semi-Parametric Topological Memory [26]. In order to adapt SPTM to the environments used in our experiments, we train the action and edge prediction networks on them. For each scene, we train the networks for 300 epochs of 1000 batches each, with a batch size of 64. Samples in the batches are obtained from random trajectories that are sampled on-line. This number of training iterations amounts to approximately M steps in each environment - which is comparable to the number of steps used to train our method (exploration and navigation). Since SPTM requires an expert human-provided exploration trajectory, we use random exploration in place.

Adrian Albert. Arkan. Ballou Capist. Goffs Mosq. Sanc. Mean
Random 13.5 19.3 16.4 10.5 26.0 9.3 10.6 12.6 14.8
SPTM [26] 25.5 23.5 20.2 9.7 38.6 9.3 16.9 10.1 19.2
Supervised 27.5 30.5 21.9 11.1 45.8 14.8 13.0 17.4 22.8
Ours-sparse 27.8 39.9 30.6 17.0 60.9 15.1 16.3 32.9 30.1
Ours-dense 35.6 45.2 32.3 27.8 65.9 16.8 18.8 24.5 33.4
Table 1: Navigation performance as measured by the SPL metric for our method (Ours) and selected baselines on all considered environments.

Second, we also compare against a feedforward policy trained with supervised rewards (Supervised). This policy is trained using RL, assuming that at each step the distance from the agent to the goal is known: . In that setup, the agent receives a reward of 10 if - which is equivalent to the success criterion defined above. Please note that this feedforward policy is trained on the same set of 100 goals that are used during evaluation. For reference, we provide the performance of random navigation.

Figure 4: Navigation performance is broken down by physical shortest distance from start to goal location. Left: Histogram of distances from start to goal in our evaluation dataset. We see that most goals are located between m and m from the entrance. Center: Breakdown of the SPL metrics by distance. Right: Breakdown of the flat success rate by distance. We clearly see the advantage of using the dense rewards for learning to navigate to far-away locations.

We run the evaluation for the baselines and our method with sparse or dense rewards and report the results for each scene in Table 1. There are a couple observations that we can make about this experiment. First of all, we see that our method outperforms all the baselines by a large margin on all of the scenes. Surprisingly, it even works better than the supervised agent which utilizes the location information - data to which our method has no access. However, this can be explained by our architectural choice of conditioning the navigation module on the memory of previously visited states. As opposed to that, the Supervised baseline is only a feedforward network, and has no representation of the past observations.

Second, the SPTM baseline performs poorly compared to our method with only little improvement over the randomly acting agent. This can be explained by the fact that SPTM only has access to a random exploration trajectory, therefore limiting the set of goals that it can ever reach. Moreover, SPTM restricts the navigation to its exploration graph, severely limiting the possible routes to the goal. In comparison, our method encourages the agent to reach the goal as fast as possible taking any possible route. Our dense reward does use the graph, but only as a guide that can be completely ignored if more optimal solutions exist.

Finally, we see that the dense reward generated using the graph as described in Sec. 4.3.3 allows to train a better navigation policy, outperforming the sparse reward on most of the scenes. Indeed, for our agent, this dense reward corresponds to a discrete distance over the graph which leads to the goal if minimized. This effect is clearer when we measure the performance for different goal distances as shown in Fig. 4. The gap between the dense and sparse reward widens for far-away goals. This is likely because the graph provides intermediate goals, helping a lot when the goal cannot be reached easily.

5.4 Analysis of Exploration

As mentioned in Sec. 4.3, the coverage obtained during the exploration stage is critical for the final navigation task. In this section, we want to evaluate the quality of this exploration stage alone.

Evaluation Metrics.

The goal of the exploration stage is to train an agent to explore and map an environment without any form of supervision. For this experiment, we follow previous work and evaluate the quality of the exploration using a coverage metric. In order to define this metric, we discretize the environment using a grid, with cells of size meter. At the end of the episode, we report the number of cells that were visited by the agent. Since the environments we consider can have multiple floors, we infer the floors in the environment. We do so by sampling random locations and keeping most frequent heights that are at least meters away by doing non-maximal suppression. We then keep one coverage grid per floor.


First, we compare our exploration module to Episodic Curiosity (EC) [27]. In that baseline - unlike our method - the policy has no dependency on the past. This is implemented by making the policy and value function directly depend on instead of in Eq. (2). Another difference between our method and the EC baseline, is the nature of the intrinsic rewards. While the original bonus proposed in [27] was continuous, we propose to use a discretized version instead. Note that we cannot compare to EC on the navigation task in Sec. 5.3 because it does not provide means of navigation without supervision.

Second, we include a Supervised policy trained using the “oracle” reward, being the measure that we use for evaluation. In this case, we densely reward the agent every time a new cell is visited. Apart from using a different source of reward, all parameters for this model are taken the same as for our model.

Figure 5: Quantitative and qualitative evaluation of the exploration phase. Left: Performance of exploration policies as measured by the coverage metric in squared meters. We compare the performance of our model to a baseline (EC) and a supervised topline. Right: Visualization of the graph build during exploration in the Ballou environment.

The performance evolution of our method and the baselines during training is shown in Fig. 5 (Left). The coverage metric is averaged over the eight scenes. Our method performs comparable to the supervised agent, which can be considered as the upper bound as it directly optimizes the coverage metric. In Fig. 5 (Right), we show an example of exploration behavior learnt by our agent. The nodes of this graph are states added to the SMB by the agent, and they are connected following the rule described in Sec. 4.3. We see that the agent has explored most of the house successfully and made connections consistent with its topology, which will assist the training of the navigation module. Surprisingly, we observed that the agent trained using vanilla EC does not learn a good exploration policy. We investigate the reason for this in the following experiment.

Ablation of the exploration model.

In [27], the authors propose a continuous curiosity reward: . In this ablation study, we want to exhibit the importance of our improvements over [27], namely using a discrete bonus and an attention mechanism over the SMB. To this end, we show the evolution of the intrinsic reward as well as of the coverage metric for three models on the Ballou environment. We compare our full model to the vanilla EC and an exploration policy such as ours but with no memory in Fig. 6.

We observe that using the continuous reward makes the agent find trivial maximas by exploiting the reward design. In that case the total episode reward converges to a value just below , where is the number of steps - see Fig. 6 (Left). Despite the fact that the agent trains properly, and optimizes the reward, it does not work well when measured by the metric we care about, the coverage metric, as shown in Fig. 6 (Center). We provide a qualitative representation of the phenomenon, by visualizing the agent’s path, as well as the spatial location of elements in the memory. We show these visualizations for both continuous and discrete rewards in Fig. 6 (Right). We see that the agent trained with discrete rewards manages to explore the scene properly. However, when trained with continuous intrinsic rewards, the agent gets stuck in a specific subpart of the environment where it receives a continuous reward just below the threshold .

Continuous Discrete
Figure 6: Ablation study of our exploration policy. We report the performance for our model, the EC baseline, as well as a variant of our model with no attention mechanism on the SMB. Left: Evolution of the intrinsic bonus reward as a function of iterations. Center: Evolution of the coverage metric. Right: Visualization of the trajectories obtained with a policy trained with continuous and discrete bonus rewards.

6 Conclusion

We have shown how to train an agent to perform goal-directed navigation in photorealistic environments without using any extrinsic rewards. Our agent trains in a purely self-supervised manner, only using RGB image observations. The model is composed of three interconnected components: one which learns visual representations, a second which explores the environment, and a third that teaches itself to navigate. We have shown that our self-supervised navigation models manage to navigate to novel test goals.

In future work, we can consider multiple natural extensions to this model. First, we want to work on making the learning of all the components of the model end-to-end. Second, we want to study the generalization capabilities of our method by training it on a large set of scenes with shared parameters and testing it on previously unseen environments. Finally, we can improve the dependency on the SMB by including the graph structure in the attention mechanism for both exploration and navigation policies.


We would like to thank Dhruv Batra, Oleksandr Maksymets, Danielle Rothermel and Hervé Jégou for their invaluable help and constructive comments throughout this project.


  • [1] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al. (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: §5.3.
  • [2] G. Avraham, Y. Zuo, T. Dharmasiri, and T. Drummond (2019) EMPNet: neural localisation and mapping using embedded memory points. Cited by: §1, §2.
  • [3] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    Cited by: §1.
  • [4] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from rgb-d data in indoor environments. In 3DV, Cited by: §1.
  • [5] D. S. Chaplot, E. Parisotto, and R. Salakhutdinov (2018) Active neural localization. Cited by: §2.
  • [6] T. Chen, S. Gupta, and A. Gupta (2019) Learning exploration policies for navigation. Cited by: §2.
  • [7] N. Chentanez, A. G. Barto, and S. P. Singh (2005)

    Intrinsically motivated reinforcement learning

    Cited by: §2.
  • [8] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. Cited by: §1.
  • [9] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In CoRL, Cited by: §1.
  • [10] B. Eysenbach, R. Salakhutdinov, and S. Levine (2019) Search on the replay buffer: bridging planning and reinforcement learning. arXiv preprint arXiv:1906.05253. Cited by: §2.
  • [11] K. Fang, A. Toshev, L. Fei-Fei, and S. Savarese (2019) Scene memory transformer for embodied agents in long-horizon tasks. Cited by: §1, §2.
  • [12] P. Goyal, D. Mahajan, A. Gupta, and I. Misra (2019) Scaling and benchmarking self-supervised visual representation learning. Cited by: §1.
  • [13] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. Cited by: §1, §2.
  • [14] S. Gupta, D. Fouhey, S. Levine, and J. Malik (2017) Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125. Cited by: §2.
  • [15] J. F. Henriques and A. Vedaldi (2018) Mapnet: an allocentric spatial memory for mapping environments. Cited by: §1, §2.
  • [16] A. Khan, C. Zhang, N. Atanasov, K. Karydis, V. Kumar, and D. D. Lee (2018) Memory augmented control networks. Cited by: §1, §2.
  • [17] A. Kumar, S. Gupta, D. Fouhey, S. Levine, and J. Malik (2018) Visual memory for robust path following. Cited by: §2.
  • [18] Manolis Savva*, Abhishek Kadian*, Oleksandr Maksymets*, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra (2019) Habitat: A Platform for Embodied AI Research. Cited by: §1, §1, §5.1.
  • [19] P. Mirowski, M. Grimes, M. Malinowski, K. M. Hermann, K. Anderson, D. Teplyashin, K. Simonyan, A. Zisserman, R. Hadsell, et al. (2018) Learning to navigate in cities without a map. Cited by: §2.
  • [20] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, et al. (2017) Learning to navigate in complex environments. Cited by: §2.
  • [21] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics 31 (5), pp. 1147–1163. Cited by: §2.
  • [22] J. Oh, V. Chockalingam, S. Singh, and H. Lee (2016) Control of memory, active perception, and action in minecraft. Cited by: §1, §2.
  • [23] E. Parisotto and R. Salakhutdinov (2018) Neural map: structured memory for deep reinforcement learning. Cited by: §1, §2.
  • [24] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. Cited by: §2.
  • [25] F. Sadeghi and S. Levine (2017) Cad2rl: real single-image flight without a single real image. Cited by: §1.
  • [26] N. Savinov, A. Dosovitskiy, and V. Koltun (2018) Semi-parametric topological memory for navigation. In ICLR, Cited by: §2, §5.3, Table 1.
  • [27] N. Savinov, A. Raichuk, R. Marinier, D. Vincent, M. Pollefeys, T. Lillicrap, and S. Gelly (2019) Episodic curiosity through reachability. Cited by: 2nd item, §1, §2, Figure 2, §4.1, §4.1, §4.2.1, §4.2.2, §5.4, §5.4.
  • [28] J. Schmidhuber (1991) Curious model-building control systems. Cited by: §2.
  • [29] S. Thrun, W. Burgard, and D. Fox (2005) Probabilistic robotics. MIT press. Cited by: §1, §2.
  • [30] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. Cited by: §1.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Cited by: §1, §2, §4.2.1.
  • [32] D. Wierstra, A. Förster, J. Peters, and J. Schmidhuber (2010) Recurrent policy gradients. Logic Journal of the IGPL 18 (5), pp. 620–634. Cited by: §1.
  • [33] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese Gibson Env: real-world perception for embodied agents. In CVPRICMLNIPSIJCNNRSSIROSCVPRNIPSICLRNIPSNIPSICMLECCVICCVICCVECCVICCVICCVICLRICLRNIPSCVPRICRAICLRICLRICLRCVPRICLR, Note: Gibson dataset license agreement available at https://storage.googleapis.com/gibson_material/Agreement%20GDS%2006-04-18.pdf } 2018 @inproceedings{xiazamirhe2018gibsonenv, title = {Gibson {Env}: real-world perception for embodied agents}, author = {Xia, Fei and R. Zamir, Amir and He, Zhi-Yang and Sax, Alexander and Malik, Jitendra and Savarese, Silvio}, booktitle = {CVPR}, note = {{G}ibson dataset license agreement available at {\tiny \verb+https://storage.googleapis.com/gibson_material/Agreement%20GDS%2006-04-18.pdf+} \}, year = {2018}} Cited by: Learning to Visually Navigate in Photorealistic Environments Without any Supervision, §1, §1, §5.1.
  • [34] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi (2019) Visual semantic navigation using scene priors. Cited by: §2.
  • [35] A. Zhang, A. Lerer, S. Sukhbaatar, R. Fergus, and A. Szlam (2018) Composable planning with attributes. arXiv preprint arXiv:1803.00512. Cited by: §2.
  • [36] J. Zhang, L. Tai, J. Boedecker, W. Burgard, and M. Liu (2017) Neural slam: learning to explore with external memory. arXiv preprint arXiv:1706.09520. Cited by: §1, §2.
  • [37] R. Zhang, P. Isola, and A. A. Efros (2016)

    Colorful image colorization

    Cited by: §1.
  • [38] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. Cited by: §2.