Unsupervised Domain Adaptation for Visual Navigation

by   Shangda Li, et al.
Carnegie Mellon University

Advances in visual navigation methods have led to intelligent embodied navigation agents capable of learning meaningful representations from raw RGB images and perform a wide variety of tasks involving structural and semantic reasoning. However, most learning-based navigation policies are trained and tested in simulation environments. In order for these policies to be practically useful, they need to be transferred to the real-world. In this paper, we propose an unsupervised domain adaptation method for visual navigation. Our method translates the images in the target domain to the source domain such that the translation is consistent with the representations learned by the navigation policy. The proposed method outperforms several baselines across two different navigation tasks in simulation. We further show that our method can be used to transfer the navigation policies learned in simulation to the real world.



page 2

page 3

page 7

page 8

page 14

page 15


On Embodied Visual Navigation in Real Environments Through Habitat

Visual navigation models based on deep learning can learn effective poli...

Domain Adaptation Through Task Distillation

Deep networks devour millions of precisely annotated images to build the...

An in-depth experimental study of sensor usage and visual reasoning of robots navigating in real environments

Visual navigation by mobile robots is classically tackled through SLAM p...

Out of the Box: Embodied Navigation in the Real World

The research field of Embodied AI has witnessed substantial progress in ...

Building Intelligent Autonomous Navigation Agents

Breakthroughs in machine learning in the last decade have led to `digita...

Analyzing Visual Representations in Embodied Navigation Tasks

Recent advances in deep reinforcement learning require a large amount of...

Gibson Env: Real-World Perception for Embodied Agents

Developing visual perception models for active agents and sensorimotor c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few years, a lot of progress has been made in learning to navigate from first-person RGB images. Reinforcement learning have been applied to train navigation policies to navigate to goals according to coordinates 

Gupta et al. (2017); Chaplot et al. (2020b); Wijmans et al. (2020), images Zhu et al. (2017c), object labels Gupta et al. (2017); Yang et al. (2018), room labels Wu et al. (2018, 2019) and language instructions Hermann et al. (2017); Chaplot et al. (2018); Anderson et al. (2018b); Fried et al. (2018); Chen et al. (2019a); Wang et al. (2019). However, such navigation policies are predominantly trained and tested in simulation environments. Our goal is to have such navigation capabilities in the real-world. While some progress has been made towards moving from game-like simulation environments to more realistic simulation environments based on reconstructions Xia et al. (2018); Chang et al. (2017); Straub et al. (2019) or 3D modeling Kolve et al. (2017), there is still a significant gap between simulation environments and real-world.

Training the above navigation policies in the real-world has not been possible as current reinforcement learning methods typically require millions of samples for training. Even if we parallelize the training across multiple robots, it will still require multiple weeks on training with constant human supervision due to safety concerns and battery limitations. This makes real-world training practically infeasible and leaves us with the other option of transferring models trained in simulation to the real-world, which highlights the importance of domain adaptation methods.

Among domain adaptation techniques, unsupervised methods are favorable because it is extremely expensive to collect parallel data for the purpose of visual navigation. It essentially requires reconstructing real-world scenes in the simulator separately for all possible scenarios one might deploy the navigation model in such as different lighting conditions, time of day, indoor vs outdoor, weather conditions, and so on. Reconstructing real-world scenes is a tedious job requiring specialized cameras and significant human effort. Unsupervised learning methods have the potential to overcome this difficulty since they require only a few real-world images taken by regular cameras.

One possible solution involves using unsupervised image translation techniques to translate visual perception from simulation to real-world and adapt the navigation policy learned in simulation to the real-world. Although there already exists a rich amount of prior work in unsupervised image translation techniques that transfer images from one domain to another Zhu et al. (2017a); Liu et al. (2017); Huang et al. (2018), prior techniques are not well suited for navigation since the image translations are agnostic of the navigation policy and instead focus on photo-realisticity and clarity.

Figure 1: PBIT. The proposed policy-based image translation for unsupervised visual navigation adaptation.

In this paper, we propose an unsupervised domain adaptation method for transferring navigation policies from simulation to the real-world, by unsupervised image translation subject to the constraint that the image translation respects agent’s policy. In order to learn policy-based image translation (PBIT) in an unsupervised fashion, we devise a disentanglement of content and style in images such that the representations learnt by the navigation policy are consistent for images with the same content with different styles. See Figure 1 for the illustration of PBIT. Our experiments show that the proposed method outperforms the baselines in transferring navigation policies for different tasks between two simulation domains and from simulation to the real-world.

2 Related Work

Simulation to real-world (Sim-to-Real) transfer of visual navigation policies requires the adaptation for both visual perception and agent’s policy. Among its wide range of relevant literature, we focus on discussing related work on visual navigation, visual domain adaptation and policy transfer.

Visual Navigation. Visual Navigation has been widely studied in robotics for over two decades. Bonin-Font et al. Bonin-Font et al. (2008)

provide an in-depth survey of visual navigation methods in classical robotics. In the past few years, there has been an increasing focus on learning-based methods for visual navigation. This includes methods which tackle navigation tasks primarily requiring geometrical scene understanding such as the

pointgoal task Gupta et al. (2017); Anderson et al. (2018a) where the relative coordinate to the goal is given and the exploration task Chen et al. (2019b); Fang et al. (2019); Chaplot et al. (2020b) where the objective is to maximize the explored area. There has also been a lot of work on navigation tasks involving more semantics such as image-goal Zhu et al. (2017c); Chaplot et al. (2020c), object-goal Yang et al. (2018); Wortsman et al. (2019); Mousavian et al. (2019); Chaplot et al. (2020a); Chang et al. (2020), high-level language goal Hermann et al. (2017); Chaplot et al. (2018) and low-level language instructions Anderson et al. (2018b); Fried et al. (2018).

While performance on semantic navigation tasks is still far from perfect even in simulation, recent improvements in both visual simulation quality Savva et al. (2019); Xia et al. (2018); Chang et al. (2017); Straub et al. (2019) as well as algorithms Sax et al. (2019); Chaplot et al. (2020b); Wijmans et al. (2020)

have led to impressive results on geometric navigation tasks. However, most of the above works train navigation policies using reinforcement or imitation learning in simulation and test on different scenes in the same domain in the simulator. Some prior works which tackle sim-to-real transfer for navigation policies directly transfer the policy trained in simulation to the real-world without any domain adaptation technique 

Gupta et al. (2017); Chaplot et al. (2020b). We show that the proposed domain adaptation method can lead to large improvements over direct policy transfer.

Visual Domain Adaptation.

Simulation and real-world can be viewed as two distinct visual domains, and adapting their visual perceptions can be regarded as an image-to-image translation task. Thanks to the success of Generative Adversarial Networks (GANs) 

Goodfellow et al. (2014) for matching cross-domain distribution, we are able to adapt an image across domains without changing its context. For example, pix2pix Isola et al. (2017) changes only the style of an image (e.g., photograph portrait) while preserving its context (e.g., the same face of a person). We note that, for Sim2Real navigation, some amount of context should be preserved across domains, such as the obstacles and walls, to prevent collisions.

If we have access to the paired cross-domain images, then pix2pix Isola et al. (2017) and BicycleGAN Zhu et al. (2017b) serve as good candidates to model the context-preserving adaptation. However, the paired data between simulation and real-world is notoriously hard to collect Tzeng et al. (2015) or even do not exist (e.g., we cannot always build simulators for new environments). To tackle this challenge, numerous visual domain adaptation approaches Taigman et al. (2016); Shrivastava et al. (2017); Zhu et al. (2017a); Kim et al. (2017); Yi et al. (2017); Liu et al. (2017) have been proposed to relax the constraint of requiring paired data during training time. Nevertheless, the above methods still assume one-to-one correspondence across domains. As an example, these models can only generate the same target-domain image given a source-domain image. We argue that it is more realistic to assume a many-to-many mapping between simulation and real-world.

To learn multimodal mappings without paired data, prior works Huang et al. (2018); Lee et al. (2018) disentangle the context and style of an image. Precisely, they assume the context is shared across domains and the styles are specific to each domain. Note that these models focus on realistic image generation, and hence it remains unclear on how image translation benefits cross-domain visual navigation. To further bridge the gap between navigation and image translation, our key idea is to ensure the agent’s navigation policy be consistent under domain translation. As a consequence, we propose to enforce constraints such that the agent’s policy is only inferred from the shared context across simulation and real-world.

Policy Transfer. Existing works on sim-to-sim or sim-to-real policy transfer require domain knowledge specific to certain environments and tasks and extra human supervision. Zhang et al. (2019) employs semantic loss and shift loss for consistent generation to facilitate the domain translation. Sadeghi (2019) build a new simulator with diverse layouts populated with diverse furniture placements and defines several auxiliary tasks that are specific to the Collision-Free Goal Reaching task. Müller et al. (2018) trains the domain translation module of its driving agent on human-labeled segmentation dataset to generalize its learned policy from one domain to the other. Bousmalis et al. (2018) utilizes customized simulator to improve a policy’s real-world performance, given a policy already trained in the real world, which is different from our setting where the policy is trained in the simulation and then gets transferred to the real-world. Gordon et al. (2019) targets similar visual navigation problems and defines auxiliary tasks for better sim-to-sim transfer, which is complementary to our contribution. Blukis et al. (2019)

relies on Supervised Learning for Visitation Prediction which requires human-supervised trajectories of both the real world and the simulation.

3 Methods

Denote the source domain as and the target domain as . is the state space, is the set of actions, and

is the transition probability distribution. Note that we assume action spaces

are shared across domains. Let be a navigation policy in the source domain (the navigation policy is given). For our task setup, we have access to some target-domain images during training, but we cannot perform target-domain policy () training. Our objective is to learn a many-to-many mapping such that the navigation policy under the source-to-target mapping, , is effective in the target domain . Under Sim2Real setting, the source domain refers to simulator and the target domain refers to real-world. Unless specified otherwise, we abbreviate as for the rest of the paper.

Figure 2: Policy Decomposition. The Task-specific Navigation Policy () can be sequentially decomposed into a Policy Feature Extractor () and a Action Policy () such that extracts all task-specific features () and throws away all domain-specific features from the input image () and learns an action distribution function over the task-specific features.

3.1 Policy Decomposition

As our objective is to transfer the task-specific navigation policy across domains, we assume that the task itself is domain-invariant. As a consequence, given a policy (for the navigation task in the source domain), we assume that some intermediate task-specific representation inferred by the policy is invariant from the source to the target domain. For example, a simple obstacle avoidance navigation policy would extract task-specific and domain-invariant features such as distance to obstacles at various angles and then learn a policy over these features. Let be sequentially decomposed into a Policy Feature Extractor () and a Action Policy (). extracts all task-specific features ( with indicating the input image) and throws away all domain-specific features in the input image. learns an action distribution function over the task-specific features. We illustrate the policy decomposition in Figure 2.

3.2 Policy-based Consistency Loss

Recall that our objective is to learn an image translation model , such that . Based on the policy decomposition described above, different translations of the same target-domain image would have similar task-specific features. Precisely, if and are the translated images to the source domain from the same target-domain image , then .

Figure 3: Policy-based Consistency Loss. Since the task is domain-invariant, task-specific representations obtained from different domain-specific styles but the same domain-invariant content should be similar.

To achieve the above policy consistency in an unsupervised fashion, we take inspiration from style and content-based unsupervised methods designed for image translation Huang et al. (2018). We assume that each image can be decomposed into a domain-invariant content representation () and a domain-specific style representation (). Let be an Image Encoder for domain which encodes an image () to domain-invariant content () and domain-specific style (): . On the contrary, let be an Image Decoder which is the inverse of the Image Encoder: .

Since we assume the navigation task is domain-invariant, all the the task-specific features are a subset of content representation . Therefore, images generated from different styles but same content should lead to the same task-specific features as shown in Figure 3. We operationalize this idea using the following policy-based consistency loss:


with and being two distinct styles sampled from the the prior distribution .

Note that in the above equation, is part of the given navigation policy. We assume the navigation policy is trained before deciding the target domain; hence, is fixed during the domain adaptation phase. This ensures that the presented policy-based image translation (PBIT) can be used for transferring a policy across domains (potentially not anticipated during policy training) without re-training the the navigation policy.

3.3 Reconstruction and Adversarial Loss

Using just policy-based consistency loss would make decoder ignore the style and decode based only on the content. Inspired by prior work Zhu et al. (2017a); Huang et al. (2018), to encourage the content to be domain-invariant and style representations to be domain-specific, we adopt the following image and latent representation reconstruction losses, and use for the prior distributions of styles and :


We also use adversarial losses to match the distribution of images to their respective domains. Let and be the discriminators for the source and target domains:


Putting everything together, our overall objective is


where s are hyper-parameters controlling the weight of each loss during training. The cross-domain image translation model consists of and as described above and the optimization admits a mix-max objective:


4 Experimental Setup

We conduct two sets of experiments to test the domain adaptation of navigation policies in Sim-to-Sim and Sim-to-Real settings. In the Sim-to-Sim experiments, we adapt navigation policies trained in the Gibson Xia et al. (2018) domain to the Replica Straub et al. (2019) domain in the Habitat Simulator Savva et al. (2019). For Sim-to-Real, we adapt policies trained in Gibson to real-world office scenes. For training the domain adaptation models, we collect 7200 images randomly in 72 training scenes in Gibson, 7200 images in 18 test scenes in Replica, and 1125 images in the real-world.

We study domain adaptation for two navigation tasks, PointGoal and Exploration for our experiments. The PointGoal task Anderson et al. (2018b); Savva et al. (2019) involves navigating to the target location specified using point coordinates. Success Rate and Success weighted by Path Length (SPL) Anderson et al. (2018b)

as evaluation metrics for PointGoal. An episode is considered successful if the agent is within

radius of the goal location at the end of the episode. The Exploration task Chen et al. (2019b); Chaplot et al. (2020b) involves maximizing the coverage or the explored area within a fixed time budget of 500 steps. A traversable point is defined to be explored by the agent if it is in the field-of-view of the agent and less than away from the agent. We use two evaluation metrics, the absolute coverage area in (Explored Area) and the proportion of area explored in the scene (Explored Ratio).

For both the tasks, the agent has two sensors: RGB camera () and an odometry sensor ( coordinates and orientation). For the PointGoal task, the odometry sensor is used to compute the relative distance and angle of the pointgoal at each time step. The action space consists of 3 actions: move_forward (0.25), turn_left (), turn_right ().

4.1 Navigation Policy

The source domain navigation policy () is trained using reinforcement learning. The reward for the PointGoal task is the decrease in geodesic distance to the point goal, and for the Exploration task is the increase in the explored area. The navigation policy is decomposed into a Policy Feature Extractor () and an Action Policy () as shown in Fig 2. is based on the 18-layer ResNet He et al. (2016). outputs 128-dimensional policy-related representations given RGB images of shape . is based on a 2-layer GRU, which takes along with relative distance and angle of the point goal (in the PointGoal task) or readings from the odometry sensor (in the Exploration task) as input. We train a separate policy for each task in the source Gibson domain using 72 scenes in the training set provided by Savva et al. (2019). The policies are trained using PPO Schulman et al. (2017) for 30 million frames. In the PointGoal task, the agent stops when the relative distance of the PointGoal is less than , whereas, in the Exploration task, the agent explores for the maximum episode length of 500 steps. More architecture and hyperparamater details for navigation policy training are provided in supplementary.

4.2 Implementation Details

Given a navigation policy, we use the proposed method for domain adaptation from Gibson to Replica and vice versa. The image encoders and decoders used in PBIT consist of several convolutional layers and residual blocks. The exact architecture is described in supplementary. Regarding hyperparameters, we use

for all experiments. We scale based on the mean of image features for each task: , where

is estimated using the Gibson images from the domain translation dataset. We use the Adam optimizer

Kingma and Ba (2014) with initial learning rate, and . The learning rate is halved every 100K iterations. We train the proposed model and all the baselines for 500K iterations.

4.3 Baselines

We compare the proposed PBIT model against the following baselines for domain adaptation:
1) Direct Transfer: This is the most common method of transferring navigation policies across domains, which involves directly testing the policy in the target domain without any fine-tuning.
2) CycleGAN Zhu et al. (2017a) is a competitive and popular unsupervised image translation method. This method is designed for static image translation and is agnostic to the navigation policy.
3) PBIT w.o. Policy Loss: This method is an ablation of the proposed method without the policy-based consistency loss. This ablation is conceptually very similar to prior works based on style and content disentanglement such as Huang et al. (2018); Lee et al. (2018). We use the architecture and hyperparameters from the PBIT model for this ablation to quantify the effect of the policy-based consistency loss on performance.

5 Results

PointGoal Exploration
SPL Success Rate Explored Ratio Explored Area
Direct Transfer 0.505 0.688 0.832 22.9
CycleGAN 0.605 0.803 0.868 24.3
PBIT w.o. Policy Loss 0.669 0.852 0.879 24.6
PBIT 0.712 0.881 0.897 25.3
Table 1: Results. Comparisons between the proposed Policy-Based Image Translation (PBIT) and baselines on the PointGoal and Exploration tasks when transferred from the Gibson to the Replica domain.

5.1 Gibson to Replica

For evaluation in Replica, we generate a separate test set of 900 episodes (50 episodes in each of the 18 scenes) for both the PointGoal and Exploration tasks. The episodes are sampled using random starting locations and aggressive rejection sampling of near-straight-line episodes as in prior work Savva et al. (2019); Chaplot et al. (2020b). The performances of our method and all the baselines for both the tasks are presented in Table 1

. PBIT outperforms all the baselines on both the tasks. It improves the SPL from 0.605 to 0.712 for PointGoal and Explored Ratio from 0.868 to 0.897 for Exploration as compared to the CycleGAN baseline. There’s a considerable difference between PBIT and the ablation, indicating the importance of policy-based consistency loss. PBIT performed better than all the baselines consistently across 3 different runs with a small standard deviation of 0.007 SPL and 0.004 Explored Ratio. The Explored Ratios of all the methods in Table 

1 are high on an absolute level because the Replica scenes are relatively small usually having one or two rooms. Just turning on the spot leads to an explored ratio of 0.436.

In Figure 4, we visualize an example trajectory for the PointGoal task using the proposed method  PBIT (Fig 4 above) and the Direct Transfer baseline (Fig 4 below) for the same episode specification. The figure shows the images observed by the agent in the target domain, the translated images, and a top-down map (not visible to the agent) showing the point goal coordinates and the agent’s path. The PBIT model successfully reaches the PointGoal within 46 steps while the Direct Transfer baseline is unable to reach the goal. The image translations indicate that policy-relevant characteristics of the image such as corners of obstacles, walls and free space are preserved during the translation.

Figure 4: Trajectory Comparison between PBIT and the Direct Transfer Baseline on the PointGoal Task in Replica. The upper half of the figure is the trajectory of PBIT: the agent successfully navigates from a corner of the apartment to the fridge in 46 steps, by seeing the translated images by PBIT. The agent takes almost the shortest path possible, as shown in the Top-Down Map (not visible to the agent). The lower half of the figure is the trajectory of the Direct Transfer baseline on the same test episode. The Direct Transfer agent fails to navigate to the target location and gets lost, even after 460 steps.

5.2 Gibson to Real-world

For the real-world experiments, we conduct 18 trials each for the proposed method and the Direct Transfer baseline for the PointGoal task. We transfer the navigation policy to a LoCoBot Wögerer et al. (2012) using the PyRobot API Murali et al. (2019) for both the methods. We conduct trials in 2 scenes, Seen and Unseen. 151 images among the set of the 1125 real-world images used for training the PBIT model were sampled in the Seen scene, whereas none of the images were sampled from the Unseen scene. We terminate the episode if the robot collides with an obstacle and count it as a failure. We allow a maximum episode length of 99 steps in the real-world experiments.

Each trial specification and the corresponding results are presented in Table 2. PBIT achieves an absolute improvement of ( vs ) in success rate over the Direct Transfer baseline across all the trials. PBIT also has a much lower collision rate as compared to the baseline. Surprisingly, the PBIT model achieves 100% success rate in the Seen scene, achieving an absolute improvement of ( vs ). These results indicate that the navigation policy can be reliably transferred to the real-world using the proposed method given access to a few images in the real-world scene. Even in the unseen scene, PBIT leads to a large absolute improvement of ( vs ) over the Direct Transfer baseline.

In Figure 5, we show an example of a successful trajectory in the Unseen real-world scene using PBIT. It shows some of the images seen by the agent during the trajectory, the corresponding translations, and a third-person view of the robot. The trajectory shows the PBIT is able to successfully navigate around the blue chair obstacle to reach the pointgoal. The image translations shown in the figure also indicate that the model generates good translations similar to images in the Gibson domain. For example, the dark grey carpet floors in the office space scenes in the real-world are successfully translated to brown floors, representative of wooden floors of apartment scenes in the Gibson domain. Similarly, wooden doors and staircases in the real-world are translated to off-white walls which are common in Gibson scenes. At the same time, navigation relevant details such as the boundary between the floor and walls and other obstacles, are preserved during translation.

Figure 5: Sample Real World Trajectory on PointGoal Task. Figure contains raw inputs from Real World (row one), translated Gibson images by PBIT (row 2), and a third-person perspective from the back. The PBIT agent successfully reached it’s destination (a trash can) by avoiding an obstacle (a chair) in its way.
Episode Specification Baseline: Direct Transfer PBIT
Ep No Dist () Angle () Steps Collision
Succ. Steps Collision.
Scene 1 (Seen): Corridors
1 2.83 45.00 99 0 3.41 0 21 0 0.15 1
2 4.47 333.43 99 0 4.68 0 41 0 0.02 1
3 4.00 0.00 99 0 4.43 0 41 0 0.13 1
4 5.39 21.80 10 1 4.60 0 32 0 0.13 1
5 4.24 315.00 99 0 6.20 0 73 0 0.14 1
6 4.00 0.00 36 1 3.88 0 63 0 0.09 1
7 1.41 45.00 99 0 1.95 0 9 0 0.19 1
8 1.41 135.00 50 0 0.15 1 29 0 0.07 1
9 4.47 26.57 99 0 4.73 0 27 0 0.19 1
Scene Avg 92 22.2% 3.78 11.1% 37.3 0% 0.13 100%
Scene 2 (Unseen): Public kitchen area
1 2.00 0.00 32 1 0.50 0 36 0 0.18 1
2 2.00 0.00 35 1 0.56 0 10 0 0.04 1
3 2.24 333.43 31 0 0.02 1 16 0 0.07 1
4 2.24 153.43 38 0 0.08 1 43 0 0.09 1
5 4.12 345.96 80 1 4.10 0 44 1 1.96 0
6 4.47 26.57 99 0 4.01 0 99 0 2.85 0
7 4.47 26.57 99 0 4.49 0 70 0 0.14 1
8 5.39 21.80 99 0 5.50 0 99 0 2.73 0
9 2.83 45.00 99 0 3.57 0 99 0 2.84 0
Scene Avg 77.5 33.3% 2.54 22.2% 59.0 11.1% 1.21 55.6%
Overall 84.8 27.7% 3.16 16.7% 48.2 5.6% 0.67 77.8%
Table 2: Real-world results. Table comparing the performance of the Direct Transfer baseline and the proposed method PBIT across 18 trials in two scenes in the real-world. The training set for PBIT consists of some images sampled from Scene 1: Corridors (Seen), but no image from Scene 2: Public kitchen area (Unseen). The episode specification (starting relative distance and angle to the PointGoal) are shown in the left columns. Identical starting locations and episode specifications were used to evaluate both the methods. The performance of the baseline and PBIT are shown in the center and right, respectively. Success is abbreviated as Succ.

6 Conclusion

In this paper, we proposed a domain adaptation method for transferring navigation policies from simulation to the real-world. Given a navigation policy in the source domain, our method translates images from the target domain to the source domain such that the translations are consistent with the task-specific and domain-invariant representations learnt by the given policy. Our experiments across two different tasks for domain transfer in simulation show that the proposed method can improve the performance on the transferred navigation policies over baselines. We also show strong performance of navigation policies transferred from simulation to the real-world using our method. In this paper, we considered navigation tasks involving mostly spatial reasoning. In the future, the proposed method can be extended to navigation tasks involving more geometric reasoning by incorporating semantic consistency losses along with the policy consistency losses.


  • P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al. (2018a) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: §2.
  • P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018b) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3674–3683. Cited by: §1, §2, §4.
  • V. Blukis, Y. Terme, E. Niklasson, R. A. Knepper, and Y. Artzi (2019) Learning to map natural language instructions to physical quadcopter control using simulated flight. In CoRL, Cited by: §2.
  • F. Bonin-Font, A. Ortiz, and G. Oliver (2008) Visual navigation for mobile robots: a survey. Journal of intelligent and robotic systems 53 (3), pp. 263. Cited by: §2.
  • K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke (2018) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250. Cited by: §2.
  • A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §1, §2.
  • M. Chang, A. Gupta, and S. Gupta (2020) Semantic visual navigation by watching youtube videos. arXiv preprint arXiv:2006.10034. Cited by: §2.
  • D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov (2020a) Object goal navigation using goal-oriented semantic exploration. arXiv preprint arXiv:2007.00643. Cited by: §2.
  • D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov (2020b) Learning to explore using active neural slam. In ICLR, Cited by: §1, §2, §2, §4, §5.1.
  • D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta (2020c) Neural topological slam for visual navigation. In CVPR, Cited by: §2.
  • D. S. Chaplot, K. M. Sathyendra, R. K. Pasumarthi, D. Rajagopal, and R. Salakhutdinov (2018) Gated-attention architectures for task-oriented language grounding. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §2.
  • H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi (2019a) Touchdown: natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12538–12547. Cited by: §1.
  • T. Chen, S. Gupta, and A. Gupta (2019b) Learning exploration policies for navigation. arXiv preprint arXiv:1903.01959. Cited by: §2, §4.
  • K. Fang, A. Toshev, L. Fei-Fei, and S. Savarese (2019) Scene memory transformer for embodied agents in long-horizon tasks. In CVPR, Cited by: §2.
  • D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell (2018) Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, pp. 3314–3325. Cited by: §1, §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • D. Gordon, A. Kadian, D. Parikh, J. Hoffman, and D. Batra (2019) SplitNet: sim2sim and task2task transfer for embodied visual navigation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1022–1031. Cited by: §2.
  • S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625. Cited by: §1, §2, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Appendix B, §4.1.
  • K. M. Hermann, F. Hill, S. Green, F. Wang, R. Faulkner, H. Soyer, D. Szepesvari, W. M. Czarnecki, M. Jaderberg, D. Teplyashin, et al. (2017) Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551. Cited by: §1, §2.
  • X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Appendix C.
  • X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §1, §2, §3.2, §3.3, §4.3.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2, §2.
  • T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim (2017) Learning to discover cross-domain relations with generative adversarial networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1857–1865. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix B, §4.2.
  • E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv. Cited by: §1.
  • H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European conference on computer vision (ECCV), pp. 35–51. Cited by: §2, §4.3.
  • M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708. Cited by: §1, §2.
  • A. Mousavian, A. Toshev, M. Fišer, J. Košecká, A. Wahid, and J. Davidson (2019) Visual representations for semantic target driven navigation. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8846–8852. Cited by: §2.
  • M. Müller, A. Dosovitskiy, B. Ghanem, and V. Koltun (2018) Driving policy transfer via modularity and abstraction. In CoRL, Cited by: §2.
  • A. Murali, T. Chen, K. V. Alwala, D. Gandhi, L. Pinto, S. Gupta, and A. Gupta (2019)

    PyRobot: an open-source robotics framework for research and benchmarking

    arXiv preprint arXiv:1906.08236. Cited by: §5.2.
  • F. Sadeghi (2019) DIViS: domain invariant visual servoing for collision-free goal reaching. ArXiv abs/1902.05947. Cited by: §2.
  • M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al. (2019) Habitat: a platform for embodied ai research. In ICCV, Cited by: Appendix B, §2, §4.1, §4, §4, §5.1.
  • A. Sax, J. O. Zhang, B. Emi, A. Zamir, S. Savarese, L. Guibas, and J. Malik (2019) Learning to navigate using mid-level visual priors. arXiv preprint arXiv:1912.11121. Cited by: §2.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: Appendix B.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: Appendix B, §4.1.
  • A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb (2017) Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2107–2116. Cited by: §2.
  • J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe (2019) The Replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797. Cited by: §1, §2, §4.
  • Y. Taigman, A. Polyak, and L. Wolf (2016) Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200. Cited by: §2.
  • E. Tzeng, C. Devin, J. Hoffman, C. Finn, P. Abbeel, S. Levine, K. Saenko, and T. Darrell (2015) Adapting deep visuomotor representations with weak pairwise constraints. arXiv preprint arXiv:1511.07111. Cited by: §2.
  • D. Ulyanov, A. Vedaldi, and V. Lempitsky (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix C.
  • T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix C.
  • X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang, and L. Zhang (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6629–6638. Cited by: §1.
  • E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra (2020) Decentralized distributed ppo: solving pointgoal navigation. In ICLR, Cited by: §1, §2.
  • C. Wögerer, H. Bauer, M. Rooker, G. Ebenhofer, A. Rovetta, N. Robertson, and A. Pichler (2012) LOCOBOT-low cost toolkit for building robot co-workers in assembly lines. In International conference on intelligent robotics and applications, pp. 449–459. Cited by: §5.2.
  • S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: Appendix A.
  • M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi (2019) Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6750–6759. Cited by: §2.
  • Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian (2018) Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209. Cited by: §1.
  • Y. Wu, Y. Wu, A. Tamar, S. Russell, G. Gkioxari, and Y. Tian (2019) Bayesian relational memory for semantic visual navigation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2769–2779. Cited by: §1.
  • F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, Cited by: §1, §2, §4.
  • W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi (2018) Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543. Cited by: §1, §2.
  • Z. Yi, H. Zhang, P. Tan, and M. Gong (2017) Dualgan: unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pp. 2849–2857. Cited by: §2.
  • J. Zhang, L. Tai, P. Yun, Y. Xiong, M. Liu, J. Boedecker, and W. Burgard (2019) VR-goggles for robots: real-to-sim domain adaptation for visual control. IEEE Robotics and Automation Letters 4 (2), pp. 1148–1155. Cited by: §2.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017a) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §2, §3.3, §4.3.
  • J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017b) Toward multimodal image-to-image translation. In Advances in neural information processing systems, pp. 465–476. Cited by: §2.
  • Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017c) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3357–3364. Cited by: §1, §2.

Appendix A Visualization of Policy Representations

We analyze the policy representation before and after translation by reducing the dimensionality of the policy representations using Principle Component Analysis (PCA) Wold et al. (1987). In Figures 7 and 7, we visualize the policy representations reduced to 2 dimensions using PCA in Replica and Real-World respectively. Both figures show that PBIT brings the representations of target domain Replica/Real-World images closer to the distribution of representations of Gibson images.

Figure 6:

We use PCA to visualize and compare the 128-dimensional task-specific feature vectors produced by PointGoal agent trained in Gibson, when given Gibson images, Replica images, or Replica images translated to Gibson by PBIT as input. The figure shows the translated images by PBIT bridge the policy domain gap between Gibson and Replica.

Figure 7: We use PCA to visualize and compare the 128-dimensional task-specific feature vectors produced by PointGoal agent trained in Gibson, when given Gibson images, Real-World images, or Real-World images translated to Gibson by PBIT as input. The figure shows the translated images by PBIT bridge the policy domain gap between Gibson and Real-World.
Figure 6: We use PCA to visualize and compare the 128-dimensional task-specific feature vectors produced by PointGoal agent trained in Gibson, when given Gibson images, Replica images, or Replica images translated to Gibson by PBIT as input. The figure shows the translated images by PBIT bridge the policy domain gap between Gibson and Replica.

Appendix B Navigation Policy Training Details

Policy Architecture. The navigation policy is decomposed into a Policy Feature Extractor () and a Action Policy () as shown in Fig 2. is based on the 18-layer ResNet He et al. (2016). The architecture of is shown in Figure 8. outputs 128-dimensional policy-related representations given RGB images of shape . is based on a 2-layer GRU, which takes together with readings from either the GPS+Compass sensor in PointGoal task or the base odometry sensor in Exploration task.

Policy Training. The navigation policies are trained using PPO Schulman et al. (2017) with Generalized Advantage Estimation Schulman et al. (2015). The training dataset contains episodes in 72 training scenes in Gibson as provided by Savva et al. (2019)

. We use 8 concurrent workers and train for a total of 30 million frames. The time horizon for a PPO update is 128 steps, number of epochs per PPO update is 2, and clipping parameter is set to 0.2. We use discount factor 0.99, GAE parameter 0.95, and the Adam optimizer

Kingma and Ba (2014) with learning rate .

Appendix C PBIT Model Architecture Details

We use several convolutional layers and residual blocks to construct the image encoders and image decoders . We use Instance Normalization Ulyanov et al. (2017) in and Adaptive Instance Normalization Huang and Belongie (2017) in . For the discriminators , we adopt the multi-scale discriminator architecture proposed by Wang et al. (2018). Detailed descriptions of the architecture are given in Table 3.

Figure 8: An illustration of the network architecture of the Policy Feature Extractor.
(a) Style Encoder
(b) Content Encoder
and AdaIN
(c) Decoder
Multi-scale Input
(d) Discriminator
Table 3: Architecture specification of different parts of the proposed PBIT model.

Appendix D Additional Trajectory Visualizations

Figure 9: Additional Real World Trajectories. Figure showing raw inputs, translated Gibson images by PBIT, and a third-person perspective for two trajectories in the real-world for the PointGoal task..
Figure 10: Additional Replica Trajectories. Figure showing raw inputs, translated Gibson images by PBIT, and the ground-truth top-down map (not visible to the agent) for three trajectories in the Replica domain.