Pathdreamer: A World Model for Indoor Navigation

by   Jing Yu Koh, et al.

People navigating in unfamiliar buildings take advantage of myriad visual, spatial and semantic cues to efficiently achieve their navigation goals. Towards equipping computational agents with similar capabilities, we introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments. Given one or more previous visual observations, Pathdreamer generates plausible high-resolution 360 visual observations (RGB, semantic segmentation and depth) for viewpoints that have not been visited, in buildings not seen during training. In regions of high uncertainty (e.g. predicting around corners, imagining the contents of an unseen room), Pathdreamer can predict diverse scenes, allowing an agent to sample multiple realistic outcomes for a given trajectory. We demonstrate that Pathdreamer encodes useful and accessible visual, spatial and semantic knowledge about human environments by using it in the downstream task of Vision-and-Language Navigation (VLN). Specifically, we show that planning ahead with Pathdreamer brings about half the benefit of looking ahead at actual observations from unobserved parts of the environment. We hope that Pathdreamer will help unlock model-based approaches to challenging embodied navigation tasks such as navigating to specified objects and VLN.



There are no comments yet.


page 1

page 4

page 5

page 6

page 8


Learning and Planning with a Semantic Model

Building deep reinforcement learning agents that can generalize and adap...

ESNI: Domestic Robots Design for Elderly and Disabled People

Our paper focuses on the research of the possibility for speech recognit...

VisualEchoes: Spatial Image Representation Learning through Echolocation

Several animal species (e.g., bats, dolphins, and whales) and even visua...

Visual Representations for Semantic Target Driven Navigation

What is a good visual representation for autonomous agents? We address t...

Embodied Learning for Lifelong Visual Perception

We study lifelong visual perception in an embodied setup, where we devel...

Bayesian Relational Memory for Semantic Visual Navigation

We introduce a new memory architecture, Bayesian Relational Memory (BRM)...

Semantic Visual Navigation by Watching YouTube Videos

Semantic cues and statistical regularities in real-world environment lay...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Generating photorealistic visual observations from an imagined 6.3m trajectory in a previously unseen building. Observations also include depth and segmentations (not shown here).

World models [23], or models of environments [72], are an appealing way to represent an agent’s knowledge about its surroundings. An agent with a world model can predict its future by ‘imagining’ the consequences of a series of proposed actions. This capability can be used for sampling-based planning [16, 57], learning policies directly from the model (i.e., learning in a dream) [17, 23, 64, 25], and for counterfactual reasoning [6]

. Model-based approaches such as these also typically improve the sample efficiency of deep reinforcement learning 

[72, 62]. However, world models that generate high-dimensional visual observations (i.e., images) have typically been restricted to relatively simple environments, such as Atari games [62] and tabletops [16].

Our goal is to develop a generic visual world model for agents navigating in indoor environments. Specifically, given one or more previous observations and a proposed navigation action sequence, we aim to generate plausible high-resolution visual observations for viewpoints that have not been visited, and do so in buildings not seen during training. Beyond applications in video editing and content creation, solving this problem would unlock model-based methods for many embodied AI tasks, including navigating to objects [5], instruction-guided navigation [3, 66, 40] and dialog-guided navigation [74, 26]. For example, an agent asked to find a certain type of object in a novel building, e.g. ‘find a chair’, could perform mental simulations using the world model to identify navigation trajectories that are most likely to include chair observations – without moving.

Building such a model is challenging. It requires synthesizing completions of partially visible objects, using as few as one previous observation. This is akin to novel view synthesis from a single image [19, 80], but with potentially unbounded viewpoint changes. There is also the related but considerably more extreme challenge of predicting around corners. For example, as shown in Fig. 1, any future navigation trajectory passing the entrance of an unseen room requires the model to plausibly imagine the entire contents of that room (we dub this the room reveal problem). This requires generalizing from the visual, spatial and semantic structure of previously explored environments—which in our case are photo-realistic 3D captures of real indoor spaces in the Matterport3D dataset [7]. A third problem is temporal consistency: predictions of unseen building regions should ideally be stochastic (capturing the full distribution of possible outcomes), but revisited regions should be rendered in a consistent manner to previous observations.

Towards this goal, we introduce Pathdreamer. Given one or more visual observations (consisting of RGB, depth and semantic segmentation for panoramas) from an indoor scene, Pathdreamer synthesizes high-resolution visual observations (RGB, depth and semantic segmentations) along a specified trajectory through future viewpoints, using a hierarchical two-stage approach. Pathdreamer’s first stage, Structure Generator, generates depth and semantic segmentations. Inspired by work in video prediction [11]

, these outputs are conditioned on a latent noise tensor capturing the stochastic information about the next observation (such as the layout of an unseen room) that cannot be predicted deterministically. The second stage’s Image Generator renders the depth and semantic segmentations as realistic RGB images using modified Multi-SPADE blocks 

[63, 51]. To maintain long-term consistency in the generated observations, both stages use back-projected 3D point cloud representations which are re-projected into image space for context [51].

As illustrated in Figure 1, Pathdreamer can generate plausible views of previously unseen indoor scenes under large viewpoint changes, while also addressing the room reveal problem – in this case correctly hypothesizing that the unseen room revealed at position 2 will most likely resemble a kitchen. Empirically, using the Matterport3D dataset [7] and observations, we evaluate both stages of our model against prior work and reasonable baselines and ablations. We find that the hierarchical structure of the model is essential for predicting over large viewpoint changes, that maintaining both RGB and semantic context is required, and that prediction quality degrades gradually when we evaluate with trajectory rollouts of up to 13m (with viewpoints 2.25m apart on average).

Encouraged by these results, we further evaluate whether RGB predictions from Pathdreamer can improve performance on a downstream visual navigation task. We focus on Vision-and-Language Navigation (VLN) using the R2R dataset [3]. VLN requires agents to interpret and execute natural language navigation instructions in a photorealistic 3D environment. A robust finding from previous VLN research is that task success is dramatically increased by allowing an agent to look ahead at unobserved parts of the environment while following an instruction [50]. We find that replacing look-ahead observations with Pathdreamer predictions maintains around half of this improvement, a finding we expect to have significant implications for research in this area. In summary, our main contributions include:

  • Proposing the study of visual world models for generic indoor environments and defining evaluation protocols and baselines for future work.

  • Pathdreamer, a stochastic hierarchical visual world model combining multiple, independent threads of previous work on video prediction [11], semantic image synthesis [63] and video-to-video synthesis [51].

  • Extensive experiments characterizing the performance of Pathdreamer and demonstrating improved results on the downstream VLN task [3].

2 Related Work

Video Prediction

Our work is closely related to the task of video prediction, which aims to predict the future frames of a video sequence. While some video prediction methods predict RGB video frames directly [76, 1, 41, 44], many others use hierarchical models to first predict an intermediate representation (such as semantic segmentation) [47, 35, 77, 82, 42], which improves the fidelity of long-term predictions [42]. Several approaches have also incorporated 3D point cloud representations, using projective camera geometry to explicitly infer aspects of the next frame [75, 51, 43]. Inspired by this work, we adopt and combine both the hierarchical two-stage approach and 3D point cloud representations. Further, since our interest is in action-conditional world models, we provide a trajectory of future viewpoints to the model rather than assuming a constant frame rate and modeling camera motion implicitly, which is more typical in video generation [44, 42].

Action-Conditional Video Prediction

Conditional video prediction to improve agent reasoning and planning has been explored in several tasks. This includes video prediction for Atari games conditioned on control inputs [60, 10, 62, 25] and 3D game environments like Doom [23]. In robotics, action-conditional video prediction has been investigated for object pushing in tabletop settings to improve generalization to novel objects [15, 16, 14]. This work has been restricted to simple environments and low-resolution images, such as 6464 images of objects in a wooden box. To the best of our knowledge, we are the first to investigate action-conditional video prediction in building-scale environments with high-resolution (1024512) images.

World Models and Navigation Priors

World models [23] are an appealing way to summarize and distill knowledge about complex, high-dimensional environments. However, world models can differ in their outputs. While Pathdreamer predicts visual observations, there is also a vast literature on world models that predict compact latent representations of future states [38, 24, 25] or other task-specific measurements [13] or rewards [61]. This includes recent work attempting to learn statistical regularities and other priors for indoor navigation—for example, by mining spatial co-occurrences from real estate video tours [8], learning to predict top-down belief maps over room characteristics [58], or learning to reconstruct house floor plans using audio and visual cues from a short video sequence [65]. In contrast to these approaches, we focus on explicitly predicting visual observations (i.e., pixels) which are generic, human-interpretable, and apply to a wide variety of downstream tasks and applications. Further, recent work identifies a close correlation between image prediction accuracy and downstream task performance in model-based RL [4].

Embodied Navigation Agents

High-quality 3D environment datasets such as Matterport3D [7], StreetLearn [56, 53], Gibson [81] and Replica [71] have triggered intense interest in developing embodied agents that act in realistic human environments [2]. Tasks of interest include ObjectNav [5] (navigating to an instance of a particular kind of object), and Vision-and-Language Navigation (VLN) [3], in which agents must navigate according to natural language instructions. Variations of VLN include indoor navigation [3, 33, 66, 40], street-level navigation [9, 53], vision-and-dialog navigation [59, 74, 26], VLN in continuous environments [39], and more. Notwithstanding considerable exploration of pretraining strategies [46, 27, 50, 87], data augmentation approaches [20, 21, 73]

, agent architectures and loss functions 

[86, 48, 49], existing work in this space considers only model-free approaches. Our aim is to unlock model-based approaches to these tasks, using a visual world model to encode prior commonsense knowledge about human environments and thereby relieve the burden on the agent to learn these regularities. Underscoring the potential of this direction, we note that using the ground-truth environment for planning with beam search typically improves VLN success rates on the R2R dataset by 17-19% [20, 73].

Novel View Synthesis

Finally, we position our work in the context of novel view synthesis [19, 37, 29, 18, 70, 85, 54]. Many methods have been proposed to represent 3D scenes, including point cloud representations [80], layered depth images [12], and mesh representations [68]. More recently, neural radiance fields (NeRF) [55, 52, 83]

achieved impressive results by capturing volume density and color implicitly with a neural network. NeRF models can synthesize very high quality 3D scenes, but a significant drawback for our purposes is that they require a large number of input views to render a single scene (e.g., 20–62 images per scene in

[55]). More importantly, these models are typically trained to represent a single scene, and currently do not generalize well to unseen environments. In contrast, our problem demands generalization to unseen environments, using as little as one previous observation.

Figure 2: Pathdreamer model architecture at step . Given a history of visual observations (RGB, depth and semantics) and a trajectory of future viewpoints, the Structure Generator conditions on a sampled noise tensor before generating semantic and depth outputs to provide a high-level structural representation of the scene. Realistic RGB images are synthesized by the Image Generator in the second stage.

3 Pathdreamer

Pathdreamer is a world model that generates high-resolution visual observations from a trajectory of future viewpoints in buildings it has never observed. The input to Pathdreamer is a sequence of previous observations consisting of RGB images , semantic segmentation images , and depth images

(where the depth and segmentations could be ground-truth or estimates from a model). We assume that a corresponding sequence of camera poses

is available from an odometry system, and that the camera intrinsics are known or estimated. Our goal is to generate realistic RGB, semantic segmentation and depth images for a trajectory of future poses , which may be provided up front or iteratively by some agent interacting with the returned observations. Note that we generate depth and segmentation because these modalities are useful in many downstream tasks. We assume that the future trajectory may traverse unseen areas of environment, requiring the model to not only in-fill minor object dis-occlusions, but also to imagine entire room reveals (Figure 1).

Figure 2 shows our proposed hierarchical two-stage model for addressing this challenge. It uses a latent noise tensor to capture the stochastic information about the next observation (e.g. the layout of an unseen room) that cannot be predicted deterministically. Given a sampled noise tensor , the first stage (Structure Generator) generates a new depth image and segmentation image to provide a plausible high-level semantic representation of the scene, using as context the previous semantic and depth images , . In the second stage (Image Generator), the predicted semantic and depth images , are rendered into a realistic RGB image using previous RGB images as context. In each stage, context is provided by accumulating previous observations as a 3D point cloud which is re-projected into 2D using .

3.1 Structure Generator: Segmentation & Depth

Pathdreamer’s first stage is the Structure Generator, a stochastic encoder-decoder network for generating diverse, plausible segmentation and depth images. Like [51], to provide the previous observation context, we first back-project the previous segmentations into a unified 3D semantic point cloud using the depth images and camera poses . We then re-project this point cloud back into pixel space using to create sparse segmentation and depth guidance images , which reflect the current pose.

The input to the encoder is a one-hot encoding of the semantic guidance image

, concatenated with the depth guidance image . The architecture of the encoder-decoder model is based on RedNet [34] – a ResNet-50 [28] architecture designed for indoor RGB-D semantic segmentation. RedNet uses transposed convolutions for upsampling in the decoder and skip connections between the encoder and decoder to preserve spatial information. Since the input contains a segmentation image, and segmentation classes differ across datasets, the encoder-decoder is not pretrained. We introduce the latent spatial noise tensor into the model by concatenating it with the feature map between the encoder and the decoder. The final output of the encoder-decoder model is a segmentation image and a depth image , with segmentation predictions generated by a -way softmax and depth outputs normalized in the range

and generated via a sigmoid function. At each step during inference, the segmentation prediction

is back-projected and added to the point cloud to assist prediction in future timesteps.

To generate the noise tensor , we take inspiration from SVG [11] and learn a conditional prior noise distribution . Intuitively, there are many possible scenes that may be generated for an unseen building region. We would like to carry the stochastic information about the next observation that the deterministic encoder cannot capture, and we would like for the decoder to make good use of that information. During training, we encourage the first outcome by using a KL-divergence loss to force the prior distribution to be close to the posterior distribution which is conditioned on the ground-truth segmentation and depth images. We encourage the second outcome by providing the decoder with sampled values from the posterior distribution (conditioned on the ground-truth outputs) during training. During inference, the latent noise is sampled from the prior distribution and the posterior distribution is not used. Both distributions are modeled using 3-layer CNNs that take their input from the encoder and output two channels representing and

to parameterize a multivariate Gaussian distribution

. As shown in Figure 3, the noise is useful in encoding diverse, plausible representations of unseen regions.

Overall, the Structure Generator is trained to minimize a joint loss consisting of a cross-entropy loss for semantic predictions, a mean absolute error term for depth predictions, and the KL-divergence term for the noise tensor:


where , , and are weights determined by a grid search. We set these to 1, 100, and 0.5 respectively.

Figure 3: When predicting around corners, the Structure Generator can sample diverse and semantically plausible scene layouts which are closely reflected in the RGB output of the Image Generator, shown here for two guidance image inputs (left columns; unseen areas are indicated by solid black regions). Each example shows three alternative room reveals and the groundtruth. In the bottom example, the model considers various completions for a bedroom but fails to anticipate the groundtruth’s matching lamp on the opposite side of the bed.

3.2 Image Generator: RGB

The Image Generator is an image-to-image translation GAN 

[22, 78] that converts the semantic and depth predictions , from the first stage into a realistic RGB image . Our model architecture is based on SPADE blocks [63] that use spatially-adaptive normalization layers to insert context into multiple layers of the network. As with our Structure Generator, we maintain an accumulating 3D point cloud containing all previous image observations. This provides a sparse RGB guidance image when re-projected. Similar to Multi-SPADE [51], we insert two SPADE normalization layers into each residual block: one conditioned on the concatenated semantic and depth inputs , and one conditioned on the RGB guidance image . The sparsity of the RGB guidance image is handled by applying partial convolutions [45]. In total Image Generator consists of 7 Multi-SPADE blocks, preceded by a single convolution block.

Following SPADE [63], the model is trained with the GAN hinge loss, feature matching loss [78], and perceptual loss [36] from a pretrained VGG-19 [69] model. During training, the generator is provided with the ground-truth segmentation image and ground-truth depth image . Our discriminator architecture is based on PatchGAN [32], and takes as input the concatenation of the ground-truth image or generated image , the ground-truth depth image and the ground-truth semantic image . The losses for the generator and the discriminator are:


where denotes the complete set of inputs to the generator, denotes the intermediate output of the layer of the pretrained VGG-19 network, denotes the output of the discriminator’s -th layer, and the conditioning inputs to the discriminator have been dropped to save space. We follow [63] for the choice of weights . Like the Structure Generator, the Image Generator is not pretrained.

Figure 4: Example full prediction sequence beginning with one observation (depth, semantics, RGB) as context and generating observations for 3 new viewpoints traversing a corridor. At 2.3m the model completes a room reveal, imagining a kitchen-like space. After 8.6m the model’s predictions degrade. More examples are provided in the supplementary.

3.3 Training and Inference


For training and evaluation we use Matterport3D [7], a dataset of 10.8k RGB-D images from 90 building-scale indoor environments. For each environment, Matterport3D also includes a textured 3D mesh which is annotated with 40 semantic classes of objects and building components. To align with downstream VLN tasks, in all experiments the RGB, depth and semantic images are panoramas in equirectangular format.


To train Pathdreamer, we sampled 400k trajectories from the Matterport3D training environments. To define feasible trajectories, we used the navigation graphs from the Room-to-Room (R2R) dataset [3], in which nodes correspond to panoramic image locations, and edges define navigable state transitions. For each trajectory 5–8 panoramas were sampled, choosing the starting node and the edge transitions uniformly at random. On average the viewpoints in these trajectories are 2m apart. Training with relatively large viewpoint changes is desirable, since the model learns to synthesize observations with large viewpoint changes in a single step (without the need to incur the computational cost of generating intervening frames). However, this does not preclude Pathdreamer from generating smooth video outputs at high frame rates111See for our video generation results..


The first and second stages of the model are trained separately. For the Image Generator, we use the Matterport3D RGB panoramas as training targets at 1024512 resolution. We use the Habitat simulator [67] to render ground-truth depth and semantic training inputs and stitch these into equirectangular panoramas. We perform data augmentation by randomly cropping and horizontally rolling the RGB panoramas, which we found essential due to the limited number of panoramas available.

To train the Structure Generator, we again used Habitat to render depth and semantic images. Since this stage does not require aligned RGB images for training, in this case we performed data augmentation by perturbing the viewpoint coordinates with a random Gaussian noise vector drawn from

independently along each 3D axis. The Structure Generator was trained with equirectangular panoramas at 512256 resolution.


To avoid heading discontinuities during inference, we use circular padding on the image x-axis for both the Structure Generator and the Image Generator. The 512

256 resolution semantic and depth outputs of the Structure Generator are upsampled to 1024

512 using nearest neighbor interpolation before they are passed to the Image Generator. In quantitative experiments, we set the Structure Generator noise tensor

to the mean of the prior.

4 Experiments

For evaluation we use the paths from the Val-Seen and Val-Unseen splits of the R2R dataset [3]. Val-Seen contains 340 trajectories from environments in the Matterport3D training split. Val-Unseen contains 783 trajectories in Matterport3D environments not seen in training. Since R2R trajectories contain 5-7 panoramas and at least 1 previous observation is given as context, we report evaluations over 1–6 steps, representing predictions over trajectory rollouts of around 2–13m (panoramas are 2.25m apart on average). See Figure 4 for an example rollout over 8.6m. We characterize the performance of Pathdreamer in comparison to baselines, ablations and in the context of the downstream task of Vision-and-Language Navigation (VLN).

4.1 Pathdreamer Results

Semantic Generation

A key feature of our approach is the ability to generate semantic segmentation and depth outputs, in addition to RGB. We evaluate the generated semantic segmentation images using mean Intersection-Over-Union (mIOU) and report results for:

  • Nearest Neighbor: A baseline without any learned components, using nearest-neighbor interpolation to fill holes in the projected semantic guidance image .

  • Ours (Teacher Forcing): Structure Generator trained using the ground truth semantic and depth images as the previous observation at every time step.

  • Ours (Recurrent): Structure Generator trained while feeding back its own semantic and depth predictions as previous observations for the next step prediction. This reduces train-test mismatch and may allow the model to compensate for errors when doing longer roll-outs.

We also tried training the hierarchical convolutional LSTM from [42], but found that it frequently collapsed to a single class prediction. We attribute this to the large viewpoint changes and heavy occlusion in the training sequences; we believe this can be more effectively modeled with point cloud geometry than with a geometry-unaware LSTM.

As illustrated in Table 1, Pathdreamer performs far better than the Nearest Neighbor baseline regardless of the number of steps in the rollout or the number of previous observations used as context. As expected, performance in seen environments is higher than unseen. Perhaps surprisingly, in Figure 4(a) we show that Recurrent training improves results during longer rollouts in the training environments (Val-Seen), but this does not improve results on Val-Unseen, perhaps indicating that the error compensation learned by the Image Generator does not easily generalize.

In addition to accurate predictions, we also want generated results to be diverse. Figure 3 shows that our model can generate diverse semantic scenes by interpolating the noise tensor , and that the RGB outputs closely reflect the generated semantic image. This allows us to generate multiple plausible alternatives for the same navigation trajectory.

(a) Semantic segmentation mean-IOU (). [TF]: Teacher Forcing. [Rec]: Recurrent.
(b) RGB generation FID (). [GT]: Ground truth semantic inputs. [SG]: Structure Generator predictions.
Figure 5:

Pathdreamer semantic segmentation mean-IOU (above) and RGB generation FID (below). Results are shown for Val-Seen (left) and Val-Unseen (right). Confidence intervals indicate the range of outcomes with 1, 2 or 3 previous observations as context.

RGB Generation

To evaluate the quality of RGB panoramas generated by the Image Generator, we compute the Fréchet Inception Distance (FID) [30] between generated and real images for each step in the paths. We report results using the semantic images generated by the Structure Generator as inputs (i.e., our full model). To quantify the potential for uplift with better Structure Generators, we also report results using ground truth semantic segmentations as input. We compare to two ablated versions of our model:

  • No Semantics: The semantic and depth inputs , are removed from the Multi-SPADE blocks.

  • SPADE: An ablation of the RGB inputs to the model, comprising the previous RGB image and the re-projected RGB guidance image . The semantic image replaces as input to the model and the input layers are removed from the Multi-SPADE blocks, making this effectively the SPADE model [63].

Val-Seen Val-Unseen
Model Context 1 Step 1–6 Steps 1 Step 1–6 Steps
Nearest Neighbor 1 59.5 32.0 59.1 30.6
Ours (Teacher Forcing) 1 84.9 59.2 78.3 50.8
Ours (Recurrent) 1 84.7 65.9 77.5 50.9
Nearest Neighbor 2 57.4 35.2 56.5 33.8
Ours (Teacher Forcing) 2 85.4 64.6 77.4 55.5
Ours (Recurrent) 2 85.1 70.2 76.6 55.7
Nearest Neighbor 3 57.4 38.7 56.1 37.7
Ours (Teacher Forcing) 3 85.1 68.5 77.3 60.4
Ours (Recurrent) 3 84.6 72.7 76.8 60.8
Table 1: Mean-IOU () for generated semantic segmentations with varying context and prediction steps.
Inputs Val-Seen Val-Unseen
Model Context Obs Sem RGB 1 Step 1–6 Steps 1 Step 1–6 Steps
No Semantics 1 - 61.2 112.1 64.0 117.8
SPADE 1 GT 23.3 24.9 47.3 50.3
Ours 1 GT 25.1 28.6 37.9 45.2
Ours 1 SG 26.2 48.3 36.6 85.5
No Semantics 2 - 54.2 98.9 65.6 107.4
SPADE 2 GT 22.8 25.3 52.3 51.2
Ours 2 GT 24.4 28.6 41.3 44.9
Ours 2 SG 25.7 42.6 41.8 74.7
No Semantics 3 - 54.9 89.8 64.2 94.5
SPADE 3 GT 23.1 26.2 52.8 50.7
Ours 3 GT 24.5 29.2 41.5 44.2
Ours 3 SG 26.3 39.2 42.9 63.6
Table 2: FID scores () for generated RGB images with varying context and prediction steps, using either ground truth semantics (GT) or Structure Generator predictions (SG) as input.

As shown in Table 2, SPADE performs the best in Val-Seen, indicating that the model has the capacity to memorize the training environments. In this case, RGB inputs are not necessary. However, our model performs noticeably better in Val-Unseen, highlighting the importance of maintaining RGB context in unseen environments (which is our focus). Performance degrades significantly in the No Semantics setting in both Val-Seen and Val-Unseen. We observed that without semantic inputs, the model is unable to generate meaningful images over longer horizons, which validates our two-stage hierarchical approach. These results are reflected in the FID scores, as well as qualitatively (Figure 6); Image Generator’s outputs are significantly crisper, especially over longer horizons. Due to the benefit of guidance images, the Image Generator’s textures are also generally better matched with the unseen environment, while SPADE tends to wash out textures, usually creating images of a standard style. Figure 4(b) plots performance for every setting step-by-step. FID of the Image Generator improves substantially when using ground truth semantics, particularly for longer rollouts, highlighting the potential to benefit from improvements to the Structure Generator.

Figure 6: Visual comparison of ablated Image Generator outputs on Val-Unseen using ground truth segmentation and depth inputs. Both RGB and semantic context is required for best performance.

4.2 VLN Results

Finally, we evaluate whether predictions from Pathdreamer can improve performance on a downstream visual navigation task. We focus on Vision-and-Language Navigation (VLN) using the R2R dataset [3]. Because reaching the navigation goal requires successfully grounding natural language instructions to visual observations, this provides a challenging task-based assessment of prediction quality.

In our inference setting, at each step while moving through the environment we use a baseline VLN agent based on [79] to generate a large number of possible future trajectories using beam search. We then rank these alternative trajectories using an instruction-trajectory compatibility model [84] to assess which trajectory best matches the instruction. The agent then executes the first action from the top-ranked trajectory before repeating the process. We consider three different planning horizons, with future trajectories containing 1, 2 or 3 forward steps.

The instruction-trajectory compatibility model is a dual-encoder that separately encodes textual instructions and trajectories (encoded using visual observations and path geometry) into a shared latent space. To improve performance on incomplete paths, we introduce truncated paths into the original contrastive training scheme proposed in [84]. The compatibility model is trained using only ground truth observations. However, during inference, RGB observations for future steps are drawn from three different sources:

  • Ground truth: RGB observations from the actual environment, i.e., look-ahead observations.

  • Pathdreamer: RGB predictions from our model.

  • Repeated pano: A simple baseline in which the most recent RGB observation is repeated in future steps.

Note that in all cases the geometry of the future trajectories is determined by the ground truth R2R navigation graphs. In Table 3, we report Val-Unseen results for this experiment using standard metrics for VLN: navigation error (NE), success rate (SR), shortest path length (SPL), normalized Dynamic Time Warping (nDTW) [31], and success weighted by normalized Dynamic Time Warping (sDTW) [31].

Consistent with prior work [20, 73], we find that looking ahead using ground truth visual observations provides a robust performance boost, e.g., success rate increases from 44.6% with 1 planning step (top panel) to 59.3% with 3 planning steps (bottom panel). At the other extreme, the Repeated pano baseline is weak, with a success rate of just 35.7% with 1 planning step (top row). This is not surprising: repeating the last pano denies the compatibility model any useful visual representation of the next action, which is crucial to performance [20, 73]. However, increasing the planning horizon does improve performance even for the Repeated pano baseline, since the compatibility model is able to compare the geometry of alternative future trajectories. Finally, we observe that using Pathdreamer’s visual observations closes about half the gap between the Repeated pano baseline and the ground truth observations, e.g., 48.9% success with Pathdreamer vs. 40.6% and 59.3% respectively for the others. We conclude that using Pathdreamer as a visual world model can improve performance on downstream tasks, although existing agents still rely on using a navigation graph to define the feasible action space at each step. Pathdreamer is complementary to current SOTA model-based approaches, and a combination would likely lead to further boosts in VLN performance, which is worth investigating in future work.

Observations Plan Steps NE SR SPL nDTW sDTW
Repeated pano 1 6.75 35.7 33.8 52.0 31.2
Pathdreamer 1 6.48 40.3 38.8 55.0 35.6
Ground truth 1 5.80 44.6 42.7 58.9 39.4
Repeated pano 2 6.76 36.8 34.0 51.8 31.7
Pathdreamer 2 5.76 46.5 44.0 59.5 41.0
Ground truth 2 4.95 54.3 51.3 64.9 48.3
Repeated pano 3 6.25 40.6 37.7 55.6 35.2
Pathdreamer 3 5.61 48.9 46.0 60.5 43.1
Ground truth 3 4.44 59.3 55.8 67.9 52.7
Table 3: VLN Val-Unseen results using an instruction-trajectory compatibility model to rank alternative future trajectories with planning horizons of 1, 2 or 3 steps.

5 Conclusion

In this paper, we presented Pathdreamer, a stochastic hierarchical visual world model. Pathdreamer is capable of synthesizing realistic and diverse panoramic images for unseen trajectories in real buildings. As a visual world model, Pathdreamer also shows strong promise in improving performance on downstream tasks, such as VLN. Most notably, we show that Pathdreamer captures around half the benefit of looking ahead at actual observations from the environment. The efficacy of Pathdreamer in the VLN task may be attributed to its ability to model fundamental constraints in the real world – relieving the agent from having to learn the geometry and visual and semantic structure of buildings. Applying Pathdreamer to other embodied navigation tasks such as Object-Nav [5], VLN-CE [39] and street-level navigation [9, 53] are natural directions for future work.


  • [1] S. Aigner and M. Körner (2018) FutureGAN: anticipating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing gans. arXiv preprint arXiv:1810.01325. Cited by: §2.
  • [2] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: §2.
  • [3] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In CVPR, pp. 3674–3683. Cited by: 3rd item, §1, §1, §2, §3.3, §4.2, §4.
  • [4] M. Babaeizadeh, M. T. Saffar, D. Hafner, H. Kannan, C. Finn, S. Levine, and D. Erhan (2020) Models, pixels, and rewards: evaluating design trade-offs in visual model-based reinforcement learning. arXiv preprint arXiv:2012.04603. Cited by: §2.
  • [5] D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans (2020) Objectnav revisited: on evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171. Cited by: §1, §2, §5.
  • [6] L. Buesing, T. Weber, Y. Zwols, S. Racaniere, A. Guez, J. Lespiau, and N. Heess (2019) Woulda, coulda, shoulda: counterfactually-guided policy search. In ICLR, Cited by: §1.
  • [7] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §1, §1, §2, §3.3.
  • [8] M. Chang, A. Gupta, and S. Gupta (2020) Semantic visual navigation by watching youtube videos. NeurIPS. Cited by: §2.
  • [9] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi (2019) Touchdown: natural language navigation and spatial reasoning in visual street environments. In CVPR, pp. 12538–12547. Cited by: §2, §5.
  • [10] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed (2017) Recurrent environment simulators. ICLR. Cited by: §2.
  • [11] E. Denton and R. Fergus (2018) Stochastic video generation with a learned prior. icml. Cited by: 2nd item, §1, §3.1.
  • [12] H. Dhamo, K. Tateno, I. Laina, N. Navab, and F. Tombari (2019) Peeking behind objects: layered depth prediction from a single image. Pattern Recognition Letters 125, pp. 333–340. Cited by: §2.
  • [13] A. Dosovitskiy and V. Koltun (2017) Learning to act by predicting the future. In ICLR, Cited by: §2.
  • [14] F. Ebert, C. Finn, A. X. Lee, and S. Levine (2017) Self-supervised visual planning with temporal skip connections. Conference on Robot Learning (CoRL). Cited by: §2.
  • [15] C. Finn, I. Goodfellow, and S. Levine (2016) Unsupervised learning for physical interaction through video prediction. In NeurIPS, pp. 64–72. Cited by: §2.
  • [16] C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In ICRA, pp. 2786–2793. Cited by: §1, §2.
  • [17] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel (2016)

    Deep spatial autoencoders for visuomotor learning

    In ICRA, pp. 512–519. Cited by: §1.
  • [18] J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. Overbeck, N. Snavely, and R. Tucker (2019) Deepview: view synthesis with learned gradient descent. In CVPR, pp. 2367–2376. Cited by: §2.
  • [19] J. Flynn, I. Neulander, J. Philbin, and N. Snavely (2016) DeepStereo: learning to predict new views from the world’s imagery. In CVPR, Cited by: §1, §2.
  • [20] D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell (2018) Speaker-follower models for vision-and-language navigation. NeurIPS. Cited by: §2, §4.2.
  • [21] T. Fu, X. E. Wang, M. F. Peterson, S. T. Grafton, M. P. Eckstein, and W. Y. Wang (2020) Counterfactual vision-and-language navigation via adversarial path sampler. In ECCV, pp. 71–86. Cited by: §2.
  • [22] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. NeurIPS. Cited by: §3.2.
  • [23] D. Ha and J. Schmidhuber (2018) World models. NeurIPS. Cited by: §1, §2, §2.
  • [24] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi (2019) Dream to control: learning behaviors by latent imagination. ICLR. Cited by: §2.
  • [25] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba (2020) Mastering atari with discrete world models. arXiv preprint arXiv:2010.02193. Cited by: §1, §2, §2.
  • [26] M. Hahn, J. Krantz, D. Batra, D. Parikh, J. M. Rehg, S. Lee, and P. Anderson (2020) Where are you? localization from embodied dialog. Cited by: §1, §2.
  • [27] W. Hao, C. Li, X. Li, L. Carin, and J. Gao (2020-06) Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR, Cited by: §2.
  • [28] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.1.
  • [29] P. Henzler, V. Rasche, T. Ropinski, and T. Ritschel (2018) Single-image tomography: 3d volumes from 2d cranial x-rays. In Computer Graphics Forum, Vol. 37, pp. 377–388. Cited by: §2.
  • [30] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: §4.1.
  • [31] G. Ilharco, V. Jain, A. Ku, E. Ie, and J. Baldridge (2019) General evaluation for instruction conditioned navigation using dynamic time warping. NeurIPS Workshop on Visually Grounded Interaction and Language (ViGIL). Cited by: §4.2.
  • [32] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. CVPR. Cited by: §3.2.
  • [33] V. Jain, G. Magalhaes, A. Ku, A. Vaswani, E. Ie, and J. Baldridge (2019) Stay on the path: instruction fidelity in vision-and-language navigation. Cited by: §2.
  • [34] J. Jiang, L. Zheng, F. Luo, and Z. Zhang (2018) Rednet: residual encoder-decoder network for indoor rgb-d semantic segmentation. arXiv preprint arXiv:1806.01054. Cited by: §3.1.
  • [35] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, and S. Yan (2017) Predicting scene parsing and motion dynamics in the future. In NeurIPS, pp. 6915–6924. Cited by: §2.
  • [36] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In ECCV, pp. 694–711. Cited by: §3.2.
  • [37] A. Kar, C. Häne, and J. Malik (2017) Learning a multi-view stereo machine. In NeurIPS, pp. 365–376. Cited by: §2.
  • [38] M. Karl, M. Soelch, J. Bayer, and P. Van der Smagt (2017) Deep variational bayes filters: unsupervised learning of state space models from raw data. ICLR. Cited by: §2.
  • [39] J. Krantz, E. Wijmans, A. Majundar, D. Batra, and S. Lee (2020) Beyond the nav-graph: vision and language navigation in continuous environments. In ECCV, Cited by: §2, §5.
  • [40] A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge (2020) Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. EMNLP. Cited by: §1, §2.
  • [41] Y. Kwon and M. Park (2019) Predicting future frames using retrospective cycle gan. In CVPR, pp. 1811–1820. Cited by: §2.
  • [42] W. Lee, W. Jung, H. Zhang, T. Chen, J. Y. Koh, T. Huang, H. Yoon, H. Lee, and S. Hong (2021) Revisiting hierarchical approach for persistent long-term video prediction. In ICLR, Cited by: §2, §4.1.
  • [43] Z. Li, Z. Cui, and M. R. Oswald (2020) Street-view panoramic video synthesis from a single satellite image. arXiv preprint arXiv:2012.06628. Cited by: §2.
  • [44] A. Liu, R. Tucker, V. Jampani, A. Makadia, N. Snavely, and A. Kanazawa (2020) Infinite nature: perpetual view generation of natural scenes from a single image. arXiv preprint arXiv:2012.09855. Cited by: §2.
  • [45] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In ECCV, pp. 85–100. Cited by: §3.2.
  • [46] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, pp. 13–23. Cited by: §2.
  • [47] P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun (2017) Predicting deeper into the future of semantic segmentation. In ICCV, pp. 648–657. Cited by: §2.
  • [48] C. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong (2019) Self-monitoring navigation agent via auxiliary progress estimation. ICLR. Cited by: §2.
  • [49] C. Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira (2019)

    The regretful agent: heuristic-aided navigation through progress estimation

    In CVPR, pp. 6732–6740. Cited by: §2.
  • [50] A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra (2020) Improving vision-and-language navigation with image-text pairs from the web. ECCV. Cited by: §1, §2.
  • [51] A. Mallya, T. Wang, K. Sapra, and M. Liu (2020) World-consistent video-to-video synthesis. ECCV. Cited by: 2nd item, §1, §2, §3.1, §3.2.
  • [52] R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth (2021) NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In CVPR, Cited by: §2.
  • [53] H. Mehta, Y. Artzi, J. Baldridge, E. Ie, and P. Mirowski (2020) Retouchdown: adding touchdown to streetlearn as a shareable resource for language grounding tasks in street view. EMNLP Workshop on Spatial Language Understanding (SpLU). Cited by: §2, §5.
  • [54] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar (2019) Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–14. Cited by: §2.
  • [55] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. ECCV. Cited by: §2.
  • [56] P. Mirowski, A. Banki-Horvath, K. Anderson, D. Teplyashin, K. M. Hermann, M. Malinowski, M. K. Grimes, K. Simonyan, K. Kavukcuoglu, A. Zisserman, et al. (2019) The streetlearn environment and dataset. arXiv preprint arXiv:1903.01292. Cited by: §2.
  • [57] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine (2018) Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In ICRA, pp. 7559–7566. Cited by: §1.
  • [58] M. Narasimhan, E. Wijmans, X. Chen, T. Darrell, D. Batra, D. Parikh, and A. Singh (2020) Seeing the un-scene: learning amodal semantic maps for room navigation. In ECCV, pp. 513–529. Cited by: §2.
  • [59] K. Nguyen and H. Daumé III (2019)

    Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning

    EMNLP. Cited by: §2.
  • [60] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh (2015) Action-conditional video prediction using deep networks in atari games. In NeurIPS, pp. 2863–2871. Cited by: §2.
  • [61] J. Oh, S. Singh, and H. Lee (2017) Value prediction network. In NeurIPS, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Cited by: §2.
  • [62] B. Osinski, C. Finn, D. Erhan, G. Tucker, H. Michalewski, K. Czechowski, L. M. Kaiser, M. Babaeizadeh, P. Kozakowski, P. Milos, R. H. Campbell, A. Mohiuddin, R. Sepassi, and S. Levine (2020) Model-based reinforcement learning for atari. In ICLR, Cited by: §1, §2.
  • [63] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, pp. 2337–2346. Cited by: 2nd item, §1, §3.2, §3.2, 2nd item.
  • [64] A. Piergiovanni, A. Wu, and M. S. Ryoo (2019) Learning real-world robot policies by dreaming. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7680–7687. Cited by: §1.
  • [65] S. Purushwalkam, S. V. A. Gari, V. K. Ithapu, C. Schissler, P. Robinson, A. Gupta, and K. Grauman (2020) Audio-visual floorplan reconstruction. arXiv preprint arXiv:2012.15470. Cited by: §2.
  • [66] Y. Qi, Q. Wu, P. Anderson, X. Wang, W. Y. Wang, C. Shen, and A. v. d. Hengel (2020) REVERIE: remote embodied visual referring expression in real indoor environments. In CVPR, Cited by: §1, §2.
  • [67] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra (2019) Habitat: A Platform for Embodied AI Research. In ICCV, Cited by: §3.3.
  • [68] M. Shih, S. Su, J. Kopf, and J. Huang (2020-06) 3D photography using context-aware layered depth inpainting. In CVPR, Cited by: §2.
  • [69] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §3.2.
  • [70] P. P. Srinivasan, R. Tucker, J. T. Barron, R. Ramamoorthi, R. Ng, and N. Snavely (2019) Pushing the boundaries of view extrapolation with multiplane images. In CVPR, pp. 175–184. Cited by: §2.
  • [71] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe (2019) The Replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797. Cited by: §2.
  • [72] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. A Bradford Book, Cambridge, MA, USA. Cited by: §1.
  • [73] H. Tan, L. Yu, and M. Bansal (2019) Learning to navigate unseen environments: back translation with environmental dropout. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2610–2621. Cited by: §2, §4.2.
  • [74] J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer (2020) Vision-and-dialog navigation. In Conference on Robot Learning (CoRL), pp. 394–406. Cited by: §1, §2.
  • [75] S. Vora, R. Mahjourian, S. Pirk, and A. Angelova (2018) Future semantic segmentation leveraging 3d information. ECCV 3D Reconstruction meets Semantics Workshop. Cited by: §2.
  • [76] J. Walker, A. Gupta, and M. Hebert (2014) Patch to the future: unsupervised visual prediction. In CVPR, pp. 3302–3309. Cited by: §2.
  • [77] T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. NeurIPS. Cited by: §2.
  • [78] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, pp. 8798–8807. Cited by: §3.2, §3.2.
  • [79] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang, and L. Zhang (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR, pp. 6629–6638. Cited by: §4.2.
  • [80] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson (2020) Synsin: end-to-end view synthesis from a single image. In CVPR, pp. 7467–7477. Cited by: §1, §2.
  • [81] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson env: real-world perception for embodied agents. In CVPR, Cited by: §2.
  • [82] J. Xu, B. Ni, Z. Li, S. Cheng, and X. Yang (2018) Structure preserving video prediction. In CVPR, pp. 1460–1469. Cited by: §2.
  • [83] A. Yu, V. Ye, M. Tancik, and A. Kanazawa (2020) PixelNeRF: neural radiance fields from one or few images. arXiv preprint arXiv:2012.02190. Cited by: §2.
  • [84] M. Zhao, P. Anderson, V. Jain, S. Wang, A. Ku, J. Baldridge, and E. Ie (2021) On the evaluation of vision-and-language navigation instructions. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), Cited by: §4.2, §4.2.
  • [85] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely (2018) Stereo magnification: learning view synthesis using multiplane images. SIGGRAPH. Cited by: §2.
  • [86] F. Zhu, Y. Zhu, X. Chang, and X. Liang (2020) Vision-language navigation with self-supervised auxiliary reasoning tasks. In CVPR, pp. 10012–10022. Cited by: §2.
  • [87] W. Zhu, X. Wang, T. Fu, A. Yan, P. Narayana, K. Sone, S. Basu, and W. Y. Wang (2021) Multimodal text style transfer for outdoor vision-and-language navigation. EACL. Cited by: §2.