PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation

05/30/2023
by   Jialu Li, et al.
2

Vision-and-Language Navigation (VLN) requires the agent to follow language instructions to navigate through 3D environments. One main challenge in VLN is the limited availability of photorealistic training environments, which makes it hard to generalize to new and unseen environments. To address this problem, we propose PanoGen, a generation method that can potentially create an infinite number of diverse panoramic environments conditioned on text. Specifically, we collect room descriptions by captioning the room images in existing Matterport3D environments, and leverage a state-of-the-art text-to-image diffusion model to generate the new panoramic environments. We use recursive outpainting over the generated images to create consistent 360-degree panorama views. Our new panoramic environments share similar semantic information with the original environments by conditioning on text descriptions, which ensures the co-occurrence of objects in the panorama follows human intuition, and creates enough diversity in room appearance and layout with image outpainting. Lastly, we explore two ways of utilizing PanoGen in VLN pre-training and fine-tuning. We generate instructions for paths in our PanoGen environments with a speaker built on a pre-trained vision-and-language model for VLN pre-training, and augment the visual observation with our panoramic environments during agents' fine-tuning to avoid overfitting to seen environments. Empirically, learning with our PanoGen environments achieves the new state-of-the-art on the Room-to-Room, Room-for-Room, and CVDN datasets. Pre-training with our PanoGen speaker data is especially effective for CVDN, which has under-specified instructions and needs commonsense knowledge. Lastly, we show that the agent can benefit from training with more generated panoramic environments, suggesting promising results for scaling up the PanoGen environments.

READ FULL TEXT

page 2

page 4

page 8

page 14

research
03/29/2022

EnvEdit: Environment Editing for Vision-and-Language Navigation

In Vision-and-Language Navigation (VLN), an agent needs to navigate thro...
research
04/08/2019

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

A grand goal in AI is to build a robot that can accurately navigate base...
research
06/09/2022

FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation

The speaker-follower models have proven to be effective in vision-and-la...
research
04/11/2023

Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

Vision-and-Language Navigation (VLN) is the task that requires an agent ...
research
11/17/2019

Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling

Vision-and-Language Navigation (VLN) is a task where agents must decide ...
research
01/20/2022

LEMON: Language-Based Environment Manipulation via Execution-Guided Pre-training

Language-based environment manipulation requires agents to manipulate th...
research
05/22/2022

Housekeep: Tidying Virtual Households using Commonsense Reasoning

We introduce Housekeep, a benchmark to evaluate commonsense reasoning in...

Please sign up or login with your details

Forgot password? Click here to reset