Improving Vision-and-Language Navigation by Generating Future-View Image Semantics

04/11/2023
by   Jialu Li, et al.
3

Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions. At each step, the agent takes the next action by selecting from a set of navigable locations. In this paper, we aim to take one step further and explore whether the agent can benefit from generating the potential future view during navigation. Intuitively, humans will have an expectation of how the future environment will look like, based on the natural language instructions and surrounding views, which will aid correct navigation. Hence, to equip the agent with this ability to generate the semantics of future navigation views, we first propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG). These three objectives teach the model to predict missing views in a panorama (MPM), predict missing steps in the full trajectory (MTM), and generate the next view based on the full instruction and navigation history (APIG), respectively. We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step. Empirically, our VLN-SIG achieves the new state-of-the-art on both the Room-to-Room dataset and the CVDN dataset. We further show that our agent learns to fill in missing patches in future views qualitatively, which brings more interpretability over agents' predicted actions. Lastly, we demonstrate that learning to predict future view semantics also enables the agent to have better performance on longer paths.

READ FULL TEXT

page 4

page 8

page 12

research
03/02/2020

Multi-View Learning for Vision-and-Language Navigation

Learning to navigate in a visual environment following natural language ...
research
03/22/2022

HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

Pre-training has been adopted in a few of recent works for Vision-and-La...
research
05/30/2023

PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires the agent to follow langua...
research
08/09/2023

Bird's-Eye-View Scene Graph for Vision-Language Navigation

Vision-language navigation (VLN), which entails an agent to navigate 3D ...
research
07/20/2020

Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation

We introduce a learning-based approach for room navigation using semanti...
research
07/15/2021

Neighbor-view Enhanced Model for Vision and Language Navigation

Vision and Language Navigation (VLN) requires an agent to navigate to a ...
research
07/11/2020

Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation

The ability to perform effective planning is crucial for building an ins...

Please sign up or login with your details

Forgot password? Click here to reset