Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration

03/08/2022
by   Xiwen Liang, et al.
0

Vision-language navigation (VLN) is a challenging task due to its large searching space in the environment. To address this problem, previous works have proposed some methods of fine-tuning a large model that pretrained on large-scale datasets. However, the conventional fine-tuning methods require extra human-labeled navigation data and lack self-exploration capabilities in environments, which hinders their generalization of unseen scenes. To improve the ability of fast cross-domain adaptation, we propose Prompt-based Environmental Self-exploration (ProbES), which can self-explore the environments by sampling trajectories and automatically generates structured instructions via a large-scale cross-modal pretrained model (CLIP). Our method fully utilizes the knowledge learned from CLIP to build an in-domain dataset by self-exploration without human labeling. Unlike the conventional approach of fine-tuning, we introduce prompt-based learning to achieve fast adaptation for language embeddings, which substantially improves the learning efficiency by leveraging prior knowledge. By automatically synthesizing trajectory-instruction pairs in any environment without human supervision and efficient prompt-based learning, our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE. Both qualitative and quantitative results show that our ProbES significantly improves the generalization ability of the navigation model.

READ FULL TEXT

page 2

page 8

page 13

page 14

page 15

research
08/24/2022

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

In vision-and-language navigation (VLN), an embodied agent is required t...
research
08/20/2021

Airbert: In-domain Pretraining for Vision-and-Language Navigation

Vision-and-language navigation (VLN) aims to enable embodied agents to n...
research
09/07/2023

Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

With strong representation capabilities, pretrained vision-language mode...
research
06/15/2021

Vision-Language Navigation with Random Environmental Mixup

Vision-language Navigation (VLN) tasks require an agent to navigate step...
research
01/11/2023

Graph based Environment Representation for Vision-and-Language Navigation in Continuous Environments

Vision-and-Language Navigation in Continuous Environments (VLN-CE) is a ...
research
02/11/2023

Cross-Modal Fine-Tuning: Align then Refine

Fine-tuning large-scale pretrained models has led to tremendous progress...
research
07/11/2018

Learning Deployable Navigation Policies at Kilometer Scale from a Single Traversal

Model-free reinforcement learning has recently been shown to be effectiv...

Please sign up or login with your details

Forgot password? Click here to reset