Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

09/07/2023
by   Ting Liu, et al.
0

With strong representation capabilities, pretrained vision-language models are widely used in vision and language navigation (VLN). However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a trajectory and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and Domain-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the domain-aware stage, we apply a low-cost prompt tuning paradigm to learn soft visual prompts from an in-domain dataset for equipping the pretrained models with object-level and scene-level cross-modal alignment in VLN tasks. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics and instill both out-of-context and contextual knowledge in the instruction into cross-modal representations. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to previous state-of-the-art methods.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset