Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

09/07/2023
by   Ting Liu, et al.
0

With strong representation capabilities, pretrained vision-language models are widely used in vision and language navigation (VLN). However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a trajectory and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and Domain-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the domain-aware stage, we apply a low-cost prompt tuning paradigm to learn soft visual prompts from an in-domain dataset for equipping the pretrained models with object-level and scene-level cross-modal alignment in VLN tasks. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics and instill both out-of-context and contextual knowledge in the instruction into cross-modal representations. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to previous state-of-the-art methods.

READ FULL TEXT

page 1

page 4

research
09/09/2021

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Pretrained vision-and-language BERTs aim to learn representations that c...
research
12/22/2020

Seeing past words: Testing the cross-modal capabilities of pretrained V L models

We investigate the ability of general-purpose pretrained vision and lang...
research
07/24/2022

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

In a busy city street, a pedestrian surrounded by distractions can pick ...
research
05/31/2022

ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

Vision-Language Navigation (VLN) is a challenging task that requires an ...
research
03/08/2022

Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration

Vision-language navigation (VLN) is a challenging task due to its large ...
research
04/10/2019

Context-Aware Embeddings for Automatic Art Analysis

Automatic art analysis aims to classify and retrieve artistic representa...
research
05/09/2023

Exploiting Pseudo Image Captions for Multimodal Summarization

Cross-modal contrastive learning in vision language pretraining (VLP) fa...

Please sign up or login with your details

Forgot password? Click here to reset