Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

08/24/2023
by   Yibo Cui, et al.
0

Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pre-training objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability.

READ FULL TEXT

page 1

page 12

page 14

page 15

research
12/17/2021

Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Video-and-language pre-training has shown promising improvements on vari...
research
11/09/2021

FILIP: Fine-grained Interactive Language-Image Pre-Training

Unsupervised large-scale vision-language pre-training has shown promisin...
research
05/13/2023

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

In recent years, the growing demand for medical imaging diagnosis has pl...
research
12/01/2021

Domain-oriented Language Pre-training with Adaptive Hybrid Masking and Optimal Transport Alignment

Motivated by the success of pre-trained language models such as BERT in ...
research
07/25/2023

Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation

We introduce a novel speaker model Kefa for navigation instruction gener...
research
11/25/2021

Less is More: Generating Grounded Navigation Instructions from Landmarks

We study the automatic generation of navigation instructions from 360-de...
research
10/06/2022

Video Referring Expression Comprehension via Transformer with Content-aware Query

Video Referring Expression Comprehension (REC) aims to localize a target...

Please sign up or login with your details

Forgot password? Click here to reset