Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

04/30/2020
by   Arjun Majumdar, et al.
10

Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground scene elements referenced via language (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following question – can we leverage abundant 'disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn visual groundings (what do 'stairs' look like?) that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a sequence of panoramic RGB images captured by the agent. We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data significantly improves performance on VLN – outperforming the prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful – with their combination resulting in further positive synergistic effects.

READ FULL TEXT

page 2

page 3

page 6

page 15

page 20

page 21

page 22

research
04/06/2020

Sub-Instruction Aware Vision-and-Language Navigation

Vision-and-language navigation requires an agent to navigate through a r...
research
05/19/2023

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

The progress of autonomous web navigation has been hindered by the depen...
research
05/08/2023

Accessible Instruction-Following Agent

Humans can collaborate and complete tasks based on visual signals and in...
research
06/01/2023

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Conversational generative AI has demonstrated remarkable promise for emp...
research
03/15/2023

Lana: A Language-Capable Navigator for Instruction Following and Generation

Recently, visual-language navigation (VLN) – entailing robot agents to f...
research
05/31/2022

ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

Vision-Language Navigation (VLN) is a challenging task that requires an ...
research
06/01/2021

Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks

There is a growing interest in the community in making an embodied AI ag...

Please sign up or login with your details

Forgot password? Click here to reset