Airbert: In-domain Pretraining for Vision-and-Language Navigation

08/20/2021
by   Pierre-Louis Guhur, et al.
42

Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses.

READ FULL TEXT

page 1

page 4

page 12

page 13

page 16

page 17

page 18

page 19

research
05/23/2023

Masked Path Modeling for Vision-and-Language Navigation

Vision-and-language navigation (VLN) agents are trained to navigate in r...
research
03/02/2020

Multi-View Learning for Vision-and-Language Navigation

Learning to navigate in a visual environment following natural language ...
research
03/08/2022

Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration

Vision-language navigation (VLN) is a challenging task due to its large ...
research
08/24/2022

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

In vision-and-language navigation (VLN), an embodied agent is required t...
research
09/16/2020

Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule

Vision-and-language navigation (VLN) is a task in which an agent is embo...
research
07/22/2023

Learning Vision-and-Language Navigation from YouTube Videos

Vision-and-language navigation (VLN) requires an embodied agent to navig...
research
09/05/2019

Robust Navigation with Language Pretraining and Stochastic Sampling

Core to the vision-and-language navigation (VLN) challenge is building r...

Please sign up or login with your details

Forgot password? Click here to reset