Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

02/25/2020
by   Weituo Hao, et al.
1

Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent called Prevalent. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room benchmark, our model improves the state-of-the-art from 47 on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation and “Help, Anna!” the proposed Prevalent leads to significant improvement over existing methods, achieving a new state of the art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/02/2020

Multi-View Learning for Vision-and-Language Navigation

Learning to navigate in a visual environment following natural language ...
research
09/05/2019

Robust Navigation with Language Pretraining and Stochastic Sampling

Core to the vision-and-language navigation (VLN) challenge is building r...
research
11/20/2022

Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation

In Vision-and-Language Navigation (VLN), researchers typically take an i...
research
06/07/2021

Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring

Despite recent progress, learning new tasks through language instruction...
research
04/08/2019

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

A grand goal in AI is to build a robot that can accurately navigate base...
research
03/31/2020

Take the Scenic Route: Improving Generalization in Vision-and-Language Navigation

In the Vision-and-Language Navigation (VLN) task, an agent with egocentr...
research
03/06/2019

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation

We present FAST NAVIGATOR, a general framework for action decoding, whic...

Please sign up or login with your details

Forgot password? Click here to reset