Multi-modal Discriminative Model for Vision-and-Language Navigation

05/31/2019
by   Haoshuo Huang, et al.
0

Vision-and-Language Navigation (VLN) is a natural language grounding task where agents have to interpret natural language instructions in the context of visual scenes in a dynamic environment to achieve prescribed navigation goals. Successful agents must have the ability to parse natural language of varying linguistic styles, ground them in potentially unfamiliar scenes, plan and react with ambiguous environmental feedback. Generalization ability is limited by the amount of human annotated data. In particular, paired vision-language sequence data is expensive to collect. We develop a discriminator that evaluates how well an instruction explains a given path in VLN task using multi-modal alignment. Our study reveals that only a small fraction of the high-quality augmented data from Fried:2018:Speaker, as scored by our discriminator, is useful for training VLN agents with similar performance on previously unseen environments. We also show that a VLN agent warm-started with pre-trained components from the discriminator outperforms the benchmark success rates of 35.5 by 10% relative measure on previously unseen environments.

READ FULL TEXT

page 5

page 8

research
08/09/2019

Transferable Representation Learning in Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) tasks such as Room-to-Room (R2R) re...
research
06/15/2021

Vision-Language Navigation with Random Environmental Mixup

Vision-language Navigation (VLN) tasks require an agent to navigate step...
research
11/17/2019

Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling

Vision-and-Language Navigation (VLN) is a task where agents must decide ...
research
04/09/2022

On the Importance of Karaka Framework in Multi-modal Grounding

Computational Paninian Grammar model helps in decoding a natural languag...
research
08/10/2021

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

Language-guided robots performing home and office tasks must navigate in...
research
03/01/2019

Learning To Follow Directions in Street View

Navigating and understanding the real world remains a key challenge in m...
research
08/24/2022

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

In vision-and-language navigation (VLN), an embodied agent is required t...

Please sign up or login with your details

Forgot password? Click here to reset