Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

06/02/2019
by   Ronghang Hu, et al.
0

Vision-and-Language Navigation (VLN) requires grounding instructions, such as "turn right and stop at the door", to routes in a visual environment. The actual grounding can connect language to the environment through multiple modalities, e.g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route. We investigate where the natural language empirically grounds under two recent state-of-the-art VLN models. Surprisingly, we discover that visual features may actually hurt these models: models which only use route structure, ablating visual features, outperform their visual counterparts in unseen new environments on the benchmark Room-to-Room dataset. To better use all the available modalities, we propose to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-of-the-art models on the VLN task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/02/2020

Multi-View Learning for Vision-and-Language Navigation

Learning to navigate in a visual environment following natural language ...
research
10/20/2021

SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark

Existing work in language grounding typically study single environments....
research
07/05/2022

CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

Vision-and-Language Navigation (VLN) tasks require an agent to navigate ...
research
04/23/2018

Attention Based Natural Language Grounding by Navigating Virtual Environment

In this work, we focus on the problem of grounding language by training ...
research
11/15/2022

Structured Exploration Through Instruction Enhancement for Object Navigation

Finding an object of a specific class in an unseen environment remains a...
research
10/19/2020

Language and Visual Entity Relationship Graph for Agent Navigation

Vision-and-Language Navigation (VLN) requires an agent to navigate in a ...
research
07/27/2022

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

In this paper, we investigate how to achieve better visual grounding wit...

Please sign up or login with your details

Forgot password? Click here to reset