Bridging the visual gap in VLN via semantically richer instructions

10/27/2022
by   Joaquin Ossandón, et al.
0

The Visual-and-Language Navigation (VLN) task requires understanding a textual instruction to navigate a natural indoor environment using only visual information. While this is a trivial task for most humans, it is still an open problem for AI models. In this work, we hypothesize that poor use of the visual information available is at the core of the low performance of current models. To support this hypothesis, we provide experimental evidence showing that state-of-the-art models are not severely affected when they receive just limited or even no visual data, indicating a strong overfitting to the textual instructions. To encourage a more suitable use of the visual information, we propose a new data augmentation method that fosters the inclusion of more explicit visual information in the generation of textual navigational instructions. Our main intuition is that current VLN datasets include textual instructions that are intended to inform an expert navigator, such as a human, but not a beginner visual navigational agent, such as a randomly initialized DL model. Specifically, to bridge the visual semantic gap of current VLN datasets, we take advantage of metadata available for the Matterport3D dataset that, among others, includes information about object labels that are present in the scenes. Training a state-of-the-art model with the new set of instructions increase its performance by 8 demonstrating the advantages of the proposed data augmentation method.

READ FULL TEXT

page 4

page 10

page 11

research
04/19/2021

Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information

Vision language navigation is the task that requires an agent to navigat...
research
06/03/2021

Grounding Complex Navigational Instructions Using Scene Graphs

Training a reinforcement learning agent to carry out natural language in...
research
03/01/2021

CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

Navigation guided by natural language instructions is particularly suita...
research
06/07/2018

Speaker-Follower Models for Vision-and-Language Navigation

Navigation guided by natural language instructions presents a challengin...
research
06/24/2023

Thinking Like an Annotator: Generation of Dataset Labeling Instructions

Large-scale datasets are essential to modern day deep learning. Advocate...
research
02/13/2023

Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation

Vision-Language Navigation (VLN) is a challenging task which requires an...
research
01/13/2020

Towards Evaluating Plan Generation Approaches with Instructional Texts

Recent research in behaviour understanding through language grounding ha...

Please sign up or login with your details

Forgot password? Click here to reset