Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information

04/19/2021
by   Jialu Li, et al.
10

Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions. One key challenge in this task is to ground instructions with the current visual information that the agent perceives. Most of the existing work employs soft attention over individual words to locate the instruction required for the next action. However, different words have different functions in a sentence (e.g., modifiers convey attributes, verbs convey actions). Syntax information like dependencies and phrase structures can aid the agent to locate important parts of the instruction. Hence, in this paper, we propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes. Empirically, our agent outperforms the baseline model that does not use syntax information on the Room-to-Room dataset, especially in the unseen environment. Besides, our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages (English, Hindi, and Telugu). We also show that our agent is better at aligning instructions with the current visual information via qualitative visualizations. Code and models: https://github.com/jialuli-luka/SyntaxVLN

READ FULL TEXT

page 1

page 5

research
07/05/2022

CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

Vision-and-Language Navigation (VLN) tasks require an agent to navigate ...
research
10/19/2020

Language and Visual Entity Relationship Graph for Agent Navigation

Vision-and-Language Navigation (VLN) requires an agent to navigate in a ...
research
01/10/2019

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

The Vision-and-Language Navigation (VLN) task entails an agent following...
research
09/16/2020

Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule

Vision-and-language navigation (VLN) is a task in which an agent is embo...
research
10/27/2022

Bridging the visual gap in VLN via semantically richer instructions

The Visual-and-Language Navigation (VLN) task requires understanding a t...
research
06/15/2021

Vision-Language Navigation with Random Environmental Mixup

Vision-language Navigation (VLN) tasks require an agent to navigate step...
research
06/01/2021

Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks

There is a growing interest in the community in making an embodied AI ag...

Please sign up or login with your details

Forgot password? Click here to reset