SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

10/27/2021
by   Abhinav Moudgil, et al.
7

Natural language instructions for visual navigation often use scene descriptions (e.g., "bedroom") and object references (e.g., "green chairs") to provide a breadcrumb trail to a goal location. This work presents a transformer-based vision-and-language navigation (VLN) agent that uses two different visual encoders – a scene classification network and an object detector – which produce features that match these two distinct types of visual cues. In our method, scene features contribute high-level contextual information that supports object-level processing. With this design, our model is able to use vision-and-language pretraining (i.e., learning the alignment between images and text from large-scale web data) to substantially improve performance on the Room-to-Room (R2R) and Room-Across-Room (RxR) benchmarks. Specifically, our approach leads to improvements of 1.8 and 3.7 navigation instructions that contain six or more object references, which further suggests that our approach is better able to use object features and align them to references in the instructions.

READ FULL TEXT
research
05/13/2021

Episodic Transformer for Vision-and-Language Navigation

Interaction and navigation defined by natural language instructions in d...
research
07/24/2022

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

In a busy city street, a pedestrian surrounded by distractions can pick ...
research
06/02/2023

CLIPGraphs: Multimodal Graph Networks to Infer Object-Room Affinities

This paper introduces a novel method for determining the best room to pl...
research
09/15/2021

What Vision-Language Models `See' when they See Scenes

Images can be described in terms of the objects they contain, or in term...
research
09/25/2018

Structural and object detection for phosphene images

Prosthetic vision based on phosphenes is a promising way to provide visu...
research
06/09/2022

FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation

The speaker-follower models have proven to be effective in vision-and-la...
research
02/13/2023

Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation

Vision-Language Navigation (VLN) is a challenging task which requires an...

Please sign up or login with your details

Forgot password? Click here to reset