Local Slot Attention for Vision-and-Language Navigation

06/17/2022
by   Yifeng Zhuang, et al.
0

Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.

READ FULL TEXT

page 1

page 3

page 8

research
09/26/2022

LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation

Understanding spatial and visual information is essential for a navigati...
research
04/21/2020

Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms

Because attention modules are core components of Transformer-based model...
research
05/26/2023

GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

Most existing works solving Room-to-Room VLN problem only utilize RGB im...
research
03/15/2020

Vision-Dialog Navigation by Exploring Cross-modal Memory

Vision-dialog navigation posed as a new holy-grail task in vision-langua...
research
05/14/2022

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in...
research
09/20/2023

AttentionMix: Data augmentation method that relies on BERT attention mechanism

The Mixup method has proven to be a powerful data augmentation technique...
research
11/05/2019

Self-Attention and Ingredient-Attention Based Model for Recipe Retrieval from Image Queries

Direct computer vision based-nutrient content estimation is a demanding ...

Please sign up or login with your details

Forgot password? Click here to reset