GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

05/26/2023
by   Jingyang Huo, et al.
0

Most existing works solving Room-to-Room VLN problem only utilize RGB images and do not consider local context around candidate views, which lack sufficient visual cues about surrounding environment. Moreover, natural language contains complex semantic information thus its correlations with visual inputs are hard to model merely with cross attention. In this paper, we propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation. The RGB images are compensated with the corresponding depth maps and normal maps predicted by Omnidata as visual inputs. Technically, we introduce a two-stage module that combine local slot attention and CLIP model to produce geometry-enhanced representation from such input. We employ V L BERT to learn a cross-modal representation that incorporate both language and vision informations. Additionally, a novel multiway attention module is designed, encouraging different phrases of input instruction to exploit the most related features from visual input. Extensive experiments demonstrate the effectiveness of our newly designed modules and show the compelling performance of the proposed method.

READ FULL TEXT

page 8

page 14

page 15

page 16

page 17

page 18

page 19

page 20

research
03/15/2020

Vision-Dialog Navigation by Exploring Cross-modal Memory

Vision-dialog navigation posed as a new holy-grail task in vision-langua...
research
07/15/2021

Neighbor-view Enhanced Model for Vision and Language Navigation

Vision and Language Navigation (VLN) requires an agent to navigate to a ...
research
06/17/2022

Local Slot Attention for Vision-and-Language Navigation

Vision-and-language navigation (VLN), a frontier study aiming to pave th...
research
01/26/2022

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

In the Vision-and-Language Navigation task, the embodied agent follows l...
research
06/22/2022

Depth-aware Glass Surface Detection with Cross-modal Context Mining

Glass surfaces are becoming increasingly ubiquitous as modern buildings ...
research
06/21/2022

Pyramid Region-based Slot Attention Network for Temporal Action Proposal Generation

It has been found that temporal action proposal generation, which aims t...
research
07/24/2022

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

In a busy city street, a pedestrian surrounded by distractions can pick ...

Please sign up or login with your details

Forgot password? Click here to reset