GridMM: Grid Memory Map for Vision-and-Language Navigation

07/24/2023
by   Zihan Wang, et al.
0

Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. To represent the previously visited environment, most approaches for VLN implement memory using recurrent states, topological maps, or top-down semantic maps. In contrast to these approaches, we build the top-down egocentric and dynamically growing Grid Memory Map (i.e., GridMM) to structure the visited environment. From a global perspective, historical observations are projected into a unified grid map in a top-down view, which can better represent the spatial relations of the environment. From a local perspective, we further propose an instruction relevance aggregation method to capture fine-grained visual clues in each grid region. Extensive experiments are conducted on both the REVERIE, R2R, SOON datasets in the discrete environments, and the R2R-CE dataset in the continuous environments, showing the superiority of our proposed method.

READ FULL TEXT

page 1

page 3

page 14

page 15

research
10/14/2022

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

We address a practical yet challenging problem of training robot agents ...
research
05/05/2023

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a realistic but challenging task...
research
12/09/2020

Topological Planning with Transformers for Vision-and-Language Navigation

Conventional approaches to vision-and-language navigation (VLN) are trai...
research
03/07/2022

Find a Way Forward: a Language-Guided Semantic Map Navigator

This paper attacks the problem of language-guided navigation in a new pe...
research
08/09/2023

Bird's-Eye-View Scene Graph for Vision-Language Navigation

Vision-language navigation (VLN), which entails an agent to navigate 3D ...
research
05/21/2023

Instance-Level Semantic Maps for Vision Language Navigation

Humans have a natural ability to perform semantic associations with the ...
research
11/10/2021

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a task that an agent is required...

Please sign up or login with your details

Forgot password? Click here to reset