GridMM: Grid Memory Map for Vision-and-Language Navigation

by   Zihan Wang, et al.
Institute of Computing Technology, Chinese Academy of Sciences

Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. To represent the previously visited environment, most approaches for VLN implement memory using recurrent states, topological maps, or top-down semantic maps. In contrast to these approaches, we build the top-down egocentric and dynamically growing Grid Memory Map (i.e., GridMM) to structure the visited environment. From a global perspective, historical observations are projected into a unified grid map in a top-down view, which can better represent the spatial relations of the environment. From a local perspective, we further propose an instruction relevance aggregation method to capture fine-grained visual clues in each grid region. Extensive experiments are conducted on both the REVERIE, R2R, SOON datasets in the discrete environments, and the R2R-CE dataset in the continuous environments, showing the superiority of our proposed method.


page 1

page 3

page 14

page 15


Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

We address a practical yet challenging problem of training robot agents ...

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a realistic but challenging task...

Topological Planning with Transformers for Vision-and-Language Navigation

Conventional approaches to vision-and-language navigation (VLN) are trai...

Find a Way Forward: a Language-Guided Semantic Map Navigator

This paper attacks the problem of language-guided navigation in a new pe...

Bird's-Eye-View Scene Graph for Vision-Language Navigation

Vision-language navigation (VLN), which entails an agent to navigate 3D ...

Instance-Level Semantic Maps for Vision Language Navigation

Humans have a natural ability to perform semantic associations with the ...

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a task that an agent is required...

Please sign up or login with your details

Forgot password? Click here to reset