A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

05/05/2023
by   Liuyi Wang, et al.
0

Vision-and-Language Navigation (VLN) is a realistic but challenging task that requires an agent to locate the target region using verbal and visual cues. While significant advancements have been achieved recently, there are still two broad limitations: (1) The explicit information mining for significant guiding semantics concealed in both vision and language is still under-explored; (2) The previously structured map method provides the average historical appearance of visited nodes, while it ignores distinctive contributions of various images and potent information retention in the reasoning process. This work proposes a dual semantic-aware recurrent global-adaptive network (DSRG) to address the above problems. First, DSRG proposes an instruction-guidance linguistic module (IGL) and an appearance-semantics visual module (ASV) for boosting vision and language semantic learning respectively. For the memory mechanism, a global adaptive aggregation module (GAA) is devised for explicit panoramic observation fusion, and a recurrent memory fusion module (RMF) is introduced to supply implicit temporal hidden states. Extensive experimental results on the R2R and REVERIE datasets demonstrate that our method achieves better performance than existing methods.

READ FULL TEXT

page 1

page 7

research
07/24/2023

GridMM: Grid Memory Map for Vision-and-Language Navigation

Vision-and-language navigation (VLN) enables the agent to navigate to a ...
research
03/05/2021

Structured Scene Memory for Vision-Language Navigation

Recently, numerous algorithms have been developed to tackle the problem ...
research
01/10/2019

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

The Vision-and-Language Navigation (VLN) task entails an agent following...
research
03/28/2023

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

Vision-and-language navigation (VLN) is the task to enable an embodied a...
research
07/15/2021

Neighbor-view Enhanced Model for Vision and Language Navigation

Vision and Language Navigation (VLN) requires an agent to navigate to a ...
research
11/10/2021

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a task that an agent is required...
research
07/31/2019

EMPNet: Neural Localisation and Mapping using Embedded Memory Points

Continuously estimating an agent's state space and a representation of i...

Please sign up or login with your details

Forgot password? Click here to reset