Structured Scene Memory for Vision-Language Navigation

03/05/2021
by   Hanqing Wang, et al.
0

Recently, numerous algorithms have been developed to tackle the problem of vision-language navigation (VLN), i.e., entailing an agent to navigate 3D environments through following linguistic instructions. However, current VLN agents simply store their past experiences/observations as latent states in recurrent networks, failing to capture environment layouts and make long-term planning. To address these limitations, we propose a crucial architecture, called Structured Scene Memory (SSM). It is compartmentalized enough to accurately memorize the percepts during navigation. It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment. SSM has a collect-read controller that adaptively collects information for supporting current decision making and mimics iterative algorithms for long-range reasoning. As SSM provides a complete action space, i.e., all the navigable places on the map, a frontier-exploration based navigation decision making strategy is introduced to enable efficient and global planning. Experiment results on two VLN datasets (i.e., R2R and R4R) show that our method achieves state-of-the-art performance on several metrics.

READ FULL TEXT

page 1

page 3

page 6

page 7

research
08/09/2023

Bird's-Eye-View Scene Graph for Vision-Language Navigation

Vision-language navigation (VLN), which entails an agent to navigate 3D ...
research
07/11/2020

Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation

The ability to perform effective planning is crucial for building an ins...
research
05/05/2023

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a realistic but challenging task...
research
08/14/2023

DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation

VLN-CE is a recently released embodied task, where AI agents need to nav...
research
06/25/2021

Building Intelligent Autonomous Navigation Agents

Breakthroughs in machine learning in the last decade have led to `digita...
research
07/16/2023

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Vision language decision making (VLDM) is a challenging multimodal task....
research
03/09/2019

Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

Many robotic applications require the agent to perform long-horizon task...

Please sign up or login with your details

Forgot password? Click here to reset