Target-Driven Structured Transformer Planner for Vision-Language Navigation

07/19/2022
by   Yusheng Zhao, et al.
0

Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions. For the agent, inferring the long-term navigation target from visual-linguistic clues is crucial for reliable path planning, which, however, has rarely been studied before in literature. In this article, we propose a Target-Driven Structured Transformer Planner (TD-STP) for long-horizon goal-guided and room layout-aware navigation. Specifically, we devise an Imaginary Scene Tokenization mechanism for explicit estimation of the long-term target (even located in unexplored environments). In addition, we design a Structured Transformer Planner which elegantly incorporates the explored room layout into a neural attention architecture for structured and global planning. Experimental results demonstrate that our TD-STP substantially improves previous best methods' success rate by 2 respectively. Our code is available at https://github.com/YushengZhao/TD-STP .

READ FULL TEXT

page 2

page 3

page 4

page 5

page 8

page 11

page 12

research
04/20/2022

Reinforced Structured State-Evolution for Vision-Language Navigation

Vision-and-language Navigation (VLN) task requires an embodied agent to ...
research
07/22/2023

Learning Vision-and-Language Navigation from YouTube Videos

Vision-and-language navigation (VLN) requires an embodied agent to navig...
research
07/11/2020

Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation

The ability to perform effective planning is crucial for building an ins...
research
11/27/2019

Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a challenging task in which an a...
research
11/30/2022

Layout-aware Dreamer for Embodied Referring Expression Grounding

In this work, we study the problem of Embodied Referring Expression Grou...
research
12/10/2022

OpenD: A Benchmark for Language-Driven Door and Drawer Opening

We introduce OPEND, a benchmark for learning how to use a hand to open c...
research
11/10/2021

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a task that an agent is required...

Please sign up or login with your details

Forgot password? Click here to reset