A^2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

08/15/2023
by   Peihao Chen, et al.
0

We study the task of zero-shot vision-and-language navigation (ZS-VLN), a practical yet challenging problem in which an agent learns to navigate following a path described by language instructions without requiring any path-instruction annotation data. Normally, the instructions have complex grammatical structures and often contain various action descriptions (e.g., "proceed beyond", "depart from"). How to correctly understand and execute these action demands is a critical problem, and the absence of annotated data makes it even more challenging. Note that a well-educated human being can easily understand path instructions without the need for any special training. In this paper, we propose an action-aware zero-shot VLN method (A^2Nav) by exploiting the vision-and-language ability of foundation models. Specifically, the proposed method consists of an instruction parser and an action-aware navigation policy. The instruction parser utilizes the advanced reasoning ability of large language models (e.g., GPT-3) to decompose complex navigation instructions into a sequence of action-specific object navigation sub-tasks. Each sub-task requires the agent to localize the object and navigate to a specific goal position according to the associated action demand. To accomplish these sub-tasks, an action-aware navigation policy is learned from freely collected action-specific datasets that reveal distinct characteristics of each action demand. We use the learned navigation policy for executing sub-tasks sequentially to follow the navigation instruction. Extensive experiments show A^2Nav achieves promising ZS-VLN performance and even surpasses the supervised learning methods on R2R-Habitat and RxR-Habitat datasets.

READ FULL TEXT

page 1

page 3

page 8

page 15

research
11/30/2022

CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

Household environments are visually diverse. Embodied agents performing ...
research
10/14/2022

Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language Navigation

We address a practical yet challenging problem of training robot agents ...
research
09/20/2023

Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions

Visual language navigation (VLN) is an embodied task demanding a wide ra...
research
03/28/2023

KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation

Vision-and-language navigation (VLN) is the task to enable an embodied a...
research
01/13/2020

Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

In this work, we present an alternative approach to making an agent comp...
research
08/26/2021

Visual-and-Language Navigation: A Survey and Taxonomy

An agent that can understand natural-language instruction and carry out ...
research
05/31/2022

ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

Vision-Language Navigation (VLN) is a challenging task that requires an ...

Please sign up or login with your details

Forgot password? Click here to reset