Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions

09/20/2023
by   Yuxing Long, et al.
0

Visual language navigation (VLN) is an embodied task demanding a wide range of skills encompassing understanding, perception, and planning. For such a multifaceted challenge, previous VLN methods totally rely on one model's own thinking to make predictions within one round. However, existing models, even the most advanced large language model GPT4, still struggle with dealing with multiple tasks by single-round self-thinking. In this work, drawing inspiration from the expert consultation meeting, we introduce a novel zero-shot VLN framework. Within this framework, large models possessing distinct abilities are served as domain experts. Our proposed navigation agent, namely DiscussNav, can actively discuss with these experts to collect essential information before moving at every step. These discussions cover critical navigation subtasks like instruction understanding, environment perception, and completion estimation. Through comprehensive experiments, we demonstrate that discussions with domain experts can effectively facilitate navigation by perceiving instruction-relevant information, correcting inadvertent errors, and sifting through in-consistent movement decisions. The performances on the representative VLN task R2R show that our method surpasses the leading zero-shot VLN model by a large margin on all metrics. Additionally, real-robot experiments display the obvious advantages of our method over single-round self-thinking.

READ FULL TEXT

page 1

page 5

research
05/18/2023

Aligning Instruction Tasks Unlocks Large Language Models as Zero-Shot Relation Extractors

Recent work has shown that fine-tuning large language models (LLMs) on l...
research
08/15/2023

A^2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

We study the task of zero-shot vision-and-language navigation (ZS-VLN), ...
research
03/06/2023

Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Based Zero-Shot Object Navigation

We present LGX, a novel algorithm for Object Goal Navigation in a "langu...
research
11/30/2022

CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

Household environments are visually diverse. Embodied agents performing ...
research
04/11/2023

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Visual target navigation in unknown environments is a crucial problem in...
research
05/19/2023

Multimodal Web Navigation with Instruction-Finetuned Foundation Models

The progress of autonomous web navigation has been hindered by the depen...
research
02/03/2023

Multiple Thinking Achieving Meta-Ability Decoupling for Object Navigation

We propose a meta-ability decoupling (MAD) paradigm, which brings togeth...

Please sign up or login with your details

Forgot password? Click here to reset