CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

11/30/2022
by   Vishnu Sashank Dorbala, et al.
0

Household environments are visually diverse. Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity, while also following arbitrary language instructions. Recently, Vision-Language models like CLIP have shown great performance on the task of zero-shot object recognition. In this work, we ask if these models are also capable of zero-shot language grounding. In particular, we utilize CLIP to tackle the novel problem of zero-shot VLN using natural language referring expressions that describe target objects, in contrast to past work that used simple language templates describing object classes. We examine CLIP's capability in making sequential navigational decisions without any dataset-specific finetuning, and study how it influences the path that an agent takes. Our results on the coarse-grained instruction following task of REVERIE demonstrate the navigational capability of CLIP, surpassing the supervised baseline in terms of both success rate (SR) and success weighted by path length (SPL). More importantly, we quantitatively show that our CLIP-based zero-shot approach generalizes better to show consistent performance across environments when compared to SOTA, fully supervised learning approaches when evaluated via Relative Change in Success (RCS).

READ FULL TEXT

page 2

page 4

page 5

page 6

page 7

research
03/06/2023

Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Based Zero-Shot Object Navigation

We present LGX, a novel algorithm for Object Goal Navigation in a "langu...
research
03/20/2022

CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration

Households across the world contain arbitrary objects: from mate gourds ...
research
08/15/2023

A^2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models

We study the task of zero-shot vision-and-language navigation (ZS-VLN), ...
research
06/17/2023

MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

Given a natural language, a general robot has to comprehend the instruct...
research
06/07/2022

Intra-agent speech permits zero-shot task acquisition

Human language learners are exposed to a trickle of informative, context...
research
09/20/2023

Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions

Visual language navigation (VLN) is an embodied task demanding a wide ra...
research
11/28/2021

Explore the Potential Performance of Vision-and-Language Navigation Model: a Snapshot Ensemble Method

Vision-and-Language Navigation (VLN) is a challenging task in the field ...

Please sign up or login with your details

Forgot password? Click here to reset