CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration

03/20/2022
by   Samir Yitzhak Gadre, et al.
5

Households across the world contain arbitrary objects: from mate gourds and coffee mugs to sitars and guitars. Considering this diversity, robot perception must handle a large variety of semantic objects without additional fine-tuning to be broadly applicable in homes. Recently, zero-shot models have demonstrated impressive performance in image classification of arbitrary objects (i.e., classifying images at inference with categories not explicitly seen during training). In this paper, we translate the success of zero-shot vision models (e.g., CLIP) to the popular embodied AI task of object navigation. In our setting, an agent must find an arbitrary goal object, specified via text, in unseen environments coming from different datasets. Our key insight is to modularize the task into zero-shot object localization and exploration. Employing this philosophy, we design CLIP on Wheels (CoW) baselines for the task and evaluate each zero-shot model in both Habitat and RoboTHOR simulators. We find that a straightforward CoW, with CLIP-based object localization plus classical exploration, and no additional training, often outperforms learnable approaches in terms of success, efficiency, and robustness to dataset distribution shift. This CoW achieves 6.3 RoboTHOR, when tested zero-shot on all categories. On a subset of four RoboTHOR categories considered in prior work, the same CoW shows a 16.1 percentage point improvement in Success over the learnable state-of-the-art baseline.

READ FULL TEXT

page 2

page 5

page 6

page 17

research
11/30/2022

CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

Household environments are visually diverse. Embodied agents performing ...
research
01/30/2023

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

The ability to accurately locate and navigate to a specific object is a ...
research
06/13/2019

Know What You Don't Know: Modeling a Pragmatic Speaker that Refers to Objects of Unknown Categories

Zero-shot learning in Language & Vision is the task of correctly labelli...
research
08/29/2022

CounTR: Transformer-based Generalised Visual Counting

In this paper, we consider the problem of generalised visual object coun...
research
04/03/2023

Navigating to Objects Specified by Images

Images are a convenient way to specify which particular object instance ...
research
10/11/2022

Contrastive Training Improves Zero-Shot Classification of Semi-structured Documents

We investigate semi-structured document classification in a zero-shot se...
research
04/11/2022

No Token Left Behind: Explainability-Aided Image Classification and Generation

The application of zero-shot learning in computer vision has been revolu...

Please sign up or login with your details

Forgot password? Click here to reset