ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings

06/24/2022
by   Arjun Majumdar, et al.
4

We present a scalable approach for learning open-world object-goal navigation (ObjectNav) – the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot – i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2 methods. For reference, these gains are similar or better than the 5 improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").

READ FULL TEXT

page 2

page 4

page 9

page 15

page 16

page 17

page 18

research
03/06/2023

Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Based Zero-Shot Object Navigation

We present LGX, a novel algorithm for Object Goal Navigation in a "langu...
research
03/10/2023

Zero-Shot Object Searching Using Large-scale Object Relationship Prior

Home-assistant robots have been a long-standing research topic, and one ...
research
03/14/2021

Success Weighted by Completion Time: A Dynamics-Aware Evaluation Criteria for Embodied Navigation

We present Success weighted by Completion Time (SCT), a new metric for e...
research
11/29/2022

Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances

We consider the problem of embodied visual navigation given an image-goa...
research
06/23/2020

ObjectNav Revisited: On Evaluation of Embodied Agents Navigating to Objects

We revisit the problem of Object-Goal Navigation (ObjectNav). In its sim...
research
04/11/2023

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Visual target navigation in unknown environments is a crucial problem in...
research
11/18/2021

Simple but Effective: CLIP Embeddings for Embodied AI

Contrastive language image pretraining (CLIP) encoders have been shown t...

Please sign up or login with your details

Forgot password? Click here to reset