Learning Navigational Visual Representations with Semantic Map Supervision

07/23/2023
by   Yicong Hong, et al.
0

Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that humans naturally build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego^2-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego^2-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47 server.

READ FULL TEXT

page 3

page 5

research
05/20/2021

VTNet: Visual Transformer Network for Object Goal Navigation

Object goal navigation aims to steer an agent towards a target object ba...
research
05/21/2023

Instance-Level Semantic Maps for Vision Language Navigation

Humans have a natural ability to perform semantic associations with the ...
research
06/08/2023

SNAP: Self-Supervised Neural Maps for Visual Positioning and Semantic Understanding

Semantic 2D maps are commonly used by humans and machines for navigation...
research
05/04/2020

VisualEchoes: Spatial Image Representation Learning through Echolocation

Several animal species (e.g., bats, dolphins, and whales) and even visua...
research
11/20/2022

Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation

In Vision-and-Language Navigation (VLN), researchers typically take an i...
research
01/26/2022

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

In the Vision-and-Language Navigation task, the embodied agent follows l...
research
11/29/2022

MoDA: Map style transfer for self-supervised Domain Adaptation of embodied agents

We propose a domain adaptation method, MoDA, which adapts a pretrained e...

Please sign up or login with your details

Forgot password? Click here to reset