CoCoNets: Continuous Contrastive 3D Scene Representations

04/08/2021
by   Shamit Lal, et al.
2

This paper explores self-supervised learning of amodal 3D feature representations from RGB and RGB-D posed images and videos, agnostic to object and scene semantic content, and evaluates the resulting scene representations in the downstream tasks of visual correspondence, object tracking, and object detection. The model infers a latent3D representation of the scene in the form of 3D feature points, where each continuous world 3D point is mapped to its corresponding feature vector. The model is trained for contrastive view prediction by rendering 3D feature clouds in queried viewpoints and matching against the 3D feature point cloud predicted from the query view. Notably, the representation can be queried for any 3D location, even if it is not visible from the input view. Our model brings together three powerful ideas of recent exciting research work: 3D feature grids as a neural bottleneck for view prediction, implicit functions for handling resolution limitations of 3D grids, and contrastive learning for unsupervised training of feature representations. We show the resulting 3D visual feature representations effectively scale across objects and scenes, imagine information occluded or missing from the input viewpoints, track objects over time, align semantically related objects in 3D, and improve 3D object detection. We outperform many existing state-of-the-art methods for 3D feature learning and view prediction, which are either limited by 3D grid spatial resolution, do not attempt to build amodal 3D representations, or do not handle combinatorial scene variability due to their non-convolutional bottlenecks.

READ FULL TEXT

page 3

page 6

page 8

page 14

page 15

page 16

research
06/10/2019

Embodied View-Contrastive 3D Feature Learning

Humans can effortlessly imagine the occluded side of objects in a photog...
research
09/14/2022

PointACL:Adversarial Contrastive Learning for Robust Point Clouds Representation under Adversarial Attack

Despite recent success of self-supervised based contrastive learning mod...
research
12/01/2021

PreViTS: Contrastive Pretraining with Video Tracking Supervision

Videos are a rich source for self-supervised learning (SSL) of visual re...
research
06/02/2022

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Learning from audio-visual data offers many possibilities to express cor...
research
10/02/2019

Embodied Language Grounding with Implicit 3D Visual Feature Representations

Consider the utterance "the tomato is to the left of the pot." Humans ca...
research
06/13/2016

Visual-Inertial-Semantic Scene Representation for 3-D Object Detection

We describe a system to detect objects in three-dimensional space using ...
research
11/06/2020

Disentangling 3D Prototypical Networks For Few-Shot Concept Learning

We present neural architectures that disentangle RGB-D images into objec...

Please sign up or login with your details

Forgot password? Click here to reset