ObPose: Leveraging Canonical Pose for Object-Centric Scene Inference in 3D

06/07/2022
by   Yizhe Wu, et al.
7

We present ObPose, an unsupervised object-centric generative model that learns to segment 3D objects from RGB-D video in an unsupervised manner. Inspired by prior art in 2D representation learning, ObPose considers a factorised latent space, separately encoding object-wise location (where) and appearance (what) information. In particular, ObPose leverages an object's canonical pose, defined via a minimum volume principle, as a novel inductive bias for learning the where component. To achieve this, we propose an efficient, voxelised approximation approach to recover the object shape directly from a neural radiance field (NeRF). As a consequence, ObPose models scenes as compositions of NeRFs representing individual objects. When evaluated on the YCB dataset for unsupervised scene segmentation, ObPose outperforms the current state-of-the-art in 3D scene inference (ObSuRF) by a significant margin in terms of segmentation quality for both video inputs as well as for multi-view static scenes. In addition, the design choices made in the ObPose encoder are validated with relevant ablations.

READ FULL TEXT

page 2

page 7

page 8

page 14

page 16

research
11/09/2021

Object-Centric Representation Learning with Generative Spatial-Temporal Factorization

Learning object-centric scene representations is essential for attaining...
research
05/31/2021

APEX: Unsupervised, Object-Centric Scene Segmentation and Tracking for Robot Manipulation

Recent advances in unsupervised learning for object detection, segmentat...
research
04/02/2021

Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation

We present ObSuRF, a method which turns a single image of a scene into a...
research
06/07/2021

SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

To help agents reason about scenes in terms of their building blocks, we...
research
12/14/2016

Disentangling Space and Time in Video with Hierarchical Variational Auto-encoders

There are many forms of feature information present in video data. Princ...
research
09/25/2019

LAVAE: Disentangling Location and Appearance

We propose a probabilistic generative model for unsupervised learning of...
research
06/25/2020

Unsupervised Video Decomposition using Spatio-temporal Iterative Inference

Unsupervised multi-object scene decomposition is a fast-emerging problem...

Please sign up or login with your details

Forgot password? Click here to reset