SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

06/07/2021
by   Rishabh Kabra, et al.
0

To help agents reason about scenes in terms of their building blocks, we wish to extract the compositional structure of any given scene (in particular, the configuration and characteristics of objects comprising the scene). This problem is especially difficult when scene structure needs to be inferred while also estimating the agent's location/viewpoint, as the two variables jointly give rise to the agent's observations. We present an unsupervised variational approach to this problem. Leveraging the shared structure that exists across different scenes, our model learns to infer two sets of latent representations from RGB video input alone: a set of "object" latents, corresponding to the time-invariant, object-level contents of the scene, as well as a set of "frame" latents, corresponding to global time-varying elements such as viewpoint. This factorization of latents allows our model, SIMONe, to represent object attributes in an allocentric manner which does not depend on viewpoint. Moreover, it allows us to disentangle object dynamics and summarize their trajectories as time-abstracted, view-invariant, per-object properties. We demonstrate these capabilities, as well as the model's performance in terms of view synthesis and instance segmentation, across three procedurally generated video datasets.

READ FULL TEXT

page 5

page 10

page 11

page 12

page 19

page 20

page 21

page 22

research
11/12/2020

3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

We propose an action-conditioned dynamics model that predicts scene chan...
research
06/07/2022

ObPose: Leveraging Canonical Pose for Object-Centric Scene Inference in 3D

We present ObPose, an unsupervised object-centric generative model that ...
research
01/21/2023

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

When perceiving the world from multiple viewpoints, humans have the abil...
research
01/22/2019

MONet: Unsupervised Scene Decomposition and Representation

The ability to decompose scenes in terms of abstract building blocks is ...
research
04/01/2021

Visual Attention in Imaginative Agents

We present a recurrent agent who perceives surroundings through a series...
research
04/30/2023

Object-Centric Voxelization of Dynamic Scenes via Inverse Neural Rendering

Understanding the compositional dynamics of the world in unsupervised 3D...
research
10/11/2018

Identification of Invariant Sensorimotor Structures as a Prerequisite for the Discovery of Objects

Perceiving the surrounding environment in terms of objects is useful for...

Please sign up or login with your details

Forgot password? Click here to reset