Unmasking the Inductive Biases of Unsupervised Object Representations for Video Sequences

06/12/2020
by   Marissa A. Weis, et al.
14

Perceiving the world in terms of objects is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models have been evaluated with respect to different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of individual objects. In this paper, we argue that the established evaluation protocol of multi-object tracking tests precisely these perceptual qualities and we propose a new benchmark dataset based on procedurally generated video sequences. Using this benchmark, we compare the perceptual abilities of three state-of-the-art unsupervised object-centric learning approaches. Towards this goal, we propose a video-extension of MONet, a seminal object-centric model for static scenes, and compare it to two recent video models: OP3, which exploits clustering via spatial mixture models, and TBA, which uses an explicit factorization via spatial transformers. Our results indicate that architectures which employ unconstrained latent representations based on per-object variational autoencoders and full-image object masks are able to learn more powerful representations in terms of object detection, segmentation and tracking than the explicitly parameterized spatial transformer based architecture. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios, suggesting that our synthetic video benchmark may provide fruitful guidance towards learning more robust object-centric video representations.

READ FULL TEXT

page 25

page 26

page 27

page 28

page 29

page 30

page 31

page 32

research
11/13/2015

UA-DETRAC: A New Benchmark and Protocol for Multi-Object Detection and Tracking

In recent years, numerous effective multi-object tracking (MOT) methods ...
research
07/20/2021

Generative Video Transformer: Can Objects be the Words?

Transformers have been successful for many natural language processing t...
research
05/31/2021

APEX: Unsupervised, Object-Centric Scene Segmentation and Tracking for Robot Manipulation

Recent advances in unsupervised learning for object detection, segmentat...
research
10/07/2021

Unsupervised Image Decomposition with Phase-Correlation Networks

The ability to decompose scenes into their object components is a desire...
research
07/19/2021

Structured World Belief for Reinforcement Learning in POMDP

Object-centric world models provide structured representation of the sce...
research
02/07/2019

Spatial Mixture Models with Learnable Deep Priors for Perceptual Grouping

Humans perceive the seemingly chaotic world in a structured and composit...
research
04/18/2022

Inductive Biases for Object-Centric Representations of Complex Textures

Understanding which inductive biases could be useful for the unsupervise...

Please sign up or login with your details

Forgot password? Click here to reset