Unsupervised Video Decomposition using Spatio-temporal Iterative Inference

06/25/2020
by   Polina Zablotskaia, et al.
0

Unsupervised multi-object scene decomposition is a fast-emerging problem in representation learning. Despite significant progress in static scenes, such models are unable to leverage important dynamic cues present in video. We propose a novel spatio-temporal iterative inference framework that is powerful enough to jointly model complex multi-object representations and explicit temporal dependencies between latent variables across frames. This is achieved by leveraging 2D-LSTM, temporally conditioned inference and generation within the iterative amortized inference for posterior refinement. Our method improves the overall quality of decompositions, encodes information about the objects' dynamics, and can be used to predict trajectories of each object separately. Additionally, we show that our model has a high accuracy even without color information. We demonstrate the decomposition, segmentation, and prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets, one of which was curated for this work and will be made publicly available.

READ FULL TEXT

page 2

page 7

page 8

page 13

page 14

page 15

research
01/17/2020

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Generating video descriptions automatically is a challenging task that i...
research
12/06/2022

Multi-Task Edge Prediction in Temporally-Dynamic Video Graphs

Graph neural networks have shown to learn effective node representations...
research
03/26/2018

CNN in MRF: Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF

This paper addresses the problem of video object segmentation, where the...
research
06/07/2022

ObPose: Leveraging Canonical Pose for Object-Centric Scene Inference in 3D

We present ObPose, an unsupervised object-centric generative model that ...
research
04/09/2021

GATSBI: Generative Agent-centric Spatio-temporal Object Interaction

We present GATSBI, a generative model that can transform a sequence of r...
research
12/24/2016

Unsupervised Video Segmentation via Spatio-Temporally Nonlocal Appearance Learning

Video object segmentation is challenging due to the factors like rapidly...
research
12/12/2022

Breaking the "Object" in Video Object Segmentation

The appearance of an object can be fleeting when it transforms. As eggs ...

Please sign up or login with your details

Forgot password? Click here to reset