Unsupervised Video Decomposition using Spatio-temporal Iterative Inference

06/25/2020 ∙ by Polina Zablotskaia, et al. ∙ The University of British Columbia 0

Unsupervised multi-object scene decomposition is a fast-emerging problem in representation learning. Despite significant progress in static scenes, such models are unable to leverage important dynamic cues present in video. We propose a novel spatio-temporal iterative inference framework that is powerful enough to jointly model complex multi-object representations and explicit temporal dependencies between latent variables across frames. This is achieved by leveraging 2D-LSTM, temporally conditioned inference and generation within the iterative amortized inference for posterior refinement. Our method improves the overall quality of decompositions, encodes information about the objects' dynamics, and can be used to predict trajectories of each object separately. Additionally, we show that our model has a high accuracy even without color information. We demonstrate the decomposition, segmentation, and prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets, one of which was curated for this work and will be made publicly available.



There are no comments yet.


page 2

page 7

page 8

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised representation learning, which has a long history dating back to Boltzman Machines Hinton and Sejnowski (1986) and original works of Marr Marr (1970), has recently emerged as one of the important directions of research, carrying the newfound promise of alleviating the need for excessively large and fully labeled datasets. More traditional representation learning approaches focus on unsupervised (e.g.

, autoencoder-based

Pathak et al. (2016); Vincent et al. (2008)) or self-supervised Noroozi and Favaro (2016); Vondrick et al. (2016); Zhang et al. (2016) learning of holistic representations that, for example, are tasked with producing (spatial Noroozi and Favaro (2016), temporal Vondrick et al. (2016), or color Zhang et al. (2016)) encodings of images or patches. The latest and most successful methods along these lines include ViLBERT Lu et al. (2019) and others Sun et al. (2019); Tan and Bansal (2019) that utilize powerful transformer architectures Vaswani et al. (2017) coupled with proxy multi-modal tasks (e.g., masked token prediction or visua-lingual alignment). Learning of good disentangled, spatially granular, representations that are, for example, able to decouple object appearance and shape in complex visual scenes consisting of multiple moving objects remains elusive.

Recent works that attempt to address this challenge can be characterized as: (i) attention-based methods Crawford and Pineau (2019b); Eslami et al. (2016), which infer latent representations for each object in a scene, and (ii) iterative refinement models Greff et al. (2019, 2017), which decompose a scene into a collection of components by grouping pixels. Importantly, the former have been limited to latent representations at object- or image patch-levels, while the latter class of models have illustrated the ability for more granular latent representations at the pixel (segmentation)-level. Specifically, most refinement models learn pixel-level generative models driven by spatial mixtures Greff et al. (2017) and utilize amortized iterative refinements Marino et al. (2018) for inference of disentangled latent representations within the VAE framework Kingma and Welling (2014); a prime example is IODINE Greff et al. (2019). However, while providing a powerful model and abstraction which is able to segment and disentangle complex scenes, IODINE Greff et al. (2019) and other similar architectures are fundamentally limited by the fact that they only consider images. Even when applied for inference in video, they process one frame at a time. This makes it excessively challenging to discover and represent individual instances of objects that may share properties such as appearance and shape but differ in dynamics.

In computer vision, it has been a long-held belief that motion carries important information for segmenting objects

Jepson et al. (2002); Weiss and Adelson (1996). Armed with this intuition, we propose a spatio-temporal amortized inference model capable of not only unsupervised multi-object scene decomposition, but also of learning and leveraging the implicit probabilistic dynamics of each object from perspective raw video alone. This is achieved by introducing temporal dependencies between the latent variables across time. As such, IODINE Greff et al. (2019) could be considered a special (spatial) case of our spatio-temporal formulation. Modeling temporal dependencies among video frames also allows us to make use of conditional priors Chung et al. (2015) for variational inference, leading to more accurate and efficient inference results. The resulting model, illustrated in Fig. 1, achieves superior performance on complex multi-object benchmarks with respect to state-of-the-art models, including R-NEM Van Steenkiste et al. (2018) and IODINE Greff et al. (2019).


We propose a new spatio-temporal amortized inference model that is not only capable of multi-object video decomposition in an unsupervised manner but also learns and models the probabilistic dynamics of each object from complex raw video data by leveraging temporal dependencies between the latent random variables at each frame. To the best of our knowledge this is the first spatio-temporal model of this kind. Our model has a number of appealing properties, including temporal extrapolation (prediction), computational efficiency, and the ability to work with complex data exhibiting non-linear dynamics, colors, and changing number of objects within the same video sequence (

e.g., due to objects exiting and entering the scene). In addition, we introduce an entropy prior to improve our model’s performance in scenarios where object appearance alone is not sufficiently distinctive (e.g., greyscale data). Finally, we illustrate state-of-the-art performance on challenging multi-object benchmark datasets (Bouncing Balls and CLEVRER), outperforming results of R-NEM Van Steenkiste et al. (2018) and IODINE Greff et al. (2019) in terms of segmentation, prediction, and generalization.

Figure 1: Unsupervised Video Decomposition. Our approach allows to infer precise segmentations of the objects via interpretable latent representations, that can be used to decompose each frame and simulate the future dynamics, all in unsupervised fashion. Whenever a new object emerges into a frame the model dynamical adapts and uses one of the segmentation slots to assign to the new object.

2 Related work

Unsupervised Scene Representation Learning.

Unsupervised scene representation learning has a rich history. Generally, these methods can be divided into two groups: attention-based methods, which infer latent representations for each object in a scene, and more complex and powerful iterative refinement models, which often make use of spatial mixtures and can decompose a scene into a collection of precisely estimated components by grouping pixels together.

Attention-based methods, such as AIR Eslami et al. (2016) and SPAIR Crawford and Pineau (2019b), decompose scenes into latent variables representing the appearance, position, and size of the underlying objects. However, both methods can only infer the objects’ bounding boxes (not segmentations) and have not been shown to work on non-trivial 3D scenes with perspective distortions and occlusions. MoNet Burgess et al. (2019) is the first model in this family tackling more complex data and infering representations that can be used for instance segmentation of the objects and individual reconstructions. On the other hand, it is not a probabilistic generative model and thus not suitable for density estimation. GENESIS Engelcke et al. (2020) extends MoNet and alleviates some of its limitations by introducing a probabilistic framework and allowing for spatial relations between the objects. Iterative refinement models started with Tagger Greff et al. (2016) that explicitly reasons about the segmentation of its inputs and features. However, it does not allow explicit latent representations and cannot be scaled to larger and more complex images. NEM Greff et al. (2017)

, as an extension of Tagger, uses a spatial mixture model inside an expectation maximization framework but is limited to binary data. Finally, IODINE 

Greff et al. (2019) is a notable example of a model employing iterative amortized inference w.r.t. a spatial mixture formulation and achieves state-of-the-art performance in scene decomposition and segmentation. Furthermore, it can cope with complex data, including occlusions, and uses an auxilliary component to separate the objects from the background.

Unsupervised Video Tracking and Object Detection. SQAIR  Kosiorek et al. (2018), SILOT Crawford and Pineau (2019a) and SCALOR Jiang et al. (2020) are temporal extensions of the static attention-based models that are tailored to tracking and object detection tasks. SQAIR is restricted to binary data and operates at the level of bounding boxes. SILOT and SCALOR are more expressive and can cope with cluttered scenes, a larger numbers of objects, and dynamic backgrounds but do not work on colored perspective111Perspective videos are more complex as objects can occlude one another and change in size over time. data; accurate segmentation remains a challenge. Finally, STOVE Kossen et al. (2020) focuses on physics-driven learning and simulation.

Unsupervised Video Decomposition and Segmentation. Models employing spatial mixtures and iterative inference in a temporal setting are closest to our method from a technical perspective. Notably, there are only few models falling into this line of work: RTagger Prémont-Schwarz et al. (2017) is a recurrent extension of Tagger but inherits the limitations of its predecessor. R-NEM Van Steenkiste et al. (2018) effectively learns the objects’ dynamics and interactions through a relational module but is limited to orthographic binary data.

Non-representation Learning Methods. Orthogonal to unsupervised representation learning for instance segmentation and object detection are methods relying on fully labeled data, including Mask R-CNN He et al. (2017), Yolo V3 Redmon and Farhadi (2018), and Fast R-CNN Girshick (2015). Alternatively, hand-crafted features can be used, as demonstrated in Felzenszwalb and Huttenlocher (2004); Shi and Malik (2000)

. Unsupervised video segmentation also plays an important role in reinforcement learning: MOREL 

Goel et al. (2018) takes an optical flow approach to segment the moving objects, while others use RL agents to infer segmentations Casanova et al. (2020).

3 Dynamic Video Decomposition

We now introduce our dynamic model for unsupervised video decomposition. Our approach builds upon a generative model of multi-object representations Greff et al. (2019) and leverages elements of iterative amortized inference Marino et al. (2018). We briefly review both concepts (section 3.1) and then introduce our model (section 3.2).

3.1 Background

Multi-Object Representations.

The multi-object framework introduced in Greff et al. (2019) decomposes a static image into objects (including background). Each object is represented by a latent vector capturing the object’s unique appearance and can be thought of as an encoding of common visual properties, such as color, shape, position, and size. For each independently, a broadcast decoder Watters et al. (2019) generates pixelwise pairs

describing the assignment probability and appearance of pixel

for object . Together, they induce the generative image formation model


where , and is the same and fixed for all and . The original image pixels can be reconstructed from this probabilistic representation as .

Iterative Amortized Inference.

Our approach leverages the iterative amortized inference framework Marino et al. (2018), which uses the learning to learn principle Andrychowicz et al. (2016) to close the amortization gap Cremer et al. (2017) typically observed in traditional variational inference. The need for such an iterative process arises due to the multi-modality of Eq.(1), which results in an order invariance and assignment ambiguity in the approximate posterior that standard variational inference cannot overcome Greff et al. (2019).

The idea of amortized iterative inference is to start with randomly guessed parameters for the approximate posterior and update this initial estimate through a series of refinement steps. Each refinement step first samples a latent representation from to evaluate the ELBO and then uses the approximate posterior gradients to compute an additive update , producing a new parameter estimate :


where is a function of , , , and additional inputs. The function consists of a sequence of convolutional layers and an LSTM. The memory unit takes as input a hidden state from the previous refinement step.

3.2 Spatio-Temporal Iterative Inference

Our proposed model builds upon the concepts introduced in the previous section and enables robust learning of dynamic scenes through spatio-temporal iterative inference. Specifically, we consider the task of decomposing a video sequence into slot sequences and appearance sequences . To this end, we introduce explicit temporal dependencies into the sequence of posterior refinements and show how to leverage this contextual information during decoding with a generative model. The resulting computation graph can be thought of as a 2D grid with time dimension and refinement dimension  (Fig. 1(a)). Propagation of information along these two axes is achieved with a 2D-LSTM Graves et al. (2007) (Fig. 1(b)), which allows us to model the joint probability over the entire video sequence inside the iterative amortized inference framework. The proposed method is expressive enough to model the multimodality of our image formation process and posterior, yet its runtime complexity is smaller than that of its static counterpart.

3.2.1 Variational Objective

Since exact likelihood training is intractable, we formulate our task in terms of a variational objective. In contrast to traditional optimization of the evidence lower bound (ELBO) through static encodings of the approximate posterior, we incorporate information from two dynamic axes: (1) variational estimates from previous refinement steps; (2) temporal information from previous frames. Together, they form the basis for spatio-temporal variational inference via iterative refinements. Specifically, we train our model by maximizing the following ELBO objective222For simplicity, we drop references to the object slot from now on and formulate all equations on a per-slot basis.:


where the first term expresses the reconstruction error of a single frame and the second term measures the divergence between the variational posterior and the prior. The relative weight between terms is controlled with a hyperparameter

 Higgins et al. (2017). Furthermore, to reduce the overall complexity of the model and to make it easier to train, we set (see Fig. 2 for an illustration). Compared to a static model, which infers each frame independently, reusing information from previous refinement steps also makes our model more computationally efficient. In the next sections, we discuss the form of the conditional distributions in Eq.(3) in more detail.

3.2.2 Inference and Generation

Figure 2: Model Overview. (a) Inference in our model passes through a 2D grid in which light gray cell represents the -th refinment at time , dark gray cells are where the final reconstruction is computed and no refinement is needed . Each light gray cell receives three inputs: a refinement hidden state , a temporal hidden state , and posterior parameters . The outputs are a new hidden state and new posterior parameters . (b) An example of the internal structure of the highlighted cell from Fig. (a). We process the inputs with the help of a spatial broadcast decoder Greff et al. (2019) and a 2D LSTM Graves et al. (2007). The rest of the light gray cells have the same structure.

Posterior Refinement. Optimizing Eq.(3) inside the iterative amortized inference framework (Section 3.1) requires careful thought about the nature and processing of the hidden states. While there is vast literature on the propagation of a single signal, including different types of RNNs Hochreiter and Schmidhuber (1997); Cho et al. (2014); Graves et al. (2005); Chung et al. (2017) and transformers Vaswani et al. (2017), the optimal solution for multiple axes with different semantic meaning (i.e., time and refinements) is less obvious. Here, we propose to use a 2D version of the uni-directional MD-LSTM Graves et al. (2007) to compute our variational objective (Eq.(3)) in an iterative manner. In order to do so, we replace the traditional LSTM in the refinement network (Eq.(2)) with a 2D extension. This extension allows the posterior gradients to flow through both the grid of the previous refinements and the previous time steps (see Fig. 1(a)). Writing for the latent encoding at time and refinement , we can formalize this new update scheme as follows:


Note that the hidden state from the previous time step is always , i.e., the one computed during the final refinement at time . Our reasoning for this is that the approximation of the posterior only improves with the number of refinements Marino et al. (2018).

Temporal Conditioning. Inside the learning objective we set the prior and the likelihood to be conditioned on the previous frames and the refinement steps. This naturally comes from an idea that each frame is dependent on the predecessor’s dynamics and therefore latent representations should follow the same property. Conditioning on the refinement steps is essential to the iterative amortized inference procedure. To model the prior and the likelihood distributions accordingly we adopt the approach proposed in Chung et al. Chung et al. (2015) but tailor it to our iterative amortized inference setting. Specifically, the parameters of our Gaussian prior are now computed from the temporal hidden state :



is a simple neural network with a few layers.

333In practice, predicts for stability reasons. Please refer to the supplemental material for details. Note that the prior only changes along the time dimension and is independent of the refinement iterations, because we refine the posterior to be as close as possible to the dynamic prior for the current time step. Finally, to complete the conditional generation, we modify the likelihood distribution as follows444

Since our likelihood is a Gaussian mixture model, we are now referencing the object slot



where are mask and appearance of pixel in slot at time step and refinement step . is a spatial mixture broadcast decoder Greff et al. (2019) with preceding MLP to transform the pair into a single vector representation.

3.2.3 Learning and Prediction

Architecture. From a graphical point of view, we can think of the refinement steps and time steps as being organized on a 2D grid from Fig. 1(a), with light gray cell representing the -th refinement at time . According to Eq.(4), each such cell takes as input the hidden state from a previous refinement , the temporal hidden state , and the posterior parameters . Outputs of each light gray cell are new posterior parameters and a new hidden state . At the last refinement at time , the value of the refinement hidden state is assigned to a new temporal hidden state .

Training Objective. Instead of a direct optimization of Eq.(3), we propose two modifications that we found to improve our model’s practical performance: (1) similar to observations made by Greff et al. Greff et al. (2019), we found that color is an important factor for high-quality segmentations. In the absence of such information, we mitigate the arising ambiguity by maximizing the entropy of the masks along the slot dimension , i.e., we train our model by maximizing the objective


where defines the weight of the entropy loss. (2) In addition to the entropy loss, we also prioritize later refinement steps by weighting the terms in the inner sum of Eq.(3) with .

Prediction. On top of pure video decomposition, our model is also able to simulate future frames . Because our model requires image data as input, which is not available during simulation of new frames, we use the reconstructed image in place of to compute the likelihood in these cases. We also set the gradients , , and to zero.

Complexity. Our model’s ability to reuse information from previous refinements leads to a runtime complexity of , which is much more efficient than the complexity of the traditional IODINE model Greff et al. (2019) (when each frame is inferred independently) in the typical case of .

4 Experiments

We validate our model on Bouncing Balls Van Steenkiste et al. (2018) and an augmented version of CLEVRER Yi et al. (2020). Our experiments comprise quantitative studies of decomposition quality during generation and prediction as well as an ablation study. We complement these results with visual illustrations. An additional wide range of visualizations and experimental detailes can be found in the supplemental material.

4.1 Setup

Datasets. Bouncing Balls consists of frame, binary, resolution video sequences. Each video shows simulated balls with different masses bouncing elastically off each other and the image border. We train our model on the first 40 frames of 50K videos containing 4 balls in each frame. We use two different test sets consisting of 10K videos with 4 balls and 10K videos with 6-8 balls. We also validate our model on a color version of this dataset that we generate using the segmentation masks.

CLEVRER contains synthetic videos of moving and colliding objects. Each video is 5 seconds long (128 frames) at resolution of , which we trim and rescale to pixels (see supp. mat.). For training, we use the same 10K videos as in the original source. For testing, we compute ground truth masks for the validation set using the provided annotations and test on 2.5K instances containing 3-5 objects and on 1.1K instances containing 6 objects. We set the number of slots to 6 for the CLEVRER training set and to one more than the maximum number of objects in all other cases.

Baselines. We compare our approach to two recent baselines: R-NEM Van Steenkiste et al. (2018) and IODINE Greff et al. (2019). R-NEM is a state-of-the-art model for unsupervised video decomposition and physics learning. While showing impressive results on simulation tasks, it is limited to binary data and has difficulties with perspective scenes. IODINE is more expressive but static in nature and cannot capture temporal dynamics within its probabilistic framework. However, as noted in Greff et al. (2019), it can be readily applied to temporal sequences by feeding a new video frame to each iteration of the LSTM in the refinement network. We call this variant SEQ-IODINE and compare to it as well.

4.2 Evaluation Metrics

ARI. The Adjusted Rand Index  Rand (1971); Hubert and Arabie (1985) is a measure of clustering similarity. It is computed by counting all pairs of samples that are assigned to the same or different clusters in the predicted and true clusterings. It ranges from -1 to 1, with score of indicating a random clustering and indicating a perfect match. We treat each pixel as one sample and its segmentation as the cluster assignment.

F-ARI. The Foreground Adjusted Rand Index is a modification of the ARI score ignoring background pixels, which often occupy the majority of the image. We argue that both metrics are necessary to assess the segmentation quality of a video decomposition method; this metric is also used in Greff et al. (2019); Van Steenkiste et al. (2018).

MSE. The mean squared error between pixels of the reconstructed and the ground truth frames .

4.3 Video Decomposition

We evaluate the models on a video decomposition task at different sequence lengths. As shown in Table 1 our model outperforms the baselines regardless of the presence of color information, which further reduces the error. We are at least 7% better than R-NEM on all metrics and at least 20% than IODINE on ARI and MSE. Since R-NEM cannot cope well with colored data or perspective of the scenes, it is only evaluated on the Bouncing Balls dataset (binary) producing high-error results in the first frames, a phenomenon not affecting our model. For both datasets IODINE’s results are computed independently on each frame of the longest sequence, by processing frames separately IODINE does not keep the same object-slot assignment, we chose to ignore it when computing the scores.

Bouncing Balls
ARI () F-ARI () MSE ()
Length 10 20 30 40 10 20 30 40 10 20 30 40


R-NEM 0.5031 0.6199 0.6632 0.6833 0.6259 0.7325 0.7708 0.7899 0.0252 0.0138 0.0096 0.0076
IODINE 0.0318 0.9986 0.0018
SEQ-IODINE 0.0230 0.0223 0.0021 -0.0201 0.8645 0.6028 0.5444 0.4063 0.0385 0.0782 0.0846 0.0968
Our 0.7169 0.7263 0.7286 0.7294 0.9999 0.9999 0.9999 0.9999 0.0004 0.0004 0.0004 0.0004


IODINE 0.5841 0.9752 0.0014
SEQ-IODINE 0.3789 0.3743 0.3225 0.2654 0.7517 0.8159 0.7537 0.6734 0.0160 0.0164 0.0217 0.0270
Our 0.7275 0.7291 0.7298 0.7301 1.0000 1.0000 0.9999 0.9999 0.0002 0.0002 0.0002 0.0002
ARI () F-ARI () MSE ()
Length 10 20 30 40 10 20 30 40 10 20 30 40


IODINE 0.1791 0.9316 0.0004
SEQ-IODINE 0.1171 0.1378 0.1558 0.1684 0.8520 0.8774 0.8780 0.8759 0.0009 0.0009 0.0010 0.0010
Our 0.2220 0.2403 0.2555 0.2681 0.9182 0.9258 0.9309 0.9312 0.0003 0.0003 0.0003 0.0003
Table 1: Quantitative Evaluation (Scene Decomposition). We show our model’s ability to produce high-quality instance segmentations for sequences with varying length. We test on sequences with 4 balls and two different types of data (binary, colored) for Bouncing Balls and on sequences with 3-5 objects for CLEVRER. Note, R-NEM does not cope with color data; hence we only run it on binary.

Figure 3: Qualitative Evaluation (Bouncing Balls). Our model can generalize to sequences with 8 balls when trained on 4 balls. Top-to-bottom: output masks, reconstructions, and ground truth video.

4.4 Generalization

We investigated how well our model adapts to a higher number of objects, evaluating its performance on the Bouncing Balls dataset (6 to 8 objects) and on the CLEVRER dataset (6 objects). Table 3 shows that our F-ARI and MSE scores are at least 50% better than those for R-NEM, and ARI scores are just marginally worse and only on the binary data. In comparison to IODINE we are at least 4% better across all metrics. For the Bouncing Balls dataset we have also investigated the impact of changing the total number of possible colors to 4 and 8. The former resulting in duplicate colors for different objects and the latter in unique colors for each object. The higher MSE scores for the 8 balls variant is due to the model not being able to reconstruct the unseen colors. Sample qualitative results are shown in Fig. 3 and 4, while more can be found in the supplementary material.

Figure 4: Qualitative Evaluation (CLEVRER). Our model can generalize to sequences with 6 objects. Furthermore, we demonstrate the ability to handle a dynamically changing number of objects, ranging from 4 in the beginning to 6 at the end.
Bouncing Balls
ARI () F-ARI () MSE ()


R-NEM 0.4484 0.6377 0.0328
IODINE 0.0271 0.9969 0.0040
SEQ-IODINE 0.0263 0.8874 0.0521
Our 0.4453 0.9999 0.0008


IODINE (4) 0.4136 0.8211 0.0138
IODINE (8) 0.2823 0.7197 0.0281
SEQ-IODINE (4) 0.2068 0.5854 0.0338
SEQ-IODINE (8) 0.1571 0.5231 0.0433
Our (4) 0.4275 0.9998 0.0004
Our (8) 0.4317 0.9900 0.0114
ARI () F-ARI () MSE ()


IODINE 0.2205 0.9305 0.0006
SEQ-IODINE 0.1482 0.8645 0.0012
Our 0.2839 0.9355 0.0004
Table 3: Ablation Study. A 2D-LSTM extension of IODINE trained on sequences of 20 frames is unstable and its output segmentation lacks precision and consistency. Our efficient version of 2D-LSTM grid (Fig. 1(a)) and the conditional prior and generation increase both segmentation and reconstruction quality. Training this model on longer sequences of 40 frames we observe further improvement of the scores. Our full models including the entropy loss term (Eq.(7)) leads to the highest scores.






ARI () F-ARI () MSE ()
20 0.0126 0.7765 0.0340
20 0.2994 0.9999 0.0010
40 0.3528 0.9998 0.0010
40 0.7263 0.9999 0.0004

[Base: base model using 2D-LSTM; Grid: efficient triangular grid structure (Fig. 1(a)); CP+G: conditional prior and generation; Length: sequence length; Entropy: entropy term (Eq.(7)]

Table 2: Generalization. At test time, we change the number of slots in the models from 5 to 9 for the Bouncing Balls test dataset (6-8 balls), and from 6 to 7 for the CLEVRER test dataset (6 objects).

4.5 Prediction

We compare the predictions of our model (Section 3.2.3) to those of R-NEM after 20 steps of inference on 10 predicted steps on the Bouncing Balls dataset (Fig. 5a). As we can see from the results our model is superior to R-NEM on a shorter sequences, however for the longer sequences we are outperforming R-NEM only on colored data. We also show the prediction errors on the CLEVRER dataset in Fig. 5b, which slowly decreases over time as expected.

4.6 Ablation

The quantitative results for the ablation study on the binary Bouncing Balls dataset are shown in Table 3. We investigate the effects of the efficient grid, conditional prior and generation, length of training sequences and entropy term on the performance of our model; all necessary and important.

(a) Bouncing balls
Figure 5: Prediction. We plot the prediction errors for 3, 5, 7 and 10 frames after 20 inference steps.

5 Conclusion

We presented a novel unsupervised learning framework capable of precise scene decomposition and dynamics modeling in multi-object videos with complex appearance and motion. The proposed approach leverages temporal consistency between latent random variables expressed through a variational energy resulting in a robust and efficient inference model. These leads to the state-of-the-art in decomposition, segmentation and prediction tasks on several datasets, one of which collected by us. Notably, our model generalizes well to more populous scenes and has improved stability in scenes with missing color information due to the entropy term.

This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chair, NSERC CRC and an NSERC DG and Discovery Accelerator Grants.


  • M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas (2016) Learning to learn by gradient descent by gradient descent. In NIPS, Cited by: §3.1.
  • C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner (2019) Monet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390. Cited by: §2.
  • A. Casanova, P. O. Pinheiro, N. Rostamzadeh, and C. J. Pal (2020)

    Reinforced active learning for image segmentation

    In ICLR, Cited by: §2.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: §3.2.2.
  • J. Chung, S. Ahn, and Y. Bengio (2017)

    Hierarchical multiscale recurrent neural networks

    In ICLR, Cited by: §3.2.2.
  • J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio (2015) A recurrent latent variable model for sequential data. In NIPS, pp. 2980–2988. Cited by: §1, §3.2.2.
  • E. Crawford and J. Pineau (2019a) Exploiting spatial invariance for scalable unsupervised object tracking. In AAAI, Cited by: §2.
  • E. Crawford and J. Pineau (2019b)

    Spatially invariant unsupervised object detection with convolutional neural networks

    In AAAI, pp. 3412–3420. Cited by: §1, §2.
  • C. Cremer, X. Li, and D. Duvenaud (2017) Inference suboptimality in variational autoencoders. In

    NIPS Workshop on Advances in Approximate Bayesian Inference

    Cited by: §3.1.
  • M. Engelcke, A. R. Kosiorek, O. P. Jones, and I. Posner (2020) Genesis: generative scene inference and sampling with object-centric latent representations. In ICLR, Cited by: §F, §2.
  • S. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton, et al. (2016) Attend, infer, repeat: fast scene understanding with generative models. In NIPS, pp. 3225–3233. Cited by: §1, §2.
  • P. F. Felzenszwalb and D. P. Huttenlocher (2004) Efficient graph-based image segmentation. International journal of computer vision 59 (2), pp. 167–181. Cited by: §2.
  • R. Girshick (2015) Fast r-cnn. In ICCV, pp. 1440–1448. Cited by: §2.
  • V. Goel, J. Weng, and P. Poupart (2018) Unsupervised video object segmentation for deep reinforcement learning. In NeurIPS, pp. 5683–5694. Cited by: §2.
  • A. Graves, S. Fernández, and J. Schmidhuber (2005) Bidirectional lstm networks for improved phoneme classification and recognition. In Artificial Neural Networks: Formal Models and Their Applications – ICANN 2005, W. Duch, J. Kacprzyk, E. Oja, and S. Zadrożny (Eds.), Berlin, Heidelberg, pp. 799–804. External Links: ISBN 978-3-540-28756-8 Cited by: §3.2.2.
  • A. Graves, S. Fernández, and J. Schmidhuber (2007) Multi-dimensional recurrent neural networks. In International conference on artificial neural networks, pp. 549–558. Cited by: Figure 2, §3.2.2, §3.2.
  • K. Greff, R. L. Kaufmann, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner (2019) Multi-object representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450. Cited by: §H, §1, §1, §1, §2, Figure 2, §3.1, §3.1, §3.2.2, §3.2.3, §3.2.3, §3, §4.1, §4.2.
  • K. Greff, A. Rasmus, M. Berglund, T. Hao, H. Valpola, and J. Schmidhuber (2016) Tagger: deep unsupervised perceptual grouping. In NIPS, pp. 4484–4492. Cited by: §2.
  • K. Greff, S. Van Steenkiste, and J. Schmidhuber (2017) Neural expectation maximization. In NIPS, pp. 6691–6701. Cited by: §1, §2.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: §2.
  • I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. In ICLR, Cited by: §3.2.1.
  • G.E. Hinton and T.J. Sejnowski (1986)

    Learning and relearning in boltzmann machines

    Parallel Distributed Processing: Explorations in the Microstructure of Cognition 1, pp. 282–317. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.2.
  • L. Hubert and P. Arabie (1985) Comparing partitions. Journal of classification 2 (1), pp. 193–218. Cited by: §4.2.
  • A. Jepson, D. Fleet, and M. Black (2002) A layered motion representation with occlusion and compact spatial support. In ECCV, pp. 692–706. Cited by: §1.
  • J. Jiang, S. Janghorbani, G. De Melo, and S. Ahn (2020) SCALOR: generative world models with scalable object representations. In ICLR, Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §D.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §1.
  • A. Kosiorek, H. Kim, Y. W. Teh, and I. Posner (2018) Sequential attend, infer, repeat: generative modelling of moving objects. In NeurIPS, pp. 8606–8616. Cited by: §F, §2.
  • J. Kossen, K. Stelzner, M. Hussing, C. Voelcker, and K. Kersting (2020) Structured object-aware physics prediction for video modeling and planning. In ICLR, Cited by: §2.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §1.
  • J. Marino, Y. Yue, and S. Mandt (2018) Iterative amortized inference. In ICML, Cited by: §1, §3.1, §3.2.2, §3.
  • D. Marr (1970) A theory for cerebral neocortex. Proceedings of the Royal Society of London Series B (176), pp. 161–234. Cited by: §1.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §1.
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §1.
  • I. Prémont-Schwarz, A. Ilin, T. Hao, A. Rasmus, R. Boney, and H. Valpola (2017) Recurrent ladder networks. In NIPS, pp. 6009–6019. Cited by: §2.
  • W. M. Rand (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66 (336), pp. 846–850. Cited by: §4.2.
  • J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2.
  • A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. In NIPS, pp. 4967–4976. Cited by: §F.
  • J. Shi and J. Malik (2000) Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22 (8), pp. 888–905. Cited by: §2.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmidt (2019) VideoBERT: a joint model for video and language representation learning. In ICCV, Cited by: §1.
  • H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. In

    Conference on Empirical Methods in Natural Language Processing

    Cited by: §1.
  • S. Van Steenkiste, M. Chang, K. Greff, and J. Schmidhuber (2018) Relational neural expectation maximization: unsupervised discovery of objects and their interactions. In ICLR, Cited by: §A.1, §B, §C, §1, §1, §2, §4.1, §4.2, §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1, §3.2.2.
  • P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    In ICML, Cited by: §1.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Anticipating visual representations with unlabeled videos. In CVPR, Cited by: §1.
  • N. Watters, L. Matthey, C. P. Burgess, and A. Lerchner (2019) Spatial broadcast decoder: a simple architecture for learning disentangled representations in vaes. arXiv preprint arXiv:1901.07017. Cited by: §3.1.
  • Y. Weiss and E. Adelson (1996) A unified mixture framework for mo- tion segmentation: incorporating spatial coherence and estimating the number of models. In CVPR, pp. 321–326. Cited by: §1.
  • K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum (2020) Clevrer: collision events for video representation and reasoning. In ICLR, Cited by: §B, §4.
  • R. Zhang, P. Isola, and A. A. Efros (2016)

    Colorful image colorization

    In ECCV, Cited by: §1.

Supplemental Material

A Baselines

a.1 R-Nem

We use the R-NEM Van Steenkiste et al. (2018) authors’ original implementation and their publicly available models: https://github.com/sjoerdvansteenkiste/Relational-NEM.

a.2 Iodine

Our IODINE experiments are based on the following PyTorch implementation:

https://github.com/MichaelKevinKelly/IODINE. We use the same parameters as in this code, with the exceptions of (weight factor) and, for the Bouncing Balls experiments, (refinement steps). The majority of the hyperparameters shared between our own model and IODINE are identical.

a.3 Seq-Iodine

In order to test the sequential version of IODINE, we use the regularly trained IODINE model but change the number of refinement steps to the number of video frames during testing. During each refinement step, instead of computing the error between the reconstructed image and the ground truth image, we use the next video frame. Since the IODINE model was trained on refinement steps, extending the number of refinement steps to the video length leads to exploding gradients. This effect is especially problematic in the binary Bouncing Balls dataset with 20, 30 and 40 frames per video, because the scores of the static model are already low. We deal with this issue by clamping with max and min the gradients and the refinement value in this experiment555 Please note that clamping was done only when applied to binary Bouncing Balls for 20, 30 and 40 frames.. SEQ-IODINE’s weak performance, especially w.r.t. the ARI, reflect the gradual divergence from the optimum as the number of frames increases.

B Datasets

Bouncing Balls. Bouncing Balls is a dataset provided by the authors of R-NEM Van Steenkiste et al. (2018). We use the train and test splits of this dataset in two different versions: binary and color. For the color version, we randomly choose 4 colors for the 4-balls (sub-)dataset. For the 6-8 balls test data, we color them in 2 different ways: 4 colors (same as train) and 8 colors (4 from train, 4 new ones). Note that the former results in identical colors for multiple objects, while the latter guarantees unique colors for each object.

CLEVRER. The version of the CLEVRER dataset Yi et al. (2020) used in this work was processed as follows:

  • Train split, validation split and validation annotations were obtained from the official website: http://clevrer.csail.mit.edu/. We use the validation set as test set, because the test set does not contain annotations.

  • For training, we use the original train split. Our minimal preprocessing consists of cropping the frames along the width axis by 40 pixels on both sides, followed by a uniform downscaling to 64x64 pixels. Since the length of each video is 128 frames and the maximum number of frames during training was 40, we split the videos into multiple sequences to obtain a larger number of training samples.

  • For testing, we trim the videos to a subsequence containing at least 3 objects and object motion. We compute these subsequences by running the script (slice_videos_from_annotations.py in the attached code) from the folder with the validation split and validation annotations.

  • The test set ground truth masks can be downloaded from here. The masks and the preprocessed test videos will be grouped into separate folders based on the number of objects in a video.

C Hyperparameters

Initialization. We initialize the parameters of the posterior by sampling from . In all experiments, we use a latent dimensionality , such that

. Horizontal and vertical hidden states and cell states are of size 128, initialized with zeros. The variance of the likelihood is set to

in all experiments.

Experiments on Bouncing Balls. For this experiment, we have explored several values of (refinement steps) and empirically found to be optimal in terms of accuracy and efficiency. Refining the posterior more than 6 times does not lead to any substantial improvement, however, the time and memory consumption is significantly increased. For the 4-balls dataset, we use slots for train and test. For our tests on 6-8 balls, we use slots. This protocol is identical to the one used in R-NEM Van Steenkiste et al. (2018). Furthermore, we set and scale the KL term by . The weight of the entropy term is set to in the binary case. As expected, the effect of the entropy term is most pronounced with binary data, so we set in all experiments with RGB data.

Experiments on CLEVRER. We keep the default number of iterative refinements at , because we did not observe any substantial improvements from a further increase. We use slots during training, slot when testing on 3-5 objects and slots when testing on 6 objects.

D Training

We use ADAM Kingma and Ba (2014) for all experiments, with a learning rate of 0.0003 and default values for all remaining parameters. During training, we gradually increase the number of frames per video, as we have found this to make the optimisation more stable. We start with sequences of length 4 and train the model until we observe a stagnant loss or posterior collapse. At the beginning of training, the batch size is 32 and is gradually decreased negatively proportional to the number of frames in the video.

E Infrastructure and Runtime

We train our models on 8 GeForce GTX 1080 Ti GPUs, which takes approximately one day per model.

F Discussion and Future work

Introduction of a temporal component not only enables modelling of dynamics inside the amortized iterative inference framework but also improves the quality of the results overall. From our quantitative and qualitative comparisons with IODINE and SEQ-IODINE, we see that our model shows more accurate results on the decomposition task. We can detect new objects faster and are less sensitive to color, because our model can leverage the objects’ motion cues. The ability to work with complex colored data, a property inherited from IODINE, means that we significantly outperform R-NEM. However, R-NEM is a stronger model when it comes to prediction of longer sequences, owing to its ability to model the relations between the objects in the scene. Similar ideas were used in SQAIR Kosiorek et al. (2018) and GENESIS Engelcke et al. (2020) by adding a relational RNN Santoro et al. (2017). Integration of these concepts into our framework is a promising direction for future research. Another possible route is an application of our model to complex real-world scenarios. However, given that such datasets typically contain a much higher number of objects, as well as intricate interactions and spatially varying materials, we consider the resulting scalability questions as a separate line of research.

G Additional Qualitative Results

Figure 6: Video decomposition using our model applied on Bouncing Balls dataset with 4 balls.
Figure 7: Video decomposition using our model applied on Bouncing Balls dataset with 6-8 balls.
Figure 8: Prediction on Bouncing Balls (colored) dataset.
Figure 9: Prediction on CLEVRER dataset.
Figure 10: Qualitative results for Ours vs. IODINE vs. SEQ-IODINE decomposition experiment. (a) From the figure it is clear that our model can much sooner detect new objects emerging to the frame, while SEQ-IODINE struggles to properly reconstruct and decompose them. And IODINE doesn’t have any temporal consistence and reshuffles the slot order. (b). Here we can see that our model is much more stable with time and it does not fail to detect objects, unlike IODINE and SEQ-IODINE.

H Disentanglement

We demonstrate that introducing a new temporal hidden state and an additional MLP in front of the spatial broadcast decoder has not impacted its ability to separate each object’s representations and disentangles them based on color, position, size and other features, similar to results shown in Greff et al. (2019).

Figure 11: Disentanglement of the latent representations corresponding to distinct interpretable features. CLEVRER latent walks along three different dimensions: color, size and position. We chose a random frame and for each object’s representation in the scene dimensions were traversed independently.