1 Introduction
Unsupervised representation learning, which has a long history dating back to Boltzman Machines Hinton and Sejnowski (1986) and original works of Marr Marr (1970), has recently emerged as one of the important directions of research, carrying the newfound promise of alleviating the need for excessively large and fully labeled datasets. More traditional representation learning approaches focus on unsupervised (e.g.
, autoencoderbased
Pathak et al. (2016); Vincent et al. (2008)) or selfsupervised Noroozi and Favaro (2016); Vondrick et al. (2016); Zhang et al. (2016) learning of holistic representations that, for example, are tasked with producing (spatial Noroozi and Favaro (2016), temporal Vondrick et al. (2016), or color Zhang et al. (2016)) encodings of images or patches. The latest and most successful methods along these lines include ViLBERT Lu et al. (2019) and others Sun et al. (2019); Tan and Bansal (2019) that utilize powerful transformer architectures Vaswani et al. (2017) coupled with proxy multimodal tasks (e.g., masked token prediction or visualingual alignment). Learning of good disentangled, spatially granular, representations that are, for example, able to decouple object appearance and shape in complex visual scenes consisting of multiple moving objects remains elusive.Recent works that attempt to address this challenge can be characterized as: (i) attentionbased methods Crawford and Pineau (2019b); Eslami et al. (2016), which infer latent representations for each object in a scene, and (ii) iterative refinement models Greff et al. (2019, 2017), which decompose a scene into a collection of components by grouping pixels. Importantly, the former have been limited to latent representations at object or image patchlevels, while the latter class of models have illustrated the ability for more granular latent representations at the pixel (segmentation)level. Specifically, most refinement models learn pixellevel generative models driven by spatial mixtures Greff et al. (2017) and utilize amortized iterative refinements Marino et al. (2018) for inference of disentangled latent representations within the VAE framework Kingma and Welling (2014); a prime example is IODINE Greff et al. (2019). However, while providing a powerful model and abstraction which is able to segment and disentangle complex scenes, IODINE Greff et al. (2019) and other similar architectures are fundamentally limited by the fact that they only consider images. Even when applied for inference in video, they process one frame at a time. This makes it excessively challenging to discover and represent individual instances of objects that may share properties such as appearance and shape but differ in dynamics.
In computer vision, it has been a longheld belief that motion carries important information for segmenting objects
Jepson et al. (2002); Weiss and Adelson (1996). Armed with this intuition, we propose a spatiotemporal amortized inference model capable of not only unsupervised multiobject scene decomposition, but also of learning and leveraging the implicit probabilistic dynamics of each object from perspective raw video alone. This is achieved by introducing temporal dependencies between the latent variables across time. As such, IODINE Greff et al. (2019) could be considered a special (spatial) case of our spatiotemporal formulation. Modeling temporal dependencies among video frames also allows us to make use of conditional priors Chung et al. (2015) for variational inference, leading to more accurate and efficient inference results. The resulting model, illustrated in Fig. 1, achieves superior performance on complex multiobject benchmarks with respect to stateoftheart models, including RNEM Van Steenkiste et al. (2018) and IODINE Greff et al. (2019).Contributions.
We propose a new spatiotemporal amortized inference model that is not only capable of multiobject video decomposition in an unsupervised manner but also learns and models the probabilistic dynamics of each object from complex raw video data by leveraging temporal dependencies between the latent random variables at each frame. To the best of our knowledge this is the first spatiotemporal model of this kind. Our model has a number of appealing properties, including temporal extrapolation (prediction), computational efficiency, and the ability to work with complex data exhibiting nonlinear dynamics, colors, and changing number of objects within the same video sequence (
e.g., due to objects exiting and entering the scene). In addition, we introduce an entropy prior to improve our model’s performance in scenarios where object appearance alone is not sufficiently distinctive (e.g., greyscale data). Finally, we illustrate stateoftheart performance on challenging multiobject benchmark datasets (Bouncing Balls and CLEVRER), outperforming results of RNEM Van Steenkiste et al. (2018) and IODINE Greff et al. (2019) in terms of segmentation, prediction, and generalization.2 Related work
Unsupervised Scene Representation Learning.
Unsupervised scene representation learning has a rich history. Generally, these methods can be divided into two groups: attentionbased methods, which infer latent representations for each object in a scene, and more complex and powerful iterative refinement models, which often make use of spatial mixtures and can decompose a scene into a collection of precisely estimated components by grouping pixels together.
Attentionbased methods, such as AIR Eslami et al. (2016) and SPAIR Crawford and Pineau (2019b), decompose scenes into latent variables representing the appearance, position, and size of the underlying objects. However, both methods can only infer the objects’ bounding boxes (not segmentations) and have not been shown to work on nontrivial 3D scenes with perspective distortions and occlusions. MoNet Burgess et al. (2019) is the first model in this family tackling more complex data and infering representations that can be used for instance segmentation of the objects and individual reconstructions. On the other hand, it is not a probabilistic generative model and thus not suitable for density estimation. GENESIS Engelcke et al. (2020) extends MoNet and alleviates some of its limitations by introducing a probabilistic framework and allowing for spatial relations between the objects. Iterative refinement models started with Tagger Greff et al. (2016) that explicitly reasons about the segmentation of its inputs and features. However, it does not allow explicit latent representations and cannot be scaled to larger and more complex images. NEM Greff et al. (2017), as an extension of Tagger, uses a spatial mixture model inside an expectation maximization framework but is limited to binary data. Finally, IODINE
Greff et al. (2019) is a notable example of a model employing iterative amortized inference w.r.t. a spatial mixture formulation and achieves stateoftheart performance in scene decomposition and segmentation. Furthermore, it can cope with complex data, including occlusions, and uses an auxilliary component to separate the objects from the background.Unsupervised Video Tracking and Object Detection. SQAIR Kosiorek et al. (2018), SILOT Crawford and Pineau (2019a) and SCALOR Jiang et al. (2020) are temporal extensions of the static attentionbased models that are tailored to tracking and object detection tasks. SQAIR is restricted to binary data and operates at the level of bounding boxes. SILOT and SCALOR are more expressive and can cope with cluttered scenes, a larger numbers of objects, and dynamic backgrounds but do not work on colored perspective^{1}^{1}1Perspective videos are more complex as objects can occlude one another and change in size over time. data; accurate segmentation remains a challenge. Finally, STOVE Kossen et al. (2020) focuses on physicsdriven learning and simulation.
Unsupervised Video Decomposition and Segmentation. Models employing spatial mixtures and iterative inference in a temporal setting are closest to our method from a technical perspective. Notably, there are only few models falling into this line of work: RTagger PrémontSchwarz et al. (2017) is a recurrent extension of Tagger but inherits the limitations of its predecessor. RNEM Van Steenkiste et al. (2018) effectively learns the objects’ dynamics and interactions through a relational module but is limited to orthographic binary data.
Nonrepresentation Learning Methods. Orthogonal to unsupervised representation learning for instance segmentation and object detection are methods relying on fully labeled data, including Mask RCNN He et al. (2017), Yolo V3 Redmon and Farhadi (2018), and Fast RCNN Girshick (2015). Alternatively, handcrafted features can be used, as demonstrated in Felzenszwalb and Huttenlocher (2004); Shi and Malik (2000)
. Unsupervised video segmentation also plays an important role in reinforcement learning: MOREL
Goel et al. (2018) takes an optical flow approach to segment the moving objects, while others use RL agents to infer segmentations Casanova et al. (2020).3 Dynamic Video Decomposition
We now introduce our dynamic model for unsupervised video decomposition. Our approach builds upon a generative model of multiobject representations Greff et al. (2019) and leverages elements of iterative amortized inference Marino et al. (2018). We briefly review both concepts (section 3.1) and then introduce our model (section 3.2).
3.1 Background
MultiObject Representations.
The multiobject framework introduced in Greff et al. (2019) decomposes a static image into objects (including background). Each object is represented by a latent vector capturing the object’s unique appearance and can be thought of as an encoding of common visual properties, such as color, shape, position, and size. For each independently, a broadcast decoder Watters et al. (2019) generates pixelwise pairs
describing the assignment probability and appearance of pixel
for object . Together, they induce the generative image formation model(1) 
where , and is the same and fixed for all and . The original image pixels can be reconstructed from this probabilistic representation as .
Iterative Amortized Inference.
Our approach leverages the iterative amortized inference framework Marino et al. (2018), which uses the learning to learn principle Andrychowicz et al. (2016) to close the amortization gap Cremer et al. (2017) typically observed in traditional variational inference. The need for such an iterative process arises due to the multimodality of Eq.(1), which results in an order invariance and assignment ambiguity in the approximate posterior that standard variational inference cannot overcome Greff et al. (2019).
The idea of amortized iterative inference is to start with randomly guessed parameters for the approximate posterior and update this initial estimate through a series of refinement steps. Each refinement step first samples a latent representation from to evaluate the ELBO and then uses the approximate posterior gradients to compute an additive update , producing a new parameter estimate :
(2) 
where is a function of , , , and additional inputs. The function consists of a sequence of convolutional layers and an LSTM. The memory unit takes as input a hidden state from the previous refinement step.
3.2 SpatioTemporal Iterative Inference
Our proposed model builds upon the concepts introduced in the previous section and enables robust learning of dynamic scenes through spatiotemporal iterative inference. Specifically, we consider the task of decomposing a video sequence into slot sequences and appearance sequences . To this end, we introduce explicit temporal dependencies into the sequence of posterior refinements and show how to leverage this contextual information during decoding with a generative model. The resulting computation graph can be thought of as a 2D grid with time dimension and refinement dimension (Fig. 1(a)). Propagation of information along these two axes is achieved with a 2DLSTM Graves et al. (2007) (Fig. 1(b)), which allows us to model the joint probability over the entire video sequence inside the iterative amortized inference framework. The proposed method is expressive enough to model the multimodality of our image formation process and posterior, yet its runtime complexity is smaller than that of its static counterpart.
3.2.1 Variational Objective
Since exact likelihood training is intractable, we formulate our task in terms of a variational objective. In contrast to traditional optimization of the evidence lower bound (ELBO) through static encodings of the approximate posterior, we incorporate information from two dynamic axes: (1) variational estimates from previous refinement steps; (2) temporal information from previous frames. Together, they form the basis for spatiotemporal variational inference via iterative refinements. Specifically, we train our model by maximizing the following ELBO objective^{2}^{2}2For simplicity, we drop references to the object slot from now on and formulate all equations on a perslot basis.:
(3)  
where the first term expresses the reconstruction error of a single frame and the second term measures the divergence between the variational posterior and the prior. The relative weight between terms is controlled with a hyperparameter
Higgins et al. (2017). Furthermore, to reduce the overall complexity of the model and to make it easier to train, we set (see Fig. 2 for an illustration). Compared to a static model, which infers each frame independently, reusing information from previous refinement steps also makes our model more computationally efficient. In the next sections, we discuss the form of the conditional distributions in Eq.(3) in more detail.3.2.2 Inference and Generation
Posterior Refinement. Optimizing Eq.(3) inside the iterative amortized inference framework (Section 3.1) requires careful thought about the nature and processing of the hidden states. While there is vast literature on the propagation of a single signal, including different types of RNNs Hochreiter and Schmidhuber (1997); Cho et al. (2014); Graves et al. (2005); Chung et al. (2017) and transformers Vaswani et al. (2017), the optimal solution for multiple axes with different semantic meaning (i.e., time and refinements) is less obvious. Here, we propose to use a 2D version of the unidirectional MDLSTM Graves et al. (2007) to compute our variational objective (Eq.(3)) in an iterative manner. In order to do so, we replace the traditional LSTM in the refinement network (Eq.(2)) with a 2D extension. This extension allows the posterior gradients to flow through both the grid of the previous refinements and the previous time steps (see Fig. 1(a)). Writing for the latent encoding at time and refinement , we can formalize this new update scheme as follows:
(4) 
Note that the hidden state from the previous time step is always , i.e., the one computed during the final refinement at time . Our reasoning for this is that the approximation of the posterior only improves with the number of refinements Marino et al. (2018).
Temporal Conditioning. Inside the learning objective we set the prior and the likelihood to be conditioned on the previous frames and the refinement steps. This naturally comes from an idea that each frame is dependent on the predecessor’s dynamics and therefore latent representations should follow the same property. Conditioning on the refinement steps is essential to the iterative amortized inference procedure. To model the prior and the likelihood distributions accordingly we adopt the approach proposed in Chung et al. Chung et al. (2015) but tailor it to our iterative amortized inference setting. Specifically, the parameters of our Gaussian prior are now computed from the temporal hidden state :
(5) 
where
is a simple neural network with a few layers.
^{3}^{3}3In practice, predicts for stability reasons. Please refer to the supplemental material for details. Note that the prior only changes along the time dimension and is independent of the refinement iterations, because we refine the posterior to be as close as possible to the dynamic prior for the current time step. Finally, to complete the conditional generation, we modify the likelihood distribution as follows^{4}^{4}4Since our likelihood is a Gaussian mixture model, we are now referencing the object slot
again.:(6) 
where are mask and appearance of pixel in slot at time step and refinement step . is a spatial mixture broadcast decoder Greff et al. (2019) with preceding MLP to transform the pair into a single vector representation.
3.2.3 Learning and Prediction
Architecture. From a graphical point of view, we can think of the refinement steps and time steps as being organized on a 2D grid from Fig. 1(a), with light gray cell representing the th refinement at time . According to Eq.(4), each such cell takes as input the hidden state from a previous refinement , the temporal hidden state , and the posterior parameters . Outputs of each light gray cell are new posterior parameters and a new hidden state . At the last refinement at time , the value of the refinement hidden state is assigned to a new temporal hidden state .
Training Objective. Instead of a direct optimization of Eq.(3), we propose two modifications that we found to improve our model’s practical performance: (1) similar to observations made by Greff et al. Greff et al. (2019), we found that color is an important factor for highquality segmentations. In the absence of such information, we mitigate the arising ambiguity by maximizing the entropy of the masks along the slot dimension , i.e., we train our model by maximizing the objective
(7) 
where defines the weight of the entropy loss. (2) In addition to the entropy loss, we also prioritize later refinement steps by weighting the terms in the inner sum of Eq.(3) with .
Prediction. On top of pure video decomposition, our model is also able to simulate future frames . Because our model requires image data as input, which is not available during simulation of new frames, we use the reconstructed image in place of to compute the likelihood in these cases. We also set the gradients , , and to zero.
Complexity. Our model’s ability to reuse information from previous refinements leads to a runtime complexity of , which is much more efficient than the complexity of the traditional IODINE model Greff et al. (2019) (when each frame is inferred independently) in the typical case of .
4 Experiments
We validate our model on Bouncing Balls Van Steenkiste et al. (2018) and an augmented version of CLEVRER Yi et al. (2020). Our experiments comprise quantitative studies of decomposition quality during generation and prediction as well as an ablation study. We complement these results with visual illustrations. An additional wide range of visualizations and experimental detailes can be found in the supplemental material.
4.1 Setup
Datasets. Bouncing Balls consists of frame, binary, resolution video sequences. Each video shows simulated balls with different masses bouncing elastically off each other and the image border. We train our model on the first 40 frames of 50K videos containing 4 balls in each frame. We use two different test sets consisting of 10K videos with 4 balls and 10K videos with 68 balls. We also validate our model on a color version of this dataset that we generate using the segmentation masks.
CLEVRER contains synthetic videos of moving and colliding objects. Each video is 5 seconds long (128 frames) at resolution of , which we trim and rescale to pixels (see supp. mat.). For training, we use the same 10K videos as in the original source. For testing, we compute ground truth masks for the validation set using the provided annotations and test on 2.5K instances containing 35 objects and on 1.1K instances containing 6 objects. We set the number of slots to 6 for the CLEVRER training set and to one more than the maximum number of objects in all other cases.
Baselines. We compare our approach to two recent baselines: RNEM Van Steenkiste et al. (2018) and IODINE Greff et al. (2019). RNEM is a stateoftheart model for unsupervised video decomposition and physics learning. While showing impressive results on simulation tasks, it is limited to binary data and has difficulties with perspective scenes. IODINE is more expressive but static in nature and cannot capture temporal dynamics within its probabilistic framework. However, as noted in Greff et al. (2019), it can be readily applied to temporal sequences by feeding a new video frame to each iteration of the LSTM in the refinement network. We call this variant SEQIODINE and compare to it as well.
4.2 Evaluation Metrics
ARI. The Adjusted Rand Index Rand (1971); Hubert and Arabie (1985) is a measure of clustering similarity. It is computed by counting all pairs of samples that are assigned to the same or different clusters in the predicted and true clusterings. It ranges from 1 to 1, with score of indicating a random clustering and indicating a perfect match. We treat each pixel as one sample and its segmentation as the cluster assignment.
FARI. The Foreground Adjusted Rand Index is a modification of the ARI score ignoring background pixels, which often occupy the majority of the image. We argue that both metrics are necessary to assess the segmentation quality of a video decomposition method; this metric is also used in Greff et al. (2019); Van Steenkiste et al. (2018).
MSE. The mean squared error between pixels of the reconstructed and the ground truth frames .
4.3 Video Decomposition
We evaluate the models on a video decomposition task at different sequence lengths. As shown in Table 1 our model outperforms the baselines regardless of the presence of color information, which further reduces the error. We are at least 7% better than RNEM on all metrics and at least 20% than IODINE on ARI and MSE. Since RNEM cannot cope well with colored data or perspective of the scenes, it is only evaluated on the Bouncing Balls dataset (binary) producing higherror results in the first frames, a phenomenon not affecting our model. For both datasets IODINE’s results are computed independently on each frame of the longest sequence, by processing frames separately IODINE does not keep the same objectslot assignment, we chose to ignore it when computing the scores.
Bouncing Balls  
ARI ()  FARI ()  MSE ()  
Length  10  20  30  40  10  20  30  40  10  20  30  40  
binary 
RNEM  0.5031  0.6199  0.6632  0.6833  0.6259  0.7325  0.7708  0.7899  0.0252  0.0138  0.0096  0.0076 
IODINE  0.0318  0.9986  0.0018  
SEQIODINE  0.0230  0.0223  0.0021  0.0201  0.8645  0.6028  0.5444  0.4063  0.0385  0.0782  0.0846  0.0968  
Our  0.7169  0.7263  0.7286  0.7294  0.9999  0.9999  0.9999  0.9999  0.0004  0.0004  0.0004  0.0004  
color 
IODINE  0.5841  0.9752  0.0014  
SEQIODINE  0.3789  0.3743  0.3225  0.2654  0.7517  0.8159  0.7537  0.6734  0.0160  0.0164  0.0217  0.0270  
Our  0.7275  0.7291  0.7298  0.7301  1.0000  1.0000  0.9999  0.9999  0.0002  0.0002  0.0002  0.0002 
CLEVRER  
ARI ()  FARI ()  MSE ()  
Length  10  20  30  40  10  20  30  40  10  20  30  40  
color 
IODINE  0.1791  0.9316  0.0004  
SEQIODINE  0.1171  0.1378  0.1558  0.1684  0.8520  0.8774  0.8780  0.8759  0.0009  0.0009  0.0010  0.0010  
Our  0.2220  0.2403  0.2555  0.2681  0.9182  0.9258  0.9309  0.9312  0.0003  0.0003  0.0003  0.0003 
4.4 Generalization
We investigated how well our model adapts to a higher number of objects, evaluating its performance on the Bouncing Balls dataset (6 to 8 objects) and on the CLEVRER dataset (6 objects). Table 3 shows that our FARI and MSE scores are at least 50% better than those for RNEM, and ARI scores are just marginally worse and only on the binary data. In comparison to IODINE we are at least 4% better across all metrics. For the Bouncing Balls dataset we have also investigated the impact of changing the total number of possible colors to 4 and 8. The former resulting in duplicate colors for different objects and the latter in unique colors for each object. The higher MSE scores for the 8 balls variant is due to the model not being able to reconstruct the unseen colors. Sample qualitative results are shown in Fig. 3 and 4, while more can be found in the supplementary material.
Bouncing Balls  
ARI ()  FARI ()  MSE ()  
binary 
RNEM  0.4484  0.6377  0.0328 
IODINE  0.0271  0.9969  0.0040  
SEQIODINE  0.0263  0.8874  0.0521  
Our  0.4453  0.9999  0.0008  
color 
IODINE (4)  0.4136  0.8211  0.0138 
IODINE (8)  0.2823  0.7197  0.0281  
SEQIODINE (4)  0.2068  0.5854  0.0338  
SEQIODINE (8)  0.1571  0.5231  0.0433  
Our (4)  0.4275  0.9998  0.0004  
Our (8)  0.4317  0.9900  0.0114  
CLEVRER  
ARI ()  FARI ()  MSE ()  
color 
IODINE  0.2205  0.9305  0.0006 
SEQIODINE  0.1482  0.8645  0.0012  
Our  0.2839  0.9355  0.0004 
Base 
Grid 
CP+G 
Entropy 
Length 
ARI ()  FARI ()  MSE () 
✓  20  0.0126  0.7765  0.0340  
✓  ✓  ✓  20  0.2994  0.9999  0.0010  
✓  ✓  ✓  40  0.3528  0.9998  0.0010  
✓  ✓  ✓  ✓  40  0.7263  0.9999  0.0004 
[Base: base model using 2DLSTM; Grid: efficient triangular grid structure (Fig. 1(a)); CP+G: conditional prior and generation; Length: sequence length; Entropy: entropy term (Eq.(7)]
4.5 Prediction
We compare the predictions of our model (Section 3.2.3) to those of RNEM after 20 steps of inference on 10 predicted steps on the Bouncing Balls dataset (Fig. 5a). As we can see from the results our model is superior to RNEM on a shorter sequences, however for the longer sequences we are outperforming RNEM only on colored data. We also show the prediction errors on the CLEVRER dataset in Fig. 5b, which slowly decreases over time as expected.
4.6 Ablation
The quantitative results for the ablation study on the binary Bouncing Balls dataset are shown in Table 3. We investigate the effects of the efficient grid, conditional prior and generation, length of training sequences and entropy term on the performance of our model; all necessary and important.
5 Conclusion
We presented a novel unsupervised learning framework capable of precise scene decomposition and dynamics modeling in multiobject videos with complex appearance and motion. The proposed approach leverages temporal consistency between latent random variables expressed through a variational energy resulting in a robust and efficient inference model. These leads to the stateoftheart in decomposition, segmentation and prediction tasks on several datasets, one of which collected by us. Notably, our model generalizes well to more populous scenes and has improved stability in scenes with missing color information due to the entropy term.
This work was funded, in part, by the Vector Institute for AI, Canada CIFAR AI Chair, NSERC CRC and an NSERC DG and Discovery Accelerator Grants.
References
 Learning to learn by gradient descent by gradient descent. In NIPS, Cited by: §3.1.
 Monet: unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390. Cited by: §2.

Reinforced active learning for image segmentation
. In ICLR, Cited by: §2.  Learning phrase representations using rnn encoderdecoder for statistical machine translation. In EMNLP, Cited by: §3.2.2.

Hierarchical multiscale recurrent neural networks
. In ICLR, Cited by: §3.2.2.  A recurrent latent variable model for sequential data. In NIPS, pp. 2980–2988. Cited by: §1, §3.2.2.
 Exploiting spatial invariance for scalable unsupervised object tracking. In AAAI, Cited by: §2.

Spatially invariant unsupervised object detection with convolutional neural networks
. In AAAI, pp. 3412–3420. Cited by: §1, §2. 
Inference suboptimality in variational autoencoders.
In
NIPS Workshop on Advances in Approximate Bayesian Inference
, Cited by: §3.1.  Genesis: generative scene inference and sampling with objectcentric latent representations. In ICLR, Cited by: §F, §2.
 Attend, infer, repeat: fast scene understanding with generative models. In NIPS, pp. 3225–3233. Cited by: §1, §2.
 Efficient graphbased image segmentation. International journal of computer vision 59 (2), pp. 167–181. Cited by: §2.
 Fast rcnn. In ICCV, pp. 1440–1448. Cited by: §2.
 Unsupervised video object segmentation for deep reinforcement learning. In NeurIPS, pp. 5683–5694. Cited by: §2.
 Bidirectional lstm networks for improved phoneme classification and recognition. In Artificial Neural Networks: Formal Models and Their Applications – ICANN 2005, W. Duch, J. Kacprzyk, E. Oja, and S. Zadrożny (Eds.), Berlin, Heidelberg, pp. 799–804. External Links: ISBN 9783540287568 Cited by: §3.2.2.
 Multidimensional recurrent neural networks. In International conference on artificial neural networks, pp. 549–558. Cited by: Figure 2, §3.2.2, §3.2.
 Multiobject representation learning with iterative variational inference. arXiv preprint arXiv:1903.00450. Cited by: §H, §1, §1, §1, §2, Figure 2, §3.1, §3.1, §3.2.2, §3.2.3, §3.2.3, §3, §4.1, §4.2.
 Tagger: deep unsupervised perceptual grouping. In NIPS, pp. 4484–4492. Cited by: §2.
 Neural expectation maximization. In NIPS, pp. 6691–6701. Cited by: §1, §2.
 Mask rcnn. In ICCV, pp. 2961–2969. Cited by: §2.
 Betavae: learning basic visual concepts with a constrained variational framework.. In ICLR, Cited by: §3.2.1.

Learning and relearning in boltzmann machines
. Parallel Distributed Processing: Explorations in the Microstructure of Cognition 1, pp. 282–317. Cited by: §1.  Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.2.
 Comparing partitions. Journal of classification 2 (1), pp. 193–218. Cited by: §4.2.
 A layered motion representation with occlusion and compact spatial support. In ECCV, pp. 692–706. Cited by: §1.
 SCALOR: generative world models with scalable object representations. In ICLR, Cited by: §2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §D.
 Autoencoding variational bayes. In ICLR, Cited by: §1.
 Sequential attend, infer, repeat: generative modelling of moving objects. In NeurIPS, pp. 8606–8616. Cited by: §F, §2.
 Structured objectaware physics prediction for video modeling and planning. In ICLR, Cited by: §2.
 ViLBERT: pretraining taskagnostic visiolinguistic representations for visionandlanguage tasks. In NeurIPS, Cited by: §1.
 Iterative amortized inference. In ICML, Cited by: §1, §3.1, §3.2.2, §3.
 A theory for cerebral neocortex. Proceedings of the Royal Society of London Series B (176), pp. 161–234. Cited by: §1.
 Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §1.
 Context encoders: feature learning by inpainting. In CVPR, Cited by: §1.
 Recurrent ladder networks. In NIPS, pp. 6009–6019. Cited by: §2.
 Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66 (336), pp. 846–850. Cited by: §4.2.
 Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2.
 A simple neural network module for relational reasoning. In NIPS, pp. 4967–4976. Cited by: §F.
 Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22 (8), pp. 888–905. Cited by: §2.
 VideoBERT: a joint model for video and language representation learning. In ICCV, Cited by: §1.

Lxmert: learning crossmodality encoder representations from transformers.
In
Conference on Empirical Methods in Natural Language Processing
, Cited by: §1.  Relational neural expectation maximization: unsupervised discovery of objects and their interactions. In ICLR, Cited by: §A.1, §B, §C, §1, §1, §2, §4.1, §4.2, §4.
 Attention is all you need. In NIPS, pp. 5998–6008. Cited by: §1, §3.2.2.

Extracting and composing robust features with denoising autoencoders
. In ICML, Cited by: §1.  Anticipating visual representations with unlabeled videos. In CVPR, Cited by: §1.
 Spatial broadcast decoder: a simple architecture for learning disentangled representations in vaes. arXiv preprint arXiv:1901.07017. Cited by: §3.1.
 A unified mixture framework for mo tion segmentation: incorporating spatial coherence and estimating the number of models. In CVPR, pp. 321–326. Cited by: §1.
 Clevrer: collision events for video representation and reasoning. In ICLR, Cited by: §B, §4.

Colorful image colorization
. In ECCV, Cited by: §1.
Supplemental Material
A Baselines
a.1 RNem
We use the RNEM Van Steenkiste et al. (2018) authors’ original implementation and their publicly available models: https://github.com/sjoerdvansteenkiste/RelationalNEM.
a.2 Iodine
Our IODINE experiments are based on the following PyTorch implementation:
https://github.com/MichaelKevinKelly/IODINE. We use the same parameters as in this code, with the exceptions of (weight factor) and, for the Bouncing Balls experiments, (refinement steps). The majority of the hyperparameters shared between our own model and IODINE are identical.a.3 SeqIodine
In order to test the sequential version of IODINE, we use the regularly trained IODINE model but change the number of refinement steps to the number of video frames during testing. During each refinement step, instead of computing the error between the reconstructed image and the ground truth image, we use the next video frame. Since the IODINE model was trained on refinement steps, extending the number of refinement steps to the video length leads to exploding gradients. This effect is especially problematic in the binary Bouncing Balls dataset with 20, 30 and 40 frames per video, because the scores of the static model are already low. We deal with this issue by clamping with max and min the gradients and the refinement value in this experiment^{5}^{5}5 Please note that clamping was done only when applied to binary Bouncing Balls for 20, 30 and 40 frames.. SEQIODINE’s weak performance, especially w.r.t. the ARI, reflect the gradual divergence from the optimum as the number of frames increases.
B Datasets
Bouncing Balls. Bouncing Balls is a dataset provided by the authors of RNEM Van Steenkiste et al. (2018). We use the train and test splits of this dataset in two different versions: binary and color. For the color version, we randomly choose 4 colors for the 4balls (sub)dataset. For the 68 balls test data, we color them in 2 different ways: 4 colors (same as train) and 8 colors (4 from train, 4 new ones). Note that the former results in identical colors for multiple objects, while the latter guarantees unique colors for each object.
CLEVRER. The version of the CLEVRER dataset Yi et al. (2020) used in this work was processed as follows:

Train split, validation split and validation annotations were obtained from the official website: http://clevrer.csail.mit.edu/. We use the validation set as test set, because the test set does not contain annotations.

For training, we use the original train split. Our minimal preprocessing consists of cropping the frames along the width axis by 40 pixels on both sides, followed by a uniform downscaling to 64x64 pixels. Since the length of each video is 128 frames and the maximum number of frames during training was 40, we split the videos into multiple sequences to obtain a larger number of training samples.

For testing, we trim the videos to a subsequence containing at least 3 objects and object motion. We compute these subsequences by running the script (slice_videos_from_annotations.py in the attached code) from the folder with the validation split and validation annotations.

The test set ground truth masks can be downloaded from here. The masks and the preprocessed test videos will be grouped into separate folders based on the number of objects in a video.
C Hyperparameters
Initialization. We initialize the parameters of the posterior by sampling from . In all experiments, we use a latent dimensionality , such that
. Horizontal and vertical hidden states and cell states are of size 128, initialized with zeros. The variance of the likelihood is set to
in all experiments.Experiments on Bouncing Balls. For this experiment, we have explored several values of (refinement steps) and empirically found to be optimal in terms of accuracy and efficiency. Refining the posterior more than 6 times does not lead to any substantial improvement, however, the time and memory consumption is significantly increased. For the 4balls dataset, we use slots for train and test. For our tests on 68 balls, we use slots. This protocol is identical to the one used in RNEM Van Steenkiste et al. (2018). Furthermore, we set and scale the KL term by . The weight of the entropy term is set to in the binary case. As expected, the effect of the entropy term is most pronounced with binary data, so we set in all experiments with RGB data.
Experiments on CLEVRER. We keep the default number of iterative refinements at , because we did not observe any substantial improvements from a further increase. We use slots during training, slot when testing on 35 objects and slots when testing on 6 objects.
D Training
We use ADAM Kingma and Ba (2014) for all experiments, with a learning rate of 0.0003 and default values for all remaining parameters. During training, we gradually increase the number of frames per video, as we have found this to make the optimisation more stable. We start with sequences of length 4 and train the model until we observe a stagnant loss or posterior collapse. At the beginning of training, the batch size is 32 and is gradually decreased negatively proportional to the number of frames in the video.
E Infrastructure and Runtime
We train our models on 8 GeForce GTX 1080 Ti GPUs, which takes approximately one day per model.
F Discussion and Future work
Introduction of a temporal component not only enables modelling of dynamics inside the amortized iterative inference framework but also improves the quality of the results overall. From our quantitative and qualitative comparisons with IODINE and SEQIODINE, we see that our model shows more accurate results on the decomposition task. We can detect new objects faster and are less sensitive to color, because our model can leverage the objects’ motion cues. The ability to work with complex colored data, a property inherited from IODINE, means that we significantly outperform RNEM. However, RNEM is a stronger model when it comes to prediction of longer sequences, owing to its ability to model the relations between the objects in the scene. Similar ideas were used in SQAIR Kosiorek et al. (2018) and GENESIS Engelcke et al. (2020) by adding a relational RNN Santoro et al. (2017). Integration of these concepts into our framework is a promising direction for future research. Another possible route is an application of our model to complex realworld scenarios. However, given that such datasets typically contain a much higher number of objects, as well as intricate interactions and spatially varying materials, we consider the resulting scalability questions as a separate line of research.
G Additional Qualitative Results
H Disentanglement
We demonstrate that introducing a new temporal hidden state and an additional MLP in front of the spatial broadcast decoder has not impacted its ability to separate each object’s representations and disentangles them based on color, position, size and other features, similar to results shown in Greff et al. (2019).
Comments
There are no comments yet.