1 Introduction
Learning good representations of complex visual scenes is a challenging problem for artificial intelligence that is far from solved. Recent breakthroughs in unsupervised representation learning
(Higgins et al., 2017a; Makhzani et al., 2015; Chen et al., 2016) tend to focus on data where a single object of interest is placed in front of some background (e.g. dSprites, 3D Chairs, CelebA). Yet in general, visual scenes contain a variable number of objects arranged in various spatial configurations, and often with partial occlusions (e.g., CLEVR, Johnson et al. 2017; see Figure 1). This motivates the question: what forms a good representation of a scene with multiple objects? In line with recent advances (Burgess et al., 2019; van Steenkiste et al., 2018; Eslami et al., 2016), we maintain that discovery of objects in a scene should be considered a crucial aspect of representation learning, rather than treated as a separate problem.We approach the problem from a spatial mixture model perspective (Greff et al., 2017) and use amortized iterative refinement (Marino et al., 2018b) of latent object representations within a variational framework (Rezende et al., 2014; Kingma & Welling, 2013). We encode our basic intuition about the existence of objects into the structure of our model, which simultaneously facilitates their discovery and efficient representation in a fully datadriven, unsupervised manner. We name the resulting architecture IODINE (short for Iterative Object Decomposition Inference NEtwork).
IODINE can segment complex scenes and learn disentangled object features without supervision on datasets like CLEVR, Objects Room (Burgess et al., 2019), and Tetris (see Appendix A). We show systematic generalization to more objects than included in the training regime, as well as objects formed with unseen feature combinations. This highlights the benefits of multiobject representation learning by comparison to a VAE’s singleslot representations. We also justify how the sampling used in iterative refinement leads to resolving multimodal and multistable decomposition.
2 Method
We first express the assumptions required for multiobject representation learning within the framework of generative modelling (Section 2.1
). Then, building upon the successful Variational AutoEncoder framework (VAEs;
Rezende et al. 2014; Kingma & Welling 2013), we leverage variational inference to jointly learn both the generative and inference model (Section 2.2). There we also discuss the particular challenges that arise for inference in this context and show how they can be solved using iterative amortization. Finally, in Section 2.3 we bring all elements together and show how the complete system can be trained endtoend by simply maximizing its Evidence Lower Bound (ELBO).2.1 MultiObject Representations
Flat vector representations as used by standard VAEs are inadequate for capturing the combinatorial object structure that many datasets exhibit. To achieve the kind of systematic generalization that is so natural for humans, we propose employing a
multislot representation where each slot shares the underlying representation format, and each would ideally describe an independent part of the input. Consider the example in Figure 1: by construction, the scene consists of 8 objects, each with its own properties such as shape, size, position, color and material. To split objects, a flat representation would have to represent each object using separate feature dimensions. But this neglects the simple and (to us) trivial fact that they are interchangeable objects with common properties.Generative Model
We represent each scene with latent object representations that collaborate to generate the input image (c.f. Figure 1(b)). The are assumed to be independent and their generative mechanism is shared such that any ordering of them produces the same image (i.e. entailing permutation invariance). Objects distinguished in this way can easily be compared, reused and recombined, thus facilitating combinatorial generalization.
The image
is modeled with a spatial Gaussian mixture model where each mixing component (slot) corresponds to a single object. That means each object vector
is decoded into a pixelwise mean (the appearance of the object) and a pixelwise assignment (the segmentation mask; c.f. Figure 1(c)). Assuming that the pixels are independent conditioned on , the likelihood thus becomes:(1) 
where we use a global fixed variance
for all pixels.Decoder Structure
Our decoder network structure directly reflects the structure of the generative model. See Figure 1(d) for an illustration. Each object latent is decoded separately into pixelwise means
and masklogits
, which we then normalize using a softmax operation applied across slots such that the masks for each pixel sum to 1. Together, and parameterize the spatial mixture distribution as defined in Equation 1. For the network architecture we use a broadcast decoder (Watters et al., 2019), which spatially replicates the latent vector , appends two coordinate channels (ranging from to horizontally and vertically), and applies a series of sizepreserving convolutional layers. This structure encourages disentangling the position across the image from the other features such as color or texture, and generally supports disentangling. All slots share weights to ensure a common format, and are independently decoded, up until the mask normalization.2.2 Inference
Similar to VAEs, we use amortized variational inference to get an approximate posterior parameterized as an isotropic Gaussian with parameters . However, our objectoriented generative model poses a few specific challenges for the inference process: Firstly, being a (spatial) mixture model, we need to infer both the components (i.e. object appearance) and the mixing (i.e. object segmentation). This type of problem is well known, for example in clustering and image segmentation, and is traditionally tackled as an iterative procedure, because there are no efficient direct solutions available. A related second problem is that any slot can, in principle, explain any pixel. Once a pixel is explained by one of the slots, however, the others don’t need to account for it anymore. This explainingaway property complicates the inference by strongly coupling it across the individual slots. Finally, slot permutation invariance induces a multimodal posterior with at least one mode per slot permutation. This is problematic, since our approximate posterior is parameterized as a unimodal distribution. For all the above reasons, the standard feedforward VAE inference model is inadequate for our case, so we consider a more powerful method for inference.
Iterative Inference
The basic idea of iterative inference is to start with an arbitrary guess for the posterior parameters
, and then iteratively refine them using the input and samples from the current posterior estimate. We build on the framework of iterative amortized inference
(Marino et al., 2018b), which uses a trained refinement network . Unlike Marino et al., we consider only additive updates to the posterior and we use several salient auxiliary inputs to the refinement network (instead of just ). We update the posterior of the slots independently and in parallel (indicated by and ), as follows:(2)  
(3) 
Thus the only place where the slots interact are at the input level. Instead of amortizing the posterior directly (as in a regular VAE encoder), the refinement network can be thought of as amortizing the gradient of the posterior (Marino et al., 2018a). The alternating updates to and are also akin to message passing.
Inputs
We feed a set of auxiliary inputs to the refinement network, which are generally cheap to compute and substantially simplify the task. Crucially, we include gradient information about the ELBO in the inputs, as it conveys information about what is not yet explained by other slots.
Omitting the superscript for clarity, the auxiliary inputs are:

image , means , masks , and masklogits ,

gradients , , and ,

posterior mask ,

pixelwise likelihood ,

the leaveoneout likelihood ,

and two coordinate channels like in the decoder.
With the exception of , these are all imagesized and cheap to compute, so we feed them as additional inputchannels into the refinement network. The approximate gradient is computed using the reparameterization trick by a backward pass through the generator network. This is computationally quite expensive, but we found that this information helps to significantly improve training of the refinement network. Like Marino et al. (2018b) we found it beneficial to normalize the gradientbased inputs with LayerNorm (Ba et al., 2016). See Section 4.2 for an ablation study on these auxiliary inputs.
2.3 Training
We train the parameters of the decoder (), of the refinement network (), and of the initial posterior () by gradient descent through the unrolled iterations. In principle, it is enough to minimize the final negative ELBO , but we found it beneficial to use a weighted sum which also includes earlier terms:
(4) 
It is worth noting that IODINE contains two nested minimizations of the loss: In the inner loop, each (refinement) iteration minimizes the loss by adjusting (this is what is shown in Figure 3 and Algorithm 1). Training the model itself by gradient descent constitutes an outer loop which minimizes the loss (indirectly) by adjusting to make the iterative refinement more effective. In this sense, our system resembles the learning to learn meta learner from (Andrychowicz et al., 2016). Unfortunately, this double minimization also leads to numerical instabilities connected to double derivatives like
. We found that this problem can be mitigated by dropping the double derivative terms, i.e. stopping the gradients from backpropagating through the gradientinputs
, , and in the auxiliary inputs .3 Related Work
Representation learning (Bengio et al., 2013) has received much attention and has seen several recent breakthroughs. This includes disentangled representations through the use of VAEs (Higgins et al., 2017a), adversarial autoencoders (Makhzani et al., 2015), Factor VAEs (Kim & Mnih, 2018), and improved generalization through noneuclidean embeddings (Nickel & Kiela, 2017). However, most advances have focused on the featurelevel structure of representations, and do not address the issue of representing multiple, potentially repeating objects, which we tackle here.
Another line of work is concerned with obtaining segmentations of images, usually without considering representation learning. This has led to impressive results on realworld images, however, many approaches (such as “semantic segmentation” or object detection) rely on supervised signals (Girshick, 2015; He et al., 2017; Redmon & Farhadi, 2018), while others require handengineered features (Shi & Malik, 2000; Felzenszwalb & Huttenlocher, 2004). In contrast, as we learn to both segment and represent, our method can perform inpainting (Figure 1) and deal with ambiguity (Figure 10), going beyond what most methods relying on feature engineering are currently able to do.
Finally, work tackling the full problem of scene representation are rarer. Probabilistic programming based approaches, like strokebased character generation
(Lake et al., 2015) or 3D indoor scene rendering (Pero et al., 2012), have produced appealing results, but require carefully engineered generative models, which are typically not fully learned from data. Work on endtoend models has shown promise in using autoregressive inference or generative approaches (Eslami et al., 2016; Gregor et al., 2015), including the recent MONet (Burgess et al., 2019). Few methods can achieve similar kinds of results with the complexity of the scenes we consider here, apart from MONet. However, unlike our model, MONet does not utilize any iterative refinement, which we believe will be important for scaling up to even more challenging datasets.Closely related to ours is Neural Expectation Maximization
(Greff et al., 2017) (along with its sequential and relational extensions (van Steenkiste et al., 2018)) which uses recurrent neural networks to amortize expectation maximization for a spatial mixture model. The Tagger
(Greff et al., 2016) uses iterative inference to segment and represent images based on denoising as its training objective. However, due to its usage of a ladder network, its representation is less explicit than what we can learn with IODINE.4 Results
We evaluate our model on three main datasets: 1) CLEVR (Johnson et al., 2017) 2) a multiobject version of the dSprites dataset (Matthey et al., 2017), and 3) a dataset of multiple “Tetris”like pieces that we created. In all cases we train the system using the Adam optimizer (Kingma & Ba, 2015) to minimize the negative ELBO for
updates. We varied several hyperparameters, including: number of slots, dimensionality of
, number of inference iterations, number of convolutional layers and their filter sizes, batch size, and learning rate. For details of the models and hyperparameters refer to Appendix B.4.1 Representation Quality
Segmentation
An appealing property of IODINE is, that it provides a readily interpretable segmentation of the data, as seen in Figure 4. These examples clearly demonstrate the models ability to segmenting out the same objects which were used to generate the dataset, despite never having received supervision to do so.
To quantify segmentation quality, we measure the similarity between groundtruth (instance) segmentations and our predicted object masks using the Adjusted Rand Index (ARI; Rand 1971; Hubert & Arabie 1985). ARI is a measure of clustering similarity that ranges from (chance) to (perfect clustering) and can handle arbitrary permutations of the clusters. We apply it as a measure of instance segmentation quality by treating each foreground pixel (ignoring the background) as one point and its segmentation as cluster assignment.
Our model produces excellent ARI scores of for CLEVR, for Multi dSprites, and
for Tetris, where we report the mean and standard deviation calculated over five independent random initial seeds. We attempted to compare these scores to baseline methods such as Neural Expectation Maximization, but neither RelationalNEM nor the simpler RNNNEM variant could cope well with colored images. As a result, we could only compare our ARI scores on a binarized version of Multi dSprites and the Shapes dataset. These are summarized in
Table 1.IODINE  RNEM  

Binarized Multi dSprites  0.96  0.53 
Shapes  0.92  0.72 
Information Content
The objectreconstructions in Figure 4 show that their representations contain all the information about the object. But in what format, and how usable is it? To answer this question we associate each groundtruth object with its corresponding based on the segmentation masks. We then train a singlelayer network to predict groundtruth factors for each object. Note that this predictor is trained after the model has finished training (i.e. no supervised finetuning). This tells us if a linear mapping is sufficient to extract information like color, position, shape or size of an object from the latent representation, and gives an important indication about the usefulness of the representation. Results are shown in Figure 5
and clearly show that a linear mapping is sufficient to extract relevant information like color, position, shape or size about an object from its latent representation to high accuracy. This result is in contrast with the scene representations learned by a standard VAE. Here even training the factorpredictor is difficult, as there is no obvious way to align object with features. To make this comparison, we chose a canonical ordering of the objects based on their size, material, shape, and position (with decreasing precedence). The precedence of features was intended as a heuristic to maximize the predictability of the ordering. We then trained a linear network to predict the concatenated features of the canonically ordered objects from the latent scene representation. As the results in
Figure 5 indicate, the information is present, but in a much less explicit/usable state.Disentanglement
Disentanglement is another important desirable property of representations (Bengio et al., 2013) that captures how well learned features separate and correspond to individual, interpretable factors of variation in the data. While its precise definition is still highly debated (Higgins et al., 2018; Eastwood & Williams, 2018; Ridgeway & Mozer, 2018; Locatello et al., 2018), the concept of disentanglement has generated a lot of interest recently. Good disentanglement is believed to lead to both better generalization and more interpretable features (Lake et al., 2016; Higgins et al., 2017b). Interestingly, for these desirable advantages to bear out, disentangled features seem to be most useful for properties of single objects, such as color, position, shape, etc. It is much less clear how to operationalize this in order to create disentangled representations of entire scenes with variable numbers of objects. And indeed, if we train a VAE that can successfully disentangle features of a singleobject dataset, we find that that its representation becomes highly entangled on a multiobject dataset, (see Figure 6 left vs middle). IODINE, on the other hand, successfully learns disentangled representations, because it is able to first decompose the scene and then represent individual objects (Figure 6 right). In Figure 6
we show traversals of the most important features (selected by KL) of a standard VAE vs IODINE. While the standard VAE clearly entangles many properties even across multiple objects, IODINE is able to neatly separate them.
Generalization
Finally, we can ask directly: Does the system generalize to novel scenes in a systematic way? Specifically, does it generalize to scenes with more or fewer objects than ever encountered during training? Slots are exchangeable by design, so we can freely vary the number of slots during testtime (more on this in Section 4.2). So in Figure 7 we qualitatively show the performance of a system that was trained with on up to 6 objects, but evaluated with on 9 objects. In Figure 8(a) the orange boxes show, that, even quantitatively, the segmentation performance decreases little when generalizing to more objects.
A more extreme form of generalization involves handling unseen feature combinations. To test this we trained our system on a subset of CLEVR that does not contain green spheres (though it does contain spheres and other green objects). And then we tested what the system does when confronted with a green sphere. In Figure 7 it can be seen that IODINE is still able to represent green spheres, despite never having seen this combination during training.
4.2 Robustness & Ablation
Now that we’ve established the usefulness of the objectrepresentations produced by IODINE, we turn our attention to investigating its behavior in more detail.
Iterations
The number of iterations is one of the central hyperparameters to our approach. To investigate its impact, we trained four models with 1, 2, 4 and 6 iterations on CLEVR, and evaluated them all using 15 iterations (c.f. Figure 8). The first thing to note is that the inference converges very quickly within the first 35 iterations after which neither the segmentation nor reconstruction change much. The second important finding is that the system is very stable for much longer than the number of iterations it was trained with. The model even further improves the segmentation and reconstruction when it is run for more iterations, though it eventually starts to diverge after about two to three times the number of training iterations as can be seen with the blue and orange curves in Figure 8.
Slots
The other central parameter of IODINE is the number of slots , as it controls the maximum number of objects the system can separate. It is important to distinguish varying for training vs varying it at testtime. As can be seen in Figure 9, if the model was trained with sufficiently many slots to fit all objects (, and ), then testtime behavior generalizes very well. Typical behavior (not shown) is to leave excess slots empty, and when confronted with too many objects it will often completely ignore some of them, leaving the other objectrepresentations mostly intact. As mentioned in Section 4.1, given enough slots at test time, such a model can even segment and represent scenes of higher complexity (more objects) than any scene encountered during training (see Figure 7 and the orange boxes in Figure 9). If on the other hand, the model was trained with too few slots to hold all objects ( and ), its performance suffers substantially. This happens because, here the only way to reconstruct the entire scene during training is to consistently represent multiple objects per slot. And that leads to the model learning inefficient and entangled representations akin to the VAE in Figure 6 (also apparent from their much higher KL in Figure 8(c)). Once learned, this suboptimal strategy cannot be mitigated by increasing the number of slots at testtime as can be seen by their decreased performance in Figure 8(a).
Ablations
We ablated each of the different inputs to the refinement network described in Section 2.2. Broadly, we found that individually removing an input did not noticeably affect the results (with two exceptions noted below). See Figures 2936 in the Appendix demonstrating this lack of effect on different terms of the model’s loss and the ARI segmentation score on both CLEVR and Tetris. A more comprehensive analysis could ablate combinations of inputs and identify synergistic or redundant groups, and thus potentially simplify the model. We didn’t pursue this direction since none of the inputs incurs any noticeable computational overhead and at some point during our experimentation each of them contributed towards stable training behavior.
The main exceptions to the above are and . Computing the former requires an entire backward pass through the decoder, and contributes about of the computational cost of the entire model. But we found that it often substantially improves performance and training convergence, which justifies its inclusion. Another somewhat surprising finding was that for the Tetris dataset, removing from the list of inputs had a pronounced detrimental effect, while for CLEVR it was negligible. At this point we do not have a good explanation for this effect.
4.3 MultiModality and MultiStability
Standard VAEs are unable to represent multimodal posteriors, because
is parameterized using a unimodal Gaussian distribution. However, as demonstrated in
Figure 10, IODINE can actually handle this problem quite well. So what is going on? It turns out that this is an important sideeffect of iterative variational inference, that to the best of our knowledge has not been noticed before: the stochasticity at each iteration, which results from samplingto approximate the likelihood, implicitly acts as an auxilliary (inference) random variable. This effect compounds over iterations, and is possibly amplified by the slotstructure and the effective messagepassing between slots over the course of iterations. In effect the model can implicitly represent multiple modes (if integrated over all ways of sampling
) and thus converge to different modes (see Figure 10 left) depending on these samples. This does not happen in a regular VAE, where the stochasticity never enters the inference process. Also, if we had an exact and deterministic way to compute the likelihood and its gradient, this effect would vanish.A neat sideeffect of this is the ability of IODINE to elegantly capture ambiguous (aka multistable) segmentations such as the ones shown in Figure 10. We presented the model with an ambiguous arrangement of Tetris blocks, which has three different yet equally valid "explanations" (given the data distribution). When we evaluate an IODINE model on this image, we get different segmentations on different evaluations. Some of these correspond to different slotorderings (1st vs 3rd row). But we also find qualitatively different segmentations (i.e. 3rd vs 4th row) that correspond to different interpretations of the scene. Multistability is a wellstudied and pervasive feature of human perception that is important for handling ambiguity, and that not modelled by any of the standard image recognition networks.
5 Discussion and Future Work
We have introduced IODINE, a novel approach for unsupervised representation learning of multiobject scenes, based on amortized iterative refinement of the inferred latent representation. We analyzed IODINE’s performance on various datasets, including realistic images containing variable numbers of partially occluded 3D objects, and demonstrated that our method can successfully decompose the scenes into objects and represent each of them in terms of their individual properties such as color, size, and material. IODINE can robustly deal with occlusions by inpainting covered sections, and generalises beyond the training distribution in terms of numerosity and objectproperty combinations. Furthermore, when applied to scenes with ambiguity in terms of their object decomposition, IODINE can represent – and converge to – multiple valid solutions given the same input image.
We also probed the limits of our current setup by applying IODINE to the Textured MNIST dataset (Greff et al., 2016) and to ImageNet, testing how it would deal with texturesegmentation and more complex realworld data (Figure 11). Trained on ImageNet data, IODINE segmented mostly by color rather than by objects. This behavior is interesting, but also expected: ImageNet was never designed as a dataset for unsupervised learning, and likely lacks the richness in poses, lighting, sizes, positions and distance variations required to learn object segmentations from scratch. Trained on Textured MNIST, IODINE was able to model the background, but mostly failed to capture the foreground digits. Together these results point to the importance of color as a strong cue for segmentation, especially early in the iterative refinement process. As demonstrated by our results on grayscale CLEVR (Figure 10(c)) though, color is not a requirement, and instead seems to speed up and stabilize the learning.
Beyond furnishing the model with more diverse training data, we want to highlight three other promising directions to scale IODINE to richer realworld data. First, the iterative nature of IODINE makes it an ideal candidate for extensions to sequential data. This is an especially attractive future direction, because temporal data naturally contains rich statistics about objectness both in the movement itself, and in the smooth variations of object factors. IODINE can actually readily be applied to sequences feeding a new frame at every iteration, and we have done some preliminary experiments described in Section B.4. As a nice sideeffect, the model automatically maintains the object to slot association turning it into an unsupervised object tracker. However, IODINE in its current form has limited abilities for modelling dynamics, thus extending the model in this way is a promising direction.
Physical interaction between objects is another common occurrence in sequential data, which can serve as a further strong cue for object decomposition. Similarly even objects placed within static scene commonly adhere to certain relations among each other, such as cars on streets rather than on houses. Currently however, IODINE assumes the objects to be placed independently of each other, and relaxing this assumption will be important for modelling physical interactions.
Yet while learning about the statistical properties of relations should help scene decomposition, there is also a need to balance this with a strong independence assumption between objects, since the system should still be able to segment out a car that is floating in the sky. Thus, the question of how strongly relations should factor into unsupervised object discovery is an open research problem that will be exciting to tackle in future work. In particular, we believe integration with some form of graph network to support relations while preserving slot symmetry to be a promising direction.
Ultimately, object representations have to be useful, such as for supervised tasks like caption generation, or for agents in reinforcement learning setups. Whatever the task, it will likely provide important feedback about which objects matter and which are irrelevant. Complex visual scenes can contain an extremely large number of potential objects (think of sand grains on a beach), which can make it unfeasible to represent them all simultaneously. Thus, allowing taskrelated signals to bias selection for what and how to decompose, may enable scaling up unsupervised scene representation learning approaches like IODINE to arbitrarily complex scenes.
Acknowledgements
We would like to thank Danilo Rezende, Sjoerd van Steenkiste, and Malcolm Reynolds for helpful suggestions and generous support.
References
 Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.
 Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv:1607. 06450 [cs, stat], July 2016.
 Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013.
 Burgess et al. (2019) Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Higgins, I., Botvinick, M., and Lerchner, A. MONet: Unsupervised scene decomposition and representation. arXiv preprint, January 2019.
 Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. arXiv:1606.03657 [cs, stat], June 2016.
 Clevert et al. (2015) Clevert, D.A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). November 2015.
 Eastwood & Williams (2018) Eastwood, C. and Williams, C. K. I. A framework for the quantitative evaluation of disentangled representations. ICLR, 2018.
 Eslami et al. (2016) Eslami, S. M. A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Kavukcuoglu, K., and Hinton, G. E. Attend, infer, repeat: Fast scene understanding with generative models. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 3225–3233. Curran Associates, Inc., 2016.
 Felzenszwalb & Huttenlocher (2004) Felzenszwalb, P. F. and Huttenlocher, D. P. Efficient GraphBased image segmentation. Int. J. Comput. Vis., 59(2):167–181, September 2004.
 Girshick (2015) Girshick, R. B. Fast RCNN. CoRR, abs/1504.08083, 2015. URL http://arxiv.org/abs/1504.08083.
 Greff et al. (2016) Greff, K., Rasmus, A., Berglund, M., Hao, T. H., Valpola, H., and Schmidhuber, J. Tagger: Deep unsupervised perceptual grouping. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 4484–4492, 2016.
 Greff et al. (2017) Greff, K., van Steenkiste, S., and Schmidhuber, J. Neural expectation maximization. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6694–6704. Curran Associates, Inc., 2017.
 Gregor et al. (2015) Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. DRAW: A recurrent neural network for image generation. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, pp. 1462–1471, Lille, France, 2015. JMLR.org.
 He et al. (2017) He, K., Gkioxari, G., Dollár, P., and Girshick, R. B. Mask RCNN. CoRR, abs/1703.06870, 2017. URL http://arxiv.org/abs/1703.06870.
 Higgins et al. (2017a) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betaVAE: Learning basic visual concepts with a constrained variational framework. In In Proceedings of the International Conference on Learning Representations (ICLR), 2017a.
 Higgins et al. (2017b) Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. DARLA: Improving zeroshot transfer in reinforcement learning. ICML, 2017b.
 Higgins et al. (2018) Higgins, I., Amos, D., Pfau, D., Racanière, S., Matthey, L., Rezende, D. J., and Lerchner, A. Towards a definition of disentangled representations. CoRR, abs/1812.02230, 2018. URL http://arxiv.org/abs/1812.02230.
 Hubert & Arabie (1985) Hubert, L. and Arabie, P. Comparing partitions. J. Classification, 2(1):193–218, December 1985.
 Johnson et al. (2017) Johnson, J., Hariharan, B., van der Maaten, L., FeiFei, L., Zitnick, C. L., and Girshick, R. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 1988–1997. openaccess.thecvf.com, 2017.
 Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. February 2018.
 Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. CBLS, 2015.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015.
 Lake et al. (2016) Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and Brain Sciences, pp. 1–101, 2016.
 Locatello et al. (2018) Locatello, F., Bauer, S., Lucic, M., Gelly, S., Schölkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359, 2018.
 Maaten & Hinton (2008) Maaten, L. v. d. and Hinton, G. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. November 2015.
 Marino et al. (2018a) Marino, J., Cvitkovic, M., and Yue, Y. A general method for amortizing variational filtering. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 7868–7879. Curran Associates, Inc., 2018a.
 Marino et al. (2018b) Marino, J., Yue, Y., and Mandt, S. Iterative amortized inference. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3403–3412, Stockholmsmässan, Stockholm Sweden, 2018b. PMLR.
 Matthey et al. (2017) Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dspritesdataset/, 2017.
 Nickel & Kiela (2017) Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6338–6347. Curran Associates, Inc., 2017.

Pascanu et al. (2012)
Pascanu, R., Mikolov, T., and Bengio, Y.
Understanding the exploding gradient problem.
CoRR, abs/1211. 5063, 2012.  Pero et al. (2012) Pero, L. D., Bowdish, J., Fried, D., Kermgard, B., Hartley, E., and Barnard, K. Bayesian geometric modeling of indoor scenes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2719–2726. ieeexplore.ieee.org, June 2012.
 Rand (1971) Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc., 66(336):846–850, December 1971.
 Redmon & Farhadi (2018) Redmon, J. and Farhadi, A. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018. URL http://arxiv.org/abs/1804.02767.
 Reichert & Serre (2013) Reichert, D. P. and Serre, T. Neuronal Synchrony in ComplexValued Deep Networks. arXiv:1312. 6115 [cs, qbio, stat], December 2013.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1278–1286, Bejing, China, 2014. PMLR.
 Ridgeway & Mozer (2018) Ridgeway, K. and Mozer, M. C. Learning deep disentangled embeddings with the fstatistic loss. NIPS, 2018.
 Shi & Malik (2000) Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, August 2000.
 van Steenkiste et al. (2018) van Steenkiste, S., Chang, M., Greff, K., and Schmidhuber, J. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. In Proceedings of the International Conference on Learning Representations (ICLR), January 2018.
 Watters et al. (2019) Watters, N., Matthey, L., Burgess, C. P., and Lerchner, A. Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. arXiv, 1901.07017, 2019.
Appendix A Dataset Details
a.1 Clevr
We regenerated the CLEVR dataset (Johnson et al., 2017) using their opensource code, because we needed ground truth segmentation for all the objects. The dataset contains 70 000 images with a resolution of pixels from which we extract a square center crop of and scale it to pixels. Each scene contains between three and ten objects, each of which is characterized in terms of shape (cube, cylinder, or sphere), size (small or large), material (rubber or metal), color (8 different colors), position (continuous), rotation (continuous). We do not make use of the question answering task. Figure 12 shows a few samples from the dataset.
a.2 Multi dSprites
This dataset, based on the dSprites dataset (Matthey et al., 2017), consists of 60 000 images with a resolution of . Each image contains two to five random sprites, which vary in terms of shape (square, ellipse, or heart), color (uniform saturated colors), scale (continuous), position (continuous), and rotation (continuous). Furthermore the background color is varied in brightness but always remains grayscale. Figure 13 shows a few samples from the dataset.
We also used a binary version of Multi dSprites, where the sprites are always white and the background is always black.
a.3 Tetris
We generated this dataset of 60 000 images by placing three random Tetrominoes without overlap in an image of pixels. Each Tetromino is composed of four blocks that are each pixels. There are a total of 17 different Tetrominoes (counting rotations). We randomly color each Tetromino with one of 6 colors (red, green, blue, cyan, magenta, or yellow). Figure 14 shows a few samples from the dataset.
a.4 Shapes
We use the same shapes dataset as in (Reichert & Serre, 2013). It contains 60 000 binary images of size each with three random shapes from the set .
a.5 Objects Room
For the preliminary sequential experiments we used a sequential version of the Objects Room dataset (Burgess et al., 2019). This dataset consists of 64x64 RGB images of a cubic room, with randomly colored walls, floors and objects randomly scattered around the room. The camera is always positioned on a ring inside the room, always facing towards the centre and oriented vertically in the range . There are 3 randomly shaped objects in the room with 13 objects visible in any given frame. This version contains sequences of cameraflights for 16 time steps, with the camera position and angle (within the above constraints) changing according to a fixed velocity for the entire sequence (with a random velocity sampled for each sequence).
Appendix B Model and Hyperparameter Details
Training
Unless otherwise specified all the models are trained with the ADAM optimizer (Kingma & Ba, 2015), with default parameters and a learning rate of
. We used gradient clipping as recommended by
(Pascanu et al., 2012): if the norm of global gradient exceeds then the gradient is scaled down to that norm. Note that this is virtually always the case as the gradient norm is typically on the order of , but we nonetheless found it useful to apply this strategy. Finally, batch size was 32 (GPUs).Architecture
All layers use the ELU (Clevert et al., 2015)activation function and the Convolutional layers use a stride equal to 1, unless mentioned otherwise.
Inputs
For all models, we use the following inputs to the refinement network, where LN means Layernorm and SG means stop gradients:
Description  Formula  LN  SG 

image  
means  
mask  
masklogits  
gradient of means  ✓  ✓  
gradient of mask  ✓  ✓  
gradient of posterior  ✓  ✓  
posterior  
mask posterior  
pixelwise likelihood  ✓  ✓  
leaveoneout likelihood  ✓  ✓  
two coordinate channels 
b.1 Clevr
Most models we trained were on the scenes with 36 objects from the CLEVR dataset, had slots, and used iterations. However, we varied these parameters as mentioned in the text. The rest of the architecture and hyperparameters is described in the following.
Decoder
4 Convolutional layers.
Kernel Size  Nr. Channels  Comment 

128  
64  
64  
64  
64  
4  Output: RGB + Mask 
Refinement Network
4 Convolutional layers with stride 2 followed by a 2layer MLP, one LSTM layer and finally a linear update layer.
Type  Size/Channels  Act. Func.  Comment 

MLP  128  Linear  Update 
LSTM  256  Tanh  
MLP  256  
Conv  64  
Conv  64  
Conv  64  
Conv  64  
18  Inputs 
b.2 Multi dSprites
Models were trained with slots, and used iterations.
Decoder
4 Convolutional layers.
Kernel Size  Nr. Channels  Comment 

32  coordinates  
32  
32  
32  
32  
4  Output: RGB + Mask 
Refinement Network
3 Convolutional layers with stride 2 followed by a 1layer MLP, one LSTM layer and an update layer.
Type  Size/Channels  Act. Func.  Comment 

MLP  32  Linear  Update 
LSTM  128  Tanh  
MLP  128  
Conv  32  
Conv  32  
Conv  32  
24  Inputs 
b.3 Tetris
Models were trained with slots, and used iterations.
Decoder
4 Convolutional layers.
Kernel Size  Nr. Channels  Comment 

64  coordinates  
32  
32  
32  
32  
4  Output: RGB + Mask 
Refinement Network
3 Convolutional layers with stride 2 followed by a 1layer MLP and an update layer.
Type  Size/Channels  Act. Func.  Comment 

MLP  64  Linear  Update 
MLP  128  
Conv  32  
Conv  32  
Conv  32  
24  Inputs 
b.4 Sequences
The iterative nature of IODINE lends itself readily to sequential data, by, e.g., feeding a new frame at every iteration, instead of the same input image . This setup corresponds to one iteration per timestep, and using nextstepprediction instead of reconstruction as part of the training objective. An example of this can be seen in Figure 15 where we show a 16 timestep sequence along with reconstructions and masks. When using the model in this way, it automatically maintains the association of object to slot over time (i.e, displaying robust slot stability). Thus, object tracking comes almost for free as a byproduct in IODINE. Notice though, that IODINE has to rely on the LSTM that is part of the inference network to model any dynamics. That means none of the dynamics of tracked objects (e.g. velocity) will be part of the object representation.
Appendix C More Plots
Figures 16–18 show additional decomposition samples on our datasets. Figure 19 shows a complete version of Figure 7, showing all individual masked reconstruction slots. Figures 20–22 show a comparison between the object reconstructions and the mask logits used for assigning decoded latents to pixels. Figures 23–25 demonstrate how object latents are clustered when projected onto the first two principal components of the latent distribution. Figures 26–28 show how object latents are clustered when projected onto a tSNE (Maaten & Hinton, 2008) of the latent distribution. Figures 29–36 give an overview of the impact of each of the inputs to the refinement network on the total loss, mean squared reconstruction error, KL divergence loss term, and the ARI segmentation performance (excluding the background pixels) on the CLEVR and Tetris datasets.
Comments
There are no comments yet.