Multi-Object Representation Learning with Iterative Variational Inference

03/01/2019 ∙ by Klaus Greff, et al. ∙ 0

Human perception is structured around objects which form the basis for our higher-level cognition and impressive systematic generalization abilities. Yet most work on representation learning focuses on feature learning without even considering multiple objects, or treats segmentation as an (often supervised) preprocessing step. Instead, we argue for the importance of learning to segment and represent objects jointly. We demonstrate that, starting from the simple assumption that a scene is composed of multiple entities, it is possible to learn to segment images into interpretable objects with disentangled representations. Our method learns -- without supervision -- to inpaint occluded parts, and extrapolates to scenes with more objects and to unseen objects with novel feature combinations. We also show that, due to the use of iterative variational inference, our system is able to learn multi-modal posteriors for ambiguous inputs and extends naturally to sequences.



There are no comments yet.


page 5

page 7

page 11

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Object decomposition of an image from the CLEVR dataset by IODINE. The model is able to decompose the image into separate objects in an unsupervised manner, inpainting occluded objects in the process (see slots (b), (e) and (f)).

Learning good representations of complex visual scenes is a challenging problem for artificial intelligence that is far from solved. Recent breakthroughs in unsupervised representation learning

(Higgins et al., 2017a; Makhzani et al., 2015; Chen et al., 2016) tend to focus on data where a single object of interest is placed in front of some background (e.g. dSprites, 3D Chairs, CelebA). Yet in general, visual scenes contain a variable number of objects arranged in various spatial configurations, and often with partial occlusions (e.g., CLEVR, Johnson et al. 2017; see Figure 1). This motivates the question: what forms a good representation of a scene with multiple objects? In line with recent advances (Burgess et al., 2019; van Steenkiste et al., 2018; Eslami et al., 2016), we maintain that discovery of objects in a scene should be considered a crucial aspect of representation learning, rather than treated as a separate problem.

We approach the problem from a spatial mixture model perspective (Greff et al., 2017) and use amortized iterative refinement (Marino et al., 2018b) of latent object representations within a variational framework (Rezende et al., 2014; Kingma & Welling, 2013). We encode our basic intuition about the existence of objects into the structure of our model, which simultaneously facilitates their discovery and efficient representation in a fully data-driven, unsupervised manner. We name the resulting architecture IODINE (short for Iterative Object Decomposition Inference NEtwork).

IODINE can segment complex scenes and learn disentangled object features without supervision on datasets like CLEVR, Objects Room (Burgess et al., 2019), and Tetris (see Appendix A). We show systematic generalization to more objects than included in the training regime, as well as objects formed with unseen feature combinations. This highlights the benefits of multi-object representation learning by comparison to a VAE’s single-slot representations. We also justify how the sampling used in iterative refinement leads to resolving multi-modal and multi-stable decomposition.

2 Method

We first express the assumptions required for multi-object representation learning within the framework of generative modelling (Section 2.1

). Then, building upon the successful Variational AutoEncoder framework (VAEs;

Rezende et al. 2014; Kingma & Welling 2013), we leverage variational inference to jointly learn both the generative and inference model (Section 2.2). There we also discuss the particular challenges that arise for inference in this context and show how they can be solved using iterative amortization. Finally, in Section 2.3 we bring all elements together and show how the complete system can be trained end-to-end by simply maximizing its Evidence Lower Bound (ELBO).

2.1 Multi-Object Representations

Flat vector representations as used by standard VAEs are inadequate for capturing the combinatorial object structure that many datasets exhibit. To achieve the kind of systematic generalization that is so natural for humans, we propose employing a

multi-slot representation where each slot shares the underlying representation format, and each would ideally describe an independent part of the input. Consider the example in Figure 1: by construction, the scene consists of 8 objects, each with its own properties such as shape, size, position, color and material. To split objects, a flat representation would have to represent each object using separate feature dimensions. But this neglects the simple and (to us) trivial fact that they are interchangeable objects with common properties.

(a) VAE
(b) Multi-object VAE
(d) IODINE neural architecture.
Figure 2: Generative model illustrations. (a) A regular VAE decoder. (b) A hypothetical multi-object VAE decoder that recomposes the scene from three objects. (c) IODINE’s multi-object decoder showing latent vectors (denoted ) corresponding to objects refined over iterations. The deterministic pixel-wise means and masks are denoted and respectively. The dimensionality of the input or reconstructed image is denoted . (d) The neural architecture of the IODINE’s multi-object spatial mixture decoder.

Generative Model

We represent each scene with latent object representations that collaborate to generate the input image (c.f. Figure 1(b)). The are assumed to be independent and their generative mechanism is shared such that any ordering of them produces the same image (i.e. entailing permutation invariance). Objects distinguished in this way can easily be compared, reused and recombined, thus facilitating combinatorial generalization.

The image

is modeled with a spatial Gaussian mixture model where each mixing component (slot) corresponds to a single object. That means each object vector

is decoded into a pixel-wise mean (the appearance of the object) and a pixel-wise assignment (the segmentation mask; c.f. Figure 1(c)). Assuming that the pixels are independent conditioned on , the likelihood thus becomes:


where we use a global fixed variance

for all pixels.

Decoder Structure

Our decoder network structure directly reflects the structure of the generative model. See Figure 1(d) for an illustration. Each object latent is decoded separately into pixel-wise means

and mask-logits

, which we then normalize using a softmax operation applied across slots such that the masks for each pixel sum to 1. Together, and parameterize the spatial mixture distribution as defined in Equation 1. For the network architecture we use a broadcast decoder (Watters et al., 2019), which spatially replicates the latent vector , appends two coordinate channels (ranging from to horizontally and vertically), and applies a series of size-preserving convolutional layers. This structure encourages disentangling the position across the image from the other features such as color or texture, and generally supports disentangling. All slots share weights to ensure a common format, and are independently decoded, up until the mask normalization.

2.2 Inference

Similar to VAEs, we use amortized variational inference to get an approximate posterior parameterized as an isotropic Gaussian with parameters . However, our object-oriented generative model poses a few specific challenges for the inference process: Firstly, being a (spatial) mixture model, we need to infer both the components (i.e. object appearance) and the mixing (i.e. object segmentation). This type of problem is well known, for example in clustering and image segmentation, and is traditionally tackled as an iterative procedure, because there are no efficient direct solutions available. A related second problem is that any slot can, in principle, explain any pixel. Once a pixel is explained by one of the slots, however, the others don’t need to account for it anymore. This explaining-away property complicates the inference by strongly coupling it across the individual slots. Finally, slot permutation invariance induces a multimodal posterior with at least one mode per slot permutation. This is problematic, since our approximate posterior is parameterized as a unimodal distribution. For all the above reasons, the standard feed-forward VAE inference model is inadequate for our case, so we consider a more powerful method for inference.

Iterative Inference

Figure 3: Illustration of the iterative inference procedure.

The basic idea of iterative inference is to start with an arbitrary guess for the posterior parameters

, and then iteratively refine them using the input and samples from the current posterior estimate. We build on the framework of iterative amortized inference 

(Marino et al., 2018b), which uses a trained refinement network . Unlike Marino et al., we consider only additive updates to the posterior and we use several salient auxiliary inputs to the refinement network (instead of just ). We update the posterior of the slots independently and in parallel (indicated by and ), as follows:


Thus the only place where the slots interact are at the input level. Instead of amortizing the posterior directly (as in a regular VAE encoder), the refinement network can be thought of as amortizing the gradient of the posterior (Marino et al., 2018a). The alternating updates to and are also akin to message passing.

  Input: image , hyperparamters , ,
  Input: trainable parameters , ,
  for  to  do
      // Sample
      // Decode
      // Masks
      // Likelihood
      // Inputs
      // Refinement
  end for
Algorithm 1 IODINE Pseudocode.


We feed a set of auxiliary inputs to the refinement network, which are generally cheap to compute and substantially simplify the task. Crucially, we include gradient information about the ELBO in the inputs, as it conveys information about what is not yet explained by other slots.

Omitting the superscript for clarity, the auxiliary inputs are:

  • image , means , masks , and mask-logits ,

  • gradients , , and ,

  • posterior mask ,

  • pixelwise likelihood ,

  • the leave-one-out likelihood ,

  • and two coordinate channels like in the decoder.

With the exception of , these are all image-sized and cheap to compute, so we feed them as additional input-channels into the refinement network. The approximate gradient is computed using the reparameterization trick by a backward pass through the generator network. This is computationally quite expensive, but we found that this information helps to significantly improve training of the refinement network. Like Marino et al. (2018b) we found it beneficial to normalize the gradient-based inputs with LayerNorm (Ba et al., 2016). See Section 4.2 for an ablation study on these auxiliary inputs.

2.3 Training

We train the parameters of the decoder (), of the refinement network (), and of the initial posterior () by gradient descent through the unrolled iterations. In principle, it is enough to minimize the final negative ELBO , but we found it beneficial to use a weighted sum which also includes earlier terms:


It is worth noting that IODINE contains two nested minimizations of the loss: In the inner loop, each (refinement) iteration minimizes the loss by adjusting (this is what is shown in Figure 3 and Algorithm 1). Training the model itself by gradient descent constitutes an outer loop which minimizes the loss (indirectly) by adjusting to make the iterative refinement more effective. In this sense, our system resembles the learning to learn meta learner from (Andrychowicz et al., 2016). Unfortunately, this double minimization also leads to numerical instabilities connected to double derivatives like

. We found that this problem can be mitigated by dropping the double derivative terms, i.e. stopping the gradients from backpropagating through the gradient-inputs

, , and in the auxiliary inputs .

3 Related Work

Representation learning (Bengio et al., 2013) has received much attention and has seen several recent breakthroughs. This includes disentangled representations through the use of -VAEs (Higgins et al., 2017a), adversarial autoencoders  (Makhzani et al., 2015), Factor VAEs (Kim & Mnih, 2018), and improved generalization through non-euclidean embeddings (Nickel & Kiela, 2017). However, most advances have focused on the feature-level structure of representations, and do not address the issue of representing multiple, potentially repeating objects, which we tackle here.

Another line of work is concerned with obtaining segmentations of images, usually without considering representation learning. This has led to impressive results on real-world images, however, many approaches (such as “semantic segmentation” or object detection) rely on supervised signals (Girshick, 2015; He et al., 2017; Redmon & Farhadi, 2018), while others require hand-engineered features (Shi & Malik, 2000; Felzenszwalb & Huttenlocher, 2004). In contrast, as we learn to both segment and represent, our method can perform inpainting (Figure 1) and deal with ambiguity (Figure 10), going beyond what most methods relying on feature engineering are currently able to do.

Finally, work tackling the full problem of scene representation are rarer. Probabilistic programming based approaches, like stroke-based character generation

(Lake et al., 2015) or 3D indoor scene rendering (Pero et al., 2012), have produced appealing results, but require carefully engineered generative models, which are typically not fully learned from data. Work on end-to-end models has shown promise in using autoregressive inference or generative approaches (Eslami et al., 2016; Gregor et al., 2015), including the recent MONet (Burgess et al., 2019). Few methods can achieve similar kinds of results with the complexity of the scenes we consider here, apart from MONet. However, unlike our model, MONet does not utilize any iterative refinement, which we believe will be important for scaling up to even more challenging datasets.

Closely related to ours is Neural Expectation Maximization

(Greff et al., 2017) (along with its sequential and relational extensions (van Steenkiste et al., 2018)

) which uses recurrent neural networks to amortize expectation maximization for a spatial mixture model. The Tagger 

(Greff et al., 2016) uses iterative inference to segment and represent images based on denoising as its training objective. However, due to its usage of a ladder network, its representation is less explicit than what we can learn with IODINE.

4 Results

Figure 4: IODINE segmentations and object reconstructions on CLEVR (top), multi-dSprites (middle), and Tetris (bottom). The individual masked reconstruction slots represent objects separately (along with their shadow on CLEVR). Border colours are matched to the segmentation mask on the left.
Figure 5: Prediction accuracy / score for the factor regression on CLEVR. Position is continuous; the rest are categorical with 8 colors, 3 shapes, and 2 sizes.
Figure 6: Disentanglement in regular VAEs vs IODINE. Rows indicate traversals of single latents, annotated by our interpretation of their effects. (Left) When a VAE is trained on single-object scenes it can disentangle meaningful factors of variation. (Center) When the same VAE is trained on multi-object scenes, the latents entangle across both factors and objects. (Right) In contrast, traversals of individual latents in IODINE vary individual factors of single objects, here the orange cylinder. Thus, the architectural bias for discovering multiple entities in a common format enables not only the discovery of objects, but also facilitates disentangling of their features.

We evaluate our model on three main datasets: 1) CLEVR (Johnson et al., 2017) 2) a multi-object version of the dSprites dataset (Matthey et al., 2017), and 3) a dataset of multiple “Tetris”-like pieces that we created. In all cases we train the system using the Adam optimizer (Kingma & Ba, 2015) to minimize the negative ELBO for

updates. We varied several hyperparameters, including: number of slots, dimensionality of

, number of inference iterations, number of convolutional layers and their filter sizes, batch size, and learning rate. For details of the models and hyperparameters refer to Appendix B.

4.1 Representation Quality


An appealing property of IODINE is, that it provides a readily interpretable segmentation of the data, as seen in Figure 4. These examples clearly demonstrate the models ability to segmenting out the same objects which were used to generate the dataset, despite never having received supervision to do so.

To quantify segmentation quality, we measure the similarity between ground-truth (instance) segmentations and our predicted object masks using the Adjusted Rand Index (ARI; Rand 1971; Hubert & Arabie 1985). ARI is a measure of clustering similarity that ranges from (chance) to (perfect clustering) and can handle arbitrary permutations of the clusters. We apply it as a measure of instance segmentation quality by treating each foreground pixel (ignoring the background) as one point and its segmentation as cluster assignment.

Our model produces excellent ARI scores of for CLEVR, for Multi dSprites, and

for Tetris, where we report the mean and standard deviation calculated over five independent random initial seeds. We attempted to compare these scores to baseline methods such as Neural Expectation Maximization, but neither Relational-NEM nor the simpler RNN-NEM variant could cope well with colored images. As a result, we could only compare our ARI scores on a binarized version of Multi dSprites and the Shapes dataset. These are summarized in

Table 1.

Binarized Multi dSprites 0.96 0.53
Shapes 0.92 0.72
Table 1: Summary of IODINE’s segmentation performance in terms of ARI versus a baseline model.

Information Content

The object-reconstructions in Figure 4 show that their representations contain all the information about the object. But in what format, and how usable is it? To answer this question we associate each ground-truth object with its corresponding based on the segmentation masks. We then train a single-layer network to predict ground-truth factors for each object. Note that this predictor is trained after the model has finished training (i.e. no supervised fine-tuning). This tells us if a linear mapping is sufficient to extract information like color, position, shape or size of an object from the latent representation, and gives an important indication about the usefulness of the representation. Results are shown in Figure 5

and clearly show that a linear mapping is sufficient to extract relevant information like color, position, shape or size about an object from its latent representation to high accuracy. This result is in contrast with the scene representations learned by a standard VAE. Here even training the factor-predictor is difficult, as there is no obvious way to align object with features. To make this comparison, we chose a canonical ordering of the objects based on their size, material, shape, and position (with decreasing precedence). The precedence of features was intended as a heuristic to maximize the predictability of the ordering. We then trained a linear network to predict the concatenated features of the canonically ordered objects from the latent scene representation. As the results in

Figure 5 indicate, the information is present, but in a much less explicit/usable state.


Disentanglement is another important desirable property of representations (Bengio et al., 2013) that captures how well learned features separate and correspond to individual, interpretable factors of variation in the data. While its precise definition is still highly debated (Higgins et al., 2018; Eastwood & Williams, 2018; Ridgeway & Mozer, 2018; Locatello et al., 2018), the concept of disentanglement has generated a lot of interest recently. Good disentanglement is believed to lead to both better generalization and more interpretable features (Lake et al., 2016; Higgins et al., 2017b). Interestingly, for these desirable advantages to bear out, disentangled features seem to be most useful for properties of single objects, such as color, position, shape, etc. It is much less clear how to operationalize this in order to create disentangled representations of entire scenes with variable numbers of objects. And indeed, if we train a VAE that can successfully disentangle features of a single-object dataset, we find that that its representation becomes highly entangled on a multi-object dataset, (see Figure 6 left vs middle). IODINE, on the other hand, successfully learns disentangled representations, because it is able to first decompose the scene and then represent individual objects (Figure 6 right). In Figure 6

we show traversals of the most important features (selected by KL) of a standard VAE vs IODINE. While the standard VAE clearly entangles many properties even across multiple objects, IODINE is able to neatly separate them.


Figure 7: IODINE’s iterative inference process and generalization capabilities. Rows indicate steps of iterative inference, refining reconstructions and segmentations when moving down the figure. Of particular interest is the explaining away effect visible between slots 2 and 3, where they settle on different objects despite both starting with the large cylinder. This particular model was only trained with slots on objects (excluding green spheres), and yet is able to generalize to slots (only 4 are shown, see Figure 19 in the appendix for a full version) on a scene with objects, including the never seen before green sphere (last column).

Finally, we can ask directly: Does the system generalize to novel scenes in a systematic way? Specifically, does it generalize to scenes with more or fewer objects than ever encountered during training? Slots are exchangeable by design, so we can freely vary the number of slots during test-time (more on this in Section 4.2). So in Figure 7 we qualitatively show the performance of a system that was trained with on up to 6 objects, but evaluated with on 9 objects. In Figure 8(a) the orange boxes show, that, even quantitatively, the segmentation performance decreases little when generalizing to more objects.

A more extreme form of generalization involves handling unseen feature combinations. To test this we trained our system on a subset of CLEVR that does not contain green spheres (though it does contain spheres and other green objects). And then we tested what the system does when confronted with a green sphere. In Figure 7 it can be seen that IODINE is still able to represent green spheres, despite never having seen this combination during training.

4.2 Robustness & Ablation

Now that we’ve established the usefulness of the object-representations produced by IODINE, we turn our attention to investigating its behavior in more detail.


(a) ARI
(b) MSE
(c) KL
Figure 8: The effect of varying the number of iterations, for both training and at test time. (a) Median ARI score, (b) MSE and (c) KL over test-iterations, for models trained with different numbers of iterations on CLEVR. The region beyond the filled dots thus shows test-time generalization behavior. Shaded region from 25th to 75th percentile.

The number of iterations is one of the central hyperparameters to our approach. To investigate its impact, we trained four models with 1, 2, 4 and 6 iterations on CLEVR, and evaluated them all using 15 iterations (c.f. Figure 8). The first thing to note is that the inference converges very quickly within the first 3-5 iterations after which neither the segmentation nor reconstruction change much. The second important finding is that the system is very stable for much longer than the number of iterations it was trained with. The model even further improves the segmentation and reconstruction when it is run for more iterations, though it eventually starts to diverge after about two to three times the number of training iterations as can be seen with the blue and orange curves in Figure 8.


(a) ARI
(b) MSE
(c) KL
Figure 9: IODINE trained on CLEVR with varying numbers of slots (columns) on scenes with 3-6 objects. Evaluation of ARI (a), MSE (b), and KL (c) with 7 slots on 3-6 Objects (blue) and 11 slots on 7-9 objects (orange).

The other central parameter of IODINE is the number of slots , as it controls the maximum number of objects the system can separate. It is important to distinguish varying for training vs varying it at test-time. As can be seen in Figure 9, if the model was trained with sufficiently many slots to fit all objects (, and ), then test-time behavior generalizes very well. Typical behavior (not shown) is to leave excess slots empty, and when confronted with too many objects it will often completely ignore some of them, leaving the other object-representations mostly intact. As mentioned in Section 4.1, given enough slots at test time, such a model can even segment and represent scenes of higher complexity (more objects) than any scene encountered during training (see Figure 7 and the orange boxes in Figure 9). If on the other hand, the model was trained with too few slots to hold all objects ( and ), its performance suffers substantially. This happens because, here the only way to reconstruct the entire scene during training is to consistently represent multiple objects per slot. And that leads to the model learning inefficient and entangled representations akin to the VAE in Figure 6 (also apparent from their much higher KL in Figure 8(c)). Once learned, this sub-optimal strategy cannot be mitigated by increasing the number of slots at test-time as can be seen by their decreased performance in Figure 8(a).


We ablated each of the different inputs to the refinement network described in Section 2.2. Broadly, we found that individually removing an input did not noticeably affect the results (with two exceptions noted below). See Figures 29-36 in the Appendix demonstrating this lack of effect on different terms of the model’s loss and the ARI segmentation score on both CLEVR and Tetris. A more comprehensive analysis could ablate combinations of inputs and identify synergistic or redundant groups, and thus potentially simplify the model. We didn’t pursue this direction since none of the inputs incurs any noticeable computational overhead and at some point during our experimentation each of them contributed towards stable training behavior.

The main exceptions to the above are and . Computing the former requires an entire backward pass through the decoder, and contributes about of the computational cost of the entire model. But we found that it often substantially improves performance and training convergence, which justifies its inclusion. Another somewhat surprising finding was that for the Tetris dataset, removing from the list of inputs had a pronounced detrimental effect, while for CLEVR it was negligible. At this point we do not have a good explanation for this effect.

4.3 Multi-Modality and Multi-Stability

Figure 10: Multi-stability of segmentation when presented with ambiguous stimuli. IODINE can interpret the set of Tetris blocks differently depending on its iterative sampling process. (left) Seed image with ambiguous Tetris blocks, along with several inferred masks. (right) PCA of latent space, coloured by the segmentation obtained. Shows multimodality and variety in the interpretation of the same seed image.
(a) Textured MNIST
(b) ImageNet
(c) Grayscale CLEVR
Figure 11: Segmentation challenges a) IODINE did not succeed in capturing the foreground digits in the Textured MNIST dataset. b) IODINE groups ImageNet not into meaningful objects but mostly into regions of similar color. c) On a grayscale version of CLEVR, IODINE still produces the desired groupings.

Standard VAEs are unable to represent multi-modal posteriors, because

is parameterized using a unimodal Gaussian distribution. However, as demonstrated in

Figure 10, IODINE can actually handle this problem quite well. So what is going on? It turns out that this is an important side-effect of iterative variational inference, that to the best of our knowledge has not been noticed before: the stochasticity at each iteration, which results from sampling

to approximate the likelihood, implicitly acts as an auxilliary (inference) random variable. This effect compounds over iterations, and is possibly amplified by the slot-structure and the effective message-passing between slots over the course of iterations. In effect the model can implicitly represent multiple modes (if integrated over all ways of sampling

) and thus converge to different modes (see Figure 10 left) depending on these samples. This does not happen in a regular VAE, where the stochasticity never enters the inference process. Also, if we had an exact and deterministic way to compute the likelihood and its gradient, this effect would vanish.

A neat side-effect of this is the ability of IODINE to elegantly capture ambiguous (aka multi-stable) segmentations such as the ones shown in Figure 10. We presented the model with an ambiguous arrangement of Tetris blocks, which has three different yet equally valid "explanations" (given the data distribution). When we evaluate an IODINE model on this image, we get different segmentations on different evaluations. Some of these correspond to different slot-orderings (1st vs 3rd row). But we also find qualitatively different segmentations (i.e. 3rd vs 4th row) that correspond to different interpretations of the scene. Multi-stability is a well-studied and pervasive feature of human perception that is important for handling ambiguity, and that not modelled by any of the standard image recognition networks.

5 Discussion and Future Work

We have introduced IODINE, a novel approach for unsupervised representation learning of multi-object scenes, based on amortized iterative refinement of the inferred latent representation. We analyzed IODINE’s performance on various datasets, including realistic images containing variable numbers of partially occluded 3D objects, and demonstrated that our method can successfully decompose the scenes into objects and represent each of them in terms of their individual properties such as color, size, and material. IODINE can robustly deal with occlusions by inpainting covered sections, and generalises beyond the training distribution in terms of numerosity and object-property combinations. Furthermore, when applied to scenes with ambiguity in terms of their object decomposition, IODINE can represent – and converge to – multiple valid solutions given the same input image.

We also probed the limits of our current setup by applying IODINE to the Textured MNIST dataset (Greff et al., 2016) and to ImageNet, testing how it would deal with texture-segmentation and more complex real-world data (Figure 11). Trained on ImageNet data, IODINE segmented mostly by color rather than by objects. This behavior is interesting, but also expected: ImageNet was never designed as a dataset for unsupervised learning, and likely lacks the richness in poses, lighting, sizes, positions and distance variations required to learn object segmentations from scratch. Trained on Textured MNIST, IODINE was able to model the background, but mostly failed to capture the foreground digits. Together these results point to the importance of color as a strong cue for segmentation, especially early in the iterative refinement process. As demonstrated by our results on grayscale CLEVR (Figure 10(c)) though, color is not a requirement, and instead seems to speed up and stabilize the learning.

Beyond furnishing the model with more diverse training data, we want to highlight three other promising directions to scale IODINE to richer real-world data. First, the iterative nature of IODINE makes it an ideal candidate for extensions to sequential data. This is an especially attractive future direction, because temporal data naturally contains rich statistics about objectness both in the movement itself, and in the smooth variations of object factors. IODINE can actually readily be applied to sequences feeding a new frame at every iteration, and we have done some preliminary experiments described in Section B.4. As a nice side-effect, the model automatically maintains the object to slot association turning it into an unsupervised object tracker. However, IODINE in its current form has limited abilities for modelling dynamics, thus extending the model in this way is a promising direction.

Physical interaction between objects is another common occurrence in sequential data, which can serve as a further strong cue for object decomposition. Similarly even objects placed within static scene commonly adhere to certain relations among each other, such as cars on streets rather than on houses. Currently however, IODINE assumes the objects to be placed independently of each other, and relaxing this assumption will be important for modelling physical interactions.

Yet while learning about the statistical properties of relations should help scene decomposition, there is also a need to balance this with a strong independence assumption between objects, since the system should still be able to segment out a car that is floating in the sky. Thus, the question of how strongly relations should factor into unsupervised object discovery is an open research problem that will be exciting to tackle in future work. In particular, we believe integration with some form of graph network to support relations while preserving slot symmetry to be a promising direction.

Ultimately, object representations have to be useful, such as for supervised tasks like caption generation, or for agents in reinforcement learning setups. Whatever the task, it will likely provide important feedback about which objects matter and which are irrelevant. Complex visual scenes can contain an extremely large number of potential objects (think of sand grains on a beach), which can make it unfeasible to represent them all simultaneously. Thus, allowing task-related signals to bias selection for what and how to decompose, may enable scaling up unsupervised scene representation learning approaches like IODINE to arbitrarily complex scenes.


We would like to thank Danilo Rezende, Sjoerd van Steenkiste, and Malcolm Reynolds for helpful suggestions and generous support.


  • Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and De Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.
  • Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv:1607. 06450 [cs, stat], July 2016.
  • Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013.
  • Burgess et al. (2019) Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Higgins, I., Botvinick, M., and Lerchner, A. MONet: Unsupervised scene decomposition and representation. arXiv preprint, January 2019.
  • Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. arXiv:1606.03657 [cs, stat], June 2016.
  • Clevert et al. (2015) Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). November 2015.
  • Eastwood & Williams (2018) Eastwood, C. and Williams, C. K. I. A framework for the quantitative evaluation of disentangled representations. ICLR, 2018.
  • Eslami et al. (2016) Eslami, S. M. A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Kavukcuoglu, K., and Hinton, G. E. Attend, infer, repeat: Fast scene understanding with generative models. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 3225–3233. Curran Associates, Inc., 2016.
  • Felzenszwalb & Huttenlocher (2004) Felzenszwalb, P. F. and Huttenlocher, D. P. Efficient Graph-Based image segmentation. Int. J. Comput. Vis., 59(2):167–181, September 2004.
  • Girshick (2015) Girshick, R. B. Fast R-CNN. CoRR, abs/1504.08083, 2015. URL
  • Greff et al. (2016) Greff, K., Rasmus, A., Berglund, M., Hao, T. H., Valpola, H., and Schmidhuber, J. Tagger: Deep unsupervised perceptual grouping. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 4484–4492, 2016.
  • Greff et al. (2017) Greff, K., van Steenkiste, S., and Schmidhuber, J. Neural expectation maximization. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6694–6704. Curran Associates, Inc., 2017.
  • Gregor et al. (2015) Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., and Wierstra, D. DRAW: A recurrent neural network for image generation. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 1462–1471, Lille, France, 2015.
  • He et al. (2017) He, K., Gkioxari, G., Dollár, P., and Girshick, R. B. Mask R-CNN. CoRR, abs/1703.06870, 2017. URL
  • Higgins et al. (2017a) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-VAE: Learning basic visual concepts with a constrained variational framework. In In Proceedings of the International Conference on Learning Representations (ICLR), 2017a.
  • Higgins et al. (2017b) Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. DARLA: Improving zero-shot transfer in reinforcement learning. ICML, 2017b.
  • Higgins et al. (2018) Higgins, I., Amos, D., Pfau, D., Racanière, S., Matthey, L., Rezende, D. J., and Lerchner, A. Towards a definition of disentangled representations. CoRR, abs/1812.02230, 2018. URL
  • Hubert & Arabie (1985) Hubert, L. and Arabie, P. Comparing partitions. J. Classification, 2(1):193–218, December 1985.
  • Johnson et al. (2017) Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., and Girshick, R. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 1988–1997., 2017.
  • Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. February 2018.
  • Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. CBLS, 2015.
  • Kingma & Welling (2013) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, December 2015.
  • Lake et al. (2016) Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and Brain Sciences, pp. 1–101, 2016.
  • Locatello et al. (2018) Locatello, F., Bauer, S., Lucic, M., Gelly, S., Schölkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359, 2018.
  • Maaten & Hinton (2008) Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. November 2015.
  • Marino et al. (2018a) Marino, J., Cvitkovic, M., and Yue, Y. A general method for amortizing variational filtering. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 7868–7879. Curran Associates, Inc., 2018a.
  • Marino et al. (2018b) Marino, J., Yue, Y., and Mandt, S. Iterative amortized inference. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 3403–3412, Stockholmsmässan, Stockholm Sweden, 2018b. PMLR.
  • Matthey et al. (2017) Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites: Disentanglement testing sprites dataset., 2017.
  • Nickel & Kiela (2017) Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6338–6347. Curran Associates, Inc., 2017.
  • Pascanu et al. (2012) Pascanu, R., Mikolov, T., and Bengio, Y.

    Understanding the exploding gradient problem.

    CoRR, abs/1211. 5063, 2012.
  • Pero et al. (2012) Pero, L. D., Bowdish, J., Fried, D., Kermgard, B., Hartley, E., and Barnard, K. Bayesian geometric modeling of indoor scenes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2719–2726., June 2012.
  • Rand (1971) Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc., 66(336):846–850, December 1971.
  • Redmon & Farhadi (2018) Redmon, J. and Farhadi, A. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018. URL
  • Reichert & Serre (2013) Reichert, D. P. and Serre, T. Neuronal Synchrony in Complex-Valued Deep Networks. arXiv:1312. 6115 [cs, q-bio, stat], December 2013.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp. 1278–1286, Bejing, China, 2014. PMLR.
  • Ridgeway & Mozer (2018) Ridgeway, K. and Mozer, M. C. Learning deep disentangled embeddings with the f-statistic loss. NIPS, 2018.
  • Shi & Malik (2000) Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 22(8):888–905, August 2000.
  • van Steenkiste et al. (2018) van Steenkiste, S., Chang, M., Greff, K., and Schmidhuber, J. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. In Proceedings of the International Conference on Learning Representations (ICLR), January 2018.
  • Watters et al. (2019) Watters, N., Matthey, L., Burgess, C. P., and Lerchner, A. Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. arXiv, 1901.07017, 2019.

Appendix A Dataset Details

a.1 Clevr

We regenerated the CLEVR dataset (Johnson et al., 2017) using their open-source code, because we needed ground truth segmentation for all the objects. The dataset contains 70 000 images with a resolution of pixels from which we extract a square center crop of and scale it to pixels. Each scene contains between three and ten objects, each of which is characterized in terms of shape (cube, cylinder, or sphere), size (small or large), material (rubber or metal), color (8 different colors), position (continuous), rotation (continuous). We do not make use of the question answering task. Figure 12 shows a few samples from the dataset.

Figure 12: Samples from the CLEVR dataset. The first column is the scene, the second column is the background mask and the following columns are the ground-truth object masks.

a.2 Multi dSprites

This dataset, based on the dSprites dataset (Matthey et al., 2017), consists of 60 000 images with a resolution of . Each image contains two to five random sprites, which vary in terms of shape (square, ellipse, or heart), color (uniform saturated colors), scale (continuous), position (continuous), and rotation (continuous). Furthermore the background color is varied in brightness but always remains grayscale. Figure 13 shows a few samples from the dataset.

Figure 13: Samples from the Multi dSprites dataset. The first column is the full image, the second column is the background mask and the following columns are the ground-truth object masks.

We also used a binary version of Multi dSprites, where the sprites are always white and the background is always black.

a.3 Tetris

We generated this dataset of 60 000 images by placing three random Tetrominoes without overlap in an image of pixels. Each Tetromino is composed of four blocks that are each pixels. There are a total of 17 different Tetrominoes (counting rotations). We randomly color each Tetromino with one of 6 colors (red, green, blue, cyan, magenta, or yellow). Figure 14 shows a few samples from the dataset.

Figure 14: Samples from the Tetris dataset. The first column is the full image, the second column is the background mask and the following columns are the ground-truth object masks.

a.4 Shapes

We use the same shapes dataset as in (Reichert & Serre, 2013). It contains 60 000 binary images of size each with three random shapes from the set .

a.5 Objects Room

For the preliminary sequential experiments we used a sequential version of the Objects Room dataset (Burgess et al., 2019). This dataset consists of 64x64 RGB images of a cubic room, with randomly colored walls, floors and objects randomly scattered around the room. The camera is always positioned on a ring inside the room, always facing towards the centre and oriented vertically in the range . There are 3 randomly shaped objects in the room with 1-3 objects visible in any given frame. This version contains sequences of camera-flights for 16 time steps, with the camera position and angle (within the above constraints) changing according to a fixed velocity for the entire sequence (with a random velocity sampled for each sequence).

Appendix B Model and Hyperparameter Details


Unless otherwise specified all the models are trained with the ADAM optimizer (Kingma & Ba, 2015), with default parameters and a learning rate of

. We used gradient clipping as recommended by

(Pascanu et al., 2012): if the norm of global gradient exceeds then the gradient is scaled down to that norm. Note that this is virtually always the case as the gradient norm is typically on the order of , but we nonetheless found it useful to apply this strategy. Finally, batch size was 32 (GPUs).


All layers use the ELU (Clevert et al., 2015)activation function and the Convolutional layers use a stride equal to 1, unless mentioned otherwise.


For all models, we use the following inputs to the refinement network, where LN means Layernorm and SG means stop gradients:

Description Formula LN SG
gradient of means
gradient of mask
gradient of posterior
mask posterior
pixelwise likelihood
leave-one-out likelihood
two coordinate channels

b.1 Clevr

Most models we trained were on the scenes with 3-6 objects from the CLEVR dataset, had slots, and used iterations. However, we varied these parameters as mentioned in the text. The rest of the architecture and hyperparameters is described in the following.


4 Convolutional layers.

Kernel Size Nr. Channels Comment
4 Output: RGB + Mask

Refinement Network

4 Convolutional layers with stride 2 followed by a 2-layer MLP, one LSTM layer and finally a linear update layer.

Type Size/Channels Act. Func. Comment
MLP 128 Linear Update
LSTM 256 Tanh
MLP 256
Conv 64
Conv 64
Conv 64
Conv 64
18 Inputs

b.2 Multi dSprites

Models were trained with slots, and used iterations.


4 Convolutional layers.

Kernel Size Nr. Channels Comment
32 coordinates
4 Output: RGB + Mask

Refinement Network

3 Convolutional layers with stride 2 followed by a 1-layer MLP, one LSTM layer and an update layer.

Type Size/Channels Act. Func. Comment
MLP 32 Linear Update
LSTM 128 Tanh
MLP 128
Conv 32
Conv 32
Conv 32
24 Inputs

b.3 Tetris

Models were trained with slots, and used iterations.


4 Convolutional layers.

Kernel Size Nr. Channels Comment
64 coordinates
4 Output: RGB + Mask

Refinement Network

3 Convolutional layers with stride 2 followed by a 1-layer MLP and an update layer.

Type Size/Channels Act. Func. Comment
MLP 64 Linear Update
MLP 128
Conv 32
Conv 32
Conv 32
24 Inputs

b.4 Sequences

Figure 15: IODINE applied to sequences by setting , the number of refinement iterations, equal to the number of timesteps in the data.

The iterative nature of IODINE lends itself readily to sequential data, by, e.g., feeding a new frame at every iteration, instead of the same input image . This setup corresponds to one iteration per timestep, and using next-step-prediction instead of reconstruction as part of the training objective. An example of this can be seen in Figure 15 where we show a 16 timestep sequence along with reconstructions and masks. When using the model in this way, it automatically maintains the association of object to slot over time (i.e, displaying robust slot stability). Thus, object tracking comes almost for free as a by-product in IODINE. Notice though, that IODINE has to rely on the LSTM that is part of the inference network to model any dynamics. That means none of the dynamics of tracked objects (e.g. velocity) will be part of the object representation.

Appendix C More Plots

Figures 1618 show additional decomposition samples on our datasets. Figure 19 shows a complete version of Figure 7, showing all individual masked reconstruction slots. Figures 2022 show a comparison between the object reconstructions and the mask logits used for assigning decoded latents to pixels. Figures 2325 demonstrate how object latents are clustered when projected onto the first two principal components of the latent distribution. Figures 2628 show how object latents are clustered when projected onto a t-SNE (Maaten & Hinton, 2008) of the latent distribution. Figures 2936 give an overview of the impact of each of the inputs to the refinement network on the total loss, mean squared reconstruction error, KL divergence loss term, and the ARI segmentation performance (excluding the background pixels) on the CLEVR and Tetris datasets.

Figure 16: Additional segmentation and object reconstruction results on CLEVR. Border colors are matched to the segmentation mask on the left.
Figure 17: Additional segmentation and object reconstruction results on Multi dSprites. Border colors are matched to the segmentation mask on the left.
Figure 18: Additional segmentation and object reconstruction results on Tetris. Border colors are matched to the segmentation mask on the left.
Figure 19: Full version of Figure 7, showcasing all slots.
Figure 20: CLEVR

dataset. Odd rows: image and object masks as determined by the model. Even rows: first column is the input image, second one is the ground-truth masks and the following ones are mask logits produced by the model.

Figure 21: Multi dSprites dataset. Odd rows: image and object masks as determined by the model. Even rows: first column is the input image, second one is the ground-truth masks and the following ones are mask logits produced by the model.
Figure 22: Tetris dataset. Odd rows: image and object masks as determined by the model. Even rows: first column is the input image, second one is the ground-truth masks and the following ones are mask logits produced by the model.
Figure 23: Projection on the first two principal components of the latent distribution for the CLEVR dataset. Each dot represents one object latent and is colored according to the corresponding ground truth factor.
Figure 24: Projection on the first two principal components of the latent distribution for the Multi dSprites dataset. Each dot represents one object latent and is colored according to the corresponding ground truth factor.
Figure 25: Projection on the first two principal components of the latent distribution for the Tetris dataset. Each dot represents one object latent and is colored according to the corresponding ground truth factor.
Figure 26: t-SNE of the latent distribution for the CLEVR dataset. Each dot represents one object latent and is colored according to the corresponding ground truth factor
Figure 27: t-SNE of the latent distribution for the Multi dSprites dataset. Each dot represents one object latent and is colored according to the corresponding ground truth factor
Figure 28: t-SNE of the latent distribution for the Tetris dataset. Each dot represents one object latent and is colored according to the corresponding ground truth factor
Figure 29: Ablation study for the model’s total loss on CLEVR. Each curve denotes the result of training the model without a particular input.
Figure 30: Ablation study for the model’s segmentation performance in terms of ARI (excluding the background pixels) on CLEVR. Each curve denotes the result of training the model without a particular input.
Figure 31: Ablation study for the model’s reconstruction loss term on CLEVR. Each curve denotes the result of training the model without a particular input. The y-axis shows the mean squared error between the target image and the output means (of the final iteration) as a proxy for the full reconstruction loss.
Figure 32: Ablation study for the model’s KL divergence loss term on CLEVR, summed across iterations. Each curve denotes the result of training the model without a particular input.
Figure 33: Ablation study for the model’s total loss on Tetris. Each curve denotes the result of training the model without a particular input.
Figure 34: Ablation study for the model’s segmentation performance in terms of ARI (excluding the background pixels) on Tetris. Each curve denotes the result of training the model without a particular input.
Figure 35: Ablation study for the model’s reconstruction loss term on Tetris. Each curve denotes the result of training the model without a particular input. The y-axis shows the mean squared error between the target image and the output means (of the final iteration) as a proxy for the full reconstruction loss.
Figure 36: Ablation study for the model’s KL divergence loss term on Tetris, summed across iterations. Each curve denotes the result of training the model without a particular input.