DeepAI
Log In Sign Up

GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations

Generative models are emerging as promising tools in robotics and reinforcement learning. Yet, even though tasks in these domains typically involve distinct objects, most state-of-the-art methods do not explicitly capture the compositional nature of visual scenes. Two exceptions, MONet and IODINE, decompose scenes into objects in an unsupervised fashion via a set of latent variables. Their underlying generative processes, however, do not account for component interactions. Hence, neither of them allows for principled sampling of coherent scenes. Here we present GENESIS, the first object-centric generative model of visual scenes capable of both decomposing and generating complete scenes by explicitly capturing relationships between scene components. GENESIS parameterises a spatial GMM over pixels which is encoded by component-wise latent variables that are inferred sequentially or sampled from an autoregressive prior. We train GENESIS on two publicly available datasets and probe the information in the latent representations through a set of classification tasks, outperforming several baselines.

READ FULL TEXT VIEW PDF

page 6

page 7

page 12

page 13

page 14

page 15

10/11/2022

Robust and Controllable Object-Centric Learning through Energy-based Models

Humans are remarkably good at understanding and reasoning about complex ...
04/11/2020

Learning to Manipulate Individual Objects in an Image

We describe a method to train a generative model with latent factors tha...
04/27/2020

Towards causal generative scene models via competition of experts

Learning how to model complex scenes in a modular way with recombinable ...
07/02/2020

RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces

We present RELATE, a model that learns to generate physically plausible ...
03/21/2022

Generating Fast and Slow: Scene Decomposition via Reconstruction

We consider the problem of segmenting scenes into constituent entities, ...
07/27/2022

GAUDI: A Neural Architect for Immersive 3D Scene Generation

We introduce GAUDI, a generative model capable of capturing the distribu...
10/28/2019

Entity Abstraction in Visual Model-Based Reinforcement Learning

This paper tests the hypothesis that modeling a scene in terms of entiti...

1 Introduction

Task execution in robotics and reinforcement learning requires accurate reasoning about elements of an environment and the ability to generalise to situations not encountered during training. While in some cases it is feasible to facilitate skill acquisition through corpora of manually collected labels for task-relevant objects (He et al. [1]), it is generally intractable to do so for every new task. As a consequence, deep generative models have gained in popularity due to their suitability to unsupervised training, for example, to improve sample efficiency in reinforcement learning (Ha and Schmidhuber [2], Racanière et al. [3]), to synthesise training instances for data augmentation (Lee et al. [4]) or to capture beliefs about the state of a scene (Gregor et al. [5]). Of particular interest recently have been models that are able to decompose visual scenes into meaningful objects without supervision (Burgess et al. [6], Greff et al. [7]).

Generative modelling, or probability density estimation, finds a parameterised distribution

that explains the training data, sometimes via a set of latent variables . Typical tasks in robotics require interaction with distinct elements in an environment. Thus, should capture that visual scenes consist of individual components that can each be described with a concise latent code. Most state-of-the-art generative models of images, though, do not cater to this (Parmar et al. [8]). Burgess et al. [6] and Greff et al. [7] recently proposed two latent-variable models, monet and iodine, that decompose visual scenes into objects. However, neither provides an object-centric generative model of scenes that accounts for relationships between constituent parts: for example, two physical objects cannot occupy the same location. Consequently, it is not possible to employ these models in applications such as generating synthetic scenes for data augmentation in downstream tasks (Lee et al. [4]) or as a “world model” (Ha and Schmidhuber [2]).

Moreover, despite being motivated as a probabilistic generative model, monet relies on a deterministic attention network and lacks a probabilistic inference mechanism. This supposedly makes the optimisation problem easier, but also implies that monet cannot estimate and that it does not constitute a probabilistic generative model of visual scenes. In addition, the attention mechanism of monet embeds a CNN in a RNN, posing an issue in terms of scalability. These two considerations do not apply to iodine, but iodine employs a computationally expensive iterative refinement mechanism, also limiting its applicability.

Therefore, we propose GENESIS which is, to the best of our knowledge, the first object-centric generative model of visual scenes that explicitly captures dependencies between scene components111We use the terms “object” and “scene component” synonymously in this work.. Crucially, this makes GENESIS significantly more suitable for various robotics and reinforcement learning applications. We demonstrate the ability of GENESIS to (i) decompose scenes into meaningful components, (ii) manipulate scenes, and (iii) generate novel scenes in a compositional fashion that resemble the training data. In genesis, relationships between scene components are modelled with an autoregressive prior that is learned alongside a sequential inference network. This sequential inference mechanism is performed in low-dimensional latent space, allowing all convolutional encoders and decoders to be run in parallel to better exploit modern graphics processing hardware.

We conduct experiments on two canonical and publicly available datasets: the GQN dataset (Eslami et al. [9]) and ShapeStacks (Groth et al. [10]). These simulated environments serve as testing grounds for navigation and object manipulation tasks, respectively. We use the scene annotations available for ShapeStacks to show the utility of object-centric latent representations for tasks such as predicting whether a block tower is stable or not, where GENESIS outperforms recent baselines.

We will release our PyTorch implementation and trained models for further community evaluation.

2 Related Work

This work draws inspiration from several different lines of research from the machine learning, robotics, and computer vision communities.

Applications of Environment Models  Three use cases of environment models for sample efficient reinforcement learning are presented in Ha and Schmidhuber [2], Racanière et al. [3], and Gregor et al. [5]. Ha and Schmidhuber [2] are able to train an agent entirely inside a generative model, while Racanière et al. [3] enable agents to imagine different future outcomes to learn better policies. Gregor et al. [5] recently showed that generative models can capture beliefs about the state of the environment and that this can be leveraged to considerably improve the sample efficiency of reinforcement learning algorithms. In robotics, image editing and generating synthetic scenes are of particular interest as techniques for data augmentation. For example, Yao et al. [11] segment objects and subsequently render them in different poses. Alhaija et al. [12] overlay synthetic objects on real backgrounds and Lee et al. [4] edit scenes in pixel-wise semantic label space. Integrating GENESIS into methods such as these presents an exciting avenue for future research.

Structured Models  While perception modules in robotics are typically trained in a supervised fashion (He et al. [1]), collecting labels for every new task is intractable. Several methods leveraging structured latent variables have been proposed to discover meaningful objects without supervision. Attend-Infer-Repeat (Eslami et al. [13]) and its sequential counterpart in Kosiorek et al. [14] use spatial attention to partition scenes into objects. Tagger (Greff et al. [15]) and Neural Expectation Maximisation (Greff et al. [16]) perform unsupervised segmentation by modelling images as spatial mixture models.

Stacked Capsule Autoencoders

(Kosiorek et al. [17]) discover geometric relationships between objects and their parts by using an affine-aware decoder. Yet, these approaches are limited with regards to the image complexity they can handle and none of these works demonstrate the ability to generate novel scenes with more complex, relational structure. While Xu et al. [18] present an extension of Eslami et al. [13] to generate images, their method only works for images with a uniform background and makes a prior assumption that object bounding boxes do not overlap. In contrast, we apply GENESIS to images from Eslami et al. [9] and Groth et al. [10] which contain simulated 3D scenes with complex backgrounds and considerable occlusion.

monet & iodine  While this work is most directly related to monet (Burgess et al. [6]) and iodine (Greff et al. [7]), it sets itself apart by introducing a generative model of complete visual scenes. monet employs a deterministic, recurrent CNN to implement a SBP that outputs pixel-wise attention masks for scene components. Subsequently, a VAE (Kingma and Welling [19], Rezende et al. [20]) with a spatial broadcast decoder (Watters et al. [21]) models individual components. Due to the deterministic attention network, monet does not constitute a proper probabilistic generative model, limiting its applicability. iodine uses a decoder and a refinement network with a computationally expensive, gradient-based iterative inference procedure. Structurally, GENESIS is more comparable to monet and avoids the expensive iterative mechanism in iodine. Unlike monet, though, the convolutional encoders and decoders in GENESIS can be run in parallel, rendering the model computationally more scalable to inputs with a larger number of scene components. Both monet and iodine use an independent prior over scene components, whereas the autoregressive prior of GENESIS allows the modelling of relationships between components and the generation of coherent novel scenes.

Adversarial Methods  A few recent works have proposed to use an adversary for scene segmentation and generation. While Chen et al. [22] and Bielski and Favaro [23] segment a single foreground object per image, Arandjelović and Zisserman [24] are able to segment several synthetic objects superimposed on natural images. Azadi et al. [25] introduce a model to combine two objects or an object and a background scene in a sensible fashion, whereas van Steenkiste et al. [26] can generate scenes with an arbitrary number of components. In comparison, GENESIS performs both inference and generation without exhibiting the instabilities common in adversarial training. Furthermore, the computational complexity of GENESIS increases with , where is the number of components, compared to the complexity of the relational stage in van Steenkiste et al. [26].

3 Genesis: Generative Scene Inference and Sampling

Akin to state-space representations, latent representations learned by generative models have been shown to improve sample efficiency of machine learning models in downstream tasks (van Steenkiste et al. [27]). Separating scene representations into object-centric ones facilitates relational reasoning (Santoro et al. [28]) and has the potential to further improve sample efficiency. Existing solutions, however, either require expensive inference procedures (Greff et al. [7, 15, 16]) or lack a proper probabilistic formulation (Burgess et al. [6]). Therefore, we introduce GENESIS which features a probabilistic formulation and an efficient inference procedure. Moreover, GENESIS has an autoregressive prior which is capable of learning relationships between scene components. We first describe the generative model of GENESIS, as well as a simplified variant called GENESIS-s, and follow by detailing the associated inference procedure. See Figure 1 for an overview of GENESIS and Figure 2 for the graphical model in comparison to alternative methods. GENESIS-s is illustrated in Figure 7 in Section A.1.

Generative model

Let be an image. We formulate the problem of image generation as a spatial GMM. That is, every Gaussian component , where is the maximum number of scene components, represents an image-sized scene component . The corresponding image-sized mixing probability represents whether the component is present at a location in the image. The mixing probabilities are normalised across scene components, i.e. , and can be regarded as pixel-wise attention masks. Since there are strong spatial dependencies between different scene components, we formulate an autoregressive prior distribution over mask variables which encode the mixing probabilities , as

(1)

The dependence on previous latents is implemented via an RNN with hidden state .

Next, we assume that the appearance of scene components is conditionally independent given their spatial allocation in the scene. The corresponding conditional distribution over component variables which encode component appearances factorises as follows,

(2)

The image likelihood is given by a mixture model,

(3)

where the mixing probabilities are created via a SBPSBP adapted from Burgess et al. [6] as follows, slightly overloading the notation,

(4)

Note that this step is not essential for our model and instead we could use a to normalise masks as in Greff et al. [7]. Early experiments indicated, however, that the SBP learns to decompose scenes into meaningful objects more quickly.

Finally and omitting subscripts, the full generative model can be written as follows,

(5)

where we assume that all conditional distributions are Gaussian. The Gaussian components of the image likelihood have a fixed scalar standard deviation

. We refer to this model as GENESIS. To investigate whether separate latents for masks and component appearances are necessary for decomposition, we consider a simplified model, GENESIS-s, with a single latent variable per component,

(6)

In this case, takes the role of in Equation 3 and of in Equation 4, while Equation 2 is no longer necessary.

Figure 1: GENESIS overview. Given an image , an encoder and an RNN compute the mask latents . These are decoded to obtain the mixing probabilities . The image and individual masks are concatenated to infer the component latents from which the scene components are decoded. In GENESIS-s, the second stage is omitted and and are computed simultaneously as illustrated in Figure 7 in Section A.1.
(a) vae     (b) monet     (c) iodine       (d) genesis      (e) genesis-s
Figure 2: Graphical model of genesis compared to related methods. denotes the number of refinement iterations in iodine. Unlike the other methods, both genesis variants explicitly model dependencies between the scene components.
Approximate posterior

We amortise inference by using an approximate posterior distribution with parameters and a structure similar to the generative model. The full approximate posterior reads as follows,

(7)

with the dependence on realised by an RNN . The RNN could, in principle, be shared with the prior, but we have not investigated this option. All conditional distributions are Gaussian. For GENESIS-s, the approximate posterior takes the form  .

Learning

GENESIS can be trained by maximising the ELBOELBO on the log-marginal likelihood , given by

(8)
(9)

However, this often leads to a strong emphasis on the likelihood term (first term in Equation 9), while allowing the marginal approximate posterior to drift away from the prior distribution, hence increasing the kl-divergence (second term in Equation 9). This also decreases the quality of samples drawn from the model.

To prevent this behaviour, we use the GECO objective from Rezende and Viola [29] instead, which changes the learning problem to minimising the kl-divergence subject to a reconstruction constraint. Let be the minimum allowed reconstruction log-likelihood, GECO then uses Lagrange multipliers to solve the following problem,

(10)

4 Experiments

In this section, we present qualitative and quantitative results on the “rooms-ring-camera” dataset from GQN (Eslami et al. [9]) and the ShapeStacks dataset (Groth et al. [10]). We use an image resolution of 64-by-64 for all experiments. The number of steps was set to for GQN and for ShapeStacks. Additional results are included in the Appendix, also illustrating a typical failure case of the model.

GQN  The “rooms-ring-camera” dataset includes simulated 3D scenes of a square room with different floor and wall textures, containing one to three objects of various shapes and sizes.

ShapeStacks  Images show simulated block towers of different heights (two to six blocks) where individual blocks have different shapes, sizes, and colours. Scenes have annotations for: stability of the tower (binary), number of blocks (two to six), properties of individual blocks, locations in the tower of centre-of-mass violations and planar surface violations, wall and floor textures (five each), light presets (five), and camera view points (sixteen).

4.1 Implementation Details

For the results in this section, we use the architecture from Berg et al. [30] to encode and decode . This architecture was modified for GENESIS by applying batch normalisation (Ioffe and Szegedy [31]) before the GLUs (Dauphin et al. [32]

). The convolutional layers in the encoder and decoder have five layers with size-5 kernels, strides of [1, 2, 1, 2, 1], and filter sizes of [32, 32, 64, 64, 64] and [64, 32, 32, 32, 32], respectively. Fully-connected layers are used at the lowest resolution.

The encoded image is passed to a LSTM cell (Hochreiter and Schmidhuber [33]) followed by a linear layer to compute the mask latents of size 64. The LSTM state size is twice the latent size. Importantly, unlike the analogous counterpart in monet, the decoding of is performed in parallel. The autoregressive prior is implemented as an LSTM with 256 units. The conditional distribution is parameterised by a MLP with two hidden layers, 256 units per layer, and ELUs (Clevert et al. [34]). We use the same component VAE featuring a spatial broadcast decoder as monet to encode and decode , but we replace ReLUs (Glorot et al. [35]) with ELUs.

For GENESIS-s, the encoder of is the same as for above and the decoder from Berg et al. [30] is again used to compute the mixing probabilities. However, GENESIS-s also has a second decoder with spatial broadcasting to obtain the object appearances from . We found the use of two different decoders to be important for GENESIS-s to learn meaningful scene decompositions. While this shows that it is not necessary to use separate latent variables for component masks and appearances in order to decompose scenes, we found that GENESIS consistently trained more quickly with better qualitative results. We conjecture that this gap might close to some degree when training the models for more iterations. Hence, qualitative results reported in this section are from GENESIS.

All models are trained for iterations with a batch size of 32 using the ADAM optimiser (Kingma and Ba [36]) and a learning rate of . With these settings, training GENESIS takes about two days on a single GPU, though we expect performance to improve with further training. We deliberately choose a comparatively weak reconstruction constraint for the GECO objective to emphasise kl minimisation and sample quality. We increase the GECO step size by a constant factor when the reconstruction constraint is satisfied to accelerate training.

4.2 Scene Generation

Unlike previous works, GENESIS has an autoregressive prior to capture intricate dependencies between scene components. Modelling these relationships is necessary to generate coherent scenes. For example, different parts of the background need to fit together; we do not want to create components several times that can only be present once, such as the sky; and several physical objects cannot be in the same location. GENESIS is able to generate novel scenes by sequentially sampling scene components from the prior and conditioning each new component on those that have been generated during previous steps.

Figure 3 shows two scenes generated by GENESIS after training on GQN. At every step, either an object in the foreground or a part of the background is generated. Crucially, these components fit together to make up a semantically consistent scene that looks similar to the training data.

Figure 3: Two scenes generated in a component-wise fashion by GENESIS after training on the GQN dataset. The first pane shows the final image and the subsequent panes show the different scene components generated at each step. The model first generates the sky and the floor, followed by individual objects, and finally distinct parts of the wall in the background.

4.3 Scene Decomposition

Similar to monet and iodine, GENESIS is able to segment scenes into meaningful components, with some slots reconstructing individual foreground objects and other slots reconstructing different parts of the background. Step-by-step reconstructions for a model trained on the GQN dataset are shown in Figure 4. The model starts by reconstructing part of the background in the first step and individual objects are reconstructed in separate steps.

We observed that the model pursues a similar decomposition strategy for all input images. Models trained with different hyperparameters also typically learn the same strategy. We hypothesise that this is due to the fact that large surfaces with little variation in appearance have a big influence on the reconstruction loss and are also easy to learn compared to more involved parts of the scene with finer details. In particular, the mask latents in the first step only depend on the input, whereas later masks depend on previous steps which could introduce distracting noise early on in training.

In Figure 5, we show the step-by-step reconstructions on ShapeStacks. Again, the model learns a similar reconstruction strategy as for GQN. Notably though, GENESIS is able to differentiate between all five objects in the block tower compared to the smaller number of objects in GQN even though the background is quite noisy and the objects in the block tower occlude each other significantly.

Figure 4: Decomposing a validation image from the GQN dataset. From left to right, the panels show: the input scene presented to the model, the final reconstruction, the first reconstruction step (attending to the ground and to the sky), the second step (focusing on the yellow shape), the third step (focusing on the purple shape), and three further steps (attending to different parts of the wall).
Figure 5: Decomposing a validation image from ShapeStacks. Despite a noisy background and considerable occlusion, GENESIS is able to segment all five shapes in the block tower.

4.3.1 Scene Manipulation

Learning disentangled representations (van Steenkiste et al. [27]) combined with the ability to decompose scenes enables GENESIS to deliberately manipulate selected parts of the scene while leaving the rest unchanged. This is achieved by linearly traversing the latent space of individual component latents. This can be used, for example, to change the appearance of the background or any of the foreground objects as illustrated in Figure 6.

Figure 6: In the first row, the floor appearance gradually changes from a dark shade to a light shade by traversing one of the component latents. In the second row, the colour of the foreground object transitions from green to red.

4.4 Evaluating Unsupervised Representations on ShapeStacks

We employ a set of discriminative tasks to evaluate how well the representations learned by GENESIS capture various properties of input scenes and compare to several VAE and monet baselines. Alternative metrics such as ELBO or negative log-likelihood cannot be computed for monet. To this end, ShapeStacks is an interesting testing ground as scenes contain multiple objects with surface contact. In particular, we consider three tasks: (1) Is a tower stable or not? (2) What is the tower’s height in terms of the number of blocks? (3) What is the quantised camera pose (out of 16 possibilities)? The first two tasks focus on information relating to foreground objects in the scene and the third task was inspired by Eslami et al. [9].

We compare against four baselines: a VAE with a deconvolutional decoder (DeconvVAE), a VAE with a spatial broadcast decoder (BroadcastVAE), and two different variants of monet with either [32, 32, 64, 64, 64] (monet32) or [64, 64, 128, 128, 128] (monet64) filters in the attention encoder and the reverse in the decoder. The VAEs have 64 latents and the same encoder (and decoder in case of the DeconvVAE) as in Berg et al. [30]. The BroadcastVAE has the same decoder with ELUs as GENESIS, but with twice the number of filters to enable a better comparison. For the monet variants, the prior on the masks was normalised with a function to compute the kl-divergence.

MLP with one hidden layer, 512 units, and ELU activations were used for classification. The classifiers were trained for 100 epochs on 50,000 labelled examples with a batch size of 128 using a cross-entropy loss, the ADAM optimiser, and a learning rate of

. As inputs to the classifiers, we concatenate and for GENESIS, for GENESIS-s, and the component VAE latents for the two monet variants. Table 1 shows the test accuracy of the models. The first row of the table shows the accuracy obtained by always predicting the class with the largest number of examples in the test set.

Model Stability Height View
Largest class 50.0 30.5 6.25
DeconvVAE 58.7 67.8 98.7
BroadcastVAE 59.2 77.5 99.8
monet32 59.3 85.0 99.1
monet64 60.4 86.2 99.4
GENESIS 65.2 81.5 98.8
GENESIS-s 63.5 78.0 99.8
Table 1: Classification accuracy in % on the test sets of the ShapeStacks tasks.

None of models are able to reach the stability prediction accuracies in Groth et al. [10]. This is not surprising considering that the representation networks were trained on sub-sampled images without data augmentation and with a pixel-wise reconstruction loss. Notably, however, both GENESIS and monet do much better in terms of predicting tower stability and height than the VAE baselines, indicating that object-centric latents are beneficial for these tasks. monet performs particularly well on predicting the height of the towers which might be facilitated by its deterministic segmentation network. Nevertheless, both variants of GENESIS outperform all baselines in terms of stability prediction. All models do well at predicting the camera view.

5 Conclusions

In this work, we propose a novel object-centric latent variable model of scenes called GENESIS. We show that GENESIS is, to the best of our knowledge, the first unsupervised model to both decompose visua scenes into semantically meaningful constituent parts, while at the same time being able to generate novel coherent scenes in a component-wise fashion. Importantly, this is achieved by capturing relationships between scene components with an autoregressive prior, allowing complete scenes to be modelled as a collection of components. Regarding future work, the most interesting challenge is to scale GENESIS to more complex datasets. Another potentially promising research direction is to adapt the formulation to only model parts of the scene that are relevant for a certain task. We hope that this work will open up promising avenues for further research, in particular with regards to learning and planning in latent space as well as better harnessing the capabilities of generative models in robotics and reinforcement learning applications.

This research was supported by an EPSRC Programme Grant (EP/M019918/1). The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work, http://dx.doi.org/10.5281/zenodo.22558, and the use of Hartree Centre resources. The authors thank Yizhe Wu for his help with re-implementing monet.

References

Appendix A Appendix

a.1 genesis-s

Figure 7: GENESIS-s overview. Given an image , an encoder and an RNN compute latent variables . These are decoded to directly obtain the mixing probabilities and the scene components .

a.2 Architecture Variations

We investigate several architecture variations of GENESIS and GENESIS-s and compare their performances on the ShapeStacks tasks.

The basic GatedConv architecture is adapted from Berg et al. [30] as described in Section 4.1. Here, we report results for three variants of GENESIS and GENESIS-s with the following modifications:

  • increase the dimension of the mask latents from 64 to 128 and reduce the number of filters in the mask decoder by a factor of two;

  • use four stride-2 layers with [32, 32, 64, 64] and [64, 32, 32, 32] filters in the encoder and decoder, respectively, and use instance normalisation (IN) (Ulyanov et al. [37]) instead of batch normalisation (BN);

  • use the attention architecture from MON

    et32 without skip-connections and replace ReLU activations with ELUs.

The results of are summarised in Table 2. Overall, we found that performance on ShapeStacks was not strongly affected by these architecture changes. Both GENESIS and GENESIS-s still perform better on the stability prediction task than the baselines in Table 1. It would be interesting to conduct a more comprehensive study to establish best practices with regards to architecture design for these types of models, in particular in consideration of performance vs. run-time trade-offs.

Model dim Architecture Norm Stability Height View
GENESIS 64 GatedConv BN 65.2 81.5 98.8
128 GatedConv, 1/2 channels in decoder BN 64.3 79.8 98.6
128 GatedConv, four stride-2 layers IN 61.0 77.1 99.6
128 monet32-like IN 63.1 78.8 99.6
GENESIS-s 64 GatedConv BN 63.5 78.0 99.8
128 GatedConv, 1/2 channels in decoder BN 64.1 81.7 99.7
128 GatedConv, four stride-2 layers IN 64.4 79.0 99.8
128 monet32-like IN 62.6 77.2 99.9
Table 2: Classification accuracy in % on the semi-supervised tasks.

a.3 Scene Generation

Figure 8: Step-by-step generation of a synthetic scene by GENESIS after training on GQN, showing the unmasked component appearances in row two, the masks of the mixing probabilities in row three, and the “scope” of the stick-breaking process which represents pixels in the image that still need to be explained. Notably, the model decides to not use the second reconstruction step as there is only one object in the foreground.

a.4 Scene Decomposition

Figure 9: From left to right, the eight panels show: the input scene from GQN presented to the model, the final model reconstruction, the first reconstruction step which attends to the floor and the sky, the second step which attends to the yellow shape, the third step which to attends to the purple shape, and three further steps that attend to different parts of the background.
Figure 10: De-rendering an image from the ShapeStacks validation set. Despite a noisy background and considerable occlusion, GENESIS is able to segment all five shapes in the block tower. It interesting to observe how the component appearances in steps three to eight all capture the light conditions of the scene, even though the top surfaces of the blocks below the top block are mostly occluded.
Figure 11: Failure case on the ShapeStacks dataset. Especially when there are a large number of objects in the block tower, in this example there are six, GENESIS sometimes explains several objects in a single reconstruction step as in steps two, three, and four for this input image.