1 Introduction
Task execution in robotics and reinforcement learning requires accurate reasoning about elements of an environment and the ability to generalise to situations not encountered during training. While in some cases it is feasible to facilitate skill acquisition through corpora of manually collected labels for task-relevant objects (He et al. [1]), it is generally intractable to do so for every new task. As a consequence, deep generative models have gained in popularity due to their suitability to unsupervised training, for example, to improve sample efficiency in reinforcement learning (Ha and Schmidhuber [2], Racanière et al. [3]), to synthesise training instances for data augmentation (Lee et al. [4]) or to capture beliefs about the state of a scene (Gregor et al. [5]). Of particular interest recently have been models that are able to decompose visual scenes into meaningful objects without supervision (Burgess et al. [6], Greff et al. [7]).
Generative modelling, or probability density estimation, finds a parameterised distribution
that explains the training data, sometimes via a set of latent variables . Typical tasks in robotics require interaction with distinct elements in an environment. Thus, should capture that visual scenes consist of individual components that can each be described with a concise latent code. Most state-of-the-art generative models of images, though, do not cater to this (Parmar et al. [8]). Burgess et al. [6] and Greff et al. [7] recently proposed two latent-variable models, monet and iodine, that decompose visual scenes into objects. However, neither provides an object-centric generative model of scenes that accounts for relationships between constituent parts: for example, two physical objects cannot occupy the same location. Consequently, it is not possible to employ these models in applications such as generating synthetic scenes for data augmentation in downstream tasks (Lee et al. [4]) or as a “world model” (Ha and Schmidhuber [2]).Moreover, despite being motivated as a probabilistic generative model, monet relies on a deterministic attention network and lacks a probabilistic inference mechanism. This supposedly makes the optimisation problem easier, but also implies that monet cannot estimate and that it does not constitute a probabilistic generative model of visual scenes. In addition, the attention mechanism of monet embeds a CNN in a RNN, posing an issue in terms of scalability. These two considerations do not apply to iodine, but iodine employs a computationally expensive iterative refinement mechanism, also limiting its applicability.
Therefore, we propose GENESIS which is, to the best of our knowledge, the first object-centric generative model of visual scenes that explicitly captures dependencies between scene components111We use the terms “object” and “scene component” synonymously in this work.. Crucially, this makes GENESIS significantly more suitable for various robotics and reinforcement learning applications. We demonstrate the ability of GENESIS to (i) decompose scenes into meaningful components, (ii) manipulate scenes, and (iii) generate novel scenes in a compositional fashion that resemble the training data. In genesis, relationships between scene components are modelled with an autoregressive prior that is learned alongside a sequential inference network. This sequential inference mechanism is performed in low-dimensional latent space, allowing all convolutional encoders and decoders to be run in parallel to better exploit modern graphics processing hardware.
We conduct experiments on two canonical and publicly available datasets: the GQN dataset (Eslami et al. [9]) and ShapeStacks (Groth et al. [10]). These simulated environments serve as testing grounds for navigation and object manipulation tasks, respectively. We use the scene annotations available for ShapeStacks to show the utility of object-centric latent representations for tasks such as predicting whether a block tower is stable or not, where GENESIS outperforms recent baselines.
We will release our PyTorch implementation and trained models for further community evaluation.
2 Related Work
This work draws inspiration from several different lines of research from the machine learning, robotics, and computer vision communities.
Applications of Environment Models Three use cases of environment models for sample efficient reinforcement learning are presented in Ha and Schmidhuber [2], Racanière et al. [3], and Gregor et al. [5]. Ha and Schmidhuber [2] are able to train an agent entirely inside a generative model, while Racanière et al. [3] enable agents to imagine different future outcomes to learn better policies. Gregor et al. [5] recently showed that generative models can capture beliefs about the state of the environment and that this can be leveraged to considerably improve the sample efficiency of reinforcement learning algorithms. In robotics, image editing and generating synthetic scenes are of particular interest as techniques for data augmentation. For example, Yao et al. [11] segment objects and subsequently render them in different poses. Alhaija et al. [12] overlay synthetic objects on real backgrounds and Lee et al. [4] edit scenes in pixel-wise semantic label space. Integrating GENESIS into methods such as these presents an exciting avenue for future research.
Structured Models While perception modules in robotics are typically trained in a supervised fashion (He et al. [1]), collecting labels for every new task is intractable.
Several methods leveraging structured latent variables have been proposed to discover meaningful objects without supervision.
Attend-Infer-Repeat (Eslami et al. [13]) and its sequential counterpart in Kosiorek et al. [14] use spatial attention to partition scenes into objects.
Tagger (Greff et al. [15]) and Neural Expectation Maximisation (Greff et al. [16]) perform unsupervised segmentation by modelling images as spatial mixture models.
Stacked Capsule Autoencoders
monet & iodine While this work is most directly related to monet (Burgess et al. [6]) and iodine (Greff et al. [7]), it sets itself apart by introducing a generative model of complete visual scenes. monet employs a deterministic, recurrent CNN to implement a SBP that outputs pixel-wise attention masks for scene components. Subsequently, a VAE (Kingma and Welling [19], Rezende et al. [20]) with a spatial broadcast decoder (Watters et al. [21]) models individual components. Due to the deterministic attention network, monet does not constitute a proper probabilistic generative model, limiting its applicability. iodine uses a decoder and a refinement network with a computationally expensive, gradient-based iterative inference procedure. Structurally, GENESIS is more comparable to monet and avoids the expensive iterative mechanism in iodine. Unlike monet, though, the convolutional encoders and decoders in GENESIS can be run in parallel, rendering the model computationally more scalable to inputs with a larger number of scene components. Both monet and iodine use an independent prior over scene components, whereas the autoregressive prior of GENESIS allows the modelling of relationships between components and the generation of coherent novel scenes.
Adversarial Methods A few recent works have proposed to use an adversary for scene segmentation and generation. While Chen et al. [22] and Bielski and Favaro [23] segment a single foreground object per image, Arandjelović and Zisserman [24] are able to segment several synthetic objects superimposed on natural images. Azadi et al. [25] introduce a model to combine two objects or an object and a background scene in a sensible fashion, whereas van Steenkiste et al. [26] can generate scenes with an arbitrary number of components. In comparison, GENESIS performs both inference and generation without exhibiting the instabilities common in adversarial training. Furthermore, the computational complexity of GENESIS increases with , where is the number of components, compared to the complexity of the relational stage in van Steenkiste et al. [26].
3 Genesis: Generative Scene Inference and Sampling
Akin to state-space representations, latent representations learned by generative models have been shown to improve sample efficiency of machine learning models in downstream tasks (van Steenkiste et al. [27]). Separating scene representations into object-centric ones facilitates relational reasoning (Santoro et al. [28]) and has the potential to further improve sample efficiency. Existing solutions, however, either require expensive inference procedures (Greff et al. [7, 15, 16]) or lack a proper probabilistic formulation (Burgess et al. [6]). Therefore, we introduce GENESIS which features a probabilistic formulation and an efficient inference procedure. Moreover, GENESIS has an autoregressive prior which is capable of learning relationships between scene components. We first describe the generative model of GENESIS, as well as a simplified variant called GENESIS-s, and follow by detailing the associated inference procedure. See Figure 1 for an overview of GENESIS and Figure 2 for the graphical model in comparison to alternative methods. GENESIS-s is illustrated in Figure 7 in Section A.1.
Generative model
Let be an image. We formulate the problem of image generation as a spatial GMM. That is, every Gaussian component , where is the maximum number of scene components, represents an image-sized scene component . The corresponding image-sized mixing probability represents whether the component is present at a location in the image. The mixing probabilities are normalised across scene components, i.e. , and can be regarded as pixel-wise attention masks. Since there are strong spatial dependencies between different scene components, we formulate an autoregressive prior distribution over mask variables which encode the mixing probabilities , as
(1) |
The dependence on previous latents is implemented via an RNN with hidden state .
Next, we assume that the appearance of scene components is conditionally independent given their spatial allocation in the scene. The corresponding conditional distribution over component variables which encode component appearances factorises as follows,
(2) |
The image likelihood is given by a mixture model,
(3) |
where the mixing probabilities are created via a SBPSBP adapted from Burgess et al. [6] as follows, slightly overloading the notation,
(4) |
Note that this step is not essential for our model and instead we could use a to normalise masks as in Greff et al. [7]. Early experiments indicated, however, that the SBP learns to decompose scenes into meaningful objects more quickly.
Finally and omitting subscripts, the full generative model can be written as follows,
(5) |
where we assume that all conditional distributions are Gaussian. The Gaussian components of the image likelihood have a fixed scalar standard deviation
. We refer to this model as GENESIS. To investigate whether separate latents for masks and component appearances are necessary for decomposition, we consider a simplified model, GENESIS-s, with a single latent variable per component,(6) |
In this case, takes the role of in Equation 3 and of in Equation 4, while Equation 2 is no longer necessary.


Approximate posterior
We amortise inference by using an approximate posterior distribution with parameters and a structure similar to the generative model. The full approximate posterior reads as follows,
(7) | ||||
with the dependence on realised by an RNN . The RNN could, in principle, be shared with the prior, but we have not investigated this option. All conditional distributions are Gaussian. For GENESIS-s, the approximate posterior takes the form .
Learning
GENESIS can be trained by maximising the ELBOELBO on the log-marginal likelihood , given by
(8) | ||||
(9) |
However, this often leads to a strong emphasis on the likelihood term (first term in Equation 9), while allowing the marginal approximate posterior to drift away from the prior distribution, hence increasing the kl-divergence (second term in Equation 9). This also decreases the quality of samples drawn from the model.
To prevent this behaviour, we use the GECO objective from Rezende and Viola [29] instead, which changes the learning problem to minimising the kl-divergence subject to a reconstruction constraint. Let be the minimum allowed reconstruction log-likelihood, GECO then uses Lagrange multipliers to solve the following problem,
(10) | ||||
4 Experiments
In this section, we present qualitative and quantitative results on the “rooms-ring-camera” dataset from GQN (Eslami et al. [9]) and the ShapeStacks dataset (Groth et al. [10]). We use an image resolution of 64-by-64 for all experiments. The number of steps was set to for GQN and for ShapeStacks. Additional results are included in the Appendix, also illustrating a typical failure case of the model.
GQN The “rooms-ring-camera” dataset includes simulated 3D scenes of a square room with different floor and wall textures, containing one to three objects of various shapes and sizes.
ShapeStacks Images show simulated block towers of different heights (two to six blocks) where individual blocks have different shapes, sizes, and colours. Scenes have annotations for: stability of the tower (binary), number of blocks (two to six), properties of individual blocks, locations in the tower of centre-of-mass violations and planar surface violations, wall and floor textures (five each), light presets (five), and camera view points (sixteen).
4.1 Implementation Details
For the results in this section, we use the architecture from Berg et al. [30] to encode and decode . This architecture was modified for GENESIS by applying batch normalisation (Ioffe and Szegedy [31]) before the GLUs (Dauphin et al. [32]
). The convolutional layers in the encoder and decoder have five layers with size-5 kernels, strides of [1, 2, 1, 2, 1], and filter sizes of [32, 32, 64, 64, 64] and [64, 32, 32, 32, 32], respectively. Fully-connected layers are used at the lowest resolution.
The encoded image is passed to a LSTM cell (Hochreiter and Schmidhuber [33]) followed by a linear layer to compute the mask latents of size 64. The LSTM state size is twice the latent size. Importantly, unlike the analogous counterpart in monet, the decoding of is performed in parallel. The autoregressive prior is implemented as an LSTM with 256 units. The conditional distribution is parameterised by a MLP with two hidden layers, 256 units per layer, and ELUs (Clevert et al. [34]). We use the same component VAE featuring a spatial broadcast decoder as monet to encode and decode , but we replace ReLUs (Glorot et al. [35]) with ELUs.
For GENESIS-s, the encoder of is the same as for above and the decoder from Berg et al. [30] is again used to compute the mixing probabilities. However, GENESIS-s also has a second decoder with spatial broadcasting to obtain the object appearances from . We found the use of two different decoders to be important for GENESIS-s to learn meaningful scene decompositions. While this shows that it is not necessary to use separate latent variables for component masks and appearances in order to decompose scenes, we found that GENESIS consistently trained more quickly with better qualitative results. We conjecture that this gap might close to some degree when training the models for more iterations. Hence, qualitative results reported in this section are from GENESIS.
All models are trained for iterations with a batch size of 32 using the ADAM optimiser (Kingma and Ba [36]) and a learning rate of . With these settings, training GENESIS takes about two days on a single GPU, though we expect performance to improve with further training. We deliberately choose a comparatively weak reconstruction constraint for the GECO objective to emphasise kl minimisation and sample quality. We increase the GECO step size by a constant factor when the reconstruction constraint is satisfied to accelerate training.
4.2 Scene Generation
Unlike previous works, GENESIS has an autoregressive prior to capture intricate dependencies between scene components. Modelling these relationships is necessary to generate coherent scenes. For example, different parts of the background need to fit together; we do not want to create components several times that can only be present once, such as the sky; and several physical objects cannot be in the same location. GENESIS is able to generate novel scenes by sequentially sampling scene components from the prior and conditioning each new component on those that have been generated during previous steps.
Figure 3 shows two scenes generated by GENESIS after training on GQN. At every step, either an object in the foreground or a part of the background is generated. Crucially, these components fit together to make up a semantically consistent scene that looks similar to the training data.
![]() |
![]() |
4.3 Scene Decomposition
Similar to monet and iodine, GENESIS is able to segment scenes into meaningful components, with some slots reconstructing individual foreground objects and other slots reconstructing different parts of the background. Step-by-step reconstructions for a model trained on the GQN dataset are shown in Figure 4. The model starts by reconstructing part of the background in the first step and individual objects are reconstructed in separate steps.
We observed that the model pursues a similar decomposition strategy for all input images. Models trained with different hyperparameters also typically learn the same strategy. We hypothesise that this is due to the fact that large surfaces with little variation in appearance have a big influence on the reconstruction loss and are also easy to learn compared to more involved parts of the scene with finer details. In particular, the mask latents in the first step only depend on the input, whereas later masks depend on previous steps which could introduce distracting noise early on in training.
In Figure 5, we show the step-by-step reconstructions on ShapeStacks. Again, the model learns a similar reconstruction strategy as for GQN. Notably though, GENESIS is able to differentiate between all five objects in the block tower compared to the smaller number of objects in GQN even though the background is quite noisy and the objects in the block tower occlude each other significantly.


4.3.1 Scene Manipulation
Learning disentangled representations (van Steenkiste et al. [27]) combined with the ability to decompose scenes enables GENESIS to deliberately manipulate selected parts of the scene while leaving the rest unchanged. This is achieved by linearly traversing the latent space of individual component latents. This can be used, for example, to change the appearance of the background or any of the foreground objects as illustrated in Figure 6.
![]() |
![]() |
4.4 Evaluating Unsupervised Representations on ShapeStacks
We employ a set of discriminative tasks to evaluate how well the representations learned by GENESIS capture various properties of input scenes and compare to several VAE and monet baselines. Alternative metrics such as ELBO or negative log-likelihood cannot be computed for monet. To this end, ShapeStacks is an interesting testing ground as scenes contain multiple objects with surface contact. In particular, we consider three tasks: (1) Is a tower stable or not? (2) What is the tower’s height in terms of the number of blocks? (3) What is the quantised camera pose (out of 16 possibilities)? The first two tasks focus on information relating to foreground objects in the scene and the third task was inspired by Eslami et al. [9].
We compare against four baselines: a VAE with a deconvolutional decoder (DeconvVAE), a VAE with a spatial broadcast decoder (BroadcastVAE), and two different variants of monet with either [32, 32, 64, 64, 64] (monet32) or [64, 64, 128, 128, 128] (monet64) filters in the attention encoder and the reverse in the decoder. The VAEs have 64 latents and the same encoder (and decoder in case of the DeconvVAE) as in Berg et al. [30]. The BroadcastVAE has the same decoder with ELUs as GENESIS, but with twice the number of filters to enable a better comparison. For the monet variants, the prior on the masks was normalised with a function to compute the kl-divergence.
MLP with one hidden layer, 512 units, and ELU activations were used for classification. The classifiers were trained for 100 epochs on 50,000 labelled examples with a batch size of 128 using a cross-entropy loss, the ADAM optimiser, and a learning rate of
. As inputs to the classifiers, we concatenate and for GENESIS, for GENESIS-s, and the component VAE latents for the two monet variants. Table 1 shows the test accuracy of the models. The first row of the table shows the accuracy obtained by always predicting the class with the largest number of examples in the test set.Model | Stability | Height | View |
---|---|---|---|
Largest class | 50.0 | 30.5 | 6.25 |
DeconvVAE | 58.7 | 67.8 | 98.7 |
BroadcastVAE | 59.2 | 77.5 | 99.8 |
monet32 | 59.3 | 85.0 | 99.1 |
monet64 | 60.4 | 86.2 | 99.4 |
GENESIS | 65.2 | 81.5 | 98.8 |
GENESIS-s | 63.5 | 78.0 | 99.8 |
None of models are able to reach the stability prediction accuracies in Groth et al. [10]. This is not surprising considering that the representation networks were trained on sub-sampled images without data augmentation and with a pixel-wise reconstruction loss. Notably, however, both GENESIS and monet do much better in terms of predicting tower stability and height than the VAE baselines, indicating that object-centric latents are beneficial for these tasks. monet performs particularly well on predicting the height of the towers which might be facilitated by its deterministic segmentation network. Nevertheless, both variants of GENESIS outperform all baselines in terms of stability prediction. All models do well at predicting the camera view.
5 Conclusions
In this work, we propose a novel object-centric latent variable model of scenes called GENESIS. We show that GENESIS is, to the best of our knowledge, the first unsupervised model to both decompose visua scenes into semantically meaningful constituent parts, while at the same time being able to generate novel coherent scenes in a component-wise fashion. Importantly, this is achieved by capturing relationships between scene components with an autoregressive prior, allowing complete scenes to be modelled as a collection of components. Regarding future work, the most interesting challenge is to scale GENESIS to more complex datasets. Another potentially promising research direction is to adapt the formulation to only model parts of the scene that are relevant for a certain task. We hope that this work will open up promising avenues for further research, in particular with regards to learning and planning in latent space as well as better harnessing the capabilities of generative models in robotics and reinforcement learning applications.
This research was supported by an EPSRC Programme Grant (EP/M019918/1). The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work, http://dx.doi.org/10.5281/zenodo.22558, and the use of Hartree Centre resources. The authors thank Yizhe Wu for his help with re-implementing monet.
References
- He et al. [2017] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. International Conference on Computer Vision, 2017.
- Ha and Schmidhuber [2018] D. Ha and J. Schmidhuber. World Models. Neural Information Processing Systems, 2018.
- Racanière et al. [2017] S. Racanière, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al. Imagination-Augmented Agents for Deep Reinforcement Learning. Neural Information Processing Systems, 2017.
- Lee et al. [2018] D. Lee, S. Liu, J. Gu, M.-Y. Liu, M.-H. Yang, and J. Kautz. Context-Aware Synthesis and Placement of Object Instances. Neural Information Processing Systems, 2018.
- Gregor et al. [2019] K. Gregor, D. J. Rezende, F. Besse, Y. Wu, H. Merzic, and A. v. d. Oord. Shaping Belief States with Generative Environment Models for RL. arXiv preprint arXiv:1906.09237, 2019.
- Burgess et al. [2019] C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner. MONet: Unsupervised Scene Decomposition and Representation. arXiv preprint arXiv:1901.11390, 2019.
- Greff et al. [2019] K. Greff, R. L. Kaufmann, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner. Multi-Object Representation Learning with Iterative Variational Inference. International Conference on Machine Learning, 2019.
- Parmar et al. [2018] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and D. Tran. Image Transformer. International Conference on Machine Learning, 2018.
-
Eslami et al. [2018]
S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo,
A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al.
Neural Scene Representation and Rendering.
Science, 2018. - Groth et al. [2018] O. Groth, F. B. Fuchs, I. Posner, and A. Vedaldi. ShapeStacks: Learning Vision-Based Physical Intuition for Generalised Object Stacking. European Conference on Computer Vision, 2018.
- Yao et al. [2018] S. Yao, T. M. Hsu, J.-Y. Zhu, J. Wu, A. Torralba, B. Freeman, and J. Tenenbaum. 3D-Aware Scene Manipulation via Inverse Graphics. Neural Information Processing Systems, 2018.
- Alhaija et al. [2018] H. A. Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, and C. Rother. Augmented Reality Meets Computer Vision : Efficient Data Generation for Urban Driving Scenes. International Journal of Computer Vision, 2018.
- Eslami et al. [2016] S. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton, et al. Attend, Infer, Repeat: Fast Scene Understanding with Generative Models. Neural Information Processing Systems, 2016.
- Kosiorek et al. [2018] A. Kosiorek, H. Kim, Y. W. Teh, and I. Posner. Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects. Neural Information Processing Systems, 2018.
- Greff et al. [2016] K. Greff, A. Rasmus, M. Berglund, T. Hao, H. Valpola, and J. Schmidhuber. Tagger: Deep Unsupervised Perceptual Grouping. Neural Information Processing Systems, 2016.
-
Greff et al. [2017]
K. Greff, S. van Steenkiste, and J. Schmidhuber.
Neural Expectation Maximization.
Neural Information Processing Systems, 2017. - Kosiorek et al. [2019] A. R. Kosiorek, S. Sabour, Y. W. Teh, and G. E. Hinton. Stacked Capsule Autoencoders. arXiv preprint arXiv:1906.06818, 2019.
- Xu et al. [2018] K. Xu, C. Li, J. Zhu, and B. Zhang. Multi-Objects Generation with Amortized Structural Regularization. Neural Information Processing Systems, 2018.
- Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. International Conference on Learning Representations, 2014.
-
Rezende et al. [2014]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
International Conference on Machine Learning, 2014. - Watters et al. [2019] N. Watters, L. Matthey, C. P. Burgess, and A. Lerchner. Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs. arXiv preprint arXiv:1901.07017, 2019.
- Chen et al. [2019] M. Chen, T. Artières, and L. Denoyer. Unsupervised Object Segmentation by Redrawing. arXiv preprint arXiv:1905.13539, 2019.
- Bielski and Favaro [2019] A. Bielski and P. Favaro. Emergence of Object Segmentation in Perturbed Generative Models. arXiv preprint arXiv:1905.12663, 2019.
- Arandjelović and Zisserman [2019] R. Arandjelović and A. Zisserman. Object Discovery with a Copy-Pasting GAN. arXiv preprint arXiv:1905.11369, 2019.
- Azadi et al. [2019] S. Azadi, D. Pathak, S. Ebrahimi, and T. Darrell. Compositional GAN: Learning Image-Conditional Binary Composition. arXiv preprint arXiv:1807.07560, 2019.
- van Steenkiste et al. [2018] S. van Steenkiste, K. Kurach, and S. Gelly. A Case for Object Compositionality in Deep Generative Models of Images. NeurIPS Workshop on Modeling the Physical World: Learning, Perception, and Control, 2018.
- van Steenkiste et al. [2019] S. van Steenkiste, F. Locatello, J. Schmidhuber, and O. Bachem. Are Disentangled Representations Helpful for Abstract Visual Reasoning? arXiv preprint arXiv:1905.12506, 2019.
- Santoro et al. [2017] A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. W. Battaglia, and T. P. Lillicrap. A Simple Neural Network Module for Relational Reasoning. Neural Information Processing Systems, 2017.
- Rezende and Viola [2018] D. J. Rezende and F. Viola. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.
-
Berg et al. [2018]
R. v. d. Berg, L. Hasenclever, J. M. Tomczak, and M. Welling.
Sylvester Normalizing Flows for Variational Inference.
Conference on Uncertainty in Artificial Intelligence
, 2018. - Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. International Conference on Machine Learning, 2015.
- Dauphin et al. [2017] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language Modeling with Gated Convolutional Networks. International Conference on Machine Learning, 2017.
- Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 1997.
- Clevert et al. [2016] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). International Conference on Learning Representations, 2016.
- Glorot et al. [2011] X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse Rectifier Neural Networks. International Conference on Artificial Intelligence and Statistics, 2011.
- Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations, 2015.
- Ulyanov et al. [2016] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv preprint arXiv:1607.08022, 2016.
Appendix A Appendix
a.1 genesis-s

a.2 Architecture Variations
We investigate several architecture variations of GENESIS and GENESIS-s and compare their performances on the ShapeStacks tasks.
The basic GatedConv architecture is adapted from Berg et al. [30] as described in Section 4.1. Here, we report results for three variants of GENESIS and GENESIS-s with the following modifications:
-
increase the dimension of the mask latents from 64 to 128 and reduce the number of filters in the mask decoder by a factor of two;
-
use four stride-2 layers with [32, 32, 64, 64] and [64, 32, 32, 32] filters in the encoder and decoder, respectively, and use instance normalisation (IN) (Ulyanov et al. [37]) instead of batch normalisation (BN);
-
use the attention architecture from MON
et32 without skip-connections and replace ReLU activations with ELUs.
The results of are summarised in Table 2. Overall, we found that performance on ShapeStacks was not strongly affected by these architecture changes. Both GENESIS and GENESIS-s still perform better on the stability prediction task than the baselines in Table 1. It would be interesting to conduct a more comprehensive study to establish best practices with regards to architecture design for these types of models, in particular in consideration of performance vs. run-time trade-offs.
Model | dim | Architecture | Norm | Stability | Height | View |
---|---|---|---|---|---|---|
GENESIS | 64 | GatedConv | BN | 65.2 | 81.5 | 98.8 |
128 | GatedConv, 1/2 channels in decoder | BN | 64.3 | 79.8 | 98.6 | |
128 | GatedConv, four stride-2 layers | IN | 61.0 | 77.1 | 99.6 | |
128 | monet32-like | IN | 63.1 | 78.8 | 99.6 | |
GENESIS-s | 64 | GatedConv | BN | 63.5 | 78.0 | 99.8 |
128 | GatedConv, 1/2 channels in decoder | BN | 64.1 | 81.7 | 99.7 | |
128 | GatedConv, four stride-2 layers | IN | 64.4 | 79.0 | 99.8 | |
128 | monet32-like | IN | 62.6 | 77.2 | 99.9 |
a.3 Scene Generation

a.4 Scene Decomposition


