Log In Sign Up

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

by   Zhixuan Lin, et al.

The ability to decompose complex multi-object scenes into meaningful abstractions like objects is fundamental to achieve higher-level cognition. Previous approaches for unsupervised object-oriented scene representation learning are either based on spatial-attention or scene-mixture approaches and limited in scalability which is a main obstacle towards modeling real-world scenes. In this paper, we propose a generative latent variable model, called SPACE, that provides a unified probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches. SPACE can explicitly provide factorized object representations for foreground objects while also decomposing background segments of complex morphology. Previous models are good at either of these, but not both. SPACE also resolves the scalability problems of previous methods by incorporating parallel spatial-attention and thus is applicable to scenes with a large number of objects without performance degradations. We show through experiments on Atari and 3D-Rooms that SPACE achieves the above properties consistently in comparison to SPAIR, IODINE, and GENESIS. Results of our experiments can be found on our project website:


page 6

page 7

page 12

page 13

page 14


Scalable Object-Oriented Sequential Generative Models

The main limitation of previous approaches to unsupervised sequential ob...

MONet: Unsupervised Scene Decomposition and Representation

The ability to decompose scenes in terms of abstract building blocks is ...

GMAIR: Unsupervised Object Detection Based on Spatial Attention and Gaussian Mixture

Recent studies on unsupervised object detection based on spatial attenti...

BlobGAN: Spatially Disentangled Scene Representations

We propose an unsupervised, mid-level representation for a generative mo...

Learning Object Arrangements in 3D Scenes using Human Context

We consider the problem of learning object arrangements in a 3D scene. T...

Spatial Mixture Models with Learnable Deep Priors for Perceptual Grouping

Humans perceive the seemingly chaotic world in a structured and composit...

Compositional Transformers for Scene Generation

We introduce the GANformer2 model, an iterative object-oriented transfor...

Code Repositories


Official PyTorch implementation of "SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition"

view repo


SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

view repo

1 Introduction

One of the unsolved key challenges in machine learning is unsupervised learning of structured representation for a visual scene containing many objects with occlusion, partial observability, and complex background. When properly decomposed into meaningful abstract entities such as objects and spaces, this structured representation brings many advantages of abstract (symbolic) representation to areas where contemporary deep learning approaches with a global continuous vector representation of a scene have not been successful. For example, a structured representation may improve sample efficiency for downstream tasks such as a deep reinforcement learning agent

(Mnih2013PlayingAW). It may also enable visual variable binding (sun1992variable) for reasoning and causal inference over the relationships between the objects and agents in a scene. Structured representations also provide composability and transferability for better generalization.

Recent approaches to this problem of unsupervised object-oriented scene representation can be categorized into two types of models: scene-mixture models and spatial-attention models. In scene-mixture models (nem; iodine; monet; genesis), a visual scene is explained by a mixture of a finite number of component images. This type of representation provides flexible segmentation maps that can handle objects and background segments of complex morphology. However, since each component corresponds to a full-scale image, important physical features of objects like position and scale are only implicitly encoded in the scale of a full image and further disentanglement is required to extract these useful features. Also, since it does not explicitly reflect useful inductive biases like the locality of an object in the Gestalt principles (koffka2013principles), the resulting component representation is not necessarily a representation of a local area. Moreover, to obtain a complete scene, a component needs to refer to other components, and thus inference is inherently performed sequentially, resulting in limitations in scaling to scenes with many objects.

In contrast, spatial-attention models

(air; spair) can explicitly obtain the fully disentangled geometric representation of objects such as position and scale. Such features are grounded on the semantics of physics and should be useful in many ways (e.g., sample efficiency, interpretability, geometric reasoning and inference, transferability). However, these models cannot represent complex objects and background segments that have too flexible morphology to be captured by spatial attention (i.e. based on rectangular bounding boxes). Similar to scene-mixture models, previous models in this class show scalability issues as objects are processed sequentially.

In this paper, we propose a method, called Spatially Parallel Attention and Component Extraction (SPACE), that combines the best of both approaches. SPACE learns to process foreground objects, which can be captured efficiently by bounding boxes, by using parallel spatial-attention while decomposing the remaining area that includes both morphologically complex objects and background segments by using component mixtures. Thus, SPACE provides an object-wise disentangled representation of foreground objects along with explicit properties like position and scale per object while also providing decomposed representations of complex background components. Furthermore, by fully parallelizing the foreground object processing, we resolve the scalability issue of existing spatial attention methods. In experiments on 3D-room scenes and Atari game scenes, we quantitatively and qualitatively compare the representation of SPACE to other models and show that SPACE combines the benefits of both approaches in addition to significant speed-ups due to the parallel foreground processing.

The contributions of the paper are as follows. First, we introduce a model that unifies the benefits of spatial-attention and scene-mixture approaches in a principled framework of probabilistic latent variable modeling. Second, we introduce a spatially parallel multi-object processing module and demonstrate that it can significantly mitigate the scalability problems of previous methods. Lastly, we provide an extensive comparison with previous models where we illustrate the capabilities and limitations of each method.

2 The Proposed Model: SPACE

In this section, we describe our proposed model, Spatially Parallel Attention and Component Extraction (SPACE). The main idea of SPACE, presented in Figure 1, is to propose a unified probabilistic generative model that combines the benefits of the spatial-attention and scene-mixture models.

Figure 1: Illustration of the SPACE model. SPACE consists of a foreground module and a background module. In the foreground module, the input image is divided into a grid of cells ( in the figure). An image encoder is used to compute the , , and for each cell in parallel. is used to identify proposal bounding boxes and a spatial transformer is used to attend to each bounding box in parallel, computing a encoding for each cell. The model selects patches using the bounding boxes and reconstructs them using a VAE from all the foreground latents . The background module segments the scene into components (4 in the figure) using a pixel-wise mixture model. Each component consists of a set of latents where

models the mixing probability of the component and

models the RGB distribution of the component. The components are combined to reconstruct the background using a VAE. The reconstructed background and foreground are then combined using a pixel-wise mixture model to generate the full reconstructed image.

2.1 Generative Process

SPACE assumes that a scene is decomposed into two independent latents: foreground and background . The foreground is further decomposed into a set of independent foreground objects and the background is also decomposed further into a sequence of background segments . While our choice of modeling the foreground and background independently worked well empirically, for better generation, it may also be possible to condition one on the other. The image distributions of the foreground objects and the background components are combined together with a pixel-wise mixture model to produce the complete image distribution: p(xz^fg, z^bg) = α⏟p(xz^fg)_Foreground + (1-α)∑_k=1^K π_k ⏟p(xz^bg_k)_Background. Here, the foreground mixing probability is computed as . This way, the foreground is given precedence in assigning its own mixing weight and the remaining is apportioned to the background. The mixing weight assigned to the background is further sub-divided among the background components. These weights are computed as and . With these notations, the complete generative model can be described as follows. p(x) = ∬p(xz^fg, z^bg)p(z^bg)p(z^fg)dz^fgdz^bg We now describe the foreground and background models in more detail.

Foreground. SPACE implements as a structured latent. In this structure, an image is treated as if it were divided into cells and each cell is tasked with modeling at most one (nearby) object in the scene. This type of structuring has been used in (yolo; rn; spair). Similar to SPAIR, in order to model an object, each cell is associated with a set of latents . In this notation,

is a binary random variable denoting if the cell models any object or not,

denotes the size of the object and its location relative to the cell, denotes the depth of the object to resolve occlusions and models the object appearance and its mask. These latents may then be used to compute the foreground image component

which is modeled as a Gaussian distribution

. In practice, we treat

as a hyperparameter and decode only the mean image

. In this process, SPACE reconstructs the objects associated to each cell having . For each such cell, the model uses the to decode the object glimpse and its mask and the glimpse is then positioned on a full-resolution canvas using

via the Spatial Transformer Network

(spatial_transformer). Using the object masks and , all the foreground objects are combined into a single foreground mean-image and the foreground mask (See Appendix D for more details).

SPACE imposes a prior distribution on these latents as follows: p(z^fg) &= ∏_i=1^H×W p(z^pres_i)(p(z^where_i)p(z^0pt_i)p(z^what_i))^z^pres_i Here, only

is modeled using a Bernoulli distribution while the remaining are modeled as Gaussian.

Background. To model the background, SPACE implements , similar to GENESIS, as where models the mixing probabilities of the components and models the RGB distribution of the background component as a Gaussian . The following prior is imposed upon these latents. p(z^bg)&=∏_k=1^K p(z^c_k—z_k^m)p(z_k^m— z_¡k^m)

2.2 Inference and Training

Since we cannot analytically evaluate the integrals in equation 2.1 due to the continuous latents and , we train the model using a variational approximation. The true posterior on these variables is approximated as follows. p(z^bg_1:K, z^fg—x) ≈q(z^fg—x) ∏_k=1^K q(z^bg_k—z^bg_¡k, x) This is used to derive the following ELBO to train the model using the reparameterization trick and SGD (vae).


See Appendix B for the detailed decomposition of the ELBO and the related details.

Parallel Inference of Cell Latents. SPACE uses mean-field approximation when inferring the cell latents, so for each cell does not depend on other cells. q(z^fg—x) &= ∏_i=1^H×W q(z^pres_i—x)(q(z^where_i—x)q(z^0pt_i—x)q(z^what_i—z^where_i, x))^z^pres_i. As shown in Figure 1, this allows each cell to act as an independent object detector, spatially attending to its own local region in parallel. This is in contrast to inference in SPAIR, where each cell’s latents auto-regressively depend on some or all of the previously traversed cells in a row-major order i.e., . However, this method becomes prohibitively expensive in practice as the number of objects increases. While spair claim that these lateral connections are crucial for performance since they model dependencies between objects and thus prevent duplicate detections, we challenge this assertion by observing that 1) due to the bottom-up encoding conditioning on the input image, each cell should have information about its nearby area without explicitly communicating with other cells, and 2) in (physical) spatial space, two objects cannot exist at the same position. Thus, the relation and interference between objects should not be severe and the mean-field approximation is a good choice in our model. In our experiments, we verify empirically that this is indeed the case and observe that SPACE shows comparable detection performance to SPAIR while having significant gains in training speeds and efficiently scaling to scenes with many objects.

Preventing Box-Splitting. If the prior for the bounding box size is set to be too small, then the model could split a large object by multiple bounding boxes and when the size prior is too large, the model may not capture small objects in the scene, resulting in a trade-off between the prior values of the bounding box size. To alleviate this problem, we found it helpful to introduce an auxiliary loss which we call the boundary loss. In the boundary loss, we construct a boundary of thickness pixels along the borders of each glimpse. Then, we restrict an object to be inside this boundary and penalize the model if an object’s mask overlaps with the boundary area. Thus, the model is penalized if it tries to split a large object by multiple smaller bounding boxes. A detailed implementation of the boundary loss is mentioned in Appendix C.

3 Related Works

Our proposed model is inspired by several recent works in unsupervised object-oriented scene decomposition. The Attend-Infer-Repeat (AIR) (air)

framework uses a recurrent neural network to attend to different objects in a scene and each object is sequentially processed one at a time. An object-oriented latent representation is prescribed that consists of ‘what’, ‘where’, and ‘presence’ variables. The ‘what’ variable stores the appearance information of the object, the ‘where’ variable represents the location of the object in the image, and the ‘presence’ variable controls how many steps the recurrent network runs and acts as an interruption variable when the model decides that all objects have been processed.

Since the number of steps AIR runs scales with the number of objects it attends to, it does not scale well to images with many objects. Spatially Invariant Attend, Infer, Repeat (SPAIR) (spair) attempts to address this issue by replacing the recurrent network with a convolutional network. Similar to YOLO (yolo)

, the locations of objects are specified relative to local grid cells rather than the entire image, which allow for spatially invariant computations. In the encoder network, a convolutional neural network is first used to map the image to a feature volume with dimensions equal to a pre-specified grid size. Then, each cell of the grid is processed


to produce objects. This is done sequentially because the processing of each cell takes as input feature vectors and sampled objects of nearby cells that have already been processed. SPAIR therefore scales with the pre-defined grid size which also represents the maximum number of objects that can be detected. Our model uses an approach similar to SPAIR to detect foreground objects, but importantly we make the foreground object processing fully parallel to scale to large number of objects without performance degradation. Works based on Neural Expectation Maximization

(rnem; nem) do achieve unsupervised object detection but do not explicitly model the presence, appearance, and location of objects. These methods also suffer from the problem of scaling to images with a large number of objects.

For unsupervised scene-mixture models, several recent models have shown promising results. MONet (monet)

leverages a deterministic recurrent attention network that outputs pixel-wise masks for the scene components. A variational autoencoder (VAE)

(vae) is then used to model each component. IODINE (iodine) approaches the problem from a spatial mixture model perspective and uses amortized iterative refinement of latent object representations within the variational framework. GENESIS (genesis) also uses a spatial mixture model which is encoded by component-wise latent variables. Relationships between these components are captured with an autoregressive prior, allowing complete images to be modeled by a collection of components.

4 Evaluation

We evaluate our model on two datasets: 1) an Atari (bellemare13arcade) dataset that consists of random images from a pretrained agent playing the games, and 2) a generated 3D-room dataset that consists of images of a walled enclosure with a random number of objects on the floor. In order to test the scalability of our model, we use both a small 3D-room dataset that has 4-8 objects and a large 3D-room dataset that has 18-24 objects. Each image is taken from a random camera angle and the colors of the objects, walls, floor, and sky are also chosen at random. Additional details of the datasets can be found in the Appendix E.

Baselines. We compare our model against two scene-mixture models (IODINE and GENESIS) and one spatial-attention model (SPAIR). Since SPAIR does not have an explicit background component, we add an additional VAE for processing the background. Additionally, we test against two implementations of SPAIR: one where we train on the entire image using a grid and another where we train on random pixel patches using a grid. We denote the former model as SPAIR and the latter as SPAIR-P. SPAIR-P is consistent with the SPAIR’s alternative training regime on Space Invaders demonstrated in spair to address the slow training of SPAIR on the full grid size because of its sequential inference. Lastly, for performance reasons, unlike the original SPAIR implementation, we use parallel processing for rendering the objects from their respective latents onto the canvas111It is important to note that the worst case complexity of rendering is , (where is the image size) which is extremely time consuming when we have large image size and/or large number of objects. for both SPAIR and SPAIR-P. Thus, because of these improvements, our SPAIR implementation can be seen as a stronger baseline than the original SPAIR. The complete details of the architecture used is given in Appendix D.

Figure 2: Qualitative comparison between SPACE , SPAIR, SPAIR-P, IODINE and GENESIS for the 3D-Room dataset.
Figure 3: Qualitative comparison between SPACE , SPAIR, IODINE and GENESIS for Space Invaders, Air Raid, and River Raid.

4.1 Qualitative Comparison of Inferred Representations

In this section, we provide a qualitative analysis of the generated representations of the different models. For each model, we performed a hyperparameter search and present the results for the best settings of hyperparameters for each environment. Figure 2 shows sample scene decompositions of our baselines from the 3D-Room dataset and Figure 3 shows the results on Atari. Note that SPAIR does not use component masks and IODINE and GENESIS do not separate foreground from background, hence the corresponding cells are left empty. Additionally, we only show a few representative components for IODINE and GENESIS since we ran those experiments with larger than can be displayed. More qualitative results of SPACE can be found in Appendix A.

IODINE & GENESIS. In the 3D-Room environment, IODINE is able to segment the objects and the background into separate components. However, it occasionally does not properly decompose objects (in the Large 3D-room results, the orange sphere on the right is not reconstructed) and may generate blurry objects. GENESIS is able to segment the background walls, floor, and sky into multiple components. It is able to capture blurry foreground objects in the Small 3D-Room, but is not able to cleanly capture foreground objects with the larger number of objects in the Large 3D-Room. In Atari, for all games, both IODINE and GENESIS fail to capture the foreground properly. We believe this is because the objects in Atari games are smaller, less regular and lack the obvious latent factors like color and shape as in the 3D dataset, which demonstrates that detection-based approaches are more appropriate in this case.

SPAIR & SPAIR-P. SPAIR is able to detect tight bounding boxes in both 3D-Room and most Atari games (it does not work as well on dynamic background games, which we discuss below). SPAIR-P, however, often fails to detect the foreground objects in proper bounding boxes, frequently uses multiple bounding boxes for one object and redundantly detects parts of the background as foreground objects. This is a limitation of the patch training as the receptive field of each patch is limited to a glimpse, hence prohibiting it to detect objects larger than that and making it difficult to distinguish the background from foreground. These two properties are illustrated well in Space Invaders, where it is able to detect the small aliens, but it detects the long piece of background ground on the bottom of the image as foreground objects.

SPACE. In 3D-Room, SPACE is able to accurately detect almost all objects despite the large variations in object positions, colors, and shapes, while producing a clean segmentation of the background walls, ground, and sky. This is in contrast to the SPAIR model, while being able to provide similar foreground detection quality, encodes the whole background into a single component, which makes the representation less disentangled and the reconstruction more blurry. Similarly in Atari, SPACE consistently captures all foreground objects while producing clean background segmentation across many different games.

Dynamic Backgrounds. SPACE and SPAIR exhibit some very interesting behavior when trained on games with dynamic backgrounds. For the most static game - Space Invaders, both SPACE and SPAIR work well. For Air Raid, in which the background building moves, SPACE captures all objects accurately while providing a two-component segmentation, whereas SPAIR and SPAIR-P produce splitting and heavy re-detections. In the most dynamic games, SPAIR completely fails because of the difficulty to model dynamic background with a single VAE component, while SPACE is able to perfectly segment the blue racing track while accurately detecting all foreground objects.

Foreground vs Background. Typically, foreground is the dynamic local part of the scene that we are interested in, and background is the relatively static and global part. This definition, though intuitive, is ambiguous. Some objects, such as the red shields in Space Invaders and the key in Montezuma’s Revenge (Figure 5) are detected as foreground objects in SPACE, but are considered background in SPAIR. Though these objects are static222The pretrained agent we used to collect the game frames does not capture the key., they are important elements of the games and should be considered as foreground objects. Similar behavior is observed in Atlantis (Figure 7), where SPACE detects some foreground objects from the middle base that is above the water. We believe this is an interesting property of SPACE and could be very important for providing useful representations for downstream tasks. By using a spatial broadcast network (Watters2019SpatialBD) which is much weaker when compared to other decoders like sub-pixel convolutional nets (Shi_2016_CVPR), we limit the capacity of background module, which favors modeling static objects as foreground rather than background.

Boundary Loss. We notice SPAIR sometimes splits objects into two whereas SPACE is able to create the correct bounding box for the objects (for example, see Air Raid). This may be attributed to the addendum of the auxiliary boundary loss in the SPACE model that would penalize splitting an object with multiple bounding boxes.

4.2 Quantitative Comparison

In this section we compare SPACE with the baselines in several quantitative metrics333As previously shown, SPAIR-P does not work well in many of our environments, so we do not include it in these quantitative experiments. As in the qualitative section, we use the SPAIR implementation with sequential inference and parallel rendering in order to speed up the experiments.. We first note that each of the baseline models has a different decomposition capacity (), which we define as the capability of the model to decompose the scene into its semantic constituents such as the foreground objects and the background segmented components. For SPACE, the decomposition capacity is equal to the number of grid cells (which is the maximum number of foreground objects that can be detected) plus the number of background components . For SPAIR, the decomposition capacity is equal to the number of grid cells plus 1 for background. For IODINE and GENESIS, it is equal to the number of components .

For each experiment, we compare the metrics for each model with similar decomposition capacities. This way, each model can decompose the image into the same number of components. For a setting in SPACE with a grid size of with components, the equivalent settings in IODINE and GENESIS would be with . The equivalent setting in SPAIR would be a grid size of .

Figure 4: Quantitative performance comparison between SPACE , SPAIR, IODINE and GENESIS in terms of batch-processing time during training, training convergence and converged pixel MSE. Convergence plots showing pixel-MSE were computed on a held-out set during training.
Model Dataset
Avg. Precision
Avg. Precision
Object Count
Error Rate
SPACE () 3D-Room Large 0.8942 0.4608 0.0507
SPACE-WB () 3D-Room Large 0.8925 0.4428 0.0947
SPAIR () 3D-Room Large 0.9076 0.4453 0.0427
SPACE () 3D-Room Small 0.9057 0.5203 0.0330
SPACE-WB () 3D-Room Small 0.9048 0.5057 0.0566
SPAIR () 3D-Room Small 0.9036 0.4883 0.0257
Table 1: Comparison of SPACE with SPACE-WB (SPACE without boundary loss) and the SPAIR baseline with respect to the quality of the bounding boxes in the 3D-Room setting.

Gradient Step Latency. The leftmost chart of Figure 4 shows the time taken to complete one gradient step (forward and backward propagation) for different decomposition capacities for each of the models. We see that SPAIR’s latency grows with the number of cells because of the sequential nature of its latent inference step. Similarly GENESIS and IODINE’s latency grows with the number of components because each component is processed sequentially in both the models. IODINE is the slowest overall with its computationally expensive iterative inference procedure. Furthermore, both IODINE and GENESIS require storing data for each of the components, so we were unable to run our experiments on 256 components or greater before running out of memory on our 22GB GPU. On the other hand, SPACE employs parallel processing for the foreground which makes it scalable to large grid sizes, allowing it to detect a large number of foreground objects without any significant performance degradation. Although this data was collected for gradient step latency, this comparison implies a similar relationship exists with inference time which is a main component in the gradient step.

Time for Convergence. The remaining three charts in Figure 4 show the amount of time each model takes to converge in different experimental settings. We use the pixel-wise mean squared error (MSE) as a measurement of how close a model is to convergence. We see that not only does SPACE achieve the lowest MSE, it also converges the quickest out of all the models.

Average Precision and Error Rate. In order to assess the quality of our bounding box predictions and the effectiveness of boundary loss, we measure the Average Precision and Object Count Error Rate of our predictions. Our results are shown in Table 1. We only report these metrics for 3D-Room since we have access to the ground truth bounding boxes for each of the objects in the scene. All three models have very similar average precision and error rate. Despite being parallel in its inference, SPACE has a comparable count error rate to that of SPAIR. SPACE also achieves better average precision and count error rate compared to its variant without the boundary loss (SPACE-WB), which shows the efficacy of our proposed loss.

From our experiments, we can assert that SPACE can produce similar quality bounding boxes as SPAIR while 1) having orders of magnitude faster inference and gradient step time, 2) converging more quickly, 3) scaling to a large number of objects without significant performance degradation, and 4) providing complex background segmentation.

5 Conclusion

We propose SPACE, a unified probabilistic model that combines the benefits of the object representation models based on spatial attention and the scene decomposition models based on component mixture. SPACE can explicitly provide factorized object representation per foreground object while also decomposing complex background segments. SPACE also achieves a significant speed-up and thus makes the model applicable to scenes with a much larger number of objects without performance degradation. Besides, the detected objects in SPACE are also more intuitive than other methods. We show the above properties of SPACE on Atari and 3D-Rooms. Interesting future directions are to replace the sequential processing of background by a parallel one and to improve the model for natural images. Our next plan is to apply SPACE for object-oriented model-based reinforcement learning.


SA thanks to Kakao Brain and Center for Super Intelligence (CSI) for their support. ZL thanks to the ZJU-3DV group for its support. The authors would also like to thank Chang Chen and Zhuo Zhi for their insightful discussions and help in generating the 3D-Room dataset.


Appendix A Additional Results of SPACE

Figure 5: Case illustration of Montezuma’s Revenge comparing object-detection behaviour in SPACE and SPAIR.
Figure 6: Qualitative demonstration of SPACE trained on the 3D-room dataset.
Figure 7: Qualitative demonstration of SPACE trained jointly on a selection of 10 ATARI games. We show 6 games with complex background here.
Figure 8: Object detection and background segmentation using SPACE on 3D-Room data set with small number of objects. Each row corresponds to one input image.
Figure 9: Object detection and background segmentation using SPACE on 3D-Room data set with large number of objects.

Appendix B ELBO Derivations

In this section, we derive the ELBO for the log-likelihood .

KL Divergence for the Foreground Latents Under the SPACE ’s approximate inference, the inside the expectation can be evaluated as follows.

KL Divergence for the Background Latents Under our GENESIS-like modeling of inference for the background latents, the KL term inside the expectation for the background is evaluated as follows.

Relaxed treatment of In our implementation, we model the Bernoulli random variable using the Gumbel-Softmax distribution (jang2016categorical). We use the relaxed value of in the entire training and use hard samples only for the visualizations.

Appendix C Boundary Loss

In this section we elaborate on the implementation details of the boundary loss. We construct a kernel of the size of the glimpse, (we use ) with a boundary gap of having negative uniform weights inside the boundary and a zero weight in the region between the boundary and the glimpse. This ensures that the model is penalized when the object is outside the boundary. This kernel is first mapped onto the global space via STN spatial_transformer to obtain the global kernel. This is then multiplied element-wise with global object mask to obtain the boundary loss map. The objective of the loss is to minimize the mean of this boundary loss map. In addition to the ELBO, this loss is also back-propagated via RMSProp (rmsprop). This loss, due to the boundary constraint, enforces the bounding boxes to be less tight and results in lower average precision, so we disable the loss and optimize only the ELBO after the model has converged well.

Appendix D Implementation Details

d.1 Algorithms

Algorithm 1 and Algorithm 2 present SPACE’s inference for foreground and background. Algorithm 3 show the details of the generation process of the background module. For foreground generation, we simply sample the latent variables from the priors instead of conditioning on the input. Note that, for convenience the algorithms for the foreground module and background module are presented with for loops, but inference for all variables of the foreground module are implemented as parallel convolution operations and most operations of the background module (barring the LSTM module) are parallel as well.

Input: image
Output: foreground mask , appearance
= ImageEncoderFg() for  to  do
       /* The following is performed in parallel */
       = ZPresNet() ZDepthNet( ZScaleNet( ZShiftNet( Bern() ) ) ) /* rescale local shift to global shift */
       /* Extract glimpses with a Spatial Transformer */
       = ST() =GlimpseEncoder() ) /* Foreground mask and appearance */
       = GlimpseDecoder(
end for
/* Compute weights for each component */
/* Compute global weighted mask and foreground appearance */
Algorithm 1 Foreground Inference
Input: image , initial LSTM states , initial dummy mask
Output: background masks , appearance , for
= ImageEncoderBg() for  to  do
       PredictMask( ) /* Actually decoded in parallel */
       /* Stick breaking process as described in GENESIS */
       CompEncoder( ) CompDecoder(
end for
Algorithm 2 Background Inference
Input: initial LSTM states , initial dummy mask
Output: background masks , appearance , for
for  to  do
       PredictMaskPrior( ) /* Actually decoded in parallel */
       /* Stick breaking process as described in GENESIS */
       PredictComp() ) CompDecoder(
end for
Algorithm 3 Background Generation

d.2 Training Regime and Hyperparameters

For all experiments we use an image size of and a batch size of 12 to 16 depending on memory usage. We use the RMSProp (rmsprop) optimizer with a learning rate of for the foreground module and Adam (kingma2014adam) optimizer with a learning rate of

for the background module. We use gradient clipping with a maximum norm of 1.0. For Atari games, we find it beneficial to set

to be fixed for the first 1000-2000 steps, and vary the actual value and number of steps for different games. This allows both the foreground as well as the background module to learn in the early stage of training.

We list out our hyperparameters for 3D large dataset and joint training for 10 static Atari games below. Hyperparameters for other experiments are similar, but are finetuned for each dataset individually. In the tables below, denotes annealing the hyperparameter value from to , starting from step and till step .

3D Room Large Name Symbol Value prior prob prior mean prior stdev 0.1 prior prior prior foreground stdev 0.15 background stdev 0.15 component number 5 gumbel-softmax temperature 2.0 #steps to fix N/A fixed value N/A
Joint Training on 10 Atari Games Name Symbol Value prior prob 0.01 prior mean prior stdev 0.1 prior prior prior foreground stdev 0.20 background stdev 0.10 component number 3 gumbel-softmax temperature 2.5 #steps to fix 2000 fixed value 0.1

d.3 Model Architecture

Here we describe the architecture of our SPACE model. The model for

grid cells is the same but with a stride-2 convolution for the last layer of the image encoder.

All modules that output distribution parameters are implemented with either one single fully connected layer or convolution layer, with the appropriate output size. Image encoders are fully convolutional networks that output a feature map of shape , and the glimpse encoder comprises of convolutional layers followed by a final linear layer that computes the parameters of a Gaussian distribution. For the glimpse decoder of the foreground module and the mask decoder of the background module we use the sub-pixel convolution layer (Shi_2016_CVPR). On the lines of GENESIS (genesis) and IODINE (iodine), we adopt Spatial Broadcast Network (Watters2019SpatialBD) as the component decoder to decode into background components.

For inference and generation of the background module, the dependence of on is implemented with LSTMs, with hidden sizes of 64. Dependence of on is implemented with a MLP with two hidden layers with 64 units per layer. We apply

when computing standard deviations for all Gaussian distributions, and apply

when computing reconstruction and masks. We use either Group Normalization (GN) (Wu_2018_ECCV) and CELU (CELU

) or Batch Normalization (BN) (

pmlr-v37-ioffe15) and ELU (clevert2015fast) depending on the module type.

The rest of the architecture details are described below. In the following tables,

denotes a sub-pixel convolution layer implemented as a stride-1 convolution and a PyTorch

layer, and denotes Group Normalization with groups.

Name Value Comment dim 1 dim 1 dim 2 for and axis dim 2 for and axis dim 32 dim 32 dim 32 glimpse shape for image shape
Background Image Encoder Layer Size/Ch. Stride Norm./Act. Input 3 Conv 64 2 BN/ELU Conv 64 2 BN/ELU Conv 64 2 BN/ELU Conv 64 2 BN/ELU Flatten Linear 64 ELU
Glimpse Encoder
Layer Size/Ch. Stride Norm./Act.
Input 3
Conv 16 1 GN(4)/CELU
Conv 32 2 GN(8)/CELU
Conv 32 1 GN(4)/CELU
Conv 64 1 GN(8)/CELU
Conv 128 2 GN(8)/CELU
Conv 256 1 GN(16)/CELU
Linear 32 + 32
Glimpse Decoder Layer Size/Ch. Stride Norm./Act. Input 32 Conv 256 1 GN(16)/CELU ConvSub(2) 128 1 GN(16)/CELU Conv 128 1 GN(16)/CELU ConvSub(2) 128 1 GN(16)/CELU Conv 128 1 GN(16)/CELU ConvSub(2) 64 1 GN(8)/CELU Conv 64 1 GN(8)/CELU ConvSub(2) 32 1 GN(8)/CELU Conv 32 3 GN(8)/CELU ConvSub(2) 16 1 GN(4)/CELU Conv 16 1 GN(4)/CELU
Background Image Encoder Layer Size/Ch. Stride Norm./Act. Input 3 Conv 64 2 BN/ELU Conv 64 2 BN/ELU Conv 64 2 BN/ELU Conv 64 2 BN/ELU Flatten Linear 64 ELU
Mask Decoder Layer Size/Ch. Stride Norm./Act. Input 32 Conv 256 1 GN(16)/CELU ConvSub(4) 256 1 GN(16)/CELU Conv 256 1 GN(16)/CELU ConvSub(2) 128 1 GN(16)/CELU Conv 128 1 GN(16)/CELU ConvSub(4) 64 1 GN(8)/CELU Conv 64 1 GN(8)/CELU ConvSub(4) 16 1 GN(4)/CELU Conv 16 1 GN(4)/CELU Conv 1 1
Component Encoder Layer Size/Ch. Stride Norm./Act. Input 3+1 (RGB+mask) Conv 32 2 BN/ELU Conv 32 2 BN/ELU Conv 64 2 BN/ELU Conv 64 2 BN/ELU Flatten Linear 32+32
Component Decoder
Layer Size/Ch. Stride Norm./Act.
Input 32 (1d)
Spatial Broadcast 32+2 (3d)
Conv 32 1 BN/ELU
Conv 32 1 BN/ELU
Conv 32 1 BN/ELU
Conv 3 1

d.4 Baseline SPAIR

Here we give out the details of the background encoder and decoder in training of SPAIR (both full image as well as patch-wise training). The image encoder is same as that of SPACE with the only difference that the inferred latents are conditioned on previous cells’ latents as described in Section 2.2.

SPAIR Background Encoder - Full Image training Layer Size/Ch. Norm./Act. Input Linear 256 GN(16)/CELU Linear 128 GN(16)/CELU Linear 32
SPAIR Background Decoder
Layer Size/Ch. Stride Norm./Act.
Input 16
Conv 256 1 GN(16)/CELU
ConvSub 128 1 GN(16)/CELU
Conv 128 1 GN(16)/CELU
ConvSub 64 1 GN(8)/CELU
Conv 64 1 GN(8)/CELU
ConvSub 16 1 GN(4)/CELU
Conv 16 1 GN(4)/CELU
Conv 16 1 GN(4)/CELU
Conv 3 1
SPAIR Background Encoder For Patch Training
Layer Size/Ch. Stride Norm./Act.
Input 3
Conv 16 2 GN(4)/CELU
Conv 32 2 GN(8)/CELU
Conv 64 2 GN(8)/CELU
Conv 128 2 GN(16)/CELU
Conv 32 2 GN(4)/CELU
SPAIR Background Decoder For Patch Training
Layer Size/Ch. Stride Norm./Act.
Input 16
Conv 256 1 GN(16)/CELU
Conv 2048 1
ConvSub(4) 128 1 GN(16)/CELU
Conv 128 1 GN(16)/CELU
Conv 256 1
ConvSub(2) 64 1 GN(8)/CELU
Conv 64 1 GN(8)/CELU
Conv 256 1
ConvSub(4) 16 1 GN(4)/CELU
Conv 16 1 GN(4)/CELU
Conv 16 1 GN(4)/CELU
Conv 3 1

Appendix E Dataset Details

Atari. For each game, we sample 60,000 random images from a pretrained agent (wu2016tensorpack). We split the images into 50,000 for the training set, 5,000 for the validation set, and 5,000 for the testing set. Each image is preprocessed into a size of pixels with BGR color channels. We present the results for the following games: Space Invaders, Air Raid, River Raid, Montezuma’s Revenge.

We also train our model on a dataset of 10 games jointly, where we have 8,000 training images, 1,000 validation images, and 1,000 testing images for each game. We use the following games: Asterix, Atlantis, Carnival, Double Dunk, Kangaroo, Montezuma Revenge, Pacman, Pooyan, Qbert, Space Invaders.

Room 3D. We use MuJoCo (mujoco) to generate this dataset. Each image consists of a walled enclosure with a random number of objects on the floor. The possible objects are randomly sized spheres, cubes, and cylinders. The small 3D-Room dataset has 4-8 objects and the large 3D-Room dataset has 18-24 objects. The color of the objects are randomly chosen from 8 different colors and the colors of the background (wall, ground, sky) are chosen randomly from 5 different colors. The angle of the camera is also selected randomly. We use a training set of 63,000 images, a validation set of 7,000 images, and a test set of 7,000 images. We use a 2-D projection from the camera to determine the ground truth bounding boxes of the objects so that we can report the average precision of the different models.