1 Introduction
Many realworld tasks, such as inferring physical relationships between objects in an image and visualspatial reasoning, requires identifying and learning a useful representation of elements in the scene. Such objects can be conveniently encoded into a representation containing its category, attributes and spatial positions and orientations. For example, an object can be of category vehicle, with attributes such as red colour and 4 doors, and positioned at the bottom right of the scene in a specific orientation. Humans, when recognising objects or trying to draw them, are believed to have attentional templates (Carlisle et al., 2011) of different categories of objects in mind that are augmented by different attributes and selected spatially via attention.
Machine approaches to such problems often use generative models such as Variational AutoEncoder (VAE) (Kingma & Welling, 2013), which use an inference model to infer latent codes corresponding to the representation, and a generative model which reconstruct data given the representation. Recurrent versions of VAE such as AttendInferRepeat (AIR) model by Eslami et al (Eslami et al., 2016) have been developed to decompose a scene into multiple objects with each represented by latent code . While this latent code disentangles spatial information and object presence , for most of the tasks, the object representation
is an entangled realvalued vector and thus difficult to interpret. While AIR does propose the possibility of using discrete latent code as
, it only experimented with the discrete code with specifically designed graphicsengine as decoder. Here, we propose DiscreteAIR, an endtoend trainable autoencoder which structures latent representation
into representing category of objects and representing attributes of objects. Figure 1 illustrates how a scene of different shapes can be separately identified into different categories with varying attributes. This decomposition is similar to InfoGAN by Chen et al (Chen et al., 2016), which also decompose representation into style and shape using a modified Generative Adversarial Network (Goodfellow et al., 2014). However, there are two main differences. Firstly, while InfoGAN uses a mutual information objective in addition to GAN objective to encourage disentangled coding, DiscreteAIR only uses the Variational Lower Bound (ELBO) objectives and encourages disentanglement through the inductivebias structure of latent code. Secondly, while InfoGAN is only applied to images containing single objects, DiscreteAIR is developed for scenes with multiple objects.Related to this work are other approaches which decompose scenes into different categories. Neural Expectation Maximization (NEM) by Greff et al
(Greff et al., 2017)implemented ExpectationMaximization algorithm with an endtoend trainable neural network. NEM is able to perceptually group pixels of an image into different clusters. However, it does not learn a generative model that allows controllable generation like using
and in DiscreteAIR. Ganin et al (Ganin et al., 2018) train a neural network to synthesise programs that can be fed into a graphics engine to generate scenes. While it learns an inference model for the generative model, a graphics engine that can provide learning gradients is predefined and not learnt. In contrast, DiscreteAIR jointly learns an inference model and a generative model from scratch.We show that our DiscreteAIR model can decompose scenes into a set of interpretable latent codes for two multiobject datasets, namely MultiMNIST dataset as used in the original AIR model (Eslami et al., 2016) and a multiobject dataset in similar style as the dSprites dataset (Matthey et al., 2017). We show that unsupervised training of DiscreteAIR model is able to effectively capture the categories of objects in the scene for MultiMNIST and for MultiSprites datasets.
2 Attend Infer Repeat
AttendInferRepeat (AIR) model, introduced by Eslami et al (Eslami et al., 2016), is a recurrent version of Variational AutoEncoder (VAE) (Kingma & Welling, 2013) that decomposes a scene into multiple objects represented by latent code at each recurrent time step . Among them is a binary discrete variable encoding whether an object is inferred in current step . If is 0, the inference will be stopped. The sequence of for all can be concatenated into a vector of ones and a final zero. therefore is a variable representing the number of objects in the scene. is a spatial attention parameter used to locate a target object in the image, and is the latent code of the target object. In AIR an amortised variational approximation , as computed in equation 1, is used to approximate true posterior by minimising KL divergence . In AIR implementation, and
are parametrised as Gaussian distributions with diagonal covariance
.(1) 
In the generative model of AIR, the number of objects can be sampled from a prior such as geometric prior, and then form the sequence of . Next, and are sampled from . An object is generated by processing through a decoder. is then written to the canvas, gated by and with scaling and translation specified by using Spatial Transformer (Jaderberg et al., 2015), a powerful spatial attention module. The generative model can be summarised in equation 2, where is the decoder, is the spatial transformer and is elementwise product.
(2)  
(3) 
Inference and generative models of AIR are jointly optimised by maximising the lower bound . While sampling operation of is not differentiable (which is a requirement for gradientbased training), there are various ways to circumvent this. For the continuous latent codes, reparametrisation trick for VAE (Kingma & Welling, 2013) is applied, which lets parameters estimated from the inference model to deterministically modify a sampled distribution, thereby allowing backpropagation through the deterministic function. For discrete latent codes, AIR uses NVIL likelihood ratio estimator introduced by Mnih et al (Mnih & Gregor, 2014)
to produce an unbiased estimate of the gradient for discrete latent variables.
3 DiscreteAIR
While the AIR model can encode objects in a scene into latent code , the representation is still entangled and therefore not interpretable. In DiscreteAIR, we introduce structure into the latent distribution to encourage disentanglement. We break into and . is discrete latent variable that captures the category of the object, while is a combination of continuous and discrete latent variables that captures attributes of the object. We do not use any objective function to encourage to capture category and to capture attributes. Rather, we allow the model to automatically learn the best way of using these discrete and latent variables through the process of likelihood maximisation.
3.1 Sampling discrete variable
In DiscreteAIR, we treat binary discrete variables as scalar of 0/1 values and multiclass categorical discrete variables as onehot vectors. As sampling from a discrete distribution is nondifferentiable, we model discrete latent variables with Gumbel Softmax (Maddison et al., 2016; Jang et al., 2016), a continuous approximation to the discrete distribution from which we can sample approximately onehot discrete vector where:
(4) 
are a parametrisation of the distribution, are Gumbel noise sampled from the Gumbel distribution , and is temperature parameters controlling smoothness of the distribution. As , the distribution converges to a discrete distribution. For binary discrete variables such as
, we use Gumbel Sigmoid, which is essentially Gumbel softmax with softmax function replaced with Sigmoid function:
(5) 
In contrast to the NVIL estimator (Mnih & Gregor, 2014) used in the original AIR model, we found that Gumbel softmax/Sigmoid is more stable during training, experiencing no model collapse during all the training experiments.
3.2 Generative model
The probabilistic generative model is shown in figure 2. From , a template of this object category is generated. This template is then modified by attributes into an image of object that is subsequently drawn onto the canvas using spatial write attention. , , and , jointly as , are estimated from the inference model for each time step of inference.
We replace the decoder function from the original AIR model with a new function parametrised by category variable and . There are various candidate functions for combining and . We have experimented with three different variations, which are:

Additive: where generates a template, while generates an additive modification of template.

Multiplicative: where generates a template, while generates a multiplicative modification of template.

Convolutional: where generates a template, while generates a set of convolution kernels that can be convolved with template to modify it.
In our experiments we found that the choice of combining functions has only a small effect on the model performance. We found that the additive combining function performs slightly better, and thus use this function in all of the experiments presented.
In the original AIR model, the spatial transformation operation specified by attention variable
only contains translation and scaling. Affine transformations such as rotation and shearing are accounted for in the latent variable in an entangled way. In DiscreteAIR, we explicitly introduce additional spatial transformer networks that account for rotation and skewing, thereby allowing
to have a reduced number of variables. The spatial attention for the generative decoder is thus factorised as in equation 6,(6)  
where is the combined transformation matrix of translation and scaling used in the original AIR model, is the transformation matrix for rotation and is the transformation matrix for skewing. In the matrix, and are for scaling, and are horizontal and vertical translations, is an angle of rotation, and are parameters for shearing in horizontal and vertical axis.
3.3 Inference
Figure 3 shows an overview of the DiscreteAIR architecture. At inference step , a difference image between input image and previous canvas is fed together with previous latent code
into a Recurrent Neural Network (RNN), implemented as Long ShortTerm Memory (LSTM)
(Hochreiter & Schmidhuber, 1997) to generate parameters of the distribution for and . Spatial Attention module then attends to parts of the image and applies transformation according to . We enforce the encoding transformation to be the inverse of decoding transformation , which means . This constraint forces the model to match attended objects in the scene with the invariant template specified by . In practice, we compute as the product of inverses of the transformation matrices composing :(7) 
The transformed image is then processed by an encoder to estimate parameters of distributions for and . are sampled from Gumbel Softmax as discussed in section 3.1. can be sampled from any distribution that is suitable for the paradigm of tasks. For tasks presented, continuous variables such as the colour intensity or part deformation of an object can be sampled from a multivariate Gaussian distribution using the Reparameterization trick (Kingma & Welling, 2013), which allows gradient to pass through the originally undifferentiable sampling function. The generative model described in section 3.2 then samples and from the distributions in order to generate an object that will be written to canvas using spatial attention module.
3.4 Learning
Similar to the original AIR model, we train DiscreteAIR model endtoend by maximizing the lower bound on the marginal likelihood of data:
(8) 
While in the original AIR model, one cannot further arrange this equation due to undifferentiable discrete variable sampling process used. For DiscreteAIR, by using GubmelSoftmax as a repameterized sampling process, we can rearrange equation 8 as:
(9) 
Where is data likelihood and is KullbackLeibler (KL) divergence. This is the same implemented in the original VAE (Kingma & Welling, 2013). Computing , the loss derivative with respect to parameters of the generative model, is relatively straightforward as it is fully differentiable. With a sampled batch of latent codes , the partial derivative can be directly computed.
When computing , we can use the reparameterization trick (Kingma & Welling, 2013) to reparametrise the sampling of both, continuous and discrete latent variables as a deterministic function in the form . is the parameters of the distributions for at time step , and are random noise at time step
. In this way we can use the chain rule to compute the gradient with respect to
as:(10) 
For our experiments, we parametrise continuous variables as multivariate Gaussian distributions with a diagonal covariance matrix. Thus . For discrete variables, we use Gumbel softmax introduced in section 3.1, which is itself a reparametrised differentiable sampling function.For the KLdivergence term, assuming all latent variables are conditionally independent, we can factorize as and thereby separate the Klterms, as discussed in (Dupont, 2018). We use Gaussian prior for all continuous variables. While KL divergence between two GumbelSoftmax distribution are not available in closed form, we approximate with a MonteCarlo estimation of KL divergence with a categorical prior for , similar as (Jang et al., 2016). For we used a geometric prior and compute MonteCarlo estimation of KL divergence (Maddison et al., 2016).
4 Evaluation
We evaluate DiscreteAIR on two multiobject datasets, namely MultiMNIST dataset as used in the original AIR model (Eslami et al., 2016) and a multiobject shape dataset comprising of simple shapes similar to dSprites dataset (Matthey et al., 2017). We perform experiments to show that DiscreteAIR, while retaining the original strength of AIR model of discovering the number of objects in a scene, can additionally categorise each discovered object. In order to evaluate how accurately can DiscreteAIR categorize each object, we compute the correspondence rate between the best permutation of category assignments from DiscreteAIR model and the true labels of the dataset.
To explain the metric we used, we first define a few notations. For each input image , DiscreteAIR generates a corresponding category latent code and presence variable . From this we can form a set of predicted object categories for predicted objects where is the object category. For each image we also have a set of true labels of existing objects . Due to nonidentifiability problem of unsupervised learning where a simple permutation of best cluster assignment will give the same optimal result, the category assignments produced by DiscreteAIR do not necessarily correspond to the labels. For example, an image patch of digit 1 could fall into category 4. We thus permute the category assignments and use the permutation that corresponds best with the true label as the category assignment. For example, for predicted category set and true label set , we can use the following permutation of category for predicted category set to achieve best correspondence. To put it more formally, we define a function where is a set or array of sets, and is an index permutation function to map elements in . For the whole dataset, we have an array of predicted category set and an array of true label set . We define correspondence rate as where gives the number of true labels in that are correctly identified in . gives the total number of labels. We thus compute the best correspondence rate as:
(11) 
where is the set of all possible permutations of predicted categories. This score is ranging from 0 to 1, and the score of a random category assignment should have expected score of where is the number of categories.
We train DiscreteAIR with the ELBO objective as presented in equation 8. We use Adam optimiser (Kingma & Ba, 2014) to optimise the model with batch size of 64 and learning rate of 0.0001. For Gumbel Softmax, we also applied temperature annealing (Jang et al., 2016) of to start with a smoother distribution first and gradually approximate to discrete distribution. For more details about training, please see Appendix A in supplementary material.
4.1 MultiSprites
To evaluate DiscreteAIR, we have built a multiobject dataset in similar style as the dSprites dataset (Matthey et al., 2017). This dataset consists of 90000 images of pixel size
. In each image there are 0 to 3 objects with shapes in the categories of square, triangle and ellipse. The objects’ spatial locations, orientations and size are all sampled randomly from uniform distributions. Details about constructing this dataset can be found in Appendix B. Figure
4 illustrates the application of DiscreteAIR on the MultiSprites dataset. Figure 4 a) shows samples of input data from the dataset with each object detected and categorised (with differently coloured bounding box). The number at the topleft corner shows the estimated number of objects in the scene. Figure 4 b) shows reconstructed images by the DiscreteAIR model. Figure 4 c) shows the fully interpretable latent code of each object in the scene. For this dataset, we used a discrete variable of 3 categories as together with spatial attention variables . We did not include for this dataset as the attributes of each object, including location, orientation and size, can all be controlled by . We did not include shear transformation in the spatial attention as the dataset generation process does not have a shear transformation. For more details about the architecture, please see Appendix A.For quantitative evaluation of DiscreteAIR we use three metrics, namely Reconstruction Error in the form of Mean Squared Error (MSE), count accuracy of number of objects in the scene and categorical correspondence rate. We also compare DiscreteAIR with AIR for the first two metrics. Table 1 shows the performance for these three objectives. We report mean performance across 10 independent runs. DiscreteAIR has slightly better count accuracy than AIR, and is able to categorise objects with a mean category correspondence rate of 0.956. The best achieved correspondence rate is 0.967 DiscreteAIR does have increased reconstruction MSE compared to AIR model. However DiscreteAIR only uses a category latent variable of dimension 3, while the original AIR model uses 50 latent variables.
Model  MSE  count acc.  category corr. 

DiscreteAIR  0.096  0.985  0.945 
AIR  0.074  0.981  N/A 
We also plot count accuracy during training for both DiscreteAIR and AIR in figure 5. One can observe that both models converge towards similar accuracies, but DiscreteAIR model has slightly better increase rate and stability at the start of training.
DiscreteAIR can generate a scene with a given number of objects in a fully controlled way. We can specify categories of objects with and their spatial attributes with . Figure 6 shows a sampled generation process. Note that while the training data contains up to 3 objects, DiscreteAIR can generate an arbitrary number of objects in the generative model. We generate 4 objects in the sequence ”square, ellipse, square, triangle” with specified locations, orientation and size.
4.2 MultiMNIST
We also evaluated DiscreteAIR on the MultiMNIST dataset used by the original AIR model (Eslami et al., 2016). The dataset consists of 60000 images of size . Each image contains 0 to 2 digits sampled randomly from MNIST dataset (LeCun et al., 1998) and placed at random spatial positions. The dataset is publicly available in ’observations’ python package^{1}^{1}1https://github.com/edwardlib/observations
. For this dataset, we choose a categorical variable with 10 categories as
and 1 continuous variable with Normal distribution as
as this gives best correspondence rate performance. We choose to combine transformation matrices and as one because this gives slightly better results. Figure 7 shows sampled input data from the dataset and reconstruction by DiscreteAIR. Figure 7 c) also show interpretable latent codes for each digit in the image. From this figure we can observe clearly that DiscreteAIR learns to match templates of category with modifiable attributes to input data. For example, in the second image of input data, the digit ’8’ is written in a drastically different style from most other ’8’ in the dataset. However, as we can see in the reconstruction, DiscreteAIR is able to match a template of digit ’8’ with modified attributes such as slantedness and stroke thickness.We also performed the same quantitative analysis from MultiSprites dataset, as shown in table 2. For correspondence rate measurements, we only use 10% subsampled data because permuting 10 digits requires steps of evaluating across the dataset, which is too slow for the whole dataset. While the count accuracy of DiscreteAIR and AIR model are very close, DiscreteAIR is able to categorize the digits in the image with a mean correspondence rate of 0.871. The best achieved correspondence rate is 0.913. We also plotted the count accuracy during training history, as shown in figure 8. We can see that DiscreteAIR’s count accuracy increase rates are very close to that of the AIR model.
Model  MSE  count acc.  category corr. 

DiscreteAIR  0.134  0.984  0.871 
AIR  0.107  0.985  N/A 
Same as shown for MultiSprites dataset, DiscreteAIR is able to generate images in a fully controlled way with given categories of digits , attribute variable and spatial variable . Figure 9 shows a sampled generated image. Two digits are generated in subsequent images with attribute variable increasing from top to bottom. In the first sequence we generate digits ’5’ and ’2’ while in the second sequence we generate digits ’3’ and ’9’. We can observe that the learnt attribute variable encodes attributes that cannot be encoded by affine transformation spatial variable . For example, increasing increases the size of the hook space in digit ’5’, the hook space in digit ’2’, and the hook curve in digit ’9’.
5 Related work
Generative models for unsupervised scene discovery have been an important topics in Machine Learning. Variational AutoEncoder (VAE) (Kingma & Welling, 2013) is one import generative model that learns in unsupervised way how to encode scenes into a set of latent codes that represent the scene at highlevel feature abstraction. Among these unsupervised models, the DRAW model (Gregor et al., 2015) combines Recurrent Neural Networks, VAE and attention mechanism to allow the generative model to focus on one part of image at a time, mimicking the foveation of human eye. AIR model (Eslami et al., 2016) extends this idea to allow the model to focus on one integral part of the scene at a time, such as a digit or an object.
One important topic in unsupervised learning is improving interpretability of learnt representations. For VAE, there have been various approaches in disentangling the latent code distribution, such as betaVAE models (Higgins et al., 2017; Chen et al., 2018; Kim & Mnih, 2018) which change the balance between reconstruction quality and latent capacity with parameters. The most related work in terms of disentangling is by Dupont et al (Dupont, 2018), which also disentangles latent code into discrete and continuous parts. However, this work can only disentangle for image that contains a single centered object, while our model works for scenes containing multiple objects.
DiscreteAIR extends AIR model to be able to not only discover integral parts, but assign interpretable, disentangled latent representations to these integral parts by encoding each parts into different categories and different attributes. Several works (Eslami et al., 2016; Wu et al., 2017; Romaszko et al., 2017), including AIR model itself, attempt to use predefined graphics likedecoder to generate sequences of disentangled latent representations for multiple parts of the scene. DiscreteAIR, to our best knowledge, is the first endtoend trainable VAE model without a predefined generative function to achieve this.
One other notable approach of disentangling scene representations is Neural Expectation Maximization (NEM) (Greff et al., 2017), which develops an endtoend clustering method to cluster pixels in an image. However NEM does not have an encoder that encodes attended objects into latent variables such as object location, size and orientation, but only assigns pixels to clusters by maximising a likelihood function.
6 Conclusion
In summary, we developed DiscreteAIR, an unsupervised autoencoder that learns to model a scene of multiple objects with interpretable latent codes. We have shown that DiscreteAIR can capture categories of each object in the scene and disentangle attribute variables from the categorical variable. DiscreteAIR can be applied on various problems where discrete representations are useful, such as on visual reasoning including solving Raving Progressive Matrices (Barrett et al., 2018) and symbolic visual question answering (Yi et al., 2018)
. These two works approach visual reasoning problems with supervised learning method where for each object its category, spatial parameters and attributes are labeled. DiscreteAIR can be used as a symbolic encoder or unsupervised pretraining of encoding model, thereby reduce or even completely remove the requirements for labeled data.
References
 Barrett et al. (2018) Barrett, D., Hill, F., Santoro, A., Morcos, A., and Lillicrap, T. Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, pp. 511–520, 2018.
 Carlisle et al. (2011) Carlisle, N. B., Arita, J. T., Pardo, D., and Woodman, G. F. Attentional templates in visual working memory. Journal of Neuroscience, 31(25):9315–9322, 2011.
 Chen et al. (2018) Chen, T. Q., Li, X., Grosse, R., and Duvenaud, D. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
 Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
 Dupont (2018) Dupont, E. Learning disentangled joint continuous and discrete representations. In Advances in Neural Information Processing Systems, pp. 708–718, 2018.

Eslami et al. (2016)
Eslami, S. A., Heess, N., Weber, T., Tassa, Y., Szepesvari, D., Hinton, G. E.,
et al.
Attend, infer, repeat: Fast scene understanding with generative models.
In Advances in Neural Information Processing Systems, pp. 3225–3233, 2016.  Ganin et al. (2018) Ganin, Y., Kulkarni, T., Babuschkin, I., Eslami, S., and Vinyals, O. Synthesizing programs for images using reinforced adversarial learning. arXiv preprint arXiv:1804.01118, 2018.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Greff et al. (2017) Greff, K., van Steenkiste, S., and Schmidhuber, J. Neural expectation maximization. In Advances in Neural Information Processing Systems, pp. 6691–6701, 2017.
 Gregor et al. (2015) Gregor, K., Danihelka, I., Graves, A., Rezende, D., and Wierstra, D. Draw: A recurrent neural network for image generation. In International Conference on Machine Learning, pp. 1462–1471, 2015.
 Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Jaderberg et al. (2015) Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025, 2015.
 Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Matthey et al. (2017) Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dspritesdataset/, 2017.
 Mnih & Gregor (2014) Mnih, A. and Gregor, K. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pp. 1791–1799, 2014.
 Romaszko et al. (2017) Romaszko, L., Williams, C. K., Moreno, P., and Kohli, P. Visionasinversegraphics: Obtaining a rich 3d explanation of a scene from a single image. In ICCV workshops, pp. 940–948, 2017.
 Wu et al. (2017) Wu, J., Tenenbaum, J. B., and Kohli, P. Neural scene derendering. In Proc. CVPR, volume 2, 2017.
 Yi et al. (2018) Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., and Tenenbaum, J. Neuralsymbolic vqa: Disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems, pp. 1039–1050, 2018.