Unsupervised and interpretable scene discovery with Discrete-Attend-Infer-Repeat

by   Duo Wang, et al.

In this work we present Discrete Attend Infer Repeat (Discrete-AIR), a Recurrent Auto-Encoder with structured latent distributions containing discrete categorical distributions, continuous attribute distributions, and factorised spatial attention. While inspired by the original AIR model andretaining AIR model's capability in identifying objects in an image, Discrete-AIR provides direct interpretability of the latent codes. We show that for Multi-MNIST and a multiple-objects version of dSprites dataset, the Discrete-AIR model needs just one categorical latent variable, one attribute variable (for Multi-MNIST only), together with spatial attention variables, for efficient inference. We perform analysis to show that the learnt categorical distributions effectively capture the categories of objects in the scene for Multi-MNIST and for Multi-Sprites.


page 4

page 6

page 7

page 8


Joint-VAE: Learning Disentangled Joint Continuous and Discrete Representations

We present a framework for learning disentangled and interpretable joint...

Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects

We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable de...

Identifying Interpretable Discrete Latent Structures from Discrete Data

High dimensional categorical data are routinely collected in biomedical ...

Categorical Reparameterization with Gumbel-Softmax

Categorical variables are a natural choice for representing discrete str...

Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity

Training neural network models with discrete (categorical or structured)...

Sparse Communication via Mixed Distributions

Neural networks and other machine learning models compute continuous rep...

Augment and Reduce: Stochastic Inference for Large Categorical Distributions

Categorical distributions are ubiquitous in machine learning, e.g., in c...

1 Introduction

Many real-world tasks, such as inferring physical relationships between objects in an image and visual-spatial reasoning, requires identifying and learning a useful representation of elements in the scene. Such objects can be conveniently encoded into a representation containing its category, attributes and spatial positions and orientations. For example, an object can be of category vehicle, with attributes such as red colour and 4 doors, and positioned at the bottom right of the scene in a specific orientation. Humans, when recognising objects or trying to draw them, are believed to have attentional templates (Carlisle et al., 2011) of different categories of objects in mind that are augmented by different attributes and selected spatially via attention.

Machine approaches to such problems often use generative models such as Variational Auto-Encoder (VAE) (Kingma & Welling, 2013), which use an inference model to infer latent codes corresponding to the representation, and a generative model which reconstruct data given the representation. Recurrent versions of VAE such as Attend-Infer-Repeat (AIR) model by Eslami et al (Eslami et al., 2016) have been developed to decompose a scene into multiple objects with each represented by latent code . While this latent code disentangles spatial information and object presence , for most of the tasks, the object representation

is an entangled real-valued vector and thus difficult to interpret. While AIR does propose the possibility of using discrete latent code as

, it only experimented with the discrete code with specifically designed graphics-engine as decoder. Here, we propose Discrete-AIR, an end-to-end trainable autoencoder which structures latent representation

into representing category of objects and representing attributes of objects. Figure 1 illustrates how a scene of different shapes can be separately identified into different categories with varying attributes. This decomposition is similar to InfoGAN by Chen et al (Chen et al., 2016), which also decompose representation into style and shape using a modified Generative Adversarial Network (Goodfellow et al., 2014). However, there are two main differences. Firstly, while InfoGAN uses a mutual information objective in addition to GAN objective to encourage disentangled coding, Discrete-AIR only uses the Variational Lower Bound (ELBO) objectives and encourages disentanglement through the inductive-bias structure of latent code. Secondly, while InfoGAN is only applied to images containing single objects, Discrete-AIR is developed for scenes with multiple objects.

Figure 1:

Illustration of encoding scenes into category and attribute latent code. From a scene containing three different shapes, Discrete-AIR separately identifies each of the shapes (with different coloured bounding box). It also estimates spatial x-axis and y-axis locations, and orientation of the shape.

Related to this work are other approaches which decompose scenes into different categories. Neural Expectation Maximization (NEM) by Greff et al 

(Greff et al., 2017)

implemented Expectation-Maximization algorithm with an end-to-end trainable neural network. NEM is able to perceptually group pixels of an image into different clusters. However, it does not learn a generative model that allows controllable generation like using

and in Discrete-AIR. Ganin et al (Ganin et al., 2018) train a neural network to synthesise programs that can be fed into a graphics engine to generate scenes. While it learns an inference model for the generative model, a graphics engine that can provide learning gradients is pre-defined and not learnt. In contrast, Discrete-AIR jointly learns an inference model and a generative model from scratch.

We show that our Discrete-AIR model can decompose scenes into a set of interpretable latent codes for two multi-object datasets, namely Multi-MNIST dataset as used in the original AIR model (Eslami et al., 2016) and a multi-object dataset in similar style as the dSprites dataset (Matthey et al., 2017). We show that unsupervised training of Discrete-AIR model is able to effectively capture the categories of objects in the scene for Multi-MNIST and for Multi-Sprites datasets.

2 Attend Infer Repeat

Attend-Infer-Repeat (AIR) model, introduced by Eslami et al (Eslami et al., 2016), is a recurrent version of Variational Auto-Encoder (VAE) (Kingma & Welling, 2013) that decomposes a scene into multiple objects represented by latent code at each recurrent time step . Among them is a binary discrete variable encoding whether an object is inferred in current step . If is 0, the inference will be stopped. The sequence of for all can be concatenated into a vector of ones and a final zero. therefore is a variable representing the number of objects in the scene. is a spatial attention parameter used to locate a target object in the image, and is the latent code of the target object. In AIR an amortised variational approximation , as computed in equation 1, is used to approximate true posterior by minimising KL divergence . In AIR implementation, and

are parametrised as Gaussian distributions with diagonal covariance



In the generative model of AIR, the number of objects can be sampled from a prior such as geometric prior, and then form the sequence of . Next, and are sampled from . An object is generated by processing through a decoder. is then written to the canvas, gated by and with scaling and translation specified by using Spatial Transformer (Jaderberg et al., 2015), a powerful spatial attention module. The generative model can be summarised in equation 2, where is the decoder, is the spatial transformer and is element-wise product.


Inference and generative models of AIR are jointly optimised by maximising the lower bound . While sampling operation of is not differentiable (which is a requirement for gradient-based training), there are various ways to circumvent this. For the continuous latent codes, re-parametrisation trick for VAE (Kingma & Welling, 2013) is applied, which lets parameters estimated from the inference model to deterministically modify a sampled distribution, thereby allowing back-propagation through the deterministic function. For discrete latent codes, AIR uses NVIL likelihood ratio estimator introduced by Mnih et al (Mnih & Gregor, 2014)

to produce an unbiased estimate of the gradient for discrete latent variables.

3 Discrete-AIR

While the AIR model can encode objects in a scene into latent code , the representation is still entangled and therefore not interpretable. In Discrete-AIR, we introduce structure into the latent distribution to encourage disentanglement. We break into and . is discrete latent variable that captures the category of the object, while is a combination of continuous and discrete latent variables that captures attributes of the object. We do not use any objective function to encourage to capture category and to capture attributes. Rather, we allow the model to automatically learn the best way of using these discrete and latent variables through the process of likelihood maximisation.

3.1 Sampling discrete variable

In Discrete-AIR, we treat binary discrete variables as scalar of 0/1 values and multi-class categorical discrete variables as one-hot vectors. As sampling from a discrete distribution is non-differentiable, we model discrete latent variables with Gumbel Softmax (Maddison et al., 2016; Jang et al., 2016), a continuous approximation to the discrete distribution from which we can sample approximately one-hot discrete vector where:


are a parametrisation of the distribution, are Gumbel noise sampled from the Gumbel distribution , and is temperature parameters controlling smoothness of the distribution. As , the distribution converges to a discrete distribution. For binary discrete variables such as

, we use Gumbel Sigmoid, which is essentially Gumbel softmax with softmax function replaced with Sigmoid function:


In contrast to the NVIL estimator (Mnih & Gregor, 2014) used in the original AIR model, we found that Gumbel softmax/Sigmoid is more stable during training, experiencing no model collapse during all the training experiments.

3.2 Generative model

The probabilistic generative model is shown in figure 2. From , a template of this object category is generated. This template is then modified by attributes into an image of object that is subsequently drawn onto the canvas using spatial write attention. , , and , jointly as , are estimated from the inference model for each time step of inference.

Figure 2: Generative Model of Discrete-AIR.

We replace the decoder function from the original AIR model with a new function parametrised by category variable and . There are various candidate functions for combining and . We have experimented with three different variations, which are:

  • Additive: where generates a template, while generates an additive modification of template.

  • Multiplicative: where generates a template, while generates a multiplicative modification of template.

  • Convolutional: where generates a template, while generates a set of convolution kernels that can be convolved with template to modify it.

In our experiments we found that the choice of combining functions has only a small effect on the model performance. We found that the additive combining function performs slightly better, and thus use this function in all of the experiments presented.

In the original AIR model, the spatial transformation operation specified by attention variable

only contains translation and scaling. Affine transformations such as rotation and shearing are accounted for in the latent variable in an entangled way. In Discrete-AIR, we explicitly introduce additional spatial transformer networks that account for rotation and skewing, thereby allowing

to have a reduced number of variables. The spatial attention for the generative decoder is thus factorised as in equation 6,


where is the combined transformation matrix of translation and scaling used in the original AIR model, is the transformation matrix for rotation and is the transformation matrix for skewing. In the matrix, and are for scaling, and are horizontal and vertical translations, is an angle of rotation, and are parameters for shearing in horizontal and vertical axis.

3.3 Inference

Figure 3: Overview of Discrete-AIR architecture. Blue parts are neural-network trainable modules and yellow parts are sampling processes.

Figure 3 shows an overview of the Discrete-AIR architecture. At inference step , a difference image between input image and previous canvas is fed together with previous latent code

into a Recurrent Neural Network (RNN), implemented as Long Short-Term Memory (LSTM) 

(Hochreiter & Schmidhuber, 1997) to generate parameters of the distribution for and . Spatial Attention module then attends to parts of the image and applies transformation according to . We enforce the encoding transformation to be the inverse of decoding transformation , which means . This constraint forces the model to match attended objects in the scene with the invariant template specified by . In practice, we compute as the product of inverses of the transformation matrices composing :


The transformed image is then processed by an encoder to estimate parameters of distributions for and . are sampled from Gumbel Softmax as discussed in section 3.1. can be sampled from any distribution that is suitable for the paradigm of tasks. For tasks presented, continuous variables such as the colour intensity or part deformation of an object can be sampled from a multi-variate Gaussian distribution using the Re-parameterization trick (Kingma & Welling, 2013), which allows gradient to pass through the originally un-differentiable sampling function. The generative model described in section 3.2 then samples and from the distributions in order to generate an object that will be written to canvas using spatial attention module.

3.4 Learning

Similar to the original AIR model, we train Discrete-AIR model end-to-end by maximizing the lower bound on the marginal likelihood of data:


While in the original AIR model, one cannot further arrange this equation due to undifferentiable discrete variable sampling process used. For Discrete-AIR, by using Gubmel-Softmax as a repameterized sampling process, we can rearrange equation 8 as:


Where is data likelihood and is Kullback-Leibler (KL) divergence. This is the same implemented in the original VAE (Kingma & Welling, 2013). Computing , the loss derivative with respect to parameters of the generative model, is relatively straightforward as it is fully differentiable. With a sampled batch of latent codes , the partial derivative can be directly computed.

When computing , we can use the re-parameterization trick (Kingma & Welling, 2013) to re-parametrise the sampling of both, continuous and discrete latent variables as a deterministic function in the form . is the parameters of the distributions for at time step , and are random noise at time step

. In this way we can use the chain rule to compute the gradient with respect to



For our experiments, we parametrise continuous variables as multivariate Gaussian distributions with a diagonal covariance matrix. Thus . For discrete variables, we use Gumbel softmax introduced in section 3.1, which is itself a re-parametrised differentiable sampling function.For the KL-divergence term, assuming all latent variables are conditionally independent, we can factorize as and thereby separate the Kl-terms, as discussed in (Dupont, 2018). We use Gaussian prior for all continuous variables. While KL divergence between two Gumbel-Softmax distribution are not available in closed form, we approximate with a Monte-Carlo estimation of KL divergence with a categorical prior for , similar as (Jang et al., 2016). For we used a geometric prior and compute Monte-Carlo estimation of KL divergence (Maddison et al., 2016).

4 Evaluation

We evaluate Discrete-AIR on two multi-object datasets, namely Multi-MNIST dataset as used in the original AIR model (Eslami et al., 2016) and a multi-object shape dataset comprising of simple shapes similar to dSprites dataset (Matthey et al., 2017). We perform experiments to show that Discrete-AIR, while retaining the original strength of AIR model of discovering the number of objects in a scene, can additionally categorise each discovered object. In order to evaluate how accurately can Discrete-AIR categorize each object, we compute the correspondence rate between the best permutation of category assignments from Discrete-AIR model and the true labels of the dataset.

To explain the metric we used, we first define a few notations. For each input image , Discrete-AIR generates a corresponding category latent code and presence variable . From this we can form a set of predicted object categories for predicted objects where is the object category. For each image we also have a set of true labels of existing objects . Due to non-identifiability problem of unsupervised learning where a simple permutation of best cluster assignment will give the same optimal result, the category assignments produced by Discrete-AIR do not necessarily correspond to the labels. For example, an image patch of digit 1 could fall into category 4. We thus permute the category assignments and use the permutation that corresponds best with the true label as the category assignment. For example, for predicted category set and true label set , we can use the following permutation of category for predicted category set to achieve best correspondence. To put it more formally, we define a function where is a set or array of sets, and is an index permutation function to map elements in . For the whole dataset, we have an array of predicted category set and an array of true label set . We define correspondence rate as where gives the number of true labels in that are correctly identified in . gives the total number of labels. We thus compute the best correspondence rate as:


where is the set of all possible permutations of predicted categories. This score is ranging from 0 to 1, and the score of a random category assignment should have expected score of where is the number of categories.

We train Discrete-AIR with the ELBO objective as presented in equation 8. We use Adam optimiser (Kingma & Ba, 2014) to optimise the model with batch size of 64 and learning rate of 0.0001. For Gumbel Softmax, we also applied temperature annealing (Jang et al., 2016) of to start with a smoother distribution first and gradually approximate to discrete distribution. For more details about training, please see Appendix A in supplementary material.

4.1 Multi-Sprites

To evaluate Discrete-AIR, we have built a multi-object dataset in similar style as the dSprites dataset (Matthey et al., 2017). This dataset consists of 90000 images of pixel size

. In each image there are 0 to 3 objects with shapes in the categories of square, triangle and ellipse. The objects’ spatial locations, orientations and size are all sampled randomly from uniform distributions. Details about constructing this dataset can be found in Appendix B. Figure 

4 illustrates the application of Discrete-AIR on the Multi-Sprites dataset. Figure 4 a) shows samples of input data from the dataset with each object detected and categorised (with differently coloured bounding box). The number at the top-left corner shows the estimated number of objects in the scene. Figure 4 b) shows reconstructed images by the Discrete-AIR model. Figure 4 c) shows the fully interpretable latent code of each object in the scene. For this dataset, we used a discrete variable of 3 categories as together with spatial attention variables . We did not include for this dataset as the attributes of each object, including location, orientation and size, can all be controlled by . We did not include shear transformation in the spatial attention as the dataset generation process does not have a shear transformation. For more details about the architecture, please see Appendix A.

(a) Data
(b) Reconstruction
(c) Latent codes
Figure 4: Input data from Multi-Sprites dataset and reconstruction from Discrete-AIR model. The coloured bounding boxes show each detected object. The number at the top-left corner shows the count of number of objects in the image. Latent codes representing the scene, including object categories, sizes, spatial locations and orientation are also presented.

For quantitative evaluation of Discrete-AIR we use three metrics, namely Reconstruction Error in the form of Mean Squared Error (MSE), count accuracy of number of objects in the scene and categorical correspondence rate. We also compare Discrete-AIR with AIR for the first two metrics. Table 1 shows the performance for these three objectives. We report mean performance across 10 independent runs. Discrete-AIR has slightly better count accuracy than AIR, and is able to categorise objects with a mean category correspondence rate of 0.956. The best achieved correspondence rate is 0.967 Discrete-AIR does have increased reconstruction MSE compared to AIR model. However Discrete-AIR only uses a category latent variable of dimension 3, while the original AIR model uses 50 latent variables.

Model MSE count acc. category corr.
Discrete-AIR 0.096 0.985 0.945
AIR 0.074 0.981 N/A
Table 1: Quantitative evaluation of Discrete-AIR and comparison with AIR model for Multi-Sprites dataset.

We also plot count accuracy during training for both Discrete-AIR and AIR in figure 5. One can observe that both models converge towards similar accuracies, but Discrete-AIR model has slightly better increase rate and stability at the start of training.

Figure 5: Plot of count accuracy for AIR and Discrete-AIR during training for Multi-Sprites dataset.

Discrete-AIR can generate a scene with a given number of objects in a fully controlled way. We can specify categories of objects with and their spatial attributes with . Figure 6 shows a sampled generation process. Note that while the training data contains up to 3 objects, Discrete-AIR can generate an arbitrary number of objects in the generative model. We generate 4 objects in the sequence ”square, ellipse, square, triangle” with specified locations, orientation and size.

Figure 6: Generation of images by Discrete-AIR.

4.2 Multi-MNIST

We also evaluated Discrete-AIR on the Multi-MNIST dataset used by the original AIR model (Eslami et al., 2016). The dataset consists of 60000 images of size . Each image contains 0 to 2 digits sampled randomly from MNIST dataset (LeCun et al., 1998) and placed at random spatial positions. The dataset is publicly available in ’observations’ python package111https://github.com/edwardlib/observations

. For this dataset, we choose a categorical variable with 10 categories as

and 1 continuous variable with Normal distribution as

as this gives best correspondence rate performance. We choose to combine transformation matrices and as one because this gives slightly better results. Figure 7 shows sampled input data from the dataset and reconstruction by Discrete-AIR. Figure 7 c) also show interpretable latent codes for each digit in the image. From this figure we can observe clearly that Discrete-AIR learns to match templates of category with modifiable attributes to input data. For example, in the second image of input data, the digit ’8’ is written in a drastically different style from most other ’8’ in the dataset. However, as we can see in the reconstruction, Discrete-AIR is able to match a template of digit ’8’ with modified attributes such as slantedness and stroke thickness.

(a) Data
(b) Reconstruction
(c) Latent codes
Figure 7: Input data from Multi-MNIST dataset and reconstruction from Discrete-AIR model. The coloured bounding boxes show each detected object. The number at the top-left corner shows the count of number of objects in the image. Latent codes representing the scene, including digit categories, attribute variable value, sizes and spatial locations are also presented.

We also performed the same quantitative analysis from Multi-Sprites dataset, as shown in table 2. For correspondence rate measurements, we only use 10% subsampled data because permuting 10 digits requires steps of evaluating across the dataset, which is too slow for the whole dataset. While the count accuracy of Discrete-AIR and AIR model are very close, Discrete-AIR is able to categorize the digits in the image with a mean correspondence rate of 0.871. The best achieved correspondence rate is 0.913. We also plotted the count accuracy during training history, as shown in figure 8. We can see that Discrete-AIR’s count accuracy increase rates are very close to that of the AIR model.

Model MSE count acc. category corr.
Discrete-AIR 0.134 0.984 0.871
AIR 0.107 0.985 N/A
Table 2: Quantitative evaluation of Discrete-AIR and comparison with AIR model for Multi-MNIST dataset.
Figure 8: Plot of count accuracy for AIR and Discrete-AIR during training for Multi-MNIST dataset.

Same as shown for Multi-Sprites dataset, Discrete-AIR is able to generate images in a fully controlled way with given categories of digits , attribute variable and spatial variable . Figure 9 shows a sampled generated image. Two digits are generated in subsequent images with attribute variable increasing from top to bottom. In the first sequence we generate digits ’5’ and ’2’ while in the second sequence we generate digits ’3’ and ’9’. We can observe that the learnt attribute variable encodes attributes that cannot be encoded by affine transformation spatial variable . For example, increasing increases the size of the hook space in digit ’5’, the hook space in digit ’2’, and the hook curve in digit ’9’.

Figure 9: Generation of images by Discrete-AIR.

5 Related work

Generative models for unsupervised scene discovery have been an important topics in Machine Learning. Variational Auto-Encoder (VAE) (Kingma & Welling, 2013) is one import generative model that learns in unsupervised way how to encode scenes into a set of latent codes that represent the scene at high-level feature abstraction. Among these unsupervised models, the DRAW model (Gregor et al., 2015) combines Recurrent Neural Networks, VAE and attention mechanism to allow the generative model to focus on one part of image at a time, mimicking the foveation of human eye. AIR model (Eslami et al., 2016) extends this idea to allow the model to focus on one integral part of the scene at a time, such as a digit or an object.

One important topic in unsupervised learning is improving interpretability of learnt representations. For VAE, there have been various approaches in disentangling the latent code distribution, such as beta-VAE models (Higgins et al., 2017; Chen et al., 2018; Kim & Mnih, 2018) which change the balance between reconstruction quality and latent capacity with parameters. The most related work in terms of disentangling is by Dupont et al (Dupont, 2018), which also disentangles latent code into discrete and continuous parts. However, this work can only disentangle for image that contains a single centered object, while our model works for scenes containing multiple objects.

Discrete-AIR extends AIR model to be able to not only discover integral parts, but assign interpretable, disentangled latent representations to these integral parts by encoding each parts into different categories and different attributes. Several works (Eslami et al., 2016; Wu et al., 2017; Romaszko et al., 2017), including AIR model itself, attempt to use pre-defined graphics like-decoder to generate sequences of disentangled latent representations for multiple parts of the scene. Discrete-AIR, to our best knowledge, is the first end-to-end trainable VAE model without a pre-defined generative function to achieve this.

One other notable approach of disentangling scene representations is Neural Expectation Maximization (N-EM) (Greff et al., 2017), which develops an end-to-end clustering method to cluster pixels in an image. However N-EM does not have an encoder that encodes attended objects into latent variables such as object location, size and orientation, but only assigns pixels to clusters by maximising a likelihood function.

6 Conclusion

In summary, we developed Discrete-AIR, an unsupervised auto-encoder that learns to model a scene of multiple objects with interpretable latent codes. We have shown that Discrete-AIR can capture categories of each object in the scene and disentangle attribute variables from the categorical variable. Discrete-AIR can be applied on various problems where discrete representations are useful, such as on visual reasoning including solving Raving Progressive Matrices (Barrett et al., 2018) and symbolic visual question answering (Yi et al., 2018)

. These two works approach visual reasoning problems with supervised learning method where for each object its category, spatial parameters and attributes are labeled. Discrete-AIR can be used as a symbolic encoder or unsupervised pre-training of encoding model, thereby reduce or even completely remove the requirements for labeled data.