GMAIR: Unsupervised Object Detection Based on Spatial Attention and Gaussian Mixture

by   Weijin Zhu, et al.
Shanghai Jiao Tong University

Recent studies on unsupervised object detection based on spatial attention have achieved promising results. Models, such as AIR and SPAIR, output "what" and "where" latent variables that represent the attributes and locations of objects in a scene, respectively. Most of the previous studies concentrate on the "where" localization performance; however, we claim that acquiring "what" object attributes is also essential for representation learning. This paper presents a framework, GMAIR, for unsupervised object detection. It incorporates spatial attention and a Gaussian mixture in a unified deep generative model. GMAIR can locate objects in a scene and simultaneously cluster them without supervision. Furthermore, we analyze the "what" latent variables and clustering process. Finally, we evaluate our model on MultiMNIST and Fruit2D datasets and show that GMAIR achieves competitive results on localization and clustering compared to state-of-the-art methods.



There are no comments yet.


page 7

page 8

page 15


Disentangling to Cluster: Gaussian Mixture Variational Ladder Autoencoders

In clustering we normally output one cluster variable for each datapoint...

Guided Generative Models using Weak Supervision for Detecting Object Spatial Arrangement in Overhead Images

The increasing availability and accessibility of numerous overhead image...

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

The ability to decompose complex multi-object scenes into meaningful abs...

Deep Bayesian Unsupervised Source Separation Based on a Complex Gaussian Mixture Model

This paper presents an unsupervised method that trains neural source sep...

Deep Unsupervised Clustering with Clustered Generator Model

This paper addresses the problem of unsupervised clustering which remain...

Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects

We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable de...

Learning to Manipulate Individual Objects in an Image

We describe a method to train a generative model with latent factors tha...

Code Repositories


An official implementation of GMAIR

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The perception of human vision is naturally hierarchical. We can recognize objects in a scene at a glance and classify them according their appearances, functions, and other attributes. It is expected that an intelligent agent can also decompose scenes to meaningful object abstraction, which is known as an object detection task in machine learning. In the last decade, there have been significant developments in supervised object detection tasks. However, its unsupervised counterpart continues to be challenging.

Recently, there has been some progress in unsupervised object detection. Attend, infer, repeat (AIR, Eslami et al. (2016)

), which is a variational autoencoder (VAE,

Kingma and Welling (2013)) based method, achieved encouraging results. Spatially invariant AIR (SPAIR, Crawford and Pineau (2019)) replaced the recurrent network in AIR by a convolutional network that attained better scalability and lower computational cost. SPACE (Lin et al. (2020)), which combines spatial-attention and scene-mixture approaches, performed better in background prediction.

Despite the recent progress in unsupervised object detection, results of previous studies remain unsatisfactory. One of the reasons for this could be that previous studies on unsupervised object detection were mainly concentrated on object localization and lacked analysis and evaluation on the “what” latent variables, which represent the attributes of objects. These variables are essential for many tasks such as clustering, image generation, and style transfer. Another important concern is that they do not directly reason about the category of objects in the scene, which is beneficial to know in many cases, unlike most of the studies on corresponding supervised tasks.

This paper presents a framework for unsupervised object detection that can directly reason about the category and localization of objects in the scenes and provide an intuitive way to analyze the “what” latent variables by simply incorporating a Gaussian mixture prior assumption. In Sec. 2, we introduce the architecture of our framework, GMAIR. We introduce related works in Sec. 3. We analyze the “what” latent variables in Sec. 4.1. We describe our model for image generation in Sec. 4.2. Finally, we present quantitative evaluation results of both clustering and localization in Sec. 4.3.

Our main contributions are:

  • We combine spatial attention and a Gaussian mixture in a unified deep generative model, enabling our model to cluster discovered objects.

  • We analyze the “what” latent variables, which are essential because they represent the attributes of the objects.

  • Our method achieves competitive results on both clustering and localization compared to state-of-the-art methods.

2 Gaussian Mixture Attend, Infer, Repeat

In this section, we introduce our framework, GMAIR, for unsupervised object detection. GMAIR is a spatial-attention model with a Gaussian mixture prior assumption for the “what” latent variables, and this enables the model to cluster discovered objects. An overview of GMAIR is presented in Fig.


Figure 1: Architecture of GMAIR. This is a VAE-based model that consists of a probabilistic encoder, , and a probabilistic decoder, . In encoder , feature maps with dimension are extracted from data going through a backbone network representing feature of divided regions. They are then fetched into three separated modules: pres-head, depth-head, and where-head, which produce the posterior of , , and , respectively. A cat-encoder module generates The posterior of with

input glimpses transformed by a spatial transformer network (STN) as input, and the posterior of

is generated by a what-encoder module with input glimpses and as input. In decoder , each latent is fetched into a glimpse decoder to generate decoded glimpses rendered by the renderer to recover to the final generated image. Finally, the priors of , and are fixed, whereas the prior of is generated by a “what priors” module using as input.

2.1 Structured Object-semantic Latent Representation

We follow SPAIR to attain object abstraction latent variables (Crawford and Pineau (2019)); the image is divided into regions. Latent variables is a concatenation of latent variables where is the latent variable for the -th region representing the semantic feature of the object centered in the -th region. Furthermore, for each region we divide into five seperate latent variables, , where , is the dimension of “what” latent variables and is the number of clusters. The meaning of are the same as in Crawford and Pineau (2019), while

are one-hot vectors for categories.

GMAIR imposes a prior on those latent variables as follow:


Gaussian Mixture Prior Assumption

Latent variables are one-hot vectors that act as classification indicators. They obey the categorical distribution, , where . For simplicity, we assume that for all .

We assume that conditional on

obeys a Gaussian distribution. In that case,

obeys a Gaussian mixture model, that is,



is the probability density function of Gaussian distribution,

() are the mean and standard derivation of the -th Gaussian distribution. We let and be learnable parameters that are jointly trained with other parameters. During the implementation, and if where and can be modeled as linear layers. They are called “what priors” module in Figure 1.

For other latent variables,

are modeled using a Bernoulli distribution,

, where

is the present probability.


are modeled using normal distributions,

and , respectively. All priors of latent variables are listed in Table 1.

Latent Variables Priors
Table 1: Priors of latent variables

2.2 Inference and Generation Model

Inference Model

In the inference model, latent variables conditional on data are modeled by Eqn. 3.


During implementation, feature maps with dimension are extracted from a backbone network using data as input, where is the number of channels of feature maps. Further, the posteriors of , and are reasoned by pres-head, where-head, and depth-head, respectively. Input images are cropped into

glimpses by a spatial transformer network, and each of these is transferred to the cat-encoder module to generate posteriors of

. Subsequently, we use the concatenation of the -th glimpse and as the input of the what-encoder to generate posteriors of .

Generation Model

In the generation model, each is changed back into a glimpse by using a glimpse decoder. Then, a renderer combines glimpses to generate . We use the same render algorithm as in previous studies (Eslami et al. (2016), Crawford and Pineau (2019)).

2.3 The Loss Functions

Evidence Lower Bound

In general, we learn parameters of VAE jointly by maximizing the evidence lower bound (ELBO), which can be formulated as:


where, the first term is called the reconstruction term denoted by and the second term, the regularization term. The regularization term can be further decomposed into five terms by substituting Eqn. 1 and Eqn. 3 into Eqn. 4

, and each of the five terms corresponding to the Kullback–Leibler divergence (or its expectation) between a type of latent variables and its prior:


The terms in Eqn. 5 are:


A complete derivation is given in Appendix A.

Overlap Loss

During actual implementation, we find that penalizing on overlaps of objects sometimes helps. Therefore, we introduce an auxiliary loss called overlap loss. First, we calculate images with size , where and are respectively the height and width of the input image, transformed by decoded glimpses by a spatial transformer network. The overlap loss is then calculated as the average of the sum subtract by the maximum for each pixels.

This loss, inspired by the boundary loss in SPACE (Lin et al. (2020)), is utilized to penalize if the model tries to split a large object into multiple smaller ones. However, we achieve this by using a different calculation method that incurs a lower computational cost.

Total Loss

The total loss is:


where, , and are the coefficients of the corresponding loss terms.

3 Related Works

Several studies on unsupervised object detection have been conducted, including spatial-attention methods such as AIR (Eslami et al. (2016)), SPAIR (Crawford and Pineau (2019)), and SPACE (Lin et al. (2020)), and scene-mixture methods such as MONet (Burgess et al. (2019)), IODINE (Greff et al. (2019)), and GENESIS (Engelcke et al. (2019)). Most of them including our work are based on a VAE (Kingma and Welling (2013)).

The AIR (Eslami et al. (2016)

) framework uses a VAE-based hierarchical probabilistic model marking a milestone in unsupervised scene understanding. In AIR, latent variables are structured into groups of latent variables

, for

discovered objects, each of which consists of “what,” “where,” and “presence” variables. A recurrent neural network is used in the inference model to produce

, and there is a decoder network for decoding the “what” variables of each object in the generation model. A spatial transformer network (Jaderberg et al. (2015)) is used for rendering.

Because AIR attends one object at a time, it does not scale well to scenes that contain many objects. SPAIR (Crawford and Pineau (2019)) attempted to address this issue by replacing the recurrent network with a convolutional network that follows a spatially invariant assumption. Similar to YOLO (Redmon et al. (2016)), in SPAIR, the locations of objects are specified relative to local grid cells.

Scene-mixture models such as MONet (Burgess et al. (2019)), IODINE (Greff et al. (2019)), and GENESIS (Engelcke et al. (2019)) perform segmentation instead of explicitly finding the location of objects. SPACE (Lin et al. (2020)) employs a combination of both methods. It consists of a spatial-attention model for the foreground and a scene-mixture model for the background.

In the area of deep unsupervised clustering, recent methods include AAE (Makhzani et al. (2015)), GMVAE (Dilokthanakul et al. (2016)), IIC (Ji et al. (2019)

). AAE combines the ideas of generative adversarial networks and variational inference. GMVAE uses a Gaussian mixture model as a prior distribution. In IIC, objects are clustered by maximizing mutual information of pairs of images. All of them show promising results on unsupervised clustering.

GMAIR incorporates a Gaussian mixture model for clustering, similar to the GMVAE framework111We also refer to a blog post ( published by Rui Shu.. It worth noting that our attempt may simply be a choice amongst many given options. Unless previous research, our main contribution is to show the feasibility of performing clustering and localization simultaneously. Moreover, our method provides a simple and intuitive way to analyze the mechanics of the detection process.

4 Models and Experiments

The experiments were divided into three parts: a) the analysis of “what” representation and clustering along with the iterations, b) image generation, and c) quantitative evaluation of the models.

We evaluate the models on two datasets :

  • MultiMNIST: A dataset generated by placing 1–10 small images randomly chosen from MNIST (a standard handwritten digits dataset, (LeCun (1998))) to random positions on empty images.

  • Fruit2D: A dataset collected from a real-world game. In the scenes, there are types of fruits of various sizes. There is a large difference between both the number and the size of small objects and large objects. The ratio of the size of the largest type of objects to that of the smallest type of objects is ~, and there are ~ times objects in the smallest size than in the largest size. These settings make it difficult to perform localization and clustering.

In the experiments, we compared GMAIR to two models, SPAIR and SPACE, both of which achieve state-of-the-art in unsupervised object detection in localization performance. Separated Gaussian mixture models are applied to the “what” latent variables generated by the compared models to obtain the clustering results. We set the number of clusters and Monte Carlo samples except as otherwise defined for all experiments. We present the details of models in Appendix B.

It is worth mentioning that the model sometimes successfully locates an object and encloses it with a large box. In that case, IoU between the ground truth and the predicted one will be small, and therefore, will not count to be a correct bounding box when calculating AP. We fix this issue by removing the empty area in generated glimpses to obtain the real size of predicted boxes.

4.1 “What” Representation and Cluster Analysis

We conducted the experiments using the MultiMNIST dataset. We ran GMAIR for 440k iterations and observed the change in the values of the average precision (AP) of bounding boxes, accuracy (ACC), and normalized mutual information (NMI) of clustering until 100k iterations. We also visualized the “what” latent variables in the latent space during the process, as shown in Fig. 2. Although all values continued to increase even after 100k iterations, the visualization results were similar to those at the 100k iteration. For integrity, we reserved the results from 100k to 440k iterations in Appendix D. Details of calculating the AP, ACC, and NMI are discussed in Appendix C.

The results showed that at an early stage (~10k iterations) of training, models can already locate objects well with AP (Fig. 1(a)). At the same time, , representations of objects were still evolving, and the results of clustering (in Fig. 1(b)) was not desirable ((ACC, NMI) was ()); the digits were a blur in Fig. 1(f). After 50k and 100k iterations of training, the clustering effect of was increasingly apparent, and the digits were clearer (Fig. 1(g), 1(h)). The clustering results ((ACC, NMI) was () at 50k, and () at 100k iterations) were improved (Fig. 1(c), 1(d)).

It should be noted that even if the clustering effect of is sufficiently enough, the model may fail to locate the centers of clusters (for example, the large cluster in light red in Fig. 1(d)), leading to poor clustering results. In the worst case, the model may learn to converge all to the same values, , and the Gaussian mixture model may degenerate to a single Gaussian distribution,

, resulting in a miserable clustering result. In general, we found that this phenomenon usually occurs at the early stage of training and can be avoided by adjusting the learning rate of relative modules and the coefficients of the loss functions.

(a) AP(IoU=0.5), ACC and NMI during training
(b) “What” latent space, at 10k iterations
(c) “What” latent space, at 50k iterations
(d) “What” latent space, at 100k iterations
(e) Original image
(f) Generated image, at 10k iterations
(g) Generated image, at 50k iterations
(h) Generated image, at 100k iterations
Figure 2:

“What” representation and cluster analysis. (a) Average precision (AP), accuracy (ACC), normalized mutual information (NMI) during training. (b-d) Visualized “what” latent space by t-SNE (

Van der Maaten and Hinton (2008)) at 10k, 50k, and 100k iterations, respectively. Each small dot represents a sample of , and different colors represent the ground-truth categories of the corresponding objects. The large dots are described in Sec. 2.1, and each of these can be seen as the center of a cluster. The closures represent results of clustering, which are closures of the closest points to that are assigned to the -th cluster (where and we choose ). The color of and closures are decided by a matching algorithm such that a maximum number of are correctly classified to the ground-truth label. (e) Sample of original image. (f-h) Samples of generated image at 10k, 50k, and 100k iterations, respectively.

4.2 Image Generation

It is expected that represents the average feature of the -th type of objects, and latent variable can be decomposed into:


if the -th object is in the -th category and represents the local feature of the object. By altering or , we should obtain new objects that belong to other categories or the same category with different styles, respectively. In the experiment, we altered and and observed the generated images for each object, as shown in Fig. 3. In Fig. 2(a), objects in each cluster correspond to a type of digit, which is exactly what we expected (except for digit 8 in column 3). In Fig. 2(b), categories with a large number of objects are grouped into multiple clusters, while categories with a small number are grouped into one cluster. This is due to the significant difference in number between various types. However, objects in a cluster come from a category in general.

The structure of GMAIR ensures its ability to control object categories, object styles, and the positions of each object of the generated images by altering , , and . Examples are shown in Fig. 4.

(a) MultiMNIST
(b) Fruit2D
Figure 3: Generated objects by varying and . The horizontal axis represents varying , and the vertical axis represents varying , on both (a) and (b).
(a) MultiMNIST
(b) Fruit2D
Figure 4: Generated images by varying attributes and locations of objects. Columns 1 to 5 are numbered from left to right. Column 1 shows original images. Column 2 shows the generated images without varying , , and . Column 3 presents images generated by setting all to the same random . Column 4 depicts images generated by varying . Column 5 shows images generated by applying a random shuffle to .

This could provide a new approach for tasks such as style transfer, image generation, and data augmentation. Note that previous methods such as AIR, SPAIR, and its variants can also obtain similar results, but we achieve them in finer granularity.

4.3 Quantitative Evaluations

We quantitatively evaluate the models in terms of the AP of bounding boxes, ACC and NMI of the clusters, and the results are listed in Table 2. In the first part, we summarize some results of the state-of-the-art models for unsupervised clustering on MNIST dataset for comparison. In the second and the third part, we compare GMAIR to the state-of-the-art models for unsupervised object detection on MultiMNIST and Fruit2D dataset, respectively. The clustering results of SPAIR and SPACE are obtained by Gaussian mixture models (GMMs). Results show that GMAIR achieves competitive results on both clustering and localization.

Model Dataset AP (%, IoU=0.5) ACC (%) NMI (%)
Table 2: Quantitative Results on Localization (AP) and Clustering (Accuracy and NMI)

5 Conclusion

We introduce GMAIR, which combines spatial attention and a Gaussian mixture, such that it can locate and cluster unseen objects simultaneously. We analyze the “what” latent variables and clustering process, provide examples of GMAIR application to the task of image generation, and evaluate GMAIR quantitatively compared with SPAIR and SPACE.

This work was partially supported by the Research and Development Projects of Applied Technology of Inner Mongolia Autonomous Region, China under Grant No. 201802005, the Key Program of the National Natural Science Foundation of China under Grant No. 61932014, and Pudong New Area Science & Technology Development Fundation under Grant No. PKX2019-R02. Yao Shen is the corresponding author.


Appendix A Derivation of The KL Terms

In this section, we derive the KL terms in Eqn. 5. By assumption of and (Eqn. 3 and Eqn. 1), we have:


The term can further be expanded as follow:


Continue to expand Eqn. A:


By the definition of Kullback–Leibler divergence, the four terms in the RHS of Eqn. A are indeed


respectively. Therefore, we complete the proof of Eqn. 5.

During the implementation, we model discrete variables and using the Gumbel-Softmax approximation (Jang et al. (2016)). Therefore, all variables are differentiable using the reparameterization trick.

Appendix B Implementation Details

Our code is available at

b.1 Models

Here, we describe the architecture of each module of GMAIR, as shown in Fig. 1. The backbone is a ResNet18 (He et al. (2016)) network with two deconvolution layers replacing the fully connected layer, as shown in Table 3. Pres-head, depth-head, and where-head are convolutional networks that are only different from the number of output channels, as shown in Table 4. What-encoder and cat-encoder are multiple layer networks, as shown in Table 5. Finally, the glimpse decoder is a deconvolutional network, as shown in Table 6.

For other models, we make use of code from for SPAIR, and for SPACE. We utilize most of the default configuration for both models, and only change (the dimension of ) to for comparison, the size of the base bounding box to for large objects.

b.2 Training and Hyperparameters

The base set of hyperparameters for GMAIR is given in Table

7. The value (the prior on ) drops gradually from to the final value , and the value drops from to in the early stage of training for stability. The learning rate is in the range of .

b.3 Testing

During testing phase, in order to obtain deterministic results, we use the value with the largest probability (density) for latent variables , instead of sampling them from the distributions. To be specific, we use for and , respectively.

Appendix C Calculation of AP, ACC and NMI

The value of AP is calculated at threshold by using the calculation method from the VOC (Everingham et al. (2010)). Before calculating the ACC and NMI of clusters, we filter the incorrect bounding boxes. A predicted box is correct iff there is a ground-truth box such that , and the class of a correct predicted box is assigned to the class of the ground-truth box such that is maximized. After filtering, all correct predicted boxes are used for the calculation of ACC and NMI. Note that we still have many ways to assign each predicted category to a real category when calculating the value of ACC. In all of the ways, we select the one such that ACC is maximized, following Dilokthanakul et al. (2016). Formulas are shown in Eqn. 16 and Eqn. 17 for the calculation of ACC and NMI:


where and are respectively the ground-truth categories and predicted categories for all correct boxes, and are the number of clusters and real classes, and are the entropy and mutual information function, respectively.

Layer Type Size Act./Norm. Output Size
resnet ResNet18 (w/o fc)
deconv layer 1 Deconv ReLU/BN
deconv layer 2 Deconv ReLU/BN
Table 3: Architecture of Backbone
Layer Type Size Act./Norm. Output Size
Hidden Conv ReLU
Output Conv
Table 4: Architectures of Pres/Depth/Where-Head
Layer Type Size Act./Norm. Output Size
Input Flatten
Layer 1 Linear ReLU
Layer 2 Linear ReLU
Layer 3 Linear ReLU
Output Linear
Table 5: Architectures of What/Cat-Encoder
Layer Type Size Act./Norm. Output Size
Input Linear ReLU
Layer 1 Deconv ReLU/GN(8)
Layer 2 Deconv ReLU/GN(8)
Layer 3 Deconv ReLU/GN(8)
Layer 4 Deconv ReLU/GN(8)
Conv Conv ReLU/GN(8)
Layer 5 DeConv ReLU/GN(4)
Output Conv
Table 6: Architecture of Glimpse-Decoder
Description Variable Value
Base bbox size
Batch size
Dim. of
Dim. of
Glimpse size
Learning rate
Loss Coef. of 1
Loss Coef. of
Loss Coef. of 1
Loss Coef. of 1
Loss Coef. of 8,16
Loss Coef. of 1
Loss Coef. of 1
Prior on
Prior on
Prior on
Prior on
Prior on
Table 7: Base Hyperparameters

Appendix D Additional Experiment Results

The graphs of “what” representation after 100k iterations are shown in Fig. 5.

(a) AP(IoU=0.5), ACC and NMI during training
(b) “What” latent space, at 220k iterations
(c) “What” latent space, at 330k iterations
(d) “What” latent space, at 440k iterations
(e) Original image
(f) Generated image, at 220k iterations
(g) Generated image, at 330k iterations
(h) Generated image, at 440k iterations
Figure 5: “What” representation and cluster analysis after 100k iterations. (a) Average precision (AP), accuracy (ACC), normalized mutual information (NMI) during training. (b-d) Visualized “what” latent space by t-SNE (Van der Maaten and Hinton (2008)) at 220k, 330k, and 440k iterations, respectively. (e) Sample of original image. (f-h) Samples of generated image at 220k, 330k, and 440k iterations, respectively.