An official implementation of GMAIR
Recent studies on unsupervised object detection based on spatial attention have achieved promising results. Models, such as AIR and SPAIR, output "what" and "where" latent variables that represent the attributes and locations of objects in a scene, respectively. Most of the previous studies concentrate on the "where" localization performance; however, we claim that acquiring "what" object attributes is also essential for representation learning. This paper presents a framework, GMAIR, for unsupervised object detection. It incorporates spatial attention and a Gaussian mixture in a unified deep generative model. GMAIR can locate objects in a scene and simultaneously cluster them without supervision. Furthermore, we analyze the "what" latent variables and clustering process. Finally, we evaluate our model on MultiMNIST and Fruit2D datasets and show that GMAIR achieves competitive results on localization and clustering compared to state-of-the-art methods.READ FULL TEXT VIEW PDF
An official implementation of GMAIR
The perception of human vision is naturally hierarchical. We can recognize objects in a scene at a glance and classify them according their appearances, functions, and other attributes. It is expected that an intelligent agent can also decompose scenes to meaningful object abstraction, which is known as an object detection task in machine learning. In the last decade, there have been significant developments in supervised object detection tasks. However, its unsupervised counterpart continues to be challenging.
Recently, there has been some progress in unsupervised object detection. Attend, infer, repeat (AIR, Eslami et al. (2016)
), which is a variational autoencoder (VAE,Kingma and Welling (2013)) based method, achieved encouraging results. Spatially invariant AIR (SPAIR, Crawford and Pineau (2019)) replaced the recurrent network in AIR by a convolutional network that attained better scalability and lower computational cost. SPACE (Lin et al. (2020)), which combines spatial-attention and scene-mixture approaches, performed better in background prediction.
Despite the recent progress in unsupervised object detection, results of previous studies remain unsatisfactory. One of the reasons for this could be that previous studies on unsupervised object detection were mainly concentrated on object localization and lacked analysis and evaluation on the “what” latent variables, which represent the attributes of objects. These variables are essential for many tasks such as clustering, image generation, and style transfer. Another important concern is that they do not directly reason about the category of objects in the scene, which is beneficial to know in many cases, unlike most of the studies on corresponding supervised tasks.
This paper presents a framework for unsupervised object detection that can directly reason about the category and localization of objects in the scenes and provide an intuitive way to analyze the “what” latent variables by simply incorporating a Gaussian mixture prior assumption. In Sec. 2, we introduce the architecture of our framework, GMAIR. We introduce related works in Sec. 3. We analyze the “what” latent variables in Sec. 4.1. We describe our model for image generation in Sec. 4.2. Finally, we present quantitative evaluation results of both clustering and localization in Sec. 4.3.
Our main contributions are:
We combine spatial attention and a Gaussian mixture in a unified deep generative model, enabling our model to cluster discovered objects.
We analyze the “what” latent variables, which are essential because they represent the attributes of the objects.
Our method achieves competitive results on both clustering and localization compared to state-of-the-art methods.
In this section, we introduce our framework, GMAIR, for unsupervised object detection. GMAIR is a spatial-attention model with a Gaussian mixture prior assumption for the “what” latent variables, and this enables the model to cluster discovered objects. An overview of GMAIR is presented in Fig.1.
We follow SPAIR to attain object abstraction latent variables (Crawford and Pineau (2019)); the image is divided into regions. Latent variables is a concatenation of latent variables where is the latent variable for the -th region representing the semantic feature of the object centered in the -th region. Furthermore, for each region we divide into five seperate latent variables, , where , is the dimension of “what” latent variables and is the number of clusters. The meaning of are the same as in Crawford and Pineau (2019), while
are one-hot vectors for categories.
GMAIR imposes a prior on those latent variables as follow:
Latent variables are one-hot vectors that act as classification indicators. They obey the categorical distribution, , where . For simplicity, we assume that for all .
We assume that conditional on
obeys a Gaussian distribution. In that case,
obeys a Gaussian mixture model, that is,
is the probability density function of Gaussian distribution,() are the mean and standard derivation of the -th Gaussian distribution. We let and be learnable parameters that are jointly trained with other parameters. During the implementation, and if where and can be modeled as linear layers. They are called “what priors” module in Figure 1.
For other latent variables,
are modeled using a Bernoulli distribution,, where
is the present probability.and
are modeled using normal distributions,and , respectively. All priors of latent variables are listed in Table 1.
In the inference model, latent variables conditional on data are modeled by Eqn. 3.
During implementation, feature maps with dimension are extracted from a backbone network using data as input, where is the number of channels of feature maps. Further, the posteriors of , and are reasoned by pres-head, where-head, and depth-head, respectively. Input images are cropped into
glimpses by a spatial transformer network, and each of these is transferred to the cat-encoder module to generate posteriors of. Subsequently, we use the concatenation of the -th glimpse and as the input of the what-encoder to generate posteriors of .
In general, we learn parameters of VAE jointly by maximizing the evidence lower bound (ELBO), which can be formulated as:
where, the first term is called the reconstruction term denoted by and the second term, the regularization term. The regularization term can be further decomposed into five terms by substituting Eqn. 1 and Eqn. 3 into Eqn. 4
, and each of the five terms corresponding to the Kullback–Leibler divergence (or its expectation) between a type of latent variables and its prior:
The terms in Eqn. 5 are:
A complete derivation is given in Appendix A.
During actual implementation, we find that penalizing on overlaps of objects sometimes helps. Therefore, we introduce an auxiliary loss called overlap loss. First, we calculate images with size , where and are respectively the height and width of the input image, transformed by decoded glimpses by a spatial transformer network. The overlap loss is then calculated as the average of the sum subtract by the maximum for each pixels.
This loss, inspired by the boundary loss in SPACE (Lin et al. (2020)), is utilized to penalize if the model tries to split a large object into multiple smaller ones. However, we achieve this by using a different calculation method that incurs a lower computational cost.
The total loss is:
where, , and are the coefficients of the corresponding loss terms.
Several studies on unsupervised object detection have been conducted, including spatial-attention methods such as AIR (Eslami et al. (2016)), SPAIR (Crawford and Pineau (2019)), and SPACE (Lin et al. (2020)), and scene-mixture methods such as MONet (Burgess et al. (2019)), IODINE (Greff et al. (2019)), and GENESIS (Engelcke et al. (2019)). Most of them including our work are based on a VAE (Kingma and Welling (2013)).
The AIR (Eslami et al. (2016)
) framework uses a VAE-based hierarchical probabilistic model marking a milestone in unsupervised scene understanding. In AIR, latent variables are structured into groups of latent variables, for
discovered objects, each of which consists of “what,” “where,” and “presence” variables. A recurrent neural network is used in the inference model to produce, and there is a decoder network for decoding the “what” variables of each object in the generation model. A spatial transformer network (Jaderberg et al. (2015)) is used for rendering.
Because AIR attends one object at a time, it does not scale well to scenes that contain many objects. SPAIR (Crawford and Pineau (2019)) attempted to address this issue by replacing the recurrent network with a convolutional network that follows a spatially invariant assumption. Similar to YOLO (Redmon et al. (2016)), in SPAIR, the locations of objects are specified relative to local grid cells.
Scene-mixture models such as MONet (Burgess et al. (2019)), IODINE (Greff et al. (2019)), and GENESIS (Engelcke et al. (2019)) perform segmentation instead of explicitly finding the location of objects. SPACE (Lin et al. (2020)) employs a combination of both methods. It consists of a spatial-attention model for the foreground and a scene-mixture model for the background.
). AAE combines the ideas of generative adversarial networks and variational inference. GMVAE uses a Gaussian mixture model as a prior distribution. In IIC, objects are clustered by maximizing mutual information of pairs of images. All of them show promising results on unsupervised clustering.
GMAIR incorporates a Gaussian mixture model for clustering, similar to the GMVAE framework111We also refer to a blog post (http://ruishu.io/2016/12/25/gmvae/) published by Rui Shu.. It worth noting that our attempt may simply be a choice amongst many given options. Unless previous research, our main contribution is to show the feasibility of performing clustering and localization simultaneously. Moreover, our method provides a simple and intuitive way to analyze the mechanics of the detection process.
The experiments were divided into three parts: a) the analysis of “what” representation and clustering along with the iterations, b) image generation, and c) quantitative evaluation of the models.
We evaluate the models on two datasets :
Fruit2D: A dataset collected from a real-world game. In the scenes, there are types of fruits of various sizes. There is a large difference between both the number and the size of small objects and large objects. The ratio of the size of the largest type of objects to that of the smallest type of objects is ~, and there are ~ times objects in the smallest size than in the largest size. These settings make it difficult to perform localization and clustering.
In the experiments, we compared GMAIR to two models, SPAIR and SPACE, both of which achieve state-of-the-art in unsupervised object detection in localization performance. Separated Gaussian mixture models are applied to the “what” latent variables generated by the compared models to obtain the clustering results. We set the number of clusters and Monte Carlo samples except as otherwise defined for all experiments. We present the details of models in Appendix B.
It is worth mentioning that the model sometimes successfully locates an object and encloses it with a large box. In that case, IoU between the ground truth and the predicted one will be small, and therefore, will not count to be a correct bounding box when calculating AP. We fix this issue by removing the empty area in generated glimpses to obtain the real size of predicted boxes.
We conducted the experiments using the MultiMNIST dataset. We ran GMAIR for 440k iterations and observed the change in the values of the average precision (AP) of bounding boxes, accuracy (ACC), and normalized mutual information (NMI) of clustering until 100k iterations. We also visualized the “what” latent variables in the latent space during the process, as shown in Fig. 2. Although all values continued to increase even after 100k iterations, the visualization results were similar to those at the 100k iteration. For integrity, we reserved the results from 100k to 440k iterations in Appendix D. Details of calculating the AP, ACC, and NMI are discussed in Appendix C.
The results showed that at an early stage (~10k iterations) of training, models can already locate objects well with AP (Fig. 1(a)). At the same time, , representations of objects were still evolving, and the results of clustering (in Fig. 1(b)) was not desirable ((ACC, NMI) was ()); the digits were a blur in Fig. 1(f). After 50k and 100k iterations of training, the clustering effect of was increasingly apparent, and the digits were clearer (Fig. 1(g), 1(h)). The clustering results ((ACC, NMI) was () at 50k, and () at 100k iterations) were improved (Fig. 1(c), 1(d)).
It should be noted that even if the clustering effect of is sufficiently enough, the model may fail to locate the centers of clusters (for example, the large cluster in light red in Fig. 1(d)), leading to poor clustering results. In the worst case, the model may learn to converge all to the same values, , and the Gaussian mixture model may degenerate to a single Gaussian distribution,
, resulting in a miserable clustering result. In general, we found that this phenomenon usually occurs at the early stage of training and can be avoided by adjusting the learning rate of relative modules and the coefficients of the loss functions.
“What” representation and cluster analysis. (a) Average precision (AP), accuracy (ACC), normalized mutual information (NMI) during training. (b-d) Visualized “what” latent space by t-SNE (Van der Maaten and Hinton (2008)) at 10k, 50k, and 100k iterations, respectively. Each small dot represents a sample of , and different colors represent the ground-truth categories of the corresponding objects. The large dots are described in Sec. 2.1, and each of these can be seen as the center of a cluster. The closures represent results of clustering, which are closures of the closest points to that are assigned to the -th cluster (where and we choose ). The color of and closures are decided by a matching algorithm such that a maximum number of are correctly classified to the ground-truth label. (e) Sample of original image. (f-h) Samples of generated image at 10k, 50k, and 100k iterations, respectively.
It is expected that represents the average feature of the -th type of objects, and latent variable can be decomposed into:
if the -th object is in the -th category and represents the local feature of the object. By altering or , we should obtain new objects that belong to other categories or the same category with different styles, respectively. In the experiment, we altered and and observed the generated images for each object, as shown in Fig. 3. In Fig. 2(a), objects in each cluster correspond to a type of digit, which is exactly what we expected (except for digit 8 in column 3). In Fig. 2(b), categories with a large number of objects are grouped into multiple clusters, while categories with a small number are grouped into one cluster. This is due to the significant difference in number between various types. However, objects in a cluster come from a category in general.
The structure of GMAIR ensures its ability to control object categories, object styles, and the positions of each object of the generated images by altering , , and . Examples are shown in Fig. 4.
This could provide a new approach for tasks such as style transfer, image generation, and data augmentation. Note that previous methods such as AIR, SPAIR, and its variants can also obtain similar results, but we achieve them in finer granularity.
We quantitatively evaluate the models in terms of the AP of bounding boxes, ACC and NMI of the clusters, and the results are listed in Table 2. In the first part, we summarize some results of the state-of-the-art models for unsupervised clustering on MNIST dataset for comparison. In the second and the third part, we compare GMAIR to the state-of-the-art models for unsupervised object detection on MultiMNIST and Fruit2D dataset, respectively. The clustering results of SPAIR and SPACE are obtained by Gaussian mixture models (GMMs). Results show that GMAIR achieves competitive results on both clustering and localization.
|Model||Dataset||AP (%, IoU=0.5)||ACC (%)||NMI (%)|
|SPAIR + GMM||MultiMNIST|
|SPACE + GMM||MultiMNIST|
|SPAIR + GMM||Fruit2D|
|SPACE + GMM||Fruit2D|
We introduce GMAIR, which combines spatial attention and a Gaussian mixture, such that it can locate and cluster unseen objects simultaneously. We analyze the “what” latent variables and clustering process, provide examples of GMAIR application to the task of image generation, and evaluate GMAIR quantitatively compared with SPAIR and SPACE.
This work was partially supported by the Research and Development Projects of Applied Technology of Inner Mongolia Autonomous Region, China under Grant No. 201802005, the Key Program of the National Natural Science Foundation of China under Grant No. 61932014, and Pudong New Area Science & Technology Development Fundation under Grant No. PKX2019-R02. Yao Shen is the corresponding author.
Spatially invariant unsupervised object detection with convolutional neural networks.In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3412–3420, 2019.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6569–6578, 2019.
Cgmvae: Coupling gmm prior and gmm estimator for unsupervised clustering and disentanglement.IEEE Access, 2021.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
The term can further be expanded as follow:
Continue to expand Eqn. A:
By the definition of Kullback–Leibler divergence, the four terms in the RHS of Eqn. A are indeed
respectively. Therefore, we complete the proof of Eqn. 5.
During the implementation, we model discrete variables and using the Gumbel-Softmax approximation (Jang et al. (2016)). Therefore, all variables are differentiable using the reparameterization trick.
Our code is available at https://github.com/EmoFuncs/GMAIR-pytorch.
Here, we describe the architecture of each module of GMAIR, as shown in Fig. 1. The backbone is a ResNet18 (He et al. (2016)) network with two deconvolution layers replacing the fully connected layer, as shown in Table 3. Pres-head, depth-head, and where-head are convolutional networks that are only different from the number of output channels, as shown in Table 4. What-encoder and cat-encoder are multiple layer networks, as shown in Table 5. Finally, the glimpse decoder is a deconvolutional network, as shown in Table 6.
For other models, we make use of code from https://github.com/yonkshi/SPAIR_pytorch for SPAIR, and https://github.com/zhixuan-lin/SPACE for SPACE. We utilize most of the default configuration for both models, and only change (the dimension of ) to for comparison, the size of the base bounding box to for large objects.
During testing phase, in order to obtain deterministic results, we use the value with the largest probability (density) for latent variables , instead of sampling them from the distributions. To be specific, we use for and , respectively.
The value of AP is calculated at threshold by using the calculation method from the VOC (Everingham et al. (2010)). Before calculating the ACC and NMI of clusters, we filter the incorrect bounding boxes. A predicted box is correct iff there is a ground-truth box such that , and the class of a correct predicted box is assigned to the class of the ground-truth box such that is maximized. After filtering, all correct predicted boxes are used for the calculation of ACC and NMI. Note that we still have many ways to assign each predicted category to a real category when calculating the value of ACC. In all of the ways, we select the one such that ACC is maximized, following Dilokthanakul et al. (2016). Formulas are shown in Eqn. 16 and Eqn. 17 for the calculation of ACC and NMI:
where and are respectively the ground-truth categories and predicted categories for all correct boxes, and are the number of clusters and real classes, and are the entropy and mutual information function, respectively.
|resnet||ResNet18 (w/o fc)|
|deconv layer 1||Deconv||ReLU/BN|
|deconv layer 2||Deconv||ReLU/BN|
|Base bbox size|
|Loss Coef. of||1|
|Loss Coef. of|
|Loss Coef. of||1|
|Loss Coef. of||1|
|Loss Coef. of||8,16|
|Loss Coef. of||1|
|Loss Coef. of||1|
The graphs of “what” representation after 100k iterations are shown in Fig. 5.