Deep Unsupervised Clustering with Clustered Generator Model

11/19/2019 ∙ by Dandan Zhu, et al. ∙ Shanghai Jiao Tong University Stevens Institute of Technology 0

This paper addresses the problem of unsupervised clustering which remains one of the most fundamental challenges in machine learning and artificial intelligence. We propose the clustered generator model for clustering which contains both continuous and discrete latent variables. Discrete latent variables model the cluster label while the continuous ones model variations within each cluster. The learning of the model proceeds in a unified probabilistic framework and incorporates the unsupervised clustering as an inner step without the need for an extra inference model as in existing variational-based models. The latent variables learned serve as both observed data embedding or latent representation for data distribution. Our experiments show that the proposed model can achieve competitive unsupervised clustering accuracy and can learn disentangled latent representations to generate realistic samples. In addition, the model can be naturally extended to per-pixel unsupervised clustering which remains largely unexplored.



There are no comments yet.


page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering as one of the central themes in data understanding and analysis has been widely studied in the realm of unsupervised learning. However, unsupervised clustering remains one of the most fundamental challenges in machine learning because of high dimensionality of data and high complexities of their hidden structures.

Long-established approaches for unsupervised clustering including K-means 


and Gaussian Mixture Model (GMM) 


are still the building blocks for numerous applications due to their efficiency and simplicity. However, their distance metrics are limited to data space, making them ineffective for high-dimensional data such as images. Therefore, considerable efforts have been put into obtaining a good feature embedding of data, usually of low dimensionality, for effective clustering 

[37]. However, the representation obtained by standalone data embedding typically cannot capture the latent structure and variation of the observed data which may be ineffective for clustering. We believe the good representation for clustering should also be able to compactly represent the observed data distribution to encode all necessary characteristics of the observation.

Deep generative models (a.k.a the generator models) have shown great promise in learning latent representations for high-dimensional signals such as images and videos [32, 23, 11]

. Generator models parameterized by deep neural networks specify a non-linear mapping from latent variables to observed data. As a compact probabilistic representation of knowledge, it can embed the high-dimensional data into low-dimensional latent representation. Besides, it has been shown that the generator model is also capable of generating realistic images indicating that the learned latent representation encode all necessary and useful information of the data. Though powerful, the generator model is mainly studied with the focus on generation tasks using continuous latent variables. While it is clear that we pursue both objectives of jointly learning latent representations and clustering, developing and learning such generator model for unsupervised clustering is still in its infancy with only a few recent existing works 

[22, 9, 24].

In this paper, we develop a new model-based clustering algorithm using generator model. Specifically, we propose to use the generator model with both discrete and continuous latent variables. The discrete latent variables are used to model cluster labels while continuous ones are used to model variations within each cluster. Such model is termed the clustered generator model to emphasize the fact that it aims to achieve unsupervised clustering. By learning the clustered generator model, we naturally incorporate the unsupervised clustering as an inference step for discrete latent variables in an inner loop, and as a result, useful latent representations (i.e., discrete and continuous latent variables) and the unsupervised clustering are seamlessly integrated into a unified probabilistic learning framework. The experiments show that by learning the clustered generator model, we could achieve competitive or even state-of-art unsupervised clustering accuracy while obtaining realistic and disentangled latent representations.

1.1 Related Work and Contributions

Our work is closely related to unsupervised clustering as well as learning the generator models.

The most fundamental methods for clustering are the K-means [15] algorithm and Gaussian Mixture Model (GMM)[3]. K-means assumes the data are centered around some centroids and clusters are found by minimizing

distance to the centroid within each cluster. GMM, on the other hand, assumes that data are generated by mixture of Gaussian distribution whose parameters are learned through Expectation-Maximization (EM) algorithm. Without utilizing the proper representation, these methods are ineffective in handling high-dimensional data whose underlying structure can be highly non-linear. Spectral clustering and its variants 

[34, 31, 36, 38] further generalize the distance function for non-linear clusters, yet in general they can be computationally intensive and still result in unsatisfactory clustering on high-dimensional data.

Generator models have received increasing attention over the past few years as they can effectively capture data distribution through latent representations. Generative Adversarial Network (GAN) [10] and Variational Auto-encoder (VAE) [23, 33] are two notable examples. These generative models have shown their great potential in various applications such as image generation [32, 2, 13], image completion [11, 12], and disentangled latent representation [16, 7, 14]. However, integrating such powerful knowledge representation tool with the unsupervised clustering task has not been thoroughly investigated.

Only a few existing works jointly consider learning the latent representation for data and the clustering task. Conditional-VAE (CVAE) [24]

considers discrete latent variables for clustering and is closely related to our work, but it is primarily developed for supervised/semi-supervised learning where (part of) the data label is given. HashGAN

[5] is a novel model that combines pairs of conditional Wasserstein GAN (PC-WGAN) and hash encoded information. It mainly uses a new PC-WGAN conditional on pairwise similarity information to generate an image that is closest to the real image. However, this method is also mainly used in supervised/semi-supervised tasks. Variational Deep Embedding (VaDE) [22] and Gaussian Mixture Variational Auto-encoder (GMVAE) [9] combine GMM models and VAEs for unsupervised clustering. Adversarial Auto-encoder (AAE) [27] can also be adapted to unsupervised clustering, but it needs to use GAN to match the aggregated posterior of latent representation with the prior of VAE, requiring complex computation and additional network structures. Other related models include Deep Embedded Clustering (DEC) [37] and more recent Invariant Information Clustering (IIC) [21] which specifically learn feature representations for clustering tasks. The latent representations learned by DEC and IIC are unable to represent the observed data distribution, thereby failing to generalize to other tasks (e.g., generation). While most of these variational-based models could achieve relatively impressive clustering accuracy, they need to design and learn separate inference model for cluster labels. Besides, due to the discrete nature of cluster labels, variational learning cannot take advantage of reparametrization trick and generally need further approximation.

In contrast to recent models that use variational learning for latent representation and clustering, we introduce the novel clustered generator model for unsupervised clustering. Learning such model will naturally integrate the unsupervised clustering process as an inference inner loop without utilizing additional networks or any further approximation.

Contributions of our paper are as follows:

  • We propose the clustered generator model for unsupervised clustering which includes discrete latent variables to model cluster labels and continuous latent variables to capture variations within each cluster.

  • We develop a novel learning algorithm for clustered generator model in a probabilistic framework which naturally involves the unsupervised clustering as an exact inference step without any assisting models and any approximations.

  • We conduct extensive experiments to show the effectiveness of the proposed model. Specifically, our model can achieve competitive unsupervised clustering accuracy on large-scale image datasets and could get reasonably well per-pixel unsupervised clustering, a task that has remained largely unexplored before. Besides, our model can obtain disentangled latent representations as indicated by its realistic generation.

2 Model and Learning Algorithm

In this section, we describe the details of the model and the corresponding inference and learning algorithm.

2.1 Clustered Generator Model

Suppose be the observed data of dimension . The generator model [10] assumes the observation is generated by latent variable of dimension :

where is the noise and is independent of , and is the top-down neural network with parameters . In general, the latent variable is of low-dimension (i.e., ) and is learned to (1) embed the high-dimensional data in a low-dimensional latent space, and (2) represent the data distribution of through a generative model that generates realistic samples.

Traditional generator models have been shown to be effective in image generation [32, 2, 13]. However, it only deals with the latent variable that is continuous, making it ineffective in clustering tasks which are discrete in nature. Therefore, we propose to use the generator model with both discrete and continuous latent variables for unsupervised clustering.

Suppose we have clusters, the observed data is now generated by not only the continuous latent variables but also the discrete latent variables of dimension which represents the cluster labels:

where denotes the categorical distribution with

being the prior probability for

clusters. is the noise of the model and is independent of and . We call such model clustered generator to emphasize the fact that it incorporates the unsupervised clustering naturally inside its learning framework. In this way, the latent variables and are served as both observed data embedding which is for clustering and latent representation which is for representing the data distribution of . A similar form has been used in [24]. However, the model is not developed for unsupervised clustering. Besides, the representation learned for clustering is different from the latent representation learned for data distribution which can be ineffective in both realms. We will elaborate this point in the next section and experiments.

2.2 Inference and Learning

The clustered generator model defines the generation process as: . Therefore, the complete data model can be defined as . If we observe a set of training data coming from the true but unknown distribution , then the learning and inference of the clustered generator model can be accomplished by maximizing the observed-data log-likelihood:

The model parameters can be learned by gradient descent which amounts to evaluating:


However, the evaluation of expectation in Eqn. 1 is in general analytically intractable. For given observation example , we obtain fair samples from the posterior distribution, i.e., , using Gibbs sampler which iteratively performs the conditional sampling on latent variables and , i.e., .

2.2.1 Inference on continuous :

The continuous latent variable is sampled based on posterior distribution given fixed:


Fair samples can be drawn using MCMC techniques like HMC or Langevin dynamics [29]. Langevin dynamics is used in this work because it can help navigate the landscape of the latent space more thoroughly and effectively. Specifically, we have:


where is the step size and is the time stamp for langevin inference. is the random noise projected in each iteration. The log-joint can be evaluated as:


where is the constant which does not involve . Variable

is the pre-specified standard deviation of our model. Note that

is fixed to be the currently sampled value during the learning iteration. It has been shown that the dynamic has the as its stationary distribution. Therefore, the fair sample for from can be ensured.

In fact, from Eqn. 2, we can see that for given observation example , the inference on amounts to finding the suitable latent representation to resemble the observation assuming it comes from a specific cluster as indicated by .

2.2.2 Inference on discrete :

The discrete latent variable is sampled based on posterior distribution given fixed:


Suppose we have clusters, then:




and is the prior probability of -th cluster which is pre-specified.

In fact, from Eqn. 5, the inference on

is based on true posterior distribution and essentially estimates the probability of observed

falling into each cluster based on the current latent representation . This is essentially unsupervised clustering based on the current representation and the model . Existing variational-based models [24, 22] have to design and learn a separate inference model for , i.e., , int order to approximate the true posterior distribution which can be ineffective as demonstrated in our experiments.

2.2.3 Learning model parameter :

For given observed example , after obtaining inferred continuous latent variable using Eqn. 3 and discrete latent variable using Eqn. 6. We then use the sampled and

to learn the clustered generator model by stochastic gradient descent as in Eqn. 

1. More precisely,


The whole algorithm iterates the above three steps until convergence. Note that [11] shares the similar alternating nature as ours. However, their model does not consider the discrete latent variable and is mainly developed for image generation. See Algorithm 1 for an summarized learning and inference of our model.

Note that the whole algorithm can be efficient and scale well for relatively large datasets which can be shown in our experiments. Though we use the Langevin sampling on which involves multiple steps, however, the gradient in Eqn.3

shares the same chain rule computation as in Eqn.

8 which greatly reduce the computation burden.

0:   (1) training examples (2) cluster number (3) cluster prior probability (4) number of Langevin steps and learning iterations
0:   (1) learned parameters (2) inferred continuous latent variable (3) inferred discrete latent variable  
  1: Let , initialize .
  2: Initialize
  3: Initialize
     4: Inference on : For each observed , starting from the current and , run Langevin dynamics steps to update as in Eqn. 3
     5: Inference on : For each observed , based on the current , sample, or obtain a Maximum a Posteriori (MAP), of using estimated probability as in Eqn. 6.
     5: Learning : Update , with learning rate , where is computed according to Eqn. 8.
     5: Let
Algorithm 1 Learning and inference algorithm

3 Experiments

In this section, we demonstrate the effectiveness of the proposed model through the experimental results. Firstly, in order to show that the superior unsupervised clustering performance of the proposed model, we provide a quantitative comparison of the unsupervised clustering accuracy of our method with other state-of-the-art methods on three benchmark datasets (i.e. MNIST [25], SVHN [30], STL-10 [8]). Furthermore, to demonstrate that the proposed model can be adapted for inferring 2D label map, we perform unsupervised clustering for per-pixel labels on three datasets (i.e., Facades [35], COCO-Stuff [4] and Potsdam [20]) and compared it with the CVAE [24] and other state-of-the-art methods. Meanwhile, in order to demonstrate that our proposed model has the ability to learn disentangled latent representations and generate realistic images, we perform image generation experiments on three benchmark datasets. Finally, we also explore the effect of varying ’s value on clustering performance.

3.1 Datasets

To evaluate our method, we use six public datasets: MNIST, SVHN, STL-10, Facades, COCO-Stuff and Potsdam datasets. Figure 1 shows an example of these datasets.
MNIST: This is a standard handwritten digits dataset. It consists of 60,000 training samples and 10,000 testing samples. Each image in this dataset consists of

pixels, each of which is represented by a gray value. We reshape each image to a 784-dimensional row vector.

SVHN: This dataset is obtained from the house number in the Google Street View image. All images in the dataset are color house number images, including 73257 digits for training, 26032 digits for testing sets, and extra 531131 training digits, with approximately 600,000 cropped images. We use testing data to evaluate our unsupervised clustering and rest of the data is used for model training.

(b) SVHN
(c) STL-10
(d) Facades
(e) COCO-Stuff
(f) Potsdam
Figure 1: Some examples of the six datasets.

STL-10: This is an image dataset containing 10 classes of objects, 1,300 per class, 500 training images and 800 testing images. All images in the dataset are color images. We use training images for our model learning and 800 testing images for unsupervised clustering accuracy evaluation.
Facades: Facades dataset [35]

is assembled at the Center for Machine Perception, including 606 rectified images of facades from various sources. It is divided into training sets, testing sets and validation sets. The facades are from cities around the world and different architectural styles. We mainly consider four labels including wall, doors, windows and decorations which contains roof, cornice and sill. We need to emphasize that the Facades dataset is commonly used for image-to-image translation

[18] where the image is synthesized given the label map, and in this paper we aim to obtain the label map given the image.
COCO-Stuff: COCO-Stuff [4] is a challenging and diverse segmentation dataset containing “stuff” classes ranging from buildings to bodies of water. Following the procedure in [21], we use the 15 coarse labels and 52k images variant taking only images with at least 75% stuff pixels. COCO-Stuff-3 is a subset of COCO-Stuff with only sky, ground and plants labelled. All input images are shrunk, cropped to pixels and Sobel pre-processed as in [21].
Potsdam: Potsdam [20] contains 8550 RGBIR px satellite images, of which 3150 are unlabelled. As in [21], we test the 6-label variant (roads and cars, vegetation and trees, buildings and clutter) as well as a 3-label variant (Potsdam-3). The construction of Potsdam-3 and the training/testing set preparation also follows [21].

Note that images from Facades, COCO-Stuff and Potsdam have been manually annotated, however, the annotations are not used in our model training and are only used for ground-truth evaluation.

(b) SVHN
(c) STL-10
Figure 2: Generated samples by our proposed method. Each row shares the same and each column shares the same . (a) Generate samples on the MNIST dataset. (b) Generate samples on the SVHN dataset. (c) Generate samples on the STL-10 dataset.

3.2 Evaluation Metric

Similar to the work of DEC [37], we use the unsupervised clustering accuracy (ACC) to evaluate the performance of the proposed method. The formula is defined as follows:


where is the total number of all samples, is the ground-truth label and is the clustering assignment obtained by various models. indicates all possible one-to-one mapping set between cluster assignment and labels. KuhnMunkres algorithm [28] is used to find the best mapping. The range of ACC is between 0 and 1. If the value of ACC is larger, it indicates that the unsupervised classification performance is better.

3.3 Implementation Details

Our implementation is based on Tensorflow

[1] framework. The experiments are all carried out on a workstation with NVIDIA GeForce RTX 2080Ti and 1 TB RAM.

During the training process, the parameters of our algorithm are set as follows: we set the standard deviation of the noise vector to 0.3. In each learning iteration, we set the number of steps of Langevin dynamic sampling to 100. We performed learning iterations with learning rate 0.0002 and momentum 0.5.

The proposed cluster generation model mainly adopts the structure of the deconvolutional-based generator, which is composed of multiple convolutional layers and deconvolution layers. The complete convolutional layer is composed of convolution, ReLU layer and downsampled operation. The deconvolution layer consists of linear superposition, ReLu layer, and upsampling operation. To make the training process more stable, we also use batch normalization 

[17]. The detailed structural information of our proposed clustered generator model will be given later and our experimental code will be released.

We use various convolutional structures to generate the realistic images through our proposed new learning algorithm. Particularly, we mainly introduce the structure of the network for image generation on the MNIST dataset. The network structure for image generation on the other datasets (SVHN, STL-10) is similar to the network structure on the MNIST dataset. Below we describe in detail the structure of the network model for performing image generation on the MNIST dataset as follows.

The proposed network structure consists of 5 layers of convolution and 5 layers of deconvolution layer. In the convolution stage, the convolution kernel size of each layer is

, the stride from the layer 1 to layer 5 is set to 1, 2, 2, 2, 2, respectively. In the deconvolution stage, the convolution kernel size of each of the deconvolution layer is

with stride 2 from layer 6 to layer 9, and the stride on the layer 10 is set to be 1. We utilize the one-hot form of the discrete latent variable with dimension 10, and set dimension for continuous latent variables to be 100.

K-means 10 53.49% 28.40%
AAE [27] 16 83.48% 80.01%
DEC [37] 10 84.30% 80.62% 11.90%
VaDE [22] 10 94.46% 84.45%
HashGAN [5] 10 96.50% 39.40%
CVAE [24] 10 82.26% 62.37% 58.25%
IIC [21] 10 99.2% 59.6%
Our method 10 98.35% 85.15% 75.30%
Table 1: Comparison of unsupervised clustering accuracy (ACC) for various methods on different datasets.
Figure 3: The impact of Langevin steps for unsupervised clustering in terms of ACC. More Langevin steps for inference indicate more accurate clustering.

3.4 Unsupervised Clustering

We now evaluate the model on the task of unsupervised clustering. We learn our model on the training sets of the benchmark datasets (MNIST, SVHN and STL-10) and evaluate their clustering performance on the corresponding testing sets. Given the test data, we infer its corresponding cluster label using Eqn. 6. If the inference is accurate, then we would expect a competitive unsupervised clustering accuracy as indicated by ACC. We made a quantitative comparison of various clustering methods, and the comparison results are shown in Table 1. Note that the CVAE [24] model is primarily developed for supervised/semi-supervised learning settings and we extend it for unsupervised clustering for a fair comparison. As can be seen from Table 1

, all deep learning models (AAE

[27], DEC [37], VaDE [22], HashGAN [5], IIC[21] and CVAE [24]) perform better than the traditional machine learning methods (K-means[15]). Moreover, we can achieve competitive unsupervised clustering accuracy compared with the state-of-the-art methods. Specifically, on MNIST, SVHN and STL-10 dataset, our method achieves clustering accuracy of 98.35%, 85.15% and 75.30%, which are over the CVAE method by 16.09%, 12.78% and 17.05%, respectively. Performance improvement is more obvious on the STL-10 dataset.

The competitive or superior clustering accuracy obtained indicates that the inference process of our model is more accurate than the existing variational-based models [22, 27, 24]. We argue this is due to the fact that those variational models need carefully designed approximated recognition models for efficient inference. On the other hand, our model can perform exact inference based on posterior distribution in a unified probabilistic framework which leads to better inference and clustering accuracy. It is worth noting that more steps of Langevin dynamics with Eqn. 3 will render more accurate inference on continuous which will further improve the accuracy of the unsupervised clustering as can be seen from Figure 3.

Figure 4: Qualitative comparison of our clustering results with the CVAE method on three datasets: Facade (top), COCO-Stuff-3 (middle) and Potsdam-3 (bottom). The first column is the testing image. The second column is the clustering result of our method. The third column is the clustering result of the CVAE method, and the ground-truth (GT) label map is shown in the last column. COCO-Stuff-3 considers labels: sky, vegetation and ground. The Potsdam-3 considers: vegetation, roads and buildings.

3.5 Per-pixel Unsupervised Clustering

In this section, we evaluate the ability of the model to accurately infer 2D discrete latent map by performing unsupervised clustering tasks on three datasets: Facades, COCO-stuff and Potsdam. To the best of our knowledge, there is currently among the only few methods [21] that attempt to perform per-pixel unsupervised clustering of an image. The main challenge is that the per-pixel clustering should conform to the underlying pixel-wise relations (e.g., consistency for neighbouring regions) which require accurate inference. Our proposed model can obtain reasonably well per-pixel unsupervised clustering result.

Method COCO-Stuff-3 COCO-Stuff Potsdam-3 Potsdam
K-means 52.2% 14.1% 45.7% 35.3%
SIFT [26] 38.1% 20.2% 38.2% 28.5%
DeepCluster [6] 41.6% 19.9% 41.7% 29.2%
Co-Occurrence [19] 54.0% 24.3% 63.9% 44.9%
IIC [21] 72.3% 27.7% 65.1% 45.4%
CVAE [24] 62.4% 24.5% 61.9% 39.8%
Our method 73.3% 28.1% 66.3% 46.2%
Table 2: Comparison of unsupervised clustering accuracy (ACC) for various methods on different datasets. The accuracy numbers except CVAE and our model are from [21].

Unlike tradition clustering methods, we cluster each pixel on the label map. The traditional clustering method, as we show in the previous experiment, considers one-dimensional vector space which forms a one-hot representation. It should be noted that we are now performing clustering in a two-dimensional pixel space. Specifically, we consider one-hot representation for every pixel based on which we perform the inference using Eqn. 5. In order to make a fair comparison with the CVAE model, all other settings (e.g. the number of labels in datasets and the number of iterations of the unsupervised clustering algorithm) are kept untouched except for the clustering method.

For a qualitative comparison, we present the cluster assignment obtained by our method and CVAE method in the form of label maps and compare them with the ground truth labels. The visualization of the per-pixel unsupervised clustering results is shown in Figure 4. We also quantitatively compare with the CVAE and other related baseline models in terms clustering accuracy (ACC) on COCO-Stuff and Potsdam datasets. The preparation of the datasets are followed by the routine in [21] and the results are shown in Table 2. Note that the baseline models (SIFT [26], DeepCluster [6],Co-Occurrence [19] ) do not directly learn a clustering function and requires further application of k-means to be used for image clustering. The most recent IIC [21] model can directly learn 2D clustering map, however, it only learns the feature embedding and is unable to represent the observed data distribution, therefore does not have generation ability as we do in Sec.3.6.

As shown in Figure 4, compared to CVAE method, our approach can better preserve the internal structure of the building and objects, and can also clearly display the details. This can be further verified by Table 2 where our model achieve the competitive or better clustering accuracy.

3.6 Image Generation

Our model can not only obtain the powerful data embedding to ensure the accurate unsupervised clustering, it can also learn the disentangled latent representations to generate realistic samples. To demonstrate the effectiveness of our proposed, we perform experiments on the MNIST, SVHN and STL-10 datasets. We set on three datasets to train our proposed model and show that the learning and inference of the latent variables could obtain disentangled latent representations of the data. To show this, we obtain the generated samples through learned clustered generator model by varying the two sets of latent variables in the following way:

(1) Firstly, we change the continuous latent variable within a certain range by fixing the discrete class ;

(2) Secondly, we fix the continuous latent variable and enumerate all possible values of discrete class label .

Figure 2 shows the generation result of our model on the three datasets MNIST, SVHN and STL10. As can be seen from Figure 2, the image generated by our model is both realistic and diverse. Meanwhile, it can be clearly seen that if the cluster label is fixed, the generated samples have different styles and variations while maintaining their identity, indicating that continuous latent variable effectively captures the variations within each cluster. On the other hand, the change of discrete latent variable could change the identity of the sample, indicating that it can be effective for cluster label modeling. Therefore, the learned discrete latent variable and continuous form the disentangled latent representation.

(a) k=6
(b) k=12
Figure 5: Visual comparison of the clustering results by setting different number of clusters (i.e. 6 and 12) on the MNIST dataset.

3.7 The Impact of the Number of Clusters

The number of clusters is given as priori in our model, and is set to be the number of classes for each dataset. To further investigate how different could affect our model, we conduct experiments on the MNIST dataset for different . We randomly set different values on the MNIST dataset, such as 6 and 12. The experimental results of clustering are shown in Figure 5. It can be seen from Figure 5 that if the number of clusters is smaller than the actual number of classes, digits with similar appearances are grouped together, such as digits 3, 6, and 5. If the number of clusters is larger than the actual the number of classes, some digits are divided into subclasses based on visually appearance identifiable attributes, such as digits italics and roundness. As can be seen from the Figure 5 LABEL:, the upright and oblique 1 are divided into two clusters, and the 9 with two handwritten styles are also divided into two clusters.

4 Conclusion

In this paper, we propose the clustered generator model for the task of unsupervised clustering. The clustered generator model contains both the discrete latent variables which capture the cluster labels and the continuous latent variables which capture the variations within the clusters. We then develop the novel learning and inference algorithm for clustered generator in a unified probabilistic framework. Specifically, we iteratively infer the continuous and discrete latent variables in a Gibbs manner, then use the inferred variables to learn the clustered generator model. The learning can naturally incorporate the unsupervised clustering as an inference step without the need for extra assisting models for approximation. The latent variables learned can be served as both observed data embedding as well as latent representations for data distribution. The extensive experiments show both quantitatively and qualitatively the effectiveness of our proposed model.

The model can be adapted for semi-supervised learning given only a small portion of the label. The model can also be generalized to a dynamic one by including the transition model for latent variables. Besides, the number of clusters is pre-specified in the current work and can be learned directly from data. We leave these as our future directions.


The work is partially supported by DARPA XAI project N66001-17-2-4029.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §3.3.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1.1, §2.1.
  • [3] C. M. Bishop (2006) Pattern recognition and machine learning (information science and statistics). Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0387310738 Cited by: §1.1, §1.
  • [4] H. Caesar, J. Uijlings, and V. Ferrari (2018) Coco-stuff: thing and stuff classes in context. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1209–1218. Cited by: §3.1, §3.
  • [5] Y. Cao, B. Liu, M. Long, and J. Wang (2018-06) HashGAN: deep learning to hash with pair conditional wasserstein gan. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.1, §3.4, Table 1.
  • [6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §3.5, Table 2.
  • [7] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §1.1.
  • [8] A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223. Cited by: §3.
  • [9] N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan (2016)

    Deep unsupervised clustering with gaussian mixture variational autoencoders

    arXiv preprint arXiv:1611.02648. Cited by: §1.1, §1.
  • [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.1, §2.1.
  • [11] T. Han, Y. Lu, S. Zhu, and Y. N. Wu (2017) Alternating back-propagation for generator network.. In AAAI, Vol. 3, pp. 13. Cited by: §1.1, §1, §2.2.3.
  • [12] T. Han, E. Nijkamp, X. Fang, M. Hill, S. Zhu, and Y. N. Wu (2018)

    Divergence triangle for joint training of generator model, energy-based model, and inference model

    arXiv preprint arXiv:1812.10907. Cited by: §1.1.
  • [13] T. Han, E. Nijkamp, X. Fang, M. Hill, S. Zhu, and Y. N. Wu (2019) Divergence triangle for joint training of generator model, energy-based model, and inferential model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8670–8679. Cited by: §1.1, §2.1.
  • [14] T. Han, X. Xing, and Y. N. Wu (2018) Learning multi-view generator network for shared representation. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 2062–2068. Cited by: §1.1.
  • [15] J. A. Hartigan and M. A. Wong (1979) Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1), pp. 100–108. Cited by: §1.1, §1, §3.4.
  • [16] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework.. ICLR 2 (5), pp. 6. Cited by: §1.1.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.3.
  • [18] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §3.1.
  • [19] P. Isola, D. Zoran, D. Krishnan, and E. H. Adelson (2015) Learning visual groups from co-occurrences in space and time. arXiv preprint arXiv:1511.06811. Cited by: §3.5, Table 2.
  • [20] W. ISPRS 4. isprs 2d semantic labeling contest. Cited by: §3.1, §3.
  • [21] X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9865–9874. Cited by: §1.1, §3.1, §3.4, §3.5, §3.5, Table 1, Table 2.
  • [22] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou (2016) Variational deep embedding: an unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148. Cited by: §1.1, §1, §2.2.2, §3.4, §3.4, Table 1.
  • [23] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.1, §1.
  • [24] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §1.1, §1, §2.1, §2.2.2, §3.4, §3.4, Table 1, Table 2, §3.
  • [25] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.
  • [26] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §3.5, Table 2.
  • [27] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §1.1, §3.4, §3.4, Table 1.
  • [28] J. Munkres (1957) Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5 (1), pp. 32–38. Cited by: §3.2.
  • [29] R. M. Neal et al. (2011) MCMC using hamiltonian dynamics.

    Handbook of markov chain monte carlo

    2 (11), pp. 2.
    Cited by: §2.2.1.
  • [30] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §3.
  • [31] A. Y. Ng, M. I. Jordan, and Y. Weiss (2002)

    On spectral clustering: analysis and an algorithm

    In Advances in neural information processing systems, pp. 849–856. Cited by: §1.1.
  • [32] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1.1, §1, §2.1.
  • [33] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    arXiv preprint arXiv:1401.4082. Cited by: §1.1.
  • [34] J. Shi and J. Malik (2000) Normalized cuts and image segmentation. Departmental Papers (CIS), pp. 107. Cited by: §1.1.
  • [35] R. Tyleček and R. Šára (2013) Spatial pattern templates for recognition of objects with regular structure. In German Conference on Pattern Recognition, pp. 364–374. Cited by: §3.1, §3.
  • [36] U. Von Luxburg (2007) A tutorial on spectral clustering. Statistics and computing 17 (4), pp. 395–416. Cited by: §1.1.
  • [37] J. Xie, R. Girshick, and A. Farhadi (2016) Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pp. 478–487. Cited by: §1.1, §1, §3.2, §3.4, Table 1.
  • [38] Y. Yang, D. Xu, F. Nie, S. Yan, and Y. Zhuang (2010) Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing 19 (10), pp. 2761–2773. Cited by: §1.1.