Abstract
Generative Adversarial networks (GANs) have obtained remarkable success in many unsupervised learning tasks and unarguably, clustering is an important unsupervised learning problem. While one can potentially exploit the latentspace backprojection in GANs to cluster, we demonstrate that the cluster structure is not retained in the GAN latent space. In this paper, we propose ClusterGAN as a new mechanism for clustering using GANs. By sampling latent variables from a mixture of onehot encoded variables and continuous latent variables, coupled with an inverse network (which projects the data to the latent space) trained jointly with a clustering specific loss, we are able to achieve clustering in the latent space. Our results show a remarkable phenomenon that GANs can preserve latent space interpolation across categories, even though the discriminator is never exposed to such vectors. We compare our results with various clustering baselines and demonstrate superior performance on both synthetic and real datasets.
1 Introduction
1.1 Motivation
Representation learning enables machine learning models to decipher underlying semantics in data and disentangle hidden factors of variation. These powerful representations have made it possible to transfer knowledge across various tasks. But what makes one representation better than another ?
[2] mentioned several generalpurpose priors that are not dependent on the downstream task, but appear as commonalities in good representations. One of the generalpurpose priors of representation learning that is ubiquitous across data intensive domains is clustering. Clustering has been extensively studied in unsupervised learning with multifarious approaches seeking efficient algorithms [24], problem specific distance metrics [29], validation [12] and the like. Even though the main focus of clustering has been to separate out the original data into classes, it would be even nicer if such clustering was obtained along with dimensionality reduction where the real data actually seems to come from a lower dimensional manifold.In recent times, much of unsupervised learning is driven by deep generative approaches, the two most prominent being Variational Autoencoder (VAE)
[17] and Generative Adversarial Network (GAN) [9]. The popularity of generative models themselves is hinged upon the ability of these models to capture high dimensional probability distributions, imputation of missing data and dealing with multimodal outputs. Both GAN and VAE aim to match the real data distribution (VAE using an explicit approximation of maximum likelihood and GAN through implicit sampling), and simulataneously provide a mapping from a latent space
to the input space. The latent space of GANs not only provides dimensionality reduction, but also gives rise to novel applications. Perturbations in the latent space could be used to determine adversarial examples that further help build robust classifiers
[14]. Compressed sensing using GANs [3] relies on finding a latent vector that minimizes the reconstruction error for the measurements. Generative compression is yet another application involving [26]. One of the most fascinating outcomes of the GAN training is the interpolation in the latent space. Simple vector arithmetic properties emerge which when manipulated lead to changes in the semantic qualities of the generated images [25]. This differentiates GANs from traditional dimensionality reduction techniques [22] [20] which lack interpretability. One potential application that demands such a property is clustering of cell types in genomics. GANs provide a means to understand the change in highdimensional gene expression as one traverses from one cell type (i.e., cluster) to another in the latent space. Here, it is critical to have both clustering as well as good interpretability and interpolation ability. This brings us to the principal motivation of this work: Can we design a GAN training methodology that clusters in the latent space?1.2 Related Works
Deep learning approaches have been used for dimensionality reduction starting with variants of the autoencoder such as the stacked denoising autoencoders [28], sparse autoencoder [5] and deep CCA [1]. Architectures for deep unsupervised subspace clustering have also been built on the encoderdecoder framework [15]. Recent works have addressed this problem of joint clustering and dimensionality reduction in autoencoders. [30]
solved this problem by initializing the cluster centroids and the embedding with a stacked autoencoder. Then they use alternating optimization to improve the clustering and report stateoftheart results in both clustering accuracy and speed on real datasets. The clustering algorithm is referred to as DEC in their paper. Since Kmeans is often the most widely used algorithm for clustering,
[31] improved upon DEC by introducing a modified cost function that incorporates the Kmeans loss. They optimized the nonconvex objective using alternating SGD to obtain an embedding that is amenable to Kmeans clustering. Their algorithm DCN was shown to outperform all standard clustering methods on a range of datasets. It is interesting to note that the vanilla autoencoder by itself did not explicitly have any clustering objective. But it could be improved to achieve this end by careful algorithmic design. Since GANs have outperformed autoencoders in generating high fidelty samples, we had a strong intuition in favour of the powerful latent representations of GAN providing improved clustering performance also.Interpretable representation learning in the latent space has been investigated for GANs in the seminal work of [4]. The authors trained a GAN with an additional term in the loss that seeks to maximize the mutual information between a subset of the generator’s noise variables and the generated output. The key goal of InfoGAN is to create interpretable and disentangled latent variables. While InfoGAN does employ discrete latent variables, it is not specifically designed for clustering. In this paper, we show that our proposed architecture is superior to InfoGAN for clustering. The other prominent family of generative models, VAE, has the additional advantage of having an inference network, the encoder, which is jointly learnt during training. This enables mapping from to that could potentially preserve cluster structure by suitable algorithmic design. Unfortunately, no such inference mechanism exists in GANs, let alone the possibility of clustering in the latent space.To bridge the gap between VAE and GAN, various methods such as Adversarially Learned Inference (ALI) [8], Bidirectional Generative Adversarial Networks (BiGAN) [7]
have introduced an inference network which is trained to match the joint distributions of
learnt by the encoder and decoder networks. Typically, the reconstruction in ALI/BiGAN is poor as there is no deterministic pointwise matching between and involved in the training. Architectures such as Wasserstein Autoencoder [27], Adversarial Autoencoder [21], which depart from the traditional GAN framework, also have an encoder as part of the network. So this led us to consider a formulation using an Encoder which could both reduce the cycle loss as well as aid in clustering.1.3 Main Contributions
To the best of our knowledge, this is the first work that addresses the problem of clustering in the latent space of GAN. The main contributions of the paper can be summarized as follows:

We show that even though the GAN latent variable preserves information about the observed data, the latent points are smoothly scattered based on the latent distribution leading to no observable clusters.

We propose three main algorithmic ideas in ClusterGAN in order to remedy this situation.

We utilize a mixture of discrete and continuous latent variables in order to create a nonsmooth geometry in the latent space.

We propose a novel backpropogation algorithm accommodating the discretecontinuous mixture, as well as an explicit inversemapping network to obtain the latent variables given the data points, since the problem is nonconvex.

We propose to jointly train the GAN along with the inversemapping network with a clusteringspecific loss so that the distance geometry in the projected space reflects the distancegeometry of the variables.


We compare ClusterGAN and other possible GAN based clustering algorithms, such as InfoGAN, along with multiple clustering baselines on varied datasets. This demonstrates the superior performance of ClusterGAN for the clustering task.

We demonstrate that ClusterGAN surprisingly retains good interpolation across the different classes (encoded using onehot latent variables), even though the discriminator is never exposed to such samples.
The formulation is general enough to provide a meta framework that incorporates the additional desirable property of clustering in GAN training.
2 DiscreteContinuous Prior
2.1 Background
Generative adversarial networks consist of two components, the generator and the discriminator . Both and
are usually implemented as neural networks parameterized by
and respectively. The generator can also be considered to be a mapping from latent space to the data space which we denote as . The discrimator defines a mapping from the data space to a real value which can correspond to the probability of the sample being real, . The GAN training sets up a two player game between and , which is defined by the minimax objective : , where is the distribution of real data samples, is the prior noise distribution on the latent space and is the quality function. For vanilla GAN, , and for Wasserstein GAN (WGAN) . We also denote the distribution of generated samples as . The discriminator and the generator are optimized alternatively so that at the end of training matches .2.2 Vanilla GAN does not cluster well in the latent space
One possible way to cluster using a GAN is to backpropagate the data into the latent space (using backpropogation decoding [19]) and cluster the latent space. However, this method usually leads to very bad results (see Fig. 3
for clustering results on MNIST). The key reason is that, if indeed, backpropagation succeeds, then the backprojected data distribution should look similar to the latent space distribution, which is typically chosen to be a Gaussian or uniform distribution, and we cannot expect to cluster in that space. Thus even though the latent space may contain full information about the data, the distance geometry in the latent space does not reflect the inherent clustering. In
[11], the authors sampled from a Gaussian mixture prior and obtained diverse samples even in limited data regimes. However, even GANs with a Gaussian mixture failed to cluster, as shown in 3(c). As observed by the authors of DeLiGAN, Gaussian components tend to ‘crowd’ and become redundant. Lifting the space using categorical variables could only solve this problem effectively. But continuity in latent space is traditionally viewed to be a prerequisite for the objective of good interpolation. In other words, interpolation seems to be at loggerheads with the clustering objective. We demonstrate in this paper how ClusterGAN can obtain
good interpolation and good clustering simultaneously.2.3 Sampling from DiscreteContinuous Mixtures
In ClusterGAN, we sample from a prior that consists of normal random variables cascaded with onehot encoded vectors. To be more precise
, is the elementary vector in and is the number of clusters in the data. In addition, we need to choose in such a way that the onehot vector provides sufficient signal to the GAN training that leads to each mode only generating samples from a corresponding class in the original data. To be more precise, we chose in all our experiments so that each dimension of the normal latent variables,with high probability. Small variances
are chosen to ensure the clusters in space are separated. Hence this prior naturally enables us to design an algorithm that clusters in the latent space.2.4 Linear Generator clusters perfectly
The following lemma suggests that with discretecontinuous mixtures, we need only linear generation to generate mixture of Gaussians in the generated space.
Lemma 1.
Clustering with only cannot recover a mixture of gaussian data in the linearly generated space. Further a linear mapping discretecontinuous mixtures to a mixture of Gaussians.
Proof.
If latent space has only the continuous part, , then by the linearity property, any linear generation can only produce Gaussian in the generated space. Now we show there exists a mapping discretecontinuous mixtures to the generate data , where ( is the number of mixtures). This is possible if we let and , being a diagonal matrix with diagonal entries as the means . ∎
To illustrate this lemma, and hence the drawback of traditional priors for clustering, we performed a simple experiment. The real samples are drawn from a mixture of Gaussians in . The means of the Gaussians are sampled from and the variance of each component is fixed at . We trained a GAN with
where the generator is a multilayer perceptron with two hidden layers of
units each. For comparison, we also trained a GAN with sampled from onehot encoded normal vectors, the dimension of categorical variable being . The generator for this GAN consisted of a linear mapping , such that . After training, the latent vectors are recovered using Algorithm 1 for the linear generator, and restarts with random initializations for the nonlinear generator. Even for this toy setup, the linear generator perfectly clustered the latent vectors (Acc. = 1.0, NMI = 1.0, ARI = 1.0), but the nonlinear generator performed poorly (Acc. = 0.73, NMI = 0.75, ARI = 0.60) (Figure 2). The situation becomes worse for real datasets such as MNIST when we trained a GAN using latent vectors drawn from uniform, normal or a mixture of Gaussians. None of these configurations succeeded in clustering in the latent space as shown in Figure 3.2.5 Modified Backpropagation Based Decoding
Previous works [6] [19] have explored solving an optimization problem in to recover the latent vectors, , where
is some suitable loss function and
denotes the norm. This approach is insufficient for clustering with traditional latent priors even if backpropagation was lossless and recovered accurate latent vectors. To make the situation worse, the optimization problem above is nonconvex in
( being implemented as a neural network) and can obtain different embeddings in the space based on initialization. Some of the approaches to address this issue could be multiple restarts with different initialiations to obtain , or stochastic clipping of at each iteration step. None of these lead to clustering, since they do not address the root problem of sampling from separated manifolds in . But our sampling procedure naturally gives way to such an algorithm. We use . Since we sample from a normal distrubution, we use the regularizer , penalizing only the normal variables. We use restarts, each sampling from a different onehot component and optimize with respect to only the normal variables, keeping fixed. Adam [16] is used for the updates during Backprop decoding. Formally, Algorithm 1 summarizes the approach.2.6 Separate Modes for distinct classes in the data
It was surprising to find that trained in a purely unsupervised manner without additional loss terms, each onehot encoded component generated points from a specific class in the original data. For instance, generated a particular digit in MNIST, for multiple samplings of ( denotes a permutation). This was a necessary first step for the success of Algorithm 1. We also quantitatively evaluated the modes learnt by the GAN by using a supervised classifier for MNIST. The supervised classifier had a test accuracy of , so it had high reliability of distinguishing the digits. We sample from a mode and generate a digit . It is then classified by the classifier as . From this pair , we can map each mode to a digit and compute the accuracy of digit being generated from mode . This is denoted as Mode Accuracy. Each digit sample with label can be decoded in the latent space by Algorithm 1 to obtain . Now can be used to generate , which when passed through the classifier gives the label . The pair must be equal in the ideal case and this accuracy is denoted as Reconstruction Accuracy. Finally, all the mappings of points in the same class in space should have the same onehot encoding when embedded in space. This defines the Cluster Accuracy. This metholodgy can be extended to quantitatively evaluate mode generation for other datasets also, provided there is a reliable classifier. For MNIST, we obtained Mode Accuracy of , Reconstruction Accuracy of and Cluster Accuracy of . Some of the modes in FashionMNIST and MNIST are shown in Figures 4 and 5, respectively. Supplementary materials contain the images from all modes in these two datasets.
2.7 Interpolation in latent space is preserved
The latent space in a traditional GAN with Gaussian latent distribution enforces that different classes are continuously scattered in the latent space, allowing nice interclass interpolation, which is a key strength of GANs. In ClusterGAN, the latent vector is sampled with a onehot distribution and in order to interpolate across the classes, we will have to sample from a convex combination on the onehot vector. While these vectors have never been sampled during the training process, we surprisingly observed very smooth interclass interpolation in ClusterGAN. To demonstrate interpolation, we fixed the in two latent vectors with different components, say and and interpolated with the onehot encoded part to give rise to new latent vectors . As Figure 6 illustrates, we observed a nice transition from one digit to another as well as across different classes in FashionMNIST. This demonstrates that ClusterGAN learns a very smooth manifold even on the untrained directions of the discretecontinuous distribution. We also show interpolations from a vanilla GAN trained with Gaussian prior as reference.
algocf[htbp]
3 ClusterGAN
Even though the above approach enables the GAN to cluster in the latent space, it may be able to perform even better if we had a clustering specific loss term in the minimax objective. For MNIST, digit strokes correspond well to the category in the data. But for more complicated datasets, we need to enforce structure in the GAN training. One way to ensure that is to enforce precise recovery of the latent vector. We therefore introduce an encoder , a neural network parameterized by . The GAN objective now takes the following form:
(1) 
where is the crossentropy loss. The relative magnitudes of the regularization coeficients and enable a flexible choice to vary the importance of preserving the discrete and continuous portions of the latent code. One could imagine other variations of the regularization that map to be close to the centroid of the respective cluster, for instance , in similar spirit as KMeans. The GAN training in this approach involves jointly updating the parameters of and (Algorithm LABEL:alg:update).
4 Experiments


Dataset  Algorithm  ACC  NMI  ARI 


Synthetic  ClusterGAN  0.99  0.99  0.99 
InfoGAN  0.88  0.75  0.74  
GAN with bp  0.95  0.85  0.88  
GAN with Disc.  0.99  0.98  0.98  
AGGLO.  0.99  0.99  0.99  
NMF  0.98  0.96  0.97  
SC  0.99  0.98  0.98  


MNIST  ClusterGAN  0.95  0.89  0.89 
InfoGAN  0.87  0.84  0.81  
GAN with bp  0.95  0.90  0.89  
GAN with Disc.  0.70  0.62  0.52  
DCN  0.83  0.81  0.75  
AGGLO.  0.64  0.65  0.46  
NMF  0.56  0.45  0.36  


Fashion10  ClusterGAN  0.63  0.64  0.50 
InfoGAN  0.61  0.59  0.44  
GAN with bp  0.56  0.53  0.37  
GAN with Disc.  0.43  0.37  0.23  
AGGLO.  0.55  0.57  0.37  
NMF  0.50  0.51  0.34  


Fashion5  ClusterGAN  0.73  0.59  0.48 
InfoGAN  0.67  0.55  0.42  
GAN with bp  0.73  0.54  0.45  
GAN with Disc.  0.67  0.49  0.40  
AGGLO.  0.66  0.52  0.36  
NMF  0.67  0.48  0.40  


10x_73k  ClusterGAN  0.83  0.73  0.69 
InfoGAN  0.62  0.58  0.43  
GAN with bp  0.65  0.59  0.45  
GAN with Disc.  0.33  0.17  0.07  
AGGLO.  0.63  0.58  0.40  
NMF  0.71  0.69  0.53  
SC  0.40  0.29  0.18  


Pendigits  ClusterGAN  0.79  0.73  0.65 
InfoGAN  0.72  0.73  0.61  
GAN with bp  0.76  0.71  0.63  
GAN with Disc.  0.65  0.57  0.45  
DCN  0.72  0.69  0.56  
AGGLO.  0.70  0.69  0.52  
NMF  0.67  0.58  0.45  
SC  0.70  0.69  0.52  

4.1 Datasets
Synthetic Data The data is generated from a mixture of Gaussians with components in 2D, which constitutes the space. We generated points from each Gaussian. The
space is obtained by a nonlinear transformation :
, where with .is the sigmoid function to introduce nonlinearity.
MNIST It consists of k images of digits ranging from to . Each data sample is a
greyscale image. We used the DCGAN with convdeconv layers, batch normalization and leaky relu activations, the details of which are available in the Supplementary material.
FashionMNIST (10 and 5 classes) This dataset has the same number of images with the same image size as MNIST, but it is fairly more complicated. Instead of digits, it consists of various types of fashion products. Supervised methods achieve lower accuracy than MNIST on this dataset. For training a GAN, we used the same architecture as MNIST for this dataset. We also merged some categories which were similar to form a separate 5class dataset. The five groups were as follows : {Tshirt/Top, Dress}, {Trouser}, {Pullover, Coat, Shirt}, {Bag}, {Sandal, Sneaker, Ankle Boot}.
10x_73k
Even though GANs have achieved unprecedented success in generating realistic images, it is not clear whether they can be equally effective for other types of data. In this experiment, we trained a GAN to cluster cell types from a single cell RNAseq counts matrix. Moreover, computer vision might have ample supply of labelled images, obtaining labels for some fields, for instance biology, is extremely costly and laborious. Thus, unsupervised clustering of data is truly a necessity for this domain. The dataset consists of RNAtranscript counts of
data points belonging to different cell types [33]. To reduce the dimension of the data, we selected highest variance genes across the cells. The entries of the counts matrix are first tranformed as and then divided by the maximum entry of the transformation to obtain values in the range of . One of the major challenges in this data is sparsity. Even after subselection of genes based on variance, the data matrix was close to zero entries.Pendigits It is a very different dataset that consists of a time series of coordinates. The points are sampled as writers write digits on a pressure sensitive tablet. The total number of datapoints is , and consists of classes, each for a digit. It provided a unique challenge of training GANs for point cloud data.
In all our experiments in this paper, we used an improved variant (WGANGP) which includes a gradient penalty [10]
. Using crossvalidation for selecting hyperparameters is not an option in purely unsupervised problems due to absence of labels. We adapted standard architectures for the datasets
[4] and avoided data specific tuning as much as possible. Some choices of regularization parameters , , worked well across all datasets.4.2 Evaluation
Since clustering is an unsupervised problem, we ensured that all the algorithms are oblivious to the true labels unlike a supervised framework like conditional GAN [23]. We compared ClusterGAN with other possible GAN based clustering approaches we could conceive.


Dataset  Algorithm  
Cluster  WGAN  WGAN  Info  
GAN  (Normal)  (OneHot)  GAN  
MNIST  0.81  0.88  0.94  1.88 
Fashion  0.91  0.95  6.14  11.04 
10x_73k  2.50  2.02  2.24  25.59 
Pendigits  9.56  6.45  13.44  87.80 



Dataset : MNIST, Algorithm : ClusterGAN  
ACC  
K = 7  K = 9  K = 10  K = 11  K = 13 
0.60  0.84  0.95  0.80  0.84 

Algorithm 1 + KMeans is denoted as “GAN with bp”. For InfoGAN, we used as an inferred cluster label for . Further, the features in the last layer of the Discriminator could contain some classspecific discriminating features for clustering. So we used Kmeans on to cluster, denoted as “GAN with Disc. ”. We also included clustering results from Nonnegative matrix Factorization (NMF) [18], Aggolomerative Clustering (AGGLO) [32]
and Spectral Clustering (SC). AGGLO with Euclidean affinity score and ward linkage gave best results. NMF had both l1 and l2 regularization, initialized with Nonnegative Double SVD and used KLdivergence loss. SC had rbf kernel for affinity. We reported normalized mutual information (NMI), adjusted Rand index (ARI), and clustering purity (ACC). Since DCN has been shown to outperform various deeplearning based clustering algorithms, we reported its metrics from the paper
[31] for MNIST and Pendigits. We found DCN to be very sensitive to hyperparameter choice, architecture and learning rates and could not obtain reasonable results from it on the other datasets. But we outperformed DCN results on MNIST and Pendigits dataset.Since clustering metrics do not reveal the quality of generated samples from a GAN, we report the Frechet Inception Distance (FID) [13] for all real datasets. We found that ClusterGAN achives good clustering without compromising sample quality. In fact, for our image datsets, ClusterGAN samples are closer to the real distribution than vanilla WGAN with Gaussian prior.
In all datasets, we provided the true number of clusters to all algorithms. In addition, for MNIST, Table 3 provides the clustering performance of ClusterGAN as number of clusters is varied. Overestimates do not severly hurt ClusterGAN; but underestimate does.
5 Discussion and Future Work
In this work, we discussed the drawback of training a GAN with traditional prior latent distributions for clustering and considered discretecontinuous mixtures for sampling noise variables. We proposed ClusterGAN, an architecture that enables clustering in the latent space. Comparison with clustering baselines on varied datasets using ClusterGAN illustrates that GANs can be suitably adapted for clustering. Future directions can explore better datadriven priors for the latent space. Another possibility is to improve results for problems that have a sparse generative structure such as compressed sensing.
References
 [1] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International Conference on Machine Learning, pages 1247–1255, 2013.
 [2] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 [3] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G. Dimakis. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 537–546, 2017.
 [4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.

[5]
Adam Coates, Andrew Ng, and Honglak Lee.
An analysis of singlelayer networks in unsupervised feature
learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pages 215–223, 2011.  [6] Antonia Creswell and Anil A Bharath. Inverting the generator of a generative adversarial network (ii). arXiv preprint arXiv:1802.05701, 2018.
 [7] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 [8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 [9] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [10] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
 [11] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and R Venkatesh Babu. Deligan: Generative adversarial networks for diverse and limited data.
 [12] Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. On clustering validation techniques. Journal of intelligent information systems, 17(23):107–145, 2001.
 [13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
 [14] Andrew Ilyas, Ajil Jalal, Eirini Asteri, Constantinos Daskalakis, and Alexandros G Dimakis. The robust manifold defense: Adversarial training using generative models. arXiv preprint arXiv:1712.09196, 2017.
 [15] Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. Deep subspace clustering networks. In Advances in Neural Information Processing Systems, pages 23–32, 2017.
 [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [17] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [18] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788, 1999.
 [19] Zachary C Lipton and Subarna Tripathi. Precise recovery of latent vectors from generative adversarial networks. arXiv preprint arXiv:1702.04782, 2017.
 [20] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 [21] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 [22] Sebastian Mika, Bernhard Schölkopf, Alex J Smola, KlausRobert Müller, Matthias Scholz, and Gunnar Rätsch. Kernel pca and denoising in feature spaces. In Advances in neural information processing systems, pages 536–542, 1999.
 [23] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.

[24]
Andrew Y Ng, Michael I Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.
In Advances in neural information processing systems, pages 849–856, 2002.  [25] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [26] Shibani Santurkar, David Budden, and Nir Shavit. Generative compression. arXiv preprint arXiv:1703.01467, 2017.
 [27] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558, 2017.
 [28] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and PierreAntoine Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.
 [29] Shiming Xiang, Feiping Nie, and Changshui Zhang. Learning a mahalanobis distance metric for data clustering and classification. Pattern Recognition, 41(12):3600–3612, 2008.
 [30] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding for clustering analysis. In International conference on machine learning, pages 478–487, 2016.
 [31] Bo Yang, Xiao Fu, Nicholas D. Sidiropoulos, and Mingyi Hong. Towards kmeansfriendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 3861–3870, 2017.
 [32] Wei Zhang, Xiaogang Wang, Deli Zhao, and Xiaoou Tang. Graph degree linkage: Agglomerative clustering on a directed graph. In European Conference on Computer Vision, pages 428–441. Springer, 2012.
 [33] Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8:14049, 2017.
6 Supplementary Material
6.1 Hyperparameter and Architecture Details
The networks were trained with Adam Optimizer (learning rate , , ) for all datasets. The number of discriminator updates was for each generator update in training. Gradient penalty coefficient for WGANGP was set to for all experiments. The dimension of is the same as the number of classes in the dataset. Most networks used Leaky ReLU activations and Batch Normalization (BN), details for each dataset are provided below. (In the architecture without encoder, Algorithm 1 used Adam optimizer to minimize the objective for iterations per point.)
Synthetic Data
We used batch size = , of dimensions. LReLU activation with leak = was used. , .
Generator  Encoder  Discriminator 
Input  Input  Input 
FC LReLU BN  FC LReLU BN  FC LReLU BN 
FC LReLU BN  FC LReLU BN  FC LReLU BN 
FC Sigmoid  FC linear for  FC linear 
Softmax on last to obtain 
MNIST and FashionMNIST
We used batch size = , of dimensions. LReLU activation with leak = was used. for MNIST and for FashionMNIST, for both .
Generator  Encoder  Discriminator 
Input  Input  Input 
FC ReLU BN  upconv,  upconv, 
64 stride 2 LReLU 
64 stride 2 LReLU  
FC ReLU BN  upconv,  upconv, 
128 stride 2 LReLU  128 stride 2 LReLU  
upconv,  FC LReLU  FC LReLU 
64 stride 2 ReLU BN  
upconv,  FC linear for  FC linear 
1 stride 2, Sigmoid  Softmax on last to obtain 
For FashionMNIST, we used . Rest of the architecture remained identical.
10x_73k
We used batch size = , of dimensions. LReLU activation with leak = was used. , .
Generator  Encoder  Discriminator 
Input  Input  Input 
FC LReLU  FC LReLU  FC LReLU 
FC LReLU  FC LReLU  FC LReLU 
FC Linear  FC linear for  FC linear 
Softmax on last to obtain 
Pendigits
We used batch size = , of dimensions. LReLU activation with leak = was used. , .
Generator  Encoder  Discriminator 
Input  Input  Input 
FC LReLU BN  FC LReLU BN  FC LReLU BN 
FC LReLU BN  FC LReLU BN  FC LReLU BN 
FC Sigmoid  FC linear for  FC linear 
Softmax on last to obtain 
For InfoGAN, we used the implementation of the authors https://github.com/openai/InfoGAN for MNIST and FashionMNIST. For the other datasets, we used our hyperparameters for Generator and Discriminator and added the Q network (FC 128BNLReLUFC dim ). For “GAN with bp”, we used the same Generator and Discriminator hyperparameters as ClusterGAN. Features for “GAN with Disc. ” was obtained from the trained Discriminator of experiments “GAN with bp”.
Comments
There are no comments yet.