ClusterGAN : Latent Space Clustering in Generative Adversarial Networks

09/10/2018 ∙ by Sudipto Mukherjee, et al. ∙ University of Washington 10

Generative Adversarial networks (GANs) have obtained remarkable success in many unsupervised learning tasks and unarguably, clustering is an important unsupervised learning problem. While one can potentially exploit the latent-space back-projection in GANs to cluster, we demonstrate that the cluster structure is not retained in the GAN latent space. In this paper, we propose ClusterGAN as a new mechanism for clustering using GANs. By sampling latent variables from a mixture of one-hot encoded variables and continuous latent variables, coupled with an inverse network (which projects the data to the latent space) trained jointly with a clustering specific loss, we are able to achieve clustering in the latent space. Our results show a remarkable phenomenon that GANs can preserve latent space interpolation across categories, even though the discriminator is never exposed to such vectors. We compare our results with various clustering baselines and demonstrate superior performance on both synthetic and real datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 10

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Generative Adversarial networks (GANs) have obtained remarkable success in many unsupervised learning tasks and unarguably, clustering is an important unsupervised learning problem. While one can potentially exploit the latent-space back-projection in GANs to cluster, we demonstrate that the cluster structure is not retained in the GAN latent space. In this paper, we propose ClusterGAN as a new mechanism for clustering using GANs. By sampling latent variables from a mixture of one-hot encoded variables and continuous latent variables, coupled with an inverse network (which projects the data to the latent space) trained jointly with a clustering specific loss, we are able to achieve clustering in the latent space. Our results show a remarkable phenomenon that GANs can preserve latent space interpolation across categories, even though the discriminator is never exposed to such vectors. We compare our results with various clustering baselines and demonstrate superior performance on both synthetic and real datasets.

1 Introduction

1.1 Motivation

Representation learning enables machine learning models to decipher underlying semantics in data and disentangle hidden factors of variation. These powerful representations have made it possible to transfer knowledge across various tasks. But what makes one representation better than another ?

[2] mentioned several general-purpose priors that are not dependent on the downstream task, but appear as commonalities in good representations. One of the general-purpose priors of representation learning that is ubiquitous across data intensive domains is clustering. Clustering has been extensively studied in unsupervised learning with multifarious approaches seeking efficient algorithms [24], problem specific distance metrics [29], validation [12] and the like. Even though the main focus of clustering has been to separate out the original data into classes, it would be even nicer if such clustering was obtained along with dimensionality reduction where the real data actually seems to come from a lower dimensional manifold.

In recent times, much of unsupervised learning is driven by deep generative approaches, the two most prominent being Variational Autoencoder (VAE)

[17] and Generative Adversarial Network (GAN) [9]

. The popularity of generative models themselves is hinged upon the ability of these models to capture high dimensional probability distributions, imputation of missing data and dealing with multimodal outputs. Both GAN and VAE aim to match the real data distribution (VAE using an explicit approximation of maximum likelihood and GAN through implicit sampling), and simulataneously provide a mapping from a latent space

to the input space

. The latent space of GANs not only provides dimensionality reduction, but also gives rise to novel applications. Perturbations in the latent space could be used to determine adversarial examples that further help build robust classifiers

[14]. Compressed sensing using GANs [3] relies on finding a latent vector that minimizes the reconstruction error for the measurements. Generative compression is yet another application involving [26]. One of the most fascinating outcomes of the GAN training is the interpolation in the latent space. Simple vector arithmetic properties emerge which when manipulated lead to changes in the semantic qualities of the generated images [25]. This differentiates GANs from traditional dimensionality reduction techniques [22] [20] which lack interpretability. One potential application that demands such a property is clustering of cell types in genomics. GANs provide a means to understand the change in high-dimensional gene expression as one traverses from one cell type (i.e., cluster) to another in the latent space. Here, it is critical to have both clustering as well as good interpretability and interpolation ability. This brings us to the principal motivation of this work: Can we design a GAN training methodology that clusters in the latent space?

1.2 Related Works

Deep learning approaches have been used for dimensionality reduction starting with variants of the autoencoder such as the stacked denoising autoencoders [28], sparse autoencoder [5] and deep CCA [1]. Architectures for deep unsupervised subspace clustering have also been built on the encoder-decoder framework [15]. Recent works have addressed this problem of joint clustering and dimensionality reduction in autoencoders. [30]

solved this problem by initializing the cluster centroids and the embedding with a stacked autoencoder. Then they use alternating optimization to improve the clustering and report state-of-the-art results in both clustering accuracy and speed on real datasets. The clustering algorithm is referred to as DEC in their paper. Since K-means is often the most widely used algorithm for clustering,

[31] improved upon DEC by introducing a modified cost function that incorporates the K-means loss. They optimized the non-convex objective using alternating SGD to obtain an embedding that is amenable to K-means clustering. Their algorithm DCN was shown to outperform all standard clustering methods on a range of datasets. It is interesting to note that the vanilla autoencoder by itself did not explicitly have any clustering objective. But it could be improved to achieve this end by careful algorithmic design. Since GANs have outperformed autoencoders in generating high fidelty samples, we had a strong intuition in favour of the powerful latent representations of GAN providing improved clustering performance also.

Figure 1: ClusterGAN Architecture

Interpretable representation learning in the latent space has been investigated for GANs in the seminal work of [4]. The authors trained a GAN with an additional term in the loss that seeks to maximize the mutual information between a subset of the generator’s noise variables and the generated output. The key goal of InfoGAN is to create interpretable and disentangled latent variables. While InfoGAN does employ discrete latent variables, it is not specifically designed for clustering. In this paper, we show that our proposed architecture is superior to InfoGAN for clustering. The other prominent family of generative models, VAE, has the additional advantage of having an inference network, the encoder, which is jointly learnt during training. This enables mapping from to that could potentially preserve cluster structure by suitable algorithmic design. Unfortunately, no such inference mechanism exists in GANs, let alone the possibility of clustering in the latent space.To bridge the gap between VAE and GAN, various methods such as Adversarially Learned Inference (ALI) [8], Bidirectional Generative Adversarial Networks (BiGAN) [7]

have introduced an inference network which is trained to match the joint distributions of

learnt by the encoder and decoder networks. Typically, the reconstruction in ALI/BiGAN is poor as there is no deterministic pointwise matching between and involved in the training. Architectures such as Wasserstein Autoencoder [27], Adversarial Autoencoder [21], which depart from the traditional GAN framework, also have an encoder as part of the network. So this led us to consider a formulation using an Encoder which could both reduce the cycle loss as well as aid in clustering.

1.3 Main Contributions

To the best of our knowledge, this is the first work that addresses the problem of clustering in the latent space of GAN. The main contributions of the paper can be summarized as follows:

  • We show that even though the GAN latent variable preserves information about the observed data, the latent points are smoothly scattered based on the latent distribution leading to no observable clusters.

  • We propose three main algorithmic ideas in ClusterGAN in order to remedy this situation.

    1. We utilize a mixture of discrete and continuous latent variables in order to create a non-smooth geometry in the latent space.

    2. We propose a novel backpropogation algorithm accommodating the discrete-continuous mixture, as well as an explicit inverse-mapping network to obtain the latent variables given the data points, since the problem is non-convex.

    3. We propose to jointly train the GAN along with the inverse-mapping network with a clustering-specific loss so that the distance geometry in the projected space reflects the distance-geometry of the variables.

  • We compare ClusterGAN and other possible GAN based clustering algorithms, such as InfoGAN, along with multiple clustering baselines on varied datasets. This demonstrates the superior performance of ClusterGAN for the clustering task.

  • We demonstrate that ClusterGAN surprisingly retains good interpolation across the different classes (encoded using one-hot latent variables), even though the discriminator is never exposed to such samples.

The formulation is general enough to provide a meta framework that incorporates the additional desirable property of clustering in GAN training.

(a) Non-linear generator with
(b) Linear generator with one-hot encoded
Figure 2: TSNE visualization of latent space : Linear Generator recovers clusters.

2 Discrete-Continuous Prior

2.1 Background

Generative adversarial networks consist of two components, the generator and the discriminator . Both and

are usually implemented as neural networks parameterized by

and respectively. The generator can also be considered to be a mapping from latent space to the data space which we denote as . The discrimator defines a mapping from the data space to a real value which can correspond to the probability of the sample being real, . The GAN training sets up a two player game between and , which is defined by the minimax objective : , where is the distribution of real data samples, is the prior noise distribution on the latent space and is the quality function. For vanilla GAN, , and for Wasserstein GAN (WGAN) . We also denote the distribution of generated samples as . The discriminator and the generator are optimized alternatively so that at the end of training matches .

2.2 Vanilla GAN does not cluster well in the latent space

One possible way to cluster using a GAN is to back-propagate the data into the latent space (using back-propogation decoding [19]) and cluster the latent space. However, this method usually leads to very bad results (see Fig. 3

for clustering results on MNIST). The key reason is that, if indeed, back-propagation succeeds, then the back-projected data distribution should look similar to the latent space distribution, which is typically chosen to be a Gaussian or uniform distribution, and we cannot expect to cluster in that space. Thus even though the latent space may contain full information about the data, the distance geometry in the latent space does not reflect the inherent clustering. In

[11], the authors sampled from a Gaussian mixture prior and obtained diverse samples even in limited data regimes. However, even GANs with a Gaussian mixture failed to cluster, as shown in 3

(c). As observed by the authors of DeLiGAN, Gaussian components tend to ‘crowd’ and become redundant. Lifting the space using categorical variables could only solve this problem effectively. But continuity in latent space is traditionally viewed to be a pre-requisite for the objective of good interpolation. In other words, interpolation seems to be at loggerheads with the clustering objective. We demonstrate in this paper how ClusterGAN can obtain

good interpolation and good clustering simultaneously.

(a) Uniform
(b) Normal
(c) Gaussian Mix
(d)
Figure 3: TSNE visualization of latent space : MNIST

2.3 Sampling from Discrete-Continuous Mixtures

In ClusterGAN, we sample from a prior that consists of normal random variables cascaded with one-hot encoded vectors. To be more precise

, is the elementary vector in and is the number of clusters in the data. In addition, we need to choose in such a way that the one-hot vector provides sufficient signal to the GAN training that leads to each mode only generating samples from a corresponding class in the original data. To be more precise, we chose in all our experiments so that each dimension of the normal latent variables,

with high probability. Small variances

are chosen to ensure the clusters in space are separated. Hence this prior naturally enables us to design an algorithm that clusters in the latent space.

2.4 Linear Generator clusters perfectly

The following lemma suggests that with discrete-continuous mixtures, we need only linear generation to generate mixture of Gaussians in the generated space.

Lemma 1.

Clustering with only cannot recover a mixture of gaussian data in the linearly generated space. Further a linear mapping discrete-continuous mixtures to a mixture of Gaussians.

Proof.

If latent space has only the continuous part, , then by the linearity property, any linear generation can only produce Gaussian in the generated space. Now we show there exists a mapping discrete-continuous mixtures to the generate data , where ( is the number of mixtures). This is possible if we let and , being a diagonal matrix with diagonal entries as the means . ∎

To illustrate this lemma, and hence the drawback of traditional priors for clustering, we performed a simple experiment. The real samples are drawn from a mixture of Gaussians in . The means of the Gaussians are sampled from and the variance of each component is fixed at . We trained a GAN with

where the generator is a multi-layer perceptron with two hidden layers of

units each. For comparison, we also trained a GAN with sampled from one-hot encoded normal vectors, the dimension of categorical variable being . The generator for this GAN consisted of a linear mapping , such that . After training, the latent vectors are recovered using Algorithm 1 for the linear generator, and restarts with random initializations for the non-linear generator. Even for this toy setup, the linear generator perfectly clustered the latent vectors (Acc. = 1.0, NMI = 1.0, ARI = 1.0), but the non-linear generator performed poorly (Acc. = 0.73, NMI = 0.75, ARI = 0.60) (Figure 2). The situation becomes worse for real datasets such as MNIST when we trained a GAN using latent vectors drawn from uniform, normal or a mixture of Gaussians. None of these configurations succeeded in clustering in the latent space as shown in Figure 3.

Figure 4: Fashion items generated from distinct modes : Fashion-MNIST
Figure 5: Digits generated from distinct modes : MNIST

2.5 Modified Backpropagation Based Decoding

Previous works [6] [19] have explored solving an optimization problem in to recover the latent vectors, , where

is some suitable loss function and

denotes the norm. This approach is insufficient for clustering with traditional latent priors even if backpropagation was lossless and recovered accurate latent vectors. To make the situation worse, the optimization problem above is non-convex in

( being implemented as a neural network) and can obtain different embeddings in the space based on initialization. Some of the approaches to address this issue could be multiple restarts with different initialiations to obtain , or stochastic clipping of at each iteration step. None of these lead to clustering, since they do not address the root problem of sampling from separated manifolds in . But our sampling procedure naturally gives way to such an algorithm. We use . Since we sample from a normal distrubution, we use the regularizer , penalizing only the normal variables. We use restarts, each sampling from a different one-hot component and optimize with respect to only the normal variables, keeping fixed. Adam [16] is used for the updates during Backprop decoding. Formally, Algorithm 1 summarizes the approach.

Input: Real sampler , Generator function , Number of Clusters , Regularization parameter , Adam iterations
Output: Latent embedding
for  do
       Sample Initialization ( is elementary unit vector in dimensions) for  do
             Obtain the gradient of loss function Update using with Adam iteration to minimize loss. Clipping of , i.e.,
       end for
      Update if has lowest loss obtained so far.
end for
return
Algorithm 1 Decode_latent

2.6 Separate Modes for distinct classes in the data

It was surprising to find that trained in a purely unsupervised manner without additional loss terms, each one-hot encoded component generated points from a specific class in the original data. For instance, generated a particular digit in MNIST, for multiple samplings of ( denotes a permutation). This was a necessary first step for the success of Algorithm 1. We also quantitatively evaluated the modes learnt by the GAN by using a supervised classifier for MNIST. The supervised classifier had a test accuracy of , so it had high reliability of distinguishing the digits. We sample from a mode and generate a digit . It is then classified by the classifier as . From this pair , we can map each mode to a digit and compute the accuracy of digit being generated from mode . This is denoted as Mode Accuracy. Each digit sample with label can be decoded in the latent space by Algorithm 1 to obtain . Now can be used to generate , which when passed through the classifier gives the label . The pair must be equal in the ideal case and this accuracy is denoted as Reconstruction Accuracy. Finally, all the mappings of points in the same class in space should have the same one-hot encoding when embedded in space. This defines the Cluster Accuracy. This metholodgy can be extended to quantitatively evaluate mode generation for other datasets also, provided there is a reliable classifier. For MNIST, we obtained Mode Accuracy of , Reconstruction Accuracy of and Cluster Accuracy of . Some of the modes in Fashion-MNIST and MNIST are shown in Figures 4 and 5, respectively. Supplementary materials contain the images from all modes in these two datasets.

2.7 Interpolation in latent space is preserved

The latent space in a traditional GAN with Gaussian latent distribution enforces that different classes are continuously scattered in the latent space, allowing nice inter-class interpolation, which is a key strength of GANs. In ClusterGAN, the latent vector is sampled with a one-hot distribution and in order to interpolate across the classes, we will have to sample from a convex combination on the one-hot vector. While these vectors have never been sampled during the training process, we surprisingly observed very smooth inter-class interpolation in ClusterGAN. To demonstrate interpolation, we fixed the in two latent vectors with different components, say and and interpolated with the one-hot encoded part to give rise to new latent vectors . As Figure 6 illustrates, we observed a nice transition from one digit to another as well as across different classes in FashionMNIST. This demonstrates that ClusterGAN learns a very smooth manifold even on the untrained directions of the discrete-continuous distribution. We also show interpolations from a vanilla GAN trained with Gaussian prior as reference.

algocf[htbp]

Figure 6: (a) ClusterGAN (left)     (b) vanilla WGAN (right)

3 ClusterGAN

Even though the above approach enables the GAN to cluster in the latent space, it may be able to perform even better if we had a clustering specific loss term in the minimax objective. For MNIST, digit strokes correspond well to the category in the data. But for more complicated datasets, we need to enforce structure in the GAN training. One way to ensure that is to enforce precise recovery of the latent vector. We therefore introduce an encoder , a neural network parameterized by . The GAN objective now takes the following form:

(1)

where is the cross-entropy loss. The relative magnitudes of the regularization coeficients and enable a flexible choice to vary the importance of preserving the discrete and continuous portions of the latent code. One could imagine other variations of the regularization that map to be close to the centroid of the respective cluster, for instance , in similar spirit as K-Means. The GAN training in this approach involves jointly updating the parameters of and (Algorithm LABEL:alg:update).

4 Experiments

 

Dataset Algorithm ACC NMI ARI

 

Synthetic ClusterGAN 0.99 0.99 0.99
Info-GAN 0.88 0.75 0.74
GAN with bp 0.95 0.85 0.88
GAN with Disc. 0.99 0.98 0.98
AGGLO. 0.99 0.99 0.99
NMF 0.98 0.96 0.97
SC 0.99 0.98 0.98

 

MNIST ClusterGAN 0.95 0.89 0.89
Info-GAN 0.87 0.84 0.81
GAN with bp 0.95 0.90 0.89
GAN with Disc. 0.70 0.62 0.52
DCN 0.83 0.81 0.75
AGGLO. 0.64 0.65 0.46
NMF 0.56 0.45 0.36

 

Fashion-10 ClusterGAN 0.63 0.64 0.50
Info-GAN 0.61 0.59 0.44
GAN with bp 0.56 0.53 0.37
GAN with Disc. 0.43 0.37 0.23
AGGLO. 0.55 0.57 0.37
NMF 0.50 0.51 0.34

 

Fashion-5 ClusterGAN 0.73 0.59 0.48
Info-GAN 0.67 0.55 0.42
GAN with bp 0.73 0.54 0.45
GAN with Disc. 0.67 0.49 0.40
AGGLO. 0.66 0.52 0.36
NMF 0.67 0.48 0.40

 

10x_73k ClusterGAN 0.83 0.73 0.69
Info-GAN 0.62 0.58 0.43
GAN with bp 0.65 0.59 0.45
GAN with Disc. 0.33 0.17 0.07
AGGLO. 0.63 0.58 0.40
NMF 0.71 0.69 0.53
SC 0.40 0.29 0.18

 

Pendigits ClusterGAN 0.79 0.73 0.65
Info-GAN 0.72 0.73 0.61
GAN with bp 0.76 0.71 0.63
GAN with Disc. 0.65 0.57 0.45
DCN 0.72 0.69 0.56
AGGLO. 0.70 0.69 0.52
NMF 0.67 0.58 0.45
SC 0.70 0.69 0.52

 

Table 1: Comparison of clustering metrics across datasets

4.1 Datasets

Synthetic Data The data is generated from a mixture of Gaussians with components in 2D, which constitutes the space. We generated points from each Gaussian. The

space is obtained by a non-linear transformation :

, where with .

is the sigmoid function to introduce non-linearity.

MNIST It consists of k images of digits ranging from to . Each data sample is a

greyscale image. We used the DCGAN with conv-deconv layers, batch normalization and leaky relu activations, the details of which are available in the Supplementary material.

Fashion-MNIST (10 and 5 classes) This dataset has the same number of images with the same image size as MNIST, but it is fairly more complicated. Instead of digits, it consists of various types of fashion products. Supervised methods achieve lower accuracy than MNIST on this dataset. For training a GAN, we used the same architecture as MNIST for this dataset. We also merged some categories which were similar to form a separate 5-class dataset. The five groups were as follows : {Tshirt/Top, Dress}, {Trouser}, {Pullover, Coat, Shirt}, {Bag}, {Sandal, Sneaker, Ankle Boot}.

10x_73k

Even though GANs have achieved unprecedented success in generating realistic images, it is not clear whether they can be equally effective for other types of data. In this experiment, we trained a GAN to cluster cell types from a single cell RNA-seq counts matrix. Moreover, computer vision might have ample supply of labelled images, obtaining labels for some fields, for instance biology, is extremely costly and laborious. Thus, unsupervised clustering of data is truly a necessity for this domain. The dataset consists of RNA-transcript counts of

data points belonging to different cell types [33]. To reduce the dimension of the data, we selected highest variance genes across the cells. The entries of the counts matrix are first tranformed as and then divided by the maximum entry of the transformation to obtain values in the range of . One of the major challenges in this data is sparsity. Even after subselection of genes based on variance, the data matrix was close to zero entries.

Pendigits It is a very different dataset that consists of a time series of coordinates. The points are sampled as writers write digits on a pressure sensitive tablet. The total number of datapoints is , and consists of classes, each for a digit. It provided a unique challenge of training GANs for point cloud data.

In all our experiments in this paper, we used an improved variant (WGAN-GP) which includes a gradient penalty [10]

. Using cross-validation for selecting hyperparameters is not an option in purely unsupervised problems due to absence of labels. We adapted standard architectures for the datasets

[4] and avoided data specific tuning as much as possible. Some choices of regularization parameters , , worked well across all datasets.

4.2 Evaluation

Since clustering is an unsupervised problem, we ensured that all the algorithms are oblivious to the true labels unlike a supervised framework like conditional GAN [23]. We compared ClusterGAN with other possible GAN based clustering approaches we could conceive.

 

Dataset Algorithm
Cluster WGAN WGAN Info
GAN (Normal) (One-Hot) GAN
MNIST 0.81 0.88 0.94 1.88
Fashion 0.91 0.95 6.14 11.04
10x_73k 2.50 2.02 2.24 25.59
Pendigits 9.56 6.45 13.44 87.80

 

Table 2: Comparison of Frechet Inception Distance (FID) (Lower distance is better)

 

Dataset : MNIST, Algorithm : ClusterGAN
ACC
K = 7 K = 9 K = 10 K = 11 K = 13
0.60 0.84 0.95 0.80 0.84

 

Table 3: Robustness to Cluster Number

Algorithm 1 + K-Means is denoted as “GAN with bp”. For InfoGAN, we used as an inferred cluster label for . Further, the features in the last layer of the Discriminator could contain some class-specific discriminating features for clustering. So we used Kmeans on to cluster, denoted as “GAN with Disc. ”. We also included clustering results from Non-negative matrix Factorization (NMF) [18], Aggolomerative Clustering (AGGLO) [32]

and Spectral Clustering (SC). AGGLO with Euclidean affinity score and ward linkage gave best results. NMF had both l-1 and l-2 regularization, initialized with Non-negative Double SVD and used KL-divergence loss. SC had rbf kernel for affinity. We reported normalized mutual information (NMI), adjusted Rand index (ARI), and clustering purity (ACC). Since DCN has been shown to outperform various deep-learning based clustering algorithms, we reported its metrics from the paper

[31] for MNIST and Pendigits. We found DCN to be very sensitive to hyperparameter choice, architecture and learning rates and could not obtain reasonable results from it on the other datasets. But we outperformed DCN results on MNIST and Pendigits dataset.

Since clustering metrics do not reveal the quality of generated samples from a GAN, we report the Frechet Inception Distance (FID) [13] for all real datasets. We found that ClusterGAN achives good clustering without compromising sample quality. In fact, for our image datsets, ClusterGAN samples are closer to the real distribution than vanilla WGAN with Gaussian prior.

In all datasets, we provided the true number of clusters to all algorithms. In addition, for MNIST, Table 3 provides the clustering performance of ClusterGAN as number of clusters is varied. Overestimates do not severly hurt ClusterGAN; but underestimate does.

5 Discussion and Future Work

In this work, we discussed the drawback of training a GAN with traditional prior latent distributions for clustering and considered discrete-continuous mixtures for sampling noise variables. We proposed ClusterGAN, an architecture that enables clustering in the latent space. Comparison with clustering baselines on varied datasets using ClusterGAN illustrates that GANs can be suitably adapted for clustering. Future directions can explore better data-driven priors for the latent space. Another possibility is to improve results for problems that have a sparse generative structure such as compressed sensing.

References

6 Supplementary Material

6.1 Hyperparameter and Architecture Details

The networks were trained with Adam Optimizer (learning rate -, , ) for all datasets. The number of discriminator updates was for each generator update in training. Gradient penalty coefficient for WGAN-GP was set to for all experiments. The dimension of is the same as the number of classes in the dataset. Most networks used Leaky ReLU activations and Batch Normalization (BN), details for each dataset are provided below. (In the architecture without encoder, Algorithm 1 used Adam optimizer to minimize the objective for iterations per point.)

Synthetic Data

We used batch size = , of dimensions. LReLU activation with leak = was used. , .

Generator Encoder Discriminator
Input Input Input
FC LReLU BN FC LReLU BN FC LReLU BN
FC LReLU BN FC LReLU BN FC LReLU BN
FC Sigmoid FC linear for FC linear
Softmax on last to obtain

MNIST and Fashion-MNIST

We used batch size = , of dimensions. LReLU activation with leak = was used. for MNIST and for Fashion-MNIST, for both .

Generator Encoder Discriminator
Input Input Input
FC ReLU BN upconv, upconv,

64 stride 2 LReLU

64 stride 2 LReLU
FC ReLU BN upconv, upconv,
128 stride 2 LReLU 128 stride 2 LReLU
upconv, FC LReLU FC LReLU
64 stride 2 ReLU BN
upconv, FC linear for FC linear
1 stride 2, Sigmoid Softmax on last to obtain

For Fashion-MNIST, we used . Rest of the architecture remained identical.

10x_73k

We used batch size = , of dimensions. LReLU activation with leak = was used. , .

Generator Encoder Discriminator
Input Input Input
FC LReLU FC LReLU FC LReLU
FC LReLU FC LReLU FC LReLU
FC Linear FC linear for FC linear
Softmax on last to obtain

Pendigits

We used batch size = , of dimensions. LReLU activation with leak = was used. , .

Generator Encoder Discriminator
Input Input Input
FC LReLU BN FC LReLU BN FC LReLU BN
FC LReLU BN FC LReLU BN FC LReLU BN
FC Sigmoid FC linear for FC linear
Softmax on last to obtain

For InfoGAN, we used the implementation of the authors https://github.com/openai/InfoGAN for MNIST and Fashion-MNIST. For the other datasets, we used our hyperparameters for Generator and Discriminator and added the Q network (FC 128-BN-LReLU-FC dim ). For “GAN with bp”, we used the same Generator and Discriminator hyperparameters as ClusterGAN. Features for “GAN with Disc. ” was obtained from the trained Discriminator of experiments “GAN with bp”.

6.2 Generated Modes

(a) mode
(b) mode
(c) mode
(d) mode
(e) mode
(f) mode
(g) mode
(h) mode
(i) mode
(j) mode
Figure 7: Generated digits from distinct modes
(a) mode
(b) mode
(c) mode
(d) mode
(e) mode
(f) mode
(g) mode
(h) mode
(i) mode
(j) mode
Figure 8: Generated fashion items from distinct modes