1 Introduction
Learning a good representation from complex data distribution can be resolved by deep directed generative models. Among them, Generative Adversarial Network (GAN)[Goodfellow2016] is proposed to generate complicated data space by sampling from a simple predefined latent space. Specifically, a generator is modeled to map the latent samples to real data, and a discriminator is applied to differentiate real samples from generated ones. However, the original GAN only learns the forward mapping from a entangled latent space to data space. Given the complicated data, it lacks the inverse inference network to map the data back to the interpretable latent space.
Efforts have been put on learning the bidirectional mapping in an adversarial way. InfoGAN [Chen et al.2016]
is proposed to address the problem of uninformative latent space of GAN, by disentangling the latent variables, and maximizing the mutual information between a subset of the variables and the observations. InfoGAN is able to learn a representation with semantic meaning in a fully unsupervised way. However, faithful reconstruction cannot be achieved by InfoGAN. Another model named Adversarial Autoencoder (AAE)
[Makhzani et al.2015] performs variational inference by matching the aggregated posterior distribution with the prior distribution using an adversarial loss. The autoencoderlike structure guarantees a good reconstruction performance, but the generation using the sampled latent variables is not faithful enough. BiGAN[Donahue et al.2016] and ALI[Dumoulin et al.2016]both propose an encoder (inference network) and decoder (generative network), and seek to match the joint distributions of latent variables and data from the two networks. However, the objective functions do not constraint on the relationship between the latent variables and the observations, which results in an unsatisfied reconstruction performance. ALICE
[Li et al.2017] resolves this nonidentifiability issues by additionally optimizing the conditional entropy. But it does not learn a disentangled latent space for semantic interpretation and knowledge discovery.Bidirectional mapping is also addressed in some applications like image domain transformation or image semantic editing. In BiCycleGAN[Zhu et al.2017b], the authors differentiated two models cVAEGAN and cLRGAN and explained the hybrid model in an intuitive way (regarding to real or fake sampling). It does not encode interpretable information into the latent vector, but directly concatenates the vector with the images from another domain. crVAE [Shang et al.2017] could only demonstrate the semantic meaning of latent vector from visual inspection. IAN[Brock et al.2016] proposed a hybridization of VAE and GAN to solve the semantic photo editing problem by improving the representation capability of latent space without increasing its dimensionality. The decoder of VAE is used as generator of GAN, and hidden layer outputs of discriminator are used to quantify reconstruction loss, which was showed to improve the reconstruction quality. However no cycle consistency was enforced and the latent space was not disentangled. DTN[Taigman et al.2016] applied a similar structure for image domain transfer, while the latent space is not constrained to a regularized distribution, thus random generation tasks were not performed.
In this paper, we seek to learn a generic interpretable representation and bidirectional network which is capable of reconstruction, generation and clustering at the same time. A model supporting all these capabilities is important for data analysis and transmission. Reconstruction ability will help data compression while transmitting. Clustering and generation ability will benefit the natural analysis of complicated data without human prior knowledge.
we first performed a theoretical analysis of two popular unsupervised learning models named Adversarial Autoencoder (AAE) and Information maximization GAN (InfoGAN). We identify their advantages and disadvantages respectively by studying the loss functions they try to minimize, and relate it to the mutual information and conditional entropy in the information theory
[Zhao et al.2017]. Then we propose a novel model involving the concept of cycle consistency [Zhu et al.2017a, Yi et al.2017, Kim et al.2017] to combine those two models, which is able to achieve a better overall performance in terms of unsupervised classification accuracy, data reconstruction and generation by learning a useful generic disentangled latent space. Finally, we show and analyze the effectiveness of this new model on the MNIST, FashionMNIST, celebA and SVHN dataset.2 Related Work
In this section, we review two models named AAE [Makhzani et al.2015] and InfoGAN [Chen et al.2016] by studying their network structures and loss functions. We denote the parameter of the generator (from to ) as , and that of the encoder (from to ) as .
2.1 Adversarial Autoencoder (AAE)
Adversarial Autoencoder[Makhzani et al.2015] is a related work of the Variational Autoencoder (VAE)[Kingma and Welling2013, Rezende et al.2014, Doersch2016]. The basic AAElike structure is shown in figure 1. Recall that in the VAE, the variational lower bound it optimizes is,
(1) 
The first term is identified as the regularization term which tries to match , the posterior of conditional on , to a target distribution using the KL divergence. The second term represents the reconstruction loss, namely given the data , generating the latent representation , and then using this to reconstruct the data. Notice that the loss term above is for a specific data point . To get the loss inside a training batch, we need to average it over , namely
(2)  
The loss function of AAE is similar to VAE, except for the regularization term is replaced by an adversarial learning process (represented by JS Divergence) on the aggregated posterior distribution. Therefore, the objective function for AAE becomes,
(3) 
InfoVAE[Zhao et al.2017] generalizes the regularization term of AAE to a divergence family, and justify the richer information it provides in the latent code sampled from the aggregated posterior distribution. The author also proved that the latent space learned by InfoVAE (or the variants AAE) does not suffer from exploding problem and uninformative latent modeling. However, unsupervised generation with disentangled latent vector was not reported in the original AAE or InfoVAE paper.
2.2 Infomation Maximizing GAN (InfoGAN)
Another unsupervised learning model is InfoGAN (figure 1), which basically adds an information maximizing term on top of the vanilla GAN [Goodfellow2016] so that the generator is forced to use all the information contained in the input when generating sample data points. In the original InfoGAN paper, the latent vector is disentangled into categorical, continuous and noise parts, and the discriminator will output the categorical and continuous parts to achieve the mutual information maximization. The design of the discriminator is similar to other previous work like [Odena et al.2016, Salimans et al.2016, Springenberg2015]. The loss function of InfoGAN can be written as,
(4)  
InfoGAN achieves a fully unsupervised representation learning with disentangled semantic latent space. Both the generation and clustering performance is impressive, but the reconstruction quality of input images was not reported.
3 RepGAN
In this section, we derivate the correlation between the loss function of previous models and conditional entropy, and compare their influence on the learned representation. Then we integrate the loss and propose a novel model called Representation GAN to learn a useful disentangled latent space.
3.0.1 Conditional Entropy
The conditional entropy measures the uncertainty of one random variable given the other. For example,
quantifies the uncertainty of the observation space given the latent space , which can be formulated as,(5)  
Comparing the upper bound of conditional entropy with the second term in VAE/AAE loss (equation 2), we can see that they are exactly the same expression. As a result, the upper bound of the conditional entropy is equivalent to the reconstruction term of VAE/AAE objective function. Minimizing the upper bound of the conditional entropy also turns out to be maximizing the mutual information , if the entropy of the data distribution is assumed to be fixed.
Again the formulation of conditional entropy is,
(6)  
We see that the reconstruction loss term in equation 4 is exactly the same as the upper bound of the conditional entropy . As noted by the author in original InfoGAN paper, this loss function tries to minimize the conditional entropy and consequently maximizing the mutual information .
3.0.2 Comparing AAE and InfoGAN
Now we study the property of the loss function of AAE and InfoGAN. AAE minimizes the conditional entropy , it is trained to decrease the uncertainty of given . As shown in figure 2, AAE demonstrates a stochastic mapping from to the latent representation . Comparing equation 4 to equation 2, we can see that they are symmetric to each other. Namely, AAE follows an xzx pattern while InfoGAN follows an zxz pattern. As a result, we can use the same argument as in the AAE case to conclude that InfoGAN actually maps multiple data points back to the same latent representation as shown in figure 2.
Both AAE and InfoGAN try to maximize the mutual information
by minimizing conditional entropies. However since conditional entropy is not symmetric, those two models show different focuses. Specifically, we show that AAE maps multiple points in latent space to a single point in data space, whereas InfoGAN maps multiple points in data space to a single point in latent space. Therefore, AAE is good at reconstruction (when the latent space is large enough). The classification performance of AAE is not guaranteed though. On the other hand, InfoGAN is good at classification, because different digits with subtle differences can be put into the same category, which makes the classifier robust to noises and small style changes. But the reconstruction performance is not guaranteed.
Actually, if we follow the design of the latent vector in the original InfoGAN paper, the reconstruction of InfoGAN cannot be good because the noise at the input of InfoGAN is not present at the output, which means subtle information describing the details of the image is discarded during reconstruction. In our experiments, we find out that the noise actually changes the generated image greatly, as shown in figure 10 and 11. If the noise is simply discarded, the reconstructed images will be almost not the same.
To further understanding the mapping relation of AAE and InfoGAN, suppose in the discrete case, we notice that the second term in equation 2 is minimized to zero when equals one for all , whereas those
are actually the output from the encoder of the AAE. Thus in the optimal case, the probability mass function
should have disjoint support for different given [Zhao et al.2017], but it is not optimized to 1 for a specific . That is the reason why one data point can be mapped to different , and multiple can be used to reconstruct the same . Similar explanations apply to the mapping property of InfoGAN.Another problem is the dimension of the latent vector. Lower latent dimension will suffer from insufficient representation ability, while higher latent dimension will increase the difficulty of distribution regularization using adversarial learning or KL divergence. For example, let us consider the case where the latent space has a categorical onehot variable and a continuous variable , and we wish to categorize the digits in MNIST in terms of which number they represent and their style. If the input is a slim and clockwise rotated digit 7, the encoder should output with the 7th element being one and others being zero and
should have its two elements describing the style of the digit, namely slim and rotation. However, apart from being slim and rotated, this digit 7 still has many other characteristics, perhaps a bar in the middle or a sharp turn at the corner. Those subtle information will be lost during the encoding process due to the insufficient latent dimension. As a result, in the case where the latent space is not large enough, the loss function in AAE will make the output of decoder converge to an averaged version of all the inputs that are represented by the same latent representation, which turns out to be a blurred image while training using L2norm by assuming a Gaussian distribution. When the latent space is large enough, the reconstruction of AAE will be much better. However, high latent dimension will increase the difficulties of latent regularization. The drawbacks are shown in figure
4 and 5.3.0.3 Model Structure
In order to combine the strength of both models, we propose to train AAE and InfoGAN together with shared parameters, so that the new model can achieve good classification and reconstruction performance at the same time. The network architecture is illustrated in figure 1
. Specifically, the encoder (X2Rep) of AAE and InfoGAN are the same module sharing parameters. Likewise, the decoder (Rep2X) of AAE and of InfoGAN are the same module sharing parameters. During the training, we train the model alternatively between InfoGANfashion and AAEfashion so that the classification accuracy can be improved by InfoGAN training while reconstruction performance can be improved by AAE training. The training of infoGAN is emphasised, which experimentally gives better result. The training steps for infoGAN:AAE is 5:1. The latent vector is split into three subsets, a categorical variable
, a continuous variable and a noise . Continuous and noise variable could be sampled from Gaussian distribution. The full objective function for RepGAN is,(7)  
In practical implementation, we rewrite the objective function as,
(8)  
where is computed as norm for image reconstruction, represents crossentropy loss, and is negative loglikelihood for Gaussian loss with reparametrization tricks as InfoGAN. The model structure for experiment is summarized in table 1
.The stride for each convolution layer is always 2, and we refer the structure design and optimization tricks as WGAN
[Arjovsky and Bottou2017, Arjovsky et al.2017]. For learning rate, we used , , for the generators, AAE discriminator and infoGAN discriminator on MNIST dataset. For fashionMNIST, we used , , . For SVHN, we used , , .encoder  decoder 

In 28x28x1  In 32x1 
4x4x64 conv, LReLU,BN  FC1024 ReLU,BN 
4x4x128 conv,LReLU, BN  FC7x7x128 ReLU,BN 
FC 1024 LReLU, BN  4x4x64 deConv,ReLU,BN 
c: FC 10 softmax, BN  4x4x1 deConv, Sigmoid 
s mean: FC 2 LRelu, BN  
s sigma: FC 2 LRelu, BN, exp()  
n: FC 20 LRelu, BN  
Dz 
Dx 
In c/s/n  In 28x28x1 
FC3000 LReLU  4x4x64 conv, LReLU 
FC3000 LReLU  4x4x128 conv, LReLU, BN 
FC1 raw output (WGAN)  FC1024 LRelu, BN 
FC 1 sigmoid  

4 Expeiments
We tested the three models AAE, InfoGAN and RepGAN on MNIST[LeCun et al.2010], FashionMNIST[Xiao et al.2017], and SVHN[Netzer et al.2011] dataset. We conducted two sets of experiments with different types of the latent variable . First, is not disentangled but directly sampled from an isotropic Gaussian distribution. In this experiment, we investigate the theory of mapping discussed in section 3.0.2. Second, like the original InfoGAN, we split the latent vector into three slots: a onehot vector sampled from a categorical distribution, a continuous vector sampled from Gaussian, and a random noise . In additional to reconstruction and generation performance, we demonstrate the unsupervised and semisupervised clustering performance of RepGAN, and noise importance.
4.1 Gaussian Latent space
We first implement the AAE, InfoGAN, and RepGAN with a single entangled latent space using a latent vector sampled from isotropic Gaussian distribution with zero mean and 0.5 variance. We vary the dimension of the latent vector to 2, 8, 16, 32 and 64, and compare the reconstruction performance of image or latent space of all the three models. Training and testing is conducted on MNIST dataset.
4.1.1 Image Reconstruction
The image reconstruction result is computed by organizing the structure like AAE after training each model, and feeding the real data sample to the input. The visualization is shown in figure 3. AAE achieves a better reconstruction ability than InfoGAN, and RepGAN is almost as good as AAE. As shown in the figure 3 and 5, InfoGAN has a bad ability of reconstruction, and for all the latent dimensions, the error keeps the highest among the three models. That is because the loss definition of InfoGAN does not put constrains on the image reconstruction.
4.1.2 Latent Reconstruction and Generation
For latent reconstruction evaluation, we follow the structure of InfoGAN for testing. After training all the models, we reorganize the network structure, and feed a sampled latent vector into the network. we plotted the MSE of the latent code to examine the ability of latent regularization and the exist of mode collapse of the models in figure 5. For AAE, the error becomes large when the latent dimension is high because of (1) heavy mode collapse: given different z, the model generates identical x. This is illustrated in the objective, and (2) unsatisfied latent regularizing like VAE. In this case, when we sample from a true prior distribution, the AAE model cannot generate goodquality images. That is because the highquality manifold shifted.As shown in figure 4, AAE cannot generate highquality samples when the latent dimension is too large, and cannot generate sharp samples when the dimension is small. The model fails to learn a good latent distribution when the latent dimension is larger or equal to 16.
However, InfoGAN and RepGAN achieve an identical good performance for latent space modeling and new sample generation. The generated images are also sharp and clear. All the images in figure 4 are randomly sampled from the generation results. Compared with AAE and InfoGAN which can only guarantee either recognition and generation, the proposed RepGAN can simultaneously achieve the two capabilities by constraining on two conditional entropy, and the mapping between the latent variable and real data shrinks to a bijection.
4.2 Disentangled Latent space
In this experiment, we disentangle the latent space and follow the original structure of InfoGAN. , and have dimension 10, 2, and 20 respectively. In addition to reconstruction and generation, we compare the unsupervised and semisupervised clustering performance. We also investigate the importance of the noise.
4.2.1 Unsupervised Learning
Model  MNIST  MNIST  FMNIST  FMNIST 
(Acc)  (MSE)  (Acc)  (MSE)  
AAE  86.92%  0.007  57.30%  0.015 
InfoGAN  95%  0.07  53.81%  0.098 
VADE  94.46%  None  None  None 
DEC  84.30%  None  None  None 
RepGAN  96.74%  0.02  58.64%  0.013 

Model  SVHN  SVHN  FMNIST  FMNIST 

(Acc)  (MSE)  (Acc)  (MSE)  
AAE  71.67%  0.004  83.82%  0.02 
InfoGAN  74.16%  0.03  79.41%  0.12 
RepGAN  76.05%  0.006  82.81%  0.034 
When evaluating the unsupervised clustering accuracy, we set the continuous and noise vector to zero, and generate the cluster head of each clusters. Then we searched in the training set to find the closest sample with the cluster head, and assigned the label of that sample to the whole cluster. Finally, we computed the accuracy based on the assigned cluster labels. Table 2 shows the classification accuracies of comparable models like VADE[Jiang et al.2016] and DEC[Xie et al.2016] on MNIST and FashionMNIST. The InfoGAN and RepGAN are able to achieve an average accuracy of 95% or 96%, which is much higher than the AAE, which only achieves 87%. For FashionMNIST, the classification accuracy is low due to the high similarity of images assigned by different category labels. This experiment result is consistent with our theoretical analysis in section 3.0.2, which is InfoGAN is better for classification than AAE. RepGAN, being the combination of AAE and InfoGAN, successfully preserved the ability of InfoGAN for clustering and generation, and AAE for reconstruction. Note that our network structure is slightly different from the original InfoGAN and AAE paper, thus the results may not be the same as the reported ones.
The qualitative evaluation of reconstruction and generation ability of RepGAN is shown in figure 6 and 9. By fixing the categorical code, the model is able to generate any samples belonging to this cluster. And by changing the continuous value, the model learns the manifold of the styles. While reconstructing, RepGAN achieves a more faithful reconstruction than InfoGAN, and sharper images than AAE.
We also compare our generated image on CelebA with infoGAN in figure 7. By using the same latent space configuration as infoGAN, namely 10 categorical variables where each one is 10dim OneHot vector, we are able to achieve a better image quality while showing attribute change at the same time. In summary, RepGAN currently does well in all the three tasks: reconstruction, generation, and unsupervised clustering.
4.2.2 Semisupervised Learning
We conduct the semisupervised learning on FashionMNIST, and SVHN by utilizing 1000 labeled images of each database for training. Unlike ACGAN
[Odena et al.2016] or catGAN[Springenberg2015], InfoGAN is not designed for semisupervised or supervised learning, thus in our experiment settings, we will first train the discriminator with the selected 1000 labels, and then jointly train the discriminator and the generator together to achieve the semisupervised learning. Therefore, the reported semisupervised learning result on SVHN and fashionMNIST may not be better than other stateoftheart approaches. However, AAE is easier for semisupervised setting, and according to the original paper, it achieves the comparable classification accuracy on SVHN. Adding the infoGAN objectives will greatly enhance the generation ability when sampling from the latent space, and achieve a comparable or better clustering accuracy on SVHN and FashionMNIST.The testing accuracy and MSE value is shown in table 3. The three models demonstrate a similar classification performance, but the reconstruction is still not satisfied for InfoGAN. We also compare the generation quality on SVHN between AAE and RepGAN in figure 8 and shows that RepGAN generates sharper images than AAE.
4.2.3 Effectiveness of Noise
The noise variable is interpreted as representation incompressible information in InfoGAN. We tunnel the noise for intact and plausible image reconstruction during training the AAElike part, since categorical and continuous variable may not be expressive enough for intact reconstruction. The difference between continuous and noise variable is that: the lowerdimensional continuous variable is used to encode the most salient attributes (or largest data variance direction) commonly shared by all the samples (it is enhanced by ), while noise is used to encode incompressible or entangled information (it is enhanced by ).
In figure 10
, we demonstrate the effect of continuous v.s. noise variable on generated samples. Specifically, on the first row, we interpolate on the continuous code, and set the noise variable to zero. While varying the continuous variable, the style changes explicitly and smoothly. After adding random noise, in addition to uniformly changed style, more variants are generated.
On the second row, we interpolate the first two dimension of the noise code, and set the continuous variable to zero. We can see tiny changes of the generated images when traversing on the first two dimensions of noise, and the changes are slightly different for distinct clusters. It demonstrated the information encoded in noise is actually clusterdependent. If we randomly sample the continuous variable and keep it the same for all the clusters, we can visualize identical changes of the image attributes across clusters (slant and thickness degree). It demonstrated the information encoded in is actually clusterindependent or clustershared. Similarly figure 11 shows the generated samples from fashionMNIST dataset (unsupervised).
5 Conclusion and Discussion
In this paper, we analyzed the advantage and disadvantage of two popular unsupervised machine learning models: AAE and infoGAN. We showed both theoretically and experimentally that infoGAN is able to achieve a higher classification accuracy, whereas AAE is able to get a better reconstruction quality. After that, we combined those two models in an attempt to take their advantages. We showed on MNIST, FashionMNIST and SVHN dataset that the new model, named RepGAN, is able to achieve both a high classification accuracy and a good reconstruction quality in both the original input space and the latent space. By performing well in both classification and reconstruction, RepGAN is able to learn a good bidirectional mapping between the input space and the latent space, which is a desired property of unsupervised representation learning model. It will be inspiring if it can be utilized for arbitrarily complicated data discovery with more complicated network structures and larger latent dimension, which is left for future work.
References
 [Arjovsky and Bottou2017] Martin Arjovsky and L´eon Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint, 2017.
 [Arjovsky et al.2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [Brock et al.2016] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.
 [Chen et al.2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
 [Doersch2016] Carl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
 [Donahue et al.2016] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 [Dumoulin et al.2016] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 [Goodfellow2016] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 [Jiang et al.2016] Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148, 2016.
 [Kim et al.2017] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover crossdomain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192, 2017.
 [Kingma and Welling2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [LeCun et al.2010] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
 [Li et al.2017] Chunyuan Li, Hao Liu, Changyou Chen, Yuchen Pu, Liqun Chen, Ricardo Henao, and Lawrence Carin. Alice: Towards understanding adversarial learning for joint distribution matching. In Advances in Neural Information Processing Systems, pages 5501–5509, 2017.
 [Makhzani et al.2015] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.

[Netzer et al.2011]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
Ng.
Reading digits in natural images with unsupervised feature learning.
In
NIPS workshop on deep learning and unsupervised feature learning
, volume 2011, page 5, 2011.  [Odena et al.2016] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585, 2016.
 [Rezende et al.2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [Salimans et al.2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
 [Shang et al.2017] Wenling Shang, Kihyuk Sohn, Zeynep Akata, and Yuandong Tian. Channelrecurrent variational autoencoders. arXiv preprint arXiv:1706.03729, 2017.
 [Springenberg2015] Jost Tobias Springenberg. Unsupervised and semisupervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
 [Taigman et al.2016] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervised crossdomain image generation. arXiv preprint arXiv:1611.02200, 2016.
 [Xiao et al.2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

[Xie et al.2016]
Junyuan Xie, Ross Girshick, and Ali Farhadi.
Unsupervised deep embedding for clustering analysis.
In International conference on machine learning, pages 478–487, 2016.  [Yi et al.2017] Zili Yi, Hao Zhang, Ping Tan Gong, et al. Dualgan: Unsupervised dual learning for imagetoimage translation. arXiv preprint arXiv:1704.02510, 2017.
 [Zhao et al.2017] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.
 [Zhu et al.2017a] JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.

[Zhu et al.2017b]
JunYan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A Efros,
Oliver Wang, and Eli Shechtman.
Toward multimodal imagetoimage translation.
In Advances in Neural Information Processing Systems, pages 465–476, 2017.