1 Introduction
Image generation has been one of the core topics in the area of computer vision for a long time. Thanks to the quick development of deep learning, numerous generative models are proposed, including encoderdecoder based models
[19, 37, 2], generative adversarial networks (GANs) [11, 6, 42, 32, 4, 13], density estimator based models
[38, 22, 5, 9][23, 47, 43, 31]. The encoderdecoder based models and GANs are the most prominent ones due to their capability to generate high quality images.Intrinsically, the generator in a generative model aims to learn the real data distribution supported on the data manifold [36]. Suppose the distribution of a specific class of natural data is concentrated on a low dimensional manifold
embedded in the high dimensional data space. The encoderdecoder methods first attempt to embed the data into the latent space
through the encoder , then samples from the latent distribution are mapped back to the manifold to generate new data by decoder . While GANs, which have no encoder, directly learn a map (generator) that transports a given prior low dimensional distribution to .Usually, GANs are unstable to train and suffer from mode collapse [12, 28]. The difficulties come from the fact that the generator of a GAN model is trained to approximate the discontinuous distribution transport map from the
unimodal Gaussian distribution
to the real data distribution by the continuous neural networks [42, 2, 18]. In fact, when the supporting manifolds of the source and target distributions differ in topology or convexity, the OT map between them will be discontinuous [40], as illustrated in the map of Fig. 1. In practice, distribution transport maps can have complicated singularities, even when the ambient dimension is low (see e.g. [10]). This poses a great challenge for the generator training in standard GAN models.To tackle the mode collapse and mode mixture problems caused by discontinuous transport maps, the authors of [2] proposed the AEOT model. In this model, an autoencoder is used to map the images manifold into the latent manifold . Then, the semidiscrete optimal transport (SDOT) map from the uniform distribution to the latent empirical distribution is explicitly computed via convex optimization approach. Then a piecewise linear extension map of the SDOT, denoted by , pushes forward the uniform distribution to a continuous latent distribution , which in turn gives a good approximation of the latent distribution ( means the push forward map induced by ). Composing the continuous decoder and discontinuous together, i.e. , where is sampled from uniform distribution, this model can generate new images. Though have no mode collapse/mixture, the generated images look blurry. The framework of AEOT is shown as follows:
In this work we propose the AEOTGAN framework to combine the advantages of the both models and generate high quality images without mode collapse/mixture. Specifically, after the training of the autoencoder and the computation of the extended SDOT map, we can directly sample from the latent distribution by applying on the uniform distribution to train the GAN model. In contrast to the conventional GAN models, whose generators are trained to transport the latent Gaussian distribution to the data manifold distributions, our GAN model sample from the data inferred latent distribution . The distribution transport map from to the data distribution is continuous and thus can be well approximated by the generator (parameterized by CNNs), as shown in of Fig. 1
. Moreover, the decoder of the pretrained autoencoder gives a warm start of the generator, so that the Kullback–Leibler divergence between real and fake batches of images have nonvanishing overlap in their supports during the training phase. Furthermore, the content loss and feature loss between paired latent codes and real input images regularize the adversarial loss and stabilize the GAN training. Experiments have shown efficacy and efficiency of our proposed model.
The contributions of the current work can be summarized as follows: (1) This paper proposes a novel AEOTGAN model that combines the strengths of AEOT model and GAN model. It eliminates the mode collapse/mixture of GAN and removes the blurriness of the images generated by AEOT. (2) The decoder of the autoencoder provides a good initialization of the generator of GAN. The number of iterations required to reach the equilibrium has been reduced by more than 100 times compared to typical GANs. (3) In addition to the adversarial loss, the explicit correspondence between the latent codes and the real images provide auxiliary constraints, namely the content loss, to the generator. (4) Our experiments demonstrate that our model can generate images consistently better than or comparable to the results of stateoftheart methods.
2 Related Work
The proposed method in this paper is highly related to encoderdecoder based generation models, the generative adversarial networks (GANs), conditional GANs and the hybrid models that take the advantages of above.
Encoderdecoder architecture A breakthrough for image generating comes from the scheme of Variational Autoencoders (VAEs) (e.g. [19]), where the decoders approximate real data distributions from a Gaussian distribution in a variational approach (e.g [19] and [34]). Latter Yuri Burda et al. [45] lower the requirement of latent distribution and propose the importance weighted autoencoder (IWAE) model through a different lower bound. Bin and David [7] propose that the latent distribution of VAE may not be Gaussian and improve it by firstly training the original model and then generating new latent code through the extended ancestral process. Another improvement of the VAE is the VQVAE model [1]
, which requires the encoder to output discrete latent codes by vector quantisation, then the posterior collapse of VAEs can be overcome. By multiscale hierarchical organization, this idea is further used to generate high quality images in VQVAE2
[33]. In [37], the authors adopt the Wasserstein distance in the latent space to measure the distance between the distribution of the latent code and the given one and generate images with better quality. Different from the the VAEs, the AEOT model [2] firstly embed the images into the latent space by autoencoder, then an extended semidiscrete OT map is computed to generate new latent code based on the fixed ones. Decoded by the decoder, new images can be generated. Although the encoderdecoder based methods are relatively simple to train, the generated images tend to be blurry.Generative adversarial networks The GAN model [11] tries to alternatively update the generator, which maps the noise sampled from a given distribution to real images, and the discriminator differentiates the difference between the generated images and the real ones. If the generated images successfully fool the discriminator, we say the model is well trained. Later, [32]
proposes a deep convolutions neural network (DCGAN) to generate images with better quality. While being a powerful tool in generating realistic samples, GANs can be hard to train and suffer from mode collapse problem
[12]. After delicate analysis, [4] points out that it is the KL divergence the original GAN used causes these problems. Then the authors introduce the celebrated WGAN, which makes the whole framework easy to converge. To satisfy the lipschitz continuity required by WGAN, a lot of methods are proposed, including clipping [4], gradient penalty [13], spectral normalization [30] and so on. Later, Wu et al. [41] use the wasserstein divergence objective, which get rid of the lipschitz approximation problem and get a better result. Instead cost adopted by WGAN, Liu et.al [27] propose the WGANQC by taking the cost into consideration. Though various GANs can generate sharp images, they will theoretically encounter the mode collapse or mode mixture problem [12, 2].Hybrid models To solve the blurry image problem of encoderdecoder architecture and the mode collapse/mixture problems of GANs, a natural idea is to compose them together. Larsen et al. [21] propose to combine the variational autoencoder with a generative adversarial network, and thus generate images better than VAEs. [29] matches the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution by a discriminator and then applies the model into tasks like semisupervised classification and dimensionality reduction. BiGAN [16], with the same architecture with ours, uses the discriminator to differentiate both the generated images and the generated latent code. Further, by utilizing the BigGAN generator [3], the BigBiGAN [8] extends this method to generate much better results. Here we also treat the BourGAN [42] as a hybrid model, because it firstly embeds the images into latent space by Bourgain theorem, then trains the GAN model by sampling from the latent space using the GMM model.
Conditional GANs are another kind of hybrid models that can also be treated as imagetoimage transformation. For example, using an encoderdecoder architecture to build the connection between paired images and then differentiating the decoded images with the real ones by a discriminator, [15] is able to transform images of different styles. Further, SRGAN [25]
uses similar architecture to get super resolution images from their low resolution versions. The SRGAN model is the most similar work to ours, as it also utilizes the content loss and adversarial loss. The main differences between this model and ours including: (i) SRGAN just uses the paired data, while the proposed method use both the paired data and generated new latent code to train the model; (ii) the visually meaningful features used by SRGAN are extracted from the pretrained VGG19 network
[35], while in our model, they come from the encoder itself. This makes them more reasonable especially under the scenes where the datasets are not included in those used to train the VGG.3 The Proposed Method
In this section, we explain our proposed AEOTGAN model in detail. There are mainly three modules, an autoencoder (AE), an optimal transport mapper (OT) and a GAN model. Firstly, an AE model is trained to embed the data manifold into the latent space. At the same time, the encoder pushes forward the groundtruth data distribution supported on to the groundtruth latent distribution supported on in the latent space. Secondly, we compute the semidiscrete OT map from the uniform distribution to the empirical latent distribution. By extending the SDOT map, we can construct the continuous distribution that approximates the groundtruth latent distribution well. Finally, starting from as the latent distribution, our GAN model is trained to generate both realistic and crisp images. The pipeline of our proposed model is illustrated in Fig. 2. In the following, we will explain the three modules one by one.
3.1 Data Embedding with Autoencoder
We model the real data distribution as a probability measure
supported on an dimensional manifold embedded in the dimensional Euclidean space (ambient space) with .In the first step of our AEOTGAN model, we train an autoencoder (AE) to embed the real data manifold to be the latent manifold . In particular, training the AE model is equivalent to compute the encoding map and decoding map
by minimizing the loss function:
with and parameterized by standard CNNs ( and are the parameters of the networks, respectively). Given densely sampling from the image manifold (detailed explanation is included in the supplementary) and ideal optimization (namely the loss function goes to ), coincides with the identity map. After training, is a continuous, convertible map, namely a homeomorphism, and is the inverse homeomorphism. This means is an embedding, and pushes forward to the latent data distribution . In practice, we only have the empirical data distribution given by , which is push forward to be the empirical latent distribution , where is the number of samples.
3.2 Constructing with SemiDiscrete OT Map
In this section, from the empirical latent distribution , we construct a continuous latent distribution following [2] such that (i) it generalizes well, so that all of the modes are covered by the support of (ii) the support of has similar topology to that of , which ensures that the transport map from to is continuous and (iii) it is efficient to sample from .
To obtain , the semidiscrete OT map from the uniform distribution to is firstly computed. Here is the dimension of the latent space. By extending to be a piecewise linear map , we can construct as the push forward distribution of under :
In the first step, we compute the semidiscrete OT map , with . Under , the continuous domain of is decomposed into cells with , with the Lebesgue measure of each to be . The cell structure is shown in the left frame of Fig. 3 (the orange cells). Computational details of can be found in the supplementary material and [2].
Secondly, we extend the image domain of from the discrete latent codes to a continuous neighborhood , which serves as the supporting manifold of . Specifically, we construct a simplicial complex from the latent codes . Here is a constant. The 0skeleton of , represented by , is the set of all latent codes . The we define its kskeletons by for . The right frame of Fig. 3 shows an example of . By assuming that the latent code is densely sampled from the latent manifold and with an appropriate , will have consistent "hole" and "gap" structure with , in the sense of homology equivalence. Details are described in the supplementary material.
Finally, we define the piecewise linear extended OT map . Given a random sample sampled from , we can find the cell containing it. By computing the barycentric parameters ’s with respect to the nearby mass centers ’s of the cells ’s, i.e. compute ’s such that with and . Here represents the neighbour of . Then is mapped to if the corresponding ’s form a simplex of . Otherwise we map to , i.e. . As illustrated in Fig. 3, compared to the manytoone semidiscrete OT map , maps samples within the triangular areas (the purple triangles on the left frame) in linearly to the corresponding simplices in (the purple triangles on the right frame) in a bijective manner. We denote the pushed forward distribution under as .
Theorem 1.
The 2Wasserstein distance between and satisfies . Moreover, if the latent codes are densely sampled from the latent manifold , we have almost surely.
To avoid confusion, we omit the subscript and denote as . With proof included in the supplementary material, this theorem tells us that as a continuous generalization of , is a good approximation of . Also, we want to mention that is a piecewise linear map that pushes forward to , which makes the sampling from efficient and accurate.
3.3 GAN Training from
The GAN model computes the transport map from the continuous latent distribution to the data distribution on the manifold.
Our GAN model is based on the vanilla GAN model proposed by Ian Goodfellow et.al [11]. The generator is used to generate new images by sampling from the latent distributin , while the discriminator is used to discriminate if the distribution of the generated images are the same with that of the real images. The training process is formalized to be a minmax optimization problem:
where the loss function is given by
(1) 
In our model, the loss function consists of three terms, the image content loss , the feature loss and the adversarial loss . Here is the weight of the content loss.
Adversarial Loss We adopt the vanilla GAN model [11] based on the Kullback–Leibler (KL) divergence. The key difference between our model and the original GAN is that our latent samples are drawn from the data related latent distribution , instead of a Gaussian distribution. The adversarial loss is given by:
According to [4], vanilla GAN is hard to converge because the supports of the distributions of real images and fake images may not intersect each other, which makes the KL divergence between them infinity. This issue is solved in our case, because (1) the training of AE gives a warm start to the generator, so at the beginning of the training, the generated distribution is close to the real data distribution . (2) by delicate settings of the fake and real batches used to train the discriminator, we can keep the KL divergence between them converge well. In detail, as shown in Fig. 2, the fake batch is composed of both the reconstructed images from the real latent code (the orange circles) and the generated images from the generated latent code (the purple crosses), and the real batch includes both the real images corresponding to the real latent code and some randomly selected images.
Content Loss Recall that the generator can produce two types of images: images reconstructed by real latent codes and images from generated latent codes. Given a real sample , its latent code is , the reconstructed image is . Each reconstructed image is represented as a triple . Suppose there are reconstructed images in total, the content loss is given by
(2) 
Where is the generator parameterized by .
Feature Loss We adopt the feature loss similar to that in [25]. Given a reconstructed image triple , we encode by the encoder of AE. Ideally, the real image and the generated image should be same, therefore their latent codes should be similar. We measure the difference between their latent codes by the feature loss. Furthermore, we can measure the difference between their intermediate features from different layers of the encoder.
Suppose the encoder is a network with layers, the output of the th layer is denoted as . The feature loss is given by
Where is the weight of the feature loss of the th layer.
For reconstructed images , the content loss and the feature loss force the generated image to be the same with the real image , therefore the manifold align well with the real data manifold .
4 Expreiments
To evaluate the proposed method, several experiments are conducted on simple dataset MNIST [24] and complex datasets including Cifar10 [20], CelebA [46] and CelebAHQ [26].
(a)  (b) 
(a)  (b) 
(c)  (d) 
of each epoch, including the results of content loss (a) and selfperceptual loss (b), the discriminator output (c) and FIDs (d).
(a) Epoch 0 (AEOT)  (b) Epoch 80  (c) Epoch 160  (d) Epoch 240  (e) Groundtruth 
(a) Epoch 0 (AEOT)  (b) Epoch 80  (c) Epoch 160  (d) Epoch 240 
Architecture We adopt the InfoGAN [6] architecture as our GAN model to train the MNIST dataset. The standard and ResNet models used to train the Cifar10 dataset are the same with those used by SNGAN [30], and the architectures of WGANdiv [41] are used to train the CelebA dataset. The framework of encoder is set to be the mirror of the generators/decoders.
Evaluation metrics To illustrate the performance of the proposed method, we adopt the commonly used Frechet Inception distance (FID) [14]
as our evaluation metrics. FID takes both the generated images and the real images into consideration. When the images are embedded into the feature space by inception network, two high dimensional Gaussian distributions are used to approximate the empirical distributions of the generated and real features, respectively. Finally, the FID is given by the difference between the two Gaussian distributions. Lower FID means better quality of the generated dataset. This metric has been proven to be effective in judging the performance of the generated models, and it serves as a standard for comparison with other works.
Training details
To get rid of the vanishing gradient problem and make the model converge better, we use the following three strategies:
(i) Train the discriminator using Batch Composition There are two types of latent codes in our method: the real latent codes coming from encoding the real images by the encoder, and generated latent codes coming from the extended OT map. Correspondingly, there are two types of generated images, the reconstructed images from the real latent codes and the generated images from the generated latent codes.
To train the discriminator, both the fake batch and real batch are used. The fake batch consists of both randomly selected reconstructed images and generated images, and the real batch only includes real images, in which the first part has a onetoone correspondence with the reconstructed images in the fake batch, as shown in Fig. 2. In all the experiments, the ratio between the number of generated images and reconstructed images in the fake batch is 3.
This strategy ensures that there is an overlap between the supports of the fake and real batches, so that the KL divergence is not infinity.
(ii) Different learning rate For better training, we use different learning rates for the generator and the discriminator as suggested by Heusel et al. in [14]. Specifically, we set the learning rate of the generator to be and that of the discriminator to be , where . This improves the stability of the training process.
(iii) Different inner steps Another way to improve the training consistency of the whole framework is to set different update steps for the generator and discriminator. Namely, When the discriminator updated once, the generator updated times correspondingly. This strategy is the opposite of training vanilla GANs, which typically require multiple discriminator update steps per generator update step.
By setting and , we can keep the discriminator output of the real images is slightly large than that of the generated images, which can better guide the training of the generator. For the MNIST dataset, and ; for the Cifar10 dataset, and ; and for the CelebA dataset, and . In Eq. 1, and with , where denotes the last layer of the encoder. is used to regularize the loss of the latent codes.
With the above settings and the warm initialization of the generator from the pretrained decoder, for each dataset, the total epochs for training is set to be 500, which is far less than the training of GANs (usually 10k~50k).
CTGAN [44]  WGANGP [13]  WGANdiv [41]  WGANQC [27]  Proposed method 
WGANGP [13]  SNGAN [30]  WGANdiv [41]  AEOT [2]  Proposed method 
CIFAR10  CelebA  

Standard  ResNet  Standard  ResNet  
WGANGP [13]  40.2  19.6  21.2  18.4 
PGGAN [17]    18.8    16.3 
SNGAN [30]  25.5  21.7     
WGANdiv [41]    18.1  17.5  15.2 
WGANQC [27]        12.9 
AEOT [2]  34.2  28.5  24.3  28.6 
AEOTGAN  25.2  17.1  11.2  7.8 
4.1 Convergence Analysis in MNIST
In this experiment, we evaluate the performance of our proposed model on MNIST dataset [24], which can be well embedded into the dimensional latent space with the architecture of InfoGAN [6]. In Fig. 4(a), we visualize the real latent code (brown circles) and the generated latent codes (purple crosses) by tSNE [39]. It is obvious that the support of the real latent distribution and that of the generated distribution align well. Frame (b) of Fig. 4 shows the comparison between the generated handwritten digits (left) and the real digits (right), which is very difficult for humans to distinguish.
To show the convergent property of the proposed method, we plot the related curves in Fig. 5. The frame (a) and (b) show the changes of the content loss about the images and latent codes, and both of them decrease monotonously. The frame (c) shows that the output of the discriminator for real images is only slightly larger than that for the fake images during the training process, which can help the generator generate more realistic digits. The frame (d) shows the evolution of FID and the final value is . For MNIST dataset, the best known FIDs with the same InfoGAN architecture are and , reported in [28] and [2] respectively. This shows our model outperforms stateoftheart.
4.2 Quality Evaluation on Complex Dataset
In this section, we compare with the stateoftheart methods quantitatively and qualitatively.
Progressive Quality Improvement Firstly, we show the evolution results of the proposed method in Fig. 6 and Fig. 7 during GAN’s training process. Quality of the generated images increases monotonously during the process. Images in first four frames of Fig. 6 illustrates the results reconstructed from the real latent codes by the decoder, with the last frame showing the corresponding groundtruth input images. By examining the frames carefully, it is obvious that as the increase of the epochs, the generated images become sharper and sharper, and eventually they are very close to the ground truth. Fig. 7 shows the generated images from some generated latent codes (therefore, no corresponding real images). Similarly. the images become sharper as the increase of epochs. Here we need to state that the 0 epoch stage means the images are generated by the original decoder, which are equivalent to the outputs of an AEOT model [2]. Thus we can conclude that the proposed AEOTGAN improves the performance of AEOT prominently.
Comparison on CelebA and CIFAR 10 Secondly, we compared with the stateofthearts including WGANGP [13], PGGAN [17], SNGAN [30], CTGAN [44], WGANdiv [41], WGANQC [27] and the recently proposed AEOT model [2] on Cifar10 [20] and CelebA [46]. Tab. 1 shows the FIDs of the our method and the comparisons trained under both the standard and ResNet architectures. The FID of other methods come from the listed papers except those of the AEOT, which are directly computed by our model (the results of epoch 0). From the table we can see that our method gets much better results than others on the CelebA dataset, both under the standard and the ResNet architecture. Also, the generated faces of the proposed method have less flaws compared to other GANs, as shown on Fig. 8. On Cifar10, the FIDs of our model are also comparable to the stateofthearts. And we also show some generated images on Fig. 9. The convergence curves for the both datasets can be found in the supplementary.
PGGAN  WGANdiv  WGANQC  AEOTGAN 
14.7  13.5  7.7  7.2 
Experiment on CelebAHQ Furthermore, We also test the proposed method on images with high resolution, namely the CelebAHQ dataset with image size to be 256x256. The architecture used to train the model is illustrated in the supplementary. The parameters in our model is far less than that of [27, 41, 17], while the performance is better than theirs, as shown in Tab. 2. We also display several images generated in Fig. 10, which are crisp and visually realistic.
5 Conclusion and Future Work
In this paper, we propose the AEOTGAN model which composes the AEOT model and vanilla GAN together. By utilizing the merits of the both models, our method can generate high quality images without mode collapse nor mode mixture. Firstly, the images are embedded into the latent space by autoencoder, then the SDOT map from uniform distribution to the empirical distribution supported on the latent code is computed. Sampling from the latent distribution by applying the extended SDOT map, we can train our GAN model. Moreover, the paired latent code and images give us additional constraints about the generator. Using the FID as metric, we show that the proposed model is able to generate images comparable or better than the state of the arts.
References
 [1] (2017) Neural discrete representation learning. In NeurIPS, Cited by: §2.
 [2] (2020) AEot: a new generative model based on extended semidiscrete optimal transport. In International Conference on Learning Representations, Cited by: §1, §1, §1, §2, §2, §3.2, §3.2, Figure 9, §4.1, §4.2, §4.2, Table 1.
 [3] (2019) Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §2.
 [4] (2017) Wasserstein generative adversarial networks. In ICML, pp. 214–223. Cited by: §1, §2, §3.3.
 [5] (2017) Density estimation using real nvp. In ICLR, Cited by: §1.
 [6] (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: §1, §4.1, §4.
 [7] (2019) Diagnosing and enhancing VAE models. In International Conference on Learning Representations, Cited by: §2.
 [8] (2019) Large scale adversarial representation learning. In https://arxiv.org/abs/1907.02544, Cited by: §2.
 [9] (2018) Glow: generative flow with invertible 1x1 convolutions. In NeurIPS, Cited by: §1.

[10]
(2010)
Regularity properties of optimal maps between nonconvex domains in the plane.
Communications in Partial Differential Equations
35 (3), pp. 465–479. Cited by: §1.  [11] (2014) Generative adversarial nets. Cited by: §1, §2, §3.3, §3.3.
 [12] (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §1, §2.
 [13] (2017) Improved training of wasserstein gans. In NIPS, pp. 5769–5779. Cited by: §1, §2, Figure 8, Figure 9, §4.2, Table 1.
 [14] (2017) Gans trained by a two timescale update rule converge to a nash equilibrium. Cited by: §4, §4.

[15]
(2017)
Imagetoimage translation with conditional adversarial networks.
In
IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §2.  [16] (2017) Adversarial feature learning. In International Conference on Learning Representations, Cited by: §2.
 [17] (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR, Cited by: §4.2, §4.2, Table 1.
 [18] (2018) Disconnected manifold learning for generative adversarial networks. In Advances in Neural Information Processing Systems, Cited by: §1.
 [19] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.
 [20] (2009) Learning multiple layers of features from tiny images. Tech report. Cited by: Figure 9, §4.2, §4.
 [21] (2016) Autoencoding beyond pixels using a learned similarity metric. Cited by: §2.
 [22] (2014) NICE: nonlinear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §1.
 [23] (200601) A tutorial on energybased learning. pp. . Cited by: §1.
 [24] (2010) MNIST handwritten digit database. External Links: Link Cited by: Figure 5, §4.1, §4.
 [25] (2017) Photorealistic single image superresolution using a generative adversarial network. Cited by: §2, §3.3.
 [26] (2019) MaskGAN: towards diverse and interactive facial image manipulation. arXiv preprint arXiv:1907.11922. Cited by: §4.
 [27] (2019) Wasserstein gan with quadratic transport cost. In ICCV, Cited by: §2, Figure 8, §4.2, §4.2, Table 1.
 [28] (2018) Are gans created equal? a largescale study. In Advances in neural information processing systems, pp. 698–707. Cited by: §1, §4.1.
 [29] (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §2.
 [30] (2018) Spectral normalization for generative adversarial networks. In ICLR, Cited by: §2, Figure 9, §4.2, Table 1, §4.
 [31] (2019) On learning nonconvergent nonpersistent shortrun mcmc toward energybased model. arXiv preprint arXiv:1904.09770. Cited by: §1.
 [32] (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, Cited by: §1, §2.
 [33] Cited by: §2.

[34]
(2014)
Stochastic backpropagation and approximate inference in deep generative models
. arXiv preprint arXiv:1401.4082. Cited by: §2.  [35] (2014) Very deep convolutional networks for largescale image recognition. Cited by: §2.
 [36] (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), pp. 2391–232. Cited by: §1.
 [37] (2018) Wasserstein autoencoders. In ICLR, Cited by: §1, §2.
 [38] (2016) Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, Cited by: §1.

[39]
(2008)
Visualizing data using tSNE.
Journal of Machine Learning Research
. Cited by: §4.1.  [40] (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §1.
 [41] (2018) Wasserstein divergence for gans. In ECCV, Cited by: §2, Figure 8, Figure 9, §4.2, §4.2, Table 1, §4.
 [42] (2018) Bourgan: generative networks with metric embeddings. In NeurIPS, Cited by: §1, §1, §2.
 [43] (2016) Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
 [44] (2019) Modeling tabular data using conditional gan. In Advances in Neural Information Processing Systems, Cited by: Figure 8, §4.2.
 [45] (2015) Importance weighted autoencoders. In ICML, Cited by: §2.
 [46] (2018) From facial expression recognition to interpersonal relation prediction. International Journal of Computer Vision. Cited by: Figure 6, Figure 7, Figure 8, §4.2, §4.
 [47] (1998) Filters, random fields and maximum entropy (frame): towards a unified theory for texture modeling. International Journal of Computer Vision. Cited by: §1.
Comments
There are no comments yet.