AE-OT-GAN: Training GANs from data specific latent distribution

01/11/2020 ∙ by Dongsheng An, et al. ∙ 10

Though generative adversarial networks (GANs) areprominent models to generate realistic and crisp images,they often encounter the mode collapse problems and arehard to train, which comes from approximating the intrinsicdiscontinuous distribution transform map with continuousDNNs. The recently proposed AE-OT model addresses thisproblem by explicitly computing the discontinuous distribu-tion transform map through solving a semi-discrete optimaltransport (OT) map in the latent space of the autoencoder.However the generated images are blurry. In this paper, wepropose the AE-OT-GAN model to utilize the advantages ofthe both models: generate high quality images and at thesame time overcome the mode collapse/mixture problems.Specifically, we first faithfully embed the low dimensionalimage manifold into the latent space by training an autoen-coder (AE). Then we compute the optimal transport (OT)map that pushes forward the uniform distribution to the la-tent distribution supported on the latent manifold. Finally,our GAN model is trained to generate high quality imagesfrom the latent distribution, the distribution transform mapfrom which to the empirical data distribution will be con-tinuous. The paired data between the latent code and thereal images gives us further constriction about the generator.Experiments on simple MNIST dataset and complex datasetslike Cifar-10 and CelebA show the efficacy and efficiency ofour proposed method.



There are no comments yet.


page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image generation has been one of the core topics in the area of computer vision for a long time. Thanks to the quick development of deep learning, numerous generative models are proposed, including encoder-decoder based models

[19, 37, 2], generative adversarial networks (GANs) [11, 6, 42, 32, 4, 13]

, density estimator based models

[38, 22, 5, 9]

and energy based models

[23, 47, 43, 31]. The encoder-decoder based models and GANs are the most prominent ones due to their capability to generate high quality images.

Figure 1: Distribution transport maps from two different latent distributions to the same data distribution. maps a unimodal

latent distribution on the left to the data distribution on the right. Difference in the topology of their supporting manifolds will cause discontinuities of the map, which is hard to approximate by continuous neural networks. The singular set of

consists of and (shown in red): continuous samplings of the source distribution are mapped to three disjoint segments (shown in purple). On the other hand, samples from a suitably supported latent distribution and is less likely to suffer from the discontinuity problem. Thus it can be well approximated by neural networks.

Intrinsically, the generator in a generative model aims to learn the real data distribution supported on the data manifold [36]. Suppose the distribution of a specific class of natural data is concentrated on a low dimensional manifold

embedded in the high dimensional data space. The encoder-decoder methods first attempt to embed the data into the latent space

through the encoder , then samples from the latent distribution are mapped back to the manifold to generate new data by decoder . While GANs, which have no encoder, directly learn a map (generator) that transports a given prior low dimensional distribution to .

Usually, GANs are unstable to train and suffer from mode collapse [12, 28]. The difficulties come from the fact that the generator of a GAN model is trained to approximate the discontinuous distribution transport map from the

unimodal Gaussian distribution

to the real data distribution by the continuous neural networks [42, 2, 18]. In fact, when the supporting manifolds of the source and target distributions differ in topology or convexity, the OT map between them will be discontinuous [40], as illustrated in the map of Fig. 1. In practice, distribution transport maps can have complicated singularities, even when the ambient dimension is low (see e.g. [10]). This poses a great challenge for the generator training in standard GAN models.

To tackle the mode collapse and mode mixture problems caused by discontinuous transport maps, the authors of [2] proposed the AE-OT model. In this model, an autoencoder is used to map the images manifold into the latent manifold . Then, the semi-discrete optimal transport (SDOT) map from the uniform distribution to the latent empirical distribution is explicitly computed via convex optimization approach. Then a piece-wise linear extension map of the SDOT, denoted by , pushes forward the uniform distribution to a continuous latent distribution , which in turn gives a good approximation of the latent distribution ( means the push forward map induced by ). Composing the continuous decoder and discontinuous together, i.e. , where is sampled from uniform distribution, this model can generate new images. Though have no mode collapse/mixture, the generated images look blurry. The framework of AE-OT is shown as follows:

In this work we propose the AE-OT-GAN framework to combine the advantages of the both models and generate high quality images without mode collapse/mixture. Specifically, after the training of the autoencoder and the computation of the extended SDOT map, we can directly sample from the latent distribution by applying on the uniform distribution to train the GAN model. In contrast to the conventional GAN models, whose generators are trained to transport the latent Gaussian distribution to the data manifold distributions, our GAN model sample from the data inferred latent distribution . The distribution transport map from to the data distribution is continuous and thus can be well approximated by the generator (parameterized by CNNs), as shown in of Fig. 1

. Moreover, the decoder of the pre-trained autoencoder gives a warm start of the generator, so that the Kullback–Leibler divergence between real and fake batches of images have non-vanishing overlap in their supports during the training phase. Furthermore, the content loss and feature loss between paired latent codes and real input images regularize the adversarial loss and stabilize the GAN training. Experiments have shown efficacy and efficiency of our proposed model.

The contributions of the current work can be summarized as follows: (1) This paper proposes a novel AE-OT-GAN model that combines the strengths of AE-OT model and GAN model. It eliminates the mode collapse/mixture of GAN and removes the blurriness of the images generated by AE-OT. (2) The decoder of the autoencoder provides a good initialization of the generator of GAN. The number of iterations required to reach the equilibrium has been reduced by more than 100 times compared to typical GANs. (3) In addition to the adversarial loss, the explicit correspondence between the latent codes and the real images provide auxiliary constraints, namely the content loss, to the generator. (4) Our experiments demonstrate that our model can generate images consistently better than or comparable to the results of state-of-the-art methods.

2 Related Work

Figure 2: The framework of the proposed method. Firstly, the autoencoder is trained to embed the images into the latent space, the real latent codes are shown as the orange circles. Then we compute extended semi-discrete OT map to generate new latent codes in the latent space (the purple crosses). Finally, our GAN model is trained from the latent distribution induced by to the image distribution. Here the generator is just the decoder of the autoencoder. The fake batch (the bar with orange and purple colors) to train the discriminator is composed of two parts: the reconstructed images of the real latent codes and the generated images from the randomly generated latent codes with sampled from uniform distribution. The real batch (the bar with only orange color) is also composed of two parts: the real images corresponding to , and the randomly selected images .

The proposed method in this paper is highly related to encoder-decoder based generation models, the generative adversarial networks (GANs), conditional GANs and the hybrid models that take the advantages of above.

Encoder-decoder architecture A breakthrough for image generating comes from the scheme of Variational Autoencoders (VAEs) (e.g. [19]), where the decoders approximate real data distributions from a Gaussian distribution in a variational approach (e.g [19] and [34]). Latter Yuri Burda et al. [45] lower the requirement of latent distribution and propose the importance weighted autoencoder (IWAE) model through a different lower bound. Bin and David [7] propose that the latent distribution of VAE may not be Gaussian and improve it by firstly training the original model and then generating new latent code through the extended ancestral process. Another improvement of the VAE is the VQ-VAE model [1]

, which requires the encoder to output discrete latent codes by vector quantisation, then the posterior collapse of VAEs can be overcome. By multi-scale hierarchical organization, this idea is further used to generate high quality images in VQ-VAE-2

[33]. In [37], the authors adopt the Wasserstein distance in the latent space to measure the distance between the distribution of the latent code and the given one and generate images with better quality. Different from the the VAEs, the AE-OT model [2] firstly embed the images into the latent space by autoencoder, then an extended semi-discrete OT map is computed to generate new latent code based on the fixed ones. Decoded by the decoder, new images can be generated. Although the encoder-decoder based methods are relatively simple to train, the generated images tend to be blurry.

Generative adversarial networks The GAN model [11] tries to alternatively update the generator, which maps the noise sampled from a given distribution to real images, and the discriminator differentiates the difference between the generated images and the real ones. If the generated images successfully fool the discriminator, we say the model is well trained. Later, [32]

proposes a deep convolutions neural network (DCGAN) to generate images with better quality. While being a powerful tool in generating realistic samples, GANs can be hard to train and suffer from mode collapse problem

[12]. After delicate analysis, [4] points out that it is the KL divergence the original GAN used causes these problems. Then the authors introduce the celebrated WGAN, which makes the whole framework easy to converge. To satisfy the lipschitz continuity required by WGAN, a lot of methods are proposed, including clipping [4], gradient penalty [13], spectral normalization [30] and so on. Later, Wu et al. [41] use the wasserstein divergence objective, which get rid of the lipschitz approximation problem and get a better result. Instead cost adopted by WGAN, Liu [27] propose the WGAN-QC by taking the cost into consideration. Though various GANs can generate sharp images, they will theoretically encounter the mode collapse or mode mixture problem [12, 2].

Hybrid models To solve the blurry image problem of encoder-decoder architecture and the mode collapse/mixture problems of GANs, a natural idea is to compose them together. Larsen et al. [21] propose to combine the variational autoencoder with a generative adversarial network, and thus generate images better than VAEs. [29] matches the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution by a discriminator and then applies the model into tasks like semi-supervised classification and dimensionality reduction. BiGAN [16], with the same architecture with ours, uses the discriminator to differentiate both the generated images and the generated latent code. Further, by utilizing the BigGAN generator [3], the BigBiGAN [8] extends this method to generate much better results. Here we also treat the BourGAN [42] as a hybrid model, because it firstly embeds the images into latent space by Bourgain theorem, then trains the GAN model by sampling from the latent space using the GMM model.

Conditional GANs are another kind of hybrid models that can also be treated as image-to-image transformation. For example, using an encoder-decoder architecture to build the connection between paired images and then differentiating the decoded images with the real ones by a discriminator, [15] is able to transform images of different styles. Further, SRGAN [25]

uses similar architecture to get super resolution images from their low resolution versions. The SRGAN model is the most similar work to ours, as it also utilizes the content loss and adversarial loss. The main differences between this model and ours including: (i) SRGAN just uses the paired data, while the proposed method use both the paired data and generated new latent code to train the model; (ii) the visually meaningful features used by SRGAN are extracted from the pre-trained VGG19 network

[35], while in our model, they come from the encoder itself. This makes them more reasonable especially under the scenes where the datasets are not included in those used to train the VGG.

3 The Proposed Method

In this section, we explain our proposed AE-OT-GAN model in detail. There are mainly three modules, an autoencoder (AE), an optimal transport mapper (OT) and a GAN model. Firstly, an AE model is trained to embed the data manifold into the latent space. At the same time, the encoder pushes forward the ground-truth data distribution supported on to the ground-truth latent distribution supported on in the latent space. Secondly, we compute the semi-discrete OT map from the uniform distribution to the empirical latent distribution. By extending the SDOT map, we can construct the continuous distribution that approximates the ground-truth latent distribution well. Finally, starting from as the latent distribution, our GAN model is trained to generate both realistic and crisp images. The pipeline of our proposed model is illustrated in Fig. 2. In the following, we will explain the three modules one by one.

3.1 Data Embedding with Autoencoder

We model the real data distribution as a probability measure

supported on an dimensional manifold embedded in the dimensional Euclidean space (ambient space) with .

In the first step of our AE-OT-GAN model, we train an autoencoder (AE) to embed the real data manifold to be the latent manifold . In particular, training the AE model is equivalent to compute the encoding map and decoding map

by minimizing the loss function:

with and parameterized by standard CNNs ( and are the parameters of the networks, respectively). Given densely sampling from the image manifold (detailed explanation is included in the supplementary) and ideal optimization (namely the loss function goes to ), coincides with the identity map. After training, is a continuous, convertible map, namely a homeomorphism, and is the inverse homeomorphism. This means is an embedding, and pushes forward to the latent data distribution . In practice, we only have the empirical data distribution given by , which is push forward to be the empirical latent distribution , where is the number of samples.

3.2 Constructing with Semi-Discrete OT Map

Figure 3: OT map and the extended OT map in 2D case. Here maps points (e.g. ) in each polyhedral cell (orange cells on the left) to the corresponding latent code (circles and squares on the right). The piece-wise linear maps triangulated regions in to the simplicial complex in the latent space (shown in purple). Given the barycenters ’s of each ’s, each triangle is mapped to the corresponding simplex . For example, in the triangle is mapped to in the simplex . Red lines in illustrate the singular set of , which corresponds to the pre-image of gaps or holes in .

In this section, from the empirical latent distribution , we construct a continuous latent distribution following [2] such that (i) it generalizes well, so that all of the modes are covered by the support of (ii) the support of has similar topology to that of , which ensures that the transport map from to is continuous and (iii) it is efficient to sample from .

To obtain , the semi-discrete OT map from the uniform distribution to is firstly computed. Here is the dimension of the latent space. By extending to be a piece-wise linear map , we can construct as the push forward distribution of under :

In the first step, we compute the semi-discrete OT map , with . Under , the continuous domain of is decomposed into cells with , with the Lebesgue measure of each to be . The cell structure is shown in the left frame of Fig. 3 (the orange cells). Computational details of can be found in the supplementary material and [2].

Secondly, we extend the image domain of from the discrete latent codes to a continuous neighborhood , which serves as the supporting manifold of . Specifically, we construct a simplicial complex from the latent codes . Here is a constant. The 0-skeleton of , represented by , is the set of all latent codes . The we define its k-skeletons by for . The right frame of Fig. 3 shows an example of . By assuming that the latent code is densely sampled from the latent manifold and with an appropriate , will have consistent "hole" and "gap" structure with , in the sense of homology equivalence. Details are described in the supplementary material.

Finally, we define the piece-wise linear extended OT map . Given a random sample sampled from , we can find the cell containing it. By computing the barycentric parameters ’s with respect to the nearby mass centers ’s of the cells ’s, i.e. compute ’s such that with and . Here represents the neighbour of . Then is mapped to if the corresponding ’s form a simplex of . Otherwise we map to , i.e. . As illustrated in Fig. 3, compared to the many-to-one semi-discrete OT map , maps samples within the triangular areas (the purple triangles on the left frame) in linearly to the corresponding simplices in (the purple triangles on the right frame) in a bijective manner. We denote the pushed forward distribution under as .

Theorem 1.

The 2-Wasserstein distance between and satisfies . Moreover, if the latent codes are densely sampled from the latent manifold , we have -almost surely.

To avoid confusion, we omit the subscript and denote as . With proof included in the supplementary material, this theorem tells us that as a continuous generalization of , is a good approximation of . Also, we want to mention that is a piece-wise linear map that pushes forward to , which makes the sampling from efficient and accurate.

3.3 GAN Training from

The GAN model computes the transport map from the continuous latent distribution to the data distribution on the manifold.

Our GAN model is based on the vanilla GAN model proposed by Ian Goodfellow [11]. The generator is used to generate new images by sampling from the latent distributin , while the discriminator is used to discriminate if the distribution of the generated images are the same with that of the real images. The training process is formalized to be a min-max optimization problem:

where the loss function is given by


In our model, the loss function consists of three terms, the image content loss , the feature loss and the adversarial loss . Here is the weight of the content loss.

Adversarial Loss We adopt the vanilla GAN model [11] based on the Kullback–Leibler (KL) divergence. The key difference between our model and the original GAN is that our latent samples are drawn from the data related latent distribution , instead of a Gaussian distribution. The adversarial loss is given by:

According to [4], vanilla GAN is hard to converge because the supports of the distributions of real images and fake images may not intersect each other, which makes the KL divergence between them infinity. This issue is solved in our case, because (1) the training of AE gives a warm start to the generator, so at the beginning of the training, the generated distribution is close to the real data distribution . (2) by delicate settings of the fake and real batches used to train the discriminator, we can keep the KL divergence between them converge well. In detail, as shown in Fig. 2, the fake batch is composed of both the reconstructed images from the real latent code (the orange circles) and the generated images from the generated latent code (the purple crosses), and the real batch includes both the real images corresponding to the real latent code and some randomly selected images.

Content Loss Recall that the generator can produce two types of images: images reconstructed by real latent codes and images from generated latent codes. Given a real sample , its latent code is , the reconstructed image is . Each reconstructed image is represented as a triple . Suppose there are reconstructed images in total, the content loss is given by


Where is the generator parameterized by .

Feature Loss We adopt the feature loss similar to that in [25]. Given a reconstructed image triple , we encode by the encoder of AE. Ideally, the real image and the generated image should be same, therefore their latent codes should be similar. We measure the difference between their latent codes by the feature loss. Furthermore, we can measure the difference between their intermediate features from different layers of the encoder.

Suppose the encoder is a network with layers, the output of the th layer is denoted as . The feature loss is given by

Where is the weight of the feature loss of the -th layer.

For reconstructed images , the content loss and the feature loss force the generated image to be the same with the real image , therefore the manifold align well with the real data manifold .

4 Expreiments

To evaluate the proposed method, several experiments are conducted on simple dataset MNIST [24] and complex datasets including Cifar10 [20], CelebA [46] and CelebA-HQ [26].

(a) (b)
Figure 4: (a) Latent code distribution. The orange circles represent the fixed latent code and the purple crosses are the generated ones. (b) Comparison between the generated digits (left) and the real digits (right).
(a) (b)
(c) (d)
Figure 5: The curves for training on MNIST dataset [24]

of each epoch, including the results of content loss (a) and self-perceptual loss (b), the discriminator output (c) and FIDs (d).

(a) Epoch 0 (AE-OT) (b) Epoch 80 (c) Epoch 160 (d) Epoch 240 (e) Ground-truth
Figure 6: Evolution of the generator during training on the CelebA dataset [46]. Reconstructed images from real latent codes at different epochs are shown.
(a) Epoch 0 (AE-OT) (b) Epoch 80 (c) Epoch 160 (d) Epoch 240
Figure 7: Evolution of the generator during training on the CelebA dataset [46]. Generated images from generated latent codes at different epochs are shown.

Architecture We adopt the InfoGAN [6] architecture as our GAN model to train the MNIST dataset. The standard and ResNet models used to train the Cifar10 dataset are the same with those used by SNGAN [30], and the architectures of WGAN-div [41] are used to train the CelebA dataset. The framework of encoder is set to be the mirror of the generators/decoders.

Evaluation metrics To illustrate the performance of the proposed method, we adopt the commonly used Frechet Inception distance (FID) [14]

as our evaluation metrics. FID takes both the generated images and the real images into consideration. When the images are embedded into the feature space by inception network, two high dimensional Gaussian distributions are used to approximate the empirical distributions of the generated and real features, respectively. Finally, the FID is given by the difference between the two Gaussian distributions. Lower FID means better quality of the generated dataset. This metric has been proven to be effective in judging the performance of the generated models, and it serves as a standard for comparison with other works.

Training details

To get rid of the vanishing gradient problem and make the model converge better, we use the following three strategies:

(i) Train the discriminator using Batch Composition There are two types of latent codes in our method: the real latent codes coming from encoding the real images by the encoder, and generated latent codes coming from the extended OT map. Correspondingly, there are two types of generated images, the reconstructed images from the real latent codes and the generated images from the generated latent codes.

To train the discriminator, both the fake batch and real batch are used. The fake batch consists of both randomly selected reconstructed images and generated images, and the real batch only includes real images, in which the first part has a one-to-one correspondence with the reconstructed images in the fake batch, as shown in Fig. 2. In all the experiments, the ratio between the number of generated images and reconstructed images in the fake batch is 3.

This strategy ensures that there is an overlap between the supports of the fake and real batches, so that the KL divergence is not infinity.

(ii) Different learning rate For better training, we use different learning rates for the generator and the discriminator as suggested by Heusel et al. in [14]. Specifically, we set the learning rate of the generator to be and that of the discriminator to be , where . This improves the stability of the training process.

(iii) Different inner steps Another way to improve the training consistency of the whole framework is to set different update steps for the generator and discriminator. Namely, When the discriminator updated once, the generator updated times correspondingly. This strategy is the opposite of training vanilla GANs, which typically require multiple discriminator update steps per generator update step.

By setting and , we can keep the discriminator output of the real images is slightly large than that of the generated images, which can better guide the training of the generator. For the MNIST dataset, and ; for the Cifar10 dataset, and ; and for the CelebA dataset, and . In Eq. 1, and with , where denotes the last layer of the encoder. is used to regularize the loss of the latent codes.

With the above settings and the warm initialization of the generator from the pre-trained decoder, for each dataset, the total epochs for training is set to be 500, which is far less than the training of GANs (usually 10k~50k).

CT-GAN [44] WGAN-GP [13] WGAN-div [41] WGAN-QC [27] Proposed method
Figure 8: The visual comparison between the proposed method and the state-of-the-arts on CelebA dataset [46] with ResNet architecture.
WGAN-GP [13] SNGAN [30] WGAN-div [41] AE-OT [2] Proposed method
Figure 9: The visual comparison between the proposed method and the state-of-the-arts on Cifar10 dataset [20] with ResNet architecture.
Figure 10: The generation results of CelebA-HQ by the proposed method.
CIFAR10 CelebA
Standard ResNet Standard ResNet
WGAN-GP [13] 40.2 19.6 21.2 18.4
PGGAN [17] - 18.8 - 16.3
SNGAN [30] 25.5 21.7 - -
WGAN-div [41] - 18.1 17.5 15.2
WGAN-QC [27] - - - 12.9
AE-OT [2] 34.2 28.5 24.3 28.6
AE-OT-GAN 25.2 17.1 11.2 7.8
Table 1: The comparison of FID between the proposed method and the state of the arts on Cifar10 and CelebA.

4.1 Convergence Analysis in MNIST

In this experiment, we evaluate the performance of our proposed model on MNIST dataset [24], which can be well embedded into the dimensional latent space with the architecture of InfoGAN [6]. In Fig. 4(a), we visualize the real latent code (brown circles) and the generated latent codes (purple crosses) by t-SNE [39]. It is obvious that the support of the real latent distribution and that of the generated distribution align well. Frame (b) of Fig. 4 shows the comparison between the generated handwritten digits (left) and the real digits (right), which is very difficult for humans to distinguish.

To show the convergent property of the proposed method, we plot the related curves in Fig. 5. The frame (a) and (b) show the changes of the content loss about the images and latent codes, and both of them decrease monotonously. The frame (c) shows that the output of the discriminator for real images is only slightly larger than that for the fake images during the training process, which can help the generator generate more realistic digits. The frame (d) shows the evolution of FID and the final value is . For MNIST dataset, the best known FIDs with the same InfoGAN architecture are and , reported in [28] and [2] respectively. This shows our model outperforms state-of-the-art.

4.2 Quality Evaluation on Complex Dataset

In this section, we compare with the state-of-the-art methods quantitatively and qualitatively.

Progressive Quality Improvement Firstly, we show the evolution results of the proposed method in Fig. 6 and Fig. 7 during GAN’s training process. Quality of the generated images increases monotonously during the process. Images in first four frames of Fig. 6 illustrates the results reconstructed from the real latent codes by the decoder, with the last frame showing the corresponding ground-truth input images. By examining the frames carefully, it is obvious that as the increase of the epochs, the generated images become sharper and sharper, and eventually they are very close to the ground truth. Fig. 7 shows the generated images from some generated latent codes (therefore, no corresponding real images). Similarly. the images become sharper as the increase of epochs. Here we need to state that the 0 epoch stage means the images are generated by the original decoder, which are equivalent to the outputs of an AE-OT model [2]. Thus we can conclude that the proposed AE-OT-GAN improves the performance of AE-OT prominently.

Comparison on CelebA and CIFAR 10 Secondly, we compared with the state-of-the-arts including WGAN-GP [13], PGGAN [17], SNGAN [30], CTGAN [44], WGAN-div [41], WGAN-QC [27] and the recently proposed AE-OT model [2] on Cifar10 [20] and CelebA [46]. Tab. 1 shows the FIDs of the our method and the comparisons trained under both the standard and ResNet architectures. The FID of other methods come from the listed papers except those of the AE-OT, which are directly computed by our model (the results of epoch 0). From the table we can see that our method gets much better results than others on the CelebA dataset, both under the standard and the ResNet architecture. Also, the generated faces of the proposed method have less flaws compared to other GANs, as shown on Fig. 8. On Cifar10, the FIDs of our model are also comparable to the state-of-the-arts. And we also show some generated images on Fig. 9. The convergence curves for the both datasets can be found in the supplementary.

14.7 13.5 7.7 7.2
Table 2: The FIDs of the proposed method and the state-of-the-arts.

Experiment on CelebA-HQ Furthermore, We also test the proposed method on images with high resolution, namely the CelebA-HQ dataset with image size to be 256x256. The architecture used to train the model is illustrated in the supplementary. The parameters in our model is far less than that of [27, 41, 17], while the performance is better than theirs, as shown in Tab. 2. We also display several images generated in Fig. 10, which are crisp and visually realistic.

5 Conclusion and Future Work

In this paper, we propose the AE-OT-GAN model which composes the AE-OT model and vanilla GAN together. By utilizing the merits of the both models, our method can generate high quality images without mode collapse nor mode mixture. Firstly, the images are embedded into the latent space by autoencoder, then the SDOT map from uniform distribution to the empirical distribution supported on the latent code is computed. Sampling from the latent distribution by applying the extended SDOT map, we can train our GAN model. Moreover, the paired latent code and images give us additional constraints about the generator. Using the FID as metric, we show that the proposed model is able to generate images comparable or better than the state of the arts.


  • [1] K. K. Aaron van den Oord (2017) Neural discrete representation learning. In NeurIPS, Cited by: §2.
  • [2] D. An, Y. Guo, N. Lei, Z. Luo, S. Yau, and X. Gu (2020) AE-ot: a new generative model based on extended semi-discrete optimal transport. In International Conference on Learning Representations, Cited by: §1, §1, §1, §2, §2, §3.2, §3.2, Figure 9, §4.1, §4.2, §4.2, Table 1.
  • [3] K. S. Andrew Brock (2019) Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §2.
  • [4] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In ICML, pp. 214–223. Cited by: §1, §2, §3.3.
  • [5] S. Bengio (2017) Density estimation using real nvp. In ICLR, Cited by: §1.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: §1, §4.1, §4.
  • [7] B. Dai and D. Wipf (2019) Diagnosing and enhancing VAE models. In International Conference on Learning Representations, Cited by: §2.
  • [8] J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. In, Cited by: §2.
  • [9] P. D. Durk P Kingma (2018) Glow: generative flow with invertible 1x1 convolutions. In NeurIPS, Cited by: §1.
  • [10] A. Figalli (2010) Regularity properties of optimal maps between nonconvex domains in the plane.

    Communications in Partial Differential Equations

    35 (3), pp. 465–479.
    Cited by: §1.
  • [11] I. J. Goodfellow (2014) Generative adversarial nets. Cited by: §1, §2, §3.3, §3.3.
  • [12] I. Goodfellow (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: §1, §2.
  • [13] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NIPS, pp. 5769–5779. Cited by: §1, §2, Figure 8, Figure 9, §4.2, Table 1.
  • [14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a nash equilibrium. Cited by: §4, §4.
  • [15] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In

    IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §2.
  • [16] T. D. Jeff Donahue (2017) Adversarial feature learning. In International Conference on Learning Representations, Cited by: §2.
  • [17] T. Karras (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR, Cited by: §4.2, §4.2, Table 1.
  • [18] M. Khayatkhoei, M. K. Singh, and A. Elgammal (2018) Disconnected manifold learning for generative adversarial networks. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [19] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2.
  • [20] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Tech report. Cited by: Figure 9, §4.2, §4.
  • [21] A. B. L. Larsen (2016) Autoencoding beyond pixels using a learned similarity metric. Cited by: §2.
  • [22] Y. B. Laurent Dinh (2014) NICE: non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Cited by: §1.
  • [23] Y. Lecun, S. Chopra, and R. Hadsell (2006-01) A tutorial on energy-based learning. pp. . Cited by: §1.
  • [24] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. External Links: Link Cited by: Figure 5, §4.1, §4.
  • [25] C. Ledig (2017) Photo-realistic single image super-resolution using a generative adversarial network. Cited by: §2, §3.3.
  • [26] C. Lee, Z. Liu, L. Wu, and P. Luo (2019) MaskGAN: towards diverse and interactive facial image manipulation. arXiv preprint arXiv:1907.11922. Cited by: §4.
  • [27] H. Liu, X. Gu, and D. Samaras (2019) Wasserstein gan with quadratic transport cost. In ICCV, Cited by: §2, Figure 8, §4.2, §4.2, Table 1.
  • [28] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 698–707. Cited by: §1, §4.1.
  • [29] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §2.
  • [30] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In ICLR, Cited by: §2, Figure 9, §4.2, Table 1, §4.
  • [31] E. Nijkamp (2019) On learning non-convergent non-persistent short-run mcmc toward energy-based model. arXiv preprint arXiv:1904.09770. Cited by: §1.
  • [32] A. Radford, L. Metz, and S. Chintala (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, Cited by: §1, §2.
  • [33] Cited by: §2.
  • [34] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    arXiv preprint arXiv:1401.4082. Cited by: §2.
  • [35] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. Cited by: §2.
  • [36] J. B. Tenenbaum, V. Silva, and J. C. Langford (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), pp. 2391–232. Cited by: §1.
  • [37] I. Tolstikhin (2018) Wasserstein auto-encoders. In ICLR, Cited by: §1, §2.
  • [38] A. van den Oord, N. Kalchbrenner, L. Espeholt, k. kavukcuoglu, O. Vinyals, and A. Graves (2016) Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [39] L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE.

    Journal of Machine Learning Research

    Cited by: §4.1.
  • [40] C. Villani (2008) Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §1.
  • [41] J. Wu (2018) Wasserstein divergence for gans. In ECCV, Cited by: §2, Figure 8, Figure 9, §4.2, §4.2, Table 1, §4.
  • [42] C. Xiao, P. Zhong, and C. Zheng (2018) Bourgan: generative networks with metric embeddings. In NeurIPS, Cited by: §1, §1, §2.
  • [43] J. Xie, Y. Lu, S. Zhu, and Y. Wu (2016) Cooperative training of descriptor and generator networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
  • [44] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni (2019) Modeling tabular data using conditional gan. In Advances in Neural Information Processing Systems, Cited by: Figure 8, §4.2.
  • [45] R. S. Yuri Burda (2015) Importance weighted autoencoders. In ICML, Cited by: §2.
  • [46] Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2018) From facial expression recognition to interpersonal relation prediction. International Journal of Computer Vision. Cited by: Figure 6, Figure 7, Figure 8, §4.2, §4.
  • [47] S. Zhu, Y. Wu, and D. Mumford (1998) Filters, random fields and maximum entropy (frame): towards a unified theory for texture modeling. International Journal of Computer Vision. Cited by: §1.