Deep generative modeling is the field of modeling high dimensional data distributions through different neural network architectures. It has widespread applications including text generation, data augmentation, and speech synthesis. The two approaches dominating the field are generative adversarial networks (GANs) developed by Goodfellowet al and generative autoencoders, the most prominent of which is variational autoencoders (VAEs) by Kingma and Welling . GANs involve a minmax game between a generator and a discriminator, and training the GAN is often difficult, prone to exploding or vanishing gradients as well as mode collapse [6, 7, 8]. Variational Autoencoders, instead only require optimizing a simple minimization problem and thus are easier to train. Latent space density function mapping is at the heart of VAEs, and improvements to this density function mapping have been explored in two different ways, embodied by WAE by Tolstikhin et al and AEOT by Liu et al
. WAE adds a regularizer term to force the latent space of the WAE to match a normal distribution. Theoretically, this makes generation of images much easier, as any vector sampled from a normal distribution should be familiar to the decoder. However, artificially regularizing the latent space to match a normal distribution lowers the quality of the image generation. AE-OT manages to maintain the shape of the latent space of the autoencoder, yet does not do a good job in mapping noise vectors to the proper vectors.
We propose novel generative mapping algorithms OTtrans and OTgen. These algorithms leverage optimal transport to train deep neural networks to generate samples from lower dimensional data distributions. To generate high dimensional data, we first apply an autoencoder to the high dimensional data and then apply OTtrans and OTgen to sample from the latent distribution of the autoencoder. Depending on which algorithm we use, we call this two step procedure either AE-OTtrans or AE-OTgen.
AE-OTtrans and AE-OTgen have superior performance compared to GAN in lower complexity datasets including MNIST and FashionMNIST. This is largely because optimal transport mitigates the model collapse problem in simpler datasets. Furthermore, AE-OTtrans and AE-OTgen are simpler to train and are more theoretically understood than GANs. Compared to non-adversarial generative models, AE-OTtrans and AE-OTgen generate high quality images and interpolations. As opposed to VAE and WAE, AE-OTtrans and AE-OTgen preserve the latent structure of the autoencoder and do not force the latent distribution to be a gaussian or other prior. This provides more flexibility for the model, which translates into the generation of higher fidelity data. As opposed to latent space generator of AE-OT, OTtrans and OTgen do not need to train a discriminator and thus are able to deal with sparser datasets. As a result, AE-OTtrans and AE-OTgen are able to serve as much better generators than AE-OT.
2 Related Works
The field of deep generative modeling solves the problem of sampling from high dimensional probability distributions often lying on a much lower dimensional manifold. For instance, deep models for face generation are able to sample images from latent spaces of hundreds of thousands of pixels, far too large for traditional sampling techniques. The two methods most prominent in the field are GANs and VAEs, which both leverage the lower dimensional manifold. VAEs do this through an autoencoder, whereas GANs do this through adversarial training. Specifically, GANs pit a generator against a discriminator in a two player minmax zero sum game, in which the generator tries to generate images to fool a discriminator, and the discriminator tries to distinguish between the generated and real images. The discriminator eventually manages to learn which images are on the lower dimensional manifold, whereas the generator learns to generate images which are on the manifold. In practice GANs are able to model complex datasets, such as the CelebA dataset, producing much more samples of higher quality than non-adversarial generative modeling . Yet achieving this performance is difficult, a variety of different tricks [6,7,8,9] as well as much trial and error with to carefully procure the correct hyper parameters. We restate that non-adversarial generative modeling is still valuable for its well understood behavior, ease of training, and superior performance on less complex data.
2.1 Base VAE
VAEs is an approach which only requires a single minimization optimization problem. It consists of an encoder and a decoder
. The encoder takes an image and condenses it into a series of means and standard deviations which parameterize a multi-dimensional normal distribution in the latent space. A vector is sampled from this distribution and then put through the encoder. The probabilistic nature of the encoder forces the decoder to generalize to most points within the latent space. Then, to generate points, simply sample vectors from an-dimensional normal distribution and feed it through the decoder. However, as noted before [2, 3], the images generated by VAE tend to be more blurry than real images. This is because VAE’s stochastic training algorithm introduces some uncertainty to the autoencoder, which responds by blurring the image to minimize the mean squared loss.
WAE is an improvement on the base VAE. It consists of a deterministic autoencoder with an added cost term forcing the latent distribution of the AE to be similar to a normal distribution. It is different from VAE in that VAE encodes a single point to a normal distribution whereas in WAE, the cumulative distribution of the whole batch is penalized to match a normal distribution. This enables WAE to have a much higher reconstruction quality. Furthermore, WAE can calculate the divergence between the latent distribution and the normal distribution in two ways. WAE-GAN does this with a GAN in the latent space, whereas WAE-MMD uses Maximum Mean Discrepancy (MMD). Because WAE-GAN uses adversarial training in the latent space, we compare our model to WAE-MMD instead. WAE-MMD inevitably performs worse than a vanilla autoencoder at reconstruction of images, as it needs to satisfy the MMD penalty. Furthermore, the difference between image quality in reconstructed images and generated images further suggests that the latent space of WAE-MMD doesn’t truly match a normal distribution. Thus, MMD is a suboptimal metric for regularizing the latent distribution.
AE-OT  pretrains an autoencoder, consisting of an encoder and a decoder on the data. Then, it trains a neural network to distinguish between real points in the latent distribution and noise generated from a prior . It does this by approximating the Kantorovich potential, where a higher Kantorovich potential corresponds to a point more likely to be from the real distribution and not the prior. To generate images, simply sample noise from the normal distribution. The network will map to , which should have a high Kantorovich potential and thus should be likely to be in the real latent distribution. The final image is then . This algorithm preserves the latent space of the autoencoder, and thus any reconstructed images are very sharp.
However, AE-OT also has some flaws. In practice, training an optimal discriminator is extremely difficult due to the sparse nature of the dataset. Even when we reduce dimensionality to 64, the discriminator is unable to serve as a good generator.
3 Our Generative Autoencoders
In this paper, we follow [3,4] and propose a two step generative model framework for high dimensional data, such as images. First reduce the dimensionality of the high dimensional data by training an encoder and a decoder in an autoencoder framework. Then, train a model which maps from a noise distribution to the latent distribution . To generate data similar to , simply sample from . The vectors will approximate and the decoded data will approximate .
We propose two possible ways to do latent distribution mapping, named OTtrans and OTgen. As we will see, OTtrans is more similar to transporting from a distribution to another distribution whereas OTgen is more similar to generating points from a distribution. The corresponding image generative models with autoencoders are then named AE-OTtrans and AE-OTgen respectively.
3.1 Transporter: OTtrans
In OTtrans, we train a neural network to approximate optimal transport. Let be the distribution we want to generate from, and let be our prior distribution, which will often be a noise distribution. First, sample vectors and with . Then calculate the optimal transport mapping denoted by , a bijection from to . This is time we use optimal transport in the OTtrans algorithm. Finally, train a neural network which approximates this mapping. Specifically, should attempt to map to .
Algorithm 1: OTtrans
Prerequistites: Start by initializing encoder , decoder and transport
neural network . Let be the corresponding squared cost between and , and
let be the prior distribution.
1. Sample from the training set and from .
2. Encode to with
3. Calculate the optimal transport map , a bijection from to .
while is not converged do:
4. Randomly sample batch .
5. Calculate loss:
6. Update by using Adam to minimize loss.
AE-OTtrans is significantly different from AE-OT. Whereas AE-OT relies on a network to learn the Kantorovich Duals and approximates the optimal transport mapping, AE-OTtrans trains a network to learn the optimal transport mapping directly. Learning the mapping directly takes away any need to approximate or concern oneself with the model’s first derivatives. The resulting model is more robust to sparse datasets and easier to train.
3.2 Latent Space Generator: OTgen
In OTgen, we train a neural network to generate points by using optimal transport to give "feedback" to the network. In contrast to the previous algorithm, we calculate optimal transport multiple times, at every iteration of the training step. Let be the latent distribution we want to generate from, and let be the prior distribution. First, sample batch and . Enumerate and . Then calculate the predictions made by the network , such that .
We then use optimal transport to calculate "feedback" for each . Find the optimal transport mapping on these predictions to get bijection which maps to some . Intuitively, should have been . Finally, update based on this new optimal transport mapping. Specifically, should attempt to map to .
Finally, we also add an term weighted by a hyper parameter to increase diversity. takes the average distance between two generated latent vectors and compares it to the average distance between two genuine latent vectors. This forces the generated vectors to be, on average, as far apart as the genuine latent vectors. This is especially useful in a high dimensional latent space, and for many smaller dimensional latent spaces, works well. Formally, it is calculated:
Algorithm 2: OTgen
Prerequistites: Start by initializing encoder , decoder , and generative
neural network . Let be the corresponding squared cost between and , let
be the prior distribution, and let be the weight of the divergence term.
while is not converged do:
1. Sample from the training set and from .
2. Encode to with
3. Calculate the predictions , , …, made by .
4. Calculate the optimal transport map , a bijection from the predictions to .
5. Calculate loss:
6. Update by using Adam to minimize loss.
3.3 Notes on Optimal Transport
There are multiple ways to calculate from to . We employ the Python Optimal Transport library  which leverages the network simplex algorithm  in order to calculate the exact bijection. We also attempted using the Sinkhorn-Knopp algorithm to solve the entropic-regularized optimal transport problem, but the map it provides is unhelpful due to its non-bijective nature. (In such a case, a neural network has a difficult time approximating it.)
3.4 OTtrans vs OTgen
We argue that OTgen is more like a latent space generator than a latent space transporter. In OTtrans, noise vectors are mapped to latent vectors nearby in order to minimize the total distance moved. Such a model has the advantage of only requiring to compute optimal transport once.
However, minimizing total distance moved is an artificial restriction and actually inhibits model performance. When generating an image with noise vector , we don’t necessarily want to generate the image whose latent space vector is closest to , we simply want to generate a good image. OTgen is without this restriction and has more flexibility; the network is given the opportunity to transform any noise vectors before optimal transport is applied. Yet this requires computing optimal transport at every step.
4 Experiments in Distribution Mapping
We validate our algorithms with multiple experiments. First we show that OTgen and OTtrans both do well in generating points from lower dimensional distributions. We specifically use the two moons and concentric circles datasets as depicted below. Such experiments also give us a way to visualize how each algorithm functions.
In these experiments, the neural network architecture for OTgen and OTtrans is the same. It consists of 4 fully-connected layers of 512 neurons and a final layer of 2 neurons. Leaky ReLU  is used in between each layer. All networks are trained for a total of 10K steps, learning rate is set to 0.0003, and the prior noise distributionis
. Batch size is set to 128. For OTgen, the diversity hyperparameteris set to zero, as the dimension is small enough, we don’t need to artificially increase diversity.
We calculate the divergence between the generated distribution and the real distribution by using optimal transport. This gives us a measure of the quality of the generated distribution and therefore the model; the lower the divergence, the closer the generated distribution is to the real distribution, and the better the model is. The divergence is simply the average distance when transporting optimally from the generated distribution to the real distribution. Again, we use the network simplex method to calculate the exact bijective mapping. The average distances are as follows:
4.1 Comparison to K-means
For comparison, we also model the distributions with clusters. We apply K-means clustering to our data before approximating each cluster with a normal distribution. Intuitively, the more clusters there are, the more accurate the distribution will be approximated. Thus, if our models have high distribution modeling capabilities, they should be able to compare with an approximation with a high number of clusters.
We find that both OTgen and OTtrans model each distribution better than the approximation with eight clusters and on-par with the approximation with sixteen clusters. Hence, it is shown that our models are very able in modeling lower dimensional data, often coming close to the optimal divergence. Examples of points generated by each model can be found in the appendix.
4.2 OTgen’s Training
The two dimensional distributions also let us visualize OTgen’s training and gain insight about its stochastic nature. OTgen’s training process involves generating points and receiving "feedback" on the quality of each point. Yet this "feedback" is calculated based on the rest of the batch, which introduces some randomness. Consider the following two images:
These two images portray the feedback given on consecutive training steps. Blue dots are generated points, green dots are real data, and the red lines show a one to one mapping between the green and blue dots which minimizes overall distance travelled. Note that in this case, distance is L1 distance rather than traditional euclidean distance. The first image seems to indicate that the OTgen model is generating too many points in the bottom moon and that some should be mapped to the top moon. The second image tells the opposite story, that the model is generating too many points in the top moon and that some should be mapped to the bottom moon. The model isn’t changing drastically between two consecutive training steps, so some of the feedback is wrong. However, on average the feedback provides useful information, so with enough training steps, the model converges.
5 Experiments in Image Generation
Next, we show that OTgen and OTtrans do well in generating points from a higher dimensional distribution, namely the latent space of an autoencoder trained either on the MNIST handwritten digits dataset , the Fashion MNIST clothing dataset , or the CelebA faces dataset . The MNIST dataset of 28x28 black and white handwritten digits is widely regarded as the baseline dataset for many computer vision tasks, including image generation. The images are simple to generate and are without intricate patterns or gradients. Fashion MNIST is more difficult to generate than MNIST, as the clothing have different shades of gray and many difficult details, including shirt designs, stripes, frills, and gradients. Yet the Fashion MNIST dataset still is black and white and relatively small 28x28. The CelebA faces dataset is the most complicated dataset out of these three, with larger, colored images and faces showing different complex expressions. In our case we use the cropped CelebA images resized to 64x64x3. We compare against AE-OT, WAE-MMD, and VAE.
5.1 MNIST and Fashion MNIST
For MNIST and Fashion MNIST, we did not use convolutional autoencoders but rather chose to only use fully connected layers. Each autoencoder’s latent space dimension was set to eight. The OTgen mapping network consists of seven layers and its prior is set to . Both lambdas in WAE-MMD and AE-OT are set to 0.1, as is suggested in each respective paper. Batch size is set to 128. The diversity lamdba for OTgen was set to 0. Below the inception scores of the MNIST and Fashion MNIST images are shown (higher is better).
In both MNIST and Fashion MNIST datasets, AE-OTgen comes the closest to the optimal inception score, with AE-OTtrans in second place. Both AE-OTgen and AE-OTtrans do substantially better than both the non-adversarial generators and adversarial generators such as GAN and WGAN. This demonstrates the capabilities of OTgen and OTtrans on natural latent distributions with low dimensions. In particular, our usage of optimal transport ensures diversity in that the model generates similar amounts of each class. Example images are shown in the appendix.
For CelebA, we compare our models with the different non-adversarial models. The autoencoders in AE-OTgen, AE-OTtrans, AE-OT, VAE, and WAE all have the same architecture as in . For AE-OTgen and AE-OTtrans, the mapping network is the same as MNIST. Finally, Batch size is set to 4096 to help increase the diversity of images. With a batch size of 4096, we ensure that our sampling from the latent distribution consistently matches the true latent distribution. AE-OT and WAE’s lambdas are set to 0.1, AE-OT’s diversity hyperparameter is set to 1. The Frechet Inception Distances on the CelebA dataset are shown below (lower is better).
From the FID scores, we see that AE-OTgen and AE-OTtrans again outperform WAE-MMD, VAE, and AE-OT. This shows its efficacy in modeling higher dimensional data and that it is state of the art in the field of non-adversarial generative modeling. As noted before, the images generated by VAE are very blurry. In contrast, though the images generated by WAE-MMD are very sharp, they often lack the facial structure. AE-OT is unable to handle the sparsity of the CelebA autoencoder’s latent distribution and generates poor images. Example images pertaining to WAE-MMD, VAE, AE-OT, AE-OTtrans, and AE-OTgen are found in the appendix.
5.3 AE-OTtrans vs AE-OTgen
Here, we compare the images generated by AE-OTtrans to the images generated by AE-OTgen. Both images are well structured, without major deformities. AE-OTtrans images are more diverse than AE-OTgen, with more varied face archetypes and backgrounds. Yet the faces generated by AE-OTgen are more sharp, albeit with less diversity. A similar pattern is seen when viewing the different model’s interpolation. Interpolation exists to ensure that each model doesn’t simply memorize the different datapoints but instead can generate the whole distribution in a smooth fashion. The two model’s interpolations are shown below:
AE-OTgen and AE-OTtrans have a significant difference with respect to interpolation. AE-OTtrans interpolation is more smooth, without many radical shifts in the image. Yet this comes at the cost of the image quality; the images are a bit more blurry and unrealistic. On the contrary, AE-OTgen interpolation is not as smooth, with more drastic shifts in the image, but most images are realistic and sharp. The suggests that AE-OTtrans interpolation is more natural and smooth, whereas in AE-OTgen, the interpolation is jumpy to ensure each transition image is realistic.
6 Conclusion and Further Works
In conclusion, we have proposed two models OTtrans and OTgen which do latent distribution mapping. These two models can be extended to AE-OTtrans and AE-OTgen which do high dimensional data generation without adversarial training. We’ve shown that OTtrans and OTgen are reasonable models when applied to two dimensional datasets, often outperforming the conventional modeling of distirbutions using clusters. Similarly, AE-OTtrans and AE-OTgen also do well, significantly outperforming VAE, WAE-MMD, and AE-OT on the MNIST and FashionMNIST and slightly outperforming the competing models on the CelebA dataset. Furthermore, AE-OTtrans and AE-OTgen outperform GANs on the MNIST and FashionMNIST dataset. Combined with the simpler training procedure of our new non-adversarial algorithm, this provides a compelling case to use AE-OTtrans or AE-OTgen for lower dimensional data generation over adversarial training. Future work will include an expanded theoretical analysis of AE-OTtrans and AE-OTgen as well as further improvements on the AE-OTtrans and AE-OTgen models.
The authors of this paper would like to acknowledge Zach Gaslowitz of Proof School for the many fruitful discussions along the way. We would also like to thank Dr. Mei Han from Ping An Technology for supporting and facilitating this project. Finally, we would like to thank all the students at Proof School who helped in proofreading the paper.
D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. In ICLR, 2018.
H. Liu, Y. Guo, N. Lei, Z. Shu, S. T. Yau, D. Samaras, and X. Gu. Latent space optimal transport for generative models. arXiv preprint arXiv:1809.05964, 2018
N. Lei, K. Su, L. Cui, S.-T. Yau, and D. X. Gu. A geometric view of optimal transportation and generative model. arXiv preprint arXiv:1710.05488, 2017.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672-2680, 2014.
M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
D. Berthelot, T. Schumm, and L. Metz. Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
N. Kodali, J. Abernethy, J. Hays, and Z. Kira. On convergence and stability of gans. arXiv preprint arXiv:1705.07215, 2017.
L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for gans do actually converge? In
International Conference on Machine Learning, pages 3478-3487, 2018.
A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. In CoRR, abs/1505.00853, 2015.
S. Ioffe, C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InInternational Conference on Machine Learning, 2015, pp. 448-456.
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 6626-6637, 2017.
Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. In arXiv:1708.07747, 2017
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86(11), pages 2278-2324, 1998.
Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. InICCV, 2015.
R. Flamary, N. Courty. Python Optimal Transport. https://github.com/rflamary/POT, 2017.
G. Peyre, M. Cuturi. Computational Optimal Transport. arXiv:1803.00567, 2019.