1 Introduction
Deep generative modeling is the field of modeling high dimensional data distributions through different neural network architectures. It has widespread applications including text generation, data augmentation, and speech synthesis. The two approaches dominating the field are generative adversarial networks (GANs) developed by Goodfellow
et al[5] and generative autoencoders, the most prominent of which is variational autoencoders (VAEs) by Kingma and Welling [1]. GANs involve a minmax game between a generator and a discriminator, and training the GAN is often difficult, prone to exploding or vanishing gradients as well as mode collapse [6, 7, 8]. Variational Autoencoders, instead only require optimizing a simple minimization problem and thus are easier to train. Latent space density function mapping is at the heart of VAEs, and improvements to this density function mapping have been explored in two different ways, embodied by WAE by Tolstikhin et al[2] and AEOT by Liu et al[3]. WAE adds a regularizer term to force the latent space of the WAE to match a normal distribution. Theoretically, this makes generation of images much easier, as any vector sampled from a normal distribution should be familiar to the decoder. However, artificially regularizing the latent space to match a normal distribution lowers the quality of the image generation. AEOT manages to maintain the shape of the latent space of the autoencoder, yet does not do a good job in mapping noise vectors to the proper vectors.
We propose novel generative mapping algorithms OTtrans and OTgen. These algorithms leverage optimal transport to train deep neural networks to generate samples from lower dimensional data distributions. To generate high dimensional data, we first apply an autoencoder to the high dimensional data and then apply OTtrans and OTgen to sample from the latent distribution of the autoencoder. Depending on which algorithm we use, we call this two step procedure either AEOTtrans or AEOTgen.
AEOTtrans and AEOTgen have superior performance compared to GAN in lower complexity datasets including MNIST and FashionMNIST. This is largely because optimal transport mitigates the model collapse problem in simpler datasets. Furthermore, AEOTtrans and AEOTgen are simpler to train and are more theoretically understood than GANs. Compared to nonadversarial generative models, AEOTtrans and AEOTgen generate high quality images and interpolations. As opposed to VAE and WAE, AEOTtrans and AEOTgen preserve the latent structure of the autoencoder and do not force the latent distribution to be a gaussian or other prior. This provides more flexibility for the model, which translates into the generation of higher fidelity data. As opposed to latent space generator of AEOT, OTtrans and OTgen do not need to train a discriminator and thus are able to deal with sparser datasets. As a result, AEOTtrans and AEOTgen are able to serve as much better generators than AEOT.
2 Related Works
The field of deep generative modeling solves the problem of sampling from high dimensional probability distributions often lying on a much lower dimensional manifold. For instance, deep models for face generation are able to sample images from latent spaces of hundreds of thousands of pixels, far too large for traditional sampling techniques. The two methods most prominent in the field are GANs and VAEs, which both leverage the lower dimensional manifold. VAEs do this through an autoencoder, whereas GANs do this through adversarial training. Specifically, GANs pit a generator against a discriminator in a two player minmax zero sum game, in which the generator tries to generate images to fool a discriminator, and the discriminator tries to distinguish between the generated and real images. The discriminator eventually manages to learn which images are on the lower dimensional manifold, whereas the generator learns to generate images which are on the manifold. In practice GANs are able to model complex datasets, such as the CelebA dataset, producing much more samples of higher quality than nonadversarial generative modeling [10]. Yet achieving this performance is difficult, a variety of different tricks [6,7,8,9] as well as much trial and error with to carefully procure the correct hyper parameters. We restate that nonadversarial generative modeling is still valuable for its well understood behavior, ease of training, and superior performance on less complex data.
2.1 Base VAE
VAEs is an approach which only requires a single minimization optimization problem. It consists of an encoder and a decoder
. The encoder takes an image and condenses it into a series of means and standard deviations which parameterize a multidimensional normal distribution in the latent space. A vector is sampled from this distribution and then put through the encoder. The probabilistic nature of the encoder forces the decoder to generalize to most points within the latent space. Then, to generate points, simply sample vectors from an
dimensional normal distribution and feed it through the decoder. However, as noted before [2, 3], the images generated by VAE tend to be more blurry than real images. This is because VAE’s stochastic training algorithm introduces some uncertainty to the autoencoder, which responds by blurring the image to minimize the mean squared loss.2.2 Wae
WAE is an improvement on the base VAE. It consists of a deterministic autoencoder with an added cost term forcing the latent distribution of the AE to be similar to a normal distribution. It is different from VAE in that VAE encodes a single point to a normal distribution whereas in WAE, the cumulative distribution of the whole batch is penalized to match a normal distribution. This enables WAE to have a much higher reconstruction quality. Furthermore, WAE can calculate the divergence between the latent distribution and the normal distribution in two ways. WAEGAN does this with a GAN in the latent space, whereas WAEMMD uses Maximum Mean Discrepancy (MMD). Because WAEGAN uses adversarial training in the latent space, we compare our model to WAEMMD instead. WAEMMD inevitably performs worse than a vanilla autoencoder at reconstruction of images, as it needs to satisfy the MMD penalty. Furthermore, the difference between image quality in reconstructed images and generated images further suggests that the latent space of WAEMMD doesn’t truly match a normal distribution. Thus, MMD is a suboptimal metric for regularizing the latent distribution.
2.3 AeOt
AEOT [3] pretrains an autoencoder, consisting of an encoder and a decoder on the data. Then, it trains a neural network to distinguish between real points in the latent distribution and noise generated from a prior . It does this by approximating the Kantorovich potential, where a higher Kantorovich potential corresponds to a point more likely to be from the real distribution and not the prior. To generate images, simply sample noise from the normal distribution. The network will map to , which should have a high Kantorovich potential and thus should be likely to be in the real latent distribution. The final image is then . This algorithm preserves the latent space of the autoencoder, and thus any reconstructed images are very sharp.
However, AEOT also has some flaws. In practice, training an optimal discriminator is extremely difficult due to the sparse nature of the dataset. Even when we reduce dimensionality to 64, the discriminator is unable to serve as a good generator.
3 Our Generative Autoencoders
In this paper, we follow [3,4] and propose a two step generative model framework for high dimensional data, such as images. First reduce the dimensionality of the high dimensional data by training an encoder and a decoder in an autoencoder framework. Then, train a model which maps from a noise distribution to the latent distribution . To generate data similar to , simply sample from . The vectors will approximate and the decoded data will approximate .
We propose two possible ways to do latent distribution mapping, named OTtrans and OTgen. As we will see, OTtrans is more similar to transporting from a distribution to another distribution whereas OTgen is more similar to generating points from a distribution. The corresponding image generative models with autoencoders are then named AEOTtrans and AEOTgen respectively.
3.1 Transporter: OTtrans
In OTtrans, we train a neural network to approximate optimal transport. Let be the distribution we want to generate from, and let be our prior distribution, which will often be a noise distribution. First, sample vectors and with . Then calculate the optimal transport mapping denoted by , a bijection from to . This is time we use optimal transport in the OTtrans algorithm. Finally, train a neural network which approximates this mapping. Specifically, should attempt to map to .
Algorithm 1: OTtrans
Prerequistites: Start by initializing encoder , decoder and transport
neural network . Let be the corresponding squared cost between and , and
let be the prior distribution.
1. Sample from the training set and from .
2. Encode to with
3. Calculate the optimal transport map , a bijection from to .
while is not converged do:
4. Randomly sample batch .
5. Calculate loss:
(1) 
6. Update by using Adam to minimize loss.
AEOTtrans is significantly different from AEOT. Whereas AEOT relies on a network to learn the Kantorovich Duals and approximates the optimal transport mapping, AEOTtrans trains a network to learn the optimal transport mapping directly. Learning the mapping directly takes away any need to approximate or concern oneself with the model’s first derivatives. The resulting model is more robust to sparse datasets and easier to train.
3.2 Latent Space Generator: OTgen
In OTgen, we train a neural network to generate points by using optimal transport to give "feedback" to the network. In contrast to the previous algorithm, we calculate optimal transport multiple times, at every iteration of the training step. Let be the latent distribution we want to generate from, and let be the prior distribution. First, sample batch and . Enumerate and . Then calculate the predictions made by the network , such that .
We then use optimal transport to calculate "feedback" for each . Find the optimal transport mapping on these predictions to get bijection which maps to some . Intuitively, should have been . Finally, update based on this new optimal transport mapping. Specifically, should attempt to map to .
Finally, we also add an term weighted by a hyper parameter to increase diversity. takes the average distance between two generated latent vectors and compares it to the average distance between two genuine latent vectors. This forces the generated vectors to be, on average, as far apart as the genuine latent vectors. This is especially useful in a high dimensional latent space, and for many smaller dimensional latent spaces, works well. Formally, it is calculated:
Algorithm 2: OTgen
Prerequistites: Start by initializing encoder , decoder , and generative
neural network . Let be the corresponding squared cost between and , let
be the prior distribution, and let be the weight of the divergence term.
while is not converged do:
1. Sample from the training set and from .
2. Encode to with
3. Calculate the predictions , , …, made by .
4. Calculate the optimal transport map , a bijection from the predictions to .
5. Calculate loss:
(2) 
6. Update by using Adam to minimize loss.
3.3 Notes on Optimal Transport
There are multiple ways to calculate from to . We employ the Python Optimal Transport library [17] which leverages the network simplex algorithm [18] in order to calculate the exact bijection. We also attempted using the SinkhornKnopp algorithm to solve the entropicregularized optimal transport problem, but the map it provides is unhelpful due to its nonbijective nature. (In such a case, a neural network has a difficult time approximating it.)
3.4 OTtrans vs OTgen
We argue that OTgen is more like a latent space generator than a latent space transporter. In OTtrans, noise vectors are mapped to latent vectors nearby in order to minimize the total distance moved. Such a model has the advantage of only requiring to compute optimal transport once.
However, minimizing total distance moved is an artificial restriction and actually inhibits model performance. When generating an image with noise vector , we don’t necessarily want to generate the image whose latent space vector is closest to , we simply want to generate a good image. OTgen is without this restriction and has more flexibility; the network is given the opportunity to transform any noise vectors before optimal transport is applied. Yet this requires computing optimal transport at every step.
4 Experiments in Distribution Mapping
We validate our algorithms with multiple experiments. First we show that OTgen and OTtrans both do well in generating points from lower dimensional distributions. We specifically use the two moons and concentric circles datasets as depicted below. Such experiments also give us a way to visualize how each algorithm functions.
In these experiments, the neural network architecture for OTgen and OTtrans is the same. It consists of 4 fullyconnected layers of 512 neurons and a final layer of 2 neurons. Leaky ReLU [11] is used in between each layer. All networks are trained for a total of 10K steps, learning rate is set to 0.0003, and the prior noise distribution
is. Batch size is set to 128. For OTgen, the diversity hyperparameter
is set to zero, as the dimension is small enough, we don’t need to artificially increase diversity.We calculate the divergence between the generated distribution and the real distribution by using optimal transport. This gives us a measure of the quality of the generated distribution and therefore the model; the lower the divergence, the closer the generated distribution is to the real distribution, and the better the model is. The divergence is simply the average distance when transporting optimally from the generated distribution to the real distribution. Again, we use the network simplex method to calculate the exact bijective mapping. The average distances are as follows:
Method  Moons  Circles 

OTgen  0.090  0.092 
OTtrans  0.086  0.075 
Data  0.070  0.071 
4.1 Comparison to Kmeans
For comparison, we also model the distributions with clusters. We apply Kmeans clustering to our data before approximating each cluster with a normal distribution. Intuitively, the more clusters there are, the more accurate the distribution will be approximated. Thus, if our models have high distribution modeling capabilities, they should be able to compare with an approximation with a high number of clusters.
Method  Moons  Circles 

OTgen  0.090  0.092 
OTtrans  0.086  0.075 
Cluster (k=8)  0.117  0.123 
Cluster (k=16)  0.084  0.090 
Data  0.070  0.071 
We find that both OTgen and OTtrans model each distribution better than the approximation with eight clusters and onpar with the approximation with sixteen clusters. Hence, it is shown that our models are very able in modeling lower dimensional data, often coming close to the optimal divergence. Examples of points generated by each model can be found in the appendix.
4.2 OTgen’s Training
The two dimensional distributions also let us visualize OTgen’s training and gain insight about its stochastic nature. OTgen’s training process involves generating points and receiving "feedback" on the quality of each point. Yet this "feedback" is calculated based on the rest of the batch, which introduces some randomness. Consider the following two images:
These two images portray the feedback given on consecutive training steps. Blue dots are generated points, green dots are real data, and the red lines show a one to one mapping between the green and blue dots which minimizes overall distance travelled. Note that in this case, distance is L1 distance rather than traditional euclidean distance. The first image seems to indicate that the OTgen model is generating too many points in the bottom moon and that some should be mapped to the top moon. The second image tells the opposite story, that the model is generating too many points in the top moon and that some should be mapped to the bottom moon. The model isn’t changing drastically between two consecutive training steps, so some of the feedback is wrong. However, on average the feedback provides useful information, so with enough training steps, the model converges.
5 Experiments in Image Generation
Next, we show that OTgen and OTtrans do well in generating points from a higher dimensional distribution, namely the latent space of an autoencoder trained either on the MNIST handwritten digits dataset [14], the Fashion MNIST clothing dataset [15], or the CelebA faces dataset [16]. The MNIST dataset of 28x28 black and white handwritten digits is widely regarded as the baseline dataset for many computer vision tasks, including image generation. The images are simple to generate and are without intricate patterns or gradients. Fashion MNIST is more difficult to generate than MNIST, as the clothing have different shades of gray and many difficult details, including shirt designs, stripes, frills, and gradients. Yet the Fashion MNIST dataset still is black and white and relatively small 28x28. The CelebA faces dataset is the most complicated dataset out of these three, with larger, colored images and faces showing different complex expressions. In our case we use the cropped CelebA images resized to 64x64x3. We compare against AEOT, WAEMMD, and VAE.
5.1 MNIST and Fashion MNIST
For MNIST and Fashion MNIST, we did not use convolutional autoencoders but rather chose to only use fully connected layers. Each autoencoder’s latent space dimension was set to eight. The OTgen mapping network consists of seven layers and its prior is set to . Both lambdas in WAEMMD and AEOT are set to 0.1, as is suggested in each respective paper. Batch size is set to 128. The diversity lamdba for OTgen was set to 0. Below the inception scores of the MNIST and Fashion MNIST images are shown (higher is better).
Method  MNIST  Fashion 

True Images  9.86  9.07 
AEOTgen  9.52  7.90 
AEOTtrans  9.19  7.45 
AEOT  6.89  5.81 
WAEMMD  7.46  5.97 
VAE  6.03  5.39 
GAN  6.43  6.65 
WGAN  6.90  5.96 
In both MNIST and Fashion MNIST datasets, AEOTgen comes the closest to the optimal inception score, with AEOTtrans in second place. Both AEOTgen and AEOTtrans do substantially better than both the nonadversarial generators and adversarial generators such as GAN and WGAN. This demonstrates the capabilities of OTgen and OTtrans on natural latent distributions with low dimensions. In particular, our usage of optimal transport ensures diversity in that the model generates similar amounts of each class. Example images are shown in the appendix.
5.2 CelebA
For CelebA, we compare our models with the different nonadversarial models. The autoencoders in AEOTgen, AEOTtrans, AEOT, VAE, and WAE all have the same architecture as in [2]. For AEOTgen and AEOTtrans, the mapping network is the same as MNIST. Finally, Batch size is set to 4096 to help increase the diversity of images. With a batch size of 4096, we ensure that our sampling from the latent distribution consistently matches the true latent distribution. AEOT and WAE’s lambdas are set to 0.1, AEOT’s diversity hyperparameter is set to 1. The Frechet Inception Distances on the CelebA dataset are shown below (lower is better).
Model  FID 

AEOTgen  58.07 
AEOTtrans  58.79 
AEOT  106.96 
WAEMMD  64.71 
VAE  59.85 
From the FID scores, we see that AEOTgen and AEOTtrans again outperform WAEMMD, VAE, and AEOT. This shows its efficacy in modeling higher dimensional data and that it is state of the art in the field of nonadversarial generative modeling. As noted before, the images generated by VAE are very blurry. In contrast, though the images generated by WAEMMD are very sharp, they often lack the facial structure. AEOT is unable to handle the sparsity of the CelebA autoencoder’s latent distribution and generates poor images. Example images pertaining to WAEMMD, VAE, AEOT, AEOTtrans, and AEOTgen are found in the appendix.
5.3 AEOTtrans vs AEOTgen
Here, we compare the images generated by AEOTtrans to the images generated by AEOTgen. Both images are well structured, without major deformities. AEOTtrans images are more diverse than AEOTgen, with more varied face archetypes and backgrounds. Yet the faces generated by AEOTgen are more sharp, albeit with less diversity. A similar pattern is seen when viewing the different model’s interpolation. Interpolation exists to ensure that each model doesn’t simply memorize the different datapoints but instead can generate the whole distribution in a smooth fashion. The two model’s interpolations are shown below:
AEOTgen and AEOTtrans have a significant difference with respect to interpolation. AEOTtrans interpolation is more smooth, without many radical shifts in the image. Yet this comes at the cost of the image quality; the images are a bit more blurry and unrealistic. On the contrary, AEOTgen interpolation is not as smooth, with more drastic shifts in the image, but most images are realistic and sharp. The suggests that AEOTtrans interpolation is more natural and smooth, whereas in AEOTgen, the interpolation is jumpy to ensure each transition image is realistic.
6 Conclusion and Further Works
In conclusion, we have proposed two models OTtrans and OTgen which do latent distribution mapping. These two models can be extended to AEOTtrans and AEOTgen which do high dimensional data generation without adversarial training. We’ve shown that OTtrans and OTgen are reasonable models when applied to two dimensional datasets, often outperforming the conventional modeling of distirbutions using clusters. Similarly, AEOTtrans and AEOTgen also do well, significantly outperforming VAE, WAEMMD, and AEOT on the MNIST and FashionMNIST and slightly outperforming the competing models on the CelebA dataset. Furthermore, AEOTtrans and AEOTgen outperform GANs on the MNIST and FashionMNIST dataset. Combined with the simpler training procedure of our new nonadversarial algorithm, this provides a compelling case to use AEOTtrans or AEOTgen for lower dimensional data generation over adversarial training. Future work will include an expanded theoretical analysis of AEOTtrans and AEOTgen as well as further improvements on the AEOTtrans and AEOTgen models.
Acknowledgements
The authors of this paper would like to acknowledge Zach Gaslowitz of Proof School for the many fruitful discussions along the way. We would also like to thank Dr. Mei Han from Ping An Technology for supporting and facilitating this project. Finally, we would like to thank all the students at Proof School who helped in proofreading the paper.
References

[label=[0]]

D. P. Kingma and M. Welling. Autoencoding variational bayes. In ICLR, 2014.

I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein autoencoders. In ICLR, 2018.

H. Liu, Y. Guo, N. Lei, Z. Shu, S. T. Yau, D. Samaras, and X. Gu. Latent space optimal transport for generative models. arXiv preprint arXiv:1809.05964, 2018

N. Lei, K. Su, L. Cui, S.T. Yau, and D. X. Gu. A geometric view of optimal transportation and generative model. arXiv preprint arXiv:1710.05488, 2017.

I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 26722680, 2014.

M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.

D. Berthelot, T. Schumm, and L. Metz. Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.

N. Kodali, J. Abernethy, J. Hays, and Z. Kira. On convergence and stability of gans. arXiv preprint arXiv:1705.07215, 2017.

L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for gans do actually converge? In
International Conference on Machine Learning
, pages 34783487, 2018. 
A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.

B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. In CoRR, abs/1505.00853, 2015.

S. Ioffe, C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In
International Conference on Machine Learning, 2015, pp. 448456. 
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In NIPS, 66266637, 2017.

Xiao, H., Rasul, K., and Vollgraf, R. FashionMNIST: A novel image dataset for benchmarking machine learning algorithms. In arXiv:1708.07747, 2017

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, volume 86(11), pages 22782324, 1998.

R. Flamary, N. Courty. Python Optimal Transport. https://github.com/rflamary/POT, 2017.

G. Peyre, M. Cuturi. Computational Optimal Transport. arXiv:1803.00567, 2019.