Generative models that include an encoder-decoder architecture have several appealing properties. For example, they tend to be more stable to train Tolstikhin et al. (2018) and can potentially be used for classification in a semi-supervised fashion Makhzani et al. (2016). However, a drawback of recent generative models with an encoder-decoder architecture is the requirement to train two separate models, including the need to ensure the encoder and the decoder are reciprocal. While autoencoders with tied weights can at least overcome the problem of training two separate models, recent approaches applying autoencoders to realistic image datasets such as CelebA Liu et al. (2015) use separate decoder and encoder models Donahue et al. (2017); Dumoulin et al. (2016); Tolstikhin et al. (2018).
An interesting alternative to encoder-decoder architectures could be models that are invertible by design. Recently, invertible-by-design neural networks called reversible neural networks were proposed. In the beginning they were used as generative models Dinh et al. (2015, 2017), later as classification models with smaller memory requirements Gomez et al. (2017) and finally to study theoretical assumptions about learning and generalization of deep neural networks Jacobsen et al. (2018). For example, their good classification performance showed that loss of information about the input in later representations of a neural network is not a necessary precondition for good generalization.
In their application as generative models, reversible networks were trained in two ways. In earlier works, they were trained using the so-called change of variable formula to directly optimize the likelihood of the data under the reversible network model Dinh et al. (2015, 2017). Later, they were trained using an adversarial approach on the generated samples Danihelka et al. (2017); Grover et al. (2018) same as in generative adversarial networks Goodfellow et al. (2014). In this study, we instead investigate their performance when using an adversary in the latent space in an adversarial autoencoder framework Makhzani et al. (2016)
. In general, the RevNet’s built-in bijectivity could either be an advantage or a disadvantage when optimizing in this framework. For example, the bijectivity prevents one from hand-designing the value range for the generated samples as is sometimes done using a sigmoid nonlinearity as the final operation on the decoder output.
We indeed find it is possible to use RevNets as generative models in the adversarial autoencoder framework, producing samples of comparable quality to variational autoencoders (VAEs) Kingma & Welling (2014) on the CelebA dataset Liu et al. (2015). Furthermore, in an attempt to exploit the direct correspondence between encodings and inputs in a RevNet, we make a proof of concept for an adversary-free training without a prespecified number of latent dimensions on the MNIST dataset.
2.1 Reversible Networks
Reversible networks (RevNets) are neural networks that are invertible by design Dinh et al. (2015); Gomez et al. (2017); Jacobsen et al. (2018) through the use of invertible blocks. The basic invertible block is defined for an input , split into disjoint parts , and two functions F and G that have the same output as input size as follows (also see Figure 1):
Inputs and can be inverted from the outputs and as follows:
and will typically be a sequence of convolutional or other neural network layers. The splitting of input x into disjoint inputs , is often implemented along the channel dimension of the network.
One important addition to the invertible network architecture are invertible subsampling blocks that were introduced to make the RevNets end-to-end invertible Dinh et al. (2015); Jacobsen et al. (2018). Invertible subsampling is possible by shifting spatial dimensions into the channel dimensions. Basically, for a 2x2 subsampling, 4 translated spatial checkerboard patterns of the input are moved into four different channels as seen in Figure 2. Our subsampling operation is a slightly modified version of the operation proposed in earlier work Dinh et al. (2015); Jacobsen et al. (2018) that ensures that the final and correspond to checkerboard patterns covering the entire input image as indicated in Figure 2. This was motivated by our observation that early on in the training, the values of the RevNet encodings are still strongly influenced by the values at the input positions they correspond to, as shown in Figure 3. Therefore, both F and G seeing inputs that cover the entire image might make it easier for F and G to correctly predict what will be added to their output, easing the generative training.
2.2 Adversarial Autoencoders
In the adversarial autoencoder framework, the encoder and decoder are trained together to minimize a reconstruction loss on the decoded inputs and an adversarial loss on the encodings. For the reconstruction loss, the encoder and decoder are optimized to minimize a reconstruction error of the inputs , where is an input, the input distribution, is the decoder (generator), is the encoder, and is a reconstruction loss such as L1-loss or L2-loss.
Since RevNets are invertible by construction, we propose to use a single RevNet to instantiate both the encoder and the decoder, leading to a reconstruction loss of zero by design, regardless of the weights used in the RevNet. In practice, we aim for a lower-dimensional latent space that still yields good reconstructions. In order to obtain this, in the reconstruction training phase, we clip the encodings produced by the RevNet according to a prior distribution that sets most encoding dimensions to zero; we then invert the clipped encodings through the RevNet and optimize the L1 reconstruction loss between the original inputs and the inverted clipped encodings, which has been reported to work better than L2 in natural image settings Isola et al. (2017); Ulyanov et al. (2017).
We also penalize the L2-distance of the encodings and the clipped encodings as we found this to greatly stabilize this training phase.111 In practice, since we wanted to keep the option of using a uniform distribution in +-2 as the latent sampling distribution, we also clipped the nonzero dimensions to +-2 for both the L1 and L2 loss, but we do not expect this to strongly influence the results.
In practice, since we wanted to keep the option of using a uniform distribution in +-2 as the latent sampling distribution, we also clipped the nonzero dimensions to +-2 for both the L1 and L2 loss, but we do not expect this to strongly influence the results.
While in principle the reconstruction phase is not even necessary for reversible networks, we still found it useful as a first phase to allow the network to generate a useful arrangement of the inputs in the encoding space before optimization of the adversarial loss. We still apply this reconstruction loss in the next phase of the training where we include an adversarial loss.
For the adversarial loss, a discriminator network is trained to distinguish the distribution of the encoder outputs from a prior distribution. The encoder tries to fool the discriminator by making the encoder outputs indistinguishable from samples of the prior distribution. The adversarial game can be setup with a variety of loss functions; we choose the adversarial hinge loss as advocated for the use in Generative Adversarial Networks (GANs)Goodfellow et al. (2014) in Lim & Ye (2017):
with and the losses for the discriminator and the encoder, in our case the RevNet, respectively. In our setting, we only apply the discriminator on the nonzero dimensions of the prior distribution, while penalizing the remaining dimensions through the L1 and L2 loss on the clipped encodings as explained before. This should greatly simplify the adversarial training as it makes the problem substantially lower dimensional (e.g., 64 dimensions vs. 12288 dimensions in the case of a 64-dimensional prior and 64x64 RGB images, which will be our setting on the CelebA Liu et al. (2015) dataset as explained in the experiments section).
We also make use of the recently proposed spectral normalization for the discriminator Miyato et al. (2018). Spectral normalization normalizes the spectral norm of any weight matrix of the discriminator to unit spectral norm:
is also equivalent to the largest singular value of. Spectral normalization was designed to regularize the Lipschitz norm of the discriminator network to stabilize the training Miyato et al. (2018). In practice, spectral normalization can be computed efficiently using the power iteration method, only using a single iteration per forward pass; we defer to Miyato et al. (2018) for details.
2.3 Optimal Transport
Optimal transport distances measures the distance between two distributions as the distance needed to morph one distribution into the other. This can be visualized as the transport of sand when imagining both distributions to be piles of sand Peyré & Cuturi (2018). Formally, it is defined for two distributions as:
where is a user-defined cost/distance function, and
is a coupling distribution whose probabilities specifies how much probability mass is moved from each pointto each point . To ensure that this coupling correctly distributes all the mass from one distribution to the other, it must come from the set
of all joint distributions ofwith marginals and , respectively. For two empirical distributions with the same number of samples, it is equivalent to the pairing that minimizes the average distance between the pairs.
Optimal transport distances have recently seen an increasing usage and interest in the field of generative models, especially due to their ability to compare distributions with disjoint support. As such, they have been used in different ways to train GANs Arjovsky et al. (2017); Salimans et al. (2018). For a more thorough overview over optimal transport and its applications, we highly recommend Peyré & Cuturi (2018). In this study, we make more direct use of optimal transport distances in our experiments on the MNIST dataset Lecun et al. (1998).
Finally, we note that theoretical analysis using optimal transport distances has recently generalized the adversarial autoencoder framework into the Wasserstein Autoencoder framework Tolstikhin et al. (2018). This analysis showed that any method that matches the latent sampling distribution and the encoding distribution of the real inputs can minimize an arbitrary optimal transport distance (the chosen reconstruction loss) between the distribution of generated inputs and the real input distribution. More precisely, for a given decoder, a given latent sampling distribution and a given distance, the optimal transport distance is equivalent to the minimum expected encoder-decoder reconstruction distance over all such encoders whose encoding distribution of the real inputs is identical to the latent sampling distribution. We defer to Tolstikhin et al. (2018) for more details.
For our RevNets, inverting unclipped encodings of a RevNet should result in the exact same inputs that produced the encodings. Therefore, the distribution of generated samples would be identical to the real input distribution if the latent sampling distribution would match the encoding distribution of the real inputs produced by the RevNets exactly. Nevertheless, as the encodings never exactly match the imposed prior distribution, it remains important that the encoding distances remain meaningful throughout the training, which we found to be much more so when using the initial reconstruction phase described earlier.
2.4 Fréchet Inception Distance
The Fréchet Inception Distance (FID) has been proposed as a measure for evaluating the quality of generated samples for a specific dataset Heusel et al. (2017)
. It is the optimal L2-transport distance between features of the Imagenet-pretrained Inception network computed on the given dataset and computed on a set of generated samples, under the assumption that both feature distributions follow a Gaussian distribution. The Gaussianity assumption makes it possible to compute the optimal transport distance directly from the mean and covariance matrices. The FID has been advocated as the measure best correlated with human notions of sample quality of all automatically computable measures that have been proposed so farHeusel et al. (2017); Lucic et al. (2017), although recently alternatives overcoming the assumption of Gaussianality have been proposed Bińkowski et al. (2018).
We run our our generative reversible network on the CelebA dataset Liu et al. (2015), a widely used dataset to evaluate autoencoders. We crop and downsample the images to 64x64 pixels as is common practice, using the same code as in Tolstikhin et al. (2018)222See code here: https://github.com/tolstikhin/wae/blob/a1fdf24066b83665feffbcf18298cd605658e33d/datahandler.py##L188-L208. Our RevNet architecture uses 11 reversible function blocks and 6 reversible subsampling steps with 60 million parameters and is shown in Figure 4.
For the discriminator, we use a fully connected network with 2 hidden layers with 400 and 800 units each. The first layer uses concatenated ReLUs () Shang et al. (2016) and the second layer regular ReLUs Glorot et al. (2011) as nonlinearities. We chose concatenated ReLUs in the first layer as we observed in preliminary experiments that they help the discriminator produce more useful gradients when the encodings are too concentrated around the mean of the prior distribution. We apply spectral normalization on the discriminator using 1 power iteration per forward pass as described in Miyato et al. (2018).
The prior distribution is a 64-dimensional standard-normal distribution. The 64 dimensions are the output dimensions with the highest standard deviations of the outputs for the untrained RevNet on the dataset. For the optimization, we followHeusel et al. (2017) in employing different learning rates for the generative RevNet and the discriminator, using Adam with and , respectively ( and for both). These settings were chosen identical to a fairly recent successful GAN setting Zhang et al. (2018). Code for reproducing these experiments will be released upon publication.
Our generative reversible network generates globally coherent faces as seen in Figure 5. The generated faces are fairly blurry, which is also reflected in an FID score close to those reported for VAEs and higher than for other autoencoders in an adversarial framework (see Table 1). Reconstructions from the restricted latent space again show that the RevNet preserves some global attributes while losing detail (Figure 6). Reconstructions from the unrestricted outputs show that the RevNet does not suffer from any numerical instabilities that can be visually perceived from the reconstructions (Figure 6). Numerical analysis of the reconstruction losses confirms this with a mean L1 error of
on the entire CelebA dataset for our trained RevNet. Interpolations in latent space show coherent interpolated faces when staying in the latent space restricted to the nonzero dimensions of the prior (Figure7). Interpolations in the full latent space, while having more detail, also show unrealistic artifacts in some cases (Figure 8). Samples generated by varying the latent space in 5 dimensions of the prior latent distribution show that the latent dimensions seem to encode combinations of semantically meaningful attributes such as smiling vs. nonsmiling, hair color, background and gender (Figure 9).
Finally, we observe that the training is very stable, we rerun the experiment 4 times using the same model pretrained in the reconstruction phase but varying the order of examples and the seeds for initializing the adversary parameters. Due to time constraints, we were not yet able to rerun the reconstruction phase with different seeds, but based on preliminary experiments we expect similar results in that case as well.
In a second experiment, we attempted to answer two questions: First, can generative reversible networks be trained using optimal transport without an adversary? The question of adversary-free training or alternatively, training with a adversary limited to computing an adversarial kernel function, continues to attract considerable interest due to the often difficult training dynamics of generative adversarial networks Bińkowski et al. (2018); Tolstikhin et al. (2018); Rubenstein et al. (2018). Second, is it possible to avoid prespecifying the latent dimensionality? This question is interesting as using a too large latent dimensionality might make the matching impossible Makhzani et al. (2016); Tolstikhin et al. (2018); Rubenstein et al. (2018) and using a too small latent dimensionality might make the network unable to model some variation in the generated samples, which it could otherwise retain (see Rubenstein et al. (2018) for a more thorough discussion of these effects).
We chose the MNIST dataset as this experiment should mainly serve as a proof of principle, and not to judge the quality of this approach compared to more established approaches of optimizing generative models. To this end, we considered that a simpler dataset, such as MNIST, with less factors of variation, could yield more helpful insights for a first attempt.
Concretely, we train the RevNet to match class-conditional latent distributions on the outputs, while we optimize the parameters of these distributions at the same time as follows. We first define the class-conditional latent distributions as uncorrelated Gaussian distributions and set the means and standard deviations to the corresponding means and standard deviations of the encodings of the untrained RevNet. Then, for each minibatch, we compute the optimal transport distance for the encodings of that minibatch and a same-size sample from the latent distribution using Euclidean distances as the cost function and solving the transport problem exactly using the algorithm from Bonneel et al. (2011) 333We use the code from the Python Optimal Transport library, https://github.com/rflamary/POT/blob/81b2796226f3abde29fc024752728444da77509a/ot/lp/__init__.py##L19. The optimal transport distance is then used as a loss for both the RevNet and the means and standard deviations of the class-conditional latent distributions. While the optimal transport distance is known to have biased gradients Bellemare et al. (2017); Salimans et al. (2018), we still find it to work well enough on MNIST for reasonable per-class batch sizes (). Besides ensuring a low optimal transport distance of the encodings and the sampling distributions, we must prevent the RevNet from “hiding” information in encoding dimensions with small standard deviations for these transport distances in encoding space to remain meaningful and to keep the training stable. For that, we propose a simple perturbation loss, that penalizes the reconstruction loss after applying a small perturbation sampled from a Gaussian distribution on the encodings. Concretely we penalize:
where and are the forward and inverse functions of the RevNet, respectively. The perturbation loss should also prevent the RevNet and the latent distributions from shrinking their standard deviations too much, which would otherwise cause a very unstable training which we have also observed in practice.
The RevNet ends up using only 3-4 dimensions per class to encode the digits, with several of the dimensions shared between the classes. While the samples are somewhat blurry and lack diversity (see Figure 11), interestingly some of the used dimensions with the largest standard deviations encode semantically identical features for the different digits as shown in Figure 12. This indicates the RevNet has learned an encoding that keeps class-independent dimensions such as thickness or tilt in the same encoding dimensions despite having the freedom to use completely different dimensions for the different classes.
Overall, we have shown for the first time that reversible neural networks can be used inside the adversarial autoencoder framework, yielding globally coherent generated faces on CelebA. While they still underperform relative to recent advanced generative autoencoder models on that dataset according to the Fréchet Inception Distance, the performance gap might be due to hyperparameters or architecture design choices, which have not been explored for RevNets prior to this work and are known to strongly affect generative model resultsLucic et al. (2017). Closing the performance gap through automated search for architectures and hyperparameters could therefore be an interesting next step. This could also include other forms of matching the distributions such as maximum mean discrepancy Gretton et al. (2012); Li et al. (2017); Tolstikhin et al. (2018) or sliced Wasserstein distances Kolouri et al. (2018).
Furthermore, the previous maximum-likelihood and input-adversarial methods used to train invertible networks in a generative setting Dinh et al. (2015, 2017); Danihelka et al. (2017); Grover et al. (2018) could be more directly compared to the adversarial autoencoder method from this study. The generated samples in the maximum-likelihood approach on CelebA in Dinh et al. (2017) feature more details, but also more unnatural artifacts. Attributing these differences to the training procedure or the model architecture could meaningfully extend prior work comparing maximum-likelihood and input-adversarial training of generative RevNets Danihelka et al. (2017). For the input-adversarial approach, one could also combine it with our proposal to use only a subset of the full latent sampling dimensionality. For this combination, it might be insightful to study the resulting encodings of the real inputs, especially in terms of what is modelled in the encoding space outside of the used sampling distribution, similar to our reconstructions from restricted and full latent space. Finally, one could compare the performance of RevNets n the adversarial autoencoder framework to approaches that use more traditional non-invertible autoencoders Donahue et al. (2017); Dumoulin et al. (2016); Ulyanov et al. (2017)
Later works on generative invertible networks used a hierarchical ordering of the latent sampling dimensions (see Dinh et al. (2017) for details). This might be worth exploring further. First, one might study this idea in combination with the adversarial autoencoder framework employed in this study. Second, the model architecture and hierarchical latent dimension ordering to enable high-quality generative modelling could be further optimized. Third, one might try to combine this idea with the progressive training of generative models as in Karras et al. (2018).
Our experiment on MNIST indicates that for simple datasets an adversary-free approach that does not need a prespecified latent dimensionality can result in meaningful encoding dimensions. This might be interesting for other works investigating the effect of latent dimensionality and intrinsic dimensionality on generative models Rubenstein et al. (2018). However, even for MNIST, the results are somewhat underwhelming with regards to the diversity of the generated samples. Still, we hope our results inspire further investigations on how to properly achieve the goals of having a meaningful encoding dimension, a small distance between encodings of the real inputs and the sampling distribution and realistic generated samples.
This work was supported by the BrainLinks-BrainTools Cluster of Excellence (DFG grant EXC 1086) and by the Federal Ministry of Education and Research (BMBF, grant Motor-BIC 13GW0053D).
- Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv:1701.07875 [cs, stat], January 2017. URL http://arxiv.org/abs/1701.07875. arXiv: 1701.07875.
- Bellemare et al. (2017) Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., and Munos, R. The Cramer Distance as a Solution to Biased Wasserstein Gradients. arXiv:1705.10743 [cs, stat], May 2017. URL http://arxiv.org/abs/1705.10743. arXiv: 1705.10743.
- Bińkowski et al. (2018) Bińkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying MMD GANs. In International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id=r1lUOzWCW.
- Bonneel et al. (2011) Bonneel, N., van de Panne, M., Paris, S., and Heidrich, W. Displacement Interpolation Using Lagrangian Mass Transport. In Proceedings of the 2011 SIGGRAPH Asia Conference, SA ’11, pp. 158:1–158:12, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0807-6. doi: 10.1145/2024156.2024192. URL http://doi.acm.org/10.1145/2024156.2024192.
- Danihelka et al. (2017) Danihelka, I., Lakshminarayanan, B., Uria, B., Wierstra, D., and Dayan, P. Comparison of Maximum Likelihood and GAN-based training of Real NVPs. arXiv:1705.05263 [cs], May 2017. URL http://arxiv.org/abs/1705.05263. arXiv: 1705.05263.
- Dinh et al. (2015) Dinh, L., Krueger, D., and Bengio, Y. NICE: Non-linear Independent Components Estimation. In International Conference on Learning Representations (ICLR), 2015. URL http://arxiv.org/abs/1410.8516. arXiv: 1410.8516.
- Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using Real NVP. In International Conference on Learning Representations (ICLR), 2017. URL https://openreview.net/forum?id=SyPNSAW5.
- Donahue et al. (2017) Donahue, J., Krähenbühl, P., and Darrell, T. Adversarial Feature Learning. In International Conference on Learning Representations (ICLR), 2017. URL http://arxiv.org/abs/1605.09782. arXiv: 1605.09782.
- Dumoulin et al. (2016) Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., and Courville, A. Adversarially Learned Inference. arXiv:1606.00704 [cs, stat], June 2016. URL http://arxiv.org/abs/1606.00704. arXiv: 1606.00704.
Glorot et al. (2011)
Glorot, X., Bordes, A., and Bengio, Y.
Deep sparse rectifier neural networks.
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323, 2011.
- Gomez et al. (2017) Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The Reversible Residual Network: Backpropagation Without Storing Activations. arXiv:1707.04585 [cs], July 2017. URL http://arxiv.org/abs/1707.04585. arXiv: 1707.04585.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 2672–2680. Curran Associates, Inc., 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.
Gretton et al. (2012)
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A.
A Kernel Two-Sample Test.
Journal of Machine Learning Research, 13:723–773, March 2012. ISSN 1533-7928. URL http://jmlr.csail.mit.edu/papers/v13/gretton12a.html.
- Grover et al. (2018) Grover, A., Dhar, M., and Ermon, S. Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in Generative Models. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. URL http://arxiv.org/abs/1705.08868. arXiv: 1705.08868.
- Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6626–6637. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7240-gans-trained-by-a-two-time-scale-update-rule-converge-to-a-local-nash-equilibrium.pdf.
- Isola et al. (2017) Isola, P., Zhu, J. Y., Zhou, T., and Efros, A. A. Image-to-Image Translation with Conditional Adversarial Networks. In doi: 10.1109/CVPR.2017.632.
- Jacobsen et al. (2018) Jacobsen, J.-H., Smeulders, A. W. M., and Oyallon, E. i-RevNet: Deep Invertible Networks. In International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id=HJsjkMb0Z.
- Karras et al. (2018) Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In International Conference on Learning Representations (ICLR), 2018. URL http://arxiv.org/abs/1710.10196. arXiv: 1710.10196.
- Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In International Conference on Machine Learning, 2014. URL http://arxiv.org/abs/1312.6114. arXiv: 1312.6114.
- Kolouri et al. (2018) Kolouri, S., Martin, C. E., and Rohde, G. K. Sliced-Wasserstein Autoencoder: An Embarrassingly Simple Generative Model. arXiv:1804.01947 [cs, stat], April 2018. URL http://arxiv.org/abs/1804.01947. arXiv: 1804.01947.
- Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-Based Learning Applied to Document Recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
Li et al. (2017)
Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Poczos, B.
MMD GAN: Towards Deeper Understanding of Moment Matching Network.In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2203–2213. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6815-mmd-gan-towards-deeper-understanding-of-moment-matching-network.pdf.
- Lim & Ye (2017) Lim, J. H. and Ye, J. C. Geometric GAN. arXiv:1705.02894 [cond-mat, stat], May 2017. URL http://arxiv.org/abs/1705.02894. arXiv: 1705.02894.
- Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- Lucic et al. (2017) Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are GANs Created Equal? A Large-Scale Study. arXiv:1711.10337 [cs, stat], November 2017. URL http://arxiv.org/abs/1711.10337. arXiv: 1711.10337.
- Makhzani et al. (2016) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial Autoencoders. In International Conference on Learning Representations (ICLR), 2016. URL http://arxiv.org/abs/1511.05644. arXiv: 1511.05644.
- Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. In International Conference on Learning Representations (ICLR), 2018. URL http://arxiv.org/abs/1802.05957. arXiv: 1802.05957.
- Peyré & Cuturi (2018) Peyré, G. and Cuturi, M. Computational Optimal Transport. arXiv:1803.00567 [stat], March 2018. URL http://arxiv.org/abs/1803.00567. arXiv: 1803.00567.
- Rubenstein et al. (2018) Rubenstein, P. K., Schoelkopf, B., and Tolstikhin, I. On the Latent Space of Wasserstein Auto-Encoders. In International Conference on Learning Representations (ICLR) Workshop, 2018. URL http://arxiv.org/abs/1802.03761. arXiv: 1802.03761.
- Salimans et al. (2018) Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Improving GANs Using Optimal Transport. In International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/forum?id=rkQkBnJAb.
- Shang et al. (2016) Shang, W., Sohn, K., Almeida, D., and Lee, H. In International Conference on Machine Learning, pp. 2217–2225, 2016.
- Tolstikhin et al. (2018) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein Auto-Encoders. In International Conference on Learning Representations (ICLR), 2018. URL http://arxiv.org/abs/1711.01558. arXiv: 1711.01558.
- Ulyanov et al. (2017) Ulyanov, D., Vedaldi, A., and Lempitsky, V. Adversarial Generator-Encoder Networks. arXiv preprint arXiv:1704.02304, 2017.
- Zhang et al. (2018) Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Self-Attention Generative Adversarial Networks. arXiv:1805.08318 [cs, stat], May 2018. URL http://arxiv.org/abs/1805.08318. arXiv: 1805.08318.