1 Introduction
Deep generative models are a powerful tool to sample complex high dimensional objects from a low dimensional manifold. The dominant approaches for learning such generative models are variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) and generative adversarial networks (GANs) (Goodfellow et al., 2014)
. VAEs allow not only to generate samples from the data distribution, but also to encode the objects into the latent space. However, VAElike models require a careful likelihood choice. Misspecifying one may lead to undesirable effects in samples and reconstructions (e.g., blurry images). On the contrary, GANs do not rely on an explicit likelihood and utilize more complex loss function provided by a discriminator. As a result, they produce higher quality images. However, the original formulation of GANs
(Goodfellow et al., 2014)lacks an important encoding property that allows many practical applications. For example, it is used in semisupervised learning
(Kingma et al., 2014), in a manipulation of object properties using low dimensional manifold (Creswell et al., 2017) and in an optimization utilizing the known structure of embeddings (GómezBombarelli et al., 2018).VAEGAN hybrids are of great interest due to their potential ability to learn latent representations like VAEs, while generating highquality objects like GANs. In such generative models with a bidirectional mapping between the data space and the latent space one of the desired properties is to have good reconstructions (). In many hybrid approaches (Rosca et al., 2017; Ulyanov et al., 2018; Zhu et al., 2017; Brock et al., 2017; Tolstikhin et al., 2017) as well as in VAElike methods it is achieved by minimizing or pixelwise norm between and . However, the main drawback of using these standard reconstruction losses is that they enforce the generative model to recover too many unnecessary details of the source object . For example, to reconstruct a bird picture we do not need an exact position of the bird on an image, but the pixelwise loss penalizes a lot for shifted reconstructions. Recently, Li et al. (2017) improved ALI model (Dumoulin et al., 2017; Donahue et al., 2017)
by introducing a reconstruction loss in the form of a discriminator which classifies pairs
and . However, in such approach, the discriminator tends to detect the fake pair just by checking the identity of and which leads to vanishing gradients.In this paper, we propose a novel autoencoding model which matches the distributions in the data space and in the latent space independently as in Zhu et al. (2017). To ensure good reconstructions, we introduce an augmented adversarial reconstruction loss as a discriminator which classifies pairs and where is a stochastic augmentation function. This enforces the discriminator to take into account content invariant to the augmentation, thus making training more robust. We call this approach Pairwise Augmented Generative Adversarial Networks (PAGANs). Measuring a reconstruction quality of autoencoding models is challenging. A standard reconstruction metric RMSE does not perform the contentbased comparison. To deal with this problem we propose a novel metric Reconstruction Inception Dissimilarity (RID) which is robust to contentpreserving transformations (e.g., small shifts of an image). We show qualitative results on common datasets such as MNIST (LeCun & Cortes, 2010), CIFAR10 (Krizhevsky et al., 2009) and CelebA (Liu et al., 2015). PAGANs outperform existing VAEGAN hybrids in Inception Score (Salimans et al., 2016) and Fréchet Inception Distance (Heusel et al., 2017) except for the recently announced method PDWGAN (Gemici et al., 2018) on CIFAR10 dataset.
2 Preliminaries
Let us consider an adversarial learning framework where our goal is to match the true distribution to the model distribution . As it was proposed in the original paper Goodfellow et al. (2014), the model distribution is induced by the generator where is sampled from a prior . To match the distributions and in an adversarial manner, we introduce a discriminator . It takes an object
and predicts the probability that this object is sampled from the true distribution
. The training procedure of GANs (Goodfellow et al., 2014) is based on the minimax game of two players: the generator and the discriminator . This game is defined as follows(1) 
where is a value function for this game.
The optimal discriminator given fixed generator is
(2) 
and then the value function for the generator given the optimal discriminator is equivalent to the JensenShanon divergence between the model distribution and the true distribution , i.e.
(3) 
However, in practice, the gradient of the value function with respect to the generator’s parameters vanishes to zero. Therefore, Goodfellow et al. (2014) proposed to train the generator by minimizing instead of . This loss for the generator provides much more stable gradients and has the same fixed point as the minimax game of and .
3 Pairwise Augmented Generative Adversarial Networks
In PAGANs model our aim is not only to learn how to generate real objects with the generator where is sampled from prior but at the same time learn an inverse mapping (encoder) . Additionally, we use the third stochastic transformation without parameters which is called augmenter. It produces the augmentation of the source object .
Let us consider the distributions which are induced by these three mappings

 the conditional distribution of outputs of the generator given ;

 the conditional distribution of outputs of the encoder given ;

 the conditional distribution over the augmentations given a source object .
Within the PAGANs model our goal is to find such optimal parameters and that ensure

generator matching: where , i.e. the generator samples objects from the true distribution ;

encoder matching: where , i.e. the encoder generates embeddings as the prior ;

reconstruction matching: where
(4) i.e. reconstructions are distributed as augmentations of the source object .
3.1 Generator & Encoder Matching
In order to deal with generator and encoder matching problems we can use the framework of the vanilla GANs (Goodfellow et al., 2014). We introduce two discriminators and for two minimax games:

generator matching:
(5) 
encoder matching:
(6)
Then the value functions and given the optimal discriminators and are equivalent to JensenShanon divergence:
(7)  
(8) 
3.2 Reconstruction Matching: Augmented Adversarial Reconstruction Loss
The solution of the reconstruction matching problem ensures that reconstructions correspond to the source object up to defined random augmentations . In PAGANs model we introduce the minimax game for training the adversarial distance between the reconstructions and augmentations of the source object . We consider the discriminator which takes a pair and classifies it into one of the following classes:

the real class: pairs from the distribution , i.e. the object is taken from the true distribution and the second is obtained from the by the random augmentation ;

the fake class: pairs from the distribution
(9) i.e. is sampled from then is generated from the conditional distribution by the encoder and is produced by the generator from the conditional model distribution .
Then the minimax problem is
(10) 
where
(11) 
Let us prove that such minimax game will match the distributions and . At first, we find the optimal discriminator:
Proposition 1.
Given a fixed generator and a fixed encoder , the optimal discriminator is
(12) 
Proof.
Given in Appendix A.1. ∎
Then we can prove that given an optimal discriminator the value function is equivalent to the expected JensenShanon divergence between the distributions and .
Proposition 2.
The minimization of the value function under an optimal discriminator is equivalent to the minimization of the expected JensenShanon divergence between and , i.e.
(13) 
Proof.
Given in Appendix A.2. ∎
If then the optimal discriminator will learn an indicator as was proved in Li et al. (2017). As a consequence, the objectives of the generator and the encoder are very unstable and have vanishing gradients in practice. On the contrary, if the distribution is nondegenerate as in our model then the value function will be wellbehaved and much more stable which we observed in practice.
3.3 Training Objectives
We obtain that for the generator and the encoder we should optimize the sum of two value functions:

the generator’s objective:
(14) (15) 
the encoder’s objective:
(16) (17)
In practice in order to speed up the training we follow Goodfellow et al. (2014) and use more stable objectives replacing with . See Figure 1 for the description of our model and Algorithm 1 for an algorithmic illustration of the training procedure.
We can straightforwardly extend the definition of PAGANs model to PAGANs which minimize the divergence and to WPAGANs which optimize the Wasserstein1 distance. More detailed analysis of these models is placed in Appendix B.
4 Related Work
Recent papers on VAEGAN hybrids explore different ways to build a generative model with an encoder part. One direction is to apply adversarial training in the VAE framework to match the variational posterior distribution and the prior distribution (Mescheder et al., 2017) or to match the marginal and (Makhzani et al., 2016; Tolstikhin et al., 2017). Another way within the VAE model is to introduce the discriminator as a part of a data likelihood (Larsen et al., 2015; Brock et al., 2017). Within the GANs framework, a common technique is to regularize the model with the reconstruction loss term (Che et al., 2017; Rosca et al., 2017; Ulyanov et al., 2018).
Another principal approach is to train the generator and the encoder (Donahue et al., 2017; Dumoulin et al., 2017; Li et al., 2017)
simultaneously in a fully adversarial way. These methods match the joint distributions
and by training the discriminator which classifies the pairs . ALICE model (Li et al., 2017) introduces an additional entropy loss for dealing with the nonidentifiability issues in ALI model. Li et al. (2017) approximated the entropy loss with the cycleconsistency term which is equivalent to the adversarial reconstruction loss. The model of Pu et al. (2017a) puts ALI to the VAE framework where the same joint distributions are matched in an adversarial manner. As an alternative, Ulyanov et al. (2018) train generator and encoder by optimizing the minimax game without the discriminator. Optimal transport approach is also explored, Gemici et al. (2018) introduce an algorithm based on primal and dual formulations of an optimal transport problem.In PAGANs model the marginal distributions in the data space and and in the latent space and are matched independently as in Zhu et al. (2017). Additionally, the augmented adversarial reconstruction loss is minimized by fooling the discriminator which classifies the pairs and .
5 Experiments
In this section, we validate our model experimentally. At first, we compare PAGAN with other similar methods that allow performing both inference and generation using Inception Score and Fréchet Inception Distance. Secondly, to measure reconstruction quality, we introduce Reconstruction Inception Dissimilarity (RID) and prove its usability. In the last two experiments we show the importance of the adversarial loss and augmentations.
For the architecture choice we used deterministic DCGAN^{1}^{1}1DCGAN architecture is a common choice for GANs, other works use similar architecture generator and discriminator networks provided by pfnetresearch^{2}^{2}2https://github.com/pfnetresearch/chainerganlib
, the encoder network has the same architecture as the discriminator except for the output dimension. The encoder’s output is a factorized normal distribution. Thus
, where are outputs of the encoder network. The discriminatorarchitecture is chosen to be a 2 layer MLP with 512, 256 hidden units. We also used the same default hyperparameters as provided in the repository and applied a spectral normalization following
Miyato et al. (2018). For the augmentation defined in Section 3we used a combination of reflecting 10% pad and the random crop to the same image size. The prior distribution
is chosen to be a standard distribution . To evaluate Inception Score and Fréchet Inception Distance we used the official implementation provided in tensorflow 1.10.1 (Abadi et al., 2015).To optimize objectives (16), (14), we need to have a discriminator working on pairs . This can be done using special network architectures like siam networks (Bromley et al., 1993) or via an image concatenation. The latter approach can be implemented in two concurrent ways: concatenating channel or widthwise. Empirically we found that the siam architecture does not lead to significant improvement and concatenating width wise to be the most stable. We use this configuration in all the experiments.
Sampling Quality
To see whether our method provides good quality samples from the prior, we compared our model to related works that allow an inverse mapping. We performed our evaluations on CIFAR10 dataset since quantitative metrics are available there.
Considering Fréchet Inception Distance (FID), our model outperforms all other methods. Inception Score shows that PAGANs significantly better than others except for recently announced PDWGAN.
Quantitative results are given in Table 1. Plots with samples and reconstructions for CIFAR10 dataset are provided in Figure 2. Additional visual results for more datasets can be found in Appendix D.3.
Model  FID  Inception Score 

WAEGAN (Tolstikhin et al., 2017)  87.7  4.18 0.04 
ALI (Dumoulin et al., 2017)  5.34 0.04  
AGE (Ulyanov et al., 2018)  39.51  5.9 0.04 
ALICE (Li et al., 2017)  6.02 0.03  
GANs (Rosca et al., 2017)  6.2  
ASVAE (Pu et al., 2017b)  6.3  
PDWGAN, (Gemici et al., 2018)  33.0  6.70 0.09 
PAGAN (ours)  32.84  6.56 0.06 
Evaluation of Generator and Encoder on CIFAR10 dataset, on plots (c), (d) odd columns denote original images, even stand for corresponding reconstructions on test partition.
Reconstruction Inception Dissimilarity
The traditional approach to estimate the reconstruction quality is to compute RMSE distance from source images to reconstructed ones. However, this metric suffers from focusing on exact reconstruction and is not content aware. RMSE penalizes contentpreserving transformations while allows such undesirable effect as blurriness which degrades visual quality significantly. We propose a novel metric
Reconstruction Inception Dissimilarity (RID) which is based on a pretrained classification network and is defined as follows:(18) 
where is a pretrained classifier that estimates the label distribution given an image. Similar to Salimans et al. (2016) we use a pretrained Inception Network (Szegedy et al., 2016) to calculate softmax outputs.
Model  RMSE  RID 

AUG  8.89  1.57 0.02 
VAE  5.85  44.33 2.27 
AGE  6.675  19.02 0.84 
PAGANs  8.12  13.01 0.82 
Low RID indicates that the content did not change after reconstruction. To calculate standard deviations, we use the same approach as for IS and split test set on 10 equal parts
^{3}^{3}3Split is done sequentially without shuffling. Moreover RID is robust to augmentations that do not change the visual content and in this sense is much better than RMSE. To compare new metric with RMSE, we train a vanilla VAE with resnetlike architecture on CIFAR10. We compute RID for its reconstructions and real images with the augmentation (mirror 10% pad + random crop). In Table 2 we show that RMSE for VAE is better in comparison to augmented images (AUG), but we are not satisfied with its reconstructions (see Figure 8 in Appendix D.4), Figure 3 provides even more convincing results. RID allows a fair comparison, for VAE it is dramatically higher (44.33) than for AUG (1.57). Value 1.57 for AUG says that KL divergence is close to zero and thus content is almost not changed. We also provide estimated RID and RMSE for AGE that was publicly available^{4}^{4}4Pretrained AGE: https://github.com/DmitryUlyanov/AGE. From Table 2 we see that PAGANs outperform AGE which reflects that our model has better reconstruction quality.Importance of adversarial loss
To prove the importance of an adversarial loss, we experiment replacing adversarial loss with the standard pixelwise distance between source images and corresponding reconstructions and compared FID, IS and RID metrics.
Using an augmentation in this setting is ambiguous. Thus we did not use any augmentation in training of the changed model. Quantitative results for the experiment are provided in Table 3. IS and FID results suggest that our model without adversarial loss performed worse in generation. Reconstruction quality significantly dropped considering RID. Visual results in Appendix D.1 confirm our quantitative findings.
Model  FID  IS  RID 

PAGAN  32.84  6.56 0.06  13.01 0.82 
PAGANL1  76.73  4.46 0.03  30.94 1.58 
PAGANNOAUG  111.151  4.23 0.06  50.15 2.71 
Importance of augmentation
In ALICE model (Li et al., 2017) an adversarial reconstruction loss was implemented without an augmentation.
As we discussed in Section 1 its absence leads to undesirable effects.
Here we run an experiment to show that our model without augmentation performs worse.
Quantitative results provided in Table 3 illustrate that our model without an augmentation fails to recover both good reconstruction and generation properties. Visual comparisons can be found in Appendix D.2.
Using the results obtained from the last two experiments we conclude that adversarial reconstruction loss works significantly better with augmentation.
6 Conclusions
In this paper, we proposed a novel framework with an augmented adversarial reconstruction loss. We introduced RID to estimate reconstructions quality for images. It was empirically shown that this metric could perform contentbased comparison of reconstructed images. Using RID, we proved the value of augmentation in our experiments. We showed that the augmented adversarial loss in this framework plays a key role in getting not only good reconstructions but good generated images.
Some open questions are still left for future work. More complex architectures may be used to achieve better IS and RID. The random shift augmentation may not the only possible choice, and other choices remained undiscovered.
References

Abadi et al. (2015)
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris
Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal
Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and
Xiaoqiang Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
URL https://www.tensorflow.org/. Software available from tensorflow.org.  Ali & Silvey (1966) Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142, 1966.
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/arjovsky17a.html.
 Brock et al. (2017) Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. ICLR, 2017.

Bromley et al. (1993)
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak
Shah.
Signature verification using a "siamese" time delay neural network.
In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, pp. 737–744, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. URL http://dl.acm.org/citation.cfm?id=2987189.2987282.  Che et al. (2017) Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. ICLR, 2017.
 Creswell et al. (2017) Antonia Creswell, Anil A Bharath, and Biswa Sengupta. Conditional autoencoders with adversarial information factorization. arXiv preprint arXiv:1711.05175, 2017.
 Donahue et al. (2017) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. ICLR, 2017.
 Dumoulin et al. (2017) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. ICLR, 2017.
 Gemici et al. (2018) Mevlana Gemici, Zeynep Akata, and Max Welling. Primaldual wasserstein gan. arXiv preprint arXiv:1805.09575, 2018.
 GómezBombarelli et al. (2018) Rafael GómezBombarelli, Jennifer N Wei, David Duvenaud, José Miguel HernándezLobato, Benjamín SánchezLengeling, Dennis Sheberla, Jorge AguileraIparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. Mar 2017. URL http://arxiv.org/abs/1704.00028v3.
 Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. Jun 2017. URL http://arxiv.org/abs/1706.08500v6. Advances in Neural Information Processing Systems 30 (NIPS 2017).
 Kingma & Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. ICLR, 2014.
 Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
 Krizhevsky et al. (2009) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar10 (canadian institute for advanced research). 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
 Larsen et al. (2015) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. CoRR, 2015.
 LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
 Li et al. (2017) Chunyuan Li, Hao Liu, Changyou Chen, Yuchen Pu, Liqun Chen, Ricardo Henao, and Lawrence Carin. Alice: Towards understanding adversarial learning for joint distribution matching. In Advances in Neural Information Processing Systems, pp. 5495–5503, 2017.

Liu et al. (2015)
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision (ICCV)
, December 2015.  Makhzani et al. (2016) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. ICLR, 2016.
 Mescheder et al. (2017) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2391–2400, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/mescheder17a.html.
 Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ICLR, 2018.
 Nguyen et al. (2008) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in neural information processing systems, pp. 1089–1096, 2008.
 Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
 Pu et al. (2017a) Yuchen Pu, Weiyao Wang, Ricardo Henao, Liqun Chen, Zhe Gan, Chunyuan Li, and Lawrence Carin. Adversarial symmetric variational autoencoder. In Advances in Neural Information Processing Systems, pp. 4330–4339, 2017a.
 Pu et al. (2017b) Yuchen Pu, Weiyao Wang, Ricardo Henao, Liqun Chen, Zhe Gan, Chunyuan Li, and Lawrence Carin. Adversarial symmetric variational autoencoder. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 4330–4339. Curran Associates, Inc., 2017b. URL http://papers.nips.cc/paper/7020adversarialsymmetricvariationalautoencoder.pdf.

Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
ICML, 2014.  Rosca et al. (2017) Mihaela Rosca, Balaji Lakshminarayanan, David WardeFarley, and Shakir Mohamed. Variational approaches for autoencoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
 Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242, 2016.

Szegedy et al. (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and
Zbigniew Wojna.
Rethinking the inception architecture for computer vision.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016
, pp. 2818–2826, 2016. doi: 10.1109/CVPR.2016.308. URL https://doi.org/10.1109/CVPR.2016.308.  Tolstikhin et al. (2017) Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. Nov 2017. URL http://arxiv.org/abs/1711.01558v3.
 Ulyanov et al. (2018) Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. It takes (only) two: Adversarial generatorencoder networks. In AAAI. AAAI Press, 2018.
 Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.

Zhu et al. (2017)
JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros.
Unpaired imagetoimage translation using cycleconsistent adversarial networks.
In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
Appendix A Proofs
a.1 Proof of Proposition 1 (optimal discriminator)
See 1
Proof.
For fixed generator and encoder, the value function with respect to the discriminator is
(19) 
Let us introduce new variables and notations
(20) 
Then
(21) 
Using the results of the paper Goodfellow et al. (2014) we obtain
(22) 
∎
a.2 Proof of Proposition 2
See 2
Proof.
As in the paper Goodfellow et al. (2014) we rewrite the value function for the optimal discriminator as follows
(23)  
(24)  
(25)  
(26) 
∎
Appendix B Extending PAGANs
b.1 divergence PAGANs
GANs (Nowozin et al., 2016) are the generalization of GAN approach. Nowozin et al. (2016) introduces the model which minimizes the divergence (Ali & Silvey, 1966) between the true distribution and the model distibution , i.e. it solves the optimization problem
(27) 
where is a convex, lowersemicontinuous function satisfying .
The minimax game for GANs is defined as
(28) 
where is a value function and is a Fenchel conjugate of (Nguyen et al., 2008). For fixed parameters , the optimal is . Then the value function for optimal parameters equals to divergence between the distributions and (Nguyen et al., 2008), i.e.
(29) 
We can straightforwardly extend the definition of PAGANs model to PAGANs. We just introduce for each matching problem the GAN value function, i.e.

generator matching:
(30) (31) 
encoder matching:
(32) (33) 
reconstruction matching:
(34) (35)
b.2 Wasserstein PAGANs
Arjovsky et al. (2017) proposed WGANs model for minimizing the Wasserstein1 distance between the distributions and , i.e.
(36) 
Because the distance is intractable they consider solving the KantorovichRubinstein dual problem (Villani, 2008)
(37) 
As in Section B.1 we can easily extend the PAGANs model to WPAGANs. In each matching problem the corresponding distance between distributions will be Wasserstein1 distance.
Appendix C Other models and experiment details
c.1 Training Wasserstein PAGAN
As another concurrent approach to match implicit distributions we can use Wasserstein distance. Recent empirical works showed promising results (Gulrajani et al., 2017; Gemici et al., 2018) and thus they are interesting to compare with. As mentioned above we still need a critic to work on pairs of images. Unlike GAN frameworks it is desirable to have a strong critic. A channel wise concatenation for pairs worked the best in sense of visual quality and training stability. As a default choice to improve Wasserstein distance optimization we applied the gradient penalty proposed in Gulrajani et al. (2017)
. To apply the gradient penalty for a critic on pairs we have to interpolate between pairs
and . There are still two choices:
shared alpha
(38) 
independent alpha for each part
(39)
Empirically we found no differences in results and in further experiments used shared alpha as a default choice. The gradient penalty strength parameter was set to 10 as recommended by Gulrajani et al. (2017). We used 10 discriminator steps per 1 generator/encoder step for WPAGAN to slightly improve quality in this setting, other parameters were unchanged. In Table 4 we present results for Wasserstein loss used instead of standard GAN objective in PAGAN model. While having good reconstructions this type of loss failed to achieve good generation results.
Model  FID  IS  RIS 

WPAGAN  52.29  5.62 0.09  13.44 0.44 
Comments
There are no comments yet.