1 Introduction
Deep (Variational) Auto Encoders (AEs Bengio09 and VAEs Kingma13 ; Rezende14 ) and deep Generative Adversarial Networks (GANs Goodfellow14 ) are two of the most popular approaches to generative learning. These methods have complementary strengths and weaknesses. VAEs can learn a bidirectional mapping between a complex data distribution and a much simpler prior distribution, allowing both generation and inference; on the contrary, the original formulation of GAN learns a unidirectional
mapping that only allows sampling the data distribution. On the other hand, GANs use more complex loss functions compared to the simplistic datafitting losses in (V)AEs and can usually generate more realistic samples.
Several recent works have looked for hybrid approaches to support, in a principled way, both sampling and inference like AEs, while producing samples of quality comparable to GANs. Typically this is achieved by training a AE jointly with one or more adversarial discriminators whose purpose is to improve the alignment of distributions in the latent space Brock16 ; Makhzani15 , the data space Che16 ; Larsen15 or in the joint (product) latentdata space Donahue16 ; Dumoulin16 . Alternatively, the method of Zhu16 starts by learning a unidirectional GAN, and then learns a corresponding inverse mapping (the encoder) posthoc.
While compounding autoencoding and adversarial discrimination does improve GANs and VAEs, it does so at the cost of added complexity. In particular, each of these systems involves at least three deep mappings: an encoder, a decoder/generator, and a discriminator. In this work, we show that this is unnecessary and that the advantages of autoencoders and adversarial training can be combined without increasing the complexity of the model.
In order to do so, we propose a new architecture, called an Adversarial GeneratorEncoder (AGE) Network (section 2), that contains only two feedforward mappings, the encoder and the generator, operating in opposite directions. As in VAEs, the generator maps a simple prior distribution in latent space to the data space, while the encoder is used to move both the real and generated data samples into the latent space. In this manner, the encoder induces two latent distributions, corresponding respectively to the encoded real data and the encoded generated data. The AGE learning process then considers the divergence of each of these two distributions to the original prior distribution.
There are two advantages of this approach. First, due to the simplicity of the prior distribution, computing its divergence to the latent data distributions reduces to the calculation of simple statistics over small batches of images. Second, unlike GANlike approaches, real and generated distributions are never compared directly, thus bypassing the need for discriminator networks as used by GANs. Instead, the adversarial signal in AGE comes from learning the encoder to increase the divergence between the latent distribution of the generated data and the prior, which works against the generator, which tries to decrease the same divergence (Figure 1). Optionally, AGE training may include reconstruction losses typical of AEs.
The AGE approach is evaluated (section 3) on a number of standard image datasets, where we show that the quality of generated samples is comparable to that of GANs Goodfellow14 ; Radford15 , and the quality of reconstructions is comparable or better to that of the more complex AdversariallyLearned Inference (ALI) approach of Dumoulin16
, while training faster. We further evaluate the AGE approach in the conditional setting, where we show that it can successfully tackle the colorization problem that is known to be difficult for GANbased approaches. Our findings are summarized in
section 4.Other related work. Apart from the abovementioned approaches, AGE networks can be related to several other recent GANbased systems. Thus, they are related to improved GANs Salimans16 that proposed to use batchlevel information in order to prevent mode collapse. The divergences within AGE training are also computed as batchlevel statistics.
Another avenue for improving the stability of GANs has been the replacement of the classifying discriminator with the regressionbased one as in energybased GANs
Zhao16 and Wasserstein GANs Arjovsky17 . Our statistics (the divergence from the prior distribution) can be seen as a very special form of regression. In this way, the encoder in the AGE architecture can be (with some reservations) seen as a discriminator computing a single number similarly to how it is done in Arjovsky17 ; Zhao16 .2 Adversarial GeneratorEncoder Networks
This section introduces our Adversarial GeneratorEncoder (AGE) networks. An AGE is composed of two parametric mappings: the encoder , with the learnable parameters , that maps the data space to the latent space , and the generator , with the learnable parameters , which runs in the opposite direction. We will use the shorthand notation
to denote the distribution of the random variable
.The reference distribution is chosen so that it is easy to sample from it, which in turns allow to sample unconditionally be first sampling and then by feedforward evaluation of , exactly as it is done in GANs. In our experiments, we pick the latent space to be an dimensional sphere
, and the latent distribution to be a uniform distribution on that sphere
. We have also conducted some experiments with the unit Gaussian distribution in the Euclidean space and have obtained results comparable in quality.
The goal of learning an AGE is to align the real data distribution to the generated distribution while establishing a correspondence between data and latent samples and . The real data distribution is empirical and represented by a large number of data samples . Learning amounts to tuning the parameter and to optimize the AGE criterion, discussed in section 2.1. This criterion is based on an adversarial game whose saddle points correspond to networks that align real and generated data distribution (). The criterion is augmented with additional terms that encourage the reciprocity of the encoder and the generator (section 2.2). The details of the training procedure are given in section 2.3.
2.1 Adversarial distribution alignment
The GAN approach to aligning two distributions is to define an adversarial game based on a ratio of probabilities
Goodfellow14. The ratio is estimated by repeatedly fitting a binary classifier that distinguishes between samples obtained from the real and generated data distributions. Here, we propose an alternative adversarial setup with some advantages with respect to GAN’s, including avoiding generator collapse
Goodfellow17 .The goal of AGE is to generate a distribution in data space that is close to the true data distribution
. However, direct matching of the distributions in the highdimensional data space, as done in GAN, can be challenging. We propose instead to move this comparison
to the simpler latent space. This is done by introducing a divergence measure between distributions defined in the latent space . We only require this divergence to be nonnegative and zero if, and only if, the distributions are identical ().^{1}^{1}1We do not require the divergence to be a distance. The encoder function maps the distributions and defined in data space to corresponding distributions and in the latent space. Below, we show how to design an adversarial criterion such that minimizing the divergence in latent space induces the distributions and to align in data space as well.In the theoretical analysis below, we assume that encoders and decoders span the class of all measurable mappings between the corresponding spaces. This assumption, often referred to as nonparametric limit
, is justified by the universality of neural networks
Hornik1989359 . We further make the assumption that there exists at least one “perfect” generator that matches the data distribution, i.e. .We start by considering a simple game with objective defined as:
(1) 
As the following theorem shows, perfect generators form saddle points (Nash equilibria) of the game (1) and all saddle points of the game (1) are based on perfect generators.
Theorem 1.
A pair forms a saddle point of the game (1) if and only if the generator matches the data distribution, i.e. .
The proofs of this and the following theorems are given in the supplementary material.
While the game (1) is sufficient for aligning distributions in the data space, finding such saddle points is difficult due to the need of comparing two empirical (hence nonparametric) distributions and . We can avoid this issue by introducing an intermediate reference distribution and comparing the distributions to that instead, resulting in the game:
(2) 
Importantly, (2) still induces alignment of real and generated distributions in data space:
Theorem 2.
The important benefit of formulation (2) is that, if is selected in a suitable manner, it is simple to compute the divergence of to the empirical distributions and . For convenience, in particular, we choose to coincide with the “canonical” (prior) distribution . By substituting in objective (2), the loss can be extended to include reconstruction terms that can improve the quality of the result. It can also be optimized by using stochastic approximations as described in section 2.3.
Given a distribution in data space, the encoder and divergence can be interpreted as extracting statistics from . Hence, game (2) can be though of as comparing certain statistics of the real and generated data distributions. Similarly to GANs, these statistics are not fixed but evolve during learning.
We also note that, even away from the saddle point, the minimization for a fixed does not tend to collapse for many reasonable choice of divergence (e.g. KLdivergence). In fact, any collapsed distribution would inevitably lead to a very high value of the first term in (2). Thus, unlike GANs, our approach can optimize the generator for a fixed adversary till convergence and obtain a nondegenerate solution. On the other hand, the maximization for some fixed can lead to score for some divergences.
2.2 Encodergenerator reciprocity and reconstruction losses
In the previous section we have demonstrated that finding a saddle point of (2) is sufficient to align real and generated data distributions and and thus generate realisticallylooking data samples. At the same time, this by itself does not necessarily imply that mappings and are reciprocal. Reciprocity, however, can be desirable if one wishes to reconstruct samples from their codes .
In this section, we introduce losses that encourage encoder and generator to be reciprocal. Reciprocity can be measured either in the latent space or in the data space, resulting in the loss functions based on reconstruction errors, e.g.:
(3)  
(4) 
Both losses (3) and (4) thus encourage the reciprocity of the two mappings. Note also that (3) is the traditional pixelwise loss used within AEs (L1loss was preferred, as it is known to perform better in image synthesis tasks with deep architectures).
A natural question then is whether it is helpful to minimize both losses (3) and (4) at the same time or whether considering only one is sufficient. The answer is given by the following statement:
Theorem 3.
Let the two distributions and be aligned by the mapping (i.e. ) and let . Then, for and , we have and almost certainly, i.e. the mappings and invert each other almost everywhere on the supports of and . Furthermore, is aligned with by , i.e. .
2.3 Training AGE networks
Based on the theoretical analysis derived in the previous subsections, we now suggest the approach to the joint training of the generator in the encoder within the AGE networks. As in the case of GAN training, we set up the learning process for an AGE network as a game with the iterative updates over the parameters and that are driven by the optimization of different objectives. In general, the optimization process combines the maximin game for the functional (2) with the optimization of the reciprocity losses (3) and (4).
In particular, we use the following game objectives for the generator and the encoder:
(5)  
(6) 
where and
denote the value of the encoder and generator parameters at the moment of the optimization and
, is a userdefined parameter. Note that both objectives (5), (6) include only one of the reconstruction losses. Specifically, the generator objective includes only the latent space reconstruction loss. In the experiments, we found that the omission of the other reconstruction loss (in the data space) is important to avoid possible blurring of the generator outputs that is characteristic to autoencoders. Similarly to GANs, in (5), (6) we perform only several steps toward optimum at each iteration, thus alternating between generator and encoder updates.By maximizing the difference between and , the optimization process (6) focuses on the maximization of the mismatch between the real data distribution and the distribution of the samples from the generator . Informally speaking, the optimization (6) forces the encoder to find the mapping that aligns real data distribution with the target distribution , while mapping nonreal (synthesized data) away from . When is a uniform distribution on a sphere, the goal of the encoder would be to uniformly spread the real data over the sphere, while cramping as much of synthesized data as possible together assuring nonuniformity of the distribution .
Any differences (misalignment) between the two distributions are thus amplified by the optimization process (6) and force the optimization process (5) to focus specifically on removing these differences. Since the misalignment between and is measured after projecting the two distributions into the latent space, the maximization of this misalignment makes the encoder to compute features that distinguish the two distributions.
3 Experiments
Samples (b) and reconstructions (c) for Tiny ImageNet dataset (top) and SVHN dataset (bottom). The results of ALI
Dumoulin16on the same datasets are shown in (d). In (c,d) odd columns show real examples and even columns show their reconstructions. Qualitatively, our method seems to obtain more accurate reconstructions than ALI
Dumoulin16 , especially on the Tiny ImageNet dataset, while having samples of similar visual quality.We have validated AGE networks in two settings. A more traditional setting involves unconditional generation and reconstruction, where we consider a number of standard image datasets. We have also evaluated AGE networks in the conditional setting. Here, we tackle the problem of image colorization, which is hard for GANs. In this setting, we condition both the generator and the encoder on the grayscale image. Taken together, our experiments demonstrate the versatility of the AGE approach.
3.1 Unconditionallytrained AGE networks
Network architectures: In our experiments, the generator and the encoder networks have a similar structure to the generator and the discriminator in DCGAN Radford15 . To turn the discriminator into the encoder, we have modified it to output an
dimensional vector and replaced the final sigmoid layer with the normalization layer that projects the points onto the sphere.
Divergence measure: As we need to measure the divergence between the empirical distribution and the prior distribution in the latent space, we have used the following measure. Given a set of samples on the
dimensional sphere, we fit the Gaussian Normal distribution with diagonal covariance matrix in the embedding
dimensional space and we compute the KLdivergence of such Gaussian with the unit Gaussian as(7) 
where and
are the means and the standard deviations of the fitted Gaussians along various dimensions. Since the uniform distribution on the sphere will entail the lowest possible divergence with the unit Gaussian in the embedding space among all distributions on the unit sphere, such divergence measure is valid for our analysis above. We have also tried to measure the same divergence nonparametrically using KozachenkoLeonenko estimator
Kozachenko87 . In our initial experiments, both versions worked equally well, and we used a simpler parametric estimator in the presented experiments.Hyperparameters: We use ADAM Kingma14 optimizer with the learning rate of . We perform two generator updates per one encoder update for all datasets. For each dataset we tried and picked the best one. We ended up using for all datasets. The dimensionality of the latent space was manually set according to the complexity of the dataset. We thus used for CelebA and SVHN datasets, and for the more complex datasets of Tiny ImageNet and CIFAR10.
Results: We evaluate unconditional AGE networks on several standard datasets, while treating the system Dumoulin16 as the most natural reference for comparison (as the closest threecomponent counterpart to our twocomponent system). The results for Dumoulin16 are either reproduced with the author’s code or copied from Dumoulin16 .
In Figure 2, we present the results on the challenging Tiny ImageNet dataset RussakovskyDSKS15 and the SVHN dataset Netzer . We show both samples obtained for as well as the reconstructions alongside the real data samples . We also show the reconstructions obtained by Dumoulin16 for comparison. Inspection reveals that the fidelity of Dumoulin16 is considerably lower for Tiny ImageNet dataset.
Orig.  AGE 10 ep.  ALI 10 ep.  ALI 100 ep.  VAE  Orig.  AGE 10 ep.  ALI 10 ep.  ALI 100 ep.  VAE  Orig.  AGE 10 ep.  ALI 10 ep.  ALI 100 ep.  VAE 
In Figure 3, we further compare the reconstructions of CelebA LiuLWT15 images obtained by the AGE network, ALI Dumoulin16 , and VAE Kingma13 . Overall, the fidelity and the visual quality of AGE reconstructions are roughly comparable or better than ALI. Furthermore, ALI takes notoriously longer time to converge than our method, and the reconstructions of ALI after 10 epochs (which take six hours) of training look considerably worse than AGE reconstructions after 10 epochs (which take only two hours), thus attesting to the benefits of having a simpler twocomponent system.
Next we evaluate our method quantitatively. For the model trained on CIFAR10 dataset we compute Inception score Salimans16 . The AGE score is , which is higher than the ALI Dumoulin16 score of (as reported in WardeFarley17 ) and than the score of from Salimans16 . The stateoftheart from WardeFarley17 is higher still (). Qualitative results of AGE for CIFAR10 and other datasets are shown in supplementary material.
We also computed log likelihood for AGE and ALI on the MNIST dataset using the method of Wu16 with latent space of size using authours source code. ALI’s score is while AGE’s score is . The AGE model is also superior than both VAE and GAN, which scores are and respectively as evaluated by Wu16 .
Finally, similarly to Dumoulin16 ; Donahue16 ; Radford15 we investigated whether the learned features are useful for discriminative tasks. We reproduced the evaluation pipeline from Dumoulin16 for SVHN dataset and obtained error rate in the unsupervised feature learning protocol with our model, while their result is . At the moment, it is unclear to us why AGE networks underperform ALI at this task.
3.2 Conditional AGE network experiments.
Recently, several GANbased systems have achieved very impressive results in the conditional setting, where the latent space is augmented or replaced with a second data space corresponding to different modality Isola16 ; Zhu17 . Arguably, it is in the conditional setting where the bidirectionality lacking in conventional GANs is most needed. In fact, by allowing to switch backandforth between the data space and the latent space, bidirectionality allows powerful neural image editing interfaces Zhu16 ; Brock16 .
Here, we demonstrate that AGE networks perform well in the conditional setting. To show that, we have picked the image colorization problem, which is known to be hard for GANs. To the best of our knowledge, while the idea of applying GANs to the colorization task seems very natural, the only successful GANbased colorization results were presented in Isola16
, and we compare to the authors’ implementation of their pix2pix system. We are also aware of several unsuccessful efforts to use GANs for colorization.
To use AGE for colorization, we work with images in the Lab color space, and we treat the ab color channels of an image as a data sample . We then use the lightness channel of the image as an input to both the encoder and the generator , effectively conditioning the encoder and the generator on it. Thus, different latent variables will result in different colorizations for the same grayscale image . The latent space in these experiments is taken to be threedimensional.
The particular architecture of the generator takes the input image , augments it with variables expanded to constant maps of the same spatial dimensions as , and then applies the ResNet type architecture He16 ; Johnson16 that computes (i.e. the abchannels). The encoder architecture is a convolutional network that maps the concatenation of and (essentially, an image in the Labspace) to the latent space. The divergence measure is the same as in the unconditional AGE experiments and is computed “unconditionally” (i.e. each minibatch passed through the encoder combines multiple images with different ).
We perform the colorization experiments on Stanford Cars dataset Krause13 with 16,000 training images of 196 car models, since cars have inherently ambiguous colors and hence their colorization is particularly prone to the regressiontomean effect. The images were downsampled to .
We present colorization results in Figure 4. Crucially, AGE generator is often able to produce plausible and diverse colorizations for different latent vector inputs. As we wanted to enable pix2pix GANbased system of Isola16 to produce diverse colorizations, we augmented the input to their generator architecture with three constantvalued maps (same as in our method). We however found that their system effectively learns to ignore such input augmentation and the diversity of colorizations was very low (Figure 4a).
To demonstrate the meaningfulness of the latent space learned by the conditional AGE training, we also demonstrate the color transfer examples, where the latent vector obtained by encoding the image is then used to colorize the grayscale image , i.e. (Figure 4b).
4 Conclusion
We have introduced a new approach for simultaneous learning of generation and inference networks. We have demonstrated how to set up such learning as an adversarial game between generation and inference, which has a different type of objective from traditional GAN approaches. In particular the objective of the game considers divergences between distributions rather than discrimination at the level of individual samples. As a consequence, our approach does not require training a discriminator network and enjoys relatively quick convergence.
We demonstrate that on a range of standard datasets, the generators obtained by our approach provides highquality samples, and that the reconstructions of real data samples passed subsequently through the encoder and the generator are of better fidelity than in Dumoulin16 . We have also shown that our approach is able to generate plausible and diverse colorizations, which is not possible with the GANbased system Isola16 .
Our approach leaves a lot of room for further experiments. In particular, a more complex latent space distribution can be chosen as in Makhzani15 , and other divergence measures between distributions can be easily tried.
References
 [1] Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. Proc. ICLR, 2017.

[2]
Yoshua Bengio.
Learning deep architectures for AI.
Foundations and Trends in Machine Learning
, 2(1):1–127, 2009.  [3] Andrew Brock, Theodore Lim, James M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. Proc. ICLR, 2017.
 [4] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. Proc. ICLR, 2017.
 [5] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. Proc. ICLR, 2017.
 [6] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martín Arjovsky, Olivier Mastropietro, and Aaron C. Courville. Adversarially learned inference. Proc. ICLR, 2017.
 [7] Ian J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. CoRR, abs/1701.00160, 2017.
 [8] Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Proc. NIPS, pages 2672–2680, 2014.
 [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, pages 770–778, 2016.
 [10] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 – 366, 1989.

[11]
Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros.
Imagetoimage translation with conditional adversarial networks.
In Proc. CVPR, 2017. 
[12]
Justin Johnson, Alexandre Alahi, and Li FeiFei.
Perceptual losses for realtime style transfer and superresolution.
InEuropean Conference on Computer Vision
, 2016.  [13] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Proc. ICLR, 2015.
 [14] Diederik P. Kingma and Max Welling. Autoencoding variational bayes. Proc. ICLR, 2014.
 [15] L. F. Kozachenko and N. N. Leonenko. Sample estimate of the entropy of a random vector. Probl. Inf. Transm., 23(12):95–101, 1987.
 [16] Jonathan Krause, Michael Stark, Jia Deng, and Li FeiFei. 3d object representations for finegrained categorization. In Proc.ICCV 3DRR Workshop, pages 554–561, 2013.
 [17] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. CoRR, abs/1512.09300, 2015.
 [18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, pages 3730–3738. IEEE Computer Society, 2015.
 [19] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. Adversarial autoencoders. Proc. ICLR, 2016.
 [20] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. An introduction to sampling via measure transport. arXiv preprint arXiv:1602.05023, 2016.
 [21] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
 [22] G. Owen. Game Theory. Academic Press, 1982.
 [23] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. Proc. ICLR, 2016.
 [24] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and FeiFei Li. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [26] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems (NIPS), pages 2226–2234, 2016.
 [27] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
 [28] David WardeFarley and Yoshua Bengio. Improving generative adversarial networks with denoising feature matching. In Proc. ICLR, 2017.
 [29] Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger B. Grosse. On the quantitative analysis of decoderbased generative models. Proc ICLR, 2017.
 [30] Junbo Jake Zhao, Michaël Mathieu, and Yann LeCun. Energybased generative adversarial network. Proc. ICLR, 2017.
 [31] JunYan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. Generative visual manipulation on the natural image manifold. In Proc. ECCV, 2016.
 [32] JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. CoRR, abs/1703.10593, 2017.
5 Appendix
In this supplementary material, we provide proofs for the theorems of the main text (restating these theorems for convenience of reading). We also show additional qualitative results on several datasets.
Appendix A Proofs
Let and be distributions defined in the data and the latent spaces , correspondingly. We assume and are such, that there exists an invertible almost everywhere function which transforms the latent distribution into the data one . This assumption is weak, since for every atomless (i.e. no single point carries a positive mass) distributions , such invertible function exists. For a detailed discussion on this topic please refer to [20, 27]. Since is up to our choice simply setting it to Gaussian distribution (for ) or uniform on sphere for ( is good enough.
Lemma A.1.
Let and to be two distributions defined in the same space. The distributions are equal if and only if holds for for any measurable function .
Proof.
It is obvious, that if then for any measurable function .
Now let for any measurable . To show that we will assume converse: . Then there exists a set , such that and a function , such that corresponding set has as its preimage . Then we have , which contradicts with the previous assumption. ∎
Lemma A.2.
Let and to be two different Nash equilibria in a game . Then .
Proof.
See chapter 2 of [22]. ∎
Theorem 1.
Proof.
First note that . Consider such that , then for any : . We conclude that is a saddle point since is a maximum over and minimum over .
Lemma A.3.
Let function be almost everywhere invertible, i.e. . Then if for a mapping holds , then .
Proof.
From definition of almost everywhere invertibility follows for any set . Then:
Comparing the expressions on the sides we conclude .
∎
Theorem 2.
Let to be any fixed distribution in the latent space. Consider a game
(9) 
If the pair is a Nash equilibrium of game (9) then . Conversely, if the fake and real distributions are aligned then is a saddle point for some .
Proof.

As for a generator which aligns distributions : for any we conclude by A.2 that the optimal game value is . For an optimal pair and arbitrary from the definition of equilibrium:
(10) For invertible almost everywhere encoder such that the first term is zero since inequality (10) and then . Using result of the lemma A.3 we conclude, that .

If then and .
The corresponding optimal encoder is such that .
∎
Note that not for every optimal encoder the distributions and are aligned with . For example if collapses into two points then for any distribution : . For the optimal generator the parameter is such, that for all other generators such that : .
Theorem 3.
Let the two distributions and be aligned by the mapping (i.e. ) and let . Then, for and , we have and almost certainly, i.e. the mappings and invert each other almost everywhere on the supports of and . More, is aligned with by : .
Proof.
Since , we have almost certainly for . Using this and the fact that for we derive:
Thus almost certainly for .
To show alignment first recall the definition of alignment. Distributions are aligned iff : . Then we have
Comparing the expressions on the sides we conclude . ∎
Appendix B Additional qualitative results.
In the figures, we present additional qualitative results and comparisons for various image datasets. See figure captions for explanations.