1 Introduction
Deep directed generative models are a powerful framework for modeling complex data distributions. Generative Adversarial Networks (GANs) [1] can implicitly learn the data generating distribution; more specifically, GAN can learn to sample from it. In order to do this, GAN trains a generator to mimic real samples, by learning a mapping from a latent space (where the samples are easily drawn) to the data space. Concurrently, a discriminator is trained to distinguish between generated and real samples. The key idea behind GAN is that if the discriminator finds it difficult to distinguish real from artificial samples, then the generator is likely to be a good approximation to the true data distribution.
In its standard form, GAN only yields a oneway mapping, i.e., it lacks an inverse mapping mechanism (from data to latent space), preventing GAN from being able to do inference. The ability to compute a posterior distribution of the latent variable conditioned on a given observation may be important for data interpretation and for downstream applications (e.g., classification from the latent variable) [2, 3, 4, 5, 6, 7]. Efforts have been made to simultaneously learn an efficient bidirectional model that can produce highquality samples for both the latent and data spaces [3, 4, 8, 9, 10, 11]. Among them, the recently proposed Adversarially Learned Inference (ALI) [4, 10] casts the learning of such a bidirectional model in a GANlike adversarial framework. Specifically, a discriminator is trained to distinguish between two joint distributions: that of the real data sample and its inferred latent code, and that of the real latent code and its generated data sample.
While ALI is an inspiring and elegant approach, it tends to produce reconstructions that are not necessarily faithful reproductions of the inputs [4]
. This is because ALI only seeks to match two joint distributions, but the dependency structure (correlation) between the two random variables (conditionals) within each joint is
not specified or constrained. In practice, this results in solutions that satisfy ALI’s objective and that are able to produce reallooking samples, but have difficulties reconstructing observed data [4]. ALI also has difficulty discovering the correct pairing relationship in domain transformation tasks [12, 13, 14].In this paper, we first describe the nonidentifiability issue of ALI. To solve this problem, we propose to regularize ALI using the framework of Conditional Entropy (CE), hence we call the proposed approach ALICE.
Adversarial learning schemes are proposed to estimate the conditional entropy, for both unsupervised and supervised learning paradigms.
We provide a unified view for a family of recently proposed GAN models from the perspective of joint distribution matching, including ALI [4, 10], CycleGAN [12, 13, 14] and Conditional GAN [15]. Extensive experiments on synthetic and real data demonstrate that ALICE is significantly more stable to train than ALI, in that it consistently yields more viable solutions (good generation and good reconstruction), without being too sensitive to perturbations of the model architecture, i.e., hyperparameters. We also show that ALICE results in more faithful image reconstructions. Further, our framework can leverage paired data (when available) for semisupervised tasks. This is empirically demonstrated on the discovery of relationships for cross domain tasks based on image data.2 Background
Consider two general marginal distributions and over and . One domain can be inferred based on the other using conditional distributions, and . Further, the combined structure of both domains is characterized by joint distributions and .
To generate samples from these random variables, adversarial methods [1]
provide a sampling mechanism that only requires gradient backpropagation, without the need to specify the conditional densities. Specifically, instead of sampling directly from the desired conditional distribution, the random variable is generated as a deterministic transformation of two inputs, the variable in the source domain, and an independent noise,
e.g.,a Gaussian distribution. Without loss of generality, we use an universal distribution approximator specification
[9], i.e., the sampling procedure for conditionals and is carried out through the following two generating processes:(1) 
where and
are two generators, specified as neural networks with parameters
and , respectively. In practice, the inputs of and are simple concatenations, and , respectively. Note that (1) implies that and are parameterized by and respectively, hence the subscripts.The goal of GAN [1] is to match the marginal to . Note that denotes the true distribution of the data (from which we have samples) and is specified as a simple parametric distribution, e.g., isotropic Gaussian. In order to do the matching, GAN trains a parameterized adversarial discriminator network, , to distinguish between samples from and . Formally, the minimax objective of GAN is given by the following expression:
(2) 
where
is the sigmoid function. The following lemma characterizes the solutions of (
2) in terms of marginals and .Lemma 1 ([1])
The optimal decoder and discriminator, parameterized by , correspond to a saddle point of the objective in (2), if and only if .
Alternatively, ALI [4] matches the joint distributions and , using an adversarial discriminator network similar to (2), , parameterized by . The minimax objective of ALI can be then written as
(3) 
Lemma 2 ([4])
The optimum of the two generators and the discriminator with parameters form a saddle point of the objective in (3), if and only if .
From Lemma 2, if a solution of (3) is achieved, it is guaranteed that all marginals and conditional distributions of the pair match. Note that this implies that and match; however, (3) imposes no restrictions on these two conditionals. This is key for the identifiability issues of ALI described below.
3 Adversarial Learning with Information Measures
The relationship (mapping) between random variables and is not specified or constrained by ALI. As a result, it is possible that the matched distribution is undesirable for a given application.
To illustrate this issue, Figure 1 shows all solutions (saddle points) to the ALI objective on a simple toy problem. The data and latent random variables can take two possible values, and , respectively. In this case, their marginals and are known, i.e., and . The matched joint distribution, , can be represented as a contingency table. Figure 1(a) represents all possible solutions of the ALI objective in (3), for any . Figures 1(b) and 1(c) represent opposite extreme solutions when and , respectively. Note that although we can generate “realistic” values of from any sample of , for , we will have poor reconstruction ability since the sequence , , , can easily result in . The two (trivial) exceptions where the model can achieve perfect reconstruction correspond to , and are illustrated in Figures 1(b) and 1(c), respectively. From this simple example, we see that due to the flexibility of the joint distribution, , it is quite likely to obtain an undesirable solution to the ALI objective. For instance, ) one with poor reconstruction ability or ) one where a single instance of can potentially map to any possible value in , e.g., in Figure 1(a) with , can generate either or
with equal probability.
Many applications require meaningful mappings. Consider two scenarios:

[leftmargin=*]

A1:
In unsupervised learning, one desirable property is
cycleconsistency [12], meaning that the inferred of a corresponding , can reconstruct itself with high probability. In Figure 1 this corresponds to either or , as in Figures 1(b) and 1(c). 
A2: In supervised learning, the prespecified correspondence between samples imposes restrictions on the mapping between and , e.g., in image tagging, are images and are tags. In this case, paired samples from the desired joint distribution are usually available, thus we can leverage this supervised information to resolve the ambiguity between Figure 1(b) and (c).
From our simple example in Figure 1, we see that in order to alleviate the identifiability issues associated with the solutions to the ALI objective, we have to impose constraints on the conditionals and . Furthermore, to fully mitigate the identifiability issues we require supervision, i.e., paired samples from domains and .
To deal with the problem of undesirable but matched joint distributions, below we propose to use an informationtheoretic measure to regularize ALI. This is done by controlling the “uncertainty” between pairs of random variables, i.e., and , using conditional entropies.
3.1 Conditional Entropy
Conditional Entropy (CE) is an informationtheoretic measure that quantifies the uncertainty of random variable when conditioned on (or the other way around), under joint distribution :
(4) 
The uncertainty of given is linked with ; in fact, if only if is a deterministic mapping of . Intuitively, by controlling the uncertainty of and , we can restrict the solutions of the ALI objective to joint distributions whose mappings result in better reconstruction ability. Therefore, we propose to use the CE in (4), denoted as or (depending on the task; see below), as a regularization term in our framework, termed ALI with Conditional Entropy (ALICE), and defined as the following minimax objective:
(5) 
is dependent on the underlying distributions for the random variables, parametrized by , as made clearer below. Ideally, we could select the desirable solutions of (5) by evaluating their CE, once all the saddle points of the ALI objective have been identified. However, in practice, is intractable because we do not have access to the saddle points beforehand. Below, we propose to approximate the CE in (5) during training for both unsupervised and supervised tasks. Since and are symmetric in terms of CE according to (4), we use to derive our theoretical results. Similar arguments hold for , as discussed in the Supplementary Material (SM).
3.2 Unsupervised Learning
In the absence of explicit probability distributions needed for computing the CE, we can bound the CE using the criterion of cycleconsistency
[12]. We denote the reconstruction of as , via generating procedure (cycle) , . We desire that have high likelihood for , for the that begins the cycle , and hence that be similar to the original . Lemma 3 below shows that cycleconsistency is an upper bound of the conditional entropy in (4).Lemma 3
For joint distributions or , we have
minipage=scale=0.95,margin=0 3mm 0mm 0
(6) 
where . The proof is in the SM. Note that latent is implicitly involved in via . For the unsupervised case we want to leverage (6) to optimize the following upper bound of (5):
(7) 
Note that as ALI reaches its optimum, and reach saddle point , then in (4) accordingly, thus (7) effectively approaches (5) (ALICE). Unlike in (4), its upper bound, , can be easily approximated via Monte Carlo simulation. Importantly, (7) can be readily added to ALI’s objective without additional changes to the original training procedure.
The cycleconsistency property has been previously leveraged in CycleGAN [12], DiscoGAN [13] and DualGAN [14]. However, in [12, 13, 14], cycleconsistency, , is implemented via losses, for , and realvalued data such as images. As a consequence of an based pixelwise loss, the generated samples tend to be blurry [8]. Recognizing this limitation, we further suggest to enforce cycleconsistency (for better reconstruction) using fully adversarial training (for better generation), as an alternative to in (7). Specifically, to reconstruct , we specify an parameterized discriminator to distinguish between and its reconstruction :
(8) 
Finally, the fully adversarial training algorithm for unsupervised learning using the ALICE framework is the result of replacing with in (7); thus, for fixed , we maximize wrt .
The use of paired samples in (8) is critical. It encourages the generators to mimic the reconstruction relationship implied in the first joint; on the contrary, the model may reduce to the basic GAN discussed in Section 3, and generate any realistic sample in . The objective in (8) enjoys many theoretical properties of GAN. Particularly, Proposition 1 guarantees the existence of the optimal generator and discriminator.
Proposition 1
The optimal generators and discriminator of the objective in (8) is achieved, if and only if .
The proof is provided in the SM. Together with Lemma 2 and 3, we can also show that:
Corollary 1
When cycleconsistency is satisfied (the optimum in (8) is achieved), a deterministic mapping enforces , which indicates the conditionals are matched. On the contrary, the matched conditionals enforce , which indicates the corresponding mapping becomes deterministic.
3.3 Semisupervised Learning
When the objective in (7) is optimized in an unsupervised way, the identifiability issues associated with ALI are largely reduced due to the cycleconsistencyenforcing bound in Lemma 3. This means that samples in the training data have been probabilistically “paired” with high certainty, by conditionals and , though perhaps not in the desired configuration. In realworld applications, obtaining correctly paired data samples for the entire dataset is expensive or even impossible. However, in some situations obtaining paired data for a very small subset of the observations may be feasible. In such a case, we can leverage the small set of empirically paired samples, to further provide guidance on selecting the correct configuration. This suggests that ALICE is suitable for semisupervised classification.
For a paired sample drawn from empirical distribution , its desirable joint distribution is well specified. Thus, one can directly approximate the CE as
(9) 
where the approximation () arises from the fact that is an approximation to . For the supervised case we leverage (9) to approximate (5) using the following minimax objective:
(10) 
Note that as ALI reaches its optimum, and reach saddle point , then in (4) accordingly, thus (10) approaches (5) (ALICE).
We can employ standard losses for supervised learning objectives to approximate in (10), such as crossentropy or loss in (9). Alternatively, to also improve generation ability, we propose an adversarial learning scheme to directly match to the paired empirical conditional , using conditional GAN [15] as an alternative to in (10). The parameterized discriminator is used to distinguish the true pair from the artificially generated one (conditioned on ), using
(11) 
The fully adversarial training algorithm for supervised learning using the ALICE in (11) is the result of replacing with in (10), thus for fixed we maximize wrt .
Proposition 2
The optimum of generators and discriminator form saddle points of objective in (11), if and only if and .
The proof is provided in the SM. Proposition 2 enforces that the generator will map to the correctly paired sample in the other space. Together with the theoretical result for ALI in Lemma 2, we have
Corollary 2
When the optimum in (10) is achieved, .
Corollary 2 indicates that ALI’s drawbacks associated with identifiability issues can be alleviated for the fully supervised learning scenario. Two conditional GANs can be used to boost the perfomance, each for one direction mapping. When tying the weights of discriminators of two conditional GANs, ALICE recovers Triangle GAN [16]. In practice, samples from the paired set often contain enough information to readily approximate the sufficient statistics of the entire dataset. In such case, we may use the following objective for semisupervised learning:
(12) 
The first two terms operate on the entire set, while the last term only applies to the paired subset. Note that we can train (12) fully adversarially by replacing and with and in (8) and (11), respectively. In (12) each of the three terms are treated with equal weighting in the experiments if not specificially mentioned, but of course one may introduce additional hyperparameters to adjust the relative emphasis of each term.
4 Related Work: A Unified Perspective for Joint Distribution Matching
Connecting ALI and CycleGAN. We provide an information theoretical interpretation for cycleconsistency, and show that it is equivalent to controlling conditional entropies and matching conditional distributions. When cycleconsistency is satisfied, Corollary 1 shows that the conditionals are matched in CycleGAN. They also train additional discriminators to guarantee the matching of marginals for and using the original GAN objective in (2). This reveals the equivalence between ALI and CycleGAN, as the latter can also guarantee the matching of joint distributions and . In practice, CycleGAN is easier to train, as it decomposes the joint distribution matching objective (as in ALI) into four subproblems. Our approach leverages a similar idea, and further improves it with adversarially learned cycleconsistency, when high quality samples are of interest.
Stochastic Mapping vs. Deterministic Mapping. We propose to enforce the cycleconsistency in ALI for the case when two stochastic mappings are specified as in (1). When cycleconsistency is achieved, Corollary 1 shows that the bounded conditional entropy vanishes, and thus the corresponding mapping reduces to be deterministic. In the literture, one deterministic mapping has been empirically tested in ALI’s framework [4], without explicitly specifying cycleconsistency. BiGAN [10] uses two deterministic mappings. In theory, deterministic mappings guarantee cycleconsistency in ALI’s framework. However, to achieve this, the model has to fit a delta distribution (deterministic mapping) to another distribution in the sense of KL divergence (see Lemma 3). Due to the asymmetry of KL, the cost function will pay extremely low cost for generating fakelooking samples [17]. This explains the underfitting reasoning in [4] behind the subpar reconstruction ability of ALI. Therefore, in ALICE, we explicitly add a cycleconsistency regularization to accelerate and stabilize training.
Conditional GANs as Joint Distribution Matching. Conditional GAN and its variants [15, 18, 19, 20] have been widely used in supervised tasks. Our scheme to learn conditional entropy borrows the formulation of conditional GAN [15]. To the authors’ knowledge, this is the first attempt to study the conditional GAN formulation as joint distribution matching problem. Moreover, we add the potential to leverage the welldefined distribution implied by paired data, to resolve the ambiguity issues of unsupervised ALI variants [4, 10, 12, 13, 14].
5 Experimental Results
The code to reproduce these experiments is at https://github.com/ChunyuanLI/ALICE
5.1 Effectiveness and Stability of CycleConsistency
To highlight the role of the CE regularization for unsupervised learning, we perform an experiment on a toy dataset.
is a 2D Gaussian Mixture Model (GMM) with 5 mixture components, and
is chosen as a standard Gaussian, . Following [4], the covariance matrices and centroids are chosen such that the distribution exhibits severely separated modes, which makes it a relatively hard task despite its 2D nature. Following [21], to study stability, we run an exhaustive grid search over a set of architectural choices and hyperparameters, 576 experiments for each method. We report Mean Squared Error (MSE) and inception score (denoted as ICP) [22] to quantitatively evaluate the performance of generative models. MSE is a proxy for reconstruction quality, while ICP reflects the plausibility and variety of sample generation. Lower MSE and higher ICP indicate better results. See SM for the details of the grid search and the calculation of ICP.We train on 2048 samples, and test on 1024 samples. The groundtruth test samples for and are shown in Figure 2(a) and (b), respectively. We compare ALICE, ALI and Denoising AutoEncoders (DAEs) [23], and report the distribution of ICP and MSE values, for all (576) experiments in Figure 2 (c) and (d), respectively. For reference, samples drawn from the “oracle” (groundtruth) GMM yield =4.9770.016. ALICE yields an ICP larger than 4.5 in 77 of experiments, while ALI’s ICP wildly varies across different runs. These results demonstrate that ALICE is more consistent and quantitatively reliable than ALI. The DAE yields the lowest MSE, as expected, but it also results in the weakest generation ability. The comparatively low MSE of ALICE demonstrates its acceptable reconstruction ability compared to DAE, though a very significantly improvement over ALI.
(a) True  (b) True  (c) Inception Score  (d) MSE 
Figure 3 shows the qualitative results on the test set. Since ALI’s results vary largely from trial to trial, we present the one with highest ICP. In the figure, we color samples from different mixture components to highlight their correspondance between the ground truth, in Figure 2(a), and their reconstructions, in Figure 3 (first row, columns 2, 4 and 6, for ALICE, ALI and DAE, respectively). Importantly, though the reconstruction of ALI can recover the shape of manifold in (Gaussian mixture), each individual reconstructed sample can be substantially far away from its “original” mixture component (note the highly mixed coloring), hence the poor MSE. This occurs because the adversarial training in ALI only requires that the generated samples look realistic, i.e., to be located near true samples in , but the mapping between observed and latent spaces ( and ) is not specified. In the SM we also consider ALI with various combinations of stochastic/deterministic mappings, and conclude that models with deterministic mappings tend to have lower reconstruction ability but higher generation ability. In terms of the estimated latent space, , in Figure 3 (first row, columns 1, 3 and 5, for ALICE, ALI and DAE, respectively), we see that ALICE results in a better latent representation, in the sense of mapping consistency (samples from different mixture components remain clustered) and distribution consistency (samples approximate a Gaussian distribution). The results for reconstruction of and sampling of are shown in the SM.
In Figure 3
(second row), we also investigate latent space interpolation between a pair of test set examples. We use
and , map them into and , linearly interpolate between and to get intermediate points , and then map them back to the original space as . We only show the index of the samples for better visualization. Figure 3 shows that ALICE’s interpolation is smooth and consistent with the groundtruth distributions. Interpolation using ALI results in realistic samples (within mixture components), but the transition is not orderwise consistent. DAEs provides smooth transitions, but the samples in the original space look unrealistic as some of them are located in low probability density regions of the true model.We investigate the impact of different amount of regularization on three datasets, including the toy dataset, MNIST and CIFAR10 in SM Section D. The results show that our regularizer can improve image generation and reconstruction of ALI for a large range of weighting hyperparameter values.
(a) ALICE  (b) ALI  (c) DAEs 
5.2 Reconstruction and CrossDomain Transformation on Real Datasets
Two imagetoimage translation tasks are considered.
CartoCar [24]: each domain ( and ) includes car images in 11 different angles, on which we seek to demonstrate the power of adversarially learned reconstruction and weak supervision. EdgetoShoe [25]: domain consists of shoe photos and domain consists of edge images, on which we report extensive quantitative comparisons. Cycleconsistency is applied on both domains. The goal is to discover the crossdomain relationship (i.e., crossdomain prediction), while maintaining reconstruction ability on each domain.Adversarially learned reconstruction To demonstrate the effectiveness of our fully adversarial scheme in (8) (Joint A.) on real datasets, we use it in place of the losses in DiscoGAN [13]. In practice, feature matching [22] is used to help the adversarial objective in (8) to reach its optimum. We also compared with a baseline scheme (Marginal A.) in [12], which adversarially discriminates between and its reconstruction .
(a) Reconstruction  (b) Prediction 
The results are shown in Figure 4 (a). From top to bottom, each row shows groundtruth images, DiscoGAN (with Joint A., loss and Marginal A. schemes, respectively) and BiGAN [10]. Note that BiGAN is the best ALI variant in our grid search compasion. The proposed Joint A. scheme can retain the same crispness characteristic to adversariallytrained models, while tends to be blurry. Marginal A. provides realistic car images, but not faithful reproductions of the inputs. This explains the observations in [12] in terms of no performance gain. The BiGAN learns the shapes of cars, but misses the textures. This is a sign of underfitting, thus indicating BiGAN is not easy to train.
Weak supervision
The DiscoGAN and BiGAN are unsupervised methods, and exhibit very different crossdomain pairing configurations during different training epochs, which is indicative of nonidentifiability issues. We leverage very weak supervision to help with convergence and guide the pairing. The results on shown in Figure
4 (b). We run each methodstimes, the width of the colored lines reflect the standard deviation. We start with 1
true pairs for supervision, which yields significantly higher accuracy than DiscoGAN/BiGAN. We then provided 10 supervison in only 2 or 6 angles (of 11 total angles), which yields comparable angle prediction accuracy with full angle supervison in testing. This shows ALICE’s ability in terms of zeroshot learning, i.e., predicting unseen pairs. In the SM, we show that enforcing different weak supervision strategies affects the final pairing configurations, i.e., we can leverage supervision to obtain the desirable joint distribution.Quantitative comparison To quantitatively assess the generated images, we use structural similarity (SSIM) [26], which is an established image quality metric that correlates well with human visual perception. SSIM values are between ; higher is better. The SSIM of ALICE on prediction and reconstruction is shown in Figure 5 (a)(b) for the edgetoshoe task. As a baseline, we set DiscoGAN with based supervision (sup). BiGAN/ALI, highlighted with a circle is outperformed by ALICE in two aspects: In the unpaired setting (0 supervision), cycleconsistency regularization () shows significant performance gains, particularly on reconstruction. When supervision is leveraged (10), SSIM is significantly increased on prediction. The adversarialbased supervision (sup) shows higher prediction than sup. ALICE achieves very similar performance with the 50 and full supervision setup, indicating its advantage of in semisupervised learning. Several generated edge images (with 50% supervision) are shown in Figure 5(c), sup tends to provide more details than sup. Both methods generate correct paired edges, and quality is higher than BiGAN and DiscoGAN. In the SM, we also report MSE metrics, and results on edge domain only, which are consistent with the results presented here.
Oneside cycleconsistency When uncertainty in one domain is desirable, we consider oneside cycleconsistency. This is demonstrated on the CelebA face dataset [27]
. Each face is associated with a 40dimensional attribute vector. The results are in the Figure
13 of SM. In the first task, we consider the images are generated from a 128dimensional Gaussian latent space , and apply on . We compare ALICE and ALI on reconstruction in (a)(b). ALICE shows more faithful reproduction of the input subjects. In the second task, we consider as the attribute space, from which the images are generated. The mapping from to is then attribute classification. We only apply on the attribute domain, and on both domains. When paired samples are considered, the predicted attributes still reach 86% accuracy, which is comparable with the fully supervised case. To test the diversity on , we first predict the attributes of a true face image, and then generated multiple images conditioned on the predicted attributes. Four examples are shown in (c).(a) Crossdomain transformation  (b) Reconstruction  (c) Generated edges 
6 Conclusion
We have studied the problem of nonidentifiability in bidirectional adversarial networks. A unified perspective of understanding various GAN models as joint matching is provided to tackle this problem. This insight enables us to propose ALICE (with both adversarial and nonadversarial solutions) to reduce the ambiguity and control the conditionals in unsupervised and semisupervised learning. For future work, the proposed view can provide opportunities to leverage the advantages of each model, to advance jointdistribution modeling.
Acknowledgements
We acknowledge Shuyang Dai, Chenyang Tao and Zihang Dai for helpful feedback/editing. This research was supported in part by ARO, DARPA, DOE, NGA, ONR and NSF.
References
 [1] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 [2] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In ICLR, 2014.
 [3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
 [4] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. A., O. Mastropietro, and A. Courville. Adversarially learned inference. ICLR, 2017.

[5]
L. Mescheder, S. Nowozin, and A. Geiger.
Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks.
ICML, 2017. 
[6]
Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin.
Variational autoencoder for deep learning of images, labels and captions.
In NIPS, 2016.  [7] Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. Vae learning via Stein variational gradient descent. NIPS, 2017.
 [8] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. ICML, 2016.
 [9] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 [10] J. Donahue, K. Philipp, and T. Darrell. Adversarial feature learning. ICLR, 2017.
 [11] Y. Pu, W. Wang, R. Henao, L. Chen, Z. Gan, C. Li, and L. Carin. Adversarial symmetric variational autoencoder. NIPS, 2017.
 [12] J. Zhu, T. Park, P. Isola, and A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. ICCV, 2017.
 [13] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning to discover crossdomain relations with generative adversarial networks. ICML, 2017.
 [14] Z. Yi, H. Zhang, and P. Tan. DualGAN: Unsupervised dual learning for imagetoimage translation. ICCV, 2017.
 [15] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [16] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative adversarial networks. NIPS, 2017.
 [17] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
 [18] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.

[19]
P. Isola, J. Zhu, T. Zhou, and A. Efros.
Imagetoimage translation with conditional adversarial networks.
CVPR, 2017.  [20] C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adversarial nets. NIPS, 2017.
 [21] J. Zhao, M. Mathieu, and Y. LeCun. Energybased generative adversarial network. ICLR, 2017.
 [22] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.

[23]
P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol.
Extracting and composing robust features with denoising autoencoders.
In ICML, 2008.  [24] S. Fidler, S. Dickinson, and R. Urtasun. 3D object detection and viewpoint estimation with a deformable 3D cuboid model. In NIPS, 2012.
 [25] A. Yu and K. Grauman. Finegrained visual comparisons with local learning. In CVPR, 2014.
 [26] Z. Wang, A. C Bovik, H. R Sheikh, and E. P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE trans. on Image Processing, 2004.
 [27] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
 [28] S. Xie and Z. Tu. Holisticallynested edge detection. In ICCV, 2015.
Appendix A Information Measures
Since our paper constrain correlation of two random variables using information theoretical measures, we first review the related concepts. For any probability measure on the random variables and , we have the following additive and subtractive relationships for various information measures, including Mutual Information (MI), Variation of Information (VI) and the Conditional Entropy (CE).
(13)  
(14)  
(15)  
(16)  
(17) 
a.1 Relationship between Mutual Information, Conditional Entropy and the Negative Log Likelihood of Reconstruction
The following shows how the negative log probability (NLL) of the reconstruction is related to variation of information and mutual information. On the support of , we denote as the encoder probability measure, and as the decoder probability measure. Note that the reconstruction loss for can be writen as its log likelihood form as .
Lemma 4
For random variables and with two different probability measures, and , we have
(18)  
(19)  
(20)  
(21) 
where is the conditional entropy. From lemma 4, we have
Corollary 3
For random variables and with probability measure , the mutual information between and can be written as
(22) 
Given a simple prior such as isotropic Gaussian, is a constant.
Corollary 4
For random variables and with probability measure , the variation of information between and can be written as
(23) 
Appendix B Proof for Adversarial Learning Schemes
The proof for cycleconsistency and conditional GAN using adversarial traning is shown below. It follows the proof of the original GAN paper: we first show the implication of optimal discriminator, and then show the corresponding optimal generator.
b.1 Proof of Proposition 1: Adversarially Learned CycleConsistency for Unpair Data
In the unsupervised case, given data sample , one desirable property is reconstruction. The following game learns to reconstruct:
(24) 
Proposition 3
For fixed , the optimal in (24) yields .

We start from a simple observation
(25) when . Therefore, the objective in (24) can be expressed as
(26) (27) Note that
(28) (29) (30) The expression in (26) is maximal as a function of if and only if the integrand is maximal for every . However, the problem attains its maximum at , showing that
(31) For the game in (24), for which are optimized as to most confuse the discriminator, the optimal solution for the distribution parameters yield [1], and therefore from (31)
(32)
Similarly, we can show the cycle consistency property for reconstructing as .
b.2 Proof of Proposition 2: Adversarially Learned Conditional Generation for Paired Data
In the supervised case, given the paired data sample , the following game is used to conditionally generate [15]:
(33) 
To show the results, we need the following Lemma:
Lemma 5
The optimial generator and discriminator, with parameters , forms the saddle points of game in (33), if and only if . Further,

For the observed paired data , we have , where is marginal empirical distribution of for the paired data.
Also, when is paired with in the dataset. We start from the observation
(34) Therefore, the objective in (33) can be expressed as
(35) This integral is maximal as a function of if and only if the integrand is maximal for every . However, the problem attains its maximum at , showing that
(36) or equivalently, the optimum generator is . Since , we further have . Similarly, for conditional GAN of , we can show that is and for the Combining them, we show that .
Appendix C More Results on the Toy Data
c.1 The detailed setup
The 5component Gaussian mixture model (GMM) in is set with the means , and standard derivation . The Isotropic Gaussian in is set with mean and standard derivation .
We consider various network architectures to compare the stability of the methods. The hyperparameters includes: the number of layers and the number of neurons of the discriminator and two generators, and the update frenquency for discriminator and generator. The grid search specification is summarized in Table
2. Hence, the total number of experiments is .A generalized version of the inception score is calculated, , where denotes a generated sample and
is the label predicted by a classifier that is trained offline using the entire training set. It is also worth noting that although we inherit the name “inception score” from
[22], our evaluation is not related to the “inception” model trained on ImageNet dataset. Our classifier is a regular 3layer neural nets trained on the dataset of interest, which yields
classification accuracy on this toy dataset.(a) ALICE  (b) ALI  (c) DAEs 
c.2 Reconstruction of and sampling for
We show the additional results for the econstruction of and sampling for in Figure 6. ALICE shows good sampling ability, as it reflects the Guassian characteristics for each of 5 components, while ALI’s samples tends to be concentrated, reflected by the shrinked Guassian components. DAE learns an indentity mapping, and thus show weak generation ability.
c.3 Summary of the four variants of ALICE
ALICE is a general CEbased framework to regularize the objectives of bidiretional adversarial training, in order to obtain desirable solutions. To clearly show the versatility of ALICE, we summarize its four variants, and test their effectivenss on toy datasets.
In unsupervised learning, two forms of cycleconsistency/reconstruction are considered to bound CE:

Explicit cycleconsistency: Explicitly specified norm for reconstruction;

Implicit cycleconsistency: Implicitly learned reconstruction via adversarial training
In semisupervised learning, the pairwise information is leveraged in two forms to approximate CE:

Explicit mapping: Explicitly specified norm mapping (e.g., standard supervised losses);

Implicit mapping: Implicitly learned mapping via adversarial training
Disucssion
Explicit methods such as losses (): The similarity/quality of the reconstruction to the original sample is measured in terms of metric. This is easy to implement and optimize. However, it may lead to visually low quality reconstruction in high dimensions. Implicit methods via adversarial training: it essentially requires the reconstruction to be close to the original sample in terms of metric (see Section 3.3 of [10]: Adversarial feature learning). It theoretically guarantees perfect reconstruction, however, this is hard to achieve in practice, espcially in high dimension spaces.
Results
The effectivenss of these algorithms are demonstrated on toy data of low dimension in Figure 7. The unsupervised variants are tested in the same toy dataset described above, the results are in Figure 7 (a)(b). For the supervised variants, we create a toy dataset, where domain is 2component GMM, and domain is 5component GMM. Since each domain is symmtric, ambiguity exists when CycleGAN variants attempt to discover the relationship of the two domains in pure unsupervised setting. Indeed, we observed random switching of the discoverd corresponded components in different runs of CycleGAN. By adding a tiny fraction of pairwise information (a cheap way to specify the desirable relationship ), we can easily learn the correct correspondences for the entire datasets. In Figure 7 (c)(d), pairs (out of 2048) are prespecified: the points in domain are paired with the points in domain with opposite signs. Both explicit and implicit ALICE find the correct pairing configurations for other unlabeled samples. This inspires us to manually labeling the relations for a few samples between domains, and use ALICE to automatically control the full datasets pairing for the real datasets. One example is shown on Car2Car dataset.
c.4 Comparisons of ALI with stochastic/deterministic mappings
We investigate the ALI model with different mappings:

ALI: two stochastic mappings;

ALI: one stochastic mapping and one deterministic mapping;

BiGAN: two deterministic mappings.
We plot the histogram of ICP and MSE in Fig. 8, and report the mean and standard derivation in Table 2. In Fig. 9, we compare their reconstruction and generation ability. Models with deterministic mapping have higher recontruction ability, while show lower sampling ability.
Comparison on Reconstruction Please see row 1 and 2 in Fig. 9. For reconstruction, we start from one sample (red dot), and pass it through the cycle formed by the two mappings 100 times. The resulted reconstructions are shown as blue dots. The reconstructed samples tends to be concentrated with more deterministic mappings.
Comparison on Sampling Please see row 3 and 4 in Fig. 9. For sampling, we first draw samples in each domain, and pass them through the mappings. The generated samples are colored as the index of Gaussian component it comes from in the original domain.
(a) Inception Score  (b) MSE 