Bidirectional Conditional Generative Adversarial Networks

11/20/2017 ∙ by Ayush Jaiswal, et al. ∙ USC Information Sciences Institute 0

Conditional variants of Generative Adversarial Networks (GANs), known as cGANs, are generative models that can produce data samples (x) conditioned on both latent variables (z) and known auxiliary information (c). Another GAN variant, Bidirectional GAN (BiGAN) is a recently developed framework for learning the inverse mapping from x to z through an encoder trained simultaneously with the generator and the discriminator of an unconditional GAN. We propose the Bidirectional Conditional GAN (BCGAN), which combines cGANs and BiGANs into a single framework with an encoder that learns inverse mappings from x to both z and c, trained simultaneously with the conditional generator and discriminator in an end-to-end setting. We present crucial techniques for training BCGANs, which incorporate an extrinsic factor loss along with an associated dynamically-tuned importance weight. As compared to other encoder-based GANs, BCGANs not only encode c more accurately but also utilize z and c more effectively and in a more disentangled way to generate data samples.



There are no comments yet.


page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GAN) [6] have recently gained immense popularity in generative modeling of data from complex distributions for a variety of applications such as image editing [24], image synthesis from text descriptions [25]

, image super-resolution 

[15], video summarization [18], and others [3, 9, 11, 12, 16, 27, 29, 30, 31]. GANs essentially learn a mapping from a latent distribution to a higher dimensional, more complex data distribution. Many variants of the GAN framework have been recently developed to augment GANs with more functionality and to improve their performance in both data modeling and target applications [24, 4, 5, 7, 10, 19, 20, 21, 22, 33]. Conditional GAN (cGAN) [22] is a variant of standard GANs that was introduced to augment GANs with the capability of conditional generation of data samples based on both latent variables (or intrinsic factors) and known auxiliary information (or extrinsic factors) such as class information or associated data from other modalities. Desired properties of cGANs include the ability to disentangle the intrinsic and extrinsic factors, and also disentangle the components of extrinsic factors from each other, in the generation process, such that the incorporation of a factor minimally influences that of the others. Inversion of such a cGAN provides a disentangled information-rich representation of data, which can be used for downstream tasks (such as classification) instead of raw data. Therefore, an optimal framework would be one that ensures that the generation process uses factors in a disentangled manner and provides an encoder to invert the generation process, giving us a disentangled encoding. The existing equivalent of such a framework is the Invertible cGAN (IcGAN) [24], which learns inverse mappings to intrinsic and extrinsic factors for pretrained cGANs. The limitations of post-hoc training of encoders in IcGANs are that it prevents them from (1) influencing the disentanglement of factors during generation, and (2) learning the inverse mapping to intrinsic factors effectively, as noted for GANs in [5] and as evident in our experiments for IcGANs. Other encoder-based cGAN models either do not encode extrinsic factors [19]

or encode them in fixed-length continuous vectors that do not have an explicit form 

[20], which prevents the generation of data with arbitrary combinations of extrinsic attributes.

We propose the Bidirectional Conditional GAN (BiCoGAN), which overcomes the deficiencies of the aforementioned encoder-based cGANs. The encoder in the proposed BiCoGAN is trained simultaneously with the generator and the discriminator, and learns inverse mappings of data samples to both intrinsic and extrinsic factors. Hence, our model exhibits implicit regularization, mode coverage and robustness against mode collapse similar to Bidirectional GANs (BiGANs) [4, 5]. However, training BiCoGANs naïvely does not produce good results in practice, because the encoder fails to model the inverse mapping to extrinsic attributes and the generator fails to incorporate the extrinsic factors while producing data samples. We present crucial techniques for training BiCoGANs, which address both of these problems. BiCoGANs outperform IcGANs on both encoding and generation tasks, and have the added advantages of end-to-end training, robustness to mode collapse and fewer model parameters. Additionally, the BiCoGAN-encoder outperforms IcGAN and the state-of-the-art methods on facial attribute prediction on cropped and aligned CelebA [17] images. Furthermore, state-of-the-art performance can be achieved at predicting previously unseen facial attributes using features learned by our model instead of images. The proposed model is significantly different from the conditional extension of the ALI model (cALIM) [5]. cALIM does not encode extrinsic attributes from data samples. It requires both data samples and extrinsic attributes as inputs to encode the intrinsic features. Thus, their extension is a conditional BiGAN, which is functionally different from the proposed bidirectional cGAN.

This paper has the following contributions. It (1) introduces the new BiCoGAN framework, (2) provides crucial techniques for training BiCoGANs, and (3) presents a thorough comparison of BiCoGANs with other encoder-based GANs, showing that our method achieves the state-of-the-art performance on several metrics. The rest of the paper is organized as follows. Section 2 discusses related work. In Section 3 we review the building blocks underlying the design of our model: GANs, cGANs and BiGANs. Section 4 describes our BiCoGAN framework and techniques for training BiCoGANs. Qualitative and quantitative analyses of our model are presented in Section 5. Section 6 concludes the paper and provides directions for future research.

2 Related Work

Perarnau et al. [24] developed the IcGAN model to learn inverse mappings of a pretrained cGAN from data samples to intrinsic and extrinsic attributes using two independent encoders trained post-hoc, one for each task. In their experiments they showed that using a common encoder did not perform well. In contrast, the proposed BiCoGAN model incorporates a single encoder to embed both intrinsic and extrinsic factors, which is trained jointly with the generator and the discriminator from scratch.

BiGANs are related to autoencoders 

[8], which also encode data samples and reconstruct data from compact embeddings. Donahue et al. [4] show a detailed mathematical relationship between the two frameworks. Makhzani et al. [19] introduced an adversarial variant of autoencoders (AAE) that constrains the latent embedding to be close to a simple prior distribution (e.g., a multivariate Gaussian). Their model consists of an encoder , a decoder and a discriminator. While the encoder and the decoder are trained with the reconstruction loss (where represents real data samples), the discriminator decides whether a latent vector comes from the prior distribution or from the encoder’s output distribution. In their paper, they presented unsupervised, semi-supervised and supervised variants of AAEs. Supervised AAEs (SAAEs) have a similar setting as BiCoGANs. Both SAAE decoders and BiCoGAN generators transform intrinsic and extrinsic factors into data samples. However, SAAE encoders learn only intrinsic factors while encoders of the proposed BiCoGAN model learn both. While the structure of data samples is learned explicitly through the reconstruction loss in SAAE, it is learned implicitly in BiCoGANs.

Variational Autoencoders (VAE) [13] have also been trained adversarially in both unconditional and conditional settings [20, 21]. The conditional adversarial VAE of [20] (cAVAE) encodes extrinsic factors of data into a fixed-length continuous vector . This vector along with encoded latent attributes can be used to reconstruct images. However, is not interpretable and comes from encoding a real data sample. Hence, generating a new sample with certain desired extrinsic properties from a cAVAE requires first encoding a similar real data sample (with exactly those properties) to get its . In comparison, such attributes can be explicitly provided to BiCoGANs for data generation.

Table 1 discusses our targeted GAN properties and shows that existing encoder based GANs do not fulfill all of them.

Feature encoding
Single encoder
for &
Generator uses &
Generator allows
explicitly spcecifying
Table 1: Target properties and their fulfillment by relevant GAN models

3 Preliminaries

In this section, we introduce the mathematical notation and a brief description of the fundamental building blocks underlying the design of BiCoGANs including GANs, cGANs and BiGANs.

3.1 Generative Adversarial Networks

The working principle of the GAN framework is learning a mapping from a simple latent (or prior) distribution to the more complex, high-dimensional data distribution. A GAN is composed of a generator and a discriminator. The goal of the generator is to produce samples that resemble real data samples, while the discriminator’s objective is to differentiate between real samples and those generated by the generator. The data

comes from the distribution and the latent vector is drawn from a prior distribution . Therefore, the generator is a mapping from to the generator’s distribution with the goal of bringing as close as possible to . On the other hand, the discriminator

is simply a classifier that produces a scalar value

indicating whether is from or from

. The generator and the discriminator play the minimax game (with the networks trained through backpropagation) as shown in Equation 



3.2 Conditional Generative Adversarial Networks

Mirza et al. [22] introduced conditional GAN (cGAN), which extends the GAN framework to the conditional setting where data can be generated conditioned on known auxiliary information such as class labels, object attributes, semantic tags and associated data from different modalities. cGANs thus provide more control over the data generation process with an explicit way to communicate desired attributes of the data to be generated to the GAN. This can be thought of as using a new prior vector with two components , where represents latent intrinsic factors and represents auxiliary extrinsic factors. Hence, the generator is a mapping from to and the discriminator models that gives . The cGAN discriminator also utilizes the knowledge of to determine if is real or fake. Thus, the generator must incorporate while producing in order to fool the discriminator. The model is trained with a similar minimax objective as the original GAN formulation, as shown in Equation 2.


3.3 Bidirectional Generative Adversarial Networks

The GAN framework provides a mapping from to , but not another from to . Such a mapping is highly useful as it provides a compact, information-rich representation of , which can be used as input feature vectors for downstream tasks (such as classification) instead of the original data in simple yet effective ways [4, 5]. Donahue et al. [4] and Dumoulin et al. [5] independently developed the BiGAN (or ALI) model that adds an encoder to the original generator-discriminator framework. The generator models the same mapping as the original GAN generator while the encoder is a mapping from to with the goal of bringing close to . The discriminator is modified to incorporate both and or both and to make real/fake decisions as or , respectively. Donahue et al. [4] provide a detailed proof to show that under optimality, and must be inverses of each other to successfully fool the discriminator. The model is trained with the new minimax objective as shown in Equation 3.


4 Proposed Model — Bidirectional Conditional GAN

An optimal cGAN framework would be one in which (1) the extrinsic factors can be explicitly specified so as to enable data generation conditioned on arbitrary combinations of factors, (2) the generation process uses intrinsic and extrinsic factors in a disentangled manner, (3) the components of the extrinsic factors minimally affect each other while generating data, and (4) the generation process can be inverted, giving us a disentangled information-rich embedding of data. However, existing models fail to simultaneously fulfill all of these desired properties, as reflected in Table 1. Moreover, formulating and training such a cGAN model is difficult given the inherent complexity of training GANs and the added constraints required to achieve the said goals.

We design the proposed Bidirectional Conditional GAN (BiCoGAN) framework with the aforementioned properties as our foundational guidelines. While goal (1) is fulfilled by explicitly providing the extrinsic factors as inputs to the BiCoGAN generator, in order to accomplish goals (2) and (3), we design the BiCoGAN discriminator to check the consistency of the input data with the associated intrinsic and extrinsic factors. Thus, the BiCoGAN generator must effectively incorporate both the sets of factors into the generation process to successfully fool the discriminator. Finally, in order to achieve goal (4), we incorporate an encoder in the BiCoGAN framework that learns the inverse mapping of data samples to both intrinsic and extrinsic factors. We train the encoder jointly with the generator and discriminator to ascertain that it effectively learns the inverse mappings and improves the generation process through implicit regularization, better mode coverage and robustness against mode collapse (like BiGANs [4, 5]). Thus, BiCoGANs generate samples conditioned on desired extrinsic factors and effectively encode real data samples into disentangled representations comprising both intrinsic and extrinsic attributes. This provides an information-rich representation of data for auxiliary supervised semantic tasks [4], as well as a way for conditional data augmentation [27, 28] to aid their learning. Figure 1 illustrates the proposed BiCoGAN framework.

Figure 1: Bidirectional Conditional Generative Adversarial Network. The dotted line indicates that is trained to predict the part of with supervision (extrinsic factor loss).

The generator learns a mapping from the distribution (where ) to with the goal of bringing close to while the encoder models from to with the goal of bringing close to . The discriminator makes real/fake decisions as or . It is important to note that the proposed BiCoGAN encoder must learn the inverse mapping of to and just like the generator must learn to incorporate both into the generation of data samples in order to fool the discriminator, following from the invertibility under optimality theorem of BiGANs [4, 5]. However, in practice, such optimality is difficult to achieve, especially when the prior vector contains structured information or has a complex distribution. While the intrinsic factors are sampled randomly from a simple latent distribution, the extrinsic factors are much more specialized and model specific forms of high-level information, such as class labels or object attributes, making their underlying distribution significantly more difficult to model. To address this challenge, we introduce the extrinsic factor loss (EFL) as an explicit mechanism that helps guide BiCoGANs to better encode extrinsic factors. This is built on the fact that the associated with each real data sample is known during training, and can, thus, be used to improve the learning of inverse mappings from to .

We do not give an explicit form to EFL in the BiCoGAN objective because the choice of the loss function depends on the nature of

, and hence, on the dataset/domain.

Adding EFL to the BiCoGAN objective is not sufficient to achieve the best results for both encoding and generating that incorporates the knowledge of . This is justified by the fact that the training process has no information about the inherent difficulty of encoding (specific to the domain). Thus, it is possible that the backpropagated gradients of the EFL (to the encoder) are distorted by those from the discriminator in the BiCoGAN framework. Therefore, we multiply EFL with an importance weight, which we denote by and refer to as the EFL weight (EFLW), in the BiCoGAN objective as shown in Equation 4.


The importance weight can be chosen as a constant value or a dynamic parameter that keeps changing during training to control the focus of the training between the naïve adversarial objective and the EFL. While the former option is straightforward, the latter requires some understanding of the dynamics between the original generator-discriminator setup of cGANs and the additional encoder as introduced in the proposed BiCoGAN model. It can be seen that the objective of the generator is significantly more difficult than that of the encoder, making the former more vulnerable to instability during training. Thus, in the dynamic setting, we design as a clipped exponentially increasing variable that starts with a small initial value, i.e., , where is the initial value for , is its maximum value, controls the rate of exponential increase and

indicates the number of epochs the model has already been trained. This is motivated by a similar approach introduced in 

[2] for deep multi-task learning.

5 Experimental Evaluation

In this section we evaluate the performance of the encoder and the generator of the proposed BiCoGAN model jointly and independently, and conduct a series of experiments to compare BiCoGANs with other encoder-based GANs, specifically, IcGAN, cALIM and cAVAE on various learning tasks. We also evaluate the effect of EFL and its associated EFLW on BiCoGAN training. We used keras-adversarial 111 to implement the proposed BiCoGAN model.

5.1 Datasets

All models are evaluated on the MNIST [14] handwritten digits dataset and the CelebA [17] dataset of celebrity faces with annotated facial attributes. Samples in the MNIST dataset are grayscale images while those in the CelebA dataset are cropped and resized to from the aligned version of the dataset. We consider the class labels in the MNIST dataset as extrinsic factors and components of writing styles as intrinsic factors. We select the same visually impactful facial attributes of the CelebA dataset that the authors of IcGAN [24] used as extrinsic factors and all other factors of variation as intrinsic features. We did not evaluate the other GAN models on datasets for which their official implementations are not available. Therefore, we compare BiCoGAN with IcGAN and cAVAE on MNIST, and with IcGAN and cALIM on CelebA. Furthermore, we present qualitative results of the proposed BiCoGAN model on a dataset of chairs rendered using a Computer Aided Design (CAD) model [1]. Each chair is rendered at different yaw angles, and cropped and downsampled to dimensions. We use the yaw angle, a continuous value, as the extrinsic attribute for this dataset and all other factors of variation as intrinsic variables.

5.2 Metrics

We quantify the performance of the proposed BiCoGAN and other aforementioned GANs on encoding the extrinsic factors, , using both mean accuracy and mean -score of the encoded by these models versus the ground-truth . We refer to these metrics as and , respectively. We follow the approach in [26]

by using an external discriminative model to assess the quality of generated images. The core idea behind this approach is that the performance of an external model trained on real data samples should be similar when evaluated on both real and GAN-generated test samples. We trained a digit classifier using a simple convolutional neural network for MNIST 

222 and the attribute predictor Anet [17] model for CelebA, on the training splits of these datasets. Thus, in our experimental settings, this metric also measures the ability of the generator in incorporating in the generation of . We use both accuracy () and -score () to quantify the performance of the external model. We show the accuracy and the -score of these external models on real test datasets for reference as and . We also calculate the adversarial accuracy (AA) as proposed in [33]. AA is calculated by training the external classifier on samples generated by a GAN and testing on real data. If the generator generalizes well and produces good quality images, the AA score should be similar to the score. In order to calculate , and AA, we use each GAN to generate a set of images . Denoting the real training dataset as , each image in is created using a combined with a randomly sampled . is then used as the testing set for calculating and , and as the training set for calculating AA. Furthermore, we evaluate the ability of the GAN models to disentangle intrinsic factors from extrinsic attributes in the data generation process on the CelebA dataset using an identity-matching score (IMS). The motivation behind this metric is that the identity of generated faces should not change when identity-independent attributes (like hair color or the presence of eyeglasses) change. We first randomly generate faces with “male” and “black hair” attributes and another with “female” and “black hair” attributes. We then generate eight variations of these base images with the attributes: “bangs”, “receding hairline”, “blond hair”, “brown hair”, “gray hair”, “heavy makeup”, “eyeglasses” and “smiling” respectively. We encode all the generated images using a pretrained VGG-Face [23] model 333

. IMS is then calculated as the mean cosine similarity of the base images with their variations. We provide results on MNIST and CelebA for two settings of BiCoGANs; one where we prioritize the performance of the generator (BiCoGAN-gen) and another for that of the encoder (BiCoGAN-enc), which gives us an empirical upper bound on the performance of BiCoGAN encoders.

5.3 Importance of Extrinsic Factor Loss

We first analyze the importance of incorporating EFL for training BiCoGANs and the influence of EFLW on the performance of the trained model. Figures 1(d) and 2(d) show some examples of images randomly generated using a BiCoGAN trained without EFL on both MNIST and CelebA, respectively. We see that BiCoGANs are not able to incorporate into the data generation process when trained without EFL. The metrics discussed in Section 5.2 are calculated for BiCoGANs trained with on MNIST, with on CelebA, and with the dynamic setting of , for , and , on both. Figure 4 summarizes our results. As before, we see that BiCoGANs are unable to learn the inverse mapping of to with . The results show that increasing up until a tipping point helps train BiCoGANs better. However, beyond that point, the EFL term starts dominating the overall objective, leading to degrading performance in the quality of generated images (as reflected by and scores). Meanwhile, the dynamic setting of achieves the best results on both the datasets on almost all metrics, establishing its effectiveness at training BiCoGANs. It is also important to note that a dynamic saves significant time and effort involved in selecting a constant through manual optimization, which also depends on the complexity of the dataset. Therefore, we use BiCoGANs trained with dynamic for the comparative results in the following sections.

(a) BiCoGAN with EFL
(b) IcGAN
(c) cAVAE
(d) BiCoGAN without EFL
Figure 2: Randomly generated (MNIST) digits
(a) BiCoGAN with EFL
(b) cALIM
(c) IcGAN
(d) BiCoGAN without EFL
Figure 3: Randomly generated (CelebA) faces. Base images are generated with black hair and gender as male (first row) and female (second row). “Gender” column indicates gender change. Red boxes show cases where unspecified attributes or latent factors are mistakenly changed during generation.
Figure 4: Influence of for BiCoGANs trained on (a & b) MNIST and on (c & d) CelebA. & show the performance of encoding while & show that of data generation. “EFLW=auto” denotes the dynamic- setting. The and values are shown as “XExt-real” values. The -axes of the plots have been scaled to easily observe differences.

5.4 Conditional Generation

In this section, we evaluate the ability of the BiCoGAN generator to (1) generalize over the prior distribution of intrinsic factors, i.e., be able to generate images with random intrinsic factors, (2) incorporate extrinsic factors while producing images, and (3) disentangle intrinsic and extrinsic factors during generation.

In terms of a qualitative evaluation of the proposed BiCoGAN model, Figures 1(a)1(b) and 1(c) show some generated MNIST images with BiCoGAN, IcGAN and cAVAE, respectively. For each of these figures, we randomly sampled vectors from the latent distribution (fixed along rows) and combined them with the digit class information (fixed along columns) to generate images. In order to vary while generating images with cAVAE, we picked a random MNIST image from each of the classes and passed it through the cAVAE -encoder to get the representation for each class, as discussed in Section 2. This is required because in cAVAE does not have an explicit form and is instead a fixed-length continuous vector. We find that the apparent visual quality of the generated MNIST digits is similar for all the three models with cAVAE producing slightly unrealistic images. Figures 2(a)2(b), and 2(c) show some randomly generated CelebA images with BiCoGAN, cALIM and IcGAN respectively. For each row in these figures, we sampled a vector from the latent distribution. We set the extrinsic attributes to male and black-hair for the first row and female and black-hair for the second row. We then generate each image in the grids based on the combination of these attributes with the new feature specified as the column header. The figures show that BiCoGANs perform the best at preserving intrinsic (like subject identity and lighting) and extrinsic factors (besides the specified new attribute). Hence, BiCoGAN outperforms the other models in disentangling the influence of and the components of in the data generation process.


Model     AA


cAVAE     0.9614 0.8880 0.9910
IcGAN   0.9871 0.9853   0.9360 0.9976 0.9986


BiCoGAN-gen   0.9888 0.9888   0.9384 0.9986 0.9986
BiCoGAN-enc   0.9902 0.9906   0.9351 0.9933 0.9937


Table 2: Encoding and Generation Performance - MNIST


Model     AA


cALIM     0.9138 0.9139 0.6423   0.9085
IcGAN   0.9127 0.5593   0.8760 0.9030 0.5969   0.8522


BiCoGAN-gen   0.9166 0.6978   0.9174 0.9072 0.6289   0.9336
BiCoGAN-enc   0.9274 0.7338   0.8747 0.8849 0.5443   0.9286


Table 3: Encoding and Generation Performance - CelebA

As mentioned earlier, we quantify the image generation performance using the , , AA and IMS metrics. Table 2 shows results on the MNIST dataset for BiCoGAN, IcGAN and cAVAE. We show the and scores for reference within parentheses in the and column headings, respectively. While the proposed BiCoGAN performs the best on and scores, cAVAE performs better on AA. This indicates that cAVAE is more prone to producing digits of wrong but easily confusable classes. Table 3 shows similar results on CelebA for BiCoGAN, IcGAN and cALIM. BiCoGAN outperforms IcGAN on almost all metrics. However, cALIM performs the best on and . While this indicates that cALIM is better able to incorporate extrinsic factors for generating images, IMS indicates that cALIM does this at the cost of intrinsic factors. Specifically, cALIM fails to effectively use the identity information contained in the intrinsic factors and disentangling it from the extrinsic attributes while generating images. In comparison, BiCoGAN performs the best on IMS. BiCoGAN also performs the best on AA, indicating that it successfully generates diverse but realistic images.


Attribute   LNet+ANet WalkLearn   IcGAN BiCoGAN (Ours)


Bald   0.98 0.92   0.98 0.98
Bangs   0.95 0.96   0.92 0.95
Black_Hair   0.88 0.84   0.83 0.88
Blond_Hair   0.95 0.92   0.93 0.95
Brown_Hair   0.8 0.81   0.87 0.87
Bushy_Eyebrows   0.9 0.93   0.91 0.92
Eyeglasses   0.99 0.97   0.98 0.99
Gray_Hair   0.97 0.95   0.98 0.98
Heavy_Makeup   0.9 0.96   0.88 0.90
Male   0.98 0.96   0.96 0.97
Mouth_Slightly_Open   0.93 0.97   0.90 0.93
Mustache   0.95 0.90   0.96 0.96
Pale_Skin   0.91 0.85   0.96 0.97
Receding_Hairline   0.89 0.84   0.92 0.93
Smiling   0.92 0.98   0.90 0.92
Straight_Hair   0.73 0.75   0.80 0.80
Wavy_Hair   0.8 0.85   0.76 0.79
Wearing_Hat   0.99 0.96   0.98 0.98


MEAN   0.91 0.91   0.91 0.93


Table 4: Attribute-level Breakdown of Encoder Accuracy - CelebA

5.5 Encoding Extrinsic Factors

We assess the performance of the models at encoding the extrinsic factors from data samples using the and metrics. We calculate these scores directly on the testing split of each dataset. Tables  2 and  3 show the performance of IcGAN and BiCoGAN in encoding on MNIST and CelebA, respectively. We note here that we cannot calculate and scores for cALIM because it does not encode from and for cAVAE because the it encodes does not have an explicit form. BiCoGAN consistently outperforms IcGAN at encoding extrinsic factors from data. Furthermore, we provide an attribute-level breakdown of accuracies for the CelebA dataset in Table 4 and compare it with two state-of-the-art methods for cropped and aligned CelebA facial attribute prediction as reported in [32], namely, LNet+Anet [17] and WalkLearn [32]. BiCoGAN outperforms the state-of-the-art methods even though the EFL directly responsible for it is only one part of the entire adversarial objective. This indicates that supervised tasks (like attribute prediction) can benefit from training the predictor with a generator and a discriminator in an adversarial framework like ours.

Figure 5: Interpolation results on MNIST hand-written digits using (a) BiCoGAN, (b) IcGAN and (c) cAVAE. Each row in a grid shows the interpolation between the leftmost and the rightmost images.
Figure 6: Interpolation results on CelebA faces using (a) BiCoGAN, (b) IcGAN and (c) cALIM. Each row in a grid shows the interpolation between the leftmost and the rightmost images.

5.6 Learning the Underlying Manifold

We assess the ability of the proposed BiCoGAN to learn the underlying manifold by interpolating and between pairs of images and comparing with IcGAN and cAVAE on MNIST, and with IcGAN and cALIM on CelebA. Figures 5 and 6 show results of our experiments for MNIST and CelebA images, respectively. We see that BiCoGANs exhibit smoother transitions than the other models while traversing the underlying latent space between the pairs of images.

5.7 Image Reconstruction with Variations

We assess the performance of the generator and the encoder in the BiCoGAN framework jointly by comparing our model with IcGAN and cAVAE on the ability to reconstruct images with varied extrinsic factors on the MNIST dataset, and with IcGAN on the CelebA dataset. We do not compare against cALIM since it does not encode . In order to vary while generating images with cAVAE, we first calculate the -embedding for each class as we did in Section 5.4. Figures 7 and  8 show our results on MNIST and CelebA, respectively. We see that intrinsic factors (such as writing style for MNIST and subject identity, lighting and pose for CelebA) are better preserved in variations of images reconstructed with BiCoGANs compared to other models. On CelebA we also see that for BiCoGAN, changing an attribute has less effect on the incorporation of other extrinsic factors as well as the intrinsic features in the generation process, compared to IcGAN. This reinforces similar results that we observed in Section 5.4.

(a) BiCoGAN
(b) IcGAN
(c) cAVAE
Figure 7: MNIST images reconstructed with varied class information. Column “O” shows the real image; “R” shows the reconstruction. The following columns show images with same but varied .
(a) BiCoGAN
(b) IcGAN
Figure 8: CelebA images reconstructed with varied attributes. “Orig” shows the real image, “Recon” shows its reconstruction and the other columns show images with the same but varied . Red boxes show cases where unspecified attributes or latent factors are mistakenly modified during generation.
Figure 9: BiCoGAN-generated chairs
Figure 10: Interpolation results on the Chairs dataset

5.8 Continuous Extrinsic Factors

In our evaluations in the previous subsections, we have provided results on datasets where

is a one-hot encoding or a vector of binary attributes. We explore the ability of the proposed BiCoGAN to model data distributions when the extrinsic attributes are continuous, on the chairs dataset 

[1] with denoting the yaw angle, as mentioned above. Figure 10 shows chairs generated at eight different angles using our model, with fixed along rows. The results show that the model is able to generate chairs for different while preserving the information contained in . We also assess the ability of BiCoGAN to learn the underlying manifold by interpolating between pairs of chairs. Figure 10 shows results of our experiments. Each row in the grid shows results of interpolation between the leftmost and the rightmost images. We see that the proposed BiCoGAN model shows smooth transitions while traversing the underlying latent space of chairs.


Attribute   LNet+ANet WalkLearn   BiCoGAN (Ours)


5_o_Clock_Shadow   0.91 0.84   0.92
Arched_Eyebrows   0.79 0.87   0.79
Attractive   0.81 0.84   0.79
Bags_Under_Eyes   0.79 0.87   0.83
Big_Lips   0.68 0.78   0.70
Big_Nose   0.78 0.91   0.83
Blurry   0.84 0.91   0.95
Chubby   0.91 0.89   0.95
Double_Chin   0.92 0.93   0.96
Goatee   0.95 0.92   0.96
High_Cheekbones   0.87 0.95   0.85
Narrow_Eyes   0.81 0.79   0.86
No_Beard   0.95 0.90   0.92
Oval_Face   0.66 0.79   0.74
Pointy_Nose   0.72 0.77   0.75
Rosy_Cheeks   0.90 0.96   0.94
Sideburns   0.96 0.92   0.96
Wearing_Earrings   0.82 0.91   0.84
Wearing_Lipstick   0.93 0.92   0.93
Wearing_Necklace   0.71 0.77   0.86
Wearing_Necktie   0.93 0.84   0.93
Young   0.87 0.86   0.85


MEAN   0.84 0.87   0.87


Table 5: Accuracies of Predicting Additional Factors using Inferred Encoding - CelebA

5.9 Using The Learned Representation

Finally, we quantitatively evaluate the encoding learned by the proposed BiCoGAN model on the CelebA dataset by using the inferred and , i.e., the intrinsic factors and the extrinsic attributes on which the model is trained, to predict the other

features annotated in the dataset. We train a simple feed-forward neural network for this task. Table 

5 shows the results of our experiment with the attribute-level breakdown of prediction accuracies. We show results of the state-of-the-art methods, LNet+ANet [17] and WalkLearn [32], for reference. The results show that it is possible to achieve state-of-the-art results on predicting these attributes by using the and encoded by the proposed BiCoGAN model, instead of original images. This not only shows that information about these attributes is captured in the encoded but also presents a successful use-case of the disentangled embedding learned by the BiCoGAN encoder.

6 Conclusions and Future Work

We presented the bidirectional conditional GAN (BiCoGAN) framework that effectively learns to generate data conditioned on intrinsic and extrinsic factors in a disentangled manner and provides a jointly trained encoder to encode data into both intrinsic and extrinsic factors underlying the data distribution. We presented necessary techniques for training BiCoGANs that incorporate an extrinsic factor loss with an associated importance weight. We showed that the proposed BiCoGAN exhibits state-of-the-art performance at encoding extrinsic factors of data and at disentangling intrinsic and extrinsic factors in the generation process on MNIST and CelebA images. We provided qualitative results on the Chairs dataset to show that it works well with continuous extrinsic factors also. Finally, we showed that state-of-the-art performance can be achieved at predicting previously unseen attributes using BiCoGAN embeddings, demonstrating that the encodings can be used for downstream tasks instead of images.

We note here that for composed of multiple extrinsic factors, it is possible to use a different for each factor but we did not explore this approach in this work. We also did not incorporate many techniques developed in recent literature for further improving the performance of GANs in order to ensure that our evaluation is not influenced by those techniques but purely analyzes our contributions as compared to official implementations of other models. We will explore these to further improve performance in future work.


  • [1] Aubry, M., Maturana, D., Efros, A., C. Russell, B., Sivic, J.: Seeing 3d chairs: Exemplar part-based 2d-3d alignment using a large dataset of cad models (06 2014)
  • [2] Belharbi, S., Hérault, R., Chatelain, C., Adam, S.: Deep multi-task learning with evolving weights. In: European Symposium on Artificial Neural Networks (ESANN) (2016)
  • [3]

    Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised Pixel-Level Domain Adaptation With Generative Adversarial Networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)

  • [4] Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial Feature Learning. In: International Conference on Learning Representations (2017)
  • [5] Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., Courville, A.: Adversarially Learned Inference. In: International Conference on Learning Representations (2017)
  • [6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
  • [7] Gurumurthy, S., Kiran Sarvadevabhatla, R., Venkatesh Babu, R.: DeLiGAN : Generative Adversarial Networks for Diverse and Limited Data. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
  • [8] Hinton, G.E., Salakhutdinov, R.R.: Reducing the Dimensionality of Data with Neural Networks. science 313(5786), 504–507 (2006)
  • [9] Huang, S., Ramanan, D.: Expecting the Unexpected: Training Detectors for Unusual Pedestrians With Adversarial Imposters. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
  • [10] Huang, X., Li, Y., Poursaeed, O., Hopcroft, J., Belongie, S.: Stacked Generative Adversarial Networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
  • [11]

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-To-Image Translation With Conditional Adversarial Networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)

  • [12] Kaneko, T., Hiramatsu, K., Kashino, K.: Generative Attribute Controller With Conditional Filtered Generative Adversarial Networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
  • [13] Kingma, D.P., Welling, M.: Auto-encoding Variational Bayes. In: International Conference on Learning Representations (2014)
  • [14] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based Learning Applied to Document Recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
  • [15] Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., Shi, W.: Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
  • [16] Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual Generative Adversarial Networks for Small Object Detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
  • [17]

    Liu, Z., Luo, P., Wang, X., Tang, X.: Deep Learning Face Attributes in the Wild. In: Proceedings of International Conference on Computer Vision (ICCV) (Dec 2015)

  • [18] Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised Video Summarization With Adversarial LSTM Networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
  • [19] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I.: Adversarial Autoencoders. In: International Conference on Learning Representations (2016),
  • [20] Mathieu, M.F., Zhao, J.J., Zhao, J., Ramesh, A., Sprechmann, P., LeCun, Y.: Disentangling Factors of Variation in Deep Representation using Adversarial Training. In: Advances in Neural Information Processing Systems. pp. 5040–5048 (2016)
  • [21]

    Mescheder, L., Nowozin, S., Geiger, A.: Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks. In: International Conference on Machine Learning (ICML) (2017)

  • [22] Mirza, M., Osindero, S.: Conditional Generative Adversarial Nets. arXiv preprint arXiv:1411.1784 (2014)
  • [23]

    Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference (2015)

  • [24] Perarnau, G., Weijer, J.v.d., Raducanu, B., Álvarez, J.M.: Invertible Conditional GANs for image editing. In: NIPS Workshop on Adversarial Training (2016)
  • [25] Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative Adversarial Text-to-Image Synthesis. In: Proceedings of The 33rd International Conference on Machine Learning (2016)
  • [26] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for Training gans. In: Advances in Neural Information Processing Systems. pp. 2234–2242 (2016)
  • [27] Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [28] Sixt, L., Wild, B., Landgraf, T.: Rendergan: Generating realistic labeled data. arXiv preprint arXiv:1611.01331 (2016)
  • [29] Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
  • [30] Vondrick, C., Torralba, A.: Generating the Future With Adversarial Transformers. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017)
  • [31]

    Wan, C., Probst, T., Van Gool, L., Yao, A.: Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

  • [32] Wang, J., Cheng, Y., Feris, R.S.: Walk and Learn: Facial Attribute Representation Learning from Egocentric Video and Contextual Data. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2295–2304 (2016)
  • [33] Yang, J., Kannan, A., Batra, D., Parikh, D.: LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation. In: International Conference on Learning Representations (2017)