Training disentangled representations with generative adversarial networks (GANs) remains challenging, with leading implementations failing to achieve comparable performance to Variational Autoencoder (VAE)-based methods. After β-VAE and FactorVAE discovered that regularizing the total correlation of the latent vectors promotes disentanglement, numerous VAE-based methods emerged. Such a discovery has yet to be made for GANs, and reported disentanglement scores of GAN-based methods are significantly inferior to VAE-based methods on benchmark datasets. To this end, we propose a novel regularizer that achieves higher disentanglement scores than state-of-the-art VAE- and GAN-based approaches. The proposed contrastive regularizer is inspired by a natural notion of disentanglement: latent traversal. Latent traversal refers to generating images by varying one latent code while fixing the rest. We turn this intuition into a regularizer by adding a discriminator that detects how the latent codes are coupled together, in paired examples. Numerical experiments show that this approach improves upon competing state-of-the-art approaches on benchmark datasets.READ FULL TEXT VIEW PDF
Recently generative models have focused on combining the advantages of
An effective approach for voice conversion (VC) is to disentangle lingui...
This short article revisits some of the ideas introduced in arXiv:1701.0...
Generative Adversarial Networks (GANs) can achieve state-of-the-art samp...
Recent techniques built on Generative Adversarial Networks (GANs) like
Recent generative adversarial networks (GANs) are able to generate impre...
Exploring the potential of GANs for unsupervised disentanglement learnin...
Learning low-dimensional, informative data representations can critically enhance the data’s utility. The notion of disentangled representations in particular was theoretically proposed in Bengio et al. (2013); Ridgeway (2016); Higgins et al. (2016)
for diverse applications including supervised and reinforcement learning. These ideas were later empirically validated inHiggins et al. (2018b) for learning hierarchical visual concepts and in Higgins et al. (2017) for improving the robustness of reinforcement learning algorithms. A disentangled generative model takes a number of interpretable latent factors as input, with each factor controlling one aspect of the generated data. For example, in facial images, disentangled latent factors might control variations in eyes, noses, and hair.
Most approaches for disentangling latent factors (or codes) are based on the following natural intuition. We say a generative model has a better disentanglement if changing one latent code (while fixing other latent codes) makes a noticeable and distinct change in the generated sample. Noticeable changes are desired as we want the latent codes to capture important characteristics of the image. Distinct changes are desired as we want each latent code to represent an aspect of the samples different from other latent codes. As such, a successful disentanglement is typically evaluated by traversing the latent space as in Figure 1: by fixing all latent codes except one, varying that code systematically, and visualizing the resulting changes. Figure 1 illustrates how the latent codes of a generator have been successfully trained to capture noticeable and distinct properties of the images.
Recent approaches have focused on adding carefully chosen regularizers to promote disentanglement, building upon the two popular deep generative models: Variational AutoEncoders (VAE) Kingma and Welling (2013) and Generative Adversarial Networks (GAN) Goodfellow et al. (2014). However, fundamental difference in these two architectures led to the design of different regularizers. In a standard VAE training, an encoder finds a compact representation of real data, and a decoder is used to reconstruct the original image from the latent representation. To achieve disentanglement, a popular approach is to add an extra regularizer to promote “uncorrelatedness” by making each latent code distinct, as in -VAE Higgins et al. (2016).
In a standard GAN, a neural network generator is trained to map a randomly-drawn noise vector to the domain of the real data. A discriminator is introduced to encourage the generated sample distribution to be close to the real data. Disentangled GANs add a secondary input of latent codes, which are meant to control the underlying factors. The loss function then adds an extra regularizer to promote “informativeness”, as proposed in InfoGAN Chen et al. (2016)
. However, InfoGAN has been quantitatively reported to be significantly inferior to its VAE-based counterparts, which led to slow progress on GAN-based disentangled representation learning.
Related work. Learning a disentangled representation was first demonstrated in the semisupervised setting, where additional annotated data is available. This consists of examples from desired isolated latent factor traversals Karaletsos et al. (2015); Kulkarni et al. (2015); Narayanaswamy et al. (2017); Lopez et al. (2018). However, as manual data annotation is costly, unsupervised methods for disentangling are desired. Early approaches to unsupervised disentangling imposed uncorrelatedness by making it difficult to predict one representational unit from the rest Schmidhuber (1992)
, disentangling higher order momentsDesjardins et al. (2012), using factor analysis Tang et al. (2013), and applying group representations Cohen and Welling (2014). Breakthroughs in making these ideas scalable were achieved by -VAE Higgins et al. (2016) for VAE-based methods, and InfoGAN Chen et al. (2016) for GAN-based ones. Rapid progress in improving disentanglement was driven mainly by VAE-based methods, in a series of papers including Kim and Mnih (2018); Locatello et al. (2018); Chen et al. (2018); Lopez et al. (2018); Ansari and Soh (2018); Esmaeili et al. (2018); Gao et al. (2018); Pineau and Lelarge (2018); Dupont (2018); Ainsworth et al. (2018b, a). Quantitative comparisons in these papers suggest that InfoGAN cannot learn good disentangled representations. This has led to a misconception that GAN-based methods are inherently poor at learning disentangled representations.
Main contributions. We first disprove the common misconception that InfoGAN is inferior by training an InfoGAN model to achieve a disentanglement comparable to the best VAE-based method (FactorVAE Kim and Mnih (2018)). This is achieved by stabilizing the training using spectral normalization and two time-scale update rules, thus suggesting that the previously-reported poor performance of InfoGAN is due to training choices, not fundamental differences. We next design an appropriate regularizer to improve upon it. We propose a novel regularizer from the first principles of disentanglement, which we call a contrastive regularizer (CR). We show that the new architecture InfoGAN-CR achieves a significant gain over competing state-of-the-art approaches on benchmark datasets.
Overview of InfoGAN-CR. Our proposed regularizer is inspired by the idea that disentanglement should be measured via changes in the images when traversing the latent space. This is a popular interpretation of disentanglement, as evidenced by the widely-adopted visual evaluation of disentanglement (e.g. Figure 1
). To measure such changes, we naturally pair two (or more) samples drawn from one of multiple appropriately-designed joint distributions bycoupling the two random latent vectors. We introduce a new discriminator that performs multiway hypothesis testing on which joint distribution was used to create the paired examples. The loss of this discriminator provides an additional regularization, which we call Contrastive Regularization (CR). Building upon InfoGAN’s architecture (see Section 2 for details), we add contrastive regularization and refer to the resulting architecture as InfoGAN-CR, illustrated in Figure 1 (right).
Concretely, the input to the generator is partitioned into two parts: a latent vector that learns the disentangled representation and input noise that provides additional randomness. We generate a pair of test samples and from a pair of and . We design a set of joint distributions over and , and treat them as multiple hypotheses on how the pair is generated. This paired example is now fed into a discriminator , which tries to detect which hypothesized coupling was used. Both the generator and the CR discriminator try to make this hypothesis testing successful. We design the hypotheses to encourage each latent code to make changes that are easy to detect: noticeable and distinct, hence encouraging disentanglement. Contrastive regularization is discussed in detail in Section 3.
is an experiment focused on illustrating how the proposed contrastive regularizer enhances disentanglement beyond vanilla InfoGAN. The blue curve shows the performance when we train a vanilla InfoGAN on dSprites dataset for 28 epochs (322,560 iterations) total. To show the effect of the proposed CR regularizer, we take the model we just trained with InfoGAN at 25 epochs (288,000 iterations), and keep training with an added CR-regularizer (red curve), precisely defined in Eq. (6
). All other hyperparameters are identical. We measure disentanglement using the popular metric ofKim and Mnih (2018) and defined in Section 4. The jump at epoch 28 suggests that contrastive regularization significantly enhances disentanglement, on top of what was achieved by InfoGAN regularizer alone.
In this section, we give a brief overview of InfoGAN, introduced in Chen et al. (2016). Specifically, we highlight the design and training choices that have led to the misconception that InfoGAN cannot achieve disentanglement comparable to competing approaches. We disprove this misconception empirically by achieving a performance comparable to the best reported disentanglement scores (Table 1), by applying recently introduced techniques for stabilizing GAN training. We also derive in Remark 1 an implicit bias term that arises in practical InfoGAN implementations.
Background on GAN. Generative Adversarial Networks (GANs) are a breakthrough method for training generative models proposed in Goodfellow et al. (2014). A deep neural network generative model maps a latent code to a desired distribution of the samples .
is typically drawn from a Gaussian distribution with identity covariance. To train the neural network, no likelihood is available for ML training. GANs instead update the weights of a generator and discriminator using alternative gradient updates on the following adversarial loss:
The discriminator provides an approximate measure of how different the current generator distribution is from the distribution of the real data. For example, a common choice is , which provides an approximation of the Jensen-Shannon divergence between the real data distribution and the current generator distribution .
Background on InfoGAN. In order to achieve disentanglement, InfoGAN proposes a regularizer based on mutual information. As the goal is not to disentangle all latent codes, but rather to disentangle a subset, InfoGAN Chen et al. (2016) proposed to first split the latent codes into two parts: the disentangled code vector and the remaining code vector that provides additional randomness. InfoGAN then uses the GAN loss with regularization to encourage informative latent codes :
where denotes the mutual information between the latent code and the sample generated from that latent code, and is a positive scalar coefficient. Notice that encouraging informativeness alone does not necessarily imply good disentanglement; a fully entangled representation can achieve infinite mutual information . Despite this, InfoGAN achieves reasonable performance in practice. Its empirical performance follows from implementation choices that promote stability and alter the InfoGAN objective, which we discuss next.
Practical implementation of InfoGAN loss and the resulting implicit bias. Let denote the joint distribution of the latent code and the generated image . Two structural assumptions are enforced in Chen et al. (2016) to make the maximization of tractable. First, to make optimization of the mutual information tractable, all practical implementations of InfoGAN replace with a variational lower bound . Here is an auxiliary conditional distribution, which is maximized over the InfoGAN regularizer defined as
where and denote the distributions chosen as priors. When this lower bound is maximized over , it acts as a surrogate for mutual information: rearranging the terms gives
and (see Chen et al. (2016) for a derivation). However, this maximization is a functional optimization over an infinite dimensional function , which is intractable. To make this tractable, the optimization is done over a restricted family of Gaussian distributions in Chen et al. (2016), which are factorized (or independent) Gaussian distributions (see in Remark 1).
can then be parametrized by the conditional means and variances, which is now finite dimensional functions, and one can use deep neural networks to approximate them. Note that the Shannon entropyis a constant that does not depend on or .
Next, if this maximum over has been achieved, then notice that in the generator update, the generator tries to minimize . This is problematic as the update will increase unboundedly, tending to infinity (even if is restricted to factorized Gaussian class). The maximum value of is achieved, for example, if has variance zero for some . This problem is not just theoretical. In Appendix B, we confirm experimentally that training InfoGAN with an unfactorized leads to training instability. To avoid such catastrophic failures, Chen et al. (2016) forces to have an identity covariance matrix. This restricts the class of that we search over, and forces the to be bounded, and the and updates to be well-defined. In summary, for stability and efficiency of training, Chen et al. (2016) restricted to be a factorized Gaussian with identity covariance. We show next that this choice creates an implicit bias.
If optimized over a class of factorized Gaussian conditional distributions with unit variances, i.e. and for some for all and , the maximum of the InfoGAN loss in Eq. (3) has the following implicit bias:
where is the joint distribution between the latent code and the generated image , and with is the best one-dimensional Gaussian approximation of .
We provide a proof in Appendix A. The above implicit bias keeps the loss bounded, so it is necessary. On the other hand, it might have undesired and unintended consequences in terms of learning a disentangled deep generative model. In this paper, we therefore introduce a new regularizer to explicitly encourage disentanglement during InfoGAN training.
Improving InfoGAN performance via stable training. Before introducing our proposed regularizer, note that several VAE-based architectures claim to outperform InfoGAN by significant margins Kim and Mnih (2018); Higgins et al. (2016); Chen et al. (2018). This series of empirical results has created a misconception that InfoGAN is fundamentally limited in achieving disentanglement, which has been reinforced in following literature Jeon et al. (2018); Ansari and Soh (2018), which refer to those initial results. We show that the previously-reported inferior empirical performance of InfoGAN is due to poor architectural and hyperparameter choices in training. We take the same implementation reported to have bad performance (disentanglement score of 0.59 in Table 1). We then apply recently-proposed (but now popular) techniques for stabilizing GAN training and achieve a performance comparable to the best reported results of competing approaches (disentanglement score of 0.83 in Table 1). We provide more supporting experimental results in Section 4. Concretely, we start from the implementation in Kim and Mnih (2018). We then apply spectral normalization to the adversarial loss discriminator Miyato et al. (2018), with a choice of cross entropy loss, and use two time-scale update rule Heusel et al. (2017) with an appropriate choice of the learning rate. These choices are motivated by similar choices and successes of Brock et al. (2018) in scaling GAN to extremely large datasets. Implementation details are in Appendix E.4, and we also submit the code for reproducibility. Hence, the starting point for our design is to achieve even better disentanglement than a properly-trained version of InfoGAN.
Based on the insights from Section 2, we introduce a contrastive regularizer. A new discriminator encourages disentanglement and is added to the loss, for a positive scalar :
The key insight is that disentanglement is fundamentally measured by the changes made when traversing the latent space. Detecting changes inevitably requires the discriminator to make decisions based on multiple samples jointly. We propose generating multiple samples, whose latent codes are coupled. We design multiple hypotheses on how to couple those latent codes, and generate multiple examples from one of those hypotheses. The new discriminator tries to detect which hypothesis the set of images are generated from. The key insight is to design those hypotheses in a way to encourage each latent code to make changes in the sample that are noticeable and distinct.
We progressively change the hypotheses during the course of the training, from easy to hard. The hypotheses class we propose is as follows. Both the generator and the discriminator try to make the following -way hypothesis testing successful. First we draw a random index over indices, and sample the chosen latent code . Two images are generated with the same value of ; the remaining factors are chosen independently at random. Letting denote the th latent code for image , the contrastive gap is defined as . The larger the contrastive gap, the more distinct the pair of samples. We gradually reduce the contrastive gap for progressive training (Section 4.1). This pair of coupled images and are fed to the discriminator , which tries to identify which code was shared. We use cross entropy loss:
where denotes the joint distribution of the paired images,
denotes the one-hot encoding of the random index, andis a -dimensional vector-valued neural network normalized to be for all and . This encourages each latent code to make distinct and noticeable changes, hence promoting disentanglement. This architecture is illustrated in Figure 1. An alternative interpretation of the proposed loss as a divergence is provided in Appendix C.
We run experiments on a combination of synthetic datasets with pre-defined latent factors and real-world datasets.111The code for all experiments is available at https://github.com/fjxmlzn/InfoGAN-CR In our experiments, we evaluate two properties: disentanglement and image quality. For image quality, we use the common inception score metric. For disentanglement, we use the popular metric proposed by Kim and Mnih in Kim and Mnih (2018). This metric is a number between zero and one, with one being a perfect disentanglement. We additionally compute the (less common) disentanglement metric of Eastwood and Williams (2018). We give a full description of all metrics in Appendix D.
We begin with the synthetic dSprites dataset Matthey et al. (2017), commonly used to numerically compare disentangling methods. The dataset consists of 737,280 binary 64 64 images of 2D shapes generated from five ground truth independent latent factors: color, shape, scale, rotation, x and y position. All combinations of latent factors are present in the dataset; some examples are illustrated in Figure 3. Figure 1 illustrates the latent traversal for InfoGAN-CR. To generate this figure, we fix all latent factors except one , and vary from -1 to 1 in evenly-spaced intervals. We observe that each of the five empirically-learned factors captures one true underlying factor, and the traversals span the full range of possibilities for each hidden factor.
We compute and/or reproduce disentanglement metrics for a number of protocols in Table 1. In this table, all algorithms with an asterisk were either run or independently confirmed by us. For the metric of Kim and Mnih (2018), Table 1 shows that the baseline disentanglement of InfoGAN can be substantially boosted through better training, from 0.59 to 0.83. Contrastive regularization provides an additional gain, bringing InfoGAN-CR’s disentanglement up to 0.90, higher than any baseline from the VAE or GAN literature. For the metric of Eastwood and Williams (2018), we find similar trends, except FactorVAE is tied with InfoGAN-CR for the best score. We were made aware of independent work that proposes a similar idea to contrastive regularization in Li et al. (2018). Their algorithm is a special case of contrastive regularization, which empirically achieves lower disentanglement scores (0.39
0.02 standard error over 10 runs) than even vanilla InfoGAN. For this reason, we do not consider it as a baseline moving forward. Note that this discrepancy is not a matter of parameter tuning, but of the loss function; indeed, in our own preliminary trials, we found that training a CR-regularizer without the InfoGAN loss achieved similarly poor performance, as described in AppendixE.3. Concretely, Li et al. (2018) fixes in our loss (5), and also uses a special coupling that matches all but one latent code in for the matched pairs. In combination, these choices result in a significant degradation of performance.
We implemented both FactorVAE and InfoGAN using the architectures described in Kim and Mnih (2018). For completeness, we describe both architectures in detail in Appendix E.4. Although InfoGAN exhibits a reported disentanglement score of in Kim and Mnih (2018), we find that InfoGAN can exhibit substantially higher disentanglement scores () through some basic changes to the architecture and loss function. In particular, in accordance with Miyato et al. (2018), we changed the loss function from Wasserstein GAN to the traditional JSD loss. We also changed the generator’s Adam learning rate to and the InfoGAN and CR discriminators learning rates to ; we used 5 continuous input codes, whereas Kim and Mnih (2018)
reported using four continuous codes and one discrete one. We also used batch normalization in the generator, and spectral normalization in the discriminator. Finally, we used InfoGAN regularizer, and for InfoGAN-CR, we used contrastive regularizer . The effects of these changes are shown in the line ‘InfoGAN (modified)’ in Table 1. For FactorVAE, we used the architecture of Kim and Mnih (2018), which uses latent codes in their architecture. To compute disentanglement, we use the metric of Kim and Mnih (2018).
We start by showing the effect of varying InfoGAN-CR’s parameters: the InfoGAN regularizer and the contrastive regularizer . In particular, we are interested in understanding how these parameters trade off disentanglement with image quality. To this end, we vary both parameters and compute both the disentanglement metric of Kim and Mnih (2018) and inception score Salimans et al. (2016), explained in detail in Appendix D. Since inception scores can only be computed for labelled, categorical data, we compute it over the ‘shape’ latent factor, which has three options: heart, square, or oval. Figure 3(a) shows the resulting trends, both for InfoGAN-CR and for modified InfoGAN (which is a special case with ). For InfoGAN-CR, we set and take . The size of each point in Figure 3(a) corresponds to the value of , which we explicitly label for (blue triangles). Each data point is averaged over 11 runs. As we increase the contrastive regularizer, inception score decreases, whereas disentanglement improves up to a point, and then decreases. Notably, using contrastive regularizer or seems to incur small reductions in inception score for a significant boost in disentanglement. To better understand these results, we generated similar plots using more nuanced image quality metrics, included in Appendices D and E.1.
One key aspect of contrastive regularization is its progressive reduction of the contrastive gap during training. In this section, we show the importance of progressive training by comparing it to regularization for a fixed gap. Figure 3(b) shows the disentangling metric of Kim and Mnih (2018) as a function of batch number. For the ‘progressive training’ curve, we use a contrastive gap of 1.9 for 120,000 batches, and then introduce a (more aggressive) gap of 0. For the ‘no progressive training’ curves, we use gap size of 0 or 1.9 for all 230,400 batches. All curves use an InfoGAN regularizer and a CR weight . Notice that the results in Table 1 were computed for a different value of , which explains why Figure 3(b) achieves lower disentanglement scores. The ‘no progressive training, gap=0.0’ curve is averaged over 21 runs whereas the rest two curves are averaged over 10 runs.
|InfoGAN (modified, )|
|InfoGAN-CR (, )|
We observe that progressive training helps the disentanglement metric grow faster than either setting without progressive training. We also tried a smooth progressive training schedule (e.g. smoothly decreasing the contrastive gap), but found this to hurt the disentanglement performance. In both cases, we found that re-randomizing the CR-discriminator’s parameters every time the contrastive gap was changed helped to stabilize training.
We claim that contrastive regularization is tailored to work well with GAN architectures. Similarly, we claim that total correlation regularization of FactorVAE is specifically tailored for VAE. To test this hypothesis, we have applied contrastive regularization (CR) to FactorVAE and total correlation (TC) regularization to InfoGAN-CR. Figure 3(c) shows the disentanglement metric of each as a function of batch numbers. For FactorVAE, we introduce CR regularization of at batch 300,000. Note that the metric does not change perceptibly after adding CR. For InfoGAN-CR, we ran one set of trials with TC regularization from the beginning (red curve) and one set of trials without TC regularization (green curve). We use InfoGAN regularizer and TC coefficient for the former. Notice that InfoGAN-CR has a lower disentanglement score than FactorVAE in this plot because we did not use the optimal for this dataset; this sensitivity is a weakness of InfoGAN-CR (as well as InfoGAN). In Figure 3(c), we observe that TC regularization actually reduces disentanglement compared to InfoGAN-CR without TC. The jumps in disentanglement for the InfoGAN-CR curves are due to progressive training; we change the contrastive gap from 1.9 to 0.0 at batch 120,000. The red line (InfoGAN-CR + TC) is averaged over 4 runs, the blue line (FactorVAE + CR) over 2 runs, and the green line (InfoGAN-CR) over ten. These results support (but do not prove) our hypothesis that CR is better-suited to GAN architectures, whereas TC is better-suited to VAE architectures. To further confirm this intuition, we show that disentanglement appears negatively correlated with a measure of total correlation in Appendix E.2.
We ran InfoGAN-CR on the teapots dataset from Eastwood and Williams (2018), with images of teapots in various orientations and colors generated by the renderer in Moreno et al. (2016). Images have five latent factors: color (red, blue, and green), rotation (vertical), and rotation (lateral). Colors are randomly drawn from . Rotation (vertical) is randomly drawn from . Rotation (lateral) is drawn from . We generated a dataset of 200,000 such images with each combination of latent factors represented. Table 2 shows the disentanglement scores of FactorVAE and InfoGAN compared to InfoGAN-CR. Again, we observe that InfoGAN-CR achieves a higher disentanglement score than the other baselines, and contrastive regularization increases the metric compared to modified InfoGAN. Since the teapots dataset does not have classes, we do not compute inception score for this dataset; however, the images generated by InfoGAN-CR appear sharper than those generated by FactorVAE. Details on this point, our implementation, and additional plots appear in Appendix F.
Finally, we tested InfoGAN-CR on the CelebA dataset to observe its performance on real-world data. CelebA is a dataset of 202599 facial images from celebrities. Since these images do not have known latent factors or labelled categories, we cannot compute the disentanglement metric or inception score. We therefore evaluate this dataset qualitatively by producing latent traversals, as seen in Figure 5. Details of this experiment are included in Appendix H.
In this work, we propose contrastive regularization (CR) for improving the disentanglement performance of GAN-based generative models. Our main finding is that GAN-based disentanglement methods can substantially outperform state-of-the-art, VAE-based methods. In addition, we experimentally find that CR substantially increases the disentanglement capabilities of InfoGAN, but does not appear to affect the state-of-the-art VAEs. Similarly, we experimentally show that the total correlation regularization, a popular technique for disentangling VAEs, do not improve disentanglement in GAN training. This suggests that disentangling VAEs and GANs require fundamentally different techniques. The proposed CR regularization could be used in any application to enhance disentanglement of GANs, for example in hierarchical image representation or reinforcement learning. Understanding this phenomenon analytically is an interesting direction for future work, and may give rise to a more general understanding of how to design regularizers for GANs as opposed to VAEs. Another key question is to understand disentanglement in challenging datasets, compared to those studied in the literature as a benchmark. We study two such datasets. The first one studies three dimensional rotations on the teapots dataset in Appendix F.1. Existing training datasets includes only a subset of the full rotations, making disentanglement substantially easy. When training data is drawn from complete set of rotations in 3-D space, several challenges arise. The usual rotations along the three standard basis vectors do not commute, hence do not disentangle. We can find a commutative coordinate system, but it is not uniquely defined. Our preliminary experiments suggest that current state-of-the-art methods fail to learn a disentangled representation. The second one studies two dimensional polar coordinate system in Appendix G. State-of-the-art methods fail to learn the disentangled representation of the polar coordinates.
The authors would like to thank Qian Ge for valuable discussions about InfoGAN architecture.
This work is supported by NSF awards 1927712, 1705007, and 1815535 and Google Faculty Research Award. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). This work is partially supported by the generous research credits on AWS cloud computing resources from Amazon and funding from Siemens.
Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 119–128, 2018b.
Theoretical foundations of the potential function method in pattern recognition learning.Automation and remote control, 25:821–837, 1964.
European Conference on Computer Vision, pages 170–185. Springer, 2016.
Notice that . To understand why this works, let us decompose this lower bound further:
We can further simplify and maximize the last term, which is the only one dependent on as,
where the last equality follows from the fact that any can be parametrized by as with , in which case
and the minimum of zero can be achieved by , where .
In this section we verify that the InfoGAN trained with non-factorizing multi-variate Gaussian distribution is unstable. Specifically, we train an InfoGAN with a decoder distribution of , where is the multivariate Gaussian distribution with mean , and full covariance matrix . These parameters of the distribution are modelled as neural network functions of , where we explicitly enforce the positive semi-definiteness of . This is less restrictive than the factorizing decoder distribution, with its fixed diagonal covariance matrix , in the standard InfoGAN (see Remark 1).
Figure 6 shows the degradation in disentanglement due to the non-factorizing decoder. Similar to the experiment in Figure 2, we first train the standard InfoGAN () with factorizing decoder distribution on dSprites dataset for 25 epochs (288,000 batches) and from this point onwards we train two different versions of this model for 3 more epochs: one where we continue training the standard InfoGAN (blue solid curve) and another version where we replace the non-factorizing decoder distribution with the factorizing decoder distribution (red dashed curve). We plot the disentanglement metric, as defined in Section 4.
We see that the non-factorizing decoder is highly unstable and its performance drops (from that of factorizing decoder) to (minimum possible value for 5 latent codes) and its training terminates early because the learnt covariance matrix becomes rank deficient.
An alternative interpretation of the proposed loss is as a divergence over the distributions of the test examples. The discriminator, with an enough representation power, provides an approximation of the generalized Jensen-Shannon divergence (as shown below). The generator, in the subsequent generator update, forces the ’s to be as different as possible, as measured by the provided JS divergence. This, in turn, forces the changes in the latent codes to make changes in the images that are noticeable and easy to distinguish from (the changes of) other latent codes.
When maximized over the class of all functions, the maximum in Eq. (6) is achieved by with a normalizing constant and the maximum value is the generalized Jensen-Shannon divergence up to a shift by a constant that only depends on :
The solution to the following optimization problem is with a normalizing constant .
This follows from the fact that the gradient of the Lagrangian is where is the Lagrangian multiplier, and setting it to zero gives the desired maximizer. Plugging this back into the objective function, we get that
We use the following metrics to evaluate various aspects of the trained latent representation: disentanglement, independence, and generated image quality.
We use the popular disentanglement metric proposed in . This metric is defined for datasets with known ground truth factors and is computed as follows. First, generate data points where the th factor is fixed, but the other factors are varying uniformly at random, for a randomly-selected . Pass each sample through the encoder, and normalize each resulting dimension by its empirical variance. Take the resulting dimension with the smallest variance. In a setting with perfect disentanglement, the variance in the th dimension would be 0. Each sample’s encoding generates a ‘vote’
; we take the majority vote as the final output of the classifier; if the classifier is correct, it should map to. The disentanglement metric is the error rate of this classifier, taken over many independent trials of this procedure. In our experiment, for each fixed factor index , we generate 100 groups of images, where each group has 100 images with the same value at the th index.
One challenge is computing the disentanglement metric for FactorVAE when trained with more latent codes than there are latent factors (let denote the true number of factors). For instance,  uses for datasets with only five latent codes. To account for this,  first removes all latent codes that have collapsed to the prior (i.e., ); they then use the majority vote on the remaining factors. However, this approach can artificially change the metric if the number of factors for which the posterior does not equal the prior does not equal . Hence to measure the metric on FactorVAE (or more generally, cases where ), we first compute the metric matrix, find the maximum value of each row. We then take the top among the maximum row values, and sum them up.
We additionally compute the (less common) disentanglement metric of 
. This metric first requires an estimate of the disentangled codefrom the generated samples, for which we use our encoder. Next, we train a regressor to predict from , so . These regressors must also provide a matrix of relative importance , such that denotes the relative importance of in predicting
. Because of this requirement, Eastwood and Williams propose using regressors that provide importance scores, such as LASSO and random forests. These restrictions limit the generality of the metric; nonetheless, we include it for completeness. Given the regression, disentanglement is computed for the th latent code as where is the entropy of , and . The final disentanglement score is .
Eastwood and Williams disentanglement metric is computed using the random forest regressor 222https://github.com/cianeastwood/qedr/blob/master/quantify.ipynb, as implemented in the scikit-learn library333https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html with default values for all parameters, except for the max_depth parameter for which we use the values: 4, 2, 4, 2, and 2, for the latent factors: shape, scale, rotation, -position, and -position respectively, as used by the IB-GAN paper  (as per a private communication with its authors).
The dHSIC score is an empirical, kernel-based measure of the total correlation of a multivariate random variable from samples. It is used in  to enforce independence among latent factors. We use this metric to understand whether and how disentanglement is correlated with total correlation. Suppose we have . We want to compute the distance of the distribution from the product distribution of the marginals . The dHSIC score over samples is computed as follows. Consider a Gaussian kernel 
with a median heuristic for bandwidth:
where . When there are latent codes and samples, we have
Inception score was first proposed in  for labelled data, and is computed as , where denotes the Kullback-Liebler divergence of two distributions. The distribution was originally designed to be used with the Inception network , but we instead use a pre-trained classifier for the dataset at hand. Notice that this metric does not require any information about the disentangled representation of a sample. Inception score is widely used in the GAN literature to evaluate data quality. We also describe more nuanced measures of image quality in Appendix D, and measure them on our datasets.
Intuitively, inception score measures a combination of two effects: mode collapse and individual sample quality. To tease apart these effects, we compute two additional metrics. The first is reverse KL-divergence, proposed in  to measure mode collapse in labelled data. It measures the KL-divergence between the generated label distribution and the true distribution. The second is classifier confidence
, which we use as a proxy for sample quality. Classifier confidence is measured as the argmax of the softmax layer of a pre-trained classifier; the higher this value, the more confident the classifier is in its output, which suggests the image quality is better.
To better understand the parameter exploration in Figure 3(a), we generate similar plots, except representing image quality by mode-reversed KL-divergence (Figure 7) and classifier confidence (Figure 8).
For both reverse KL-divergence and classifier confidence, we observe similar trends to Figure 3(a); increasing improves disentanglement to a point, whereas it appears to hurt both image quality metrics. One observation is that for , there is little noticeable change in the either image quality metric. This suggests that CR does not introduce mode collapse or substantial reductions in image quality for small .
To explore the relation between total correlation and disentanglement, Figure 9 plots the disentanglement score of  as a function of dHSIC score while varying for InfoGAN-CR. Each point represents a single model, and point size/color signifies the value of . Larger points denote larger . Since dHSIC approximates the total correlation between the latent codes, a lower dHSIC score implies a lower total correlation. Perhaps surprisingly, we find a noticeable positive correlation between dHSIC score and disentanglement. This suggests that TC regularization (i.e., encouraging small TC) actually hurts disentanglement for InfoGAN-CR. This may help to explain, or at least confirm, the findings in Figure 3(c), which show that adding TC regularization to InfoGAN-CR reduces the disentanglement score.
To test if InfoGAN regularizer is necessary, we trained dSprites dataset without InfoGAN regularizer (i.e. ) and progressively increasing . This new loss suffers from significant mode collapse, which can be significantly reduced with a recent technique for mitigating mode collapse known as PacGAN introduced in . The main idea is to life the adversarial discriminator to take samples packed together, all from either real data or generated data. This provably introduces an implicit inductive bias towards penalizing mode collapse, which is mathematically defined in 
, in terms of binary hypothesis testing and type I and type II errors. The resulting metric are shown in Figure10, where even with PacGAN we do not get the desired level of disentanglement without InfoGAN regularizer. We believe that InfoGAN and Contrastive regularizers play complementary roles in disentangling GANs.
In dSprites experiments, we used a convolutional neural network for the FactorVAE encoder, InfoGAN discriminator, and CR discriminator, and a deconvolutional neural network for the decoder, and a multi-layer perceptron for total correlation discriminators. We used the Adam optimizer for all updates, whose learning rates are described below. For unmodified InfoGAN, we used the architecture described in Table 5 of. This architecture is reproduced in Table 3 for completeness. As mentioned in Section 2, we make several changes to the training of InfoGAN to improve its disentanglement, including changing the Adam learning rate to 0.001 for the generator and 0.002 for the InfoGAN and CR discriminators ( is still 0.5). The architectural changes are included in Table 4. We include in Table 5 the architecture of our CR discriminator, which is similar to the InfoGAN discriminator. Finally, Table 7 contains the architecture of FactorVAE, reproduced from . We use a batch size of 64 for all experiments.
|Discriminator / Encoder||Generator|
|Input 64 64 binary image||Input|
conv. 32 lReLU. stride 2
FC. 128 ReLU. batchnorm
|conv. 32 lReLU. stride 2. batchnorm||FC. ReLU. batchnorm|
|conv. 64 lReLU. stride 2. batchnorm||upconv. 64 lReLU. stride 2. batchnorm|
|conv. 64 lReLU. stride 2. batchnorm||upconv. 32 lReLU. stride 2. batchnorm|
|FC. 128 lReLU. batchnorm (*)||upconv. 32 lReLU. stride 2. batchnorm|
|From *: FC. 1 sigmoid. (output layer for D)||upconv. 1 sigmoid. stride 2|
|From *: FC. 128 lReLU. batchnorm. FC 5 for|
|Discriminator / Encoder||Generator|
|Input 64 64 binary (dSprites) or color (teapots) image||Input|
|conv. 32 lReLU. stride 2. spectral normalization||FC. 128 ReLU. batchnorm|
|conv. 32 lReLU. stride 2. spectral normalization||FC. ReLU. batchnorm|
|conv. 64 lReLU. stride 2. spectral normalization||upconv. 64 lReLU. stride 2. batchnorm|
|conv. 64 lReLU. stride 2. spectral normalization||upconv. 32 lReLU. stride 2. batchnorm|
|FC. 128 lReLU. spectral normalization (*)||upconv. 32 lReLU. stride 2. batchnorm|
|From *: FC. 1 sigmoid. (output layer for D)||upconv. 1 (dSprites) or 3 (teapots) sigmoid. stride 2|
|From *: FC. 128 lReLU. spectral normalization|
|FC 5. spectral normalization (output layer for )|
|Input 64 64 2 (2 binary images)|
|conv. 32 lReLU. stride 2. spectral normalization|
|conv. 32 lReLU. stride 2. spectral normalization|
|conv. 64 lReLU. stride 2. spectral normalization|
|conv. 64 lReLU. stride 2. spectral normalization|
|FC. 128 lReLU. spectral normalization|
|FC 5. softmax|
|Input 64 64 6 (2 color images)|
|conv. 32 lReLU. stride 2.|
|conv. 32 lReLU. stride 2. batchnorm|
|conv. 64 lReLU. stride 2. batchnorm|
|conv. 64 lReLU. stride 2. batchnorm|
|FC. 128 lReLU. batchnorm|
|FC 5. softmax|
|Input 64 64 binary image||Input|
|conv. 32 ReLU. stride 2||FC. 128 ReLU.|
|conv. 32 ReLU. stride 2.||FC. ReLU.|
|conv. 64 ReLU. stride 2.||upconv. 64 ReLU. stride 2.|
|conv. 64 ReLU. stride 2.||upconv. 32 ReLU. stride 2.|
|FC. 128.||upconv. 32 ReLU. stride 2.|
|FC. .||upconv. 1. stride 2|
We now discuss implementation details and additional experiments on the teapots dataset. We used an identical architecture to the dSprites dataset for InfoGAN and FactorVAE. For InfoGAN-CR, the CR discriminator architecture changed slightly and is listed in Table 6. As with dSprites, FactorVAE used 10 latent codes, so we chose the best five to compute the disentanglement metric. We also plot the disentanglement metric during training in Appendix F.
For InfoGAN-CR, we train InfoGAN with for 50000 batches, and then InfoGAN-CR with , , gap=1.9 for 35000 batches, and then InfoGAN-CR with , , gap=0.0 for 40000 batches. We use a batch size of 64 for all experiments. The decoder, disentangling discriminator architectures are the same as dSprites experiments. CR architecture is shown in Table 6.
We illustrate a latent traversal for the teapot dataset under InfoGAN-CR in Figure 12, which is from the best run of InfoGAN-CR with disentanglement metric of  1.0. To make the traversal easier to interpret, we have separated the color channels for each latent factor that captures color. This shows that each latent factor controls a single color channel, while the others are held fixed. A similar traversal is shown in Figure 13 for FactorVAE, which is from the best run of FactorVAE with and disentanglement metric of  0.94. Although it is difficult to draw conclusions from qualitative comparison, we found that the sharpness of images was reduced in the FactorVAE images, though FactorVAE is able to learn a meaningful disentanglement with three color factors and three (one duplicate) rotation factors.
Building on Table 2, we also plot the disentanglement metric of  during the training of InfoGAN-CR and FactorVAE in Figure 11. This plot shows that InfoGAN-CR achieves a consistently higher disentanglement score than FactorVAE throughout the training procedure, though FactorVAE comes close when .
To further push the boundaries of disentangled generation, it is important to understand when and why InfoGAN-CR fails; the teapot dataset gives a useful starting point. In particular,  observed that rotations about the canonical basis axes are not commutative. For example, suppose we apply two rotations of some angles, one about the and one about the axes in two different orders; in general, the resulting orientations of the object need not be the same.
Because of this, a disentangled GAN or VAE trained on a dataset where every possible orientation of teapot is represented should not recover the canonical basis for 3D space. Despite this fact, existing experiments (including our own) appear to recover the canonical axes of rotation. To explain this phenomenon, we observe that existing experiments on this dataset, including  and our own initial experiments, do not include all orientations of the object in the training data. For example, notice that none of the visualized images in Figure 12 show the bottom of the teapot. Because of the way the training data is selected, the canonical axes are indeed recovered.
We next ask what would happen if all orientations were present in the dataset. To answer this question, first note that in general, 3D rotations can indeed commute. For example, the rotational coordinate system described in Figure 14 commutes. In that coordinate system, represents the rotation along the axis of the teapot (orthogonal to the bottom of the teapot), while and represent the orientation of the teapot’s principal axis. However, this coordinate system is not unique; we could choose a different -axis, for instance, and construct a different commutative rotational coordinate system. Hence, even if we were to represent all orientations of the teapots in our training data, it is unclear what orientation we would recover.
Figure 15 shows the result from a single trial of an experiment where every possible orientation of the teapot is included in the training data. We trained InfoGAN-CR with InfoGAN coefficient and CR coefficient . As with our other experiments, we trained five latent factors and visualize the three most meaningful ones in Figure 15. To our surprise, we find that the system (roughly) recovers a similar coordinate system as the one depicted in Figure 14. Here appears to capture , appears to recover , and recovers . Even upon running multiple trials of this experiment, the InfoGAN-CR appears to learn (approximately) the same coordinate system from Figure 14. We hypothesize that this happens because of the illumination in the images. By default, the renderer from  renders the teapots with an overhead light source. This may distinguish the vertical axis from the others, causing InfoGAN-CR to learn the vertical axis as a reference for the rotational coordinate system.
To test this hypothesis, Figure 16 shows an experiment in which we randomized the light source for each image. For this experiment, we again used an InfoGAN coefficient of , and a disentangling coefficient of ; the experimental setup is identical to that of Figure 15, except for the light source in the training data. We find that in this case, InfoGAN-CR no longer appears to learn the vertical coordinate system. Indeed, it does not seem to learn any disentangled representation. We observed similar results for FactorVAE for a range of total correlation coefficients on this dataset, so we do not believe this effect is unique to InfoGAN-CR. Instead, it suggests that in settings where there is no single disentangled representation, current disentanglement methods fail.
To further study the failure modes of our InfoGAN-CR and other state-of-the-art architectures for disentangling, we introduce the following experiments. Towards this purpose we generate a new synthetic dataset which we call Circular dSprites (CdSprites).
It contains a set of 1080, 64 64 8-bit gray-scale images generated as follows. Each image has a white (pixel value 255) circular (disc) shape of radius 5 pixels on a black background (pixel value 0). For the placement of the shape we construct a polar (2D) coordinate system, whose co-ordinates are radius and angle , and its origin is the center of the image canvas: (32, 32). Then the circular shape is placed on the on a point , such that radius is uniformly selected from (pixel unit) and angle is uniformly selected from . Thus if the center of the circular shape is selected as then it will be place on the pixel . Thus there are 27 40 (1080) total images in the dataset. Fig. 17(a) shows some sample and their corresponding radius () and angle () indices. Fig. 17(b) shows the overlap of all the images in the dataset which shows the circular region where the shape can be placed. We expect that a good disentangling representation should disentangle the radius and angle latent factors.
We train FactorVAE, InfoGAN and InfoGAN-CAR models on this dataset. We use the same architecture as the one we use for the dSprites dataset for these models. However, we reduced number of latent factors to 2 for all the models, since there are only two factors to be learned.
In Fig. 18, we show the traversal of the dataset through the true latents. Each row corresponds to one one latent’s traversal while the other is fixed. Each column has a different fixed value for the other fixed latent. As we traverse through a latent keeping the other fixed, for easy visualization, we increase the shade of the shape from the darkest to the brightest. In Figs. 19, 20, and 21, where show the traversal through the learned latents of FactorVAE, InfoGAN, and InfoGAN-CR respectively. We see that the none of the models truly disentangle the true radius and angle factors and in fact they are mixed in the learned learned latents. We believe this dataset is hard for any current models to disentangle, and thus could be used as good baseline for future research.
We crop the center 128128 pixels from original CelebA images, and resize the images to 32 32 for training.
The architecture of generator, InfoGAN discriminator, CR discriminator in this experiment are shown in Table 8 and Table 9, which are in part motivated by an independent implementation of InfoGAN444https://github.com/conan7882/tf-gans. We used the Adam optimizer for all updates. Generator’s learning rate is 2e-3, InfoGAN discriminator’s and CR discriminator’s learning rates are 2e-4. is 0.5, batch size is 128 for all components. We train InfoGAN with for 80000 batches, and then train InfoGAN-CR with , , for another 10173 batches (so that total number of epoch is 57). The samples are generated from the end of training of one run.
|Discriminator / Encoder||Generator|
|Input 32 32 color image||Input|
|conv. 64 lReLU. stride 2. spectral normalization||FC. 1024 ReLU. batchnorm|
|conv. 128 lReLU. stride 2. spectral normalization||FC. ReLU. batchnorm|
|conv. 256 lReLU. stride 2. spectral normalization||upconv. 128 ReLU. stride 2. batchnorm|
|conv. 512 lReLU. stride 2. spectral normalization (*)||upconv. 64 ReLU. stride 2. batchnorm|
|From *: FC. 1 sigmoid. (output layer for D)||upconv. 3 tanh. stride 2|
|From *: FC. 5. spectral normalization (output layer for )|
|Input 32 32 6 (2 color images)|
|conv. 64 lReLU. stride 2.|
|conv. 128 lReLU. stride 2. batchnorm|
|conv. 256 lReLU. stride 2. batchnorm|
|conv. 512 lReLU. stride 2. batchnorm|
|FC 5. softmax|
In the design of CR, there are 4 choices, based on how the two input images are generated, when assuming gap is always zero:
Same noise variables, and same latent codes except one
Same noise variables, and random latent codes except one
Random noise variables, and same latent codes except one
Random noise variables, and random latent codes except one
We tried all four settings in MNIST dataset, using architecture in , with 1 10-category latent codes, 2 continuous latent codes, and 62 noise variables. We had 8 runs for each setting, and visually inspect whether it correctly disentangle digit, rotation, and width. The disentanglement successful rate are recorded in table 10.
|Choice||Rate of Successful Disentanglement|
Based on this result, we choose to use option 2 as our starting point of designing CR.