1 Introduction
Disentanglement learning holds the key for understanding the world from observations, transferring knowledge across different tasks and domains, generating novel designs, and learning compositional concepts bengio2013representation ; higgins2017scan ; lake2017building ; peters2017elements ; schmidhuber1992learning . Assuming the observation is generated from latent factors via , the goal of disentanglement learning is to correctly uncover a set of independent factors that give rise to the observation. While there has been a considerable progress in recent years, common assumptions about disentangled representations appear to be inadequate locatello2019challenging .
Unsupervised disentangling methods are highly desirable as they assume no knowledge about the ground truth factors. These methods typically impose constraints to encourage independence among latent variables. Examples of constraints include forcing the variational posterior to be similar to a factorial burgess2018understanding ; higgins2017beta , forcing the variational aggregated prior to be similar to the prior makhzani2015adversarial , adding total correlation loss kim2018disentangling , forcing the covariance matrix of
to be close to the identity matrix
kumar2017variational , and using a kernelbased measure of independence lopez2018information . However, it remains unclear how the independence constraint affects other properties of representation. Indeed, more independence may lead to higher reconstruction error in some models higgins2017beta ; kim2018disentangling . Worse still, the independent representations may mismatch human’s predefined concepts locatello2019challenging . This suggests that supervised methods – which associate a representation (or a group of representations) with a particular ground truth factor – may be more adequate. However, most supervised methods have only been shown to perform well on toy datasets harsh2018disentangling ; kulkarni2015deep ; mathieu2016disentangling in which data are generated from multiplicative combination of the ground truth factors. It is still unclear about their performance on real datasets.We believe that there are at least two major reasons for the current unsatisfying state of disentanglement learning: i) the lack of a formal notion of disentangled representations to support the design of proper objective functions tschannen2018recent ; locatello2019challenging , and ii) the lack of robust evaluation metrics to enable a fair comparison between models, regardless of their architectures or design purposes. To that end, we contribute by formally characterizing disentangled representations along three dimensions, namely informativeness, separability and interpretability, drawing from concepts in information theory (Section 3). We then design robust quantitative metrics for these properties and argue that an ideal method for disentanglement learning should achieve high performance on these metrics (Section 4).
We run a series of experiments to demonstrate how to compare different models using our proposed metrics, showing that the quantitative results provided by these metrics are consistent with visual results (Section 5). In the process, we gain important insights about some wellknown disentanglement learning methods namely FactorVAE kim2018disentangling and AAE makhzani2015adversarial .
2 Preliminaries
VAEbased methods
Variational Autoencoder (VAE)
kingma2013auto ; rezende2014stochastic is a class of latent variable models, assuming a generative process as , where denotes the observed data whose samples are drawn from the empirical distribution and denotes the latent variable. Standard VAEs are trained by minimizing the variational upper bound of the expected negative loglikelihood over the data. However, this objective function does not encourage disentanglement in representation. A simple solution is VAE higgins2017beta , which modifies the objective as follows:where and
is a parameterized variational estimator of the true posterior distribution
. When , reduces to . Another proposal is FactorVAE kim2018disentangling , which adds a constraint to the standard VAE loss to explicitly impose factorization of :(1) 
where is known as the total correlation (TC) of . Intuitively, can be large without affecting the mutual information between and , making FactorVAE more robust than VAE in learning disentangled representations. Other variants include TCVAE chen2018isolating and DIPVAE kumar2017variational , which are technically equivalent to FactorVAE in the sense that they also force to be factorized.
InfoGAN
Generative Adversarial Network (GAN) goodfellow2014generative is another class of generative models. GAN solves the minimax problem for a generator and discriminator . InfoGAN chen2016infogan
improves over GANs for learning disentangled representations. It assumes that the latent code vector
is a concatenation of two parts: a factorial part and a noisy part , denoted as . InfoGAN learns to disentangle by maximizing the mutual information between and the observed data , using the following objective function:(2)  
(3) 
where denotes the mutual information; ; and is a lower bound of agakov2004algorithm . Here, , where is a variational estimator of the true conditional distribution .
3 Rethinking Disentanglement
Inspired by bengio2013representation ; ridgeway2016survey
, we adopt the notion of disentangled representation learning as “a process
of decorrelating information in the data into separate informative representations, each of which corresponds to a concept defined by humans”. This suggests three important properties of a disentangled representation: informativeness, separability and interpretability, which we quantify as follows:Informativeness
We formulate the informativeness of a particular representation w.r.t. the data as the mutual information between and :
(4) 
where . In order to represent the data faithfully, a representation should be informative of , meaning should be large. Because , a large value of means that given that can be chosen to be relatively fixed. In other words, if is informative w.r.t. ,
usually has small variance
. It is important to note that in Eq. 4 is defined on the variational encoder , and does not require a decoder. This implies we do not need to minimize the reconstruction error over (e.g., in VAEs) to increase the informativeness of a particular .Separability and Independence
Two representations , are separable w.r.t. the data if they do not share common information about , which can be formulated as follows:
(5) 
where denotes the multivariate mutual information mcgill1954multivariate between , and . can be decomposed into standard bivariate mutual information terms as follows:
can be either positive or negative. It is positive if and contain redundant information about . The meaning of a negative remains elusive bell2003co .
Achieving separability with respect to does not guarantee that and are separable in general. and are fully separable or statistically independent if and only if:
(6) 
Let us consider how FactorVAE and InfoGAN implement this independence requirement. FactorVAE enforces the independence in Eq. 6 for every pair of , via the TC term (see Eq. 1). In InfoGAN, the existence of such condition is not clear. However, if we look closely into the term in Eq. 3, it is actually the reconstruction error over sampled from a factorial prior . By minimizing this term, we will force close to and close to , making and independent. Note that are derived from the assumption where is the implicit generative distribution of by transforming via the generator . The original GAN objective in InfoGAN assumes that matches the empirical data distribution . However, when this assumption does not hold, is not really grounded on real data making it hard to interpret the independence.
Note that there is a tradeoff between informativeness, independence and the number of latent variables which we discuss in Appdx. A5.
Interpretability
Obtaining informative and independent representations does not guarantee interpretability by human locatello2019challenging . We argue that in order to achieve interpretability, we should provide models with a set of predefined concepts . In this case, a representation is interpretable with respect to if it only contains information about (given that is separable from all other and all are distinct). Full interpretability can be formulated as follows:
(7) 
Eq. 7 is equivalent to the condition that is an invertible function of . If we want to generalize beyond the observed (i.e., ), we can change the condition in Eq. 7 into:
(8) 
which suggests that the model should accurately predict given . If satisfies the condition in Eq. 8, it is said to be partially interpretable w.r.t .
In real data, underlying factors of variation are usually correlated. For example, men usually have beard and short hair. Therefore, it is very difficult to match independent latent variables to different ground truth factors at the same time. We believe that in order to achieve good interpretability, we should isolate the factors and learn one at a time.
3.1 An informationtheoretic definition of disentangled representations
Given a dataset , where each data point is associated with a set of labeled factors of variation . Assume that there exists a mapping of to
groups of hidden representation
which follows the distribution . Denoting and . We define disentangled representations for unsupervised cases as follows:Definition 1 (Unsupervised).
A representation or a group of representations is said to be “fully disentangled” w.r.t a ground truth factor if is marginally independent of all other representations and is fully interpretable w.r.t . Mathematically, this can be written as:
(9) 
where
The definition of disentangled representations for supervised cases is similar as above except that now we model instead of and .
Recently, there have been several works eastwood2018framework ; higgins2018towards ; ridgeway2018learning that attempted to define disentangled representations. Higgin et. al. higgins2018towards proposed a definition based on group theory cohen2014learning which is (informally) stated as follows: “A representation is disentangled w.r.t a particular subgroup (from a symmetry group ) if can be decomposed into different subspaces in which the subspace should be independent of all other representation subspaces , and should only be affected by the action of a single subgroup and not by other subgroups .”. Their definition shares similar observation as ours. However, it is less convenient for designing models and metrics than our informationtheoretic definition.
Eastwood et. al. eastwood2018framework did not provide any explicit definition of disentangled representation but characterizing it along three dimensions namely “disentanglement”, “compactness”, and “informativeness” (between any ). A high “disentanglement” score () for indicates that it captures at most one factor, let’s say . A high “completeness” score () for indicates that it is captured by at most one latent and is likely to be . A high “informativeness” score^{1}^{1}1In eastwood2018framework , the authors consider the prediction error of given instead. High “informativeness” score means this error should be close to . for indicates that all information of is captured by the representations . Intuitively, when all the three notions achieve optimal values, there should be only a single representation that captures all information of the factor but no information from other factors . However, even in that case, is still not fully interpretable w.r.t since may contain some information in that does not appear in . This makes their notions only applicable to toy datasets on which we know that the data are only generated from predefined ground truth factors . Our definition can handle the situation where we only know some but not all factors of variation in the data. The notions in ridgeway2018learning follow those in eastwood2018framework , hence, suffer from the same disadvantage.
3.2 Representations learned by FactorVAE
We empirically observed that FactorVAE learns the same set of disentangled representations across different runs with varying numbers of latent variables (see Appdx. A8). This behavior is akin to that of deterministic PCA which uncovers a fixed set of linearly independent factors^{2}^{2}2When we mention factors in this context, they are not really factors of variation. They refer to the columns of the projection matrix in case of PCA and the component encoding functions in case of deep generative models. (or principal components). Standard VAE is theoretically similar to probabilistic PCA (pPCA) tipping1999probabilistic as both assume the same generative process . Unlike deterministic PCA, pPCA learns a rotationinvariant family of factors instead of an identifiable set of factors. However, in a particular pPCA model, the relative orthogonality among factors is still preserved. This means that the factors learned by different pPCA models are statistically equivalent. We hypothesize that by enforcing independence among latent variables, FactorVAE can also learn statistically equivalent factors (or ) which correspond to visually similar results. We provide a proof sketch for the hypothesis in Appdx. A6. We note that Rolinek et. al. rolinek2018variational also discovered the same phenomenon in VAE.
4 Robust Evaluation Metrics
We argue that a robust metric for disentanglement should meet the following criteria: i) it supports both supervised/unsupervised models; ii) it can be applied for real datasets; iii) it is computationally straightforward, i.e. not requiring any training procedure; iv) it provides consistent results across different methods and different latent representations; and v) it agrees with qualitative (visual) results. Here we propose informationtheoretic metrics to measure informativeness, independence and interpretability which meet all of these robustness criteria.
4.1 Metrics for informativeness
We measure the informativeness of a particular representation w.r.t. by computing in Eq. 4. The main challenges are estimating and computing the integral over . We deal with these problems by quantizing . To ensure to be consistent and comparable among different as well as different models, we apply the same quantization range for different . In practice, we choose the range since most of the latent values fall within this range. We divide the range into a set of equalsize bins and estimate as follows:
(10) 
where and
are the probability mass function and the conditional probability mass function of a particular bin
. Because ^{3}^{3}3We must take into account the whole quantized distribution . Simply counting the quantized mean for all is totally wrong., we only have to compute , which by definition, is:(11) 
where , are two ends of the bin .
There are two ways to compute . In the first way, we simply consider the unnormalized as the area of a rectangle whose width is and height is with at the center value of the bin . Then, we normalize over all bins to get . In the second way, if
is approximately a Gaussian distribution, we can estimate the above integral with a closedform function (see Appdx. A12 for detail). After computing
, we can divide it by to normalize it to the range [0, 1]. However this normalization will change the interpretation of the metric and may lead to a situation where latent variable is less informative than variable (i.e., ) but still has a higher rank than because . A better way is to divide it by wheredenotes the number of bins. An important note for implementation is that sometimes, the standard deviation of
is close to (or is deterministic given ), causing to be close to for all bins. In this case, we set if is the bin that contains the mean of and otherwise.4.2 Metrics for independence
We can compute the independence between two latent variables , based on . However, a serious problem of is that it generates the following order among pairs of representations:
where , are informative representations and , are uninformative (or noisy) representations. This means if we simply want , to be independent, the best scenario is that both are noisy and independent (e.g. ). Therefore, we propose a new metric for independence named MISJED (which stands for Mutual Information Sums Joint Entropy Difference), defined as follows:
where and are the means of and , respectively. Since have less variance than respectively, , making .
To achieve a small value of , i.e., a high degree of independence, we must have representations to be both independent and informative (or, in an extreme case, are deterministic given ). Using the MISJED metric, we can ensure the following order: . Because , we can divide by to normalize it to [0, 1].
4.3 Metrics for interpretability
Recently, several metrics have been proposed to quantitatively evaluate the interpretability of representations by examining the relationship between the representations and manually labeled factors of variation. The most popular ones are Zdiff score higgins2017beta ; kim2018disentangling , SAP kumar2017variational and MIG chen2018isolating . Detailed analysis of these metrics is provided in Appdx. A9. Among them, only MIG is based on mutual information and, to some extent, matches with the formulation of “interpretability” in Section 3. However, MIG has only been used for toy datasets like dSprites dsprites2017 . The main drawback comes from its probabilistic assumption (see Fig. 1). Note that
is a distribution over the high dimensional data space, and is very hard to robustly estimate but the authors simplified it to be
if (is the support set for a particular value ) and otherwise. This equation only holds for toy datasets where we know exactly how is generated from . In addition, since depends on the value of , it will be problematic if is continuous.Rmig
Addressing the drawbacks of MIG, we propose RMIG (which stands for Robust MIG), formulated as follows:
(12) 
where and are the highest and the second highest mutual information values computed between every and ; and are the corresponding latent variables. Like MIG, we can normalize RMIG() to [0, 1] by dividing it by but it will favor imbalanced factors (small ). Details of computation are given in Appdx. A10.
RMIG inherits the idea of MIG but differs in the probabilistic assumption (and other technicalities). RMIG assumes that
for unsupervised learning and
for supervised learning (see Fig.
1). Not only this eliminates all the problems of MIG but also provides additional advantages. First, we can estimate using Monte Carlo sampling on . Second, is well defined for both discrete/continuous and deterministic/stochastic . If is continuous, we can quantize . If is deterministic (i.e., a Dirac delta function), we simply set it to for the value of corresponding to and for other values of . Our metric can also use from an external expert model. Third, for any particular value , we compute for all rather than just for , which gives more accurate results.Jemmi
A high RMIG value of means that there is a representation that captures the factor . However, may also capture other factors of the data. To make sure that fits exactly to , we provide another metric for interpretability named JEMMI (standing for Joint Entropy Minuses Mutual Information), computed as follows:
where and are defined in Eq. 12. is bounded by 0 and . A small JEMMI score means that should match exactly to and should not be related to . Note that if we replace by to account for the generalization of over , we obtain a metric equivalent to RMIG (but in reverse order).
5 Experiments
We evaluated the performance of FactorVAE kim2018disentangling , VAE higgins2017beta and AAE makhzani2015adversarial using our proposed metrics on the CelebA liu2015faceattributes , MNIST and dSprites dsprites2017 datasets. Details about the datasets and model settings are provided in Appdx. A1 and Appdx. A2, respectively. For space limit, we only report here results on CelebA, leaving the rest in the supplementary materials.
Informativeness
We sorted the representations of different models according to their informativeness scores in the descending order and plot the results in Fig. 2. There are distinct patterns for different methods. AAE captures equally large amounts of information from the data while FactorVAE and VAE capture smaller and varying amounts. This is because FactorVAE and VAE penalize the informativeness of representations while AAE does not. Recall that . For AAE, and is equal to the entropy of . For FactorVAE and VAE, and is usually smaller than the entropy of due to a narrow ^{4}^{4}4Note that does not depend on whether is zerocentered or not.
In Fig. 2, we see a sudden drop of the scores to 0 for some FactorVAE’s and VAE’s representations. These representations are totally random and contain no information about the data (i.e., ). We call them “noisy” representations and provide discussions in Appdx.A5.
We visualize the top 10 most informative representations for these models in Fig. 3. AAE’s representations are more detailed than FactorVAE’s and VAE’s, suggesting the effect of high informativeness. However, AAE’s representations mainly capture information within the support of
. This explains why we still see a face when interpolating AAE’s representations. By contrast, FactorVAE’s and
VAE’s representations usually contain information outside the support of . Thus, when we interpolate these representations, we may see something not resembling a face.Independence
Table 1 reports MISJED scores (Section. 4.2) for the top most informative representations. FactorVAE achieves the lowest MISJED scores, AAE comes next and VAE is the worst. We argue that this is because FactorVAE learns independent and nearly deterministic representations, VAE learns strongly independent yet highly stochastic representations, and AAE, on the other extreme side, learns strongly deterministic yet not very independent representations. From Table 1 and Fig. 4, it is clear that MISJED produces correct orders among pairs of representations according to their informativeness.
MISJED (unnormalized)  

FactorVAE  0.008  0.009  2.476  2.443  4.858  4.892 
VAE  0.113  0.131  3.413  3.401  6.661  6.739 
AAE  0.022  0.023  0.022  0.021  0.021  0.020 
Interpretability
We report the RMIG scores and JEMMI scores for several ground truth factors on the CelebA dataset in Tables 2 and 3, respectively. In general, FactorVAE learns representations that agree better with the ground truth factors than VAE and AAE do. This is consistent with the qualitative results in Fig. 5. However, all models still perform poorly for interpretability since their RMIG and JEMMI scores are very far from 1 and 0, respectively.
RMIG (normalized)  
Bangs  Black Hair  Eyeglasses  Goatee  Male  Smiling  
H=0.4256  H=0.5500  H=0.2395  H=0.2365  H=0.6801  H=0.6923  
FactorVAE  0.1742  0.0430  0.0409  0.0343  0.0060  0.0962 
VAE  0.0176  0.0223  0.0045  0.0325  0.0094  0.0184 
AAE  0.0035  0.0276  0.0018  0.0069  0.0060  0.0099 
JEMMI (normalized)  
Bangs  Black Hair  Eyeglasses  Goatee  Male  Smiling  
H=0.4256  H=0.5500  H=0.2395  H=0.2365  H=0.6801  H=0.6923  
FactorVAE  0.6118  0.6334  0.6041  0.6616  0.6875  0.6150 
VAE  0.8632  0.8620  0.8602  0.8600  0.8690  0.8699 
AAE  0.8463  0.8613  0.8423  0.8496  0.8644  0.8575 


Sensitivity of the number of bins
All metrics we propose in this paper require computing the mutual information (MI). To handle continuous cases, we use quantization. It is important to note that quantization is just a trick for computing MI, not the inherent problem of our metrics. With quantization, we need to specify the number of bins (#bins) in advance. Fig. 6 (left, middle) shows the effect of #bins on RMIG scores and JEMMI scores for different models.
We can see that when #bins is small, RMIG scores are low. This is because the quantized distributions and look similar, causing and to be similar as well. When #bins is large, the quantized distribution and look more different, leading to higher RMIG scores. RMIG scores are stable when #bins > 200.
Unlike RMIG scores, JEMMI scores keep increasing when we increase #bins. Note that JEMMI only differs from RMIG in the appearance of . Finer quantizations of introduce more information about , hence, always lead to higher (see Fig. 6 (right)). Larger JEMMI scores also reflect the fact that finer quantizations of make look more continuous, thus, less interpretable w.r.t the discrete factor .
Despite the fact that #bins affects the RMIG and JEMMI scores of a single model, the relative order among different models remains the same. It suggests that once we fixed the #bins, we can use RMIG and JEMMI scores to compare different models.
6 Discussion
We have proposed informationtheoretic characterizations of disentangled representations, and designed robust metrics for evaluation, along three dimensions: informativeness, separability and interpretability. We examined three wellknown representation learning models namely FactorVAE, VAE and AAE on CelebA, MNIST and dSprites datasets. Under our metrics, FactorVAE is the best among the three, with reasonably good informativeness and very good MISJED scores. In addition, FactorVAE also learns consistent representations. However, all the examined models still perform poorly under our metric for interpretability, meaning that they have not met desirable requirements for disentanglement learning. Our work also shows that unsupervised disentanglement may not be possible, and that labels of ground truth factors should be provided during learning. Thus, we plan to investigate methods which support semisupervised or fewshot learning in the future.
References
 (1) Error function. https://en.wikipedia.org/wiki/Error_function, May 2019.
 (2) David Barber Felix Agakov. The im algorithm: A variational approach to information maximization. Advances in Neural Information Processing Systems, 16:201, 2004.
 (3) Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.

(4)
Anthony J Bell.
The coinformation lattice.
In
Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: ICA
, volume 2003, 2003.  (5) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
 (6) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 (7) Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in vae. arXiv preprint arXiv:1804.03599, 2018.
 (8) Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
 (9) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.

(10)
Taco Cohen and Max Welling.
Learning the irreducible representations of commutative lie groups.
In
International Conference on Machine Learning
, pages 1755–1763, 2014.  (11) Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Disentangling factors of variation via generative entangling. arXiv preprint arXiv:1210.5474, 2012.
 (12) Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. 2018.
 (13) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.

(14)
Ananya Harsh Jha, Saket Anand, Maneesh Singh, and VSR Veeravasarapu.
Disentangling factors of variation with cycleconsistent variational
autoencoders.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pages 805–820, 2018.  (15) Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
 (16) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
 (17) Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matko Bosnjak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, and Alexander Lerchner. Scan: Learning hierarchical compositional visual concepts. arXiv preprint arXiv:1707.03389, 2017.
 (18) Hyunjik Kim and Andriy Mnih. Disentangling by factorising. ICML, 2018.
 (19) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 (20) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 (21) Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014.
 (22) Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539–2547, 2015.
 (23) Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848, 2017.
 (24) Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
 (25) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 (26) Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. ICML, 2019.
 (27) Romain Lopez, Jeffrey Regier, Michael I Jordan, and Nir Yosef. Information constraints on autoencoding variational bayes. In Advances in Neural Information Processing Systems, pages 6114–6125, 2018.
 (28) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 (29) Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5040–5048, 2016.
 (30) Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dspritesdataset/, 2017.
 (31) William McGill. Multivariate information transmission. Transactions of the IRE Professional Group on Information Theory, 4(4):93–111, 1954.
 (32) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. MIT press, 2017.
 (33) Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning, pages 1431–1439, 2014.
 (34) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 (35) Karl Ridgeway. A survey of inductive biases for factorial representation learning. arXiv preprint arXiv:1612.05299, 2016.
 (36) Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the fstatistic loss. In Advances in Neural Information Processing Systems, pages 185–194, 2018.
 (37) Michal Rolinek, Dominik Zietlow, and Georg Martius. Variational autoencoders pursue pca directions (by accident). arXiv preprint arXiv:1812.06775, 2018.
 (38) Adrià Ruiz, Oriol Martinez, Xavier Binefa, and Jakob Verbeek. Learning disentangled representations with referencebased variational autoencoders. arXiv preprint arXiv:1901.08534, 2019.
 (39) Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
 (40) N Siddharth, Brooks Paige, JanWillem van de Meent, Alban Desmaison, Noah D Goodman, Pushmeet Kohli, Frank Wood, and Philip HS Torr. Learning disentangled representations with semisupervised deep generative models. NIPS, 2017.

(41)
Michael E Tipping and Christopher M Bishop.
Probabilistic principal component analysis.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.  (42) Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoderbased representation learning. arXiv preprint arXiv:1812.05069, 2018.

(43)
Zhenyao Zhu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Multiview perceptron: a deep model for learning face identity and view representations.
In Advances in Neural Information Processing Systems, pages 217–225, 2014.
Appendix A Appendix
a.1 Datasets
We used CelebA, MNIST and dSprites datasets. For CelebA, we resized the original images to . Dataset statistics are provided in Table 4.
Dataset  #Train  #Test  Image size 

CelebA  162,770  19,962  64643 
MNIST  60,000  10,000  28281 
dSprites  737,280  0  64641 
a.2 Model settings
For FactorVAE, VAE and AAE we used the same architectures for the encoder and decoder (see Table 5 and Table 6^{5}^{5}5Only FactorVAE and AAE use a discriminator over ), following kim2018disentangling
. We trained the models for 300 epochs with minibatches of size 64. The learning rate is
for the encoder/decoder and is for the discriminator over . We used Adam kingma2014adam optimizer with and . Unless explicitly mentioned, we fixed the following: number of latent variables to 65, coefficient for the TC term in FactorVAE to 50, value for in VAE to 50, and coefficient for the generator loss over in AAE to 50.We note that the FactorVAE in kim2018disentangling only used 10 latent variables for learning factors of variation on the CelebA dataset so our results may look different from theirs. However, by using larger numbers of latent variables, we are able to discover that FactorVAE learns consistent representations (see Appdx. A.8).
Encoder  Decoder  Discriminator Z 

dims: 64643  dim: 65  dim: 65 
conv (4, 4, 32), stride , ReLU 
FC 11256, ReLU  5[FC 1000, LReLU] 
conv (4, 4, 32), stride 2, ReLU  deconv (4, 4, 64), stride 1, valid, ReLU  FC 1 
conv (4, 4, 64), stride 2, ReLU  deconv (4, 4, 64), stride 2, ReLU  : 1 
conv (4, 4, 64), stride 2, ReLU  deconv (4, 4, 32), stride 2, ReLU  
conv (4, 4, 256), stride 1, valid, ReLU  deconv (4, 4, 32), stride 2, ReLU  
FC 65  deconv (4, 4, 3), stride 2, ReLU  
dim: 65  dim: 64643 
Encoder  Decoder  Discriminator Z 

dims:  dims: 65  dims: 65 
conv (4, 4, 64), stride 2, LReLU  FC 1024, BN, ReLU  4[FC 256, LReLU] 
conv (4, 4, 128), stride 2, BN, LReLU  FC 77128 , BN, ReLU  FC 1 
FC 1024, BN, LReLU  deconv (4, 4, 64), stride 2, BN, ReLU  : 1 
FC 128, BN, LReLU  deconv (4, 4, 1), stride 2, sigmoid  
FC 65, BN, LReLU  dims:  
dims: 65 
Encoder  Decoder  Discriminator Z 

dims: 64641  dim: 65  dim: 65 
conv (4, 4, 32), stride , ReLU  FC 128, ReLU  5[FC 1000, LReLU] 
conv (4, 4, 32), stride 2, ReLU  FC 4464, ReLU  FC 1 
conv (4, 4, 64), stride 2, ReLU  deconv (4, 4, 64), stride 2, ReLU  : 1 
conv (4, 4, 64), stride 2, ReLU  deconv (4, 4, 32), stride 2, ReLU  
FC 128, ReLU  deconv (4, 4, 32), stride 2, ReLU  
FC 65  deconv (4, 4, 1), stride 2, ReLU  
dim: 65  dim: 64641 
a.3 Reviews of disentanglement learning methods
There has been many works that attempt to learn disentangled representations, which significantly differ in approaches and generalities. However, there are lacks of consensus on many key aspects locatello2019challenging , including definition higgins2018towards .
Supervised methods kulkarni2015deep ; zhu2014multi assume access to the ground truth factors. For example, DCIGN kulkarni2015deep is a VAE whose latent variables correspond to different ground truth factors . At each training step, DCIGN chooses a minibatch with a ground truth factor varies while other factors are fixed. Then, they only allow the latent variable corresponding to the selected factor to capture the variation in the minibatch by replacing all other latent variables with their mean values over the minibatch. This “clamping” strategy was also applied in reed2014learning to improve the disentanglement capability of a higherorder Boltzman Machine. Mathieu et. al. mathieu2016disentangling proposed a conditional VAE that models both labeled factors of variation and other unspecified latent representations . Since is given, this model simply learns , which is assumed to be entangled. To ensure the decoder does not ignore labeled information from , the authors swap the unspecifed latent representations , of two samples , and use an additional GAN to force the images generated from and where to be similar. The main problem of this method is that none of the generated images are fixed, which results in unstable training of GAN as reported in mathieu2016disentangling . Other methods that are derived from mathieu2016disentangling include harsh2018disentangling ; ruiz2019learning .
Unsupervised methods learn disentangled representations directly from raw data without using knowledge about the ground truth factors of variation. Desjardins et. al. desjardins2012disentangling made an early attempt at unsupervised disentanglement learning by using a higherorder spikeandslab RBM with blocksparse connectivity to model the multiplicative interactions between (unknown) factors of variation. Despite some success on the Toronto Face dataset, this method has two main drawbacks that make it impractical: one is its modeling complexity and the other is its oversimplified assumption about the multiplicative interactions between factors. Current stateoftheart unsupervised methods are based on powerful deep generative models such as GAN chen2016infogan or VAE chen2018isolating ; higgins2017beta ; kim2018disentangling ; kumar2017variational . They show promising disentanglement results on many real datasets and are scalable. The key idea behind these methods is learning independent yet informative representations.
Semisupervised methods have also been proposed. Kingma et. al. kingma2014semi
proposed two variants of VAE to solve the semisupervised learning problem. One variant denoted as M1 adds a classifier on top of
to predict the label . The other variant denoted as M2 assumes a generative model with the inference networks for and are and , respectively. This M2 model is able to separate between style and content by using very little amount of labeled data (about 15% of the total data). Siddharth et. al. siddharth2017learning replace the variational objective of the M2 model with the importanceweighted loss burda2015importance so that can have arbitrary conditional dependency between and instead of just the decomposition used in kingma2014semi . However, this does not lead to any significant change in the model architecture.a.4 Evaluating independence with correlation matrix
For every sampled from the training data, we generated latent samples and built a correlation matrix from these samples for each of the models FactorVAE, VAE and AAE. We also built another version of the correlation matrix which is based on the (called the conditional means) instead of samples from . Both are shown in Fig. 7. We can see that the correlation matrices computed based on the conditional means incorrectly describe the independence between representations of FactorVAE and VAE. AAE is not affected because it learns deterministic given . Using the correlation matrix is not a principled way to evaluate independence in disentanglement learning.
a.5 Tradeoff between informativeness, independence and the number of latent variables
Before starting our discussion, we provide the following fact:
Fact 2.
Assume we try to fill a fixedsize pool with fixedsize balls given that all the balls must be inside the pool. The only way to increase the number of the balls without making them overlapped is reducing their size.
In the context of representation learning, a pool is with size which depends on the training data. Balls are with size . Fact. 2 reflects the situation of AAE (see Fig. 8 left). In AAE, all are deterministic given so the condition “all balls are inside the pool” is met. which is fixed so the condition “fixedsize balls” is also met. Therefore, when the number of latent variables in AAE increases, all must be less informative (i.e., must decrease) given that the independent constraint on the latent variables is still satisfied. This is empirically verified in Fig. 9 as we see the distribution of over all becomes narrower when we increase the number of representations from 65 to 200. Also note that increasing the number of latent variable from 65 to 100 does not change the distribution. This suggests that 65 or 100 latent variables are still not enough to capture all information in the data.
FactorVAE, however, handles the increasing number of latent variables in a different way. Thanks to the KL term in the loss function that forces
to be stochastic, FactorVAE can break the constraint in Fact 2 and allows the balls to stay outside the pool (see Fig. 8 right). If we increase the number of latent variables but still enforce the independence constraint on them, FactorVAE will keep a fixed number of informative representations and make all other representations “noisy” with zero informativeness scores. We refer to that capability of FactorVAE as code compression.a.6 Why FactorVAE can learn consistent representations?
Inspired by the variational information bottleneck theory alemi2016deep , we rewrite the standard VAE objective in an equivalent form as follows:
(13) 
where denotes the reconstruction loss over and is a scalar.
In the case of FactorVAE, since all latent representations are independent, we can decompose into . Thus, we argue that FactorVAE optimizes the following information bottleneck objective:
(14) 
We assume that represents a fixed condition on all . Because is a convex function of (see Appdx. A.7), minimizing Eq. 14 leads to unique solutions for all (Note that we do not count permutation invariance among here).
To make a fixed condition on all , we can further optimize with sampled from a fixed distribution like . This suggests that we can add a GAN objective to the original FactorVAE objective to achieve more consistent representations.
a.7 is a convex function of
Let us first start with the definition of a convex function and some of its known properties.
Definition 3.
Let be a set in the real vector space and be a function that output a scalar. is convex if and , we have:
Proposition 4.
A twice differentiable function is convex on an interval if and only its second derivative is nonnegative there.
Proposition 5 (Jensen’s inequality).
Let be real numbers and let be positive weights on such that . If is a convex function on the domain of , then
Equality holds if and only if all are equal or is a linear function.
Proposition 6 (Logsum inequality).
Let and be nonnegative numbers. Denote and