1 Introduction
In representation learning it is often assumed that realworld observations
(e.g. images or videos) are generated by a twostep generative process. First, a multivariate latent random variable
is sampled from a distribution . Intuitively, corresponds to semantically meaningful factors of variation of the observations (e.g. content + position of objects in an image). Then, in a second step, the observation is sampled from the conditional distribution. The key idea behind this model is that the highdimensional data
can be explained by the substantially lower dimensional and semantically meaningful latent variable which is mapped to the higherdimensional space of observations . Informally, the goal of representation learning is to find useful transformations ofthat “make it easier to extract useful information when building classifiers or other predictors”
[4].A recent line of work has argued that representations that are disentangled are an important step towards a better representation learning. While there is no single formalized notion of disentanglement (yet) which is widely accepted, the key intuition is that a disentangled representation should separate the distinct, informative factors of variations in the data [4]. A change in a single underlying factor of variation should lead to a change in a single factor in the learned representation . This assumption can be extended to groups of factors as e.g. in Bouchacourt et al. [5] or Suter et al. [49]. Based on this idea, a variety of disentanglement evaluation protocols have been proposed leveraging the statistical relations between the learned representation and the groundtruth factor of variations. Disentanglement is then measured as a particular structural property of these relations [19, 27, 15, 30, 7, 44].
Stateoftheart approaches for unsupervised disentanglement learning are based on
Variational Autoencoders (VAEs)
[28]: One assumes a specific prioron the latent space and then uses a deep neural network to parameterize the conditional probability
. Similarly, the distribution is approximated using a variational distribution , again parametrized using a deep neural network. The model is then trained by minimizing a suitable approximation to the negative loglikelihood. The representation for is defined as the mean of the approximate posterior distribution . The usual choice for the prior and variational distribution are multivariate Gaussians. Several variations of VAEs were proposed with the motivation that they lead to better disentanglement [19, 6, 27, 7, 30, 45]. The common theme behind all these approaches is that they try to enforce a factorized aggregated posterior , which should encourage disentanglement.Our contributions.
The original motivation of this work was to provide a neutral largescale study that benchmarks different unsupervised disentanglement methods and metrics on a wide set of data sets in a fair, reproducible experimental set up. However, the empirical evidence led us to instead challenge many commonly held assumptions in this field. Our key contributions can be summarized as follows:

[itemsep=2pt,topsep=3pt]

We theoretically prove that (perhaps unsurprisingly) the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases both on the considered learning approaches and the data sets.

We investigate current approaches and their inductive biases in a reproducible^{1}^{1}1Reproducing these experiments requires approximately 2.54 GPU years (NVIDIA P100). largescale experimental study with a sound experimental protocol for unsupervised disentanglement learning. We implement from scratch six recent unsupervised disentanglement learning methods as well as six disentanglement measures and train more than models on seven data sets.

We evaluate our experimental results and challenge many common assumptions in unsupervised disentanglement learning: (i) While all considered methods prove effective at ensuring that the individual dimensions of the aggregated posterior (which is sampled) are not correlated, only one method also consistently ensures that the individual dimensions of the representation (which is taken to be the mean) are not correlated. (ii) We do not find any evidence that the considered models can be used to reliably learn disentangled representations in an unsupervised manner as random seeds and hyperparameters seem to matter more than the model choice. Furthermore, good trained models seemingly cannot be identified without access to groundtruth labels even if we are allowed to transfer good hyperparameter values across data sets. (iii) For the considered models and data sets, we cannot validate the assumption that disentanglement is useful for downstream tasks, for example through a decreased sample complexity of learning.

Based on these empirical evidence, we suggest three critical areas of further research: (i) The role of inductive biases and implicit and explicit supervision should be made explicit: unsupervised model selection persists as a key question. (ii) The concrete practical benefits of enforcing a specific notion of disentanglement of the learned representations should be demonstrated. (iii) Experiments should be conducted in a reproducible experimental setup on data sets of varying degrees of difficulty.
Roadmap.
We begin by briefly reviewing the motivations for disentanglement with the current stateoftheart methods and metrics in Section 2. In Section 3 we present and discuss the theoretical impossibility of unsupervised learning of disentangled representations. In Section 4 we discuss our experimental design with the research questions, our guiding principles, a detailed discussion of the limitation of this study and the differences with existing implementations. In Section 5, we present our the empirical results. Finally, in Section 6 we study the implications of this work.
2 A review of (unsupervised) disentanglement learning
We first provide an overview of the key concepts and definitions, followed by quantitative evaluation metrics, and briefly review the proposed models based on VAEs and other related work.
2.1 What is disentanglement and why is it a desirable property?
Disentanglement is an abstract notion that has been brought back to the spotlight in the seminal paper of Bengio et al. [4]
. It has been argued that an artificial intelligence able to understand and reason about the world should be able to identify and disentangle the explanatory factors of variations hidden in the data
[4, 41, 33, 3, 46, 31]. A common modeling assumption is that we observe data from a high dimensional random variable generated by an underlying generative process which can be described by a low dimensional set of factors of variations. These factors can be fully observable, partially observable or not observable at all. We focus our study on this last setting as this one has recently gathered increased interest in the community.There are many suggestions in the literature why disentangled representations could be useful: First or all, we could use them directly to build a predictive model instead of relying on the high dimensional . Indeed, they should contain all the information present in in a compact and interpretable structure [4, 29, 8]. Furthermore, they are independent from the task at hand [17, 34]
. Therefore, they would be useful for (semi)supervised learning of downstream tasks, transfer and few shot learning
[4, 47, 41]. Moreover, one can use them to integrate out nuisance factors [30]. Last but not least, they enable interventions (i.e. to conditionally sample) and answer counterfactual questions (i.e. how would this sample look like had a factor been different?) [41, 39].2.2 What are proposed definitions for disentanglement?
While there is a comprehensive list of requirements for a disentangled representation, there does not yet exist a single, commonly accepted definition. Instead, a variety of different concrete metrics have been proposed by the community where each of these metrics aims to formalize and measure the notion in a slightly different way. In general, all these methods assume that they have access to the groundtruth factors of variations if not the full groundtruth generative model. In the following, we briefly review the metrics that we consider in our study.

[topsep=3pt]

BetaVAE metric. Higgins et al. [19] suggest to fix a random factor of variation in the underlying generative model constant and to sample two mini batches of observations
. Disentanglement is then measured as the accuracy of a linear classifier that predicts the index of the fixed factor based on the coordinatewise sum of absolute difference between the representation vectors in the two mini batches.

Mutual Information Gap. Chen et al. [7] argue that the BetaVAE metric and the FactorVAE metric are neither general nor unbiased as they depend on some hyperparameters. They compare pairwise mutual information for each ground truth factor and each factor in the computed representation . For each groundtruth factor , they then consider the two factors in that have the highest mutual information. The Mutual Information Gap (MIG) is then defined as the average, normalized difference between the representation factor with the highest and the secondhighest mutual information.

Modularity and Explicitness. Ridgeway and Mozer [44] argue that two different properties of representations should be considered, i.e., Modularity and Explicitness. In a modular representation each dimension of depends on at most a single factor of variation. In an explicit representation, the value of a factor of variation is easily predictable (i.e. with a linear model) from . They propose to measure the Modularity as the average normalized squared difference of the mutual information of the factor of variations with the highest and secondhighest mutual information with a dimension of
. They measure Explicitness as the ROCAUC of a oneversusrest logistic regression classifier trained to predict the factors of variation. In this study, we focus on Modularity as it is the property that corresponds to disentanglement.

Disentanglement, Completeness and Informativeness. Ridgeway and Mozer [44] consider three properties of representations, i.e., Disentanglement, Completeness and Informativeness. First, Eastwood and Williams [15] compute the importance of each dimension of the learned representation for predicting a factor of variation. The predictive importance of the dimensions of
can be computed with a Lasso or a Random Forest classifier. Disentanglement is the average of the difference from one of the entropy of the probability that a dimension of the learned representation is useful for predicting a factor weighted by the relative importance of each dimension. Completeness, is the average of the difference from one of the entropy of the probability that a factor of variation is captured by a dimension of the learned representation. Finally, the Informativeness can be computed as the prediction error of predicting the factors of variations. In this paper, we mainly consider Disentanglement which we subsequently call “DCI Disentanglement” for clarity.

SAP score. Kumar et al. [30] propose to compute the
score of the linear regression predicting the factor values from each dimension of the learned representation. For discrete factors, they propose to train a classifier. The
Separated Attribute Predictability (SAP) score is the average difference of the prediction error of the two most predictive latent dimensions for each factor.
2.3 Unsupervised learning of disentangled representations with VAEs
Variants of variational autoencoders [28] are considered the stateoftheart for unsupervised disentanglement learning. One assumes a specific prior on the latent space and then parameterizes the conditional probability with a deep neural network. Similarly, the distribution is approximated using a variational distribution , again parametrized using a deep neural network. One can then derive the following approximation to the maximum likelihood objective,
(1) 
which is also know as the evidence lower bound (ELBO). By carefully considering the KL term, one can encourage various properties of the resulting presentation. We will briefly review the main approaches.
Bottleneck capacity.
Higgins et al. [19] propose the VAE, introducing a hyperparameter in front of the KL regularizer of vanilla VAEs. They maximize the following expression:
By setting , the encoder distribution will be forced to better match the factorized unit Gaussian prior. This procedure introduces additional constraints on the capacity of the latent bottleneck, encouraging the encoder to learn a disentangled representation for the data. Burgess et al. [6] argue that when the bottleneck has limited capacity, the network will be forced to specialize on the factor of variation that most contribute to a small reconstruction error. Therefore, they propose to progressively increase the bottleneck capacity, so that the encoder can focus on learning one factor of variation at the time:
where C is annealed from zero to some value which is large enough to produce good reconstruction. In the following, we refer to this model as AnnealedVAE.
Penalizing the total correlation.
Let denote the mutual information between and and note that equation 1 can be rewritten as
Therefore, when , VAE penalizes the mutual information between the latent representation and the data, thus constraining the capacity of the latent space. Furthermore, it pushes , the so called aggregated posterior, to match the prior and therefore to factorize, given a factorized prior. Kim and Mnih [27] argues that penalizing is neither necessary nor desirable for disentanglement. Therefore, they propose the FactorVAE which augments the VAE objective with an additional regularizer that specifically penalizes dependencies between the dimensions of the representation:
This last term is also known as total correlation [51]. While this term is intractable and vanilla Monte Carlo approximations require marginalization over the training set, it can be optimized using the density ratio trick [38, 48]. Samples from can be obtained shuffling samples from [1]. Concurrently, Chen et al. [7] propose the
TCVAE. As opposed to FactorVAE, they propose a tractable biased MonteCarlo estimate for the total correlation.
Disentangled priors.
Kumar et al. [30] argue that a disentangled generative model requires a disentangled prior. This approach is related to the total correlation penalty, but now the aggregated posterior is pushed to match a factorized prior. Therefore
where is some (arbitrary) divergence. Since this term is intractable when
is the KL divergence, they propose to match the moments of these distribution. In particular, to regularize the deviation of either
orfrom the identity matrix in the two variants of the DIPVAE. This results in maximizing either the DIPVAEI objective
or the DIPVAEII objective
2.4 Other related work
In a similar spirit to disentanglement, (non)linear independent component analysis
[12, 2, 25, 22] studies the problem of recovering independent components of a signal. The underlying assumption is that there is a generative model for the signal composed of the combination of statistically independent nonGaussian components. While the identifiability result for linear ICA [12] proved to be a milestone for the classical theory of factor analysis, similar results are in general not obtainable for the nonlinear case and the underlying sources generating the data cannot be identified [23]. The lack of almost any identifiability result in nonlinear ICA has been a main bottleneck for the utility of the approach [24]and partially motivated alternative machine learning approaches
[14, 46, 11]. Given that unsupervised algorithms did not initially perform well on realistic settings most of the other works have considered some more or less explicit form of supervision [42, 53, 29, 26, 9, 35, 37, 55, 49]. [20, 10] assume some knowledge of the effect of the factors of variations even though they are not observed. One can also exploit known relations between factors in different samples [18, 52, 16, 13, 21, 54]. This is not a limiting assumption especially in sequential data, i.e., for videos.3 Impossibility of unsupervised disentanglement learning
The first question that we investigate is whether unsupervised disentanglement learning is even possible for arbitrary generative models. Theorem 1 essentially shows that without inductive biases both on models and data sets the task is fundamentally impossible. The proof is provided in Appendix A.
Theorem 1.
For , let denote any generative model which admits a density and where denotes the independent latent variables and the data observations. Then, there exists an infinite family of bijective functions such that for all and almost everywhere for all and .
Consider the commonly used “intuitive” notion of disentanglement which advocates that a change in a single groundtruth factor should lead to a single change in the representation. In that setting, Theorem 1 implies that unsupervised disentanglement learning is impossible for arbitrary generative models with a factorized prior^{2}^{2}2Theorem 1 only applies to factorized priors; however, we expect that a similar result can be extended to nonfactorizing priors. in the following sense: Consider any unsupervised disentanglement method and assume that it finds a representation that is perfectly disentangled with respect to in a generative model with a factorizing prior on . Then, Theorem 1 implies that there is an equivalent generative model with the latent variable where is completely entangled with respect to and thus also . Furthermore, since is deterministic, both generative models have the same marginal distribution of the observations by construction. Since the (unsupervised) disentanglement method only has access to observations , it hence cannot distinguish between the two equivalent generative models and thus has to be entangled to at least one of them.
The intuition behind this result may be seen from a generative model where consists of two independent standard Gaussian random variables. Consider an alternative latent variable obtained by rotating by . By definition, corresponds to two independent standard Gaussian random variables that are fully entangled with the original latent variable . Any disentanglement method can only be disentangled with respect to one of them and has to be entangled with respect to the other. We note that this result is obvious for multivariate Gaussians; however, perhaps surprisingly, Theorem 1
also holds for distributions which are not invariant to rotation, for example multivariate uniform distributions.
Theorem 1 may not be surprising to readers familiar with the causality literature as it is consistent with the following argument: After observing , we can construct infinitely many generative models which have the same marginal distribution of . Any one of these models could be the true causal generative model for the data and by Peters et al. [41] the right model cannot be identified only observing the distribution of . Furthermore, similar results have been obtained in the context of nonlinear ICA[23].
While Theorem 1 shows that unsupervised disentanglement learning is fundamentally impossible for arbitrary generative models, this does not necessarily mean it is an impossible endeavour in practice. After all, real world generative models may have a certain structure that could be exploited through suitably chosen inductive biases. However, Theorem 1 clearly shows that inductive biases are required both for the models (so that we find a specific set of solutions) and for the data sets (such that these solutions match the true generative model). We hence argue that the role of inductive biases should be made implicit and investigated further as done in the following largescale experimental study.
4 Experimental design
Research questions.
Theorem 1 opens several questions on the role of inductive biases in the empirical performance of stateoftheart models.

Are current methods effective at enforcing a factorizing aggregated posterior and representation?

How much do existing disentanglement metrics agree?

How important are different models and hyperparameters for disentanglement?

Are there reliable recipes for hyperparameter selection?

Are these disentangled representations useful for downstream tasks in terms of the sample complexity of learning?
Experimental conditions and guiding principles.
In our study, we seek controlled, fair and reproducible experimental conditions. We consider the case in which we can sample from a well defined and known groundtruth generative model by first sampling the factors of variations from a distribution and then sampling an observation from . Our experimental protocol works as follows: During training, we only observe the samples of obtained by marginalizing over . After training, we obtain a representation by either taking a sample from the probabilistic encoder or by taking its mean. Typically, disentanglement metrics consider the latter as the representation . During the evaluation, we assume to have access to the whole generative model, i.e. we can draw samples from both and . In this way, we can perform interventions on the latent factors as required by certain scores. We explicitly note that we effectively consider the statistical learning problem where we optimize the loss and the metrics on the known data generating distribution. As a result, we do not use separate train and test sets but always take i.i.d. samples from the known groundtruth distribution. This is justified as the statistical problem is well defined and it allows us to remove the additional complexity of dealing with overfitting and empirical risk minimization.
We consider four data sets in which is obtained as a deterministic function of : dSprites [19], Cars3d [43], SmallNORB [32], Shapes3D [27]. We also introduce three data sets in which is a stochastic function of as there are additional continuous noise variables: ColordSprites, NoisydSprites and ScreamdSprites. This means that for a fixed realization of the factors of variations , two samples from can be different. The key idea behind these three additional data sets is that they have exactly the same latent factors as dSprites but that they correspond to increasingly harder problems with more complex observations. In ColordSprites, the shapes are colored with a random color. In NoisydSprites, we consider whitecolored shapes on a noisy background. Finally, in ScreamdSprites the background is replaced with a random patch in a random color shade extracted from the famous The Scream painting [36]. The dSprites shape is embedded into the image by inverting the color of its pixels.
To fairly evaluate the different approaches, we separate the effect of the regularization from the other inductive biases (e.g., the choice of the neural architecture). Each method uses the same convolutional architecture, optimizer, hyperparameters of the optimizer and batch size. All methods use a Gaussian encoder where the mean and the log variance of each latent factor is parametrized by the deep net, a Bernoulli decoder and latent dimension fixed to 10. We note that these are all standard choices in prior work [19, 27].
We choose six different regularization strength, i.e., hyperparameter values, for each of the considered methods. The key idea was to take a wide enough set to ensure that there are useful hyperparameters for different settings for each method and not to focus on specific values known to work for specific data sets. However, the values are partially based on the ranges that are prescribed in the literature (including the suggested hyperparameters suggested by the authors).
We run each model for each regularization strength for 50 different random seeds on each data set. We fix our experimental setup in advance and we run all the methods described in Section 2.3 and evaluate them on every metric of Section 2.2 on each data set. The full details on the experimental setup are provided in the Appendix and we plan to release code to train the considered models, the trained models and the evaluation pipeline for the sake of reproducibility and to establish a strong and fair baseline.
Limitations of our study.
While we aim to provide a useful and fair experimental study, there are clear limitations to the conclusions that can be drawn from it due to design choices that we have taken. In all these choices, we have aimed to capture what is considered the stateoftheart inductive bias in the community.
On the data set side, we only consider images with a heavy focus on synthetic images. We do not explore other modalities and we only consider the toy scenario in which we have access to a data generative process with uniformly distributed factors of variations. Furthermore, all our data sets have a small number of independent discrete factors of variations without any confounding variables.
For the methods, we only consider the inductive bias of convolutional architectures. We do not test fully connected architectures or additional techniques such as skip connections. Furthermore, we do not explore different activation functions, reconstruction losses or different number of layers. We also do not vary any other hyperparameters other than the regularization weight. In particular, we do not evaluate the role of different latent space sizes, optimizers and batch sizes. We do not test the sample efficiency of the metrics but simply set the size of the train and test set to large values.
Implementing the different disentanglement methods and metrics has proven to be a difficult endeavour. Few “official” open source implementations are available and there are many small details to consider. We take a besteffort approach to these implementations and implemented all the methods and metrics from scratch as any sound machine learning practitioner might do based on the original papers. When taking different implementation choices than the original papers, we explicitly state and motivate them.
Reconstructions for different data sets and methods. Odd columns show real samples and even columns their reconstruction. As expected, the additional variants of dSprites with continuous noise variables are harder than the original data set. On NoisydSprites and ColordSprites the models produce reasonable reconstructions with the noise on NoisydSprites being ignored. ScreamdSprites is even harder and we observe that the shape information is lost. On the other data sets, we observe that reconstructions are blurry but objects are distinguishable.
Differences with previous implementations.
As described above, we use a single choice of architecture, batch size and optimizer for all the methods which might deviate from the settings considered in the original papers. However, we argue that unification of these choices is the only way to guarantee a fair comparison among the different methods such that valid conclusions may be drawn in between methods. The largest change is that for DIPVAE and for TCVAE we used a batch size of 64 instead of 400 and 2048 respectively. However, Chen et al. [7] shows in Section H.2 of the Appendix that the bias in the minibatch estimation of the total correlation does not significantly affect the performances of their model even with small batch sizes. For DIPVAEII, we did not implement the additional regularizer on the third order central moments since no implementation details are provided and since this regularizer is only used on specific data sets.
Our implementations of the disentanglement metrics deviate from the implementations in the original papers as follows: First, we strictly enforce that all factors of variations are treated as discrete variables as this corresponds to the assumed groundtruth model in all our data sets. Hence, we used classification instead of regression for the SAP score and the disentanglement score of [15]. This is important as it does not make sense to use regression on true factors of variations that are discrete (e.g., shape on dSprites). Second, wherever possible, we resorted to using the default, welltested Scikitlearn [40] implementations instead of using custom implementations with potentially hard to set hyperparameters. Third, for the Mutual Information Gap [7], we estimate the discrete mutual information (as opposed to continuous) on the mean representation (as opposed to sampled) on a subset of the samples (as opposed to the whole data set). We argue that this is the correct choice as the mean is usually taken to be the representation. Hence, it would be wrong to consider the full Gaussian encoder or samples thereof as that would correspond to a different representation. Finally, we fix the number of sampled train and test points across all metrics to a large value to ensure robustness.
5 Experimental results
5.1 Can one achieve a good reconstruction error across data sets and models?
First, we check for each data set that we manage to train models that achieve reasonable reconstructions. Therefore, for each data set we sample a random model and show real samples next to their reconstructions. The results are depicted in Figure 1. As expected, the additional variants of dSprites with continuous noise variables are harder than the original data set. On NoisydSprites and ColordSprites the models produce reasonable reconstructions with the noise on NoisydSprites being ignored. ScreamdSprites is even harder and we observe that the shape information is lost. On the other data sets, we observe that reconstructions are blurry but objects are distinguishable. SmallNorb seems to be the most challenging data set.
5.2 Can current methods enforce a factorizing aggregated posterior and representation?
We investigate whether the considered unsupervised disentanglement approaches are effective at enforcing a factorizing aggregated posterior. For each trained model, we sample
images and compute a sample from the corresponding approximate posterior. We then fit a multivariate Gaussian distribution over these
samples by computing the empirical mean and covariance matrix^{3}^{3}3For numerical stability, we add to the diagonal of the covariance matrix.. Finally, we compute the total correlation of the fitted Gaussian and report the median value for each data set, method and hyperparameter value.Figure 2 shows the total correlation of the sampled representation plotted against the regularization strength for each data set and method except AnnealedVAE. On all data sets except SmallNORB, we observe that plain vanilla variational autoencoders (i.e. the VAE model with ) enjoy the highest total correlation. For VAE and TCVAE, it can be clearly seen that the total correlation of the sampled representation decreases on all data sets as the regularization strength (in the form of ) is increased. The two variants of DIPVAE exhibit low total correlation across the data sets except DIPVAEI which incurs a slightly higher total correlation on SmallNORB compared to a vanilla VAE. Increased regularization in the DIPVAE objective also seems to lead a reduced total correlation, even if the effect is not as pronounced as for VAE and TCVAE. While FactorVAE achieves a low total correlation on all data sets except on SmallNORB, we observe that the total correlation does not seem to decrease with increasing regularization strength. This may seem surprising given that the FactorVAE objective aims to penalize the total correlation of the aggregated posterior. However, we expect this is due to the fact that the total correlation of the aggregated posterior is estimated using adversarial density estimation, which was shown to consistently underestimate the true TC, see Figure 7 [27]. We further observe that AnnealedVAE (shown in Figure 18 in the Appendix) is much more sensitive to the regularization strength. However, on all data sets except ScreamdSprites (on which AnnealedVAE performs poorly), the total correlation seems to decrease with increased regularization strength.
While many of the considered methods aim to enforce a factorizing aggregated posterior, they use the mean vector of the Gaussian encoder as the representation and not a sample from the Gaussian encoder. This may seem like a minor, irrelevant modification; however, it is not clear whether a factorizing aggregated posterior also ensures a factorizing representation. To test whether this is true, we also compute the mean of the Gaussian encoder for the same samples, fit a multivariate Gaussian and compute the total correlation of that fitted Gaussian. Figure 3 shows the total correlation of the mean representation plotted against the regularization strength for each data set and method except AnnealedVAE. We observe that, for VAE and TCVAE, increased regularization leads to a substantially increased total correlation of the mean representations. This effect can also be observed for for FactorVAE, albeit in a less extreme fashion. For DIPVAEI, we observe that the total correlation of the mean representation is consistently low. This is not surprising as the DIPVAEI objective directly optimizes the covariance matrix of the mean representation to be diagonal which implies that the corresponding total correlation is low. The DIPVAEII objective which enforces the covariance matrix of the sampled representation to be diagonal seems to lead to a factorized mean representation on some data sets (for example Shapes3D and Cars3d), but also seems to fail on others (dSprites). For AnnealedVAE (shown in Figure 19), we overall observe mean representations with a very high total correlation. In Figure 4, we further plot the log total correlations of the sampled representations versus the mean representations for each of the trained models. It can be clearly seen that for a large number of models, the total correlation of the mean representations is much higher than that of the sampled representations. The only method that achieves a consistently low total correlation both for the mean and the sampled representation is as expected DIPVAEI.
Implications.
Overall, these results lead us to conclude that, with minor exceptions, the considered methods are effective at enforcing an aggregated posterior whose individual dimensions are not correlated. However, except for DIPVAEI, this does not seem to imply that the mean representation (usually used for representation) also factorizes.
5.3 How much do existing disentanglement metrics agree?
As there exists no single, common definition of disentanglement, an interesting question is to see how much different proposed metrics agree. Figure 5 shows pairwise scatter plots of the different considered metrics on dSprites where each point corresponds to a trained model, while Figure 6 shows the Spearman rank correlation between different disentanglement metrics on different data sets. Overall, we observe that all metrics except Modularity seem to be correlated strongly on the data sets dSprites, ColordSprites and ScreamdSprites and mildly on the other data sets. There appear to be two pairs among these metrics that capture particularly similar notions: the BetaVAE and the FactorVAE score as well as the Mutual Information Gap and DCI Disentanglement.
Implication.
All disentanglement metrics except Modularity appear to be correlated. However, the level of correlation changes between different data sets.
5.4 How important are different models and hyperparameters for disentanglement?
The primary motivation behind the considered methods is that they should lead to improved disentanglement scores. This raises the question how disentanglement is affected by the model choice, the hyperparameter selection and randomness (in the form of different random seeds). To investigate this, we compute all the considered disentanglement metrics for each of our trained models. In Figure 7, we show the range of attainable disentanglement scores for each method on each data set. We clearly see that these ranges are heavily overlapping leading us to conclude that the choice of hyperparameters seems to be substantially more important than the choice of objective function. While certain models seem to attain better maximum scores on specific data sets and disentanglement metrics, we do not observe any consistent pattern that one model is consistently better than the other. Furthermore, we note that in our study we have fixed the range of hyperparameters a priori to six different values for each model and did not explore additional hyperparameters based on the results (as that would bias our study). However, this also means that specific models may have performed better than in Figure 7 if we had chosen a different set of hyperparameters.
In Figure 8, we further show the impact of randomness in the form of random seeds on the disentanglement scores. Each violin plot shows the distribution of the disentanglement metric across all 50 trained models for each model and hyperparameter setting on dSprites. We clearly see that randomness (in the form of different random seeds) has a substantial impact on the attained result and that a good run with a bad hyperparameter can beat a bad run with a good hyperparameter in many cases. We note that for AnnealedVAE, this seems to hold for a lesser extent. However, AnnealedVAE appears to be more sensitive to the choice of hyperparameter and bad hyperparameters for AnnealedVAE sometimes seem to lead to consistently bad disentanglement scores.
Implication. The disentanglement scores of unsupervised models are heavily influenced by randomness (in the form of the random seed) and the choice of the hyperparameter (in the form of the regularization strength). The objective function appears to have less impact.
5.5 Are there reliable recipes for model selection?
Given that the impact of the randomness and the hyperparameters seems to be more critical than the loss function, this raises the natural question how good hyperparameters can be chosen and how we can distinguish between good and bad training runs. In this paper, we advocate that a key principle that model selection
should not depend on the considered disentanglement score for the following reasons: The point of unsupervised learning of disentangled representation is that there is no access to the labels as otherwise we could incorporate them and would have to compare to semisupervised and fully supervised methods. All the disentanglement metrics considered in this paper require a substantial amount of groundtruth labels or the full generative model (for example for the BetaVAE and the FactorVAE metric). Hence, one may substantially bias the results of a study by tuning hyperparameters based on (supervised) disentanglement metrics. Furthermore, we argue that it is not sufficient to fix a set of hyperparameters a priori and then show that one of those hyperparameters and a specific random seed achieves a good disentanglement score as it amounts to showing the existence of a good model, but does not guide the practitioner in finding it. Finally, in many practical settings, we might not even have access to adequate labels as it may be hard to identify the true underlying factor of variations, in particular, if we consider data modalities that are less suitable to human interpretation than images.In the remainder of this section, we hence investigate and assess different ways how hyperparameters and good model runs could be chosen. In this study, we focus on choosing the learning model and the regularization strength corresponding to that loss function. However, we note that in practice this problem is likely even harder as a practitioner might also want to tune other modeling choices such architecture or optimizer.
General recipes for hyperparameter selection.
We first investigate whether we may find generally applicable “rules of thumb” for choosing the hyperparameters. For this, we plot in Figure 9 different disentanglement metrics against different regularization strengths for each model and each data set. The values correspond to the median obtained values across 50 random seeds for each model, hyperparameter and data set. There seems to be no model dominating all the others and for each model there does not seem to be a consistent strategy in choosing the regularization strength to maximize disentanglement scores. Furthermore, even if we could identify a good objective function and corresponding hyperparameter value, we still could not distinguish between a good and a bad training run.
Model selection based on unsupervised scores.
Another approach could be to select hyperparameters based on unsupervised scores such as the reconstruction error, the KL divergence between the prior and the approximate posterior, the Evidence Lower Bound or the estimated total correlation of the sampled representation. This would have the advantage that we could select specific trained models and not just good hyperparameter settings whose median trained model would perform well. To test whether such an approach is fruitful, we compute the rank correlation between these unsupervised metrics and the disentanglement metrics and present it in Figure 10. While we do observe some correlations (in particular on SmallNORB), no clear pattern emerges which leads us to conclude that this approach is unlikely to be successful in practice.
Random different data set  Same data set  

Random different metric  54.1%  63.1% 
Same metric  57.4%  79.5% 
Hyperparameter selection based on transfer.
The final strategy for hyperparameter selection that we consider is based on transferring good settings across data sets. The key idea is that good hyperparameter settings may be inferred on data sets where we have labels available (such as dSprites) and then applied to novel data sets. To test this idea, we plot in Figure 12 the different disentanglement scores obtained on dSprites against the scores obtained on other data sets. To ensure robustness of the results, we again consider the median across all 50 runs for each model, regularization strength, and data set. We observe that the scores on ColordSprites seem to be strongly correlated with the scores obtained on the regular version of dSprites. Figure 11 further shows the rank correlations obtained between different data sets for each disentanglement scores. This confirms the strong and consistent correlation between dSprites and ColordSprites. We further observe that the the Mutual Information Gap and DCI Disentanglement appear to be mildly correlated in between all data sets except SmallNORB. While these result suggest that some transfer of hyperparameters is possible, it does not allow us to distinguish between good and bad random seeds on the target data set.
To illustrate this, we compare such a transfer based approach to hyperparameter selection to random model selection as follows: We first randomly sample one of our 50 random seeds and consider the set of trained models with that random seed. First, we sample one of our 50 random seeds, a random disentanglement metric and a data set and use them to select the hyperparameter setting with the highest attained score. Then, we compare that selected hyperparameter setting to a randomly selected model on either the same or a random different data set, based on either the same or a random different metric and for a randomly sampled seed. Finally, we report the percentage of trials in which this transfer strategy outperforms or performs equally well as random model selection across trials in Table 1. If we choose the same metric and the same data set (but a different random seed), we obtain a score of . If we aim to transfer for the same metric across data sets, we achieve around . Finally, if we transfer both across metrics and data sets, our performance drops to .
Implications.
Unsupervised model selection remains an unsolved problem. Transfer of good hyperparameters between metrics and data sets does not seem to work as there appears to be no unsupervised way to distinguish between good and bad random seeds on the target task.
5.6 Are these disentangled representations useful for downstream tasks in terms of the sample complexity of learning?
One of the key motivations behind disentangled representations is that they are assumed to be useful for later downstream tasks. In particular, it is argued that disentanglement should lead to a better sample complexity of learning [4, 47, 41]. In this section, we consider the simplest downstream task where the goal is to recover the true factors of variations from the observations. As all our groundtruth models have independent, discrete factors of variations this corresponds to a set of classification tasks. Our goal is to investigate the relationship between disentanglement and the average classification accuracy on this downstream tasks as well as whether better disentanglement leads to a decreased sample complexity of learning.
To compute the classification accuracy for each trained model, we sample true factors of variations and observations from our ground truth generative models. We then feed the observations into our trained model and take the mean of the Gaussian encoder as the representations. Finally, we predict each of the groundtruth factors based on the representations with a separate learning algorithm. We consider both a 5fold crossvalidated multiclass logistic regression as well as gradient boosted trees of the Scikitlearn package. For each of these methods, we train on , , and samples. We compute the average accuracy across all factors of variation using an additional set randomly drawn samples.
Figure 13 shows the rank correlations between the disentanglement metrics and the downstream performance for all considered data sets. We observe that all metrics except Modularity seem to be correlated with increased downstream performance on the different variations of dSprites and to some degree on Shapes3D. However, it is not clear whether this is due to the fact that disentangled representations perform better or whether some of these scores actually also (partially) capture the informativeness of the evaluated representation.
To assess the sample complexity argument we compute for each trained model a statistical efficiency score which we define as the average accuracy based on samples divided by the average accuracy based on samples for either the logistic regression or the gradient boosted trees. The key idea is that if disentangled representations lead to sample efficiency, then they should also exhibit a higher statistical efficiency score. The corresponding results are shown in Figures 14 and 15 where we plot the statistical efficiency versus different disentanglement metrics for different data sets and models and in Figure 13 where we show rank correlations. Overall, we do not observe conclusive evidence that models with higher disentanglement scores also lead to higher statistical efficiency. We note that some AnnealedVAE models seem to exhibit a high statistical efficiency on ScreamdSprites and to some degree on NoisydSprites. This can be explained by the fact that these models have low downstream performance and that hence the accuracy with samples is similar to the accuracy with samples. We further observe that DCI Disentanglement and MIG seem to be lead to a better statistical efficiency on the the data set Shapes3D for gradient boosted trees. Figures 16 and 17 show the downstream performance for three groups with increasing levels of disentanglement (measured in DCI Disentanglement and MIG respectively). We observe that indeed models with higher disentanglement scores seem to exhibit better performance for gradient boosted trees with 100 samples. However, considering all data sets, it appears that overall increased disentanglement is rather correlated with better downstream performance and not statistical efficiency. Overall, this leads us to conclude that we lack strong evidence to conclude that disentanglement leads to better statistical efficiency.
Implications.
While the empirical results in this section are negative, they should also be interpreted with care. After all, we have seen in previous sections that the considered models in this study fail to reliably produce disentangled representations. Hence, the results in this section might change if one were to consider a different set of models, for example semisupervised or fully supervised one. Furthermore, there are many more potential notions of usefulness such as interpretability and fairness that we have not considered in our experimental evaluation. Nevertheless, we argue that the lack of concrete examples of useful disentangled representations necessitates that future work on disentanglement methods should make this point more explicit.
6 Conclusions
In this work we show, perhaps unsurprisingly, that the unsupervised learning of disentangled representations is fundamentally impossible without inductive biases. We then performed a largescale empirical study with six stateoftheart disentanglement methods, six disentanglement metrics on seven data sets and conclude the following: (i) While all considered methods prove effective at ensuring that the aggregated posterior (which is sampled) factorizes, only one method also consistently ensures that the representation (which is taken to be the mean) factorizes. (ii) We do not find any evidence that they can be used to reliably learn disentangled representations in an unsupervised manner as random seeds and hyperparameters seem to matter more than the model and “good” random seeds and hyperparameters seemingly cannot be identified without access to groundtruth labels. Similarly, we observe that good models cannot be reliably identified using transfer of hyperparameters across data sets nor across disentanglement metrics. (iii) For the considered models and data sets, we cannot validate the assumption that disentanglement is useful for downstream tasks through a decreased sample complexity of learning. Based on these findings, we suggest three main directions for future research:

Inductive biases and implicit and explicit supervision. Our theoretical impossibility result in Section 3
highlights the need of inductive biases while our experimental results indicate that the role of supervision is crucial. As currently there does not seem to exist a reliable strategy to choose hyperparameters in the unsupervised learning of disentangled representations, we argue that future work should make the role of inductive biases and implicit and explicit supervision more explicit. Given the seemingly fundamental impossibility of purely unsupervised disentanglement learning, we would encourage and motivate future work on disentangled representation learning that deviates from the static, purely unsupervised setting considered in this work. Promising settings (that have been explored to some degree) seem to be for example (i) disentanglement learning with interactions
[50], (ii) when weak forms of supervision e.g. through grouping information are available [5], or (iii) when temporal structure is available for the learning problem. The last setting seems to be particularly interesting given recent identifiability results in nonlinear ICA [22] which could indicate that significant improvements for autoencoding based approaches could be possible if the sequential structure of data can be exploited. 
Concrete practical benefits of disentangled representations. In our experiments we investigated whether disentanglement leads to increased sample efficiency for downstream tasks and did not find evidence that this is the case. While these results only apply to the setting and downstream task used in our study, we are also not aware of other prior work that compellingly shows the usefulness of disentangled representations. Hence, we argue that future work should aim to show concrete benefits of disentangled representations. Interpretability and fairness as well as interactive settings seem to be particularly promising candidates. In such settings, feedback could potentially be incorporated and might alleviate some of the difficulties of the purely unsupervised setting. One potential approach to include inductive biases, offer interpretability, and generalization is the concept of independent causal mechanisms and the framework of causal inference [39, 41].

Experimental setup and diversity of data sets. Our study also highlights the need for a sound, robust, and reproducible experimental setup on a diverse set of data sets in order to draw valid conclusions. First, one has to be careful with the experimental design as for example hyperparameter selection has a substantial impact on the obtained results. We have tried to keep our study fair by choosing a wide set of hyperparameters up front and not modifying them while performing the study. We note that, as our hyperparameter selection was partially based on suggestions from prior work that considered some of the same data sets, one might validly argue that we have implicitly incorporated inductive biases on these data sets. We have also chosen the exact same training and evaluation protocol for all the different methods and disentanglement metrics. Second, we have observed that it is easy to draw spurious conclusions from experimental results if one only considers a subset of methods, metrics and data sets. We hence argue that it is crucial for future work to perform experiments on a wide variety of data sets to see whether conclusions and insights are generally applicable. This is particularly important in the setting of disentanglement learning as experiments are largely performed on toylike data sets. We are hence interested in insights that generalize across multiple data sets rather than the absolute performance on specific data sets.
Acknowledgements
The authors thank Gunnar Rätsch, Ilya Tolstikhin, Paul Rubenstein and Josip Djolonga for helpful discussions and comments. This research was partially supported by the Max Planck ETH Center for Learning Systems and by an ETH core grant (to Gunnar Rätsch). This work was partially done while Francesco Locatello was at Google AI.
References
 Arcones and Gine [1992] Miguel A Arcones and Evarist Gine. On the bootstrap of u and v statistics. The Annals of Statistics, pages 655–674, 1992.
 Bach and Jordan [2002] Francis R Bach and Michael I Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1–48, 2002.
 Bengio et al. [2007] Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Largescale kernel machines, 34(5):1–41, 2007.
 Bengio et al. [2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 Bouchacourt et al. [2017] Diane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin. Multilevel variational autoencoder: Learning disentangled representations from grouped observations. arXiv preprint arXiv:1705.08841, 2017.
 Burgess et al. [2018] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in betavae. arXiv preprint arXiv:1804.03599, 2018.
 Chen et al. [2018] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. In To Appear In Neural Information Processing Systems, 2018.
 Chen et al. [2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 Cheung et al. [2014] Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583, 2014.
 Cohen and Welling [2014a] Taco Cohen and Max Welling. Learning the irreducible representations of commutative lie groups. In International Conference on Machine Learning, pages 1755–1763, 2014a.
 Cohen and Welling [2014b] Taco S Cohen and Max Welling. Transformation properties of learned visual representations. arXiv preprint arXiv:1412.7659, 2014b.
 Comon [1994] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
 Denton et al. [2017] Emily L Denton et al. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, pages 4414–4423, 2017.
 Desjardins et al. [2012] Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Disentangling factors of variation via generative entangling. arXiv preprint arXiv:1210.5474, 2012.
 Eastwood and Williams [2018] Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. 2018.
 Fraccaro et al. [2017] Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems, pages 3601–3610, 2017.
 Goodfellow et al. [2009] Ian Goodfellow, Honglak Lee, Quoc V Le, Andrew Saxe, and Andrew Y Ng. Measuring invariances in deep networks. In Advances in neural information processing systems, pages 646–654, 2009.
 Goroshin et al. [2015] Ross Goroshin, Michael F Mathieu, and Yann LeCun. Learning to linearize under uncertainty. In Advances in Neural Information Processing Systems, pages 1234–1242, 2015.
 Higgins et al. [2016] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. 2016.
 Hinton et al. [2011] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming autoencoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer, 2011.
 Hsu et al. [2017] WeiNing Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in Neural Information Processing Systems, 2017.

Hyvarinen and Morioka [2016]
Aapo Hyvarinen and Hiroshi Morioka.
Unsupervised feature extraction by timecontrastive learning and nonlinear ica.
In Advances in Neural Information Processing Systems, pages 3765–3773, 2016.  Hyvärinen and Pajunen [1999] Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
 Hyvarinen et al. [2018] Aapo Hyvarinen, Hiroaki Sasaki, and Richard E Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. arXiv preprint arXiv:1805.08651, 2018.
 Jutten and Karhunen [2003] Christian Jutten and Juha Karhunen. Advances in nonlinear blind source separation. In Proc. of the 4th Int. Symp. on Independent Component Analysis and Blind Signal Separation (ICA2003), pages 245–256, 2003.
 Karaletsos et al. [2015] Theofanis Karaletsos, Serge Belongie, and Gunnar Rätsch. Bayesian representation learning with oracle constraints. arXiv preprint arXiv:1506.05011, 2015.
 Kim and Mnih [2018] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kulkarni et al. [2015] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539–2547, 2015.
 Kumar et al. [2017] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations, 2017.
 Lake et al. [2017] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
 LeCun et al. [2004] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–104. IEEE, 2004.
 LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 Lenc and Vedaldi [2015] Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In IEEE conference on computer vision and pattern recognition, pages 991–999, 2015.
 Mathieu et al. [2016] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5040–5048, 2016.
 Munch [1893] Edvard Munch. The scream, 1893.
 Narayanaswamy et al. [2017] Siddharth Narayanaswamy, T Brooks Paige, JanWillem Van de Meent, Alban Desmaison, Noah Goodman, Pushmeet Kohli, Frank Wood, and Philip Torr. Learning disentangled representations with semisupervised deep generative models. In Advances in Neural Information Processing Systems, pages 5925–5935, 2017.
 Nguyen et al. [2010] XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
 Pearl [2009] Judea Pearl. Causality. Cambridge university press, 2009.
 Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Peters et al. [2017] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of causal inference: foundations and learning algorithms. MIT press, 2017.
 Reed et al. [2014] Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning, pages 1431–1439, 2014.
 Reed et al. [2015] Scott E Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogymaking. In Advances in neural information processing systems, pages 1252–1260, 2015.
 Ridgeway and Mozer [2018] Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the fstatistic loss. In To Appear In Neural Information Processing Systems, 2018.
 Rubenstein et al. [2018] Paul K Rubenstein, Bernhard Schoelkopf, and Ilya Tolstikhin. Learning disentangled representations with wasserstein autoencoders. 2018.
 Schmidhuber [1992] Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
 Schölkopf et al. [2012] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In International Conference on Machine Learning, 2012.
 Sugiyama et al. [2012] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Densityratio matching under the bregman divergence: a unified framework of densityratio estimation. Annals of the Institute of Statistical Mathematics, 64(5):1009–1044, 2012.
 Suter et al. [2018] Raphael Suter, Đorđe Miladinović, Stefan Bauer, and Bernhard Schölkopf. Interventional robustness of deep latent variable models. arXiv preprint arXiv:1811.00007, 2018.
 Thomas et al. [2018] Valentin Thomas, Emmanuel Bengio, William Fedus, Jules Pondard, Philippe Beaudoin, Hugo Larochelle, Joelle Pineau, Doina Precup, and Yoshua Bengio. Disentangling the independently controllable factors of variation by interacting with the world. arXiv preprint arXiv:1802.09484, 2018.
 Watanabe [1960] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66–82, 1960.
 Whitney et al. [2016] William F Whitney, Michael Chang, Tejas Kulkarni, and Joshua B Tenenbaum. Understanding visual concepts with continuation learning. arXiv preprint arXiv:1602.06822, 2016.
 Yang et al. [2015] Jimei Yang, Scott E Reed, MingHsuan Yang, and Honglak Lee. Weaklysupervised disentangling with recurrent transformations for 3d view synthesis. In Neural Information Processing Systems, 2015.
 Yingzhen and Mandt [2018] Li Yingzhen and Stephan Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning, pages 5656–5665, 2018.

Zhu et al. [2014]
Zhenyao Zhu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Multiview perceptron: a deep model for learning face identity and view representations.
In Advances in Neural Information Processing Systems, 2014.
Appendix A Proof of Theorem 1
Proof.
To show the claim, we explicitly construct a family of functions using a sequence of bijective functions. Let be the dimensionality of the latent variable and consider the function defined by
Since admits a density , the function is bijective and, for almost every , it holds that for all as well as for all . Furthermore, it is easy to see that, by construction, is a independent dimensional uniform distribution. Similarly, consider the function defined by
where
denotes the cumulative density function of a standard normal distribution. Again, by definition,
is bijective with for all as well as for all . Furthermore, the random variable is a dimensional standard normal distribution.Let
be an arbitrary orthogonal matrix with
for all and . An infinite family of such matrices can be constructed using a Householder transformation: Choose an arbitrary and consider the vector with and for . By construction, we have and both and for all . Define the matrix and note that for all as well as for all . Furthermore, is orthogonal sinceSince is orthogonal, it is invertible and thus defines a bijective linear operator. The random variable is hence an independent, multivariate standard normal distribution since the covariance matrix is equal to .
Since is bijective, it follows that is an independent dimensional uniform distribution. Define the function
and note that by definition has the same marginal distribution as under , i.e., for all . Finally, for almost every , it holds that
as claimed. Since the choice of was arbitrary, there exists an infinite family of such functions . ∎
Appendix B Implementation of metrics
All our metrics consider the expected representation of training samples (except total correlation for which we also consider the sampled representation as described in Section 5).
BetaVAE metric.
We sample two batches of 64 points with a random factor fixed to a randomly sampled value across the two batches and the others varying randomly. We compute the mean representations for these points and take the absolute difference between pairs from the two batches. We then average these 64 values to form the features of a training (or testing) point. We train a Scikitlearn logistic regression with default parameters on points. We test on points.
FactorVAE metric
First, we estimate the variance of each latent dimension by embedding random samples from the data set and we exclude collapsed dimensions with variance smaller than 0.05. Then, we generate the votes for the majority vote classifier by sampling a batch of 64 points, all with a factor fixed to the same random value. Then, we compute the variance of each dimension of their latent representation and divide by the variance of that dimension we computed on the data without interventions. The training point for the majority vote classifier consists of the index of the dimension with the smallest normalized variance. We train on points and evaluate on points.
Mutual Information Gap.
The original metric was proposed evaluating the sampled representation. Instead, we consider the mean representation, in order to be consistent with the other metrics. We estimate the discrete mutual information by binning each dimension of the representations obtained from points into 20 bins. Then, the score is computed as follows:
Where is a factor of variation, is a latent representation and .
Modularity.
For the modularity score, we sample points for which we obtain the latent representations. We discretize these points into 20 bins and compute the mutual information between representations and the values of the factors of variation. These values are stored in a matrix . For each dimension of the representation , we compute a vector as:
where . The modularity score is the average over the dimensions of the representation of where:
and N is the number of factors.
DCI Disentanglement.
We sample and training and test points respectively. For each factor, we compute gradient boosted trees from Scikitlearn with the default setting. From this model, we extract the importance weights for the feature dimensions. We take the absolute value of these weights and use them to form the importance matrix , whose rows correspond to factors and columns to the representation. To compute the disentanglement score, we first subtract from 1 the entropy of each column of this matrix (we treat the columns as a distribution by normalizing them). This gives a vector of length equal to the dimensionality of the latent space. Then, we compute the relative importance of each dimension by and the disentanglement score as .
SAP score.
We sample points for training and for testing. We then compute a score matrix containing the prediction error on the test set for a linear SVM with predicting the value of a factor from a single latent dimension. The SAP score is computed as the average across factors of the difference between the top two most predictive latent dimensions.
Downstream task.
We sample training sets of different sizes: , , and points. We always evaluate on samples. We consider as a downstream task the prediction of the values of each factor from . For each factor we fit a different model and report then report the average test accuracy across factors. We consider two different models. First, we train a cross validated logistic regression from Scikitlearn with 10 different values for the regularization strength () and folds. Finally, we train a gradient boosting classifier from Scikitlearn with default parameters.
Total correlation.
We sample points and obtain their latent representation by either sampling from the encoder distribution of by taking its mean. We then compute the mean and covariance matrix of these points and compute the total correlation of this distribution:
where indexes the dimensions in the latent space.
Appendix C Main experiment hyperparameters
In the main experiment, we fix all hyperparameters except one per each model. Model specific hyperparameters can be found in Table 3. The common architecture is depicted in Table 2 along with the other fixed hyperparameters in Table 4(a). For the discriminator in FactorVAE we use the architecture in Table 4(b) with hyperparameters in Table 4(c). All the hyperparameters for which we report single values were not varied and are selected based on the literature.
Encoder  Decoder 

Input: number of channels  Input: 
FC, 256 ReLU  
conv, 32 ReLU, stride 2  FC, ReLU 
conv, 64 ReLU, stride 2  upconv, 64 ReLU, stride 2 
conv, 64 ReLU, stride 2  upconv, 32 ReLU, stride 2 
FC 256, F2  upconv, 32 ReLU, stride 2 
upconv, number of channels, stride 2 
Model  Parameter  Values 

VAE  
AnnealedVAE  
iteration threshold  
FactorVAE  
DIPVAEI  
DIPVAEII  
TCVAE 



Appendix D Data sets and preprocessing
All the data sets contains images with pixels between and . ColordSprites: Every time we sample a point, we also sample a random scaling for each channel uniformly between and . NoisydSprites: Every time we sample a point, we fill the background with uniform noise. ScreamdSprites: Every time we sample a point, we sample a random patch of The Scream painting. We then change the color distribution by adding a random uniform number to each channel and divide the result by two. Then, we embed the dSprites shape by inverting the colors of each of its pixels.
Appendix E Changes compared to previous versions
Creating a largescale experimental study such as the one in this paper is difficult as there are many crucial details to the considered models and metrics. We took a best effort approach to implement all the models and metrics as close to the original papers and our experimental setup as possible. As we obtain comments from the community and update this preliminary preprint, we document substantial changes to our experimental code and results in this section.
v1 to v2.
We substantially changed the computation of the following disentanglement metrics: For DCI Disentanglement, we switched from an SVM with regularization to gradient boosted trees as linear models appeared to be too restrictive for estimating discrete groundtruth factors (which may be captured by a model in a single continuous dimension). In the previous version, we further incorrectly computed the SAP score by taking the difference across latent factors. The current version takes the difference across latent codes as suggested in the original paper. Finally, we fixed a bug in the computation of the discrete mutual information which affected both the Mutual Information Gap and the Modularity Score. This has changed our discussion in Section 5.3 as with these changes the considered disentanglement metrics seem to be much more correlated.