Introduction
In representation learning, we often have access to highdimensional observations (e.g., images or videos) without additional annotations. However, such observations are often assumed to be the manifestation of a set of low dimensional ground truth factors of variations . For example, the factors of variation in natural images may be pose, content, location of objects, and lighting conditions.
The goal of representation learning is to learn a vector
which is low dimensional and useful for any downstream task [bengio2013representation]. The key idea of disentangled representations is that they capture the information about the explanatory factors of variations independently: each factor of variation is separately represented in just a few dimensions of the representation [bengio2013representation]. In our example of natural images, we may wish to encode separately pose, content, location of objects, and lighting conditions.Disentangled representations hold the promise to be both interpretable, robust, and to simplify downstream prediction tasks [bengio2013representation]
. Recently, disentanglement has been found useful for a variety of downstream tasks including fair machine learning
[locatello2019fairness, creager2019flexibly], abstract visual reasoning tasks [van2019disentangled] and realworld robotic data sets [gondal2019transfer].Stateoftheart approaches for unsupervised disentanglement learning are largely based on variants of
Variational Autoencoders (VAEs)
[kingma2013auto] where the encoder is further regularized to encourage disentanglement.In this paper, we comment on some of the key contributions of [locatello2019challenging]:

We discuss why it is impossible to learn disentangled representations for arbitrary data sets without supervision or inductive biases.

We provide a sober look at the performances of stateoftheart approaches. We highlight challenges for model selection and identify critical areas for future research.

To facilitate future research and reproducibility, we release a library to train and evaluate disentangled representations on standard benchmark data sets.
Background
For the purpose of disentanglement learning, the world is modeled as a twostep generative process. First, we sample a latent variable from a distribution with factorized density . Each dimension of corresponds to an independent factor of variation such as pose, content, locations of objects and lighting conditions in an image. Second, the observations are obtained as samples from .
The goal of disentanglement is to encode the factors of variation in a vector independently. The key idea is that a change in a dimension of corresponds to a change in a dimension (or subset of dimensions) of [bengio2013representation]. This definition has been further extended in the languages of group theory [higgins2018towards] and causality [suter2018interventional].
Metrics
The lack of a formal definition of disentanglement resulted in a variety of different metrics. We assume access to and characterize the structure of the statistical relations between and . Intuitively, we measure how the information about is encoded in . The BetaVAE [higgins2016beta] and FactorVAE [kim2018disentangling]
scores measures disentanglement by predicting the index of a fixed factor. Other scores are typically composed of two steps: first, they estimate a matrix relating
and . The Mutual Information Gap (MIG) [chen2018isolating] and Modularity [ridgeway2018learning] estimate the pairwise mutual information matrix, DCI Disentanglement [eastwood2018framework] the feature importance predicting from and the SAPscore [kumar2017variational] the predictability of from . Second, this matrix is aggregated to obtain a score by computing some normalized gap either row or columnwise. For more details, see Appendix C of [locatello2019challenging].Methods
In Variational Autoencoders (VAEs) [kingma2013auto], one assumes a prior
on the latent space and parameterizes the conditional probability
using a deep neural network (i.e., a
decoder network). The posterior distribution is approximated by a variational distribution , again parameterized using a deep neural network (i.e., an encoder network). The model is then trained by maximizing a variational lowerbound to the loglikelihood and the representation is usually taken to be the mean of the encoder distribution. To learn disentangled representations, stateoftheart approaches enrich the VAE objectives with a suitable regularizer.The VAE [higgins2016beta] and AnnealedVAE [burgess2018understanding] constrain the capacity of the VAE bottleneck. The intuition is that recovering the factors of variation is the most efficient compression scheme to achieve good reconstruction [PetJanSch17]. The FactorVAE [kim2018disentangling] and TCVAE both penalize the total correlation of the aggregated posterior (i.e. the encoder distribution after marginalizing the training data). The DIPVAE variants [kumar2017variational]
match the moments of the aggregated posterior and a “disentanglement prior”, which in practice is simply a factorized distribution. We refer to Appendix B of
[locatello2019challenging] for a more detailed description.Theoretical impossibility
Theorem 1 in [locatello2019challenging]
states that the unsupervised learning of disentangled representation is impossible for arbitrary data sets. Even in the infinite data regime, where supervised learning algorithms such as knearest neighbours classifiers are consistent, no model can find a disentangled representation observing samples from
only. This theoretical result motivates the need for either implicit supervision, explicit supervision, or inductive biases.The key idea is that we can construct two generative models whose latent variables and are entangled with each other but have the same marginal distribution over , i.e., the same . If a representation is disentangled with one of these generative models it must be entangled with the other by construction. Observing only samples from , it is impossible to distinguish which model should disentangle: both and are equally plausible and “look the same” as they produce the same with the same probability.
Note that Theorem 1 in [locatello2019challenging] does not account for the structure that real world generative models may exhibit. Inductive biases on both the models and the data may be sufficient to learn disentangled representations in practice as certain solutions may be favored instead of others, i.e., some model may naturally converge to a solution that disentangles the true instead of . Similar results have been obtained in the context of nonlinear ICA [hyvarinen1999nonlinear] where i.i.d. data is known to be insufficient for identifiability, in general.
Implications
We proved that the unsupervised learning of disentangled representations is in general impossible without inductive biases on both methods and data sets. We argue that future work should make the role of inductive biases or supervision more explicit.
Disentanglement in practice
In this section, we highlight the implications of some of the empirical results of [locatello2019challenging]. We implemented six recent unsupervised disentanglement learning methods as well as six disentanglement metrics from scratch. Overall, we trained over models and computed over scores on seven data sets and 50 random seeds.^{1}^{1}1Reproducing our results requires approximately 2.52 GPU years (NVIDIA P100). We refer to Section 5 of [locatello2019challenging] for more details and a richer quantitative description.
TCVAE, 3=DIPVAEI, 4=DIPVAEII, 5=AnnealedVAE. The variance is due to different hyperparameters and random seeds. We observe that the scores are heavily overlapping. (center) FactorVAE score vs hyperparameters for each score on Cars3d. There seems to be no model dominating all the others and for each model there does not seem to be a consistent strategy in choosing the regularization strength. (right) Rankcorrelation of DCI disentanglement metric across different data sets. Good hyperparameters seem to transfer especially between dSprites and ColordSprites.
Which method should be used?
This first question is particularly relevant for practitioners interested in the benefits of disentanglement methods offtheshelf. In Figure 1 (left), e observe that the choice of the objective function seems to matter less than the choice of hyperparameters and seed. In particular, only of the observed variance in the models can be explained by the choice of the objective function. Since our trained models exhibit such a large variance, it appears to be crucial to identify good hyperparameters and runs.
Implications
It is not clear which method should be used and choosing good hyperparameters, and selecting good runs seem to be matter more.
How to choose the hyperparameters?
We investigated whether we may find “rules of thumb” for selecting good hyperparameters. In Figure 1 (center), we plot the median FactorVAE score for different regularization strengths for each method on Cars3D. We observe that no method is consistently better than all the others and there does not seem to be an obvious trend that can be used to maximize disentanglement scores. In Figure 1 (right), we test whether good hyperparameter settings may be transferred across data sets. We observe that at the distribution level there appears to be some correlation between the disentanglement scores across the different data sets.
Implications
There is no clear rule of thumb, but transfer across data sets may help. Note that we still cannot distinguish between a good and a bad training run.
How to select the best model from a set of trained models?
First, we note that the transfer of hyperparmeters does not reliably outperforms random model selection: it improves only of the times. To understand why this is the case we plot in Figure 2 (left) the distribution of Factor VAE models evaluated with the FactorVAE score on Cars3D. We observe that randomness has a substantial impact on the representation as a good run with bad hyperparameters can easily outperform a bad run with the best hyperparameters. Finally, we check whether the unsupervised training metrics may be used for model selection. In Figure 2 (right), we observe that the training metrics appear to be rather uncorrelated with disentanglement.
Implications
Unsupervised model selection remains an open research challenge. Transfer of good hyperparameters does not seem to work and we did not find a way to distinguish between good and bad runs without supervision.
Directions of future research
Finally, we discuss the critical open challenges in disentanglement and some of the lessons we learned with this study.
Inductive biases and implicit and explicit supervision.
Our results highlights an overall need for supervision. In theory, inductive biases are crucial to distinguish among equally plausible generative models. In practice we did not find a reliable strategy to choose hyperparameters without supervision. Recent work [duan2019heuristic]
proposed a stability based heuristic for unsupervised model selection. Further exploring these techniques may help us understand the practical role of inductive biases and implicit supervision. Otherwise, we advocate to consider different settings, for example when limited explicit
[locatello2019disentangling] or weak supervision [bouchacourt2017multi, gresele2019incomplete] is available.Experimental setup and diversity of data sets.
Our study highlights the need for a sound, robust, and reproducible experimental setup on a diverse set of data sets. In our experiments, we observed that the results may be easily misinterpreted if one only looks at a subset of the data sets. As current research is typically focused on the synthetic data sets of [higgins2016beta, reed2015deep, lecun2004learning, kim2018disentangling, locatello2019challenging] — with only a few recent exceptions [gondal2019transfer] — we advocate for insights that generalize across data sets rather than individual absolute performance. For this reason, we released disentanglement_lib^{2}^{2}2https://github.com/googleresearch/disentanglement˙lib, a library to facilitate reproducible research on disentanglement. Our library allows to train and evaluate stateoftheart disentangled representations on common benchmark data sets and produces automatic visualizations for visual inspection on all the trained models. Furthermore, we released over trained models, which can be used as baselines for future research.
Acknowledgements
The authors thank Ilya Tolstikhin, Paul Rubenstein and Josip Djolonga for helpful discussions and comments. This research was partially supported by the Max Planck ETH Center for Learning Systems, by an ETH core grant (to Gunnar Rätsch) and a Google Ph.D. Fellowship to FL. This work was partially done while FL was at Google Research Zurich.