In representation learning, we often have access to high-dimensional observations (e.g., images or videos) without additional annotations. However, such observations are often assumed to be the manifestation of a set of low dimensional ground truth factors of variations . For example, the factors of variation in natural images may be pose, content, location of objects, and lighting conditions.
The goal of representation learning is to learn a vectorwhich is low dimensional and useful for any downstream task [bengio2013representation]. The key idea of disentangled representations is that they capture the information about the explanatory factors of variations independently: each factor of variation is separately represented in just a few dimensions of the representation [bengio2013representation]. In our example of natural images, we may wish to encode separately pose, content, location of objects, and lighting conditions.
Disentangled representations hold the promise to be both interpretable, robust, and to simplify downstream prediction tasks [bengio2013representation]
. Recently, disentanglement has been found useful for a variety of downstream tasks including fair machine learning[locatello2019fairness, creager2019flexibly], abstract visual reasoning tasks [van2019disentangled] and real-world robotic data sets [gondal2019transfer].
State-of-the-art approaches for unsupervised disentanglement learning are largely based on variants of Variational Autoencoders (VAEs)
Variational Autoencoders (VAEs)[kingma2013auto] where the encoder is further regularized to encourage disentanglement.
In this paper, we comment on some of the key contributions of [locatello2019challenging]:
We discuss why it is impossible to learn disentangled representations for arbitrary data sets without supervision or inductive biases.
We provide a sober look at the performances of state-of-the-art approaches. We highlight challenges for model selection and identify critical areas for future research.
To facilitate future research and reproducibility, we release a library to train and evaluate disentangled representations on standard benchmark data sets.
For the purpose of disentanglement learning, the world is modeled as a two-step generative process. First, we sample a latent variable from a distribution with factorized density . Each dimension of corresponds to an independent factor of variation such as pose, content, locations of objects and lighting conditions in an image. Second, the observations are obtained as samples from .
The goal of disentanglement is to encode the factors of variation in a vector independently. The key idea is that a change in a dimension of corresponds to a change in a dimension (or subset of dimensions) of [bengio2013representation]. This definition has been further extended in the languages of group theory [higgins2018towards] and causality [suter2018interventional].
The lack of a formal definition of disentanglement resulted in a variety of different metrics. We assume access to and characterize the structure of the statistical relations between and . Intuitively, we measure how the information about is encoded in . The BetaVAE [higgins2016beta] and FactorVAE [kim2018disentangling]
scores measures disentanglement by predicting the index of a fixed factor. Other scores are typically composed of two steps: first, they estimate a matrix relatingand . The Mutual Information Gap (MIG) [chen2018isolating] and Modularity [ridgeway2018learning] estimate the pairwise mutual information matrix, DCI Disentanglement [eastwood2018framework] the feature importance predicting from and the SAPscore [kumar2017variational] the predictability of from . Second, this matrix is aggregated to obtain a score by computing some normalized gap either row- or column-wise. For more details, see Appendix C of [locatello2019challenging].
In Variational Autoencoders (VAEs) [kingma2013auto], one assumes a prior
on the latent space and parameterizes the conditional probability
using a deep neural network (i.e., adecoder network). The posterior distribution is approximated by a variational distribution , again parameterized using a deep neural network (i.e., an encoder network). The model is then trained by maximizing a variational lower-bound to the log-likelihood and the representation is usually taken to be the mean of the encoder distribution. To learn disentangled representations, state-of-the-art approaches enrich the VAE objectives with a suitable regularizer.
The -VAE [higgins2016beta] and AnnealedVAE [burgess2018understanding] constrain the capacity of the VAE bottleneck. The intuition is that recovering the factors of variation is the most efficient compression scheme to achieve good reconstruction [PetJanSch17]. The Factor-VAE [kim2018disentangling] and -TCVAE both penalize the total correlation of the aggregated posterior (i.e. the encoder distribution after marginalizing the training data). The DIP-VAE variants [kumar2017variational]
match the moments of the aggregated posterior and a “disentanglement prior”, which in practice is simply a factorized distribution. We refer to Appendix B of[locatello2019challenging] for a more detailed description.
Theorem 1 in [locatello2019challenging]
states that the unsupervised learning of disentangled representation is impossible for arbitrary data sets. Even in the infinite data regime, where supervised learning algorithms such as k-nearest neighbours classifiers are consistent, no model can find a disentangled representation observing samples fromonly. This theoretical result motivates the need for either implicit supervision, explicit supervision, or inductive biases.
The key idea is that we can construct two generative models whose latent variables and are entangled with each other but have the same marginal distribution over , i.e., the same . If a representation is disentangled with one of these generative models it must be entangled with the other by construction. Observing only samples from , it is impossible to distinguish which model should disentangle: both and are equally plausible and “look the same” as they produce the same with the same probability.
Note that Theorem 1 in [locatello2019challenging] does not account for the structure that real world generative models may exhibit. Inductive biases on both the models and the data may be sufficient to learn disentangled representations in practice as certain solutions may be favored instead of others, i.e., some model may naturally converge to a solution that disentangles the true instead of . Similar results have been obtained in the context of non-linear ICA [hyvarinen1999nonlinear] where i.i.d. data is known to be insufficient for identifiability, in general.
We proved that the unsupervised learning of disentangled representations is in general impossible without inductive biases on both methods and data sets. We argue that future work should make the role of inductive biases or supervision more explicit.
Disentanglement in practice
In this section, we highlight the implications of some of the empirical results of [locatello2019challenging]. We implemented six recent unsupervised disentanglement learning methods as well as six disentanglement metrics from scratch. Overall, we trained over models and computed over scores on seven data sets and 50 random seeds.111Reproducing our results requires approximately 2.52 GPU years (NVIDIA P100). We refer to Section 5 of [locatello2019challenging] for more details and a richer quantitative description.
-TCVAE, 3=DIP-VAE-I, 4=DIP-VAE-II, 5=AnnealedVAE. The variance is due to different hyperparameters and random seeds. We observe that the scores are heavily overlapping. (center) FactorVAE score vs hyperparameters for each score on Cars3d. There seems to be no model dominating all the others and for each model there does not seem to be a consistent strategy in choosing the regularization strength. (right) Rank-correlation of DCI disentanglement metric across different data sets. Good hyperparameters seem to transfer especially between dSprites and Color-dSprites.
Which method should be used?
This first question is particularly relevant for practitioners interested in the benefits of disentanglement methods off-the-shelf. In Figure 1 (left), e observe that the choice of the objective function seems to matter less than the choice of hyperparameters and seed. In particular, only of the observed variance in the models can be explained by the choice of the objective function. Since our trained models exhibit such a large variance, it appears to be crucial to identify good hyperparameters and runs.
It is not clear which method should be used and choosing good hyperparameters, and selecting good runs seem to be matter more.
How to choose the hyperparameters?
We investigated whether we may find “rules of thumb” for selecting good hyperparameters. In Figure 1 (center), we plot the median FactorVAE score for different regularization strengths for each method on Cars3D. We observe that no method is consistently better than all the others and there does not seem to be an obvious trend that can be used to maximize disentanglement scores. In Figure 1 (right), we test whether good hyperparameter settings may be transferred across data sets. We observe that at the distribution level there appears to be some correlation between the disentanglement scores across the different data sets.
There is no clear rule of thumb, but transfer across data sets may help. Note that we still cannot distinguish between a good and a bad training run.
How to select the best model from a set of trained models?
First, we note that the transfer of hyperparmeters does not reliably outperforms random model selection: it improves only of the times. To understand why this is the case we plot in Figure 2 (left) the distribution of Factor VAE models evaluated with the FactorVAE score on Cars3D. We observe that randomness has a substantial impact on the representation as a good run with bad hyperparameters can easily outperform a bad run with the best hyperparameters. Finally, we check whether the unsupervised training metrics may be used for model selection. In Figure 2 (right), we observe that the training metrics appear to be rather uncorrelated with disentanglement.
Unsupervised model selection remains an open research challenge. Transfer of good hyperparameters does not seem to work and we did not find a way to distinguish between good and bad runs without supervision.
Directions of future research
Finally, we discuss the critical open challenges in disentanglement and some of the lessons we learned with this study.
Inductive biases and implicit and explicit supervision.
Our results highlights an overall need for supervision. In theory, inductive biases are crucial to distinguish among equally plausible generative models. In practice we did not find a reliable strategy to choose hyperparameters without supervision. Recent work [duan2019heuristic]
proposed a stability based heuristic for unsupervised model selection. Further exploring these techniques may help us understand the practical role of inductive biases and implicit supervision. Otherwise, we advocate to consider different settings, for example when limited explicit[locatello2019disentangling] or weak supervision [bouchacourt2017multi, gresele2019incomplete] is available.
Experimental setup and diversity of data sets.
Our study highlights the need for a sound, robust, and reproducible experimental setup on a diverse set of data sets. In our experiments, we observed that the results may be easily misinterpreted if one only looks at a subset of the data sets. As current research is typically focused on the synthetic data sets of [higgins2016beta, reed2015deep, lecun2004learning, kim2018disentangling, locatello2019challenging] — with only a few recent exceptions [gondal2019transfer] — we advocate for insights that generalize across data sets rather than individual absolute performance. For this reason, we released disentanglement_lib222https://github.com/google-research/disentanglement˙lib, a library to facilitate reproducible research on disentanglement. Our library allows to train and evaluate state-of-the-art disentangled representations on common benchmark data sets and produces automatic visualizations for visual inspection on all the trained models. Furthermore, we released over trained models, which can be used as baselines for future research.
The authors thank Ilya Tolstikhin, Paul Rubenstein and Josip Djolonga for helpful discussions and comments. This research was partially supported by the Max Planck ETH Center for Learning Systems, by an ETH core grant (to Gunnar Rätsch) and a Google Ph.D. Fellowship to FL. This work was partially done while FL was at Google Research Zurich.