1 Introduction
In machine learning, it is commonly assumed that highdimensional observations
(such as images) are the manifestation of a lowdimensional latent variable of groundtruth factors of variation bengio2013representation ; kulkarni2015deep ; chen2016infogan ; tschannen2018recent . More specifically, one often assumes that there is a distribution over these latent variables and that observations in this groundtruth model are first generated by sampling from and that the observations are then sampled from a conditional distribution . The goal of disentanglement learning is to find a representation of the data which captures all the groundtruth factors of variation in independently. The hope is that such representations will be interpretable, maximally compact, allow for counterfactual reasoning and be useful for a large variety of downstream task bengio2013representation ; PetJanSch17 ; lecun2015deep ; bengio2007scaling ; schmidhuber1992learning ; lake2017building ; goodfellow2009measuring ; lenc2015understanding ; tschannen2018recent ; higgins2018towards ; suter2018interventional . In particular, we hope to learn a disentangled representation with as little supervision as possible so that all the available labels can be used to learn downstream tasks bengio2013representation ; scholkopf2012causal ; PetJanSch17 ; pearl2009causality ; SpiGlySch93 .Current stateoftheart unsupervised disentanglement approaches enrich the
Variational Autoencoder (VAE)
kingma2013auto objective with different unsupervised regularizers that aim to encourage disentangled representations higgins2016beta ; burgess2018understanding ; kim2018disentangling ; chen2018isolating ; kumar2017variational ; rubenstein2018learning ; mathieu2019disentangling ; rolinek2018variational. While these approaches can find disentangled representations if one trains a lot of different models, there is a large variance across these models and it appears hard to identify the ones with disentangled representations without supervision
locatello2018challenging . This is consistent with the theoretical result of locatello2018challengingthat the unsupervised learning of disentangled representations is impossible without inductive biases.
While visual inspection can be used to select good model runs and hyperparameters, we argue that such supervision should be made explicit. We hence consider the practically realistic setting where one has access to labels for the latent variables
for a very limited number of observations , for example through human labeling. Even though this setting is not universally applicable (e.g. when the observations are not human interpretable or the groundtruth factors are unknown), we argue that it is broad enough to warrant investigation. Furthermore, while the true groundtruth model may be unknown and nonunique locatello2018challenging , the considered setting allows us to encode additional knowledge and implicit biases into the learned representation via the labels.In this paper, we first investigate whether access to limited labels allows us to reliably perform model selection of current stateoftheart unsupervised disentanglement methods. Second, we explore whether it is more beneficial to incorporate the limited amount of labels into training. For this purpose, we perform a reproducible large scale experimental study^{1}^{1}1Reproducing these experiment requires approximately 4.73 GPU years (NVIDIA P100)., training over models on four different data sets. We found that unsupervised training with supervised validation enables reliable learning of disentangled representations. On the other hand, using some of the labeled data for training is beneficial both in terms of disentanglement and downstream performance. Overall, we show that a very small amount of supervision is enough to learn disentangled representations as illustrated in Figure 1. Our key contributions can be summarized as follows:

[itemsep=2pt,topsep=3pt, leftmargin=10pt]

We observe that some of the existing disentanglement metrics (which require observations of ) can be used to tune the hyperparameters of unsupervised methods even when only very few labeled examples are available (Section 3). Therefore, training a large number of models and introducing supervision to select the good runs is a viable solution to overcome the impossibility result of locatello2018challenging .

We find that adding a simple supervised loss, using as little as labeled examples, outperforms unsupervised training with supervised model validation both in terms of disentanglement scores and downstream performance (Section 4.2). Further, the inductive bias given by the ordinal information of the factors of variation is shown to be useful for learning disentangled representations (Section 4.4). This result empirically validates the importance of inductive biases in disentanglement learning as theoretically claimed in locatello2018challenging ; tschannen2018recent .

We discover that both unsupervised training with supervised validation and semisupervised training do not need precise labels, but imprecise approximations are sufficient (Sections 3.2 and 4.3). Furthermore, binning may have a regularizing effect and can improve the robustness of certain metrics when only very few labels are available.
2 Background and related work
Consider a generative model with latent variable with factorized density , where , and observations obtained as samples from . Intuitively, the goal of disentanglement learning is to find a representation separating the factors of variation into independent components so that a change in a dimension of corresponds to a change in a dimension of bengio2013representation . Refinements of this definition include disentangling independent groups in the topological sense higgins2018towards and learning disentangled causal models suter2018interventional . These definitions are reflected in various disentanglement metrics that aim at measuring some structural property of the statistical dependencies between and .
The BetaVAE score higgins2016beta
measures disentanglement as the accuracy of a linear classifier that predicts the index of a fixed factor of variation. The
FactorVAE score kim2018disentangling replaces the linear classifier of the BetaVAE score with a majority vote classifier on the relative variance of each dimension of when a dimension of is fixed. The Mutual Information Gap (MIG) chen2018isolating computes for each dimension of the normalized gap in mutual information between the coordinate of with the highest and second highest mutual information with . The Modularity ridgeway2018learning computes the mutual information between each coordinate of and and measures if each dimension of depends on at most one factor of variation. The Disentanglement metric of eastwood2018framework (which we call DCI Disentanglement following locatello2018challenging) computes disentanglement based on the entropy of the feature importance (quantified e.g. via random forest) of each dimension of
for predicting . Finally, the SAP score kumar2017variational trains a classifier on each dimension of predicting each dimension of and then computes the average gap in the prediction error of the two most predictive latent dimensions for each factor.Since all these metrics require access to labels they cannot be used for unsupervised training. Many stateoftheart unsupervised disentanglement methods therefore extend VAEs kingma2013auto with a regularizer that enforces structure in the latent space of the VAE induced by the encoding distribution with the hope that this leads to disentangled representations. These approaches higgins2016beta ; burgess2018understanding ; kim2018disentangling ; chen2018isolating ; kumar2017variational can be cast under the following optimization template:
(1) 
The VAE higgins2016beta and AnnealedVAE burgess2018understanding reduce the capacity of the VAE bottleneck under the assumption that encoding the factors of variation is the most efficient way to achieve a good reconstruction PetJanSch17 . The FactorVAE kim2018disentangling and TCVAE both penalize the total correlation of the aggregated posterior (i.e. the encoder distribution after marginalizing the training data). The DIPVAE variants kumar2017variational
match the moments of the aggregated posterior and a factorized distribution. We refer to Appendix B of
locatello2018challenging and Section 3 of tschannen2018recent for a more detailed description of these regularizers.While there has also been work on semisupervised disentanglement learning reed2014learning ; cheung2014discovering ; mathieu2016disentangling ; narayanaswamy2017learning ; kingma2014semi ; klys2018learning , these methods aim to disentangle only some observed factors of variation from the other latent variables which themselves remain entangled. Furthermore, if the observed factors of variation and the observations are confounded by a latent variable, the structure is known to not be identifiable PetJanSch17 ; d2019multi ; suter2018interventional . In contrast, we consider the setting where one has access an extremely limited number of fully labeled observations (, ) and a large number of unlabeled observations of . Exploiting relational information or knowledge of the effect of the factors of variation have both been qualitatively studied to learn disentangled representations hinton2011transforming ; cohen2014learning ; karaletsos2015bayesian ; goroshin2015learning ; whitney2016understanding ; fraccaro2017disentangled ; denton2017unsupervised ; hsu2017unsupervised ; li2018disentangled ; locatello2018clustering ; kulkarni2015deep ; ruiz2019learning ; bouchacourt2017multi
. These are not limiting assumption especially for sequential data or reinforcement learning
thomas2018disentangling ; steenbrugge2018improving ; laversanne2018curiosity ; nair2018visual ; higgins2017darla ; higgins2018scan . However, most of these works did not quantitatively measure disentanglement as they see disentanglement as tool to achieve some downstream goal. While a quantitative comparison of these methods in terms of disentanglement and sample complexity is an interesting research direction, it is beyond the scope of this paper.Other related work. Due to the lack of a commonly accepted formal definition, the term “disentangled representations” has been used in very different lines of work. There is for example a rich literature in disentangling pose from content in 3D objects and content from motion in videos yang2015weakly ; li2018disentangled ; hsieh2018learning ; fortuin2018deep ; deng2017factorized ; goroshin2015learning ; hyvarinen2016unsupervised . This can be achieved with different degrees of supervision, ranging from fully unsupervised to semisupervised. Another line of work aims at disentangling class labels from latent variables by assuming the existence of a causal model where the latent variable has an arbitrary factorization with the class variable . In this setting, is partially observed reed2014learning ; cheung2014discovering ; mathieu2016disentangling ; narayanaswamy2017learning ; kingma2014semi ; klys2018learning . Without further assumptions on the structure of the graphical model, this is equivalent to partially observed factors of variation with latent confounders. Except for very special cases, the recovery of the structure of the generative model is known to be impossible with purely observational data PetJanSch17 ; d2019multi ; suter2018interventional . Here, we intend to disentangle factors of variation in the sense of bengio2013representation ; suter2018interventional ; higgins2018towards . We aim at separating the effects of all factors of variation, which translates to learning a representation with independent components. This problem have already been studied extensively in the nonlinear ICA literature comon1994independent ; bach2002kernel ; jutten2003advances ; hyvarinen2016unsupervised ; hyvarinen1999nonlinear ; hyvarinen2018nonlinear , therefore the impossibility result of locatello2018challenging may not be surprising.
3 Unsupervised training with supervised model selection
The impossibility result of locatello2018challenging states that for a factorized prior one can construct infinitely many generative models all entangled with each other. This implies that there are many equally plausible generative models and an unsupervised method cannot distinguish between them without further inductive biases. While stateoftheart unsupervised model can find disentangled representations, there is a large variance in the disentanglement of representations when different random seeds are used. At the same time, it appears hard to identify welldisentangled models in a purely unsupervised fashion locatello2018challenging .
In this section, we investigate whether commonly used disentanglement metrics can be used to identify good models if we have a very small number of labeled observations available. While existing metrics are often evaluated using as much as labeled examples, it might be feasible in many practical settings to annotate or data points and use them to obtain a disentangled representation. At the same time, it is unclear whether such an approach would work as existing disentanglement metrics have been found to be noisy (even with more samples) kim2018disentangling . Finally, we explicitly note that the impossibility result of locatello2018challenging does not apply in this setting as we do observe samples from .
3.1 Experimental setup
Data sets. We consider four commonly used disentanglement data sets where one has explicit access to the groundtruth generative model and the factors of variation: dSprites higgins2016beta , Cars3D reed2015deep , SmallNORB lecun2004learning and Shapes3D kim2018disentangling . Following locatello2018challenging , we consider the statistical setting where one directly samples from the generative model, effectively sidestepping the issue of empirical risk minimization and overfitting. For each data set, we assume to have either or labeled examples available and a large amount of unlabeled observations. We note that labels correspond to labeling % of the state space of dSprites, % of Cars3D, % of SmallNORB and % of Shapes3D.
True vs. imprecise labels. In addition to using the true labels of the groundtruth generative model, we also consider the setting where the returned labels are binned to take at most five different values. This is meant to simulate the process of a practitioner quickly labeling a small number of images.
Model selection metrics. We use MIG chen2018isolating , DCI Disentanglement eastwood2018framework and SAP score kumar2017variational for model selection as they can be used on purely observational data. In contrast, the BetaVAE higgins2016beta and FactorVAE kim2018disentangling scores cannot be used for model selection on observational data because they require access to the true generative model and the ability to perform interventions. At the same time, prior work has found all these disentanglement metrics to be substantially correlated locatello2018challenging .
Experimental protocol. In total, we consider 16 different experimental settings where an experimental setting corresponds to a data set (dSprites/Cars3D/SmallNORB/Shapes3D), a specific number of labeled examples (/), and a labeling setting (perfect/imprecise). For each considered setting, we generate five different sets of labeled examples using five different random seeds. For each of these labeled sets, we train cohorts of VAEs higgins2016beta , TCVAEs chen2018isolating , FactorVAEs kim2018disentangling , and DIPVAEIs kumar2017variational where each model cohort consists of 36 different models with 6 different hyperparameters for each model and 6 random seeds. For a detailed description of hyperparameters, architecture, and model training we refer to Section A. For each of these models, we then compute all the model selection metrics on the set of labeled examples and use these scores to select the best models in each of the cohorts. We attach as a prefix for unsupervised training with supervised
model selection. Finally, we evaluate robust estimates of the BetaVAE score, the FactorVAE score, MIG, Modularity, DCI disentanglement and SAP score for each model based on an additional evaluation set of
samples^{2}^{2}2For the BetaVAE score and the FactorVAE score, this includes specific realizations based on interventions on the latent space. in the same way as in locatello2018challenging .3.2 Key findings
We highlight our key findings with plots picked to be representative of our main results. In Appendices BC, we provide complete sets of plots for different methods, data sets and disentanglement metrics.
Model selection with perfect labels. In Figure 2 (left), we show the rank correlation between the validation metrics computed on
samples and the test metrics on dSprites. We observe that MIG and DCI Disentanglement generally correlate well with the test metrics (with the only exception of Modularity) while the correlation for the SAP score is substantially lower. This is not surprising given that the SAP score requires us to train a multiclass support vector machine for each dimension of
predicting each dimension of . For example, on Cars3D the factor determining the object type can take 183 distinct values which can make it hard to train a classifier using only 100 training samples. In Figure 2 (right), we observe that the rank correlation improves considerably for the SAP score if we have labeled examples available and slightly for MIG and DCI Disentanglement. In Figure 1 we show latent traversals for the model achieving maximum validation MIG on examples on Shapes3D.Model selection with imperfect labels. Figure 3 shows the rank correlation between the model selection metrics with binned values and the test metrics with exact labels. We observe that the metrics are surprisingly robust with respect to binned labels and they still correlate well with the test metrics. This is meant to simulate the process of a practitioner labeling by hand a reasonable amount of images into “rough” categories. We observe that the rough labeling does not seem detrimental to the performance of the model selection with few labels. We interpret these results as that for the purpose of disentanglement, finegrained labeling is not critical as the different factors of variation can already be disentangled using coarse feedback. Interestingly, the rank correlation of the SAP score and the test metrics improves significantly (in particular for a 100 labels). This is to be expected, as now we only have five classes for each factor of variation so the classification problem becomes easier and the estimate of the SAP score more reliable.
Conclusions: From this experiment, we conclude that it is possible to identify good runs and hyperparameter settings on the considered data sets using the MIG and the DCI Disentanglement based on labeled examples. The SAP score may also be used, depending on how difficult the underlying classification problem is. Surprisingly, these metrics are reliable even if we do not collect the labels exactly but only use imprecise labels for the factors of variation . We conclude that labeling a small number of examples for supervised validation appears to be a reasonable solution to learn disentangled representations in practice.
4 Incorporating label information during training
Using labels for model selection—even only a small amount—raises the natural question whether these labels should rather be used for training a good model directly. In particular, such an approach also allows structure of the groundtruth factors of variation to be used, for example ordinal information. In this section, we investigate a simple approach to incorporate the information of very few labels into existing unsupervised disentanglement methods and compare that approach to the alternative of unsupervised training with supervised model selection (as described in Section 3).
The key idea is that the limited labeling information should be used to ensure a latent space of the VAE with desirable structure with respect to the groundtruth factors of variation (as there is not enough labeled samples to learn a good representation solely from the labels). We hence incorporate supervision by constraining Equation 1:
(2) 
where is a function computed on the (few) available observationlabel pairs and where is a threshold. In other words, we constrain the otherwise unsupervised problem using some supervised penalty. We can now include into the loss as a regularizer under the KarushKuhnTucker conditions:
(3) 
We rely on the binary crossentropy loss to match the factors to their targets, i.e., , where the targets are normalized to , is the logistic function and corresponds to the mean (vector) of . When has more dimensions than the number of factors of variation, only the first dimensions are regularized where is the number of factors of variation. While the
do not model probabilities of a binary random variable but factors of variation with potentially more than two discrete states, we have found the binary crossentropy loss to work empirically well outofthebox. We also experimented with a simple
loss for , but obtained significantly worse results than for the binary crossentropy. Similar observations were made in the context of VAEs where the binary crossentropy as reconstruction loss is widely used and outperforms the loss even when pixels have continuous values in (see, e.g. the code accompanying chen2018isolating ; locatello2018challenging ). Many other candidates for supervised regularizers could be explored in future work. However, given the already extensive experiments in this study, this is beyond the scope of this paper.Inductive bias based on ordinal information. We emphasize that the considered supervised regularizer uses an inductive bias in the sense that it assumes the ordering of the factors of variation to matter. This inductive bias is valid for many ground truth factors of variation both in the considered data sets and the real world (such as spatial positions, sizes, angles or even color). We argue that such inductive biases should generally be exploited whenever they are available which is the case if we have few manually annotated labels. To better understand role of ordinal information, we also investigate what happens if this inductive bias is removed (see next Section).
Differences to prior work on semisupervised disentanglement. Existing semisupervised approaches tackle the different problem of disentangling some factors of variation that are (partially) observed from the others that remain entangled reed2014learning ; cheung2014discovering ; mathieu2016disentangling ; narayanaswamy2017learning ; kingma2014semi . In contrast, we assume to observe all groundtruth generative factors but only for a very limited number of observations. Disentangling only some of the factors of variation from the others is an interesting extension of this study. However, it is not clear how to adapt existing disentanglement scores to this different setup as they are designed to measure the disentanglement of all the factors of variation. We remark that the goal of this comparison is to test the two different approaches to incorporate supervision into stateoftheart unsupervised disentanglement methods. Furthermore, by assuming to partially observe all the causal parents of we avoid unobserved confounding between and which makes the structure not identifiable from observational data PetJanSch17 ; d2019multi ; suter2018interventional . For this reason, we resorted to a simple and well understood setup and supervised loss.
4.1 Experimental setup
True vs. imprecise labels vs. violation of inductive bias. In addition to using the true and binned labels of the groundtruth generative model, we also consider the case in which the ordinal information we deduce from the labeling is incorrect. In principle, this should not harm the performance on the test metrics as they are invariant to permutations of the ordinal information. Nevertheless, the supervised approach we consider explicitly make use of this ordinal information as inductive bias. This experiment is meant to showcase the importance of being explicit about the biases of the model as their violation might significantly harm performance.
Experimental protocol. To include supervision during training we split the labeled data set in a / train/validation split. We consider 24 different experimental setting corresponding to a data set (dSprites/Cars3D/SmallNORB/Shapes3D), a specific number of labeled examples (/), and a labeling setting (perfect/imprecise/randomly permuted). For each considered setting, we generate the same five different sets of labeled examples we used for the models. For each of the labeled sets, we train cohorts of VAEs, TCVAEs, FactorVAEs, and DIPVAEIs with the additional supervised regularizer . Each model cohort consists of 36 different models with 6 different hyperparameters for each of the two regularizers and one random seed. Details on the hyperparameter values can be found in Section A. For each of these models, we compute the value of on the validation examples and use these scores to select the best method in each of the cohorts. For these models we use the prefix for semisupervised training with supervised model selection and compute the same test disentanglement metrics as in Section 3.
Fully supervised baseline. We further consider a fully supervised baseline where the encoder is trained solely based on the supervised loss (without any decoder, KL divergence and reconstruction loss) with perfectly labeled training examples (again with a / train validation split). The supervised loss does not have any tunable hyperparameters, and for each labeled data set, we run cohorts of six models with different random seeds. For each of these models, we compute the value of on the validation examples and use these scores to select the best method in the cohort.
4.2 Should labels be used for training?
First, we investigate the benefit of including the label information during training by comparing semisupervised training with supervised validation in Figure 5 (left). Each dot in the plot corresponds to the median of the DCI Disentanglement score across the draws of the labeled subset on SmallNORB (using 100 vs 1000 examples for validation). For the models we use MIG for validation (MIG has a higher rank correlation with most of the testing metric than other validation metrics, see Figure 2). From this plot one can see that the fully supervised baseline performs worse than the ones that make use of unsupervised data. As expected, having more labels can improve the median performance for the approaches (depending on the data set and the test metric) but does not improve the approaches (recall that we observed in Figure 2 (left) that the validation metrics already perform well with 100 samples).
To test whether incorporating the label information during training is better than using it for validation only, we report in Figure 6 (left) how often each approach outperforms all the others on a random disentanglement metric and data set. We observe that semisupervised training often outperforms supervised validation. In particular, TCVAE seem to improve the most, outperforming the DIPVAEI which was the best method for labeled examples. Using labeled examples, the approach already wins in 70.5% of the trials. In Appendix C, we observe similar trends even when we use the testing metrics for validation (based on the full testing set) in the models. The approach seem to overall improve training and to transfer well across the different disentanglement metrics. In Figure 1 we show the latent traversals for the best TCVAE using labeled examples. We observe that it achieves excellent disentanglement and that the unnecessary dimensions of the latent space are unused, as desired.
Comparison of the median downstream performance after validation with 100 vs. 1000 examples on Cars3D. The downstream tasks are: crossvalidated Logistic Regression (LR) and Gradient Boosting classifier (GBT) both trained with
, , and examples.In their Figure 27, locatello2018challenging showed that increasing regularization in unsupervised methods does not imply that the matrix holding the mutual information between all pairs of entries of becomes closer to diagonal (which can be seen as a proxy for improved disentanglement). For the semisupervised approach, in contrast, we observe in Figure 5 (center) that this is actually the case.
Finally, we study the effect of semisupervised training on the (natural) downstream task of predicting the groundtruth factors of variation from the latent representation. We use four different training set sizes for this downstream task: , , and samples. We train the same crossvalidated logistic regression and gradient boosting classifier as used in locatello2018challenging . In Figure 5 we compare for each method the median downstream performance after validation with 100 vs. 1000 examples. We observe that, depending on the data set and number of samples used for the downstream task, having more labels upstream can improve downstream performance of methods. Furthermore, one can see from the results obtained for the fully supervised baseline that training without the unsupervised loss can significantly harm performance. Finally, we observe in Figure 5 (right) that methods often outperform in downstream performance.
Conclusions: The results presented in this section lead us to conclude that finding sample efficient disentanglement metrics is an important research direction for practical applications of disentanglement. However, if sufficiently large amounts of labeled data are available, it seems better to use some of the labels during training and rely on a regular train/validation split for model selection. Finding robust and efficient semisupervised methods is thus also a research direction that should be explored, especially when weaker forms of supervision are available.
4.3 How robust is semisupervised training to imprecise labels?
In this section, we explore the effect of binning the labels used in the methods and how it compares to binning the labels in methods. In Figure 6 (right) we observe that binning does not significantly worsen the performance of both the supervised validation and the semisupervised training. Sometimes the regularization induced by simplifying the labels actually appears to improve generalization due to a reduction in overfitting. Comparing Figure 6 (left) and (center), we observe that the model selection metrics are slightly more robust than the semisupervised loss especially when only 100 labeled examples are available. However, the semisupervised approaches still outperform supervised model selection in 64.8% of the cases (with 100 examples) even with binned labels.
Conclusion: These results show that not only methods based on sample efficient disentanglement metrics but also methods are robust to imprecise observations of .
4.4 Ordering as an inductive bias
In this section, we verify that that the supervised regularizer we considered relies on the inductive bias given by the ordinal information present in the labels. Note that all the continuous factors of variation are binned in the considered data sets. We analyze how much the performance of the semisupervised approach degrades when the ordering information is removed. For this reason, we permute the order of the values of the factors of variation. Note that after removing the ordering information the supervised loss will still be at its minimum if matches . However, the ordering information is now useless and potentially detrimental as it does not reflect the natural ordering of the true generative factors. We also remark that none of the disentanglement metrics make use of the ordinal information, so the performance degradation cannot be explained by fitting the wrong labels. In Figure 7, we observe that the approaches heavily rely on the ordering information and removing it significantly harms the performances of the test disentanglement metrics regardless of the fact that they are blind to ordering.
Conclusions: Imposing a suitable inductive bias (ordinal structure) on the groundtruth generative model in the form of a supervised regularizer is useful for disentanglement if its assumptions are correct. If the assumptions are violated, there is no benefit anymore over unsupervised training with supervised model selection (which is invariant to the ordinal structure).
5 Conclusion
In this paper, we investigated whether a very small number of labels can be used to reliably learn disentangled representations. We found that existing disentanglement metrics can in fact be used to perform model selection on models trained in a completely unsupervised fashion even when the labels are few and imprecise. We further showed that one can obtain even better results if one incorporates the labels and inductive biases on the factors of variation (such as ordering) into the learning process using a simple supervised regularizer. In our opinion, these results provide the basis for further work in this setting: Different supervised regularizers should be explored, aiming to regularize towards a different type of structure in the latent space, and/or aiming to impose different inductive biases. Similarly, one could design different model selection techniques, for example by designing novel disentanglement metrics that work well in the regime where few labels are available. Finally, differentiable disentanglement metrics should be developed that could be used in both scenarios.
Acknowledgements
This research was partially supported by the Max Planck ETH Center for Learning Systems and by an ETH core grant (to Gunnar Rätsch). This work was partially done while Francesco Locatello was at Google Research Zurich.
References

(1)
Francis Bach and Michael Jordan.
Kernel independent component analysis.
Journal of Machine Learning Research, 3(7):1–48, 2002.  (2) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
 (3) Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards AI. Largescale Kernel Machines, 34(5):1–41, 2007.

(4)
Diane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin.
Multilevel variational autoencoder: Learning disentangled
representations from grouped observations.
In
AAAI Conference on Artificial Intelligence
, 2018.  (5) Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in betaVAE. arXiv preprint arXiv:1804.03599, 2018.
 (6) Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, 2018.
 (7) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, 2016.
 (8) Brian Cheung, Jesse A Livezey, Arjun K Bansal, and Bruno A Olshausen. Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583, 2014.
 (9) Taco Cohen and Max Welling. Learning the irreducible representations of commutative lie groups. In International Conference on Machine Learning, 2014.
 (10) Pierre Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287–314, 1994.
 (11) Alexander D’Amour. On multicause approaches to causal inference with unobserved counfounding: Two cautionary failure cases and a promising alternative. In International Conference on Artificial Intelligence and Statistics, 2019.

(12)
Zhiwei Deng, Rajitha Navarathna, Peter Carr, Stephan Mandt, Yisong Yue, Iain
Matthews, and Greg Mori.
Factorized variational autoencoders for modeling audience reactions
to movies.
In
IEEE Conference on Computer Vision and Pattern Recognition
, 2017.  (13) Emily L Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, 2017.
 (14) Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, 2018.
 (15) Vincent Fortuin, Matthias Hüser, Francesco Locatello, Heiko Strathmann, and Gunnar Rätsch. Deep selforganization: Interpretable discrete representation learning on time series. In International Conference on Learning Representations. 2019.
 (16) Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems, 2017.
 (17) Ian Goodfellow, Honglak Lee, Quoc V Le, Andrew Saxe, and Andrew Y Ng. Measuring invariances in deep networks. In Advances in Neural Information Processing Systems, 2009.
 (18) Ross Goroshin, Michael F Mathieu, and Yann LeCun. Learning to linearize under uncertainty. In Advances in Neural Information Processing Systems, 2015.
 (19) Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
 (20) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betaVAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
 (21) Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zeroshot transfer in reinforcement learning. In International Conference on Machine Learning, 2017.
 (22) Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matko Bošnjak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, and Alexander Lerchner. Scan: Learning hierarchical compositional visual concepts. In International Conference on Learning Representations, 2018.

(23)
Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang.
Transforming autoencoders.
In
International Conference on Artificial Neural Networks
, 2011.  (24) JunTing Hsieh, Bingbin Liu, DeAn Huang, Li F FeiFei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems, 2018.
 (25) WeiNing Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in Neural Information Processing Systems, 2017.

(26)
Aapo Hyvarinen and Hiroshi Morioka.
Unsupervised feature extraction by timecontrastive learning and nonlinear ica.
In Advances in Neural Information Processing Systems, 2016.  (27) Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 1999.
 (28) Aapo Hyvarinen, Hiroaki Sasaki, and Richard E Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In International Conference on Artificial Intelligence and Statistics, 2019.
 (29) Christian Jutten and Juha Karhunen. Advances in nonlinear blind source separation. In International Symposium on Independent Component Analysis and Blind Signal Separation, pages 245–256, 2003.
 (30) Theofanis Karaletsos, Serge Belongie, and Gunnar Rätsch. Bayesian representation learning with oracle constraints. arXiv preprint arXiv:1506.05011, 2015.
 (31) Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, 2018.
 (32) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, 2014.
 (33) Diederik P Kingma and Max Welling. Autoencoding variational Bayes. In International Conference on Learning Representations, 2014.
 (34) Jack Klys, Jake Snell, and Richard Zemel. Learning latent subspaces in variational autoencoders. In Advances in Neural Information Processing Systems. 2018.
 (35) Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, 2015.
 (36) Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations, 2018.
 (37) Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
 (38) Adrien LaversanneFinot, Alexandre Pere, and PierreYves Oudeyer. Curiosity driven exploration of learned disentangled goal spaces. In Conference on Robot Learning, 2018.
 (39) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
 (40) Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In IEEE Conference on Computer Vision and Pattern Recognition, 2004.
 (41) Karel Lenc and Andrea Vedaldi. Understanding image representations by measuring their equivariance and equivalence. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 (42) Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In (To appear) International Conference on Machine Learning, 2019.
 (43) Francesco Locatello, Damien Vincent, Ilya Tolstikhin, Gunnar Rätsch, Sylvain Gelly, and Bernhard Schölkopf. Competitive training of mixtures of independent deep generative models. In Workshop at the 6th International Conference on Learning Representations (ICLR), 2018.
 (44) Emile Mathieu, Tom Rainforth, N. Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. arXiv preprint arXiv:1812.02833, 2018.
 (45) Michael F Mathieu, Junbo J Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, 2016.
 (46) Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, 2018.
 (47) Siddharth Narayanaswamy, T Brooks Paige, JanWillem Van de Meent, Alban Desmaison, Noah Goodman, Pushmeet Kohli, Frank Wood, and Philip Torr. Learning disentangled representations with semisupervised deep generative models. In Advances in Neural Information Processing Systems, 2017.
 (48) Judea Pearl. Causality. Cambridge University Press, 2009.
 (49) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference  Foundations and Learning Algorithms. Adaptive Computation and Machine Learning Series. MIT Press, 2017.
 (50) Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning, 2014.
 (51) Scott Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep visual analogymaking. In Advances in Neural Information Processing Systems, 2015.
 (52) Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the fstatistic loss. In Advances in Neural Information Processing Systems, 2018.
 (53) Michal Rolinek, Dominik Zietlow, and Georg Martius. Variational autoencoders recover pca directions (by accident). In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, 2019.
 (54) P. K. Rubenstein, B. Schölkopf, and I. Tolstikhin. Learning disentangled representations with wasserstein autoencoders. In Workshop at the 6th International Conference on Learning Representations (ICLR), 2018.
 (55) Adrià Ruiz, Oriol Martinez, Xavier Binefa, and Jakob Verbeek. Learning disentangled representations with referencebased variational autoencoders. arXiv preprint arXiv:1901.08534, 2019.
 (56) Jürgen Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
 (57) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In International Conference on Machine Learning, 2012.
 (58) P. Spirtes, C. Glymour, and R. Scheines. Causation, prediction, and search. MIT Press, 2000.
 (59) Xander Steenbrugge, Sam Leroux, Tim Verbelen, and Bart Dhoedt. Improving generalization for abstract reasoning tasks using disentangled feature representations. In Workshop on Relational Representation Learning at NeurIPS, 2018.
 (60) Raphael Suter, Djordje Miladinović, Stefan Bauer, and Bernhard Schölkopf. Interventional robustness of deep latent variable models, 2019.
 (61) Valentin Thomas, Emmanuel Bengio, William Fedus, Jules Pondard, Philippe Beaudoin, Hugo Larochelle, Joelle Pineau, Doina Precup, and Yoshua Bengio. Disentangling the independently controllable factors of variation by interacting with the world. Learning Disentangled Representations Workshop at NeurIPS, 2017.
 (62) Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoderbased representation learning. arXiv preprint arXiv:1812.05069, 2018.
 (63) William F Whitney, Michael Chang, Tejas Kulkarni, and Joshua B Tenenbaum. Understanding visual concepts with continuation learning. arXiv preprint arXiv:1602.06822, 2016.
 (64) Jimei Yang, Scott E Reed, MingHsuan Yang, and Honglak Lee. Weaklysupervised disentangling with recurrent transformations for 3D view synthesis. In Advances in Neural Information Processing Systems, 2015.
 (65) Li Yingzhen and Stephan Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning, 2018.
Appendix A Architectures and detailed experimental design
The architecture shared across every method is the default one in the disentanglement_lib which we describe here for completeness in Table 1 along with the other fixed hyperparameters in Table 3(a) and the discriminator for total correlation estimation in FactorVAE Table 3(b) with hyperparameters in Table 3(c). The hyperparameters that were swept for the different methods can be found in Table 2. All the hyperparameters for which we report single values were not varied and are selected based on the literature.
Encoder  Decoder 

Input: number of channels  Input: 
FC, 256 ReLU  
conv, 32 ReLU, stride 2  FC, ReLU 
conv, 64 ReLU, stride 2  upconv, 64 ReLU, stride 2 
conv, 64 ReLU, stride 2  upconv, 32 ReLU, stride 2 
FC 256, FC  upconv, 32 ReLU, stride 2 
upconv, number of channels, stride 2 
Model  Parameter  Values 

VAE  
VAE  
FactorVAE  
FactorVAE  
DIPVAEI  
DIPVAEI  
TCVAE  
TCVAE  



Appendix B Detailed plots for Section 3
In Figure 8, we compute the rank correlation between the validation metrics computed on samples and the test metrics on each data set. In Figure 9, we observe that the correlation improves if we consider labeled examples. These plots are the extended version of Figure 2 showing the results on all data sets.
In Figure 10, we plot for each unsupervised model its validation MIG with 100 samples against the DCI test score on dSprites. We can see that indeed there is a strong linear relationship.
Appendix C Detailed plots for Section 4
c.1 Does supervision help training?
In Figure 15 we plot the median of each score across the draws of the labeled subset achieved by the best models on each data set (using 100 vs 1000 examples). For the models we use MIG for validation (MIG has a higher rank correlation with most of the testing metric than other validation metrics, see Figure 2). This plot extends Figure 5 (left) to all data set and test score.
In Table 4, we compute how often each method outperforms the corresponding on a random disentanglement metric and data set. We observe that often outperforms , especially when more labels are available.
In Figure 13 can be observed that with samples the semisupervised method is often better than the corresponding even using the test MIG computed with samples for validation. We conclude that the semisupervised loss improves the training and transfer better to different metrics than the MIG. In Figure 14, we observe similar trends if we use the test DCI Disentanglement with samples for validation of the methods.
in Figure 16 we observe that increasing the supervised regularization makes that the matrix holding the mutual information between all pairs of entries of and closer to diagonal. This plots extend Figure 5 to all data sets.
In Figure 17 we compare the median downstream performance after validation with 100 vs 1000 samples. This plot extends Figure 5 to all data sets. Finally, we observe in Table 5 that semisupervised methods often outperforms in downstream performance, especially when more labels are available.
Method  Type  SAP 100  SAP 1000  MIG 100  MIG 1000  DCI 100  DCI 1000 

VAE  72.6%  79.2%  53.9%  74.2%  53.9%  69.2%  
27.4%  20.8%  46.1%  25.8%  46.1%  30.8%  
FactorVAE  71.5%  79.4%  64.5%  75.2%  68.5%  77.6%  
28.5%  20.6%  35.5%  24.8%  31.5%  22.4%  
TCVAE  79.5%  80.6%  58.5%  75.0%  62.9%  74.4%  
20.5%  19.4%  41.5%  25.0%  37.1%  25.6%  
DIPVAEI  81.6%  83.5%  64.9%  74.8%  67.7%  70.5%  
18.4%  16.5%  35.1%  25.2%  32.3%  29.5% 
on for each approach separately. The standard deviation is between 3% and 5% and can be computed as
.Method  Type  SAP 100  SAP 1000  MIG 100  MIG 1000  DCI 100  DCI 1000 

VAE  70.0%  75.6%  53.8%  75.0%  43.8%  71.9%  
30.0%  24.4%  46.2%  25.0%  56.2%  28.1%  
FactorVAE  61.2%  71.9%  62.5%  78.1%  63.8%  71.2%  
38.8%  28.1%  37.5%  21.9%  36.2%  28.8%  
TCVAE  70.6%  72.7%  51.9%  71.9%  55.6%  67.5%  
29.4%  27.3%  48.1%  28.1%  44.4%  32.5%  
DIPVAEI  71.9%  80.6%  50.6%  75.6%  50.6%  65.0%  
28.1%  19.4%  49.4%  24.4%  49.4%  35.0% 
c.2 What happens if we collect imprecise labels?
In Figures 18 and 19 we observe that binning does not significantly worsen the performance of both the supervised validation and the semisupervised training. These plots extend Figure 6 (right) to both sample sizes, all test scores and data sets.
In Table 6 we show how often each method outperforms the corresponding on a random disentanglement metric and data set with binned labels.
Method  Type  SAP 100  SAP 1000  MIG 100  MIG 1000  DCI 100  DCI 1000 

VAE  66.9%  76.3%  50.4%  75.2%  44.5%  74.6%  
33.1%  23.7%  49.6%  24.8%  55.5%  25.4%  
FactorVAE  72.4%  67.9%  60.5%  63.2%  56.8%  62.4%  
27.6%  32.1%  39.5%  36.8%  43.2%  37.6%  
TCVAE  79.2%  77.9%  58.5%  74.0%  61.2%  72.7%  
20.8%  22.1%  41.5%  26.0%  38.8%  27.3%  
DIPVAEI  67.7%  75.8%  57.4%  71.8%  53.4%  69.6%  
32.3%  24.2%  42.6%  28.2%  46.6%  30.4% 
c.3 Ordering as inductive bias
In Table 7 we compute how often each method outperforms the corresponding on a random disentanglement metric and data set with binned labels. We observe that in this case is superior most of the times but the gap reduces with more labels.