Theory and Evaluation Metrics for Learning Disentangled Representations

08/26/2019 ∙ by Kien Do, et al. ∙ 8

We make two theoretical contributions to disentanglement learning by (a) defining precise semantics of disentangled representations, and (b) establishing robust metrics for evaluation. First, we characterize the concept "disentangled representations" used in supervised and unsupervised methods along three dimensions-informativeness, separability and interpretability - which can be expressed and quantified explicitly using information-theoretic constructs. This helps explain the behaviors of several well-known disentanglement learning models. We then propose robust metrics for measuring informativeness, separability and interpretability. Through a comprehensive suite of experiments, we show that our metrics correctly characterize the representations learned by different methods and are consistent with qualitative (visual) results. Thus, the metrics allow disentanglement learning methods to be compared on a fair ground. We also empirically uncovered new interesting properties of VAE-based methods and interpreted them with our formulation. These findings are promising and hopefully will encourage the design of more theoretically driven models for learning disentangled representations.



page 8

page 10

page 18

page 19

page 20

page 21

page 24

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Disentanglement learning holds the key for understanding the world from observations, transferring knowledge across different tasks and domains, generating novel designs, and learning compositional concepts bengio2013representation ; higgins2017scan ; lake2017building ; peters2017elements ; schmidhuber1992learning . Assuming the observation is generated from latent factors via , the goal of disentanglement learning is to correctly uncover a set of independent factors that give rise to the observation. While there has been a considerable progress in recent years, common assumptions about disentangled representations appear to be inadequate locatello2019challenging .

Unsupervised disentangling methods are highly desirable as they assume no knowledge about the ground truth factors. These methods typically impose constraints to encourage independence among latent variables. Examples of constraints include forcing the variational posterior to be similar to a factorial burgess2018understanding ; higgins2017beta , forcing the variational aggregated prior to be similar to the prior makhzani2015adversarial , adding total correlation loss kim2018disentangling , forcing the covariance matrix of

to be close to the identity matrix

kumar2017variational , and using a kernel-based measure of independence lopez2018information . However, it remains unclear how the independence constraint affects other properties of representation. Indeed, more independence may lead to higher reconstruction error in some models higgins2017beta ; kim2018disentangling . Worse still, the independent representations may mismatch human’s predefined concepts locatello2019challenging . This suggests that supervised methods – which associate a representation (or a group of representations) with a particular ground truth factor – may be more adequate. However, most supervised methods have only been shown to perform well on toy datasets harsh2018disentangling ; kulkarni2015deep ; mathieu2016disentangling in which data are generated from multiplicative combination of the ground truth factors. It is still unclear about their performance on real datasets.

We believe that there are at least two major reasons for the current unsatisfying state of disentanglement learning: i) the lack of a formal notion of disentangled representations to support the design of proper objective functions tschannen2018recent ; locatello2019challenging , and ii) the lack of robust evaluation metrics to enable a fair comparison between models, regardless of their architectures or design purposes. To that end, we contribute by formally characterizing disentangled representations along three dimensions, namely informativeness, separability and interpretability, drawing from concepts in information theory (Section 3). We then design robust quantitative metrics for these properties and argue that an ideal method for disentanglement learning should achieve high performance on these metrics (Section 4).

We run a series of experiments to demonstrate how to compare different models using our proposed metrics, showing that the quantitative results provided by these metrics are consistent with visual results (Section 5). In the process, we gain important insights about some well-known disentanglement learning methods namely FactorVAE kim2018disentangling and AAE makhzani2015adversarial .

2 Preliminaries

VAE-based methods

Variational Autoencoder (VAE)

kingma2013auto ; rezende2014stochastic is a class of latent variable models, assuming a generative process as , where denotes the observed data whose samples are drawn from the empirical distribution and denotes the latent variable. Standard VAEs are trained by minimizing the variational upper bound of the expected negative log-likelihood over the data. However, this objective function does not encourage disentanglement in representation. A simple solution is -VAE higgins2017beta , which modifies the objective as follows:

where and

is a parameterized variational estimator of the true posterior distribution

. When , reduces to . Another proposal is FactorVAE kim2018disentangling , which adds a constraint to the standard VAE loss to explicitly impose factorization of :


where is known as the total correlation (TC) of . Intuitively, can be large without affecting the mutual information between and , making FactorVAE more robust than -VAE in learning disentangled representations. Other variants include -TCVAE chen2018isolating and DIP-VAE kumar2017variational , which are technically equivalent to FactorVAE in the sense that they also force to be factorized.


Generative Adversarial Network (GAN) goodfellow2014generative is another class of generative models. GAN solves the minimax problem for a generator and discriminator . InfoGAN chen2016infogan

improves over GANs for learning disentangled representations. It assumes that the latent code vector

is a concatenation of two parts: a factorial part and a noisy part , denoted as . InfoGAN learns to disentangle by maximizing the mutual information between and the observed data , using the following objective function:


where denotes the mutual information; ; and is a lower bound of agakov2004algorithm . Here, , where is a variational estimator of the true conditional distribution .

3 Rethinking Disentanglement

Inspired by bengio2013representation ; ridgeway2016survey

, we adopt the notion of disentangled representation learning as “a process

of decorrelating information in the data into separate informative representations, each of which corresponds to a concept defined by humans”. This suggests three important properties of a disentangled representation: informativeness, separability and interpretability, which we quantify as follows:


We formulate the informativeness of a particular representation w.r.t. the data as the mutual information between and :


where . In order to represent the data faithfully, a representation should be informative of , meaning should be large. Because , a large value of means that given that can be chosen to be relatively fixed. In other words, if is informative w.r.t. ,

usually has small variance

. It is important to note that in Eq. 4 is defined on the variational encoder , and does not require a decoder. This implies we do not need to minimize the reconstruction error over (e.g., in VAEs) to increase the informativeness of a particular .

Separability and Independence

Two representations , are separable w.r.t. the data if they do not share common information about , which can be formulated as follows:


where denotes the multivariate mutual information mcgill1954multivariate between , and . can be decomposed into standard bivariate mutual information terms as follows:

can be either positive or negative. It is positive if and contain redundant information about . The meaning of a negative remains elusive bell2003co .

Achieving separability with respect to does not guarantee that and are separable in general. and are fully separable or statistically independent if and only if:


Let us consider how FactorVAE and InfoGAN implement this independence requirement. FactorVAE enforces the independence in Eq. 6 for every pair of , via the TC term (see Eq. 1). In InfoGAN, the existence of such condition is not clear. However, if we look closely into the term in Eq. 3, it is actually the reconstruction error over sampled from a factorial prior . By minimizing this term, we will force close to and close to , making and independent. Note that are derived from the assumption where is the implicit generative distribution of by transforming via the generator . The original GAN objective in InfoGAN assumes that matches the empirical data distribution . However, when this assumption does not hold, is not really grounded on real data making it hard to interpret the independence.

Note that there is a trade-off between informativeness, independence and the number of latent variables which we discuss in Appdx. A5.


Obtaining informative and independent representations does not guarantee interpretability by human locatello2019challenging . We argue that in order to achieve interpretability, we should provide models with a set of predefined concepts . In this case, a representation is interpretable with respect to if it only contains information about (given that is separable from all other and all are distinct). Full interpretability can be formulated as follows:


Eq. 7 is equivalent to the condition that is an invertible function of . If we want to generalize beyond the observed (i.e., ), we can change the condition in Eq. 7 into:


which suggests that the model should accurately predict given . If satisfies the condition in Eq. 8, it is said to be partially interpretable w.r.t .

In real data, underlying factors of variation are usually correlated. For example, men usually have beard and short hair. Therefore, it is very difficult to match independent latent variables to different ground truth factors at the same time. We believe that in order to achieve good interpretability, we should isolate the factors and learn one at a time.

3.1 An information-theoretic definition of disentangled representations

Given a dataset , where each data point is associated with a set of labeled factors of variation . Assume that there exists a mapping of to

groups of hidden representation

which follows the distribution . Denoting and . We define disentangled representations for unsupervised cases as follows:

Definition 1 (Unsupervised).

A representation or a group of representations is said to be “fully disentangled” w.r.t a ground truth factor if is marginally independent of all other representations and is fully interpretable w.r.t . Mathematically, this can be written as:



The definition of disentangled representations for supervised cases is similar as above except that now we model instead of and .

Recently, there have been several works eastwood2018framework ; higgins2018towards ; ridgeway2018learning that attempted to define disentangled representations. Higgin et. al. higgins2018towards proposed a definition based on group theory cohen2014learning which is (informally) stated as follows: “A representation is disentangled w.r.t a particular subgroup (from a symmetry group ) if can be decomposed into different subspaces in which the subspace should be independent of all other representation subspaces , and should only be affected by the action of a single subgroup and not by other subgroups .”. Their definition shares similar observation as ours. However, it is less convenient for designing models and metrics than our information-theoretic definition.

Eastwood et. al. eastwood2018framework did not provide any explicit definition of disentangled representation but characterizing it along three dimensions namely “disentanglement”, “compactness”, and “informativeness” (between any ). A high “disentanglement” score () for indicates that it captures at most one factor, let’s say . A high “completeness” score () for indicates that it is captured by at most one latent and is likely to be . A high “informativeness” score111In eastwood2018framework , the authors consider the prediction error of given instead. High “informativeness” score means this error should be close to . for indicates that all information of is captured by the representations . Intuitively, when all the three notions achieve optimal values, there should be only a single representation that captures all information of the factor but no information from other factors . However, even in that case, is still not fully interpretable w.r.t since may contain some information in that does not appear in . This makes their notions only applicable to toy datasets on which we know that the data are only generated from predefined ground truth factors . Our definition can handle the situation where we only know some but not all factors of variation in the data. The notions in ridgeway2018learning follow those in eastwood2018framework , hence, suffer from the same disadvantage.

3.2 Representations learned by FactorVAE

We empirically observed that FactorVAE learns the same set of disentangled representations across different runs with varying numbers of latent variables (see Appdx. A8). This behavior is akin to that of deterministic PCA which uncovers a fixed set of linearly independent factors222When we mention factors in this context, they are not really factors of variation. They refer to the columns of the projection matrix in case of PCA and the component encoding functions in case of deep generative models. (or principal components). Standard VAE is theoretically similar to probabilistic PCA (pPCA) tipping1999probabilistic as both assume the same generative process . Unlike deterministic PCA, pPCA learns a rotation-invariant family of factors instead of an identifiable set of factors. However, in a particular pPCA model, the relative orthogonality among factors is still preserved. This means that the factors learned by different pPCA models are statistically equivalent. We hypothesize that by enforcing independence among latent variables, FactorVAE can also learn statistically equivalent factors (or ) which correspond to visually similar results. We provide a proof sketch for the hypothesis in Appdx. A6. We note that Rolinek et. al. rolinek2018variational also discovered the same phenomenon in -VAE.

4 Robust Evaluation Metrics

We argue that a robust metric for disentanglement should meet the following criteria: i) it supports both supervised/unsupervised models; ii) it can be applied for real datasets; iii) it is computationally straightforward, i.e. not requiring any training procedure; iv) it provides consistent results across different methods and different latent representations; and v) it agrees with qualitative (visual) results. Here we propose information-theoretic metrics to measure informativeness, independence and interpretability which meet all of these robustness criteria.

4.1 Metrics for informativeness

We measure the informativeness of a particular representation w.r.t. by computing in Eq. 4. The main challenges are estimating and computing the integral over . We deal with these problems by quantizing . To ensure to be consistent and comparable among different as well as different models, we apply the same quantization range for different . In practice, we choose the range since most of the latent values fall within this range. We divide the range into a set of equal-size bins and estimate as follows:


where and

are the probability mass function and the conditional probability mass function of a particular bin

. Because 333We must take into account the whole quantized distribution . Simply counting the quantized mean for all is totally wrong., we only have to compute , which by definition, is:


where , are two ends of the bin .

There are two ways to compute . In the first way, we simply consider the unnormalized as the area of a rectangle whose width is and height is with at the center value of the bin . Then, we normalize over all bins to get . In the second way, if

is approximately a Gaussian distribution, we can estimate the above integral with a closed-form function (see Appdx. A12 for detail). After computing

, we can divide it by to normalize it to the range [0, 1]. However this normalization will change the interpretation of the metric and may lead to a situation where latent variable is less informative than variable (i.e., ) but still has a higher rank than because . A better way is to divide it by where

denotes the number of bins. An important note for implementation is that sometimes, the standard deviation of

is close to (or is deterministic given ), causing to be close to for all bins. In this case, we set if is the bin that contains the mean of and otherwise.

4.2 Metrics for independence

We can compute the independence between two latent variables , based on . However, a serious problem of is that it generates the following order among pairs of representations:

where , are informative representations and , are uninformative (or noisy) representations. This means if we simply want , to be independent, the best scenario is that both are noisy and independent (e.g. ). Therefore, we propose a new metric for independence named MISJED (which stands for Mutual Information Sums Joint Entropy Difference), defined as follows:

where and are the means of and , respectively. Since have less variance than respectively, , making .

To achieve a small value of , i.e., a high degree of independence, we must have representations to be both independent and informative (or, in an extreme case, are deterministic given ). Using the MISJED metric, we can ensure the following order: . Because , we can divide by to normalize it to [0, 1].

4.3 Metrics for interpretability

Recently, several metrics have been proposed to quantitatively evaluate the interpretability of representations by examining the relationship between the representations and manually labeled factors of variation. The most popular ones are Z-diff score higgins2017beta ; kim2018disentangling , SAP kumar2017variational and MIG chen2018isolating . Detailed analysis of these metrics is provided in Appdx. A9. Among them, only MIG is based on mutual information and, to some extent, matches with the formulation of “interpretability” in Section 3. However, MIG has only been used for toy datasets like dSprites dsprites2017 . The main drawback comes from its probabilistic assumption (see Fig. 1). Note that

is a distribution over the high dimensional data space, and is very hard to robustly estimate but the authors simplified it to be

if (is the support set for a particular value ) and otherwise. This equation only holds for toy datasets where we know exactly how is generated from . In addition, since depends on the value of , it will be problematic if is continuous.

(a) Unsupervised
(b) Supervised
Figure 1: Differences in probabilistic assumption of MIG and Robust MIG.

Addressing the drawbacks of MIG, we propose RMIG (which stands for Robust MIG), formulated as follows:


where and are the highest and the second highest mutual information values computed between every and ; and are the corresponding latent variables. Like MIG, we can normalize RMIG() to [0, 1] by dividing it by but it will favor imbalanced factors (small ). Details of computation are given in Appdx. A10.

RMIG inherits the idea of MIG but differs in the probabilistic assumption (and other technicalities). RMIG assumes that

for unsupervised learning and

for supervised learning (see Fig. 

1). Not only this eliminates all the problems of MIG but also provides additional advantages. First, we can estimate using Monte Carlo sampling on . Second, is well defined for both discrete/continuous and deterministic/stochastic . If is continuous, we can quantize . If is deterministic (i.e., a Dirac delta function), we simply set it to for the value of corresponding to and for other values of . Our metric can also use from an external expert model. Third, for any particular value , we compute for all rather than just for , which gives more accurate results.


A high RMIG value of means that there is a representation that captures the factor . However, may also capture other factors of the data. To make sure that fits exactly to , we provide another metric for interpretability named JEMMI (standing for Joint Entropy Minuses Mutual Information), computed as follows:

where and are defined in Eq. 12. is bounded by 0 and . A small JEMMI score means that should match exactly to and should not be related to . Note that if we replace by to account for the generalization of over , we obtain a metric equivalent to RMIG (but in reverse order).

5 Experiments

We evaluated the performance of FactorVAE kim2018disentangling , -VAE higgins2017beta and AAE makhzani2015adversarial using our proposed metrics on the CelebA liu2015faceattributes , MNIST and dSprites dsprites2017 datasets. Details about the datasets and model settings are provided in Appdx. A1 and Appdx. A2, respectively. For space limit, we only report here results on CelebA, leaving the rest in the supplementary materials.


We sorted the representations of different models according to their informativeness scores in the descending order and plot the results in Fig. 2. There are distinct patterns for different methods. AAE captures equally large amounts of information from the data while FactorVAE and -VAE capture smaller and varying amounts. This is because FactorVAE and -VAE penalize the informativeness of representations while AAE does not. Recall that . For AAE, and is equal to the entropy of . For FactorVAE and -VAE, and is usually smaller than the entropy of due to a narrow 444Note that does not depend on whether is zero-centered or not.

(a) FactorVAE (TC=50)
(b) -VAE (=50)
(c) AAE (Gz=50)
Figure 2: Normalized informativeness scores (bins=100, 100% data) of all latent variables sorted in descending order.

In Fig. 2, we see a sudden drop of the scores to 0 for some FactorVAE’s and -VAE’s representations. These representations are totally random and contain no information about the data (i.e., ). We call them “noisy” representations and provide discussions in Appdx.A5.

We visualize the top 10 most informative representations for these models in Fig. 3. AAE’s representations are more detailed than FactorVAE’s and -VAE’s, suggesting the effect of high informativeness. However, AAE’s representations mainly capture information within the support of

. This explains why we still see a face when interpolating AAE’s representations. By contrast, FactorVAE’s and

-VAE’s representations usually contain information outside the support of . Thus, when we interpolate these representations, we may see something not resembling a face.

(a) FactorVAE (TC=50)
(b) -VAE (=50)
(c) AAE (Gz=50)
Figure 3: Visualization of the top informative representations. Scores are unnormalized.

Table 1 reports MISJED scores (Section. 4.2) for the top most informative representations. FactorVAE achieves the lowest MISJED scores, AAE comes next and -VAE is the worst. We argue that this is because FactorVAE learns independent and nearly deterministic representations, -VAE learns strongly independent yet highly stochastic representations, and AAE, on the other extreme side, learns strongly deterministic yet not very independent representations. From Table 1 and Fig. 4, it is clear that MISJED produces correct orders among pairs of representations according to their informativeness.

MISJED (unnormalized)
FactorVAE 0.008 0.009 2.476 2.443 4.858 4.892
-VAE 0.113 0.131 3.413 3.401 6.661 6.739
AAE 0.022 0.023 0.022 0.021 0.021 0.020
Table 1: Unnormalized MISJED scores (#bins = 50, 10% data). and denote the top 3 and the bottom 3 latent variables sorted by the informativeness scores in descending order. Boldness indicates best results.
(a) FactorVAE (TC=50)
(b) -VAE (=50)
(c) AAE (Gz=50)
Figure 4: Normalized MISJED scores of all latent pairs sorted by their informativeness.

We report the RMIG scores and JEMMI scores for several ground truth factors on the CelebA dataset in Tables 2 and 3, respectively. In general, FactorVAE learns representations that agree better with the ground truth factors than -VAE and AAE do. This is consistent with the qualitative results in Fig. 5. However, all models still perform poorly for interpretability since their RMIG and JEMMI scores are very far from 1 and 0, respectively.

RMIG (normalized)
Bangs Black Hair Eyeglasses Goatee Male Smiling
H=0.4256 H=0.5500 H=0.2395 H=0.2365 H=0.6801 H=0.6923
FactorVAE 0.1742 0.0430 0.0409 0.0343 0.0060 0.0962
-VAE 0.0176 0.0223 0.0045 0.0325 0.0094 0.0184
AAE 0.0035 0.0276 0.0018 0.0069 0.0060 0.0099
Table 2: Normalized RMIG scores (#bins=100, 100% data) for some factors. Higher is better.
JEMMI (normalized)
Bangs Black Hair Eyeglasses Goatee Male Smiling
H=0.4256 H=0.5500 H=0.2395 H=0.2365 H=0.6801 H=0.6923
FactorVAE 0.6118 0.6334 0.6041 0.6616 0.6875 0.6150
-VAE 0.8632 0.8620 0.8602 0.8600 0.8690 0.8699
AAE 0.8463 0.8613 0.8423 0.8496 0.8644 0.8575
Table 3: Normalized JEMMI scores (#bins=100, 100% data) for some factors. Lower is better.
(a) AAE (Gz=50)
(b) FactorVAE (TC=50)
Figure 5: Top 10 representations that are most correlated with some ground truth factors. For each representation, we show its mutual information with the ground truth factor.
Sensitivity of the number of bins

All metrics we propose in this paper require computing the mutual information (MI). To handle continuous cases, we use quantization. It is important to note that quantization is just a trick for computing MI, not the inherent problem of our metrics. With quantization, we need to specify the number of bins (#bins) in advance. Fig. 6 (left, middle) shows the effect of #bins on RMIG scores and JEMMI scores for different models.

We can see that when #bins is small, RMIG scores are low. This is because the quantized distributions and look similar, causing and to be similar as well. When #bins is large, the quantized distribution and look more different, leading to higher RMIG scores. RMIG scores are stable when #bins > 200.

Unlike RMIG scores, JEMMI scores keep increasing when we increase #bins. Note that JEMMI only differs from RMIG in the appearance of . Finer quantizations of introduce more information about , hence, always lead to higher (see Fig. 6 (right)). Larger JEMMI scores also reflect the fact that finer quantizations of make look more continuous, thus, less interpretable w.r.t the discrete factor .

Despite the fact that #bins affects the RMIG and JEMMI scores of a single model, the relative order among different models remains the same. It suggests that once we fixed the #bins, we can use RMIG and JEMMI scores to compare different models.

Figure 6: Dependences of RMIG (normalized), JEMMI (normalized) and on the number of bins. We examined different FactorVAE and -VAE models on the dSprite dataset.

6 Discussion

We have proposed information-theoretic characterizations of disentangled representations, and designed robust metrics for evaluation, along three dimensions: informativeness, separability and interpretability. We examined three well-known representation learning models namely FactorVAE, -VAE and AAE on CelebA, MNIST and dSprites datasets. Under our metrics, FactorVAE is the best among the three, with reasonably good informativeness and very good MISJED scores. In addition, FactorVAE also learns consistent representations. However, all the examined models still perform poorly under our metric for interpretability, meaning that they have not met desirable requirements for disentanglement learning. Our work also shows that unsupervised disentanglement may not be possible, and that labels of ground truth factors should be provided during learning. Thus, we plan to investigate methods which support semi-supervised or few-shot learning in the future.


Appendix A Appendix

a.1 Datasets

We used CelebA, MNIST and dSprites datasets. For CelebA, we resized the original images to . Dataset statistics are provided in Table 4.

Dataset #Train #Test Image size
CelebA 162,770 19,962 64643
MNIST 60,000 10,000 28281
dSprites 737,280 0 64641
Table 4: Summary of datasets used in experiments.

a.2 Model settings

For FactorVAE, -VAE and AAE we used the same architectures for the encoder and decoder (see Table 5 and Table 6555Only FactorVAE and AAE use a discriminator over ), following kim2018disentangling

. We trained the models for 300 epochs with mini-batches of size 64. The learning rate is

for the encoder/decoder and is for the discriminator over . We used Adam kingma2014adam optimizer with and . Unless explicitly mentioned, we fixed the following: number of latent variables to 65, coefficient for the TC term in FactorVAE to 50, value for in -VAE to 50, and coefficient for the generator loss over in AAE to 50.

We note that the FactorVAE in kim2018disentangling only used 10 latent variables for learning factors of variation on the CelebA dataset so our results may look different from theirs. However, by using larger numbers of latent variables, we are able to discover that FactorVAE learns consistent representations (see Appdx. A.8).

Encoder Decoder Discriminator Z
dims: 64643 dim: 65 dim: 65

conv (4, 4, 32), stride

, ReLU

FC 11256, ReLU 5[FC 1000, LReLU]
conv (4, 4, 32), stride 2, ReLU deconv (4, 4, 64), stride 1, valid, ReLU FC 1
conv (4, 4, 64), stride 2, ReLU deconv (4, 4, 64), stride 2, ReLU : 1
conv (4, 4, 64), stride 2, ReLU deconv (4, 4, 32), stride 2, ReLU
conv (4, 4, 256), stride 1, valid, ReLU deconv (4, 4, 32), stride 2, ReLU
FC 65 deconv (4, 4, 3), stride 2, ReLU
dim: 65 dim: 64643
Table 5: Model architectures for CelebA.
Encoder Decoder Discriminator Z
dims: dims: 65 dims: 65
conv (4, 4, 64), stride 2, LReLU FC 1024, BN, ReLU 4[FC 256, LReLU]
conv (4, 4, 128), stride 2, BN, LReLU FC 77128 , BN, ReLU FC 1
FC 1024, BN, LReLU deconv (4, 4, 64), stride 2, BN, ReLU : 1
FC 128, BN, LReLU deconv (4, 4, 1), stride 2, sigmoid
FC 65, BN, LReLU dims:
dims: 65
Table 6: Model architecture for MNIST.
Encoder Decoder Discriminator Z
dims: 64641 dim: 65 dim: 65
conv (4, 4, 32), stride , ReLU FC 128, ReLU 5[FC 1000, LReLU]
conv (4, 4, 32), stride 2, ReLU FC 4464, ReLU FC 1
conv (4, 4, 64), stride 2, ReLU deconv (4, 4, 64), stride 2, ReLU : 1
conv (4, 4, 64), stride 2, ReLU deconv (4, 4, 32), stride 2, ReLU
FC 128, ReLU deconv (4, 4, 32), stride 2, ReLU
FC 65 deconv (4, 4, 1), stride 2, ReLU
dim: 65 dim: 64641
Table 7: Model architecture for dSprites.

a.3 Reviews of disentanglement learning methods

There has been many works that attempt to learn disentangled representations, which significantly differ in approaches and generalities. However, there are lacks of consensus on many key aspects locatello2019challenging , including definition higgins2018towards .

Supervised methods kulkarni2015deep ; zhu2014multi assume access to the ground truth factors. For example, DC-IGN kulkarni2015deep is a VAE whose latent variables correspond to different ground truth factors . At each training step, DC-IGN chooses a mini-batch with a ground truth factor varies while other factors are fixed. Then, they only allow the latent variable corresponding to the selected factor to capture the variation in the mini-batch by replacing all other latent variables with their mean values over the mini-batch. This “clamping” strategy was also applied in reed2014learning to improve the disentanglement capability of a higher-order Boltzman Machine. Mathieu et. al. mathieu2016disentangling proposed a conditional VAE that models both labeled factors of variation and other unspecified latent representations . Since is given, this model simply learns , which is assumed to be entangled. To ensure the decoder does not ignore labeled information from , the authors swap the unspecifed latent representations , of two samples , and use an additional GAN to force the images generated from and where to be similar. The main problem of this method is that none of the generated images are fixed, which results in unstable training of GAN as reported in mathieu2016disentangling . Other methods that are derived from mathieu2016disentangling include harsh2018disentangling ; ruiz2019learning .

Unsupervised methods learn disentangled representations directly from raw data without using knowledge about the ground truth factors of variation. Desjardins et. al. desjardins2012disentangling made an early attempt at unsupervised disentanglement learning by using a higher-order spike-and-slab RBM with block-sparse connectivity to model the multiplicative interactions between (unknown) factors of variation. Despite some success on the Toronto Face dataset, this method has two main drawbacks that make it impractical: one is its modeling complexity and the other is its oversimplified assumption about the multiplicative interactions between factors. Current state-of-the-art unsupervised methods are based on powerful deep generative models such as GAN chen2016infogan or VAE chen2018isolating ; higgins2017beta ; kim2018disentangling ; kumar2017variational . They show promising disentanglement results on many real datasets and are scalable. The key idea behind these methods is learning independent yet informative representations.

Semi-supervised methods have also been proposed. Kingma et. al. kingma2014semi

proposed two variants of VAE to solve the semi-supervised learning problem. One variant denoted as M1 adds a classifier on top of

to predict the label . The other variant denoted as M2 assumes a generative model with the inference networks for and are and , respectively. This M2 model is able to separate between style and content by using very little amount of labeled data (about 1-5% of the total data). Siddharth et. al. siddharth2017learning replace the variational objective of the M2 model with the importance-weighted loss burda2015importance so that can have arbitrary conditional dependency between and instead of just the decomposition used in kingma2014semi . However, this does not lead to any significant change in the model architecture.

a.4 Evaluating independence with correlation matrix

For every sampled from the training data, we generated latent samples and built a correlation matrix from these samples for each of the models FactorVAE, -VAE and AAE. We also built another version of the correlation matrix which is based on the (called the conditional means) instead of samples from . Both are shown in Fig. 7. We can see that the correlation matrices computed based on the conditional means incorrectly describe the independence between representations of FactorVAE and -VAE. AAE is not affected because it learns deterministic given . Using the correlation matrix is not a principled way to evaluate independence in disentanglement learning.

(a) FactorVAE (stochastic)
(b) -VAE (stochastic)
(c) AAE (stochastic)
(d) FactorVAE (deterministic)
(e) -VAE (deterministic)
(f) AAE (deterministic)
Figure 7: Correlation matrix of representations learned by FactorVAE, -VAE and AAE.

a.5 Trade-off between informativeness, independence and the number of latent variables

Before starting our discussion, we provide the following fact:

Fact 2.

Assume we try to fill a fixed-size pool with fixed-size balls given that all the balls must be inside the pool. The only way to increase the number of the balls without making them overlapped is reducing their size.

Figure 8: Illustration of representations learned by AAE and FactorVAE. A big red circle represents the total amount of information that contains or which is limited by the amount of training data. Blue circles are informative representations and the size of these circle indicates the informativeness of . Green circles are noisy representations . AAE does not contain , only FactorVAE does.

In the context of representation learning, a pool is with size which depends on the training data. Balls are with size . Fact. 2 reflects the situation of AAE (see Fig. 8 left). In AAE, all are deterministic given so the condition “all balls are inside the pool” is met. which is fixed so the condition “fixed-size balls” is also met. Therefore, when the number of latent variables in AAE increases, all must be less informative (i.e., must decrease) given that the independent constraint on the latent variables is still satisfied. This is empirically verified in Fig. 9 as we see the distribution of over all becomes narrower when we increase the number of representations from 65 to 200. Also note that increasing the number of latent variable from 65 to 100 does not change the distribution. This suggests that 65 or 100 latent variables are still not enough to capture all information in the data.

FactorVAE, however, handles the increasing number of latent variables in a different way. Thanks to the KL term in the loss function that forces

to be stochastic, FactorVAE can break the constraint in Fact 2 and allows the balls to stay outside the pool (see Fig. 8 right). If we increase the number of latent variables but still enforce the independence constraint on them, FactorVAE will keep a fixed number of informative representations and make all other representations “noisy” with zero informativeness scores. We refer to that capability of FactorVAE as code compression.

(a) z_dim=65
(b) z_dim=100
(c) z_dim=200
Figure 9: Distribution of over all of a particular representation for different AAE models.

a.6 Why FactorVAE can learn consistent representations?

Inspired by the variational information bottleneck theory alemi2016deep , we rewrite the standard VAE objective in an equivalent form as follows:


where denotes the reconstruction loss over and is a scalar.

In the case of FactorVAE, since all latent representations are independent, we can decompose into . Thus, we argue that FactorVAE optimizes the following information bottleneck objective:


We assume that represents a fixed condition on all . Because is a convex function of (see Appdx. A.7), minimizing Eq. 14 leads to unique solutions for all (Note that we do not count permutation invariance among here).

To make a fixed condition on all , we can further optimize with sampled from a fixed distribution like . This suggests that we can add a GAN objective to the original FactorVAE objective to achieve more consistent representations.

a.7 is a convex function of

Let us first start with the definition of a convex function and some of its known properties.

Definition 3.

Let be a set in the real vector space and be a function that output a scalar. is convex if and , we have:

Proposition 4.

A twice differentiable function is convex on an interval if and only its second derivative is non-negative there.

Proposition 5 (Jensen’s inequality).

Let be real numbers and let be positive weights on such that . If is a convex function on the domain of , then

Equality holds if and only if all are equal or is a linear function.

Proposition 6 (Log-sum inequality).

Let and be non-negative numbers. Denote and