Bridging Disentanglement with Independence and Conditional Independence via Mutual Information for Representation Learning

11/25/2019 ∙ by Xiaojiang Yang, et al. ∙ Shanghai Jiao Tong University Microsoft 12

Existing works on disentangled representation learning usually lie on a common assumption: all factors in disentangled representations should be independent. This assumption is about the inner property of disentangled representations, while ignoring their relation with external data. To tackle this problem, we propose another assumption to establish an important relation between data and its disentangled representations via mutual information: the mutual information between each factor of disentangled representations and data should be invariant to other factors. We formulate this assumption into mathematical equations, and theoretically bridge it with independence and conditional independence of factors. Meanwhile, we show that conditional independence is satisfied in encoders of VAEs due to factorized noise in reparameterization. To highlight the importance of our proposed assumption, we show in experiments that violating the assumption leads to dramatic decline of disentanglement. Based on this assumption, we further propose to split the deeper layers in encoder to ensure parameters in these layers are not shared for different factors. The proposed encoder, called Split Encoder, can be applied into models that penalize total correlation, and shows significant improvement in unsupervised learning of disentangled representations and reconstructions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning disentangled representations has been considered as an important step towards interpretable and more effective machine learning 

[3, 2, 18, 31, 35, 30]. The disentangled representations are proved to be interpretable or semantically meaningful [7, 17], robust to adversarial attacks [1], more generalizable [34] and correlated to fairness  [22]. They are also useful to many downstream tasks, including sequential data generating [37]

, reinforcement learning 

[13, 26], robot learning [19], transfer [21] and few shot learning [14, 3], etc. Although there is no formal definition for a disentangled representation except some attempts [11], we adopt the one from Bengio, et al. [3]: a representation in which each factor corresponds to a single factor of variation in data, and meanwhile is invariant to other factors of variation.

There exist unsupervised approaches to learn disentangled representations, based on generative adversarial nets (GANs) [10] and variational auto-encoders (VAEs) [16]. To some extent, the VAE-based models have become dominant for its stability. Different VAE-based models for learning disentangled representations have been proposed from different motivations, such as limiting the bottleneck capacity [12, 5], penalizing the total correlation (see Eq. 1[15, 6], and matching factorized priors [17], they can be attributed to factorizing the distribution of representations [23, 24, 15, 6]. Therefore, these models lie on a common assumption (see Assumption 1): the distribution of a disentangled representation is factorized, i.e. factors are independent.

However, according to Bengio’s statement [3], disentanglement is the bijective mapping between factors in representations and factors of variation in data, which emphasizes the relation between data and representations. Considering that the common assumption focuses on the independence of factors, which is the inner property of representations, we argue that it is not sufficient for disentanglement. For more comprehensive understanding towards disentanglement, additional assumptions are needed to describe the relations between data and disentangled representations.

Based on the above observation, in this paper we propose to establish a specific relation between data and disentangled representations via mutual information: the mutual information between data and each factor in disentangled representations is invariant to other factors. This assumption is intuitively necessary but not sufficient for Bengio’s statement, specifically not sufficient to ensure the bijective mapping between factors in representations and factors of variation in data. Nevertheless, the proposed assumption does bring further understanding for disentanglement and induces an effective technique for disentangled representations learning as will be shown in the paper.

We first formulate the proposed assumption into mathematical equations in terms of conditional mutual information (see Eq.5 and Eq.7), which opens the direction for the structured relation between data and representations in representation learning. Then we theoretically bridge it with independence and conditional independence of factors in representations, showing that conditional independence is also an important property for disentanglement. Also we show that conditional independence is satisfied in VAE’s encoder due to the factorized noise in reparameterization.

Motivated by the theoretical analysis, we propose Split Encoder to encourage different factors in representations to learn different information from data. Specifically, the deeper layers in encoder is split to ensure parameters in these layers are not shared for different factors. Moreover, the split encoder also encourages learning more information from data, thus it contributes to improving the reconstructions.

To verify our assumption, we perform experiments by violating the conditional independence and find it lead to dramatic decline of disentanglement, which demonstrates the importance of conditional independence and supports our assumption. Furthermore, experiments show that the split encoder can significantly improve disentanglement of representations learned by FactorVAE [15] and TC-VAE [6] on dSprites [12], SmallNORB [20] and Cars3D [28] data sets, and meanwhile improve reconstructions.

Our main contributions can be summarized as follows:

  • We propose a fundamental assumption for disentangled representation, and connect it with independence and conditional independence. We empirically show the importance of both conditional independence and our assumption for disentanglement in experiments. The mathematical results in this paper open the direction for understanding the structured relation between data and representations in representation learning.

  • Based on the theoretical analysis, we then develop a simple and effective architecture called split encoder to improve disentanglement and reconstructions. It can improve those models that penalize total correlation.

  • Experimental results on dSprites, SmallNORB and Cars3D show our approach combined with TC-VAE [6] and FactorVAE [15] can significantly improve disentanglement, and it also improves reconstructions.

2 Related Works

There are many works related to disentanglement with different objectives. A line of works focus on disentangled representations from typical types of data [37]

. Attention is paid to supervised learning 

[25]

or semi-supervised learning of disentangled representations 

[33, 32], others explore other objectives [29, 38]. There are also works on evaluation of disentanglement [9]. Since our assumption and analysis lie on the foundation of disentangled representation learning, in this section we focus on works related to the understanding of disentanglement and basic unsupervised models for disentangled representation learning.

Some works on unsupervised learning of disentangled representations are proposed based on early generative models. The authors in [30] propose a variant of auto-encoder to learn disentangled representations by minimizing the predictablity of one factor in representations when other factors are fixed, this model obviously is motivated by the independence of factors, i.e. the common assumption. While the methods [8] and [27]

propose variants of (Restricted) Boltzmann Machine in which interactions act to entangle the factors.

Recent studies on unsupervised learning of disentangled representations are mainly based on GANs and VAEs. In line with GANs, InfoGAN [7] penalizes the mutual information of representations, and qualitatively shows that different factors in representations correspond to different visual concepts. The authors in [4]

propose to penalize the Jensen-Shannon divergence between the distribution of representations and its factorized distribution with a discriminator, based on Independent Component Analysis.

The mainstream unsupervised models for disentangled representation learning are variants of vanilla VAE due to its stability. -VAE [12] introduces to encourage the encoder to learn disentangled representations by penalizing the KL term in the objective of vanilla VAE. AnneledVAE [5] proposes to progressively increase the bottleneck capacity of VAE to encourage the encoder to learn different factors of variation when capacity grows. FactorVAE [15] uses a discriminator to penalize the total correlation via ratio trick to enhance independence of factors in representations. DIP-VAE [17] matches the distribution of representations with disentangled priors. In TC-VAE [6]

, the authors decompose the objective of VAE and argue that the total correlation term is the source of disentanglement, then they derive a mini-batch estimator for the total correlation term and penalize it to enhance disentanglement. Most VAE-based models can be attributed to penalize the total correlation and thus enhance independence, which coincides with the common assumption.

Impossibility theorem [23] is also an important result for unsupervised learning of disentangled representations. This theorem claims that without inductive biases on both models and data sets, it is impossible to learn disentangled representations with unsupervised manner by simply ensuring independence of factors. Note that the proposed assumption in this paper does not simply focus on the independence of factors, but whether it breaks the impossibility theorem should be further investigated.

3 Theoretical Analysis

In this section, we first discuss the intuition of disentangled representation learning, and highlight that the relation between data and representation is vital for disentanglement. Then we point out that the common assumption only describes the inner property for representations without concerning the relation, and hence is not sufficient for disentanglement. Furthermore, we analyze the difficulty of formulating the relation, revealing that it is mainly caused by the difficulty of expressing the factors of variation in data. Motivated by this and specifically to avoid factors of variation, we propose an assumption to describe one relation via mutual information, and successfully formulate it into mathematical equations. Finally, we connect the proposed assumption with independence and conditional independence, and show its importance for disentanglement.

3.1 Existing Common Assumption

Although there is no formal definition for disentangled representation, the key intuition is that different factors in disentangled representations should correspond to different factors of variations in data. Specifically, a single factor in disentangled representations is only sensitive to the changes of a single factor of variation in data, while being relatively invariant when other factors change. This statement emphasizes the relation between data and factors in representations, and also indicates the independence of factors in a disentangled representation. Therefore, the relation is vital for disentanglement.

While as summarized in Section 2, most current unsupervised models for disentangled representation learning are attributed to penalize the total correlation:

(1)

where is the distribution of representations for the entire data set, is the distribution of real data, and is the dimensionality of representations. When the total correlation become zero, almost every where. In this case, factors in representations are independent. Therefore, we can conclude that they simply lie on the common assumption: factors in disentangled representations should be independent. For clarity, we summarize the common assumption as follows:

Assumption 1

(The Common Assumption) Suppose representations of data variable are disentangled, then factors in representations are independent, i.e.

(2)

almost everywhere.

Independence is the inner property of representations. While as highlighted above, disentanglement also emphasizes the relation between data and representations. Therefore, the common assumption only describes partial property of disentanglement, thus is not sufficient. Motivated by this, we aim at proposing an assumption to describe one relation between data and disentangled representations.

3.2 Proposed New Assumption

Our key insight is that we can establish an assumption via mutual information to describe the relation between data and disentangled representations. According to Bengio’s statement [3], each factor in disentangled representations corresponds to a single factor of variation in data, and is invariant to other factors of variation. Therefore, we can derive that the mutual information between each factor and data is only related the corresponding factor of variation, and is invariant to other factors of variation.

However, commonly there is no explicit expression for a useful factor of variation, which obstacles the formulating of the statement above. Fortunately, we find an assumption that can avoid factors of variation: the mutual information between data and each factor in disentangled representations is invariant to other factors. This assumption is obviously weaker than the statement above, as it can be derived from but cannot derive the statement. The most appealing thing of this assumption is that it absolutely avoids factors of variation in data, which enables us to formulate it into mathematical equations.

To formulate the invariance of mutual information to other factors, we involve conditional mutual information. For clarity, we begin with introducing mutual information. Mutual information measures the information of a variable contained by another, with expression as follows:

(3)

where entropy and conditional entropy are measures of uncertainty. Hence mutual information is the uncertainty reduction of one variable when another is given.

As for conditional mutual information, it measures the mutual information of two variables when other variables are given:

(4)

where , is any subset of and denotes all factors with subscript index in . Formally, the conditional mutual information is the difference of two conditional entropy. Specifically, the conditional mutual information above measures the mutual information between factor and data when any other factors are given.

Using the concepts above, we can elegantly formulate the proposed assumption into mathematical equations. Here we restate the proposed assumption via conditional mutual information: mutual information between data and each factor in disentangled representation remains invariant when any other factors are given, i.e. the mutual information between any factor is equal to the conditional mutual information conditioned on any other factors. To conclude, the proposed assumption is formulated into:

Assumption 2

(The Proposed Assumption) Suppose representations of data variable are disentangled, then for any single factor , its mutual information with data is invariant to other factors , i.e.

(5)

Note that the conditional mutual information is closely related the chain rule of mutual information. Using chain rule we derive another equivalent equation for the proposed assumption. First, we state a lemma as follows (

proof deferred to appendix):

Lemma 1

for any is equivalent to the following equation:

(6)

where is any subset of , is any single element in .

Towards further analysis for the proposed assumption, we consider to reformulate it into another equivalent form. As shown in Eq. 6, the proposed assumption Eq. 5 is equivalent to the decomposition of mutual information. Iteratively using Eq. 6, we finally formulate the proposed assumption into another equivalent equation:

Assumption 3

(The Proposed Assumption*) Suppose representations of data variable are disentangled, then for any subset of factors , their mutual information with data is equal to the sum of mutual information between each factor and data, i.e.

(7)

This equation is more suitable for further analysis, as it avoids conditional mutual information and hence becomes cleaner and easier to deal with.

By the end, we highlight that the proposed assumption intuitively describes one natural relation between data and disentangled representations, thus it is necessary for disentanglement. Meanwhile, note that the proposed assumption does not include all relations and cannot ensure that each factor corresponds to a single factor of variation. Therefore, it is necessary but not sufficient for disentanglement.

3.3 Connection with Independence and Condition Independence

Based on the mathematical forms of the proposed assumption, we bridge it with independence and conditional independence. Specifically, we obtain the following theorem (proof deferred to appendix):

Theorem 1

For any subset of , we have:

(8)

The theorem above bridges the proposed assumption with independence and conditional independence. In Eq. 8, the difference on the left becomes zero if and only if the proposed assumption is satisfied. While on the right, the two KL terms become zero if and only if factors are conditionally independent and independent, respectively.

From the theoretical analysis above, we can obtain some properties of disentanglement. As mentioned above, the common assumption and the proposed assumption are necessary for disentanglement, thus we can assume they are satisfied in disentangled representations. In this case, according to Eq. 8, the two KL terms become zero. To conclude, conditional independence and independence are necessary for disentanglement:

Proposition 1

Suppose representations of data variable are disentangled, then factors in are conditionally independent and independent, i.e.

(9)
(10)

almost everywhere.

Finally, we summarize the importance of the proposed assumption for disentanglement. When the common assumption is satisfied, our assumption is equivalent to that the first KL term in Eq. 8 is zero, i.e. factors in representations are conditionally independent. Therefore, compared with the common assumption, the proposed assumption highlights the importance of conditional independence, which is a necessary property of disentanglement.

4 Models

In this section, we first show that factorized noise in reparameterization ensures conditional independence for factors in representations. Conditioned on this, we reformulate total correlation using the theoretical analysis of the proposed assumption, which leads to an information-theoretic understanding for total correlation. Motivated by this intuition, we propose an simple architecture to encourage different factors to learn different information, and meanwhile improve reconstructions.

(a) Vanilla encoder
(b) Split encoder
Figure 1: Structures of vanilla encoder and split encoder. The blue and red box denotes “shared-parameter block”, and “no-shared-parameter block”, respectively.

4.1 Inductive Bias for Conditional Independence

As discussed in Section 3.3, both independence and conditional independence are necessary for disentanglement. In most existing works, independence is emphasized while conditional independence is usually ignored. However, they can still achieve less entangled representations. To explain this, we analyze the encoders in VAEs, and find that factorized noise in reparameterization exactly ensures conditional independence.

In encoders of VAEs, the representation is sampled from by reparameterization. For clarity, here we use the most common case . Specifically, data is encoded into a mean

and a variance

, then reparameterization is to formulate the representation by , and a noise as follows:

(11)

Note that the noise

follows standard Gaussian distribution

, thus it is factorized, i.e. .

When is given, and are fixed, then the randomness of originates from the noise . Hence the conditional distribution is determined by the noise distribution . Obviously, if and only if , we have for any , i.e. factors are conditionally independent. Therefore, in reparameterization, factors in representations are conditionally independent if and only if the noise is factorized.

To conclude, we find factorized noise is the inductive bias on model which ensures conditional independence. Therefore, for representations learned by reparameterization with factorized noise, conditional independence is exactly satisfied. This result explains why ignoring conditional independence does not lead to highly entangled representations in existing works, and also provides us an approach to verify the importance of conditional independence and the proposed assumption as shown in experiments. In the following, we present our devised split encoder.

4.2 The Proposed Split Encoder

As shown in Section 4.1, conditional independence is already satisfied in reparameterization. Conditioned on this, we can reformulate the total correlation using Eq. 8, which naturally leads to a simple architecture to encourage different factors to learn different information about data:

Theorem 2

Suppose almost everywhere, then we have:

(12)

This theorem demonstrates that when conditional independence is satisfied, , and penalizing total correlation is equivalent to lowering the difference of the two terms.

Note that the first term measures the sum of information about data in factors, and the second term measures the total information about data in representations. Intuitively, when different factors retain different information about data, the difference of the two terms will become zero. Therefore, conditioned on conditional independent, lowering the total correlation is also equivalent to encouraging different factors to learn different information about data, which coincides with the proposed assumption and thus improve disentanglement. Motivated by this, we can improve the performance of penalizing the total correlation by encouraging different factors to learn different information.

To achieve this, we propose Split encoder, in which the deeper layers in encoder are split to ensure that their parameters are not shared for different factors. For clarity, we call the block of these layers as “no-shared-parameter block”, while the block of others layers is named as “shared-parameter block”. The structures of vanilla encoder and split encoder are shown in Fig. 1.

From Fig. 1, we can see that in vanilla encoder, the no-shared-parameter block only contains a single fully-connected layer. This structure might not be sufficient for different factors to learn different information, thus obstacle the performance of penalizing total correlation. Compared with this, the model capacity of no-shared-parameter block in split encoder is larger, which encourages different factors to learn different information. Therefore, split encoder encourages different factors to learn different information, which will improve the performance of penalizing the total correlation for disentangled representation learning.

Another appealing property of split encoder is encouraging representations to learn more information from data. This property is natural as different factors are encouraged to learn different information rather than same information, thus the total information in representations increases. As more information in representations leads to better reconstructions in VAEs, the split encoder will improve reconstructions.

5 Experiments

The objective in this section is not only to show the proposed split encoder can improve unsupervised learning of disentangled representations, but also to verify the importance of the proposed assumption by investigating the change of disentanglement when conditional independence is violated.

Data sets: According to [23], the performance of any model on different data sets can be very different. To show the performance of split encoder, we choose three distinct data sets: dSprites [12], SmallNORB [20] and Cars3D [28]. DSprites is a set of 737,280 64*64 generated images in black and white, and the images are generated from fives independent latent factors. SmallNORB contains 24,300 image pairs of 50 3D toys produced by two cameras, each image pair is grey-scale in shape of 2*96*96. Cars3D consists of 199 colorful 3D car models in shape of 128*128*3*24*4.

Models: For verifying experiments, we select vanilla VAE and FactorVAE as baselines. Vanilla VAE and FactorVAE weakly and strongly penalizes the total correlation respectively. We believe that this choice is sufficient for showing the importance of our assumption. As for unsupervised learning of disentangled representations, we choose FactorVAE and TCVAE as baselines. As FactorVAE and TCVAE penalize total correlation with different methods, and the proposed split encoder is to improve the performance of lowering total correlation, this choice can test the its effectiveness.

Metrics: Following [36], we use Mutual Information Gap (MIG) [6] to evaluate disentanglement for its usefulness and rationality. MIG is defined by first computing the normalized mutual information between each factor in representations and each ground truth factor, then computing the gap between the highest two normalized mutual information values along factors in representations, and finally returning the gap averaged along ground truth factors. We also use reconstruction error in the objective of VAEs to measure the information in representations. Lower reconstruction error means more meaningful information are learned in representations.

Hyper parameters: For the fairness of comparisons, we use the implementation of disentanglement lib introduced by [23] without tuning. The number of factors in representations are set as 10. The weights of penalties in FactorVAE and TCVAE are set as 35 and 6 on all experiments, respectively. The code for split encoder can be found in supplemental material. For each model on any data set, we train it ten times without tuning, and record the MIG scores and reconstruction errors.

For comprehensive comparison, we report the MIG score distribution by violin graphs. The violin graph shows the distribution of points with a density plot, the median value with a white spot, interquartile range with a thick black bar, and upper and lower adjacent values with a thin black line. Furthermore, we also present the MIG scores and reconstruction errors in scatter diagrams. In this way, the reconstruction errors are clearly compared, and the raw MIG scores are shown. Images of reconstructions and traversals can be found in appendix.

(a) Vanilla encoder
(b) Split encoder
Figure 2: MIG scores of verifying experiments in vanilla VAE and FactorVAE on dSprite. In subfigures, the first violin graph is the baseline, the second one is a vanilla encoder with correlated noise ().
(a) FactorVAE (+ split encoder)
(b) TCVAE (+ split encoder)
Figure 3: MIG scores of vanilla VAE and FactorVAE (with split encoder) on dSprite. In subfigures, the first violin graph is the baseline, the second one is the model with split encoder.
(a) FactorVAE (+ split encoder)
(b) TCVAE (+ split encoder)
Figure 4: MIG score Vs. reconstruction error of vanilla VAE and FactorVAE (with split encoder) on dSprite. In subfigures, the blue round spots are results of the baseline, the red triangular spots are results of the model with split encoder.

5.1 Violating Conditional Independence

As pointed out in our assumption, conditional independence is a necessary property of disentanglement. According to our theoretical analysis, conditional independence originates from the factorized noise in reparameterization. Therefore, for verifying the importance and necessity of our assumption, in this section we involve a correlated noise into reparameterization.

Specifically, we introduce a correlated noise as follows:

(13)

where

is an identity column vector, factors in noise following Gaussian distribution with convariance matrix

are highly correlated, and is correlation weight. Larger corresponds to higher correlation, and when , the noise is factorized. We set for comparisons. While in the baselines, .

As shown in Fig. 2, for both vanilla VAE and FactorVAE, correlated noise leads to dramatic decline of disentanglement in terms of MIG score on dSprite. Even if independence is enhanced in FactorVAE, with a correlated noise, the learned representations become highly entangled and unstable in terms of MIG score. These results show that a factorized noise in reparameterization is vital for disentanglement, i.e. conditional independence is a necessary property for disentanglement. This also demonstrates the importance and necessity of the proposed assumption for disentanglement.

(a) FactorVAE (+ split encoder)
(b) TCVAE (+ split encoder)
Figure 5: MIG scores of vanilla VAE and FactorVAE (with split encoder) on SmallNORB. In subfigures, the first violin graph is the baseline, the second one is the model with split encoder.
(a) FactorVAE (+ split encoder)
(b) TCVAE (+ split encoder)
Figure 6: MIG score Vs. reconstruction error of vanilla VAE and FactorVAE (with split encoder) on SmallNORB. In subfigures, the blue round spots are results of the baseline, the red triangular spots are results of the model with split encoder.

5.2 Unsupervised Learning of Disentangled Representations

In this subsection we aim at showing the performance of split encoder on disentangled representation learning and reconstructions in terms of MIG score and reconstruction error.

The violin graphes of MIG score are shown in Fig. 3, Fig. 5 and Fig. 7. For the baselines, the MIG score distributions on the three data sets are different: on dSptites, both FactorVAE and TCVAE have a relatively stable and good performance; while on SmallNORB and Cars3D, their MIG scores are more unstable and much worse.

Nevertheless, on the three data sets, using split encoder in FactorVAE and TCVAE significantly improves their performance in terms of MIG score. When focusing on a single result, models with split encoder can produce a worse result than the baseline. However, the total MIG score distributions are better, and the median and the highest MIG scores of models with split encoder are much better than the baselines. Therefore, split encoder can significantly improve disentanglement of FactorVAE and TCVAE in terms of MIG score on different data sets. These results show its ability to improve disentanglement for models that penalize total correlation, which is due to encouraging different factors to learn different information according to our theoretical analysis.

(a) FactorVAE (+ split encoder)
(b) TCVAE (+ split encoder)
Figure 7: MIG scores of vanilla VAE and FactorVAE (with split encoder) on Cars3D. In subfigures, the first violin graph is the baseline, the second one is the model with split encoder.
(a) FactorVAE (+ split encoder)
(b) TCVAE (+ split encoder)
Figure 8: MIG score Vs. reconstruction error of vanilla VAE and FactorVAE (with split encoder) on Cars3D. In subfigures, the blue round spots are results of the baseline, the red triangular spots are results of the model with split encoder.

The scatters of MIG score Vs. reconstruction error are shown in Fig. 4, Fig. 6 and Fig. 8. Note that for different data sets, the scales of reconstruction errors can be different: the scales of reconstruction errors on dSprite, SmallNORB and Cars3D are 46-55, 1965-1979 and 1437-1510. The high reconstruction error on SmallNORB and Cars3D might be due to the high dimension and complexity of data points in these data sets. However, the reconstruction images seem to be of relatively high quality (see supplemental material).

Although reconstruction errors are different on the three data sets, split encoder can significantly and stably reduce the reconstruction errors of FactorVAE and TCVAE. This result demonstrates that split encoder can improve reconstructions, which indicates that it can learn more information from data. This coincides with our theoretical analysis to the proposed assumption.

In addition, from the scatters we can see that commonly the reconstruction error and disentanglement in terms of MIG score are negatively correlated, i.e. lower reconstruction error leads to higher disentanglement. This result indicates that learning more information in representations with fixed number of factors, the representations will be more disentangled. This is also reported in [15].

To conclude, for models that penalize total correlation like FactorVAE and TCVAE, split encoder can improve their performance on unsupervised learning of disentangled representations, and meanwhile reduce reconstruction errors. These experiment results well coincide with our theoretical analysis, and hence support the proposed assumption.

6 Conclusion and Outlook

We point out that the common assumption is not sufficient for disentanglement, and hence propose an assumption: the mutual information between data and each factor in representations is invariant to other factors. This assumption is not sufficient but necessary for disentanglement.

For further analysis of the proposed assumption, we formulate it into mathematical equations, and bridge it with independence and conditional independence. This result emphasize the importance of conditional independence for disentanglement. Then we demonstrate that factorized noise in reparameterization ensures conditional independence. Conditioned on this, we derive an equivalent formula of total correlation from information-theoretic perspective. This naturally lead to split encoder, which can improve the performance of penalizing the total correlation for disentangled representation learning.

Experiments show that violating conditional independence leads to dramatic decline of disentanglement in terms of MIG score, which demonstrates the importance of conditional independence and the proposed assumption. Furthermore, split encoder can be applied into models that penalize total correlation, which improves their disentanglement and reconstructions.

Our theoretical technique can be applied into many representation learning tasks, and our theoretical results can be used to analyze many models for disentangled representation learning. Based on these results, investigating an information-theoretic and practical definition of disentanglement is also an appealing direction.

References

  • [1] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2017) Deep variational information bottleneck. international conference on learning representations. Cited by: §1.
  • [2] Y. L. Bengio (2007) Scaling learning algorithms towards ai. Large-scale Kernel Machines 34 (5), pp. 1–41. Cited by: §1.
  • [3] Y. Bengio, A. C. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. Cited by: §1, §1, §3.2.
  • [4] P. Brakel and Y. Bengio (2018) Learning independent features with adversarial nets for non-linear ica. arXiv: Machine Learning. Cited by: §2.
  • [5] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner (2018) Understanding disentangling in -vae. arXiv: Machine Learning. Cited by: §1, §2.
  • [6] T. Q. Chen, X. Li, R. B. Grosse, and D. Duvenaud (2018)

    Isolating sources of disentanglement in variational autoencoders

    .
    neural information processing systems, pp. 2610–2620. Cited by: 3rd item, §1, §1, §2, §5.
  • [7] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. neural information processing systems, pp. 2180–2188. Cited by: §1, §2.
  • [8] G. Desjardins, A. C. Courville, and Y. Bengio (2012) Disentangling factors of variation via generative entangling. arXiv: Machine Learning. Cited by: §2.
  • [9] C. Eastwood and C. K. I. Williams (2018) A framework for the quantitative evaluation of disentangled representations. international conference on learning representations. Cited by: §2.
  • [10] I. J. Goodfellow, J. Pougetabadie, M. Mirza, B. Xu, D. Wardefarley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [11] I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. J. Rezende, and A. Lerchner (2018) Towards a definition of disentangled representations.. arXiv: Learning. Cited by: §1.
  • [12] I. Higgins, L. Matthey, A. Pal, C. P. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2017) Beta-vae: learning basic visual concepts with a constrained variational framework. international conference on machine learning. Cited by: §1, §1, §2, §5.
  • [13] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner (2017) DARLA: improving zero-shot transfer in reinforcement learning. international conference on machine learning, pp. 1480–1490. Cited by: §1.
  • [14] D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, J. M. Mooij, and B. S. Lkopf (2012) On causal and anticausal learning. pp. 459–466. Cited by: §1.
  • [15] D. H. Kim and A. Mnih (2018) Disentangling by factorising. international conference on machine learning, pp. 2649–2658. Cited by: 3rd item, §1, §1, §2, §5.2.
  • [16] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. international conference on learning representations. Cited by: §1.
  • [17] A. Kumar, P. Sattigeri, and A. Balakrishnan (2017) VARIATIONAL inference of disentangled latent concepts from unlabeled observations. international conference on learning representations. Cited by: §1, §1, §2.
  • [18] B. M. Lake, T. Ullman, J. B. Tenenbaum, and S. J. Gershman (2017) Building machines that learn and think like people. Behavioral and Brain Sciences 40, pp. 1–101. Cited by: §1.
  • [19] A. Laversanne-Finot, A. Péré, and P. Oudeyer (2018) Curiosity driven exploration of learned disentangled goal spaces. Conference on Robot Learning. Cited by: §1.
  • [20] Y. Lecun, F. J. Huang, and L. Bottou (2004) Learning methods for generic object recognition with invariance to pose and lighting. computer vision and pattern recognition 2, pp. 97–104. Cited by: §1, §5.
  • [21] A. H. Liu, Y. Liu, Y. Yeh, and Y. F. Wang (2018) A unified feature disentangler for multi-domain image translation and manipulation. neural information processing systems, pp. 2590–2599. Cited by: §1.
  • [22] F. Locatello, G. Abbati, T. Rainforth, S. Bauer, B. Scholkopf, and O. Bachem (2019) On the fairness of disentangled representations.. arXiv: Learning. Cited by: §1.
  • [23] F. Locatello, S. Bauer, M. Lucic, S. Gelly, and O. Bachem (2018) Challenging common assumptions in the unsupervised learning of disentangled representations. international conference on machine learning. Cited by: §1, §2, §5, §5.
  • [24] F. Locatello, D. Vincent, I. Tolstikhin, G. Ratsch, S. Gelly, and B. Scholkopf (2018) Competitive training of mixtures of independent deep generative models. arXiv: Learning. Cited by: §1.
  • [25] M. Mathieu, J. Zhao, P. Sprechmann, A. Ramesh, and Y. Lecun (2016) Disentangling factors of variation in deep representations using adversarial training. pp. 5047–5055. Cited by: §2.
  • [26] A. Nair, V. Pong, M. Dalal, S. Bahl, S. Lin, and S. Levine (2018) Visual reinforcement learning with imagined goals. neural information processing systems, pp. 9191–9200. Cited by: §1.
  • [27] S. E. Reed, K. Sohn, Y. Zhang, and H. Lee (2014) Learning to disentangle factors of variation with manifold interaction. international conference on learning representations, pp. 1431–1439. Cited by: §2.
  • [28] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee (2015) Deep visual analogy-making. neural information processing systems, pp. 1252–1260. Cited by: §1, §5.
  • [29] P. K. Rubenstein, B. Scholkopf, and I. Tolstikhin (2018) Learning disentangled representations with wasserstein auto-encoders.. international conference on learning representations. Cited by: §2.
  • [30] J. Schmidhuber (1992) Learning factorial codes by predictability minimization. Neural Computation 4 (6), pp. 863–879. Cited by: §1, §2.
  • [31] R. Shanmugam (2018) Elements of causal inference: foundations and learning algorithms. Journal of Statistical Computation and Simulation 88 (16), pp. 3248–3248. Cited by: §1.
  • [32] N. Siddharth, B. Paige, J. V. De Meent, A. Desmaison, N. D. Goodman, P. Kohli, F. Wood, and P. H. S. Torr (2017) Learning disentangled representations with semi-supervised deep generative models. neural information processing systems, pp. 5927–5937. Cited by: §2.
  • [33] A. Spurr, E. Aksan, and O. Hilliges (2017) Guiding infogan with semi-supervision. European conference on Machine Learning, pp. 119–134. Cited by: §2.
  • [34] X. Steenbrugge, S. Leroux, T. Verbelen, and B. Dhoedt (2018) Improving generalization for abstract reasoning tasks using disentangled feature representations.. arXiv: Learning. Cited by: §1.
  • [35] M. Tschannen, O. Bachem, and M. Lucic (2018) Recent advances in autoencoder-based representation learning. arXiv: Learning. Cited by: §1.
  • [36] N. Watters, L. Matthey, C. P. Burgess, and A. Lerchner (2019) Spatial broadcast decoder: a simple architecture for learning disentangled representations in vaes.. arXiv: Learning. Cited by: §5.
  • [37] L. Yingzhen and S. Mandt (2018) Disentangled sequential autoencoder. international conference on machine learning, pp. 5656–5665. Cited by: §1, §2.
  • [38] S. Zhao, J. Song, and S. Ermon (2019) InfoVAE: balancing learning and inference in variational autoencoders. pp. 5885–5892. Cited by: §2.

Appendix

A. A Stronger Formulation

We can formulate the proposed assumption into a stronger equation as follows:

From this equation obviously we can derive Eq. 5, thus it is stronger than Eq. 5. While as will be shown in following analysis, conditioned on the common assumption, it leads to same result as Eq. 5.

The second KL term above should be zero due to the common assumption, thus the stronger equation is equivalent to that the first KL term above is equal to zero. To conclude, conditioned on the common assumption, the stronger equation is equivalent to:

If the distribution is continuous, this equation is further equivalent to conditional independence. Therefore, the stronger equation leads to same results as Eq. 5.

B. Proofs of Lemmas and Theorems

Lemma 1. for any is equivalent to the following equation:

where is any subset of , is any single element in . proof. This conclusion is a straight corollary of chain rule of mutual information. For further understanding, here we derive it from scratch using Bayes rule:

Hence we have:


Theorem 1. For any subset of , we have:

proof.

C. Reconstructions and Traversals

For intuitive understanding, here we show the reconstructed images and traversals by FactorVAE and TCVAE on dSprites, Small NORB and Cars3D. Traversals are generated from a data point, which is first encoded into a representation, then for each row only a single factor in the representation is changed in and then reconstruct the data point. For the fairness of comparison, we show the reconstruction images and traversals from output model with the highest MIG score for each case.

(a) FactorVAE baseline
(b) FactorVAE + split encoder
(c) TCVAE baseline
(d) TCVAE + split encoder
Figure 9: Reconstructions and traversals of VAE and FactorVAE (+ split encoder) on dSprites.
(a) FactorVAE baseline
(b) FactorVAE + split encoder
(c) TCVAE baseline
(d) TCVAE + split encoder
Figure 10: Reconstructions and traversals of VAE and FactorVAE (+ split encoder) on Small NORB.
(a) FactorVAE baseline
(b) FactorVAE + split encoder
(c) TCVAE baseline
(d) TCVAE + split encoder
Figure 11: Reconstructions and traversals of VAE and FactorVAE (+ split encoder) on Cars3D.