Disentangling Disentanglement

12/06/2018 ∙ by Emile Mathieu, et al. ∙ University of Oxford 6

We develop a generalised notion of disentanglement in Variational Auto-Encoders (VAEs) by casting it as a decomposition of the latent representation, characterised by i) enforcing an appropriate level of overlap in the latent encodings of the data, and ii) regularisation of the average encoding to a desired structure, represented through the prior. We motivate this by showing that a) the β-VAE disentangles purely through regularisation of the overlap in latent encodings, and through its average (Gaussian) encoder variance, and b) disentanglement, as independence between latents, can be cast as a regularisation of the aggregate posterior to a prior with specific characteristics. We validate this characterisation by showing that simple manipulations of these factors, such as using rotationally variant priors, can help improve disentanglement, and discuss how this characterisation provides a more general framework to incorporate notions of decomposition beyond just independence between the latents.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An oft-stated motivation for learning disentangled representations of data with deep generative models is a desire to achieve interpretability [Bengio et al., 2013, Chen et al., 2016]—particularly the decomposability [see §3.2.1 in Lipton, 2016] of latent representations to admit intuitive explanations. Most work on disentanglement has constrained the form of this decomposition to capturing purely independent factors of variation [Chen et al., 2016, Burgess et al., 2018, Esmaeili et al., 2018, Kim and Mnih, 2018, Ansari and Soh, 2018, Xu and Durrett, 2018, Alemi et al., 2016, Chen et al., 2018, Higgins et al., 2016, Eastwood and Williams, 2018, Zhao et al., 2017], typically evaluating this using purpose-built, artificial, data [Eastwood and Williams, 2018, Chen et al., 2018, Higgins et al., 2016, Kim and Mnih, 2018], whose generative factors are themselves independent by construction. However, the high-level motivation for achieving decomposability places no a priori constraints on the form of the decompositions—just that they are captured effectively.

The conventional view of disentanglement, as recovering independence, has subsequently motivated the development of formal evaluation metrics for independence 

[Eastwood and Williams, 2018, Kim and Mnih, 2018], which in turn has driven the development of objectives that target these metrics, often by employing regularisers explicitly encouraging independence in the representations [Eastwood and Williams, 2018, Kim and Mnih, 2018, Esmaeili et al., 2018].

We argue that this methodological approach is not generalisable, and potentially even harmful, to learning decomposable representations for more complicated problems, wherein such simplistic representations will be unable to accurately mimic the generation of high dimensional data from low dimensional latent spaces. To see this, consider a typical measure of disentanglement-as-independence

[e.g. Eastwood and Williams, 2018], computed as the extent to which a latent dimension  predicts a generative factor  with each latent capturing at most one generative factor. This implicitly assumes , as otherwise the latents are not able to explain all of the generative factors. However, for real data, the association is more likely , with one learning a low-dimensional abstraction of a complex process involving many factors. Such complexities necessitate richly structured dependencies between latent dimensions—as reflected in the motivation for a handful of approaches [Siddharth et al., 2017, Bouchacourt et al., 2018, Johnson et al., 2016, Esmaeili et al., 2018] that explore this through graphical models, although employing mutually-inconsistent, and not generalisable, interpretations of disentanglement.

Here, we develop a generalisation of disentanglement—decomposing latent representations—that can help avoid such pitfalls. Note that the typical assumption of independence implicitly makes a choice of decomposition—that the latent features are independent of one another. We make this explicit, and exploit it to provide improvement to disentanglement simply through judicious choices of structure in the prior, while also introducing a framework flexible enough to capture alternate, more complex, notions of decomposition such as sparsity [Tibshirani, 1996], hierarchical structuring, or independent subspaces.

2 Decomposition: A Generalisation of Disentanglement

We characterise the decomposition of latent spaces in VAE to be the fulfilment of two factors:

  1. An “appropriate” level of overlap in the latent space—ensuring that the range of latent values capable of encoding a particular datapoint is neither too small, nor too large. This is, in general, dictated by the level of stochasticity in the encoder: the higher the encoder variance, the higher the number of datapoints which can plausibly give rise to a particular encoding.

  2. The marginal posterior (for encoder and true data distribution ) matching the prior , where the latter expresses the desired dependency structure between latents.

The overlap factor Item a is perhaps best understood by considering the extremes—too little, and the latent encodings effectively become a lookup table; too much, and the data and latents don’t convey information about each other. In both cases, the meaningfulness of the latent encodings is lost. Thus, without the appropriate level of overlap—dictated both by noise in the true generative process and dataset size—it is not possible to enforce meaningful structure on the latent space.

The regularisation factor Item b enforces a congruence between the (aggregate) latent embeddings of data and the dependency structures expressed in the prior. We posit that such structure is best expressed in the prior, as opposed to explicit independence regularisation of the marginal posterior [Kim and Mnih, 2018, Chen et al., 2018], to enable the generative model to express the captured decomposition; and to avoid potentially violating the self-consistency between encoder, decoder, and true data generating distribution. Furthermore, the prior provides a rich and flexible means of expressing desired structure, by defining a generative process that encapsulates dependencies between variables, analogously to a graphical model.

Critically, neither factor is sufficient in isolation. An inappropriate level of overlap in the latent space Item a will impede interpretability, irrespective of how well the regularisation Item b goes, as the latent space need not be meaningful. On the other hand, without the pressure to regularise Item b to the prior, the latent space is under no constraint to exhibit the desired structure.

Deconstructing the Bvae:

To show how existing approaches fit into our proposed framework, we now consider, as a case study, the BVAE [Higgins et al., 2016]—an adaptation of the VAE objective (ELBO) to learn better-disentangled representations. We introduce new theoretical results that show its empirical successes are purely down to controlling the level of overlap, i.e. factor (a). In particular, we have the following result, the proof of which is given in Appendix A, along with additional results.

Theorem 1.

The -VAE target


can be interpreted in terms of the standard ELBO, , for an adjusted target with annealed prior as


where is constant given , and is the entropy of .

Clearly, the second term in Eq. 2, enforcing a maxent regulariser on the posterior , allows the value of  to affect the overlap of encodings in the latent space; for Gaussian priors this effect is exactly equivalent to regularising the encoder to have higher variance. The annealed prior’s effect though, is more subtle. While one could interpret its effect as simply inducing a fixed scaling on the parameters (c.f. § A.1), which could be ignored and ‘fixed’ during learning, it actually has the effect of exactly counteracting the latent-space scaling due to the entropy regularisation—ensuring that the scaling of the marginal posterior matches that of the prior.

Taken together, these insights demonstrate that the BVAE’s disentanglement is purely down to controlling the level of induced overlap: it places no additional direct pressure on the latents to be independent, it only helps avoid the pitfalls of inappropriate overlap. Amongst other things, this explains why larger values of are not universally beneficial for disentanglement, as the level of overlap can be increased too far. It also dispels the conjecture Higgins et al. [2016], Burgess et al. [2018] that the BVAE encourages the latent variables to take on meaningful representations when using the standard choice of an isotropic Gaussian prior: for this prior, each term in (2) is invariant to rotation of the latent space. Our results show that the BVAE encourages the latent states to match true generative factors no more than it encourages them to match rotations of the true generative factors, with the latter capable of exhibiting strong correlations between the latents. This view is further supported by our empirical results (see Figure 1), calculated by averaging over a large number of independently trained networks, where we did not observe any gains in disentanglement (using the metric from Kim and Mnih [2018]) from increasing with an isotropic Gaussian prior trained on the 2D Shapes dataset [Matthey et al., 2017].

A new objective:

Given the characterisation set out above, we now develop an objective that incorporates the effect of both factors (a) and (b). From our analysis of the BVAE, we see that its objective Eq. 1 allows expressing overlap, i.e. factor Item a. To additionally capture the regularisation Item b between the marginal posterior and the prior, we add a divergence term , yielding


where we can now control the extent to which factors (a) and (b) are enforced, through appropriate setting of and respectively.

Note that such an additional term has been previously considered by Kumar et al. [2017], with

, although for the sake of tractability they rely instead on moment matching using covariances. There have also been a number of approaches that decompose the standard VAE objective in different ways

[e.g.  Hoffman and Johnson, 2016, Esmaeili et al., 2018, Anonymous, 2019] to expose as a component, but, as we discuss in Appendix C

, this is difficult to compute correctly in practice, with common previous approaches leading to highly biased estimates whose practical behaviour is very different than the divergence they are estimating. Wasserstein Auto-Encoders 

[Tolstikhin et al., 2017] formulate an objective that includes a general divergence term between the prior and marginal posterior, which are instantiated using either MMD or a variational formulation of the Jensen-Shannon divergence (a.k.a GAN loss). However, we find that empirically, choosing the MMD’s kernel and numerically stabilising its U-statistics estimator to be tricky, and designing and learning a GAN to be cumbersome and unstable. Consequently, the problems of choosing an appropriate and generating reliable estimates for this choice are tightly coupled, with a general purpose solution remaining an important open problem in the field; further discussion is given in Appendix C.

Figure 1: Reconstruction loss vs disentanglement metric [Kim and Mnih, 2018] for BVAE (i.e. (3) with ) trained on the 2D Shapes dataset [Matthey et al., 2017]

. Shaded areas represent 95% confidence intervals for disentanglement metric estimate, calculated using

separately trained networks. See Appendix B for details. [Left] Using an anisotropic Gaussian with diagonal covariance either fixed to the principal component values or learned during training. Point labels represent different values of . [Right] Using

for different degrees of freedom

with . Note that as , and reducing  only incurs a minor increase in reconstruction loss.

3 Experiments

Prior for axis-aligned disentanglement

First, we show how subtle changes to the prior distribution can yield improvement in terms of a common notion of disentanglement [see §4 in Kim and Mnih, 2018]. The most common choice of prior, an isotropic Gaussian, , has previously been justified by the correct assertion that the latents are independent under the prior [Higgins et al., 2016]. However, an isotropic Gaussian is also rotationally invariant and so does not constrain the axes of the latents space to capture any meaning. Figure 1 demonstrates that substantial improvements in disentanglement can be achieved by simply using either a non-isotropic Gaussian or using a product of Student-t’s prior, both of which break the rotational invariance.

Clustered prior

We next consider an alternative decomposition one might wish to impose, namely a clustering of the latent space. For this, we use the “pinwheels” dataset from [Johnson et al., 2016] and use a mixture of four equally-weighted Gaussians as our prior. We then conduct an ablation study to observe the effect of varying and in (as per (3)) on the learned representations, taking the divergence to be (see Appendix B for details). As shown in Figure 2, our framework allows one to impose this alternate decomposition, allowing control of both the level of overlap and the form of the marginal posterior.

Figure 2: Density of aggregate posterior for different values of and . [Top] , . [Bottom] , . We see that increasing increases the level of overlap in , as a consequence of increasing the encoder variance for individual datapoints. When

is too large, the encoding of a datapoint loses meaning. Also, as a single datapoint encodes to a Gaussian distribution,

is unable to match exactly. Because when , this in turn means that overly large values of actually cause a mismatch between and (see top right). Increasing , instead always improves the match between and . Here, the finiteness of the dataset and the choice of divergence results in an increase in the overlap with increasing , but only up to the level required for a non-negligible overlap between the nearby datapoints, such that large values of do not cause the encodings to lose significance.


EM, TR, YWT were supported in part by the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007–2013) / ERC grant agreement no. 617071. EM was also supported by Microsoft Research through its PhD Scholarship Programme. NS was funded by EPSRC/MURI grant EP/N019474/1.


Appendix A Proofs and Additional Results for the Disentangling the Bvae

Hoffman et al. [2017] showed that the BVAE target Eq. 1 can be interpreted as a standard ELBO with the alternative prior , where , along with a term down-weighting mutual information and another based on the prior’s normalising constant.

We derive the following alternate expression for the BVAE. See 1


Starting with Eq. 1, we have

as required. ∎

a.1 Special Case – Gaussians

We analyse the effect of the adjusted target in Eq. 2 by studying the often-used Gaussian prior, , where it is straightforward to see that annealing simply scales the latent space by , i.e. . Given this, it is easy to see that a VAE trained with the adjusted target , but appropriately scaling the latent space, will behave identically to a VAE trained with the original target . They will also have identical ELBO as the expected reconstruction is trivially the same, while the KL between Gaussians is invariant to scaling both equally.

In fact, including the entropy regulariser allows us to derive a specialisation of Eq. 2.

Corollary 1.

If and , then,


where  and  represent rescaled networks such that

and where  is a constant, with  denoting the dimensionality of .


We start by noting that

Now considering an alternate form of in Eq. 2,


We first simplify as

Further, denoting , and , we have

Plugging these back in to Eq. 5, we have

showing that the ELBOs for the two setups are the same. For the entropy term, we note that

Finally substituting for and in Eq. 2 gives the desired result. ∎

Noting that is inconsequential to the training process, this result demonstrates an equivalence, up to the scaling of the latent space, between training using the BVAE objective and a maximum-entropy regularised version of the standard ELBO


whenever and are Gaussian. Note that we are here implicitly presuming suitable adjustment of neural-network hyper-parameters and the stochastic gradient scheme to account for the change of scaling in the optimal networks.

More formally we have the following, showing equivalence of all the local optima for the two objectives.

Corollary 2.

If then


Provided that and do not have any zeros distinct to those of , then (7) holding also implies .


The proof follows directly from Corollary 1

and the chain rule. ∎

What we now see is that optimising for (6) leads to a pair of networks equivalent to those from training to the BVAE target, except that encodings are all scaled by a factor of . While it would be easy to doubt any tangible effects from the rescaling of the BVAE, closer inspection shows that it still plays an important role: it ensures the scaling of the encodings matches that of the prior. Just adding the entropy regularisation term will increase the scaling of the latent space as the higher variance it encourages will spread out the aggregate posterior . The rescaling of the BVAE now cancels this effect, ensuring the scaling of matches that of . This is perhaps easiest to see by considering what happens in the limit of large for the two targets. With the BVAE, we see from the original formulation that the encoder must provide embeddings equivalent to sampling from the prior. The entropy-regularised VAE on the other hand will produce an encoder with infinite variance. The equivalence between them is apparent when we scale the encodings of the latter by a factor of , and recover the encodings of the former, i.e. samples from the prior.

Appendix B Experimental Details

Encoder Decoder
Input 64 x 64 binary image Input

4x4 conv. 32 ReLU stride 2

FC. 128 ReLU
4x4 conv. 32 ReLU stride 2 FC. 4x4 x 64 ReLU
4x4 conv. 32 ReLU stride 2 4x4 upconv. 64 ReLU stride 2
4x4 conv. 64 ReLU stride 2 4x4 upconv. 64 ReLU stride 2
FC. 128 4x4 upconv. 32 ReLU stride 2
FC. 2x10 4x4 upconv. 1. stride 2
(a) 2D-shapes dataset.
Encoder Decoder
Input Input
FC. 100. ReLU FC. 100 ReLU
FC. 2x2 FC. 2x2
(b) Pinwheel dataset.
Table 1: Encoder and decoder architectures.

The experiments from § 3 on the impact of the prior in terms disentanglement are conducted on the 2D Shapes [Matthey et al., 2017]

dataset, comprising of 737,280 binary 64 x 64 images of 2D shapes with ground truth factors [number of values]: shape[3], scale[6], orientation[40], x-position[32], y-position[32]. We use a convolutional neural network for the encoder and a deconvolutional neural network for the decoder, whose architectures are described in

Table 0(a). We use

normalised data as targets for the mean of a Bernoulli distribution, using negative cross-entropy for log

. We rely on the Adam optimiser [Kingma and Ba, 2014, Reddi et al., 2018] with learning rate , = 0.9, = 0.999, to optimise the -VAE objective from Eq. 2.

When , experiments have been run with a batch size of and for epochs. When , experiments have been run with a batch size of and for epochs. In Figure 1, the PCA initialised anisotropic

prior is initialised so that its standard deviations are set to be the first

singular values computed on the observations dataset. These are then mapped through a softmax function to ensure that the regularisation coefficient is not implicitly scaled compared to the isotropic case. For the learned anisotropic priors, standard deviations are first initialised as just described, and then learned along the model through a log-variance parametrisation.

We rely on the metric presented in Section (4) and Appendix (B) of Kim and Mnih [2018] as a measure of axis-alignment of the latent encodings with respect to the true (known) generative factors. Confidence intervals in Figure 1

have been computed via the assumption of normally distributed samples with unknown mean and variance, with

runs of each model.


We generated spiral cluster data111http://hips.seas.harvard.edu/content/synthetic-pinwheel-data-matlab., with observations, clustered in 4 spirals, with radial and tangential standard deviations respectively of and , and a rate of . We use fully-connected neural networks for both the encoder and decoder, whose architectures are described in Table 0(b). We minimise the objective from Eq. 3, with chosen to be the inclusive KL, with approximated by the aggregate encoding of the dataset

with . A Gaussian likelihood is used for the encoder. We trained the model for 500 epochs using the Adam optimiser [Kingma and Ba, 2014, Reddi et al., 2018], with and and a learning rate of . The batch size is set to .

Figure 3:

PDF of Gaussian mixture model prior, i.e.

as per Eq. 8.

The mixture of Gaussian prior (c.f. Figure 3) is defined as


with , , , and .

Appendix C Posterior regularisation

The aggregate posterior regulariser is a little more subtle to analyse than the entropy regulariser as it involves both the choice of divergence and potential difficulties in estimating that divergence. One possible choice is the exclusive KL divergence , as previously used (without additional entropy regularisation) by [Esmaeili et al., 2018, Anonymous, 2019], but also implicitly by [Kim and Mnih, 2018, Chen et al., 2018], through the use of a TC term. We now highlight a shortfall with this choice of divergence due to difficulties in its empirical estimation.

In short, the approaches used to estimate the (noting that , where the latter term can be estimated reliably by a simple Monte Carlo estimate) exhibit very large biases that result in quite different effects from what was intended. In fact, our results suggest they will exhibit behavior similar to the BVAE. These biases arise from the effects of nesting estimators [Rainforth et al., 2018a], where the variance in the nested (inner) estimator for

induces a bias in the overall estimator. Specifically, for any random variable



where represents higher-order moments that get dominated asymptotically if is a Monte-Carlo estimator (see Proposition 1c in Maddison et al. [2017], Theorem 1 in Rainforth et al. [2018b], or Theorem 3 in Domke and Sheldon [2018]). In this setting, is the estimate used for . We thus see that if the variance of is large, this will induce a significant bias in our KL estimator.

To make things precise, we consider the estimator used for in Esmaeili et al. [2018] and Anonymous [2019] (noting that the analysis applies equally to those of Chen et al. [2018]): equationparentequation


each , and is the mini-batch of data being used for the current iteration and is the dataset size. Esmaeili et al. [2018] correctly show that , with the first term of Eq. 10b comprising an exact term in and the second term of Eq. 10b being an unbiased Monte-Carlo estimate for the remaining terms in .

To examine the practical behaviour of this estimator when , we first note that the second term of Eq. 10b is, in practice, usually very small and dominated by the first term. This is borne out empirically in our own experiments, and also noted in Kim and Mnih [2018]

. To see why this is the case, consider that given encodings of two independent data points, it is highly unlikely that the two encoding distributions will have any notable overlap (e.g. for a Gaussian encoder, the means will most likely be very many standard deviations apart), presuming a sensible latent space is being learned. Consequently, even though this second term is unbiased and may have an expectation comparable or even larger than the first, it is heavily skewed—it is usually negligible, but occasionally large in the rare instances where there is substantial overlap between encodings.

Let the second term of Eq. 10b be denoted and the event that this it is significant be denoted , such that . As explained above, it will typically be the case that . We now have

where the approximation relies firstly on our previous assumption that and also that . This second assumption will also generally hold in practice, firstly because the occurrence of is dominated by whether two or not similar datapoints are drawn (rather than by the value of ) and secondly because implies that

Characterising precisely is a little more challenging, but it can safely be assumed to be smaller than , which is approximately what would result from all the being the same as . We thus see that even when the event does occur, the resulting gradients should still be on a comparable scale to when it does not. Consequently, whenever is rare, the term should dominate and we thus have


More significantly, we see that the estimator mimics the VAE regularisation up to a constant factor , as adding the back in gives


We should thus expect to empirically see training with this estimator as a regulariser to behave similarly to the VAE with the same regularisation term whenever . Note that the constant factor will not impact the gradients, but does mean that it is possible, even likely, that negative estimates for will be generated, even though we know the true value is positive.