Log In Sign Up

A Closer Look at Disentangling in β-VAE

by   Harshvardhan Sikka, et al.

In many data analysis tasks, it is beneficial to learn representations where each dimension is statistically independent and thus disentangled from the others. If data generating factors are also statistically independent, disentangled representations can be formed by Bayesian inference of latent variables. We examine a generalization of the Variational Autoencoder (VAE), β-VAE, for learning such representations using variational inference. β-VAE enforces conditional independence of its bottleneck neurons controlled by its hyperparameter β. This condition is in general not compatible with the statistical independence of latents. By providing analytical and numerical arguments, we show that this incompatibility leads to a non-monotonic inference performance in β-VAE with a finite optimal β.


page 1

page 2

page 3

page 4


Independent Subspace Analysis for Unsupervised Learning of Disentangled Representations

Recently there has been an increased interest in unsupervised learning o...

WeLa-VAE: Learning Alternative Disentangled Representations Using Weak Labels

Learning disentangled representations without supervision or inductive b...

Early Visual Concept Learning with Unsupervised Deep Learning

Automated discovery of early visual concepts from raw image data is a ma...

Variational AutoEncoder For Regression: Application to Brain Aging Analysis

While unsupervised variational autoencoders (VAE) have become a powerful...

Multi-Level Variational Autoencoder: Learning Disentangled Representations from Grouped Observations

We would like to learn a representation of the data which decomposes an ...

Disentangling by Factorising

We define and address the problem of unsupervised learning of disentangl...

Demystifying Inductive Biases for β-VAE Based Architectures

The performance of β-Variational-Autoencoders (β-VAEs) and their variant...

I Introduction

Although data sampled from the natural world appear to be high-dimensional, their variations can usually be explained using a much smaller number of latent factors. Both biological and artificial information processing systems exploit such structure and learn explicit representations that are faithful to data generative factors, known commonly as disentangled representations [1]. For example, sparse coding, an influential model of the primary visual cortex, proposes that the visual cortex neurons are coding for latent variables of natural scenes: oriented edges [2]. A very popular method of extracting latent variables is by using the bottleneck neurons of deep autoencoders [3, 4]

. In this paper, we examine unsupervised learning of disentangled representations in the context of variational inference and a generalization of the Variational Autoencoder (VAE)


-VAE, developed specifically for disentangled representation learning


We will adopt a probabilistic framework for latent-variable modeling of data [7], where a generative model for data and latent variables is assumed:


Here denotes the parameters of our model, models the stochastic process that generates the data given the latent variables, and is the prior on the latent variables. An interpretable and common choice for , and the subject of our paper, is a factorized distribution

, which implies statistical independence. Examples of models with independent priors include popular methods such as Independent Component Analysis

[8, 9]

and Principal Component Analysis


While a common definition of learning disentangled representations has yet to be agreed upon [1, 5, 11, 12], extracting statistically independent latent factors is a natural choice [1, 8] and is the definition we will adopt. Such a representation is efficient in that it carries no redundant information [13], and at the same time sufficient information to generate the data.

In our probabilistic framework, the model posterior distribution allows inference of true latent variables. In principle, this could be used to form disentangled representations. However, model posterior is often intractable [7]

, and variational methods are used to estimate it.

We focus on a state-of-the-art variational inference method for learning disentangled representations, -VAE [6]. The -VAE training objective includes a hyperparameter, , encapsulating the original VAE [5] as a special case with choice . When is larger than unity, conditional independence of the learned representations at the bottleneck layer are enforced, corresponding to a conditional independence assumption on data generating latent variables, i.e. [6]. However, as pointed above, a more natural assumption on latents is full statistical independence. Further, statistically independent latents are in general not conditionally independent. Given the popularity of VAEs in representation learning, it is important to understand the role of the hyperparameter in learning disentangled (statistically independent) latent variables.

Our main contributions are as follows:

  1. We provide general results about variational Bayesian inference in -VAE. Specifically, we prove that the -VAE objective is non-increasing with increasing , leading to worse reconstruction performance but more conditionally independent representations. Further, we argue that latent variable inference performance generally tends to be non-monotonic in .

  2. We introduce an analytically tractable model for -VAE, specializing to statistically independent latent generative factors. We analytically calculate the optimality conditions for this model, and numerically find that there is an optimal for the best inference of latent variables.

  3. We test our insights from the general theorems and the analytically tractable model using a realistic -VAE architecture, using a synthetic MNIST dataset. Simulations agree well with our theory.

The rest of this paper is organized as follows. In Section II, we provide a review of variational inference and -VAE. In Section III, we prove several theorems about variational inference in the context of -VAE. In Section IV, we introduce our analytical results. In Section V, we test our insights from the general theorems and the tractable models using a -VAE architecture on a synthetic MNIST dataset. Finally, in Section VI we discuss our results and present our conclusions.

Ii Variational Inference and -Vae

Inference of latent variables in probabilistic models is often an intractable calculation [5, 7]. Variational methods instead optimize over a set of tractable distributions, , that best approximates . We will refer to as the inference model. The difference between the two distributions can be quantified using the Kullback-Leibler (KL) divergence, which we call Model Inference Error (MIE):


We distinguish between MIE and the True Inference Error (TIE),


which can only be known when one has access to the underlying ‘ground-truth’ data generative process and the ground-truth posterior, .

VAEs fit the parameters of the probabilistic model and the variational distribution simultaneously. A key identity in doing so is [14]


Model fitting is done by maximizing the data log-likelihood, , under model parameters. Because the KL divergence is non-negative, the right hand side of (II) serves as a lower bound for and is called the Evidence Lower Bound (ELBO)


VAE parameterizes the distributions and

with neural networks, and maximizes ELBO as a proxy for maximizing the data likelihood.

The neural network realization of the is referred to as a decoder [5]. Once the VAE is trained, the decoder can be used as to generate new samples from the model data distribution [5, 15]. The term measures the reconstruction performance of the generative model. We will refer to it as the reconstruction objective.

The neural network realization of the inference model is referred to as an encoder [5]. Its outputs constitute a bottleneck layer and represent inferred latent variables. Note that the MIE calculated from this representation appears on the left hand side of (II).

-VAE is an extension of the traditional VAE, where an extra, adjustable hyperparameter is placed in the training objective:


Specifically, when , the -VAE is equivalent to VAE and .

Higher values of emphasizes the KL divergence between the inference model and the independent prior in the objective (6

). Smaller values of the KL divergence favor a conditionally independent inference model. This can be used to learn disentangled representations of conditionally independent latent variables, whose probability distributions factorize when conditioned on data

[6, 16].

However, as alluded to in our introduction, in many cases of interest and application [17, 18, 19], latent variables are conditionally dependent while being independent [8],[10]. We will encounter an analytically tractable case in Section IV. In such cases, it is not clear if a different than 1 helps learning a disentangled representation which extracts statistically independent latent factors. Our goal in the remaining of this paper is to examine this case analytically and numerically.

For convenience, we also attach a table of terms and corresponding mathematical expressions used throughout the paper (Table I).

Term Mathematical Expression
Model Posterior
Ground-Truth Posterior
Inference Model
Data Log-Likelihood
Reconstruction Objective
Conditional Independence Loss
Evidence Lower Bound (ELBO)
Table I: Table of terms and corresponding mathematical expressions.

Iii How Affects Model Performance and Inference of Latent Variables

In this section, we provide general statements on the effect of the parameter on the representation learning and the generative functions of -VAE. We do this by proving propositions about how various terms in the identity (II) change as a function of . Our first two propositions imply that increasing worsens the quality of reconstructed samples while improving conditional disentangling. While these points have been shown in simulations [6, 16], here we provide analytical statements. Our last proposition gives a handle on understanding behavior of MIE through ELBO.

In the following, we will denote optimal parameters of a -VAE that maximizes the objective (6) by and . They are given as a solution to


We denote the value of the optimal objective by


and the value of ELBO at the optimal point by


Our first proposition concerns the behavior of as a function of .

Proposition 1.

The optimal value of the -VAE objective, , is non-increasing with increasing :


Follows from an application of the chain rule, the optimality conditions (

7), and the nonegativity of the KL-divergence:


The next proposition shows how the two terms in change with .

Proposition 2.

The KL divergence between the inference model and the prior is non-increasing with increasing :


Together with Proposition (1), this implies that


See Appendix A. ∎

The next proposition is about the behavior of .

Proposition 3.

is maximized at .


Note that by definition


By evaluating (14) at and , and the chain rule, we get:


The proposition follows from this result and (12). ∎

For simplicity of notation, we presented most of our formulas and propositions for a single data point. All our results generalize to the case where one averages over the data distribution , or a finite training set.

Inference of latent variables, measured by MIE, is affected by as well. In the limit the inference model becomes more and more conditionally independent, deviating from the model posterior. Is the behavior monotonic? While MIE is not explicitly calculable, we can get a hint of its behavior by rearranging (II), and evaluating it at the optimal -VAE parameters:


As reconstruction performance worsens with , it is reasonable to expect that the data likelihood decreases with . Because ELBO is non-monotonic with a maximum, even if the data log-likehood was monotonic with , we can expect a non-monotonic behavior of MIE with an optimal value. In the next section, we will see two specific examples of this.

Iv Analytical Results

In this section we demonstrate our general theory for two different analytically tractable cases.

Iv-a -VAE with a fixed decoder does not lead to better disentangling

A simple case is when the decoder of the -VAE is not trained. In our notation, this amounts to being fixed. Then the -VAE objective (6) only trains the encoder network and the inference model, . We can deduce the behavior of MIE as a function of from (16). The data likelihood, , does not change as a function of training. is maximized at from Proposition 3, which can be seen to apply to fixed . This means MIE is minimum at . In this case, , or the original VAE is best at learning the true latent variables.

Iv-B Optimal values in an analytically tractable model

Next, we present a tractable VAE model, in which we can explicitly calculate the -dependence in every term in eq. (II).

We assume that our data comes from mixing of ground truth latent variables (or sources) through a mixing matrix , then corrupted by noise ,


We assume , . The data distribution is found to be,


We denote a identity matrix as . In this model we can calculate the ground-truth posterior exactly (see Appendix B-C for details):


Note that the covariance matrix of the posterior is non-diagonal. Even though the latent factors are statistically independent, when conditioned on data they are dependent. Therefore, we expect a non-trivial dependence of MIE and TIE on the hyperparameter .

Our encoder contains a fully-connected layer with linear activation that codes for the mean of the latent variables , and a fully-connected layer with exponential activation that codes for the diagonal part of the covariance matrix . Given an input , we generate latent variables by


where the

operation maps vectors in

to the diagonal of a diagonals matrix in . The exponential nonlinearity in the definition of the covariance matrix acts elementwise and prevents negative covariances.

Our decoder consists of a single fully-connected layer with linear activations. We assume the output

is normally distributed,

, where is a hyperparameter. Without loss of generality, from now on we choose .

The decoder defines . The full data likelihood can be calculated using the prior through . With this setup, our decoder is fully capable of modeling the data generative process (18), by choosing , and . Any deviation from these parameters will be due to the encoder, or the inference model, deviating from the ground-truth distribution.

In order to solve this model, we integrate out data (i.e., performing , using eq. (18)) in the -VAE objective in eq. (II) to arrive at (see Appendix B-A for details)


We optimize over the network parameters, which amounts to setting the partial derivative of with respect to to zero. Upon simplifying, we find (see Appendix B-B for details)


and the remaining equations are ():


We can calculate the model posterior distribution at the network optimum, eqs. (22) and (IV-B). Using Bayes’ rule we find (see Appendix B-D)


Note that when , eq. (IV-B) reduces to eq. (IV-B), and the model posterior matches with the ground-truth posterior. We are interested in the inference errors MIE and TIE, eqs. (2) and (3). Upon integrating out the data, we find (see Appendix B-D for derivations)


where for MIE


and for TIE

Fig. 1: -dependence of various quantities at the optimal parameter configuration of -VAE. (A) ELBO as a function of . (B) MIE/TIE as a function of . (C) Reconstruction objective as a function of . (D) Conditional Independence Loss as a function of . In these plots, we averaged the plotted quantities over the data distribution.

As an example, we numerically solve eq. (IV-B) for , , and use the optimal network parameters to calculate ELBO (Fig. 1(A)) and inference errors (Fig. 1(B)). We see that ELBO is maximized at , while the inference error is not monotonically decreasing and has a minimum at some . This confirms the theory we outlined earlier. Also, data log-likelihood is monotonically decreasing with (not shown). We further calculate individual terms in the ELBO: the reconstruction objective (Fig. 1(C)), , and the conditional Independence Loss (Fig. 1(D)), . Indeed both terms are monotonically decreasing with , confirming our propositions.

V Numerical Simulations

In this section, we examine a deep, nonlinear -VAE on a synthetic dataset. The dataset is generated according to eq. (17) by mixing 10 MNIST digits, arranged as columns of a matrix , with ground truth sources, , and subsequently adding a noise . Other experimental setups and corresponding datasets that were explored are included in Appendix C (Fig. 3).

The encoder, , consists of three feed-forward fully-connected layers with tanh activations, ending in two separate output layers encoding the mean of the latent variables ,

, and the variance,

. These are each parameterized by encoding units. The decoder,

, consists of three feed-forward fully-connected layers with tanh activation functions, which takes its input from the encoder, and outputs the reconstructed image. Model details are included in Appendix C.

After training, we calculate individual terms in the -VAE objective and demonstrate their dependence on . These terms correspond to the Reconstruction Objective, (Fig. 2(C)), and the conditional Independence Loss, (Fig. 2(D)). As we observed in the analytically tractable case, and predicted by our theory, these terms are decreasing with . Correspondingly, after being maximized around the entire ELBO term decreases with (Fig. 2(A)). We also calculate the TIE for the -VAE at various , which follows a non-monotonic trend and has an optimal (Fig. 2(B)).

Fig. 2: Values for error terms across 100 random initializations of the network. Solid line represents the average. Dashed lines around the solid line represent the minimum and maximum values, and vertical dashed line represent the extremum. (A) ELBO as a function of . (B) TIE as a function of . (C) Reconstruction Objective as a function of . (D) Conditional Independence Loss as a function of .

Vi Discussion and Conclusion

In this paper, we examined the learning of disentangled representations by extracting statistically independent latent variables in -VAE. We proved general theorems on variational Bayesian inference in the context of -VAE and introduced an analytically tractable -VAE model. We also performed experiments on synthetic datasets to test our insights from the general theorems and the tractable model, and found good agreements.

-VAE enforces conditional independence of its units at the bottleneck layer. This preference is not compatible with independence of latent variables, and therefore may lead to an optimal value of for latent variable inference.

There are other perspectives on what constitutes a disentangled representation not addressed in this paper[1, 16], including definitions not statistical in nature, instead taking into account the manifold structure and symmetry transformations in data [1, 20, 12]

. Other deep learning approaches to disentangling include the adversarial setting

[21, 22, 23]. Disentangled representations have also been studied in supervised and semi-supervised contexts [24].


Appendix A Proof of Proposition 2

We prove a more general version of eq. (12) given in Prop. 2. Eq. (13) follows from eq. (12) and Prop. 1.

Proposition 4.

Consider an objective function given by a sum of two terms,


to be maximized over parameters , and is a hyperparameter. Let . As increases is nonincreasing.


The proof uses contradiction. Let and




where the first line is an identity, and the second line follows from the optimality of at .

Now we assume , and see that this leads to a contradiction.


The inequality follows from our assumption, and the equality from (28). Combined with (A), this implies


which contradicts (29). Therefore if , then . ∎

Appendix B Details of the analytically tractable -VAE model

B-a Integrating out data from the objective

The full -VAE objective is (6) averaged with respect to the data distribution :


We first calculate . We use the reparametrization trick: For , we can write with . Then,

The last term can be calculated by the following useful trick. Let’s introduce a source term into the generating functional,


then differentiating with respect to the source,


On the other hand, we can perform the Gaussian integral in to obtain,


Then we arrive at


Eq. (37) is central to the calculations of many results presented in the text.

Going back to the reconstruction objective, using eq. (37) we have (up to constants)


Similarly we can calculate the conditional independence loss,


Putting everything together, the objective function we want to maximize is (neglecting constant terms)


The expectation with respect to amounts to performing Gaussian integrals in , as , and thus can be done exactly. After plugging in the definition of from eq. (20), and performing the integrals, the result is given in eq. (IV-B).

B-B Taking derivatives of the objective

In order to take derivatives of eq. (IV-B), we unpack the indices (to ease the notation, we denote as , and follow the Einstein summation convention, repeated indices are to be summed over unless the summation is explicitly specified)




From the and equations we can immediately see .

B-C Derivation of the ground-truth posterior

We observe that since both and are independently normally distributed in (17), and are jointly normal, i.e., is a normal distribution. However, note that is just up to a coordinate transformation, so is also normal. Also, as , , . We can think of and partition a -dimensional normal distribution . Therefore, to find the conditional probability , we can just use the formula for conditioning multivariate normal distribution:




Now specializing to our case (17),


Note that , then


where in the second equality we have used the matrix push-through identity: For any matrices ,,


Now the covariance,


where in the third equality we have used the Woodbury matrix identity

: For any invertible matrix

and size compatible matrices and :


B-D Derivation of the model posterior

Our goal is to use the Bayes rule to calculate the model posterior,

In order to do so, we first need to calculate the evidence ,


where in the third equality we have used eq.s (52) and (54) to simplify. Therefore,


After some simplifications using eq.s (52) and (54), we arrived at


B-E Derivation of

First let’s consider . Let


Then, we can write as


Plugging in eq. (20) and performing the Gaussian integrals as in Appendix B-A, we arrive at eq. (IV-B).

Note that at network optimum, our model posterior equals to the ground-truth posterior upon changing to . Therefore, we just need to replace by in the above derivation to obtain the results for .

Appendix C Simulation Details

The deep neural network models used for the numerical experiments task used the same overall architecture. The encoder is a feed forward network with 3 hidden layers, with 256, 200, and 200 units. 2 parallel hidden layers with 2 neurons parameters the mean and variance for

latent variables. The decoder consists of 3 feed-forward hidden layers with 200, 200, and 256 units, then outputs the reconstructed image. The network was trained for 1000 epochs over the entire synthetic dataset, comprising of 1000 examples. We used a tanh activation function used along with Adam Optimization

[25] with a learning rate of 1e-3. Experiments were repeated across 300 realizations for each value. Results shown were averaged over the whole set of realizations.

The Reconstruction Objective was calculated for each trained model through generating 1000 samples from the encoder, passing them to the decoder to approximately calculate , and averaging over the data

. The Conditional Independence Loss was calculated directly using the Tensorflow Distributions library’s native KL Divergence method. The ELBO was calculated by numerically taking the difference of these two terms, and the

-VAE objective was an extension of this with the hyperparameter included. The Inference Error was calculated numerically using the modelled and and estimating from mini-batches.

In Fig. 3, we show results on another simulation consistent with our findings.

Fig. 3: Values for error terms across 300 random initializations of the network for a synthetic dataset, which comprises of a single MNIST digit localized at different locations on a blank canvas. The cartesian coordinate of the digit in a sample from our data, , is determined by eq. (17), with