I Introduction
Although data sampled from the natural world appear to be highdimensional, their variations can usually be explained using a much smaller number of latent factors. Both biological and artificial information processing systems exploit such structure and learn explicit representations that are faithful to data generative factors, known commonly as disentangled representations [1]. For example, sparse coding, an influential model of the primary visual cortex, proposes that the visual cortex neurons are coding for latent variables of natural scenes: oriented edges [2]. A very popular method of extracting latent variables is by using the bottleneck neurons of deep autoencoders [3, 4]
. In this paper, we examine unsupervised learning of disentangled representations in the context of variational inference and a generalization of the Variational Autoencoder (VAE)
[5],VAE, developed specifically for disentangled representation learning
[6].We will adopt a probabilistic framework for latentvariable modeling of data [7], where a generative model for data and latent variables is assumed:
(1) 
Here denotes the parameters of our model, models the stochastic process that generates the data given the latent variables, and is the prior on the latent variables. An interpretable and common choice for , and the subject of our paper, is a factorized distribution
, which implies statistical independence. Examples of models with independent priors include popular methods such as Independent Component Analysis
[8, 9]and Principal Component Analysis
[10].While a common definition of learning disentangled representations has yet to be agreed upon [1, 5, 11, 12], extracting statistically independent latent factors is a natural choice [1, 8] and is the definition we will adopt. Such a representation is efficient in that it carries no redundant information [13], and at the same time sufficient information to generate the data.
In our probabilistic framework, the model posterior distribution allows inference of true latent variables. In principle, this could be used to form disentangled representations. However, model posterior is often intractable [7]
, and variational methods are used to estimate it.
We focus on a stateoftheart variational inference method for learning disentangled representations, VAE [6]. The VAE training objective includes a hyperparameter, , encapsulating the original VAE [5] as a special case with choice . When is larger than unity, conditional independence of the learned representations at the bottleneck layer are enforced, corresponding to a conditional independence assumption on data generating latent variables, i.e. [6]. However, as pointed above, a more natural assumption on latents is full statistical independence. Further, statistically independent latents are in general not conditionally independent. Given the popularity of VAEs in representation learning, it is important to understand the role of the hyperparameter in learning disentangled (statistically independent) latent variables.
Our main contributions are as follows:

We provide general results about variational Bayesian inference in VAE. Specifically, we prove that the VAE objective is nonincreasing with increasing , leading to worse reconstruction performance but more conditionally independent representations. Further, we argue that latent variable inference performance generally tends to be nonmonotonic in .

We introduce an analytically tractable model for VAE, specializing to statistically independent latent generative factors. We analytically calculate the optimality conditions for this model, and numerically find that there is an optimal for the best inference of latent variables.

We test our insights from the general theorems and the analytically tractable model using a realistic VAE architecture, using a synthetic MNIST dataset. Simulations agree well with our theory.
The rest of this paper is organized as follows. In Section II, we provide a review of variational inference and VAE. In Section III, we prove several theorems about variational inference in the context of VAE. In Section IV, we introduce our analytical results. In Section V, we test our insights from the general theorems and the tractable models using a VAE architecture on a synthetic MNIST dataset. Finally, in Section VI we discuss our results and present our conclusions.
Ii Variational Inference and Vae
Inference of latent variables in probabilistic models is often an intractable calculation [5, 7]. Variational methods instead optimize over a set of tractable distributions, , that best approximates . We will refer to as the inference model. The difference between the two distributions can be quantified using the KullbackLeibler (KL) divergence, which we call Model Inference Error (MIE):
(2) 
We distinguish between MIE and the True Inference Error (TIE),
(3) 
which can only be known when one has access to the underlying ‘groundtruth’ data generative process and the groundtruth posterior, .
VAEs fit the parameters of the probabilistic model and the variational distribution simultaneously. A key identity in doing so is [14]
(4) 
Model fitting is done by maximizing the data loglikelihood, , under model parameters. Because the KL divergence is nonnegative, the right hand side of (II) serves as a lower bound for and is called the Evidence Lower Bound (ELBO)
(5) 
VAE parameterizes the distributions and
with neural networks, and maximizes ELBO as a proxy for maximizing the data likelihood.
The neural network realization of the is referred to as a decoder [5]. Once the VAE is trained, the decoder can be used as to generate new samples from the model data distribution [5, 15]. The term measures the reconstruction performance of the generative model. We will refer to it as the reconstruction objective.
The neural network realization of the inference model is referred to as an encoder [5]. Its outputs constitute a bottleneck layer and represent inferred latent variables. Note that the MIE calculated from this representation appears on the left hand side of (II).
VAE is an extension of the traditional VAE, where an extra, adjustable hyperparameter is placed in the training objective:
(6) 
Specifically, when , the VAE is equivalent to VAE and .
Higher values of emphasizes the KL divergence between the inference model and the independent prior in the objective (6
). Smaller values of the KL divergence favor a conditionally independent inference model. This can be used to learn disentangled representations of conditionally independent latent variables, whose probability distributions factorize when conditioned on data
[6, 16].However, as alluded to in our introduction, in many cases of interest and application [17, 18, 19], latent variables are conditionally dependent while being independent [8],[10]. We will encounter an analytically tractable case in Section IV. In such cases, it is not clear if a different than 1 helps learning a disentangled representation which extracts statistically independent latent factors. Our goal in the remaining of this paper is to examine this case analytically and numerically.
For convenience, we also attach a table of terms and corresponding mathematical expressions used throughout the paper (Table I).
Term  Mathematical Expression 

Prior  
Model Posterior  
GroundTruth Posterior  
Inference Model  
Data LogLikelihood  
Reconstruction Objective  
Conditional Independence Loss  
Evidence Lower Bound (ELBO) 
Iii How Affects Model Performance and Inference of Latent Variables
In this section, we provide general statements on the effect of the parameter on the representation learning and the generative functions of VAE. We do this by proving propositions about how various terms in the identity (II) change as a function of . Our first two propositions imply that increasing worsens the quality of reconstructed samples while improving conditional disentangling. While these points have been shown in simulations [6, 16], here we provide analytical statements. Our last proposition gives a handle on understanding behavior of MIE through ELBO.
In the following, we will denote optimal parameters of a VAE that maximizes the objective (6) by and . They are given as a solution to
(7) 
We denote the value of the optimal objective by
(8) 
and the value of ELBO at the optimal point by
(9) 
Our first proposition concerns the behavior of as a function of .
Proposition 1.
The optimal value of the VAE objective, , is nonincreasing with increasing :
(10) 
Proof.
Follows from an application of the chain rule, the optimality conditions (
7), and the nonegativity of the KLdivergence:(11) 
∎
The next proposition shows how the two terms in change with .
Proposition 2.
The KL divergence between the inference model and the prior is nonincreasing with increasing :
(12) 
Together with Proposition (1), this implies that
(13) 
Proof.
See Appendix A. ∎
The next proposition is about the behavior of .
Proposition 3.
is maximized at .
Proof.
For simplicity of notation, we presented most of our formulas and propositions for a single data point. All our results generalize to the case where one averages over the data distribution , or a finite training set.
Inference of latent variables, measured by MIE, is affected by as well. In the limit the inference model becomes more and more conditionally independent, deviating from the model posterior. Is the behavior monotonic? While MIE is not explicitly calculable, we can get a hint of its behavior by rearranging (II), and evaluating it at the optimal VAE parameters:
(16) 
As reconstruction performance worsens with , it is reasonable to expect that the data likelihood decreases with . Because ELBO is nonmonotonic with a maximum, even if the data loglikehood was monotonic with , we can expect a nonmonotonic behavior of MIE with an optimal value. In the next section, we will see two specific examples of this.
Iv Analytical Results
In this section we demonstrate our general theory for two different analytically tractable cases.
Iva VAE with a fixed decoder does not lead to better disentangling
A simple case is when the decoder of the VAE is not trained. In our notation, this amounts to being fixed. Then the VAE objective (6) only trains the encoder network and the inference model, . We can deduce the behavior of MIE as a function of from (16). The data likelihood, , does not change as a function of training. is maximized at from Proposition 3, which can be seen to apply to fixed . This means MIE is minimum at . In this case, , or the original VAE is best at learning the true latent variables.
IvB Optimal values in an analytically tractable model
Next, we present a tractable VAE model, in which we can explicitly calculate the dependence in every term in eq. (II).
We assume that our data comes from mixing of ground truth latent variables (or sources) through a mixing matrix , then corrupted by noise ,
(17) 
We assume , . The data distribution is found to be,
(18) 
We denote a identity matrix as . In this model we can calculate the groundtruth posterior exactly (see Appendix BC for details):
(19) 
Note that the covariance matrix of the posterior is nondiagonal. Even though the latent factors are statistically independent, when conditioned on data they are dependent. Therefore, we expect a nontrivial dependence of MIE and TIE on the hyperparameter .
Our encoder contains a fullyconnected layer with linear activation that codes for the mean of the latent variables , and a fullyconnected layer with exponential activation that codes for the diagonal part of the covariance matrix . Given an input , we generate latent variables by
(20) 
where the
operation maps vectors in
to the diagonal of a diagonals matrix in . The exponential nonlinearity in the definition of the covariance matrix acts elementwise and prevents negative covariances.Our decoder consists of a single fullyconnected layer with linear activations. We assume the output
, where is a hyperparameter. Without loss of generality, from now on we choose .The decoder defines . The full data likelihood can be calculated using the prior through . With this setup, our decoder is fully capable of modeling the data generative process (18), by choosing , and . Any deviation from these parameters will be due to the encoder, or the inference model, deviating from the groundtruth distribution.
In order to solve this model, we integrate out data (i.e., performing , using eq. (18)) in the VAE objective in eq. (II) to arrive at (see Appendix BA for details)
(21) 
We optimize over the network parameters, which amounts to setting the partial derivative of with respect to to zero. Upon simplifying, we find (see Appendix BB for details)
(22) 
and the remaining equations are ():
(23) 
We can calculate the model posterior distribution at the network optimum, eqs. (22) and (IVB). Using Bayes’ rule we find (see Appendix BD)
(24) 
Note that when , eq. (IVB) reduces to eq. (IVB), and the model posterior matches with the groundtruth posterior. We are interested in the inference errors MIE and TIE, eqs. (2) and (3). Upon integrating out the data, we find (see Appendix BD for derivations)
(25) 
where for MIE
(26) 
and for TIE
(27) 
As an example, we numerically solve eq. (IVB) for , , and use the optimal network parameters to calculate ELBO (Fig. 1(A)) and inference errors (Fig. 1(B)). We see that ELBO is maximized at , while the inference error is not monotonically decreasing and has a minimum at some . This confirms the theory we outlined earlier. Also, data loglikelihood is monotonically decreasing with (not shown). We further calculate individual terms in the ELBO: the reconstruction objective (Fig. 1(C)), , and the conditional Independence Loss (Fig. 1(D)), . Indeed both terms are monotonically decreasing with , confirming our propositions.
V Numerical Simulations
In this section, we examine a deep, nonlinear VAE on a synthetic dataset. The dataset is generated according to eq. (17) by mixing 10 MNIST digits, arranged as columns of a matrix , with ground truth sources, , and subsequently adding a noise . Other experimental setups and corresponding datasets that were explored are included in Appendix C (Fig. 3).
The encoder, , consists of three feedforward fullyconnected layers with tanh activations, ending in two separate output layers encoding the mean of the latent variables ,
, and the variance,
. These are each parameterized by encoding units. The decoder,, consists of three feedforward fullyconnected layers with tanh activation functions, which takes its input from the encoder, and outputs the reconstructed image. Model details are included in Appendix C.
After training, we calculate individual terms in the VAE objective and demonstrate their dependence on . These terms correspond to the Reconstruction Objective, (Fig. 2(C)), and the conditional Independence Loss, (Fig. 2(D)). As we observed in the analytically tractable case, and predicted by our theory, these terms are decreasing with . Correspondingly, after being maximized around the entire ELBO term decreases with (Fig. 2(A)). We also calculate the TIE for the VAE at various , which follows a nonmonotonic trend and has an optimal (Fig. 2(B)).
Vi Discussion and Conclusion
In this paper, we examined the learning of disentangled representations by extracting statistically independent latent variables in VAE. We proved general theorems on variational Bayesian inference in the context of VAE and introduced an analytically tractable VAE model. We also performed experiments on synthetic datasets to test our insights from the general theorems and the tractable model, and found good agreements.
VAE enforces conditional independence of its units at the bottleneck layer. This preference is not compatible with independence of latent variables, and therefore may lead to an optimal value of for latent variable inference.
There are other perspectives on what constitutes a disentangled representation not addressed in this paper[1, 16], including definitions not statistical in nature, instead taking into account the manifold structure and symmetry transformations in data [1, 20, 12]
. Other deep learning approaches to disentangling include the adversarial setting
[21, 22, 23]. Disentangled representations have also been studied in supervised and semisupervised contexts [24].References
 [1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
 [2] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?” Vision research, vol. 37, no. 23, pp. 3311–3325, 1997.
 [3] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description length and helmholtz free energy,” in Advances in neural information processing systems, 1994, pp. 3–10.
 [4] A. Alemi, B. Poole, I. Fischer, J. Dillon, R. A. Saurus, and K. Murphy, “An informationtheoretic analysis of deep latentvariable models,” 2018.
 [5] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
 [6] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner, “betavae: Learning basic visual concepts with a constrained variational framework.” ICLR, vol. 2, no. 5, p. 6, 2017.
 [7] D. P. Kingma and M. Welling, “An introduction to variational autoencoders,” arXiv preprint arXiv:1906.02691, 2019.
 [8] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 45, pp. 411–430, 2000.
 [9] I. Khemakhem, D. P. Kingma, and A. Hyvärinen, “Variational autoencoders and nonlinear ica: A unifying framework,” arXiv preprint arXiv:1907.04809, 2019.
 [10] M. E. Tipping and C. M. Bishop, “Probabilistic principal component analysis,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 61, no. 3, pp. 611–622, 1999.
 [11] F. Locatello, S. Bauer, M. Lucic, S. Gelly, B. Schölkopf, and O. Bachem, “Challenging common assumptions in the unsupervised learning of disentangled representations,” arXiv preprint arXiv:1811.12359, 2018.
 [12] I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner, “Towards a definition of disentangled representations,” arXiv preprint arXiv:1812.02230, 2018.
 [13] P. Dayan, L. F. Abbott et al., Theoretical neuroscience. Cambridge, MA: MIT Press, 2001, vol. 806.
 [14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, no. 2, pp. 183–233, 1999.
 [15] C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016.
 [16] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner, “Understanding disentangling in vae,” arXiv preprint arXiv:1804.03599, 2018.

[17]
X. Huang, M.Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised imagetoimage translation,” in
The European Conference on Computer Vision (ECCV)
, September 2018.  [18] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer, and M. Ranzato, “Fader networks: Manipulating images by sliding attributes,” in Advances in Neural Information Processing Systems, 2017, pp. 5967–5976.

[19]
T. Karras, S. Laine, and T. Aila, “A stylebased generator architecture for
generative adversarial networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2019, pp. 4401–4410.  [20] J. J. DiCarlo and D. D. Cox, “Untangling invariant object recognition,” Trends in cognitive sciences, vol. 11, no. 8, pp. 333–341, 2007.
 [21] E. L. Denton et al., “Unsupervised learning of disentangled representations from video,” in Advances in neural information processing systems, 2017, pp. 4414–4423.

[22]
L. Tran, X. Yin, and X. Liu, “Disentangled representation learning gan for poseinvariant face recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1415–1424.  [23] V. John, L. Mou, H. Bahuleyan, and O. Vechtomova, “Disentangled representation learning for text style transfer,” arXiv preprint arXiv:1808.04339, 2018.
 [24] N. Siddharth, B. Paige, J.W. Van de Meent, A. Desmaison, N. Goodman, P. Kohli, F. Wood, and P. Torr, “Learning disentangled representations with semisupervised deep generative models,” in Advances in Neural Information Processing Systems, 2017, pp. 5925–5935.
 [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Appendix A Proof of Proposition 2
We prove a more general version of eq. (12) given in Prop. 2. Eq. (13) follows from eq. (12) and Prop. 1.
Proposition 4.
Consider an objective function given by a sum of two terms,
(28) 
to be maximized over parameters , and is a hyperparameter. Let . As increases is nonincreasing.
Proof.
The proof uses contradiction. Let and
(29) 
Then
(30) 
where the first line is an identity, and the second line follows from the optimality of at .
Appendix B Details of the analytically tractable VAE model
Ba Integrating out data from the objective
The full VAE objective is (6) averaged with respect to the data distribution :
(33) 
We first calculate . We use the reparametrization trick: For , we can write with . Then,
The last term can be calculated by the following useful trick. Let’s introduce a source term into the generating functional,
(34) 
then differentiating with respect to the source,
(35) 
On the other hand, we can perform the Gaussian integral in to obtain,
(36) 
Then we arrive at
(37) 
Eq. (37) is central to the calculations of many results presented in the text.
Going back to the reconstruction objective, using eq. (37) we have (up to constants)
(38) 
Similarly we can calculate the conditional independence loss,
(39) 
Putting everything together, the objective function we want to maximize is (neglecting constant terms)
(40) 
The expectation with respect to amounts to performing Gaussian integrals in , as , and thus can be done exactly. After plugging in the definition of from eq. (20), and performing the integrals, the result is given in eq. (IVB).
BB Taking derivatives of the objective
In order to take derivatives of eq. (IVB), we unpack the indices (to ease the notation, we denote as , and follow the Einstein summation convention, repeated indices are to be summed over unless the summation is explicitly specified)
(41) 
Then,
(42)  
(43)  
(44)  
(45)  
(46)  
(47) 
From the and equations we can immediately see .
BC Derivation of the groundtruth posterior
We observe that since both and are independently normally distributed in (17), and are jointly normal, i.e., is a normal distribution. However, note that is just up to a coordinate transformation, so is also normal. Also, as , , . We can think of and partition a dimensional normal distribution . Therefore, to find the conditional probability , we can just use the formula for conditioning multivariate normal distribution:
(48) 
where
(49) 
Now specializing to our case (17),
(50) 
Note that , then
(51) 
where in the second equality we have used the matrix pushthrough identity: For any matrices ,,
(52) 
Now the covariance,
(53)  
where in the third equality we have used the Woodbury matrix identity
: For any invertible matrix
and size compatible matrices and :(54) 
BD Derivation of the model posterior
Our goal is to use the Bayes rule to calculate the model posterior,
BE Derivation of
First let’s consider . Let
(60) 
Then, we can write as
(61) 
Plugging in eq. (20) and performing the Gaussian integrals as in Appendix BA, we arrive at eq. (IVB).
Note that at network optimum, our model posterior equals to the groundtruth posterior upon changing to . Therefore, we just need to replace by in the above derivation to obtain the results for .
Appendix C Simulation Details
The deep neural network models used for the numerical experiments task used the same overall architecture. The encoder is a feed forward network with 3 hidden layers, with 256, 200, and 200 units. 2 parallel hidden layers with 2 neurons parameters the mean and variance for
latent variables. The decoder consists of 3 feedforward hidden layers with 200, 200, and 256 units, then outputs the reconstructed image. The network was trained for 1000 epochs over the entire synthetic dataset, comprising of 1000 examples. We used a tanh activation function used along with Adam Optimization
[25] with a learning rate of 1e3. Experiments were repeated across 300 realizations for each value. Results shown were averaged over the whole set of realizations.The Reconstruction Objective was calculated for each trained model through generating 1000 samples from the encoder, passing them to the decoder to approximately calculate , and averaging over the data
. The Conditional Independence Loss was calculated directly using the Tensorflow Distributions library’s native KL Divergence method. The ELBO was calculated by numerically taking the difference of these two terms, and the
VAE objective was an extension of this with the hyperparameter included. The Inference Error was calculated numerically using the modelled and and estimating from minibatches.In Fig. 3, we show results on another simulation consistent with our findings.