The generative process of a deep latent variable model entails drawing a number of latent factors from the prior and using a neural network to convert such factors to real data points. Maximum likelihood estimation of the parameters requires marginalizing out the latent factors, which is intractable for deep latent variable models. The influential work ofkingma2013auto and rezende2014stochastic on Variational Autoencoders (VAEs) enables optimization of a tractable lower bound on the likelihood via a reparameterization of the Evidence Lower Bound (ELBO) (jordan1999introduction; blei2017variational). This has led to a surge of recent interest in automatic discovery of the latent factors of variation for a data distribution based on VAEs and principled probabilistic modeling (higgins2016beta; bowman2015generating; chen2018isolating; molauto18). ††Code available at https://sites.google.com/view/dont-blame-the-elbo
Unfortunately, the quality and the number of the latent factors learned is influenced by a phenomenon known as posterior collapse, where the generative model learns to ignore a subset of the latent variables. Most existing papers suggest that posterior collapse is caused by the KL-divergence term in the ELBO objective, which directly encourages the variational distribution to match the prior (bowman2015generating; kingma2016improved; sonderby2016ladder)
. Thus, a wide range of heuristic approaches in the literature have attempted to diminish the effect of the KL term in the ELBO to alleviate posterior collapse(bowman2015generating; razavi2018preventing; sonderby2016ladder; huang2018improving). While holding the KL term responsible for posterior collapse makes intuitive sense, the mathematical mechanism of this phenomenon is not well understood. In this paper, we investigate the connection between posterior collapse and spurious local maxima in the ELBO objective through the analysis of linear VAEs. Unexpectedly, we show that spurious local maxima may arise even in the optimization of exact marginal likelihood, and such local maxima are linked with a collapsed posterior.
While linear autoencoders (rumelhart1985learning) have been studied extensively (baldi1989neural; kunin2019loss), little attention has been given to their variational counterpart from a theoretical standpoint. A well-known relationship exists between linear autoencoders and PCA – the optimal solution of a linear autoencoder has decoder weight columns that span the same subspace as the one defined by the principal components (baldi1989neural). Similarly, the maximum likelihood solution of probabilistic PCA (pPCA) (tipping1999probabilistic) recovers the subspace of principal components. In this work, we show that a linear variational autoencoder can recover the solution of pPCA. In particular, by specifying a diagonal covariance structure on the variational distribution, one can recover an identifiable autoencoder, which at the global maximum of the ELBO recovers the exact principal components as the columns of the decoder’s weights. Importantly, we show that the ELBO objective for a linear VAE does not introduce any local maxima beyond the log marginal likelihood.
The study of linear VAEs gives us new insights into the cause of posterior collapse and the difficulty of VAE optimization more generally. Following the analysis of tipping1999probabilistic, we characterize the stationary points of pPCA and show that the variance of the observation model directly influences the stability of local stationary points corresponding to posterior collapse – it is only possible to escape these sub-optimal solutions by simultaneously reducing noise and learning better features. Our contributions include:
[topsep=0pt, partopsep=0pt, leftmargin=25pt, parsep=0pt, itemsep=2pt]
We verify that linear VAEs can recover the true posterior of pPCA. Further, we prove that the global optimum of the linear VAE recovers the principal components (not just their spanning sub-space). More importantly, we prove that using ELBO to train linear VAEs does not introduce any additional spurious local maxima relative to log marginal likelihood training.
While high-capacity decoders are often blamed for posterior collapse, we show that posterior collapse may occur when optimizing log marginal likelihood even without powerful decoders. Our experiments verify the analysis of the linear setting and show that these insights extend even to high-capacity non-linear VAEs. Specifically, we provide evidence that the observation noise in deep Gaussian VAEs plays a crucial role in overcoming local maxima corresponding to posterior collapse.
The probabilitic PCA (pPCA) model is defined as follows. Suppose latent variables generate data . A standard Gaussian prior is used for and a linear generative model with a spherical Gaussian observation model for :
The pPCA model is a special case of factor analysis (bartholomew1987latent), which uses a spherical covariance instead of a full covariance matrix. As pPCA is fully Gaussian, both the marginal distribution for and the posterior are Gaussian, and unlike factor analysis, the maximum likelihood estimates of and are tractable (tipping1999probabilistic).
Recently, amortized variational inference has gained popularity as a means to learn complicated latent variable models. In these models, the log marginal likelihood, , is intractable but a variational distribution, denoted , is used to approximate the posterior , allowing tractable approximate inference using the Evidence Lower Bound (ELBO):
The ELBO (jordan1999introduction; blei2017variational) consists of two terms, the KL divergence between the variational distribution, , and prior, , and the expected conditional log-likelihood. The KL divergence forces the variational distribution towards the prior and so has reasonably been the focus of many attempts to alleviate posterior collapse. We hypothesize that the log marginal likelihood itself often encourages posterior collapse.
In Variational Autoencoders (VAEs), two neural networks are used to parameterize and , where and denote two sets of neural network weights. The encoder maps an input to the parameters of the variational distribution, and then the decoder maps a sample from the variational distribution back to the inputs.
A dominant issue with VAE optimization is posterior collapse, in which the learned variational distribution is close to the prior. This reduces the capacity of the generative model, making it impossible for the decoder network to make use of the information content of all of the latent dimensions. While posterior collapse is widely acknowledged, formally defining it has remained a challenge. We introduce a formal definition in Section 6.2 which we use to measure posterior collapse in trained deep neural networks.
3 Related Work
dai2017hidden discuss the relationship between robust PCA methods (candes2011robust) and VAEs. They show that at stationary points the VAE objective locally aligns with pPCA under certain assumptions. We study the pPCA objective explicitly and show a direct correspondence with linear VAEs. dai2017hidden showed that the covariance structure of the variational distribution may smooth out the loss landscape. This is an interesting result whose interactions with ours is an exciting direction for future research.
he2018lagging motivate posterior collapse through an investigation of the learning dynamics of deep VAEs. They suggest that posterior collapse is caused by the inference network lagging behind the true posterior during the early stages of training. A related line of research studies issues arising from approximate inference causing a mismatch between the variational distribution and true posterior (cremer2018inference; kim2018semi; hjelm2016iterative). By contrast, we show that posterior collapse may exist even when the variational distribution matches the true posterior exactly.
alemi2017fixing used an information theoretic framework to study the representational properties of VAEs. They show that with infinite model capacity there are solutions with equal ELBO and log marginal likelihood which span a range of representations, including posterior collapse. We find that even with weak (linear) decoders, posterior collapse may occur. Moreover, we show that in the linear case this posterior collapse is due entirely to the log marginal likelihood.
The most common approach for dealing with posterior collapse is to anneal a weight on the KL term during training from to (bowman2015generating; sonderby2016ladder; maaloe2019biva; higgins2016beta; huang2018improving). Unfortunately, this means that during the annealing process, one is no longer optimizing a bound on the log-likelihood. Also, it is difficult to design these annealing schedules and we have found that once regular ELBO training resumes the posterior will typically collapse again (Section 6.2).
kingma2016improved propose a constraint on the KL term, termed "free-bits", where the gradient of the KL term per dimension is ignored if the KL is below a given threshold. Unfortunately, this method reportedly has some negative effects on training stability (razavi2018preventing; chen2016variational). Delta-VAEs (razavi2018preventing) instead choose prior and variational distributions such that the variational distribution can never exactly recover the prior, allocating free-bits implicitly. Several other papers have studied alternative formulations of the VAE objective (rezende2018taming; dai2018diagnosing; alemi2017fixing; ma2018mae; yeung2017tackling). dai2018diagnosing analyzed the VAE objective to improve image fidelity under Gaussian observation models and also discuss the importance of the observation noise. Other approaches have explored changing the VAE network architecture to help alleviate posterior collapse; for example adding skip connections (maaloe2019biva; dieng2018avoiding)
rolinek2018variational observed that the diagonal covariance used in the variational distribution of VAEs encourages orthogonal representations. They use linearizations of deep networks to prove their results under a modification of the objective function by explicitly ignoring latent dimensions with posterior collapse. Our formulation is distinct in focusing on linear VAEs without modifying the objective function and proving an exact correspondence between the global solution of linear VAEs and the principal components.
kunin2019loss studied the optimization challenges in the linear autoencoder setting. They exposed an equivalence between pPCA and Bayesian autoencoders and point out that when
is too large information about the latent code is lost. A similar phenomenon is discussed in the supervised learning setting bychechik2005information. kunin2019loss also showed that suitable regularization allows the linear autoencoder to recover the principal components up to rotations. We show that linear VAEs with a diagonal covariance structure recover the principal components exactly.
4 Analysis of linear VAE
This section compares and analyzes the loss landscapes of both pPCA and linear variational autoencoders. We first discuss the stationary points of pPCA and then show that a simple linear VAE can recover the global optimum of pPCA. Moreover, when the data covariance eigenvalues are distinct, the linear VAE identifies the individual principal components, unlike pPCA, which recovers only the PCA subspace. Finally, we prove that ELBO does not introduce any additional spurious maxima to the loss landscape.
4.1 Probabilistic PCA Revisited
The pPCA model (Eq. eqn:ppca) is a fully Gaussian linear model, thus we can compute both the marginal distribution for and the posterior in closed form:
where . This model is particularly interesting to analyze in the setting of variational inference, as the ELBO can also be computed in closed form (see Appendix C).
Stationary points of pPCA
We now characterize the stationary points of pPCA, largely repeating the thorough analysis of tipping1999probabilistic (see Appendix A of their paper). The maximum likelihood estimate of is the mean of the data. We can compute and as follows:
Here corresponds to the first principal components of the data with the corresponding eigenvalues stored in the diagonal matrix . The matrix is an arbitrary rotation matrix which accounts for weak identifiability in the model. We can interpret as the average variance lost in the projection. The MLE solution is the global optimum. Other stationary points correspond to zeroing out columns of (posterior collapse).
In this section we consider to be fixed and not necessarily equal to the MLE solution. Equation 8 remains a stationary point when the general is swapped in. One surprising observation is that directly controls the stability of the stationary points of the log marginal likelihood (see Appendix A). In Figure 1, we illustrate one such stationary point of pPCA for different values of . We computed this stationary point by taking
to have three principal component columns and zeros elsewhere. Each plot shows the same stationary point perturbed by two orthogonal vectors corresponding to other principal components.
The stability of the pPCA stationary points depends on the size of — as increases the stationary point tends towards a stable local maximum so that we cannot learn the additional components. Intuitively, the model prefers to explain deviations in the data with the larger observation noise. Fortunately, decreasing will increase likelihood at these stationary points so that when learning simultaneously these stationary points are saddle points (tipping1999probabilistic). Therefore, learning is necessary for gaining a full latent representation.
4.2 Linear VAEs recover pPCA
We now show that linear VAEs can recover the globally optimal solution to Probabilistic PCA. We will consider the following VAE model,
where is a diagonal covariance matrix, used globally for all of the data points. While this is a significant restriction compared to typical VAE architectures, which define an amortized variance for each input point, this is sufficient to recover the global optimum of the probabilistic model.
The global maximum of the ELBO objective (Eq. eqn:elbo) for the linear VAE (Eq. eqn:linear_vae) is identical to the global maximum for the log marginal likelihood of pPCA (Eq. eqn:ppca_marginal).
Note that the global optimum of pPCA is defined up to an orthogonal transformation of the columns of , i.e., any rotation in Eq. eqn:ppca_wmle results in a matrix that given attains maximum marginal likelihood. The linear VAE model defined in Eq. eqn:linear_vae is able to recover the global optimum of pPCA when . Recall from Eq. eqn:ppca_posterior that is defined in terms of . When , we obtain , which is diagonal. Thus, setting and , recovers the true posterior with diagonal covariance at the global optimum. In this case, the ELBO equals the log marginal likelihood and is maximized when the decoder has weights . Because the ELBO lower bounds log-likelihood, the global maximum of the ELBO for the linear VAE is the same as the global maximum of the marginal likelihood for pPCA. ∎
The result of Lemma 1 is somewhat expected because the posterior of pPCA is Gaussian. Further details are given in Appendix C. In addition, we prove a more surprising result that suggests restricting the variational distribution to a Gaussian with a diagonal covariance structure allows one to identify the principal components at the global optimum of ELBO.
The global maximum of the ELBO objective (Eq. eqn:elbo) for the linear VAE (Eq. eqn:linear_vae) has the scaled principal components as the columns of the decoder network.
Follows directly from the proof of Lemma 1 and Eq. eqn:ppca_wmle. ∎
We discuss this result in Appendix B. This full identifiability is non-trivial and is not achieved even with the regularized linear autoencoder (kunin2019loss).
So far, we have shown that at its global optimum the linear VAE recovers the pPCA solution, which enforces orthogonality of the decoder weight columns. However, the VAE is trained with the ELBO rather than the log marginal likelihood — often using SGD. The majority of existing work suggests that the KL term in the ELBO objective is responsible for posterior collapse. So, we should ask whether this term introduces additional spurious local maxima. Surprisingly, for the linear VAE model the ELBO objective does not introduce any additional spurious local maxima. We provide a sketch of the proof below with full details in Appendix C.
The ELBO objective for a linear VAE does not introduce any additional local maxima to the pPCA model.
(Sketch) If the decoder has orthogonal columns, then the variational distribution recovers the true posterior at stationary points. Thus, the variational objective will exactly recover the log marginal likelihood. If the decoder does not have orthogonal columns then the variational distribution is no longer tight. However, the ELBO can always be increased by applying an infinitesimal rotation to the right-singular vectors of the decoder towards identity: (so that the decoder columns are closer to orthogonal). This works because the variational distribution can fit the posterior more closely while the log marginal likelihood is invariant to rotations of the weight columns. Thus, any additional stationary points in the ELBO objective must necessarily be saddle points. ∎
The theoretical results presented in this section provide new intuition for posterior collapse in VAEs. In particular, the KL between the variational distribution and the prior is not entirely responsible for posterior collapse — log marginal likelihood has a role. The evidence for this is two-fold. We have shown that log marginal likelihood may have spurious local maxima but also that in the linear case the ELBO objective does not add any additional spurious local maxima. Rephrased, in the linear setting the problem lies entirely with the probabilistic model. We should then ask, to what extent do these results hold in the non-linear setting?
5 Deep Gaussian VAEs
The deep Gaussian VAE consists of a decoder and an encoder . The ELBO objective can be expressed as,
The role of in this objective invites a natural comparison to the -VAE objective (higgins2016beta), where the KL term is weighted by . alemi2017fixing propose using small values to force powerful decoders to utilize the latent variables, but this comes at the cost of poor ELBO. Practitioners must then use downstream task performance for model selection, thus sacrificing one of the primary benefits of likelihood-based models. However, for a given , one can find a corresponding (and a learning rate) such that the gradient updates to the network parameters are identical. Importantly, the Gaussian partition function for a Gaussian observation model (the last term on the RHS of Eq. eqn:gaussian_elbo) prevents ELBO from deviating from the -VAE’s objective with a -weighted KL term while maintaining the benefits to representation learning when is small. For the Gaussian VAE, this helps connect the dots between the role of local maxima and observation noise in posterior collapse vs. heuristic approaches that attempted to alleviate posterior collapse by diminishing the effect of the KL term (bowman2015generating; razavi2018preventing; sonderby2016ladder; huang2018improving). In the following section, we will study the nonlinear VAE empirically and explore connections to the linear theory.
In this section, we present empirical evidence found from studying two distinct claims. First, we verify our theoretical analysis of the linear VAE model. Second, we explore to what extent these insights apply to deep nonlinear VAEs.
6.1 Linear VAEs
We ran two sets of experiments on 1000 randomly chosen MNIST images. First, we trained linear VAEs with learnable for a range of hidden dimensions111The VAEs were trained using the analytic ELBO (Appendix C.1) and without mini-batching gradients.. For each model, we compared the final ELBO to the maximum-likelihood of pPCA finding them to be essentially indistinguishable (as predicted by Lemma 1 and Theorem 1). For the second set of experiments, we took the pPCA MLE solution for for each number of hidden dimensions and computed the likelihood under the observation noise which maximizes likelihood for 50 hidden dimensions. We observed that adding additional principal components (after 50) will initially improve likelihood but eventually adding more components (after 200) actually decreases the likelihood. In other words, the collapsed solution is actually preferred if the observation noise is not set correctly — we observe this theoretically through the stability of the stationary points (e.g. Figure 1).
Effect of stochastic ELBO estimates
In general, we are unable to compute the ELBO in closed form and so instead rely on unbiased Monte Carlo estimates using the reparameterization trick. These estimates add high-variance noise and can make optimization more challenging (kingma2013auto). In the linear model, we can compare the solutions obtained using the stochastic ELBO gradients versus the analytic ELBO222We use 1000 MNIST images, as before, to enable full-batch training so that the only source of noise is from the reparameterization trick (kingma2013auto) (Figure 3). Additional experimental details are in Appendix E. We found that stochastic optimization had slower convergence (when compared to analytic training with the same learning rate) and, unsurprisingly, reached a worse final training ELBO value (in other words, worse steady-state risk due to the gradient variance).
With a linear decoder and nonlinear encoder, Lemma 1 still holds, and the optimal variational distribution is the same as the true posterior has not changed. However, Corollary 1 and Theorem 1 no longer hold in general. Even a deep linear encoder will not have a unique global maximum and new stationary points (possibly maxima) may be introduced to ELBO in general. To investigate how deeper networks may impact optimization of the probabilistic model, we trained linear decoders with varying encoders using ELBO. We do not expect the linear encoder to be outperformed and indeed the empirical results support this (Figure 4).
6.2 Investigating posterior collapse in deep nonlinear VAEs
We explored how the analysis of the linear VAEs extends to deep nonlinear models. To do so, we trained VAEs with Gaussian observation models on the MNIST (lecun1998mnist) and CelebA (liu2015faceattributes) datasets. We apply uniform dequantization as in NIPS2017_6828
in each case. We also adopt the nonlinear logit preprocessing transformation fromNIPS2017_6828 to provide fair comparisons with existing work. We also report results of models trained directly in pixel space in the appendix (there is no significant difference for the hypotheses we test).
Measuring posterior collapse
In order to measure the extent of posterior collapse, we introduce the following definition. We say that latent dimension dimension has -collapsed if . Note that the linear VAE can suffer -collapse. To estimate this practically, we compute the proportion of data samples which induce a variational distribution with KL divergence less than and finally report the percentage of dimensions which have -collapsed. Throughout this work, we fix and vary .
We trained MNIST VAEs with 2 hidden layers in both the decoder and encoder, ReLU activations, and 200 latent dimensions. We first evaluated training with fixed values of the observation noise,. This mirrors many public VAE implementations where is fixed to 1 throughout training (also observed by dai2018diagnosing), however, our linear analysis suggests that this is suboptimal. Then, we consider the setting where the observation noise and VAE weights are learned simultaneously.
In Table 1 we report the final ELBO of nonlinear VAEs trained on real-valued MNIST. For fixed , we found that the final models could have significant differences in ELBO which were maintained even after tuning to the learned representations — the converged representations are less good when is too large as predicted by the linear model. Additionally, we report the final ELBO values when the model is trained while learning with different initial values of . The gap in performance across different initializations is smaller than for fixed but is still significant. The linear VAE does not predict this gap which suggests that learning correctly is more challenging in the nonlinear case.
Despite the large volume of work studying posterior collapse it has not been measured in a consistent way (or even defined so). In Figure 5 and Figure 6 we measure posterior collapse for trained networks as described above (we chose ). By considering a range of values we found this was (moderately) robust to stochasticity in data preprocessing. We observed that for large choices of initialization the variational distribution matches the prior closely. This was true even when is learned — suggesting that local optima may contribute to posterior collapse in deep VAEs.
We trained deep convolutional VAEs with 500 hidden dimensions on images from the CelebA dataset (resized to 64x64). We trained the CelebA VAEs with different fixed values of and compared the ELBO before and after tuning to the learned representations (Table 1). Further, we explored training the CelebA VAE while learning over varied initializations of the observation noise. The VAE is sensitive to the initialization of the observation noise even when is learned (in particular, in terms of the number of collapsed dimensions).
By analyzing the correspondence between linear VAEs and pPCA, this paper makes significant progress towards understanding the causes of posterior collapse. We show that for simple linear VAEs posterior collapse is caused by ill-conditioning of the stationary points in the log marginal likelihood objective. We demonstrate empirically that the same optimization issues play a role in deep non-linear VAEs. Finally, we find that linear VAEs are useful theoretical test-cases for evaluating existing hypotheses on VAEs and we encourage researchers to consider studying their hypotheses in the linear VAE setting.
This work was guided by many conversations with and feedback from our colleagues. In particular, we thank Durk Kingma, Alex Alemi, and Guodong Zhang for invaluable feedback on early versions of this work.
Appendix A Stationary points of pPCA
Here we briefly summarize the analysis of [tipping1999probabilistic] with some simple additional observations. We recommend that interested readers study Appendix A of tipping1999probabilistic for the full details. We begin by formulating the conditions for stationary points of :
Where denotes the sample covariance matrix (assuming we set , which we do throughout), and (note that the dimensionality is different to ). There are three possible solutions to this equation, (1) , (2) , or (3) the more general solutions. (1) and (2) are not particularly interesting to us, so we focus herein on (3).
We can write
using its singular value decomposition. Substituting back into the stationary points equation, we recover the following:
Noting that is diagonal, if the singular value () is non-zero, this gives , where is the column of . Thus,
is an eigenvector ofwith eigenvalue . For , is arbitrary.
Thus, all potential solutions can be written as, , with singular values written as or and with
representing an arbitrary orthogonal matrix.
From this formulation, one can show that the global optimum is attained with and and chosen to match the leading singular vectors and values of .
a.1 Stability of stationary point solutions
Consider stationary points of the form, where contains arbitrary eigenvectors of . In the original pPCA paper they show that all solutions except the leading principal components correspond to saddle points in the optimization landscape. However, this analysis depends critically on being set to the true maximum likelihood estimate. Here we repeat their analysis, considering other (fixed) values of .
We consider a small perturbation to a column of , of the form . To analyze the stability of the perturbed solution, we check the sign of the dot-product of the perturbation with the likelihood gradient at . Ignoring terms in we can write the dot-product as,
Now, is positive definite and so the sign depends only on . The stationary point is stable (local maxima) only if the sign is negative. If then the maxima is stable only when , in words, the top principal components are stable. However, we must also consider the case . tipping1999probabilistic show that if , then this also corresponds to a saddle point as is the average of the smallest eigenvalues meaning some perturbation will be unstable (except in a special case which is handled separately).
However, what happens if is not set to be the maximum likelihood estimate? In this case, it is possible that there are no unstable perturbation directions (that is, for too many ). In this case when is fixed, there are local optima where has zero-columns — the same solutions that we observe in non-linear VAEs corresponding to posterior collapse. Note that when is learned in non-degenerate cases the local maxima presented above become saddle points where is made smaller by its gradient. In practice, we find that even when is learned in the non-linear case local maxima exist.
Appendix B Identifiability of the linear VAE
Linear autoencoders suffer from a lack of identifiability which causes the decoder columns to span the principal component subspace instead of recovering it. kunin2019loss showed that adding regularization to the linear autoencoder improves the identifiability — forcing the columns to be identified up to an arbitrary orthogonal transformation, as in pPCA. Here we show that linear VAEs are able to fully identify the principal components.
We once again consider the linear VAE from Eq. eqn:linear_vae:
The output of the VAE, is distributed as,
Therefore, the output of the linear VAE is invariant to the following transformation:
where is a diagonal matrix with non-zero entries so that is well-defined. However, this transformation changes the variational distribution which affects the loss through the KL term. As argued in Corollary 1, this means that the global optimum is unique for ELBO up to ordering of the eigenvalues/eigenvectors.
At the global optimum, the ordering can be recovered by computing the squared Euclidean norm of the columns of (which correspond to the singular values) and ordering according to these quantities. In other words, is a permutation matrix which can be computed exactly.
Appendix C Stationary points of ELBO
Here we present details on the analysis of the stationary points of the ELBO objective. To begin, we first derive closed-form solutions to the components of the log marginal likelihood (including the ELBO). The VAE we focus on is the one presented in Eq. eqn:linear_vae, with a linear encoder, linear decoder, Gaussian prior, and Gaussian observation model.
c.1 Analytic ELBO of the Linear VAE
Remember that one can express the log marginal likelihood as:
Each of the terms (A-C) can be expressed in closed form for the linear VAE. Note that the KL term (A) is minimized when the variational distribution is exactly the true posterior distribution. This is possible when the columns of the decoder are orthogonal.
The term (B) can be expressed as,
The term (C) can be expressed as,
Noting that , we can compute the expectation analytically and obtain,
c.2 Finding stationary points
To compute the stationary points we must take derivatives with respect to . As before, we have at the global maximum and for simplicity we fix here for the remainder of the analysis.
Taking the marginal likelihood over the whole dataset, at the stationary points we have,
The above are computed using standard matrix derivative identities [petersen2008matrix]. These equations yield the expected solution for the variational distribution directly. From Eq. eqn:D_stationary we compute and , recovering the true posterior mean in all cases and getting the correct posterior covariance when the columns of are orthogonal. We will now proceed with the proof of Theorem 1.
If the columns of are orthogonal then the log marginal likelihood is recovered exactly at all stationary points. This is a direct consequence of the posterior mean and covariance being recovered exactly at all stationary points so that (1) is zero.
We must give separate treatment to the case where there is a stationary point without orthogonal columns of . Suppose we have such a stationary point, using the singular value decomposition we can write , where and are orthogonal matrices. Note that is invariant to the choice of [tipping1999probabilistic]. However, the choice of does affect the first term (1) of Eq. eqn:marginal_loglik: this term is minimized when , and thus the ELBO must increase.
To formalize this argument, we compute (1) at a stationary point. From above, at every stationary point the mean of the variational distribution exactly matches the true posterior. Thus the KL simplifies to:
where . Now consider applying a small rotation to : . As the optimal and are continuous functions of , this corresponds to a small perturbation of these parameters too for a sufficiently small rotation. Importantly, remains fixed for any orthogonal choice of but does not. Thus, we choose to minimize this term. In this manner, (1) shrinks meaning that the ELBO (-2)+(3) must increase. Thus if the stationary point existed, it must have been a saddle point.
We now describe how to construct such a small rotation matrix. First note that without loss of generality we can assume that . (Otherwise, we can flip the sign of a column of and the corresponding column of .) And additionally, we have , which is orthogonal.
The Special Orthogonal group of determinant 1 orthogonal matrices is a compact, connected Lie group and therefore the exponential map from its Lie algebra is surjective. This means that we can find an upper-triangular matrix , such that . Consider , where is an integer chosen to ensure that the elements of are within of zero. This matrix is a rotation in the direction of which we can make arbitrarily close to the identity by a suitable choice of . This is verified through the Taylor series expansion of . Thus, we have identified a small perturbation to (and and ) which decreases the posterior KL (A) but keeps the log marginal likelihood constant. Thus, the ELBO increases and the stationary point must be a saddle point. ∎
c.3 Bernoulli Probabilistic PCA
We would like to extend our linear analysis to the case where we have a Bernoulli observation model, as this setting also suffers severely from posterior collapse. The analysis may also shed light on more general categorical observation models which have also been used. Typically, in these settings a continuous latent space is still used (for example, bowman2015generating).
We will consider the following model,
denotes the sigmoid function,and we assume an independent Bernoulli observation model over .
Unfortunately, under this model it is difficult to reason about the stationary points. There is no closed form solution for the marginal likelihood or the posterior distribution . Numerical integration methods exist which may make it easy to evaluate this quantity in practice but they will not immediately provide us a good gradient signal.
We can compute the density function for using the change of variables formula. Noting that
, we recover the following logit-Normal distribution:
We can write the marginal likelihood as,
where is taken to be elementwise. Unfortunately, the expectation of a logit-normal distribution has no closed form [atchison1980logistic] and so we cannot tractably compute the marginal likelihood.
Similarly, under ELBO we need to compute the expected reconstruction error. This can be written as,
another intractable integral.
Appendix D Related Work (Extended)
Due to the large volume of work studying posterior collapse in variational autoencoders, we have included here an extended discussion of related work. We utilize this additional space to provide a more in-depth discussion of the related work presented in the main paper and to highlight additional work.
tomczak2017vae introduce the VampPrior, a hierarchical learned prior for VAEs. tomczak2017vae show empirically that such a learned prior can mitigate posterior collapse (which they refer to as inactive stochastic units). While the authors provide limited theoretical support for the efficacy of their method in reducing posterior collapse, they claim intuitively that by enabling multi-modal prior distributions the KL term is less likely to force inactive units — possibly by reducing the impact of local optima corresponding to posterior collapse.
In the main paper we discuss the work of dai2017hidden, which connect robust PCA methods and VAEs. In particular, Section 2 of their manuscript studies the case of a linear decoder and shows that, when the encoder takes the form of the optimal variational distribution, the ELBO of the resulting VAE collapses into the pPCA objective. We study the ELBO without optimality assumptions on the linear encoder and characterize the optimization landscape with no additional assumptions. They claim further that all minima of the (encoder-optimal) ELBO objective are globally optimal — we show in fact that for a linear encoder there is a fully identifiable global optimum.
dai2018diagnosing discuss the important of the observation noise, and in fact show that under some assumptions the optimal observation noise should shrink to zero (Theorem 4 in their work). These assumptions amount to the number of latent dimensions exceeding the dimensionality of the true data manifold. However, in the linear model (whose latent dimensions do not exceed the input space dimensionality) the optimal variance does not shrink towards zero and is instead given by the sum of the variance lost in the linear projection. Note that this does not violate the results of dai2018diagnosing, but highlights the need to consider model capacity against data complexity, as in alemi2017fixing.
Appendix E Experiment details
We used Tensorflow[tensorflow2015-whitepaper] for our experiments with linear and deep VAEs. In each case, the models were trained using a single GPU.
Visualizing stationary points of pPCA
For this experiment we computed the pPCA MLE using a subset of 1000 random training images from the MNIST dataset. We evaluate and plot the log marginal likelihood in closed form on this same subset. In this case, we did not dequantize or apply any nonlinear processing to the data.
Stochastic vs. Analytic VAE
We trained linear VAEs with 200 hidden dimensions. We used full-batch training with 1000 MNIST digits samples randomly from the training set (the same data as used to produce Figure 2). We trained each model with the Adam optimizer and a fixed learning rate, grid searching to find the learning rate which gave the best ELBO after 12000 training steps in the range . For both models, 0.001 provided the best final ELBO.
The VAEs we trained on MNIST all had the same architecture: 784-1024-512-k-512-1024-784. The Gaussian likelihood is fairly uncommon for this dataset, which is nearly binary, but it provides a good setting for us to investigate our theoretical findings. To dequantize the data, we added uniform random noise and rescaled the pixel values to be in the range . We then applied a nonlinear logistic transform as in [NIPS2017_6828]. The VAE parameters were optimized jointly using the Adam optimizer [kingma2014adam]
. We trained the VAE for 1000 epochs total, keeping the learning rate fixed throughout. We performed a grid search over learning rates in the rangeand reported results for the model which achieved the best training ELBO.
We used the convolutional architecture proposed by higgins2016beta trained on 64x64 images from the CelebA dataset [liu2015faceattributes]
. Otherwise, the experimental procedure followed that of the MNIST VAEs with the nonlinear preprocessing hyperparameters set as in[NIPS2017_6828].
e.1 Additional results
e.1.1 Evaluating KL Annealing
We found that KL-annealing may provide temporary relief from posterior collapse but that if is not learned simultaneously then the collapsed solution is recovered. In Figure 7 we show the proportion of units collapsed by threshold for several fixed choices of when is annealed from 0 to 1 over the first 100 epochs. The solid lines correspond to the final model while the dashed line corresponds to the model at 80 epochs of training. KL-annealing was able to reduce posterior collapse initially but eventually fell back to the collapsed solution.
After finding that KL-annealing alone was insufficient to prevent posterior collapse we explored KL annealing while learning . Based on our analysis in the linear case we expect that this should work well: while is small the model should be able to learn to reduce . We trained using the same KL schedule and also with standard ELBO while learning . The results are presented in Figure 8 and Figure 9. Under the ELBO objective, is reduced somewhat but ultimately a large degree of posterior collapse is present. Using KL-annealing, the VAE is able to learn a much smaller value and ultimately reduces posterior collapse. This suggests that the non-linear VAE dynamics may be similar to the linear case when suitably conditioned.
e.1.2 Full results tables
e.1.3 Qualitative Results
Reconstructions from the KL-Annealed CelebA model are shown in Figure 12
. We also show the output of interpolating in the latent space in Figure13. To produce the latter plot, we compute the variational mean of 3 input points (top left, top right, bottom left) and interpolate linearly on the plane between them. We also extrapolate out to a fourth point (bottom right), which lies on the plane defined by the other points.