Closed Form Variances for Variational Auto-Encoders

12/21/2019
by   Graham Fyffe, et al.
Google
19

We propose a reformulation of Variational Auto-Encoders eliminating half of the network outputs (the variances) in a deep network setting. While it is well known that the posterior is in general intractable, we show that the variances of Gaussian posteriors and likelihoods may be solved in closed form, producing improved variational lower bounds over their learned counterparts in experiments. The closed forms reduce to remarkably simple expressions – in particular, one optimal choice for the posterior variance is simply the identity matrix. We arrive at these conclusions by analyzing the variational lower bound objective irrespective of any particular network architecture, deriving its partial derivatives and closed form solutions for all parameters but the posterior means. In deriving the closed form likelihood variance, we show that the objective is underdetermined, which we resolve by constraining the presumed information content of the data examples. Any of these modifications may be applied to simplify, and perhaps improve, any Variational Auto-Encoder.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

page 19

page 20

11/02/2018

Closed Form Variational Objectives For Bayesian Neural Networks with a Single Hidden Layer

In this note we consider setups in which variational objectives for Baye...
01/27/2021

Variational Encoders and Autoencoders : Information-theoretic Inference and Closed-form Solutions

This work develops problem statements related to encoders and autoencode...
10/28/2020

The Evidence Lower Bound of Variational Autoencoders Converges to a Sum of Three Entropies

The central objective function of a variational autoencoder (VAE) is its...
12/18/2020

A closed form scale bound for the (ε, δ)-differentially private Gaussian Mechanism valid for all privacy regimes

The standard closed form lower bound on σ for providing (ϵ, δ)-different...
07/09/2021

On the Variance of the Fisher Information for Deep Learning

The Fisher information matrix (FIM) has been applied to the realm of dee...
05/28/2020

Data Analysis Recipes: Products of multivariate Gaussians in Bayesian inferences

A product of two Gaussians (or normal distributions) is another Gaussian...
01/10/2019

Closed-form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders

The Maximum Mean Discrepancy (MMD) has found numerous applications in st...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An Auto-Encoder encodes a data vector into a latent space and then decodes it back to the data space with as little corruption as possible. A Variational Auto-Encoder (VAE)

2013arXiv1312.6114K treats encoding and decoding as probabilistic operations: a given data vector is encoded to a distribution over the latent space, and a given latent vector is decoded to a distribution over the data space (Fig. 1).

(a) There.
(b) Back again.
Figure 1: Schematic of a Variational Auto-Encoder. (a) The encoder maps a data vector to a posterior distribution (dashed ellipse) parameterized by mean and variance in latent space. (b) The decoder maps a latent vector to a likelihood function (dashed ellipse) parameterized by mean and variance in data space. The dotted shape and circle represent the data distribution and prior.

The idea is that a vector from a population of data with a complex distribution may be the product of some unknown process operating on a latent vector from a much simpler distribution, and if we can model this process we open up a variety of statistical inference applications including dimensionality reduction and novel example generation. In Bayesian terms, the encoder produces a posterior distribution for the latent vector given the data vector and a latent prior distribution, and the decoder produces a likelihood function for the data vector given a latent vector. The encoder and decoder are learned together by maximizing the variational lower bound of the marginal likelihood of the data.

The VAE literature is thriving, with works exploring a multitude of applications, network architectures, and alternative objective functions. We go back to the original formulation 2013arXiv1312.6114K to look for new insights. By careful analysis of the variational lower bound objective irrespective of any network architecture, we find a fundamental simplification: the variances associated with Gaussian posteriors and Gaussian likelihoods need not be learned at all.

On the encoding side, VAEs typically employ a fixed prior distribution (e.g. standard normal), and the posterior is parameterized by a learned function of the data vector that outputs the posterior mean and variance. This follows the algorithm given by Kingma et al. 2013arXiv1312.6114K, which is presented as an example of a variational formulation of an Auto-Encoder, though other formulations are possible. In this work, we argue that rather than learning the posterior variance, an optimal solution is to simply employ a constant variance matrix (such as the identity matrix), which admits a trivial solution for the prior variance, which we detail in Section 5. In a deep network setting, this eliminates half of the outputs of the encoder network, since only the posterior means are learned. We demonstrate that the proposed method outperforms a standard normal prior with learned posterior variances, producing a higher variational lower bound in experiments. Further, employing constant posterior variances admits a simplified formulation of the variational lower bound as an expectation over a batch, which we call the Batch Information Lower Bound (BILBO), detailed in Section 5.2.

On the decoding side, VAEs may employ a likelihood function where the variance is determined by the mean (e.g. Bernoulli), or a parametric likelihood function (e.g. Gaussian) parameterized by a learned function of the latent vector that outputs the mean and variance. We show in Section 6 that the optimal variance for a Gaussian likelihood is tractable and need not be learned. In a deep network setting, this eliminates half of the outputs of the decoder network, since only the likelihood means are learned. Surprisingly, our solution implies that the log likelihood is unbounded, and we offer a simple method to bound the log likelihood using a measure of the aggregate information present in a batch, which we call Bounded Aggregate Information Sampling (BAGGINS), detailed in Section 6.2.

Finally we provide experiments in Section 8 supporting our claims. The proposed methods may be employed regardless of the particular network architecture, with BILBO applicable to the encoder whenever the prior and posterior are Gaussian (regardless of the decoder) and BAGGINS applicable to the decoder when it is also Gaussian. Combining the two, we construct Variational Auto-Encoder networks with only half the outputs of ordinary VAEs, with improved variational lower bounds and scale invariance relative to their learned counterparts.

2 Preliminaries

Variational Auto-Encoders 2013arXiv1312.6114K

We assume an observed value results from an unobserved or latent variable , typically of lower dimension, via a conditional distribution (or likelihood) . We are interested in the posterior

, which is in general intractable. We estimate a tractable prior distribution

and likelihood , parameterized by . Further, we estimate an approximate posterior , parameterized by , rather than the intractable true posterior. The likelihood is often referred to interchangeably as a probabilistic decoder, and the approximate posterior as a probabilistic encoder, as these are the roles they play in mapping values from latent space to data space and back. Correspondingly, may be referred to as the generative parameters, and as the recognition parameters. We estimate these parameters by maximizing the so-called variational lower bound, or evidence lower bound (ELBO):

(1)

where

is the Kullback-Leibler divergence (or relative entropy) of

with respect to , equal to . Intuitively, the KL term encourages the posterior to resemble the prior, while the log likelihood term encourages maximum likelihood. In common practice, the expectation over the approximate posterior is estimated via a single sample:

(2)

This formulation lends itself well to optimization schemes such as stochastic gradient ascent.

Map Notation

The approximate posterior may be viewed as a mapping that maps a data vector to the mean of its distribution in latent space, and an additional mapping for the other parameters of its distribution in latent space (e.g. variance). Similarly, the likelihood may be viewed as a mapping that maps a latent vector to the mean of its distribution in data space, and an additional mapping for the other parameters of its distribution in data space (e.g. variance). In the following, we make references to these mappings, often dropping their arguments for clarity.

Gaussian Variational Auto-Encoders

Consider a Variational Auto-Encoder with Gaussian approximate posterior and a standard normal prior . Dropping the parameters for clarity, Eq. 1 becomes:

(3)

where represents determinant, and typically . This formulation produces a relatively simple expression, and is popular in the literature. Yet it was originally presented as just one example of applying the Auto-Encoding Variational Bayes algorithm 2013arXiv1312.6114K, and other choices of priors and posteriors are equally valid. In the following, we consider scaling the overall latent space, letting the prior now be for some , yielding:

(4)

Note the scaling introduces a scale ambiguity, since scaling the entire latent space admits an equal ELBO via an inverse scaling in the mapping within . We revisit this later.

3 Unraveling the Variational Auto-Encoder

To assist in the understanding of the Variational Auto-Encoder, we describe its action on a dataset of vectors in simplified terms and cartoons, with complete disregard for any network architecture or implementation detail. Before we dive in, we introduce a tool to help our analysis which we call a loss-whitening mapping, such that likelihood functions all become unit normal.

We argue that any reasonable likelihood model should produce likelihood functions that are equal whenever their means are equal, implying . The likelihood function implicitly defines a loss function between pairs of vectors in data space, . We further argue that a reasonable should depend strongly on the relationship between its two arguments, and only weakly on their absolute position in data space. This allows us to posit the existence of a loss-whitened space via a smooth mapping that maps each to a corresponding such that . (Refer to Fig. 2.) Take note this does not necessarily whiten the population; rather it whitens the loss. Note also that if is an L2 loss, then the mapping is merely the identity function.

Figure 2: The loss-whitening mapping warps the data space (left dotted shape) to a loss-whitened space (right dotted shape) such that likelihood functions (dashed ellipses) become unit normal (dashed circles) and the likelihood of a vector near the likelihood mean is equal to the whitened likelihood of the corresponding vector (or very nearly equal).

We now unravel the action of a VAE as a sequence of transformations. (Refer to Fig. 3.) (a-b) A set of data vector examples drawn from a data distribution is mapped into a latent space via . This corresponds to encoding. Inflated into fuzzy blobs by the posterior variances, they collectively resemble the prior. We conceptualize decoding the resulting posterior distributions as blobs rather than individual vectors since the mapping may operate on the entire space, and we conceptually break decoding into two steps. (b-c) The blobs are mapped into the loss-whitened space via the mapping , i.e. the composition of the decoder and the loss-whitening mapping. (c-d) The blobs are mapped back to data space via the inverse loss-whitening mapping

. Decoding is complete, with the blobs now representing the distributions of the likelihood means (not the likelihood functions proper). Their centroids are close to the original data vectors, with a small residual offset. (d-e) During training, the expectation of the log likelihood of each data vector over its associated likelihood means distribution (i.e. blob) is computed. Conceptually, this may be carried out in the loss-whitened space, where the expected log likelihood reduces to the cross entropy of a unit normal distribution (centered on the data vector) relative to the decoded blob. We emphasize that loss-whitening mappings are not actually computed, but serve here to facilitate analysis.

(a)
(b)
(c)
(d)
(e)
Figure 3: The action of a Variational Auto-Encoder (cartoon). (a) Data vectors (black dots) are encoded by mapping to (b) latent space, collectively resembling the prior (dotted ellipse) when inflated by the posterior variance (dashed circles), and are decoded by (conceptually) mapping to (c) loss-whitened space and then back to (d) data space, with slight misalignment from the original data vectors (white dots). During training, the expected negative log likelihood is equivalent to (e) the cross entropy of a unit normal distribution centered on the data vector (solid circles) relative to the decoded distribution (dashed circles) in loss-whitened space. See the text for further details.

4 Derivatives of the Expected Variational Lower Bound

In the hopes of finding a stationary point of an objective, one usually considers its partial derivatives, which we produce here for the expected variational lower bound assuming Gaussian prior and posterior. When taking derivatives, it is helpful to consider the following expansion of Eq. 4:

(5)

The partial derivatives of with respect to (rather than ), , , and are as follows (using partial derivatives of detailed in Appendices B and A):

(6)
(7)
(8)
(9)

where is the number of data examples in the population and is the Jacobian of the mapping with respect to evaluated at . The errors in the approximations denoted by in Eqs. 8 and 7 are small, being due to treating the mapping as locally linear with Jacobian when computing expectations over the posterior footprint. We show that not only is the optimal mapping

smooth over the posterior, but its Jacobian is equivalent to a moving least squares linear regression model windowed by the posterior. First, solving

with Eq. 9 admits a closed form solution for :

(10)

which has the following Jacobian which we derive, referring to IMM2012-03274111IMM2012-03274 Eq. (347) contains an error; the expression is transposed.:

(11)

On the mild condition that whenever is significant, we may write:

(12)

We recognize Eq. 12 as moving least squares linear regression windowed by the posterior, with the outer product term estimating and standing in for . The tricky part is to find simultaneous roots of Eqs. 9, 8, 7 and 6 over all data examples. We might employ classical algorithms such as gradient ascent without even using any deep networks, but we’d rather not.

Remark.

The closed form mapping in Eq. 10 indicates that the decoder stage of a VAE with Gaussian prior and posterior is tractable (a simple weighted mean) whenever the loss-whitening mapping is tractable. The only intractable stage remaining is the encoder.

5 Encoding with Closed Form Posterior Variances

In this section, we support our main proposition: In a Variational Auto-Encoder with Gaussian prior and posterior, there exists an optimal solution where the posterior variances are all equal.

Learning the posterior variances has roots in the mixture density estimation literature (e.g. NIPS1999_1673

) where one models an unknown distribution as a Gaussian mixture model and aims to estimate its parameters. However, in the context of auto-encoding we are reshaping an unknown data distribution to an

assumed (prior) latent distribution, hence we may make stronger statements about the variances. Consider running the sequence of actions in Fig. 3 in reverse. Presumably, the decoded distributions in (e) should line up pretty well with the unit normal distributions centered on the data vectors, since the cross entropy is minimized. Thus the decoded distributions should themselves be close to unit normal in loss-whitened space (though a bit smaller, due to the other terms in the objective). We disregard (d) and go straight back to (b) the posteriors in latent space. In the cartoon, we deliberately draw the posterior blobs as circles, as if they were merely translated on their journey from loss-whitened space to latent space, retaining their near unit variances. Our justification is the inclusion of the prior variance as a variable in Eq. 4 and the scale ambiguity it induces. We deliberately draw the prior as a somewhat elongated ellipse to illustrate that it may conform to fit the latent vectors rather than the other way around, without impacting the value of the objective. Further, we draw the posteriors so that no overlap is introduced from (c) to (b), almost as if the blobs are rigidly connected where they overlap more in (c) and non-rigidly connected where they overlap less, and we gather them all up into a ball (or ellipsoid) as tight as we can without introducing any new overlap, or not much. (If we are reducing dimensions, we further aim to flatten the ellipsoid into a pancake.)

This is a fine story, but there are two questions: does the math support it, and would there be any benefit to reshaping the variances while we are squishing the ball tighter? We prove several statements in Section 7 in support of our proposition, though a complete formal proof is elusive. However, we first demonstrate by experiment that learned posterior variances are inconsistent across different runs with the same data and parameters, and moreover, our constant posterior variance formulations produce improved variational lower bounds compared to their learned counterparts.

(a) Run 1.
(b) Run 2.
(c) Run 3.
Figure 4: Posterior means for a two-dimensional MNIST latent space colored by label, with learned posterior variances and Bernoulli likelihood, for three runs with identical parameters. Different runs produce different clusters because of the stochastic nature of the optimizer, but yield comparable variational lower bounds (here , , and ).
(a) Run 1.
(b) Run 2.
(c) Run 3.
Figure 5: Learned posterior variances for the same runs as Fig. 4, plotting and variances vs. mean and relative to prior scale (1 in this experiment). Note the U shape when plotted against the same mean dimension, and lack of correlation to the other dimension, consistent with inflating variances near the fringes rather than revealing intrinsic structure in the data.

Figs. 5 and 4 illustrate that learned variances are inflated near the fringes of the latent population, yet different runs with the same data and same parameters may place different data examples near the fringes. For example, the orange cluster corresponding to the label “7” is sometimes closer to the origin and sometimes closer to the fringes. Proximity to the origin seems to have more to do with how much the examples resemble the mean, such as the red cluster (“8”) and the cyan cluster (“3”), rather than some intrinsic structure in the data population. Fig. 6 illustrates that our proposed formulations slightly improve the lower bound, and Figs. 8 and 7 illustrate that generative samples and posterior distributions are qualitatively similar to those of learned posterior variances. See Section 8 for details.

(a) 2 latent dimensions.
(b) 100 latent dimensions.
Figure 6: Progress of variational lower bound during training (mean of five runs), comparing three methods for determining posterior variance. “learned” Eq. 1 with learned . “constant” proposed; Eq. 4 with and from Eq. 13. “BILBO” proposed; Eq. 15 with and from Eq. 13. The proposed methods with constant variance and closed form prior produce a smaller variational lower bound than the learned variances method ( and vs. ).

2 latent dimensions

100 latent dimensions

Learned

Constant

BILBO

Figure 7: Generative samples for MNIST, for three different methods of determining posterior variances. “Learned Eq. 1 with learned . “Constant ” proposed; Eq. 4 with and from Eq. 13. “BILBO” proposed; Eq. 15 with and from Eq. 13. Qualitatively, the three methods are indistinguishable.
(a) Learned .
(b) Constant .
(c) BILBO.
Figure 8: Posterior means for a two-dimensional MNIST latent space colored by label, for three different methods of determining posterior variances. (a) Eq. 1 with learned . (b) proposed; Eq. 4 with and from Eq. 13. (c) proposed; Eq. 15 with and from Eq. 13. Note that different runs with the same settings will produce different clusters because of the stochastic nature of the optimizer. The methods with closed form variances (b) and (c) produce a smaller variational lower bound than the learned variances method (a) ( and vs. ).

5.1 The Floating Prior Trick

With all posterior variances assumed equal, Eq. 6 appears to allow us to determine the posterior distribution during training simply by plugging in the desired prior scale and an estimate of , then trivially solving for . For example, the standard normal prior has leading to a particularly simple expression. Unfortunately, we quickly discover a problem: during early iterations of training there is no guarantee that will be positive definite, and any ad-hoc attempt to manipulate it might ruin the objective. Instead, rather than breaking the scale ambiguity by fixing , we propose to fix and allow the prior to float, via:

(13)

This solution is unsurprising since it is equal to , the covariance of the latent vectors. We now have all the ingredients to compute the variational lower bound with closed form posterior variance. In practice, we may select any we wish (with generally being a good choice; see Fig. 12), but we must estimate . To obtain a robust estimate in a deep network training setting, we propose averaging over a batch with and . This presumes is a diagonal matrix, but this does not impact the objective as any non-diagonal matrix may be construed to be a diagonal matrix with a transform that preserves the variational lower bound. We may then evaluate Eq. 4 as usual. Figs. 8, 7 and 6 illustrate that employing a constant posterior variance in this way is quantitatively more effective than learned variances, and produces qualitatively similar distributions and generative samples.

5.2 Batch Information Lower Bound

Consider reformulating the expected variational lower bound as an expectation over a batch . Recalling the optimal in Eq. 13, substitution into Eq. 4 with constant yields:

(14)

As we must estimate anyway (as ), this motivates us to define a simpler batch formulation, the Batch Information Lower Bound (BILBO):

(15)

This formulation is especially economical due to the estimator , using the commonly available function . Along with the simple choice this yields:

(16)

In practice, batch expectations are averages over batches. Figs. 8, 7 and 6 illustrate that the BILBO is as effective as the variational lower bound, both quantitatively and qualitatively, despite its simplicity.

6 Decoding with Closed Form Likelihood Variances

For some families of likelihood models, the variance is trivially constrained by the mean (e.g. for a Bernoulli distribution). In such cases, the variance is already in closed form. For other models, the variance may not be obvious, yet we show that closed forms are available. In the following, we consider a Gaussian likelihood with

. Dropping the arguments to for clarity and writing and , we write:

(17)

Noting that in terms of the loss-whitened space from Section 3, and recalling the Jacobian from Section 4, we may write and hence:

(18)

Solving for yields the closed form likelihood variance:

(19)
Remark.

The expected log likelihood Eq. 18 may also be written as:

(20)

This resembles the equation for variable noise as smoothness regularization in rothfuss2019conditional and other works cited there, which argue that the second order Taylor approximation is applicable when is small. We strengthen this claim to state that Eq. 20 holds regardless, even for general likelihoods, due to the smoothness of the optimal mapping .

6.1 The Variational Lower Bound is Unbounded

When the latent dimensionality is smaller than the data dimensionality as is commonly the case, the closed form likelihood variance in Eq. 19 is singular, having rank equal to the latent dimensionality or depending on the residual. A training procedure that learns

is free to select arbitrarily small eigenvalues in the data subspace not spanned by the latent space, and therefore the objective as stated is unbounded. Fortunately, we may still evaluate a bounded log likelihood of a normal distribution with singular variance by generalizing

Eq. 17 to use the pseudo-inverse and pseudo-determinant (product of nonzero eigenvalues) Holbrook:2018:DPD:

(21)

where represents pseudo-inverse and represents pseudo-determinant. The solution to this generalized form is still Eq. 19. If the Jacobian were available, we could compute via Eq. 19 and plug it into Eq. 21. Otherwise, we note the following expectation:

(22)

We might thus consider estimating by sampling, then computing pseudo-inverses and pseudo-determinants, but this would be expensive and numerically brittle. Moreover, even if is positive definite, it may grow arbitrarily large and completely suppress the reconstruction residuals with only a modest logarithmic overhead, resulting in nonconvergence and poor reconstruction. We elaborate and resolve this problem in the next section. Surprisingly, this implies that without further constraint, the classic VAE formulation 2013arXiv1312.6114K is unbounded even in the case of equal latent and data dimensionality ().

This phenomenon has been noted in the literature pierrealex2018leveraging and is the reason that real-world VAE implementations impose artificial priors on the generative model, such as weight decay 2013arXiv1312.6114K, hand-tuned fixed variances DBLP:journals/corr/ZhaoSE17, or bounded eigenvalues (effectively Tikhonov regularization) pierrealex2018leveraging. Fig. 9 illustrates the failure of directly learning , and the sensitivity of hand-tuned likelihood variances to the intrinsic scale of the data, producing blurry generative samples when the variance is too high, and unrecognizable generative samples when the variance is too low.

2 latent dimensions

100 latent dimensions

Learned

Figure 9: Generative samples for MNIST with Gaussian likelihood, with learned and constant likelihood variances . indicates a scaling factor applied to the data vectors. Learned fails to converge without further constraint. For a given constant , the quality of generative samples is extremely sensitive to the overall data scale.

6.2 Bounded Aggregate Information Sampling

We now elaborate and resolve the problem of underdetermined closed form likelihood variance . By Theorem 7.3, we know the posterior means image the residuals (with some scaling). With some manipulation we may state the following relationship to (detailed in Appendix C):

(23)

Note the role the left hand side plays in the log likelihood (Eq. 17) which appears in the expectation over in Eq. 4. The problem is now revealed. If we fix , we rely on the scale of in order to constrain the scale of . Likewise, if we fix , we rely on the scale of in order to constrain the scale of . However, there is yet no constraint on the scale of . We therefore propose to both fix and constrain by the following relationship with free parameter :

(24)

where is the target value of , related to the information that we presume is present in our data examples. This is desirable since the information content of a dataset is invariant to scaling or transforming the data, and this particular formulation is stable over a range of that is invariant to the number of latent dimensions. In practice, we propose the following single-sample estimator:

(25)

In a deep network setting, this allows us to aggregate the bounded observed information (negative log likelihood) over a batch, which we call Bounded Aggregate Information Sampling (BAGGINS). We form a scaling matrix for batch via Eq. 13 to compute using Eq. 25, to evaluate Eq. 17. This may be plugged into Eq. 4 as usual, or combined with Eq. 15 to yield:

(26)

with independently for each example in the batch. In a deep network setting, this eliminates half of the outputs of the likelihood model. Fig. 10 demonstrates invariance to data scaling, with generative samples comparable to hand-tuned likelihoods. Fig. 11 illustrates the effect that different values of the information constant has on the quality of the generative samples.

2 latent dimensions

100 latent dimensions

Figure 10: Generative samples for MNIST with Gaussian likelihood, using BAGGINS to determine likelihood variances with (Eq. 25). indicates a scaling factor applied to the data vectors. Qualitatively, the samples are stable regardless of the data scaling. (Contrast to Fig. 9.)

2 latent dimensions

100 latent dimensions

Figure 11: Generative samples for MNIST with Gaussian likelihood, using BAGGINS to determine likelihood variances. indicates the information factor, which affects the sample quality.

7 Detailed Analysis

We further support our claims with a few theorems, with discussion.

Theorem 7.1.

In an optimal Variational Auto-Encoder with Gaussian prior and posterior, the posterior means (relative to the prior) image the residuals, scaled by the transpose Jacobian of the mapping from latent space (relative to the prior) to data space.

Proof.

From Eq. 7 we immediately conclude, at optimality:

(27)

is the posterior mean relative to the prior, and is the Jacobian of the mapping from latent space (relative to the prior) to data space. ∎

Discussion

Consider performing gradient ascent using Eq. 7. The first term imparts a global scaling of the posterior means toward zero, while the second term maintains a scaled image of the residuals of the local neighborhood. Note further that the optimal value of scales along with the posterior magnitudes, in a sense normalizing the scaling from the first term. At equilibrium, this combination of forces causes the population of posterior means as a whole to gather together centered about the origin, while maintaining proximity between examples that are close in data space and separation between examples that are disparate in data space.

Theorem 7.2.

In a Variational Auto-Encoder with Gaussian prior and posterior, the posterior precision is equal to the prior precision plus the information (Hessian of the negative log likelihood).

Proof.

From Eq. 8 we immediately conclude, at optimality:

(28)

With and , the Hessian of the negative log likelihood with respect to the parameters is . ∎

Discussion

Consider what happens if the posterior variance is equal for all data examples, i.e. . Immediately, Eq. 28 implies that is also equal for all data examples. Recalling the scale ambiguity present in Eq. 4, we may introduce a scale constraint without impacting the variational lower bound. For exposition, we break this scale ambiguity by asserting ; in other words is locally isometric in the neighborhood of for each data example, where is now an isometry such as a rotation. This formulation is intuitively satisfying, that the posterior of every data example images an equally varying distribution in data space, yet may locally rotate to fit the data as admits any isometry, as a sort of moving principle component analysis. We employ this interpretation in our story in Section 5, where the mapping from latent space to loss-whitened space involves only translation and rotation of vectors in groups based on their proximity.

Theorem 7.3.

In a Variational Auto-Encoder with Gaussian prior and posterior, the posterior distribution is the image of a normal distribution in data space approaching a unit normal distribution in the large data variation limit, up to a loss-whitening mapping.

Proof.

(Sketch.) We recognize that Eq. 20 implies that the posterior is the image of a distribution in the loss-whitened space with variance . Substituting Eq. 28 into yields:

(29)

Noting that is positive definite and inspecting Eq. 29, we conclude that the eigenvalues of are less than unity. While is rank-deficient when , we may posit a full-rank by introducing non-zero eigenvalues in the null space of to construct . (This is a matrix spectrum completion problem, but the specific completion strategy is not important.) We verify that is still the image of this full-rank , as , where denotes the Moore-Penrose pseudo inverse. Recalling the definition of in Eq. 12 as the Jacobian of the mapping , we conclude that models the variation present in the data population (within a subspace of dimension possibly less than ), as if were representative of the optimal mapping over the entire population. Therefore, as the variation in the data population increases, the posterior approaches the image of a unit normal distribution in the loss-whitened data space (as Eq. 29 approaches identity). ∎

Discussion

approaching identity in the large data variation limit supports our story in Section 5 that decoded distributions should nearly align with unit distributions in the loss-whitened space. Further, it implies that all data examples are represented equally in data space. Intuitively, this suggests all data examples should be represented equally in latent space as well, by equal posterior variances imaging .

8 Experiments

Though our analysis of the variational lower bound is agnostic to the particular optimization procedure employed, we perform experiments in a deep network setting using simple network architectures. In all experiments the encoder is a four layer densely connected network with

hidden units, ReLU activations in the middle layers, and no activation on the output layer for posterior means. When learned posterior variances are employed, an additional output layer is attached in parallel with the first, with SoftPlus activations, and interpreted as standard deviations. The decoder is similar to the encoder (apart from the input and output shapes) with four dense layers having

hidden units, ReLU activations in the middle layers, and no activation on the output layer. The output is interpreted as logits when a Bernoulli likelihood model is employed, and as means when a Gaussian likelihood model is employed. When learned likelihood variances are employed, an additional output layer is attached in parallel with the first, with SoftPlus activations, and interpreted as standard deviations.

Whenever possible, probability density calculations are implemented using readily available TensorFlow libraries

tensorflow2015-whitepaper in order to avoid unintentional subtle differences in implementation details across the various methods tested. For the MNIST dataset lecun-mnisthandwrittendigit-2010, training is performed using the training set of examples and all reported lower bound estimates and scatter plots are produced using the test set of examples. We run experiments with dimensional and latent dimensions using the Adam Optimizer Adam with a learning rate of and batch size of for epochs, except for the dimensional Gaussian likelihood experiments which we ran for epochs. For the CelebA dataset liu2015faceattributes we use the network architecture from DBLP:journals/corr/Barron17 with the RGB Pixel model and latent dimensions, trained for iterations.

8.1 Experiments with Posterior Variances

We first trained a VAE using standard techniques (with Eq. 1), with learned posterior variances and Bernoulli likelihood. Fig. 4 illustrates that different runs with identical parameters produce different clusters. This is not surprising since the optimizer is stochastic, however the distance from the origin for any label cluster is also inconsistent across runs (apart from those most resembling the mean data vector; see Section 5), and it is this distance that relates to the magnitude of the posterior variance. Yet, the variational lower bound remains consistent, as does the spatial distribution of variances in latent space (Fig. 5). This suggests that the optimal variational lower bound is not a product of assigning specifically meaningful posterior variances to specific data examples.

We then trained a VAE using a constant posterior variance (with Eq. 4), and another using the BILBO objective (Eq. 15), both using the closed form prior variance from Eq. 13. Fig. 6 graphs the variational lower bound comparing the three methods: learned posterior variance, constant posterior variance, and BILBO (effectively constant posterior variance). The learned method lags significantly behind the two proposed methods, validating the closed form solution. Fig. 7 shows little qualitative difference between generative samples from the three methods, and Fig. 8 shows qualitatively similar distributions over the latent space. (See Figs. 17, 16 and 13 for additional results.)

We noted that the optimizer was sensitive to the scale of the data space and latent space, even in cases where the objective ought to have mathematically equivalent stationary points. Fig. 12 graphs the progress of the variational lower bound for several runs that include a hard coded scaling at the end of the encoder network (and inverse scaling at the beginning of the decoder network), and scaling the constant posterior variance by several different factors. All of these runs ought to have identical variational lower bound, by Eq. 4. We observe that the performance of the optimizer is best when the posterior variance is scaled commensurately with the hard coded scaling, suggesting a sensitivity to the initial conditions. Further, we observe that the learned posterior variance method always performs most similarly to (but a little worse than) the unscaled constant posterior variance, which validates that the higher performance of the proposed method is not merely due to lucky initialization.

We performed further experiments on the more complex CelebA dataset. Figs. 21 and 20 illustrate results comparing the standard variational lower bound objective vs. the BILBO. All experiments yielded comparable ELBO and loss. We found that choosing resulted in large optimal prior variances ( ) which seems to slightly affect the quality of generative samples. We attribute this to the complexity of the dataset, hence choosing a smaller is more appropriate. This difference is again related to the sensitivity of the optimizer to scale, despite mathematically equivalent stationary points. Thus we would suggest that is likely a good choice for simple data, with lower values for more complex data.

(a) Unmodified network.
(b) Encoder .
(c) Encoder .
Figure 12: Progress of variational lower bound during training (mean of three runs) with hard coded scaling in the encoder and decoder to highlight the impact of initialization. Parameter indicates BILBO using (Eq. 15). “learned” indicates Eq. 1 with learned . Observe that the learned method is always most comparable to regardless of the scaling factor, and the best performing method is always the constant posterior variance that matches the scaling factor, suggesting that the difference is due to the initialization of network weights rather than the objective.

8.2 Experiments with Likelihood Variances

We experimented on a VAE with Gaussian likelihood, training using the BILBO objective (Eq. 15) as it performed well in the experiments on posterior variances. We first trained a VAE with learned likelihood variances, without any additional constraint such as weight decay. These experiments did not converge, with the lower bound fluctuating up and down and the latent vector scatter plots resembling a whirling flock of starlings. Generative samples are very blurry and close to the mean (see Fig. 9, top row) in agreement with the prediction that reconstruction residuals are deemphasized. We then trained a VAE with constant isotropic likelihood variances, producing a bounded lower bound. We included an artificial scaling factor in the data vectors to test the sensitivity of the constant variance to changes in the scale of the data. (Note the artificial scaling would not be valid for a distribution with a fixed range, such as Bernoulli, but is valid for a Gaussian.) Fig. 9 illustrates that a modest change in the scale produces a dramatically different generative sample quality. This is most obvious for high latent dimensionality, perhaps due to the bottleneck of low latent dimensionality further regularizing the space.

We then repeated the same experiments, this time using BAGGINS to determine likelihood variances (Eq. 25). Fig. 10 illustrates that BAGGINS produces generative samples that are insensitive to the scale of the data. We selected the information parameter in Eq. 25 by performing a small parameter sweep and selecting one by inspection. Fig. 11 illustrates the effect of different values of

, with lower values appearing somewhat blurry and higher values breaking apart, presumably because the posteriors do not overlap enough examples to effectively interpolate between them. Again we see less impact when the latent dimensionality is low, further suggesting the bottleneck is providing some useful regularization. (See

Figs. 19, 18 and 15 for additional results.)

In all experiments with Gaussian likelihood, generative samples are plausible but the high dimensional latent space results are not as realistic as the experiments with Bernoulli likelihood, which perhaps better models the perceptual distance for pen and ink. This difference is not apparent in the low latent dimensionality experiments.

9 Limitations and Future Work

We have only provided arguments and experiments supporting our proposition of constant posterior variances in the case of Gaussian prior and posterior. It may be possible to extend these arguments to other families of distributions. Further, we have been unable to complete a formal proof of the proposition, instead relying on the experiments to validate our arguments. A formal proof remains as future work.

While BILBO may be immediately applicable to many VAE implementations, BAGGINS is only applicable to Gaussian likelihood models. Extending the method to other likelihood models may be possible (such as perceptual losses or discriminative losses) since BAGGINS leverages the relationship between residuals (i.e. losses) and a posited loss-whitened space, which in principle exists for all well behaved likelihood models. Developing this into a general method may enable simpler training procedures that are less sensitive to hyperparameter tuning.

10 Conclusions

We have presented arguments and experiments supporting our proposition that the posterior variances (in the encoder) in a Variational Auto-Encoder may simply be constant without impacting the optimal variational lower bound, in the case of Gaussian prior and posterior. In a deep network setting, this eliminates half the outputs of the encoder network while producing an improved variational lower bound in experiments. Our proposition leads to a simplified reformulation for training VAEs in batches, which we call the Batch Information Lower Bound (BILBO).

We have also presented a closed form solution to the variances for Gaussian likelihood (in the decoder), showed that it is unbounded, and proposed a bounded solution based on a presumed information content in the data population. In a deep network setting, this eliminates half the outputs of the decoder network, with improved invariance to the scale of the data compared to constant likelihood variances. We have proposed an associated estimator for training in batches, which we call Bounded Aggregate Information Sampling (BAGGINS).

While our experiments employ basic network architectures and data sets, we hope our exposition provides useful insight into the action of Variational Auto-Encoders, which remain popular in recent state of the art literature.

Acknowledgements

We wish to thank Jon Barron, Michael Broxton, Paul Debevec, John Flynn, and Matt Hoffman.

Appendix A Derivative of Expected Log Likelihood w.r.t. Posterior Mean

We expand the derivative of the expected log likelihood with respect to as:

(30)

Note that the summation including the term vanishes due to the definition of . We may change the expectation in Eq. 30 to standard normal domain as follows:

(31)

Let be defined such that over ; in other words a local linear fit windowed by the posterior. By the apparent smoothness of Eq. 10 we expect the error of this approximation to be small. Referring to IMM2012-03274 we recall222IMM2012-03274 Eq. (394) contains an error; the expression is transposed. the identities , , and and perform the following manipulations:

(32)

Therefore we may state:

(33)

Appendix B Derivative of Expected Log Likelihood w.r.t. Posterior Variance

We expand the derivative of the expected log likelihood with respect to as:

(34)

Note that the summation including the term vanishes due to the definition of . We may change the expectation in Eq. 34 to standard normal domain as follows:

(35)

Let be defined such that over ; in other words a local linear fit windowed by the posterior. By the apparent smoothness of Eq. 10 we expect the error of this approximation to be small. Referring to IMM2012-03274 we recall333IMM2012-03274 Eq. (348) contains an error; a factor of one half is missing (though present in the related Eq. (396)). the identities , , , and and perform the following manipulations:

(36)

Therefore we may state:

(37)

Appendix C Derivation of Eq. (23)

Equating from Eq. 7 and asserting full rank , we state:

(38)
(39)

The factor accounts for the reintroduction of the dimensions on the left hand side that were previously masked by . We leverage the Jacobian to evaluate at rather than :

(40)

Assuming constant , defining , and equating the partial derivatives from Eqs. 8 and 6 to yields, after some manipulation:

(41)

Discouragingly, the last term is undefined when . However, observe that in the limit as approaches zero, approaches zero a the same rate and hence the last term approaches unity. The estimators and suggest the following approximation with :

(42)

We verify the expectation , in agreement with the maximization of Eq. 17 with full rank .

Appendix D Additional Figures

(a) .
(b) .
(c) .
Figure 13: Posterior means for a two-dimensional MNIST latent space colored by label, with Bernoulli likelihood, trained using BILBO with different values of . The distributions are qualitatively similar.
(a) .
(b) .
(c) .
Figure 14: Posterior means for a two-dimensional MNIST latent space colored by label, with Gaussian likelihood with constant variance . indicates a scaling factor applied to the data vectors. Qualitatively, the distributions pinch towards the origin as increases, becoming more radial.
(a) .
(b) .
(c) .
Figure 15: Posterior means for a two-dimensional MNIST latent space colored by label, with Gaussian likelihood using BAGGINS to determine likelihood variances with information parameter . Qualitatively, the gaps between label clusters tighten up as increases.
Figure 16: Latent manifold and scatter plots for a two-dimensional MNIST latent space with Bernoulli likelihood and learned posterior variances. scatter plot (lower right) is a single sample from the posterior, while (upper right) is its mean.
Figure 17: Latent manifold and scatter plots for a two-dimensional MNIST latent space with Bernoulli likelihood, trained using BILBO with constant posterior variance . scatter plot (lower right) is a single sample from the posterior, while (upper right) is its mean.
Figure 18: Latent manifold and scatter plots for a two-dimensional MNIST latent space with Gaussian likelihood using constant variance . scatter plot (lower right) is a single sample from the posterior, while (upper right) is its mean.
Figure 19: Latent manifold and scatter plots for a two-dimensional MNIST latent space with Gaussian likelihood using BAGGINS with information parameter . scatter plot (lower right) is a single sample from the posterior, while (upper right) is its mean.
(a)
(b)
(c)
(d)
Figure 20: Reconstructions for input data (a, c) using a VAE trained with (b) standard practice using ELBO and (d) BILBO. The two methods produce qualitatively similar reconstructions.
(a) Samples from standard normal prior; ELBO loss.
(b) Samples from optimal normal prior; ELBO loss.
(c) Samples from prior; BILBO loss with .
(d) Samples from prior; BILBO loss with .
Figure 21: Generative samples for CelebA for different methods of determining posterior variances and sample distributions. (a) Standard VAE practice using ELBO loss and sampling the standard normal prior. (b) ELBO loss but sampling from the optimal prior from Eq. 13. (c) BILBO loss with constant posterior variance . (d) BILBO loss with constant posterior variance . All methods produce comparable ELBO. Sampling the standard normal prior (a) exhibits less variation than the optimal normal prior (b) which is qualitatively more similar to the variation in the reconstructions (Fig. 20). Due to the significant variation in the data, unit posterior variance (c) produces a large optimal prior variance which appears to slightly exaggerate the samples, compared to a smaller constant posterior variance (d) which is qualitatively comparable to (b).