Variational autoencoders (VAE) are described by Goodfellow2016 as an “excellent manifold learning algorithm” due to the fact that the model is forced “to learn a predictable coordinate system that the encoder can capture”. VAE do so by using a regularization term in order to get to low energy regions. According to LeCun2020, regularization like in the VAE case helps to keep the energy function smooth, which is desirable for the model in order to learn meaningful dependencies (e.g. to fill blanks). In contrast, maximum likelihood approaches push down the energy surface only at training sample regions. Therefore, their inherent objective is “to make the data manifold an infinitely deep and infinitely narrow canyon” (see LeCun2020).
Learning meaningful dependencies is a desirable concept for advancing deep learning. Hence, there exists an interest in understanding and developing VAE. Recent work aims on explaining and overcoming well-known pitfalls of VAE, such as spurious global optima (seeDai2019), posterior collapse (see Lucas2019 and VandenOord2017) or prior posterior mismatch (see Dai2019 and Ghosh). In these works, a specific decoder distribution is assumed to analyse the loss surface of the model or to look at the training behaviour.
In this work, we answer the following research question:
Is there a way to generalize the loss analysis of VAE based on the decoder distribution?
For this, we establish a connection between VAE and generalized linear models (GLM) and provide a framework for analysing VAE based on the decoder distribution. By doing so, we generalize works of Dai2018, Lucas2019 and Sicks2020
. We provide an approximation to the evidence lower bound (ELBO), which is exact in the Gaussian distribution case (see alsoDai2018 and Lucas2019
) and a lower bound for the Bernoulli distribution case (see alsoSicks2020). Further, we analyse the maximum likelihood estimators (MLE) of this approximation and find that the choice of decoder activation function does not affect the maximum. Using the MLE as initialization, we also show that the training performance of a VAE net can be enhanced.
As GLM are based on exponential dispersion families (EDF), the analysis is based on the distribution assumption for the decoder model. This is favourable as VAE are applied in various different fields (with different distribution assumptions), as e.g.: anomaly detection using Gaussian distribution (seeXu2018a), molecules representation using Bernoulli distribution (see Blaschke2018), image compression using Bernoulli distribution (see Duan2019
) or multivariate spatial point processes using Poisson distribution (seeYuan2020).
This work is structured as follows: In Section 2, we give a motivation as well as an overview of related work. In Section 3, we present the theoretical background and our results that are a consequence of connecting VAE and GLM. Afterwards in Section 4, we provide simulations validating our theoretical results and show that these can be used for deep VAE. In Section 5, we summarize our contributions and point out future research directions.
2 Motivation and Related Work
Throughout our theoretical analysis, we consider VAE with arbitrary encoder and one-layer decoder. Our main contribution is to interpret this one-layered decoder as a GLM (see Section 3.2). This perspective allows us to identify well-known activation functions as the reciprocal of link functions of GLM. Therefore, we are able to provide a systematic generalization of the loss analysis for VAE based on the assumption that the distribution of the decoder belongs to an EDF.
Even though the decoder architecture is arguably simple, analysing the critical points and the loss landscapes of VAE helps to understand these models better. Using this architecture and approach, Dai2018
show connections to probabilistic principal component analysis (pPCA),Lucas2019 analyse posterior collapse and Sicks2020 bound the ELBO for a Bernoulli VAE from below.
Given an one-layered decoder with Gaussian distribution assumption, the model (more specifically 111The expressions and are defined as in Kingma.) becomes tractable. Lucas2019 use this for their analysis and show that in this case the ELBO becomes exact, as one has . Still, the main motivation for VAE is the assumption of intractability (see Kingma), where the recognition model is used to approximate the intractable posterior . In case of tractability as for the Gaussian model, an established alternative is the EM algorithm.
Under the one-layer decoder assumption, the model does not necessarily become tractable. E.g. for a Bernoulli decoder, the model stays intractable (see Aitchison1980).
Our framework can serve as a stepping stone to theory for more rigorous models as it yields
known results for the “good to handle” Gaussian case. To be more specific, we provide an equivalent optimization target to those in Dai2018 and Lucas2019.
interpretational guidance for intractable settings like the Bernoulli case. Our results for the sigmoid222
Actually, we mean the logistic function. It has become common to use the term sigmoid function for this special case. Therefore, in the following we keep on using the term sigmoid function.activation are equivalent to the results in Sicks2020. Even though the resulting target is only approximative, our simulation in Section 4 for the Bernoulli case shows reasonable results on real data (i.e. frey data).
the possibility to use GLM tools to further analyse VAE and to extend the analysis to include more EDF distributions (e.g. the Gamma distribution).
In the following, we give an overview of literature for analysing Autoencoders and VAE as well as on GLM used in the context of neural nets.
show for autoencoders with linearised activations that the optimal solution is given by the solution of a singular value decomposition (SVD).Baldi1989a extend these results and analyse the squared error loss of autoencoders for all critical points.
Dai2018 analyse Gaussian VAE with the same architecture as in this work. They show connections to pPCA and robust PCA as well as smoothing effects for local optima of the loss landscape.
provide analytical forms for critical points and characterize the values of the corresponding loss functions as well as properties of the loss landscape for one-hidden layer ReLU autoencoders.
Dai2019 analyse deep Gaussian VAE. Assuming, the existence of an invertible and differentiable mapping between low-rank manifolds in the sample space and the latent space, they show that spurious global optima exist, which do not reflect the data-generating manifold appropriately.
Kunin2019 consider regularizations on linear autoencoders and analyse the loss-landscape for different regularizations. They show regularized linear autoencoders are capable of learning the principal directions and have connections to pPCA.
Lucas2019 extend results of Dai2018 to analyse the posterior collapse. They do this by analysing the difference from the true marginal to the ELBO which is possible under Gaussian assumptions, as becomes tractable. Furthermore, they provide experimental results on the posterior collapse for deep non-linear cases.
Sicks2020 formulate a lower bound for the ELBO of a Bernoulli VAE with the same architecture as in this work. They use the MLE to derive an initialization scheme and empirically compare it on synthetic data.
describes connections between GLM and neural network regression models, by interpreting the last layer of a neural net as GLM. With this, he is able to use aregularized neural net to learn representative features to improve a standard GLM. Furthermore, favourable properties (as for an actuarial context, the ”balance property“) are achieved with a proposed hybrid model.
3 Theoretical Background and Advancements
of a random variable (r.v.), we consider a VAE as in Kingma with the ELBO , given by
Interpreting the expectation in this expression as autoencoder yields the encoder and the decoder . We further assume that the encoder produces and , with , where and are arbitrary functions including affine transformations. The decoder is one-layered with a not necessarily linear activation function , weights and biases .
In the following, we first provide the theoretical background on GLM and EDF. Then, we show how the decoder can be interpreted as GLM. Finally, given this new perspective on VAE, we present theoretical results and analyse MLE for the resulting approximation.
3.1 The GLM and EDF
introduce GLM, providing a generalization of linear statistical models and thus of well-known statistical tools, such as analysis of variance (ANOVA), deviance statistics and MLE (see alsoMcCullagh1989). GLM consist of three parts: A random component with a distribution belonging to the EDF, a systematic component given as affine mapping of features used to estimate , and a link function connecting these two components. Hereby, the EDF is defined by the structure of the density.
We say the distribution of given belongs to the exponential dispersion family (EDF), if the density can be written as
where and are one-dimensional functions. is also called the log-normalizer. It ensures that integration w.r.t. the density in (2) over the support of is equal to one. is the location parameter. is an open, convex space with . is called the dispersion parameter and independent of .
The EDF is studied in BarndorffNielsen2014, Jorgensen1986 and Jorgensen1987a. Several well-known distributions, like the Gaussian, Bernoulli and Poisson distribution, belong to this family (see Table 1 in supplementary material Section A for the respective representations).
Let the distribution of a one-dimensional r.v. given belong to the EDF. Then, it holds and . Furthermore, the log-partition function is convex and possesses all derivatives.
The proof for the unconditional case is performed in Theorem 7.1, Corollary 7.1 and Theorem 8.1 in BarndorffNielsen2014. The statement for the conditional case is found analogously.
3.2 The decoder as GLM
We interpret the decoder as GLM. Therefore, we assume that the independent identical marginal distributions of given belong to an EDF, where they share the same . With a realization from the encoder, the parameters of the decoder are given by , with .
Part of the training objective of a VAE implementation is to construct outputs that resemble the inputs to a high degree. Therefore, the decoder reports the value
the (conditional) expectation given a latent representation created by the encoder. Training on the decoder is conducted by updating the parameters in the set . In consensus with GLM (see McCullagh1989), we define the linear predictor , with and . Given an activation function , we set333 We apply functions with one-dimensional support on vectors, which is interpreted as element wise operations.
We apply functions with one-dimensional support on vectors, which is interpreted as element wise operations.
and according to Lemma 1, we have
With the linear predictor introduced, we have for our decoder parameter set .
Obviously, is a special choice for which we have . We call this choice of the “canonical activation”. This name originates from the “canonical link function”. As mentioned in Section 3.1, for GLM we define a link function connecting the systematic component of the model to the random component . This function is called canonical if . Hence, the canonical activation is the inverse of the canonical link.
A canonical function has the advantage of an affine mapping from the net parameters to the EDF parameters. This will simplify the derivation of our theoretical results below. Apart from canonical activations, we want to provide theory for a more general case. We consider any activation that scales the systematic component to , s.t. for we have
We call activations applying to this “linearly canonical activation”. Obviously, for canonical activations we have .
Though this theory is not new, we want to stress the point that commonly used activation functions in machine learning are linearly canonical activations. Two examples in SectionA of the supplementary material illustrate this for the sigmoid and the tanh activation.
Unfortunately, the canonical activation is not for all EDF distributions a desirable candidate for a VAE model. The canonical activation of the Gamma distribution (which also belongs to the EDF) is given by , with support in . It is not favourable to limit the support, since this would place restrictions on the linear predictor and hence on the weights and biases. A common workaround from GLM theory is to use the activation , but this function does not produce the desired linear mapping in (3).
3.3 Approximating the ELBO
In this section, the new interpretation of the decoder as GLM allows to approximate the ELBO (see Proposition 1), by linearising the log-partition function . We can use the results to make statements on the choice of the activation function (see Section 3.4), to derive weight and bias initializations (see Section C.3 of the supplementary material and the results in Section 4) or to monitor the training of an intractable VAE with a tractable reference point (see Section 4).
Consider N observations of a d-dimensional r.v. X. Assume the independent identical marginals of given belong to the same EDF distribution with functions and as in (2). Each marginal distribution has a parameter () and is the same. For a VAE as in Section 3, we assume the one-layer decoder has a linearly canonical activation function , s.t. , where , , and .
Then, there exists an approximative representation for as in (1), around an arbitrary point , that admits an optimal solution for and , such that it can be written as
where and are constant in and .
Setting , , and , we have , and
where and denotes the sample mean.
See Section B.2 of the supplementary material for the proof. Usually, we set . See Table 2 in Section A of the supplementary material for different EDF distributions and linearly canonical activation functions as well as the parameters for Proposition 1 with this choice of .
To derive this alternative expression, we made use of a Taylor approximation. Therefore, we can quantify the approximation error via the corresponding remainder.
Under the assumptions of Proposition 1, the error to the original ELBO in (1) is given by , where denotes the Taylor remainder for approximating the log-partition function at the point with a 2nd degree Taylor approximation. If we assume to originate
from a Gaussian distribution, then for all .
Consider , if we assume to originate
from a Binomial distribution, then.
from a Poisson distribution, then .
See Section B.3 of the supplementary material for the proof. Since in our setting is Gaussian under
, the moments for given parameterscan be calculated straight forward.
Corollary 1 highlights how our theory generalizes the works of Dai2018, Lucas2019 and Sicks2020. Under the Gaussian assumption with linearly canonical activation, is exact. Then, Proposition 1 yields an equivalent result as given by Dai2018 and extended by Lucas2019. For the Bernoulli distribution, the expected difference is positive. Hence, with from Proposition 1 we approximate the ELBO in (1) from below. The result for the sigmoid activation is the same lower bound as reported in Sicks2020. Thus, the ELBO is bounded from both sides as naturally its values have to be negative and we have
3.4 MLE and optimal solution
In the following, we analyse the optimal values of and for and highlight important properties. To do so, we rewrite the objective and obtain a similar representation as in Tipping1999,
with and . According to Tipping1999, the MLE for is given by the sample mean of . Therefore, with becomes the sample covariance (we denote it as ). For later purposes, we also define the sample covariance matrix of as . With
we denote the (ordered) eigenvalues of the matrixand the same for and . It holds for , with and from Proposition 1. In a similar way to Tipping1999, we can derive the MLE of as
where is composed of eigenvectors of the matrix . These eigenvectors are associated with the biggest eigenvalues . is a diagonal matrix with entries
is an arbitrary rotation matrix, which implies that our optimal solution is invariant to rotations. Dai2018 show this as well as invariance to permutations in their Theorem 2.
It is possible to have , and the term (for ) controls how much columns in the matrix are zero. We interpret this as a consequence of the auto-pruning property of VAE: If the data signal is not strong enough, it is pruned away.
Further, if exists (e.g. in the Gaussian case), the MLE for is given by . We interpret this as the variance lost due to the dimension reduction.
Given the MLE, the following result shows that the choice of the activation function does not matter for the maximal value of for non-degenerated cases (i.e. for ). The same can be shown for the bounds in Corollary 1. It is notable that neither maximal point nor bounds depend on the choice of the activation. We also observe this in our simulations. A possible reason for this could be that we restrict the theory to linearly canonical activations.
Assume that for the first eigenvalues of we have for . At the optimal points and , the maximum of in Proposition 1 becomes independent of and can be written as
If less than eigenvalues fulfil the assumption, depends on and is maximal for .
See Section B.4 of the supplementary material for the proof.
4 Simulation results
In this section, we provide simulation results to illustrate our theoretical results from Section 3. We focus on the Bernoulli case, popular for image data. According to Corollary 1, from Proposition 1 becomes a lower bound yielding (5). Therefore, we expect the ELBO of VAE with an according architecture to lie above . The essential messages of the simulations are the following:
It is reasonable to use to analyse the training performance on real life data sets.
The statement above is also valid for deep VAE architectures.
The MLE points, from Section 3.4, used as initialization enhance the training performance.
For training of the nets we use the Adam optimizer by Kingma2015 with learning rate and a batch size of . Training was done for a total of batch evaluations. The simulations ran on a dual Intel Xeon E5-2670 with 16 CPU @ 2.6 GHz. The longest setup took about one hour of computing time.444Our code together with a readme file for execution can be found in the folder “vae_exp_fam_code” provided together with the supplementary materials.
By varying the following hyper parameters, we conduct a total of 36 different simulation setups:
Architecture: “canonical” or “deep”.
Latent dimension : 2, 5 or 20.
Last decoder activation: “sigmoid” or “tanh”.
Data: “synthetic”, “frey” or “mnist”.
We compare our initialization scheme (“MLE-B”) to a benchmark (“Bench”) given by He2015. The initialization schemes and different hyper parameters are explained in detail in Section C of the supplementary material. Figure 1 shows the result of training the two different initialized VAE on the frey dataset with two different architectures “deep” and “canonical”. We set and the decoder activation as sigmoid function.
In Figure 1, the bound is reasonable and both architectures do not perform significantly better. The results of all simulation schemes can be found in Section C.4 of the supplementary material. Comparing these simulation results and also considering Figure 1, we observe the following:
For the canonical architecture, the initialization MLE-B converges directly, whereas the Benchmark takes much more time. The end values are comparable. For the deep architecture, the performance of the two initialization methods mostly shows a small initial advantage of MLE-B which, however, is not as clear as in the canonical architecture.
Different decoder activations yield similar results. This observation agrees with our theoretical result that the activation affects neither bound nor optimal (see Section 3.4).
In no simulation setup a net was over-fitting, not even for large values of with synthetic data, where a much smaller was needed. This property of VAE is known as auto-pruning.
We have established a new framework for VAE, by interpreting the decoder of a VAE as a GLM. Given this framework, we derive and analyse an approximation to the ELBO. We derive MLE for this approximation and provide simulation results validating the theory on real world datasets, like the frey and mnist dataset. The results here generalize previous work in this field.
Possible extensions and research directions based on our findings are:
A deep decoder as opposed to the one-layer version here, can be considered. A possible way would be to think of the last layer of a deep decoder as GLM, as in Wuthrich2020.
Distributions like the Gamma distribution also belong to the EDF but are not yet covered by our theory. Future research on possible data transformations should extend our results for this. Furthermore, it is also interesting to extend the GLM approach to cover popular activations such as ReLU and variations of it.
As our contributions are of theoretical nature, we do not believe that the intention of this section is applicable for this work.
Appendix A Tables and Examples
|Dist. of X|
|Distribution of X||Activation|
Example 1 (Bernoulli distribution - sigmoid activation).
For , we get . Hence for the canonical activation we have
which is the sigmoid activation.
Example 2 (Bernoulli distribution - tanh activation).
Assume and set the activation as
As in the example before and it can be shown that
Appendix B Appendix: Proofs
b.1 Auxiliary results
For a -dimensional random variable it holds
and the fact that
yields the statement. ∎
Let be symmetric positive definite matrices. Then it holds
b.2 Proof of Proposition 1
To proof Proposition 1, we change the perspective. Instead of maximizing the average ELBO, we want to minimize the negative average ELBO given by
Looking at (11), we see two terms. For the KL-divergence we have that
and for the second term (with as abbreviation for ) we derive that
We see two terms, where the expectation of is taken into account. Using
for the first term in (13), it follows
For the other term in (13) first consider that the log-partition function possesses all derivatives and is convex as stated in Lemma 1. We use a second order Taylor approximation in of the log-partition function and get
Setting , and we have
We ignore and approximate the expression, within the expectation of the second argument, in (13) and get
with . With Lemma 2 and the fact that
we can rewrite this as
All potential minima w.r.t. have to conform to
independent of , with . Note that this expression is well-defined as per assumption, we have , and . To see that these are minima and not maxima, consider Lemma 3,
where we set and .
Given the minimal target function evaluated at this point becomes
By rearrangement this becomes
where denotes the element wise sample mean. For , we get as minimal points
with . This is well-defined, since and . The candidates for an optimal are minima since the second derivative is a positive constant times , which is positive definite. Given the optimal and our target function is only dependent on the parameters and . We get
Consider the Singular Value Decomposition of with and are unitary matrices and
and can write
with . For it follows that
where we denote . The justification of the last equation becomes apparent, when we consider one respective diagonal element of the diagonal matrices in the equation. We have
We can further rewrite (19) as