1 Introduction
Variational autoencoders (VAE) are described by Goodfellow2016 as an “excellent manifold learning algorithm” due to the fact that the model is forced “to learn a predictable coordinate system that the encoder can capture”. VAE do so by using a regularization term in order to get to low energy regions. According to LeCun2020, regularization like in the VAE case helps to keep the energy function smooth, which is desirable for the model in order to learn meaningful dependencies (e.g. to fill blanks). In contrast, maximum likelihood approaches push down the energy surface only at training sample regions. Therefore, their inherent objective is “to make the data manifold an infinitely deep and infinitely narrow canyon” (see LeCun2020).
Learning meaningful dependencies is a desirable concept for advancing deep learning. Hence, there exists an interest in understanding and developing VAE. Recent work aims on explaining and overcoming wellknown pitfalls of VAE, such as spurious global optima (see
Dai2019), posterior collapse (see Lucas2019 and VandenOord2017) or prior posterior mismatch (see Dai2019 and Ghosh). In these works, a specific decoder distribution is assumed to analyse the loss surface of the model or to look at the training behaviour.In this work, we answer the following research question:

Is there a way to generalize the loss analysis of VAE based on the decoder distribution?
For this, we establish a connection between VAE and generalized linear models (GLM) and provide a framework for analysing VAE based on the decoder distribution. By doing so, we generalize works of Dai2018, Lucas2019 and Sicks2020
. We provide an approximation to the evidence lower bound (ELBO), which is exact in the Gaussian distribution case (see also
Dai2018 and Lucas2019) and a lower bound for the Bernoulli distribution case (see also
Sicks2020). Further, we analyse the maximum likelihood estimators (MLE) of this approximation and find that the choice of decoder activation function does not affect the maximum. Using the MLE as initialization, we also show that the training performance of a VAE net can be enhanced.As GLM are based on exponential dispersion families (EDF), the analysis is based on the distribution assumption for the decoder model. This is favourable as VAE are applied in various different fields (with different distribution assumptions), as e.g.: anomaly detection using Gaussian distribution (see
Xu2018a), molecules representation using Bernoulli distribution (see Blaschke2018), image compression using Bernoulli distribution (see Duan2019) or multivariate spatial point processes using Poisson distribution (see
Yuan2020).This work is structured as follows: In Section 2, we give a motivation as well as an overview of related work. In Section 3, we present the theoretical background and our results that are a consequence of connecting VAE and GLM. Afterwards in Section 4, we provide simulations validating our theoretical results and show that these can be used for deep VAE. In Section 5, we summarize our contributions and point out future research directions.
2 Motivation and Related Work
Throughout our theoretical analysis, we consider VAE with arbitrary encoder and onelayer decoder. Our main contribution is to interpret this onelayered decoder as a GLM (see Section 3.2). This perspective allows us to identify wellknown activation functions as the reciprocal of link functions of GLM. Therefore, we are able to provide a systematic generalization of the loss analysis for VAE based on the assumption that the distribution of the decoder belongs to an EDF.
Even though the decoder architecture is arguably simple, analysing the critical points and the loss landscapes of VAE helps to understand these models better. Using this architecture and approach, Dai2018
show connections to probabilistic principal component analysis (pPCA),
Lucas2019 analyse posterior collapse and Sicks2020 bound the ELBO for a Bernoulli VAE from below.Given an onelayered decoder with Gaussian distribution assumption, the model (more specifically ^{1}^{1}1The expressions and are defined as in Kingma.) becomes tractable. Lucas2019 use this for their analysis and show that in this case the ELBO becomes exact, as one has . Still, the main motivation for VAE is the assumption of intractability (see Kingma), where the recognition model is used to approximate the intractable posterior . In case of tractability as for the Gaussian model, an established alternative is the EM algorithm.
Under the onelayer decoder assumption, the model does not necessarily become tractable. E.g. for a Bernoulli decoder, the model stays intractable (see Aitchison1980).
Our framework can serve as a stepping stone to theory for more rigorous models as it yields

known results for the “good to handle” Gaussian case. To be more specific, we provide an equivalent optimization target to those in Dai2018 and Lucas2019.

interpretational guidance for intractable settings like the Bernoulli case. Our results for the sigmoid^{2}^{2}2
Actually, we mean the logistic function. It has become common to use the term sigmoid function for this special case. Therefore, in the following we keep on using the term sigmoid function.
activation are equivalent to the results in Sicks2020. Even though the resulting target is only approximative, our simulation in Section 4 for the Bernoulli case shows reasonable results on real data (i.e. frey data). 
the possibility to use GLM tools to further analyse VAE and to extend the analysis to include more EDF distributions (e.g. the Gamma distribution).
In the following, we give an overview of literature for analysing Autoencoders and VAE as well as on GLM used in the context of neural nets.
Bourlard1988
show for autoencoders with linearised activations that the optimal solution is given by the solution of a singular value decomposition (SVD).
Baldi1989a extend these results and analyse the squared error loss of autoencoders for all critical points.Dai2018 analyse Gaussian VAE with the same architecture as in this work. They show connections to pPCA and robust PCA as well as smoothing effects for local optima of the loss landscape.
Zhou2018
provide analytical forms for critical points and characterize the values of the corresponding loss functions as well as properties of the loss landscape for onehidden layer ReLU autoencoders.
Dai2019 analyse deep Gaussian VAE. Assuming, the existence of an invertible and differentiable mapping between lowrank manifolds in the sample space and the latent space, they show that spurious global optima exist, which do not reflect the datagenerating manifold appropriately.
Kunin2019 consider regularizations on linear autoencoders and analyse the losslandscape for different regularizations. They show regularized linear autoencoders are capable of learning the principal directions and have connections to pPCA.
Lucas2019 extend results of Dai2018 to analyse the posterior collapse. They do this by analysing the difference from the true marginal to the ELBO which is possible under Gaussian assumptions, as becomes tractable. Furthermore, they provide experimental results on the posterior collapse for deep nonlinear cases.
Sicks2020 formulate a lower bound for the ELBO of a Bernoulli VAE with the same architecture as in this work. They use the MLE to derive an initialization scheme and empirically compare it on synthetic data.
Wuthrich2020
describes connections between GLM and neural network regression models, by interpreting the last layer of a neural net as GLM. With this, he is able to use a
regularized neural net to learn representative features to improve a standard GLM. Furthermore, favourable properties (as for an actuarial context, the ”balance property“) are achieved with a proposed hybrid model.3 Theoretical Background and Advancements
For realizations
of a random variable (r.v.)
, we consider a VAE as in Kingma with the ELBO , given by(1) 
Interpreting the expectation in this expression as autoencoder yields the encoder and the decoder . We further assume that the encoder produces and , with , where and are arbitrary functions including affine transformations. The decoder is onelayered with a not necessarily linear activation function , weights and biases .
In the following, we first provide the theoretical background on GLM and EDF. Then, we show how the decoder can be interpreted as GLM. Finally, given this new perspective on VAE, we present theoretical results and analyse MLE for the resulting approximation.
3.1 The GLM and EDF
Nelder1972
introduce GLM, providing a generalization of linear statistical models and thus of wellknown statistical tools, such as analysis of variance (ANOVA), deviance statistics and MLE (see also
McCullagh1989). GLM consist of three parts: A random component with a distribution belonging to the EDF, a systematic component given as affine mapping of features used to estimate , and a link function connecting these two components. Hereby, the EDF is defined by the structure of the density.Definition 1.
We say the distribution of given belongs to the exponential dispersion family (EDF), if the density can be written as
(2) 
where and are onedimensional functions. is also called the lognormalizer. It ensures that integration w.r.t. the density in (2) over the support of is equal to one. is the location parameter. is an open, convex space with . is called the dispersion parameter and independent of .
The EDF is studied in BarndorffNielsen2014, Jorgensen1986 and Jorgensen1987a. Several wellknown distributions, like the Gaussian, Bernoulli and Poisson distribution, belong to this family (see Table 1 in supplementary material Section A for the respective representations).
Lemma 1.
Let the distribution of a onedimensional r.v. given belong to the EDF. Then, it holds and . Furthermore, the logpartition function is convex and possesses all derivatives.
The proof for the unconditional case is performed in Theorem 7.1, Corollary 7.1 and Theorem 8.1 in BarndorffNielsen2014. The statement for the conditional case is found analogously.
3.2 The decoder as GLM
We interpret the decoder as GLM. Therefore, we assume that the independent identical marginal distributions of given belong to an EDF, where they share the same . With a realization from the encoder, the parameters of the decoder are given by , with .
Part of the training objective of a VAE implementation is to construct outputs that resemble the inputs to a high degree. Therefore, the decoder reports the value
the (conditional) expectation given a latent representation created by the encoder. Training on the decoder is conducted by updating the parameters in the set . In consensus with GLM (see McCullagh1989), we define the linear predictor , with and . Given an activation function , we set^{3}^{3}3
We apply functions with onedimensional support on vectors, which is interpreted as element wise operations.
and according to Lemma 1, we have
With the linear predictor introduced, we have for our decoder parameter set .
Obviously, is a special choice for which we have . We call this choice of the “canonical activation”. This name originates from the “canonical link function”. As mentioned in Section 3.1, for GLM we define a link function connecting the systematic component of the model to the random component . This function is called canonical if . Hence, the canonical activation is the inverse of the canonical link.
A canonical function has the advantage of an affine mapping from the net parameters to the EDF parameters. This will simplify the derivation of our theoretical results below. Apart from canonical activations, we want to provide theory for a more general case. We consider any activation that scales the systematic component to , s.t. for we have
(3) 
We call activations applying to this “linearly canonical activation”. Obviously, for canonical activations we have .
Though this theory is not new, we want to stress the point that commonly used activation functions in machine learning are linearly canonical activations. Two examples in Section
A of the supplementary material illustrate this for the sigmoid and the tanh activation.Unfortunately, the canonical activation is not for all EDF distributions a desirable candidate for a VAE model. The canonical activation of the Gamma distribution (which also belongs to the EDF) is given by , with support in . It is not favourable to limit the support, since this would place restrictions on the linear predictor and hence on the weights and biases. A common workaround from GLM theory is to use the activation , but this function does not produce the desired linear mapping in (3).
3.3 Approximating the ELBO
In this section, the new interpretation of the decoder as GLM allows to approximate the ELBO (see Proposition 1), by linearising the logpartition function . We can use the results to make statements on the choice of the activation function (see Section 3.4), to derive weight and bias initializations (see Section C.3 of the supplementary material and the results in Section 4) or to monitor the training of an intractable VAE with a tractable reference point (see Section 4).
Proposition 1.
Consider N observations of a ddimensional r.v. X. Assume the independent identical marginals of given belong to the same EDF distribution with functions and as in (2). Each marginal distribution has a parameter () and is the same. For a VAE as in Section 3, we assume the onelayer decoder has a linearly canonical activation function , s.t. , where , , and .
Then, there exists an approximative representation for as in (1), around an arbitrary point , that admits an optimal solution for and , such that it can be written as
(4) 
where and are constant in and .
Setting , , and , we have , and
where and denotes the sample mean.
See Section B.2 of the supplementary material for the proof. Usually, we set . See Table 2 in Section A of the supplementary material for different EDF distributions and linearly canonical activation functions as well as the parameters for Proposition 1 with this choice of .
To derive this alternative expression, we made use of a Taylor approximation. Therefore, we can quantify the approximation error via the corresponding remainder.
Corollary 1.
Under the assumptions of Proposition 1, the error to the original ELBO in (1) is given by , where denotes the Taylor remainder for approximating the logpartition function at the point with a 2nd degree Taylor approximation. If we assume to originate

from a Gaussian distribution, then for all .
Consider , if we assume to originate

from a Poisson distribution, then .
See Section B.3 of the supplementary material for the proof. Since in our setting is Gaussian under
, the moments for given parameters
can be calculated straight forward.Corollary 1 highlights how our theory generalizes the works of Dai2018, Lucas2019 and Sicks2020. Under the Gaussian assumption with linearly canonical activation, is exact. Then, Proposition 1 yields an equivalent result as given by Dai2018 and extended by Lucas2019. For the Bernoulli distribution, the expected difference is positive. Hence, with from Proposition 1 we approximate the ELBO in (1) from below. The result for the sigmoid activation is the same lower bound as reported in Sicks2020. Thus, the ELBO is bounded from both sides as naturally its values have to be negative and we have
(5) 
3.4 MLE and optimal solution
In the following, we analyse the optimal values of and for and highlight important properties. To do so, we rewrite the objective and obtain a similar representation as in Tipping1999,
(6) 
with and . According to Tipping1999, the MLE for is given by the sample mean of . Therefore, with becomes the sample covariance (we denote it as ). For later purposes, we also define the sample covariance matrix of as . With
we denote the (ordered) eigenvalues of the matrix
and the same for and . It holds for , with and from Proposition 1. In a similar way to Tipping1999, we can derive the MLE of as(7) 
where is composed of eigenvectors of the matrix . These eigenvectors are associated with the biggest eigenvalues . is a diagonal matrix with entries
(8) 
is an arbitrary rotation matrix, which implies that our optimal solution is invariant to rotations. Dai2018 show this as well as invariance to permutations in their Theorem 2.
It is possible to have , and the term (for ) controls how much columns in the matrix are zero. We interpret this as a consequence of the autopruning property of VAE: If the data signal is not strong enough, it is pruned away.
Further, if exists (e.g. in the Gaussian case), the MLE for is given by . We interpret this as the variance lost due to the dimension reduction.
Given the MLE, the following result shows that the choice of the activation function does not matter for the maximal value of for nondegenerated cases (i.e. for ). The same can be shown for the bounds in Corollary 1. It is notable that neither maximal point nor bounds depend on the choice of the activation. We also observe this in our simulations. A possible reason for this could be that we restrict the theory to linearly canonical activations.
Proposition 2.
Assume that for the first eigenvalues of we have for . At the optimal points and , the maximum of in Proposition 1 becomes independent of and can be written as
(9) 
If less than eigenvalues fulfil the assumption, depends on and is maximal for .
See Section B.4 of the supplementary material for the proof.
4 Simulation results
In this section, we provide simulation results to illustrate our theoretical results from Section 3. We focus on the Bernoulli case, popular for image data. According to Corollary 1, from Proposition 1 becomes a lower bound yielding (5). Therefore, we expect the ELBO of VAE with an according architecture to lie above . The essential messages of the simulations are the following:

It is reasonable to use to analyse the training performance on real life data sets.

The statement above is also valid for deep VAE architectures.

The MLE points, from Section 3.4, used as initialization enhance the training performance.
For training of the nets we use the Adam optimizer by Kingma2015 with learning rate and a batch size of . Training was done for a total of batch evaluations. The simulations ran on a dual Intel Xeon E52670 with 16 CPU @ 2.6 GHz. The longest setup took about one hour of computing time.^{4}^{4}4Our code together with a readme file for execution can be found in the folder “vae_exp_fam_code” provided together with the supplementary materials.
By varying the following hyper parameters, we conduct a total of 36 different simulation setups:

Architecture: “canonical” or “deep”.

Latent dimension : 2, 5 or 20.

Last decoder activation: “sigmoid” or “tanh”.

Data: “synthetic”, “frey” or “mnist”.
We compare our initialization scheme (“MLEB”) to a benchmark (“Bench”) given by He2015. The initialization schemes and different hyper parameters are explained in detail in Section C of the supplementary material. Figure 1 shows the result of training the two different initialized VAE on the frey dataset with two different architectures “deep” and “canonical”. We set and the decoder activation as sigmoid function.
In Figure 1, the bound is reasonable and both architectures do not perform significantly better. The results of all simulation schemes can be found in Section C.4 of the supplementary material. Comparing these simulation results and also considering Figure 1, we observe the following:

For the canonical architecture, the initialization MLEB converges directly, whereas the Benchmark takes much more time. The end values are comparable. For the deep architecture, the performance of the two initialization methods mostly shows a small initial advantage of MLEB which, however, is not as clear as in the canonical architecture.

Different decoder activations yield similar results. This observation agrees with our theoretical result that the activation affects neither bound nor optimal (see Section 3.4).

In no simulation setup a net was overfitting, not even for large values of with synthetic data, where a much smaller was needed. This property of VAE is known as autopruning.
5 Conclusion
We have established a new framework for VAE, by interpreting the decoder of a VAE as a GLM. Given this framework, we derive and analyse an approximation to the ELBO. We derive MLE for this approximation and provide simulation results validating the theory on real world datasets, like the frey and mnist dataset. The results here generalize previous work in this field.
Possible extensions and research directions based on our findings are:

A deep decoder as opposed to the onelayer version here, can be considered. A possible way would be to think of the last layer of a deep decoder as GLM, as in Wuthrich2020.

Distributions like the Gamma distribution also belong to the EDF but are not yet covered by our theory. Future research on possible data transformations should extend our results for this. Furthermore, it is also interesting to extend the GLM approach to cover popular activations such as ReLU and variations of it.
Broader Impact
As our contributions are of theoretical nature, we do not believe that the intention of this section is applicable for this work.
References
Appendix A Tables and Examples
Dist. of X  






Distribution of X  Activation  
Bin()  
2  

, with  
, with 
Example 1 (Bernoulli distribution  sigmoid activation).
For , we get . Hence for the canonical activation we have
which is the sigmoid activation.
Example 2 (Bernoulli distribution  tanh activation).
Assume and set the activation as
As in the example before and it can be shown that
Appendix B Appendix: Proofs
b.1 Auxiliary results
Lemma 2.
For a dimensional random variable it holds
Proof.
Given
and the fact that
yields the statement. ∎
Lemma 3.
Let be symmetric positive definite matrices. Then it holds
and hence
Proof.
Define the two distributions and . We have
(10) 
Now, consider that for the KullbackLeiblerDivergence with probability distributions
and it holds:
for all inputs.

if and only if almost everywhere.
Hence, we conclude in the minimum. ∎
b.2 Proof of Proposition 1
Proof.
To proof Proposition 1, we change the perspective. Instead of maximizing the average ELBO, we want to minimize the negative average ELBO given by
(11) 
Looking at (11), we see two terms. For the KLdivergence we have that
(12) 
and for the second term (with as abbreviation for ) we derive that
(13) 
We see two terms, where the expectation of is taken into account. Using
for the first term in (13), it follows
(14) 
For the other term in (13) first consider that the logpartition function possesses all derivatives and is convex as stated in Lemma 1. We use a second order Taylor approximation in of the logpartition function and get
Setting , and we have
We ignore and approximate the expression, within the expectation of the second argument, in (13) and get
with . With Lemma 2 and the fact that
we can rewrite this as
(15) 
Putting the results from (14) and (15) together with (B.2), for our target function (11), it follows that it is approximated by
All potential minima w.r.t. have to conform to
(16) 
independent of , with . Note that this expression is welldefined as per assumption, we have , and . To see that these are minima and not maxima, consider Lemma 3,
where we set and .
Given the minimal target function evaluated at this point becomes
By rearrangement this becomes
where denotes the element wise sample mean. For , we get as minimal points
(17) 
with . This is welldefined, since and . The candidates for an optimal are minima since the second derivative is a positive constant times , which is positive definite. Given the optimal and our target function is only dependent on the parameters and . We get
(18)  
where .
Consider the Singular Value Decomposition of with and are unitary matrices and
We have
and can write
with . For it follows that
(19) 
where we denote . The justification of the last equation becomes apparent, when we consider one respective diagonal element of the diagonal matrices in the equation. We have
We can further rewrite (19) as
Comments
There are no comments yet.