A Generalised Linear Model Framework for Variational Autoencoders based on Exponential Dispersion Families

06/11/2020 ∙ by Robert Sicks, et al. ∙ 0

Although variational autoencoders (VAE) are successfully used to obtain meaningful low-dimensional representations for high-dimensional data, aspects of their loss function are not yet fully understood. We introduce a theoretical framework that is based on a connection between VAE and generalized linear models (GLM). The equality between the activation function of a VAE and the inverse of the link function of a GLM enables us to provide a systematic generalization of the loss analysis for VAE based on the assumption that the distribution of the decoder belongs to an exponential dispersion family (EDF). As a further result, we can initialize VAE nets by maximum likelihood estimates (MLE) that enhance the training performance on both synthetic and real world data sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

page 20

page 21

page 22

page 23

page 25

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Variational autoencoders (VAE) are described by Goodfellow2016 as an “excellent manifold learning algorithm” due to the fact that the model is forced “to learn a predictable coordinate system that the encoder can capture”. VAE do so by using a regularization term in order to get to low energy regions. According to LeCun2020, regularization like in the VAE case helps to keep the energy function smooth, which is desirable for the model in order to learn meaningful dependencies (e.g. to fill blanks). In contrast, maximum likelihood approaches push down the energy surface only at training sample regions. Therefore, their inherent objective is “to make the data manifold an infinitely deep and infinitely narrow canyon” (see LeCun2020).

Learning meaningful dependencies is a desirable concept for advancing deep learning. Hence, there exists an interest in understanding and developing VAE. Recent work aims on explaining and overcoming well-known pitfalls of VAE, such as spurious global optima (see

Dai2019), posterior collapse (see Lucas2019 and VandenOord2017) or prior posterior mismatch (see Dai2019 and Ghosh). In these works, a specific decoder distribution is assumed to analyse the loss surface of the model or to look at the training behaviour.

In this work, we answer the following research question:

  • Is there a way to generalize the loss analysis of VAE based on the decoder distribution?

For this, we establish a connection between VAE and generalized linear models (GLM) and provide a framework for analysing VAE based on the decoder distribution. By doing so, we generalize works of Dai2018, Lucas2019 and Sicks2020

. We provide an approximation to the evidence lower bound (ELBO), which is exact in the Gaussian distribution case (see also

Dai2018 and Lucas2019

) and a lower bound for the Bernoulli distribution case (see also

Sicks2020). Further, we analyse the maximum likelihood estimators (MLE) of this approximation and find that the choice of decoder activation function does not affect the maximum. Using the MLE as initialization, we also show that the training performance of a VAE net can be enhanced.

As GLM are based on exponential dispersion families (EDF), the analysis is based on the distribution assumption for the decoder model. This is favourable as VAE are applied in various different fields (with different distribution assumptions), as e.g.: anomaly detection using Gaussian distribution (see

Xu2018a), molecules representation using Bernoulli distribution (see Blaschke2018), image compression using Bernoulli distribution (see Duan2019

) or multivariate spatial point processes using Poisson distribution (see

Yuan2020).

This work is structured as follows: In Section 2, we give a motivation as well as an overview of related work. In Section 3, we present the theoretical background and our results that are a consequence of connecting VAE and GLM. Afterwards in Section 4, we provide simulations validating our theoretical results and show that these can be used for deep VAE. In Section 5, we summarize our contributions and point out future research directions.

2 Motivation and Related Work

Throughout our theoretical analysis, we consider VAE with arbitrary encoder and one-layer decoder. Our main contribution is to interpret this one-layered decoder as a GLM (see Section 3.2). This perspective allows us to identify well-known activation functions as the reciprocal of link functions of GLM. Therefore, we are able to provide a systematic generalization of the loss analysis for VAE based on the assumption that the distribution of the decoder belongs to an EDF.

Even though the decoder architecture is arguably simple, analysing the critical points and the loss landscapes of VAE helps to understand these models better. Using this architecture and approach, Dai2018

show connections to probabilistic principal component analysis (pPCA),

Lucas2019 analyse posterior collapse and Sicks2020 bound the ELBO for a Bernoulli VAE from below.

Given an one-layered decoder with Gaussian distribution assumption, the model (more specifically 111The expressions and are defined as in Kingma.) becomes tractable. Lucas2019 use this for their analysis and show that in this case the ELBO becomes exact, as one has . Still, the main motivation for VAE is the assumption of intractability (see Kingma), where the recognition model is used to approximate the intractable posterior . In case of tractability as for the Gaussian model, an established alternative is the EM algorithm.

Under the one-layer decoder assumption, the model does not necessarily become tractable. E.g. for a Bernoulli decoder, the model stays intractable (see Aitchison1980).

Our framework can serve as a stepping stone to theory for more rigorous models as it yields

  • known results for the “good to handle” Gaussian case. To be more specific, we provide an equivalent optimization target to those in Dai2018 and Lucas2019.

  • interpretational guidance for intractable settings like the Bernoulli case. Our results for the sigmoid222

    Actually, we mean the logistic function. It has become common to use the term sigmoid function for this special case. Therefore, in the following we keep on using the term sigmoid function.

    activation are equivalent to the results in Sicks2020. Even though the resulting target is only approximative, our simulation in Section 4 for the Bernoulli case shows reasonable results on real data (i.e. frey data).

  • the possibility to use GLM tools to further analyse VAE and to extend the analysis to include more EDF distributions (e.g. the Gamma distribution).

In the following, we give an overview of literature for analysing Autoencoders and VAE as well as on GLM used in the context of neural nets.

Bourlard1988

show for autoencoders with linearised activations that the optimal solution is given by the solution of a singular value decomposition (SVD).

Baldi1989a extend these results and analyse the squared error loss of autoencoders for all critical points.

Dai2018 analyse Gaussian VAE with the same architecture as in this work. They show connections to pPCA and robust PCA as well as smoothing effects for local optima of the loss landscape.

Zhou2018

provide analytical forms for critical points and characterize the values of the corresponding loss functions as well as properties of the loss landscape for one-hidden layer ReLU autoencoders.

Dai2019 analyse deep Gaussian VAE. Assuming, the existence of an invertible and differentiable mapping between low-rank manifolds in the sample space and the latent space, they show that spurious global optima exist, which do not reflect the data-generating manifold appropriately.

Kunin2019 consider regularizations on linear autoencoders and analyse the loss-landscape for different regularizations. They show regularized linear autoencoders are capable of learning the principal directions and have connections to pPCA.

Lucas2019 extend results of Dai2018 to analyse the posterior collapse. They do this by analysing the difference from the true marginal to the ELBO which is possible under Gaussian assumptions, as becomes tractable. Furthermore, they provide experimental results on the posterior collapse for deep non-linear cases.

Sicks2020 formulate a lower bound for the ELBO of a Bernoulli VAE with the same architecture as in this work. They use the MLE to derive an initialization scheme and empirically compare it on synthetic data.

Wuthrich2020

describes connections between GLM and neural network regression models, by interpreting the last layer of a neural net as GLM. With this, he is able to use a

regularized neural net to learn representative features to improve a standard GLM. Furthermore, favourable properties (as for an actuarial context, the ”balance property“) are achieved with a proposed hybrid model.

3 Theoretical Background and Advancements

For realizations

of a random variable (r.v.)

, we consider a VAE as in Kingma with the ELBO , given by

(1)

Interpreting the expectation in this expression as autoencoder yields the encoder and the decoder . We further assume that the encoder produces and , with , where and are arbitrary functions including affine transformations. The decoder is one-layered with a not necessarily linear activation function , weights and biases .

In the following, we first provide the theoretical background on GLM and EDF. Then, we show how the decoder can be interpreted as GLM. Finally, given this new perspective on VAE, we present theoretical results and analyse MLE for the resulting approximation.

3.1 The GLM and EDF

Nelder1972

introduce GLM, providing a generalization of linear statistical models and thus of well-known statistical tools, such as analysis of variance (ANOVA), deviance statistics and MLE (see also

McCullagh1989). GLM consist of three parts: A random component with a distribution belonging to the EDF, a systematic component given as affine mapping of features used to estimate , and a link function connecting these two components. Hereby, the EDF is defined by the structure of the density.

Definition 1.

We say the distribution of given belongs to the exponential dispersion family (EDF), if the density can be written as

(2)

where and are one-dimensional functions. is also called the log-normalizer. It ensures that integration w.r.t. the density in (2) over the support of is equal to one. is the location parameter. is an open, convex space with . is called the dispersion parameter and independent of .

The EDF is studied in BarndorffNielsen2014, Jorgensen1986 and Jorgensen1987a. Several well-known distributions, like the Gaussian, Bernoulli and Poisson distribution, belong to this family (see Table 1 in supplementary material Section A for the respective representations).

Lemma 1.

Let the distribution of a one-dimensional r.v. given belong to the EDF. Then, it holds and . Furthermore, the log-partition function is convex and possesses all derivatives.

The proof for the unconditional case is performed in Theorem 7.1, Corollary 7.1 and Theorem 8.1 in BarndorffNielsen2014. The statement for the conditional case is found analogously.

3.2 The decoder as GLM

We interpret the decoder as GLM. Therefore, we assume that the independent identical marginal distributions of given belong to an EDF, where they share the same . With a realization from the encoder, the parameters of the decoder are given by , with .

Part of the training objective of a VAE implementation is to construct outputs that resemble the inputs to a high degree. Therefore, the decoder reports the value

the (conditional) expectation given a latent representation created by the encoder. Training on the decoder is conducted by updating the parameters in the set . In consensus with GLM (see McCullagh1989), we define the linear predictor , with and . Given an activation function , we set333

We apply functions with one-dimensional support on vectors, which is interpreted as element wise operations.

and according to Lemma 1, we have

With the linear predictor introduced, we have for our decoder parameter set .

Obviously, is a special choice for which we have . We call this choice of the “canonical activation”. This name originates from the “canonical link function”. As mentioned in Section 3.1, for GLM we define a link function connecting the systematic component of the model to the random component . This function is called canonical if . Hence, the canonical activation is the inverse of the canonical link.

A canonical function has the advantage of an affine mapping from the net parameters to the EDF parameters. This will simplify the derivation of our theoretical results below. Apart from canonical activations, we want to provide theory for a more general case. We consider any activation that scales the systematic component to , s.t. for we have

(3)

We call activations applying to this “linearly canonical activation”. Obviously, for canonical activations we have .

Though this theory is not new, we want to stress the point that commonly used activation functions in machine learning are linearly canonical activations. Two examples in Section

A of the supplementary material illustrate this for the sigmoid and the tanh activation.

Unfortunately, the canonical activation is not for all EDF distributions a desirable candidate for a VAE model. The canonical activation of the Gamma distribution (which also belongs to the EDF) is given by , with support in . It is not favourable to limit the support, since this would place restrictions on the linear predictor and hence on the weights and biases. A common workaround from GLM theory is to use the activation , but this function does not produce the desired linear mapping in (3).

3.3 Approximating the ELBO

In this section, the new interpretation of the decoder as GLM allows to approximate the ELBO (see Proposition 1), by linearising the log-partition function . We can use the results to make statements on the choice of the activation function (see Section 3.4), to derive weight and bias initializations (see Section C.3 of the supplementary material and the results in Section 4) or to monitor the training of an intractable VAE with a tractable reference point (see Section 4).

Proposition 1.

Consider N observations of a d-dimensional r.v. X. Assume the independent identical marginals of given belong to the same EDF distribution with functions and as in (2). Each marginal distribution has a parameter () and is the same. For a VAE as in Section 3, we assume the one-layer decoder has a linearly canonical activation function , s.t. , where , , and .

Then, there exists an approximative representation for as in (1), around an arbitrary point , that admits an optimal solution for and , such that it can be written as

(4)

where and are constant in and .

Setting , , and , we have , and

where and denotes the sample mean.

See Section B.2 of the supplementary material for the proof. Usually, we set . See Table 2 in Section A of the supplementary material for different EDF distributions and linearly canonical activation functions as well as the parameters for Proposition 1 with this choice of .

To derive this alternative expression, we made use of a Taylor approximation. Therefore, we can quantify the approximation error via the corresponding remainder.

Corollary 1.

Under the assumptions of Proposition 1, the error to the original ELBO in (1) is given by , where denotes the Taylor remainder for approximating the log-partition function at the point with a 2nd degree Taylor approximation. If we assume to originate

  • from a Gaussian distribution, then for all .

Consider , if we assume to originate

  • from a Binomial distribution, then

    .

  • from a Poisson distribution, then .

See Section B.3 of the supplementary material for the proof. Since in our setting is Gaussian under

, the moments for given parameters

can be calculated straight forward.

Corollary 1 highlights how our theory generalizes the works of Dai2018, Lucas2019 and Sicks2020. Under the Gaussian assumption with linearly canonical activation, is exact. Then, Proposition 1 yields an equivalent result as given by Dai2018 and extended by Lucas2019. For the Bernoulli distribution, the expected difference is positive. Hence, with from Proposition 1 we approximate the ELBO in (1) from below. The result for the sigmoid activation is the same lower bound as reported in Sicks2020. Thus, the ELBO is bounded from both sides as naturally its values have to be negative and we have

(5)

Given the assumptions in Proposition 1, we retrieve the optimal and for as a direct consequence of the proof of Proposition 1 in the following corollary.

Corollary 2.

For an observation of , the optimal closed form solutions for and for in Proposition 1 are given by and , with and as in Proposition 1.

3.4 MLE and optimal solution

In the following, we analyse the optimal values of and for and highlight important properties. To do so, we rewrite the objective and obtain a similar representation as in Tipping1999,

(6)

with and . According to Tipping1999, the MLE for is given by the sample mean of . Therefore, with becomes the sample covariance (we denote it as ). For later purposes, we also define the sample covariance matrix of as . With

we denote the (ordered) eigenvalues of the matrix

and the same for and . It holds for , with and from Proposition 1. In a similar way to Tipping1999, we can derive the MLE of as

(7)

where is composed of eigenvectors of the matrix . These eigenvectors are associated with the biggest eigenvalues . is a diagonal matrix with entries

(8)

is an arbitrary rotation matrix, which implies that our optimal solution is invariant to rotations. Dai2018 show this as well as invariance to permutations in their Theorem 2.

It is possible to have , and the term (for ) controls how much columns in the matrix are zero. We interpret this as a consequence of the auto-pruning property of VAE: If the data signal is not strong enough, it is pruned away.

Further, if exists (e.g. in the Gaussian case), the MLE for is given by . We interpret this as the variance lost due to the dimension reduction.

Given the MLE, the following result shows that the choice of the activation function does not matter for the maximal value of for non-degenerated cases (i.e. for ). The same can be shown for the bounds in Corollary 1. It is notable that neither maximal point nor bounds depend on the choice of the activation. We also observe this in our simulations. A possible reason for this could be that we restrict the theory to linearly canonical activations.

Proposition 2.

Assume that for the first eigenvalues of we have for . At the optimal points and , the maximum of in Proposition 1 becomes independent of and can be written as

(9)

If less than eigenvalues fulfil the assumption, depends on and is maximal for .

See Section B.4 of the supplementary material for the proof.

4 Simulation results

In this section, we provide simulation results to illustrate our theoretical results from Section 3. We focus on the Bernoulli case, popular for image data. According to Corollary 1, from Proposition 1 becomes a lower bound yielding (5). Therefore, we expect the ELBO of VAE with an according architecture to lie above . The essential messages of the simulations are the following:

  • It is reasonable to use to analyse the training performance on real life data sets.

  • The statement above is also valid for deep VAE architectures.

  • The MLE points, from Section 3.4, used as initialization enhance the training performance.

For training of the nets we use the Adam optimizer by Kingma2015 with learning rate and a batch size of . Training was done for a total of batch evaluations. The simulations ran on a dual Intel Xeon E5-2670 with 16 CPU @ 2.6 GHz. The longest setup took about one hour of computing time.444Our code together with a readme file for execution can be found in the folder “vae_exp_fam_code” provided together with the supplementary materials.

By varying the following hyper parameters, we conduct a total of 36 different simulation setups:

  • Architecture: “canonical” or “deep”.

  • Latent dimension : 2, 5 or 20.

  • Last decoder activation: “sigmoid” or “tanh”.

  • Data: “synthetic”, “frey” or “mnist”.

We compare our initialization scheme (“MLE-B”) to a benchmark (“Bench”) given by He2015. The initialization schemes and different hyper parameters are explained in detail in Section C of the supplementary material. Figure 1 shows the result of training the two different initialized VAE on the frey dataset with two different architectures “deep” and “canonical”. We set and the decoder activation as sigmoid function.

Figure 1: The picture shows two different setups deep and canonical with frey data, and sigmoid activation. Displayed are the ELBOs of both initialisations MLE-B and Bench as well as the lower bound and the expected error as provided by Corollary 1, calculated with MLE. On the left the values are calculated with the training data and on the right with test data.

In Figure 1, the bound is reasonable and both architectures do not perform significantly better. The results of all simulation schemes can be found in Section C.4 of the supplementary material. Comparing these simulation results and also considering Figure 1, we observe the following:

  • For the canonical architecture, the initialization MLE-B converges directly, whereas the Benchmark takes much more time. The end values are comparable. For the deep architecture, the performance of the two initialization methods mostly shows a small initial advantage of MLE-B which, however, is not as clear as in the canonical architecture.

  • Different decoder activations yield similar results. This observation agrees with our theoretical result that the activation affects neither bound nor optimal (see Section 3.4).

  • In no simulation setup a net was over-fitting, not even for large values of with synthetic data, where a much smaller was needed. This property of VAE is known as auto-pruning.

  • It seems that MLE-B needs a very short burn-in period to perform according to Corollary 1. We believe that the offset at the beginning originates from not readily initialized hidden layers (see Section C.1 of the supplementary material).

5 Conclusion

We have established a new framework for VAE, by interpreting the decoder of a VAE as a GLM. Given this framework, we derive and analyse an approximation to the ELBO. We derive MLE for this approximation and provide simulation results validating the theory on real world datasets, like the frey and mnist dataset. The results here generalize previous work in this field.

Possible extensions and research directions based on our findings are:

  • A deep decoder as opposed to the one-layer version here, can be considered. A possible way would be to think of the last layer of a deep decoder as GLM, as in Wuthrich2020.

  • Distributions like the Gamma distribution also belong to the EDF but are not yet covered by our theory. Future research on possible data transformations should extend our results for this. Furthermore, it is also interesting to extend the GLM approach to cover popular activations such as ReLU and variations of it.

Broader Impact

As our contributions are of theoretical nature, we do not believe that the intention of this section is applicable for this work.

References

Appendix A Tables and Examples

Dist. of X
,
with fixed
,
with fixed
Table 1: An overview of well-known distributions that apply to our theory in Section 3.3. The functions for the representation as exponential family member as well as and in terms of the natural parameters are displayed.
Distribution of X Activation
Bin()
2
,
with fixed
, with
, with
Table 2: Different EDF distributions and linearly canonical activation functions as well as the parameters for the approximation in Proposition 1, with .
Example 1 (Bernoulli distribution - sigmoid activation).

For , we get . Hence for the canonical activation we have

which is the sigmoid activation.

Example 2 (Bernoulli distribution - tanh activation).

Assume and set the activation as

As in the example before and it can be shown that

Appendix B Appendix: Proofs

b.1 Auxiliary results

Lemma 2.

For a -dimensional random variable it holds

Proof.

Given

and the fact that

yields the statement. ∎

Lemma 3.

Let be symmetric positive definite matrices. Then it holds

and hence

Proof.

Define the two distributions and . We have

(10)

Now, consider that for the Kullback-Leibler-Divergence with probability distributions

and it holds:

  • for all inputs.

  • if and only if almost everywhere.

Hence, we conclude in the minimum. ∎

b.2 Proof of Proposition 1

Proof.

To proof Proposition 1, we change the perspective. Instead of maximizing the average ELBO, we want to minimize the negative average ELBO given by

(11)

Looking at (11), we see two terms. For the KL-divergence we have that

(12)

and for the second term (with as abbreviation for ) we derive that

(13)

We see two terms, where the expectation of is taken into account. Using

for the first term in (13), it follows

(14)

For the other term in (13) first consider that the log-partition function possesses all derivatives and is convex as stated in Lemma 1. We use a second order Taylor approximation in of the log-partition function and get

Setting , and we have

We ignore and approximate the expression, within the expectation of the second argument, in (13) and get

with . With Lemma 2 and the fact that

we can rewrite this as

(15)

Putting the results from (14) and (15) together with (B.2), for our target function (11), it follows that it is approximated by

All potential minima w.r.t. have to conform to

(16)

independent of , with . Note that this expression is well-defined as per assumption, we have , and . To see that these are minima and not maxima, consider Lemma 3,

where we set and .

Given the minimal target function evaluated at this point becomes

By rearrangement this becomes

where denotes the element wise sample mean. For , we get as minimal points

(17)

with . This is well-defined, since and . The candidates for an optimal are minima since the second derivative is a positive constant times , which is positive definite. Given the optimal and our target function is only dependent on the parameters and . We get

(18)

where .

Consider the Singular Value Decomposition of with and are unitary matrices and

We have

and can write

with . For it follows that

(19)

where we denote . The justification of the last equation becomes apparent, when we consider one respective diagonal element of the diagonal matrices in the equation. We have

We can further rewrite (19) as