With the Variational Autoencoder (VAE), [Kingma] present a special form of autoencoder that incorporates randomness to the architecture. This form of autoencoder yields an interpretable latent space when it comes to images as input. [Goodfellow2016]
describe VAE as an “excellent manifold learning algorithm” due to the fact that the model is forced “to learn a predictable coordinate system that the encoder can capture”. Furthermore, VAE are suitable for generating new unobserved data by interpolating within the latent space (see the appendix A in[Kingma] for vivid examples of moving through the latent space).
In this paper we analyse VAE for the case of binary data. Let be such observations. A VAE consists of an encoder and a decoder. The encoder maps these multivariate binary observations to a low-dimensional numeric space of dimension . The decoder applies the inverse operation. [Kingma] introduce the “Bernoulli MLP decoder” for the binary setting used here. We call a VAE with this type of decoder “Bernoulli VAE”.
This work contributes to the understanding of VAE in three areas:
We derive an interpretable lower bound for the training objective of a Bernoulli VAE (the ELBO).
We propose an initialization scheme with an architecture change for Bernoulli VAE that results in faster convergence.
We derive a decision support for finding the parameter of the latent space dimension
with a principal component analysis (PCA).
Probably the most comparable work in this research field is [Lee2010]. They present a sparse logistic PCA, which has a target objective similar to the Bernoulli VAE: A balance between the reconstruction of the binary inputs and a regularization term. [Kunin2019] show that regularization helps the linear autoencoder to learn principal directions. Without regularization the linear autoencoder learns the subspace spanned by the top principal direction but not the directions themselves. [Dai2018] provide an interpretable term for the ELBO of a VAE with a “Gaussian MLP decoder” (as denoted in [Kingma]). This term is similar to our bound. [Tipping1999] derive an equivalent term, again under Gaussian assumptions but from the perspective of a PCA instead of a VAE.
For this paper, we assume given to be mutually independent and Bernoulli distributed. This assumption yields what is known as the “Cross Entropy Loss”. This loss finds application in different fields: [Blaschke2018] apply VAE on molecules represented in a so-called “SMILE”-format. [Germain2015] develop a modification for autoencoders to consider autoregressive constraints and focus on binary inputs. [Duan2019] apply the Bernoulli VAE to images.***Even though images are usually not represented in binary fashion, the pixel range can be transformed to be within zero and one. E.g. if the values originally range from 0 to 255 we divide them by 255.
This article is structured as follows: In section 2, we present related work and compare our mathematical problem in detail to the existing literature. Theoretical findings and discussion can be found in section 3. In section 4, we look at our simulation setup and present our results. Finally, we summarize in section 5 our contribution to the research of VAE and highlight further research topics.
2 Background and related work
The Bernoulli VAE yields a continuous representation of binary data, which can be viewed as a kind of logistic PCA. In this chapter, we have a look at related work and compare the mathematical formulations of sparse logistic PCA and Bernoulli VAE.
2.1 Related work: Logistic PCA
[Tipping1999] view PCA as a maximum likelihood procedure and introduce it as a probabilistic PCA (pPCA). We show connections of the Bernoulli VAE to pPCA based on our bound. [Collins2002] generalize pPCA and derive an optimization for the exponential family. They provide a generic algorithm for minimizing the loss. [DeLeeuw2006] presents logistic (and probit) forms of PCA and a majorization algorithm to fit the parameters. [Lee2010] propose a sparse logistic PCA by introducing regularization. This can be seen as the deterministic version of our proposed method here as we will see in section 2.3. [Landgraf2015] formulate the logistic PCA differently: For binary data their model is not based on matrix-factorization but on projection of the data.
2.2 Related work: Autoencoder
show for autoencoders with linearizable activations that the optimal solution is given by the solution of a Singular Value Decomposition.[Baldi1989a] extend these results and analyse the squared error loss of autoencoders for all critical points. [Saxe2014]
provide theory for learning deep linear neural networks for different non-linear dynamics.[Josse2016] present an autoencoder that provides a stable encoding when perturbed with bootstrap noise. Therefore, it can adapt to non-isotropic noise structures. [Pretorius2018]
analyse the learning dynamics of linear denoising autoencoders.[Dai2018] analyse the loss of the Gaussian VAE. They show connections to pPCA and robust PCA. The bound of the ELBO from this paper has an equivalent form to one of their results. [Kunin2019] consider regularizations in linear autoencoders and analyse the critical points for three targets.
VAE experience regularization over the Kullback-Leibler-Divergence (KL-Divergence) term (see below). Variations of autoencoders exists, which regularize the closeness of distributions in a different way. Adversarial Autoencoders (AAE) (see[Makhzani2015]) try to approximate the Jensen-Shannon-Divergence instead of using the KL-Divergence. [Zhao2019] motivate a unifying model for VAE and AAE.
2.3 Comparison of sparse logistic PCA and Bernoulli VAE
We are interested in a sparse low-rank representation of binary data. One possible approach is the logistic PCA introduced by [Lee2010]. Let be the observations of
, a multivariate Bernoulli distributed random variable, having a low-rank representation. Letdenote these representations with , i.e. there exists
and a location vectorsuch that
where denotes the probability vector. Finding the sparse low-rank representation (in the spirit of [Lee2010]) leads to maximizing the log-likelihood and a penalization term, given by
[Lee2010] introduced a penalty term to get sparse loading vectors
, similar to the LASSO regression from[Tibshirani1996].
We are going to analyse an alternative to logistic PCA using the VAE introduced by [Kingma]. For the VAE setup, we add a latent variable with density to this model. We change our perspective from
where the probability vector is a function of and the model parameters . We interpret this as replacing the deterministic low-rank representation by a stochastic .
Using the Bayes theorem, we get
which gives a lower bound for the log-likelihood function: The “Evidence Lower Bound” (ELBO). Since in our current model formulation the log-likelihood is intractable, we maximize the ELBO instead and assume the corresponding log-likelihood to increase as well.
The first part resembles the reconstruction error of an autoencoder: Given a sample , we integrate over the probabilities that an observation x was generated from that sample. The expression is the generating function that resembles the encoder in the autoencoder.
can be identified with the decoder of an autoencoder. Both, encoder and decoder are usually implemented as a multi layer perceptron (MLP). The second term is the Kullback-Leibler-Divergence that indicates how close the two inputs are. The further these two expressions are apart, the higher the penalization.
Figure 1 shows the implementation of a VAE as a neural net.
There are several similarities to the work of [Lee2010]:
The ELBO and the loss in (1) consist of two parts: The first part provides information on the goodness of fit, while the second part is a penalty term.
The probability vector of [Lee2010] and our assumed probability vector use sigmoid activation.
3 Theoretical results
In this section, we present our theoretical findings. Recall that
Our central result is the lower bound for the ELBO in proposition 1. So instead of the ELBO, we maximize the lower bound and assume that the corresponding log-likelihood increases.
We first state our setting and assumptions. Then we formulate proposition 1 and interpret the resulting lower bound of the ELBO. Based on these findings, we introduce an initialization and a change of architecture for Bernoulli VAE. Finally, we propose a way to determine the latent space dimension given the training data.
3.1 Assumptions and lower bound
Let denote the observations, where denotes the dimension. We assume a latent variable model: The values were generated given a latent (not observable) sample of a -dimensional variable . Furthermore, we assume and given are mutually independent. The distribution of given is
where the probability vector is a function of and the model parameters . We fix the probability vector the following way: For each entry of , we have
with , and the -th entry.
is the sigmoid function. Hencein our case. and can be identified with the weights and biases of the decoder of the VAE. In figure 1, these are the arrows between “Hidden Layer” and “Output Layer”.
We further expect the distribution of the variables to be given by the standard recognition model (see [Kingma]):
where we define and as functions of and . Furthermore, we define
as the density of a standard normal distribution.
The target of a VAE is the average ELBO, given by
for which we state the following lower bound:
Given the assumptions stated above, we further assume that for the encoder we have
where and are arbitrary functions that include affine transformations. Then, there exists a lower bound for the average ELBO in (3) that admits an optimal solution for and such that it can be written as
where with and .
The proof of this result can be found in the supplementary material. The advantage of this lower bound is the gain in interpretability as we will see. The assumptions of proposition 1 for and can be interpreted as “the encoder is always doing its job”. These are the same as in lemma 1 of [Dai2018].
The resulting objective is equivalent to the result of lemma 1 in [Dai2018] and the log-likelihood objective in [Tipping1999]. Contrary to our Bernoulli distribution assumption here, their objective originated from a Gaussian distribution. Nonetheless both assume a latent variable model, as we do here. The properties derived by [Dai2018] and [Tipping1999] apply to our target (4). Therefore, if we assume to be diagonal (which is a standard assumption for VAE) we get an upper bound for (4) (see [Dai2018]).
the sample covariance matrix (of ), since according to [Tipping1999] the maximum point for is given by the sample mean (of ), as follows:
The maximum point for w.r.t. is given by
where is composed of eigenvectors of the matrix S. These eigenvectors are associated with the
biggest eigenvalues. is a diagonal matrix with entries
is an arbitrary rotation matrix. [Kunin2019] obtain similar optimal values for the regularized linear autoencoders. We can derive several interesting features of :
It is possible to have . We use this observation in section 3.3 for the choice of the latent space size of the VAE.
The rotation matrix is arbitrary. This implies that our optimal solution is invariant to rotations. [Dai2018] show this as well as invariance to permutations in their theorem 2.
It can be shown that other candidate points of for represent saddle points (see appendix of [Tipping1999]). So, if we think of training a VAE, the optimizer should be robust to saddle points.
It is important that we assume given are mutually independent with (2) and not that the are mutually independent with
So, we have all the eigenvalues of readily at hand and only depends on for . The inequality in (8) is only fulfilled at , which results in . Thus assumption (9), with mutually independent, does not lead to meaningful solutions for the weights of the decoder.
Under assumption (2), with given mutually independent, we cannot calculate as above: are not mutually independent without knowing
. This fact stems from the common parent argument in Bayesian network theory.
We generate data in 4.1 according to assumption (2). For this data we observe eigenvalues that are bigger than the needed value of .†††E.g.: , , , … .
Apart from the optimal and we achieve representations for optimal and :
For an observation the optimal closed form solutions for and of the lower bound of the ELBO are given by
These optimal solutions are a by-product of the proof of proposition 1. They give us insight into what to expect from a trained neural net. We use this to propose a new initialization and a change of the VAE structure.
3.2 Initialization and change of net architecture for Bernoulli VAE
Given the optimal values for the encoder outputs in corollary 1 as well as for the decoder parameters in (7) and (6), we initialize the corresponding weights and biases of a VAE. This affects only the last layers of the encoder and decoder.
Then we introduce and discuss a slight change of the net architecture.
At last we explain our approach for over-parametrized nets in the form that we have more ingoing dimensions into the last layers of encoder and decoder than needed. This issue arises when we consider hidden layers with higher dimension than the input.
Our simulation results in 4.3 show a faster convergence and comparable results as produced by a VAE without this change.
The weights and biases for the last decoder layer are straightforward set as the optimal from (7) and from (6). For the last encoder layers we first note that there are two layers: the “”- and the “”-layer. We initialize the weights and biases of the “”-layer as follows:
The VAE-architecture is changed at the “”-layer: We decouple it from the layers before. Consequently, the output of this layer never changes. We set it to produce the of . This change is justified by the fact that in (10) is the same for all inputs . Apart from a better initialization than over random values, a benefit of this approach is that the distribution now is prohibited to become degenerated. E.g. is this the case when .
All weights and biases produced here only depend on and , which are easily obtainable: We just need to calculate the sample mean and a Singular Value Decomposition of the sample covariance.
In case of over-parametrized nets, more edges lead into the affected layers than we need. This problem only concerns the weights and not the biases. We solve this by initializing not needed dimensions of the weights with zero. This still allows the net to change these weights during training, but they have a starting point which is optimal in the sense of our bound from proposition 1.
3.3 Optimal latent space size
Given the structure of in (7), we propose a decision support for the choice of the latent space dimension
. The approach we propose here is based on PCA and is therefore not new. Usually we perform a Singular Value Decomposition of the estimated covariance matrix and consider the dimensions associated to the biggest eigenvalues. This way, we retain the variation present in the data up to a maximum amount. The parameterequals the number of eigenvalues we decide for.
The form of (7) shows which eigenvalues of the sample covariance matrix S are eligible for : Only those that are greater then . Hence, all dimensions with smaller associated eigenvalues can be discarded as this would only introduce a zero-row in .
Concluding, the value for can be chosen the following way:
Perform a Singular Value Decomposition of S.
Only consider eigenvalues as eligible that are greater than when choosing
For the equivalent terms in [Tipping1999], [Dai2018] and [Kunin2019], this approach is not directly possible. There, also depends on the parameter
, the variance of the observation distribution under isotropic Gaussian assumption. This parameter has to be estimated.
Given our theoretical results, we provide simulations that back up these findings. Therefore, in this section, we compare performances of VAEs. The essential messages of our simulations are the following:
Given a VAE, our proposed initialization of the net results in a faster training convergence.
Even without our initialization as starting points, the VAE approaches the theoretical bound after a longer training period.
With our initialization and architecture, the VAE is less prone to over-fitting.
For the simulations, we generate data and show that our theory applies to the general case. Practitioners are welcome to apply our proposed setting to their data.
We first describe the data generation in detail. Then, we show the architecture and the initialization of the VAE and how we train the two resulting nets. Given these two parts, we present the simulation setup and interpret the results.
4.1 Data generation
For , and we generate two matrices and . is identifiable with the principal components and B with the loading vectors of a PCA. is assumed to be sparse (see below).
We construct the matrices in the fashion of [Lee2010]: The two-dimensional principal components of are drawn from normal distributions, so that and . The sparse loading vectors are constructed by setting to zero except for and .
Given and we calculate
and the probability matrix , with
where we apply the sigmoid function element-wise. We then use the probabilities to independently draw samples
and with the data , we conduct the simulation.
Given the combinations of and , we get nine different data sets for our simulation.
Though the case
seems to be rather odd for training of neural nets,[Lee2010] generate data explicitly with and and denote it as challenging tasks for their logistic PCA. We therefore keep it in our simulation study to see how our bound from proposition 1 relates to the loss of the VAEs for the test data.
4.2 VAE architecture
The architecture of the VAE that we look at can be described as follows:
where the dots just indicate the line break. Each aspect denotes a layer and the value in the parentheses gives the dimension of this layer. So, denote the hidden layers of the encoder and those of the decoder. This architecture, the notation and the training parameters (see next section) are as in [Dai2018].
The hidden layers of the encoder and the decoder are implemented with “ReLU”-activation (see[nair2010rectified]), which is known to be highly expressive. The “”- and “”- layer have linear activations. As we need an output between zero and one and in consistency with our theoretical derivation from before, the last layer “
” has a sigmoid activation function.
The network weights and biases are initialized as proposed by [He2015]. This initialization particularly considers rectifier non-linearities.
Apart from the fact that we expect the net to provide a better loss than provided by our theoretical bound, the net should also be able to represent the optimal and diagonal entries of the optimal from corollary 1 if necessary.
We train a second version of VAE, called “VAE preinit”. This is equal to the one described above for all but the following two aspects:
We initialize the weights and biases of the “”- and “”-layer as proposed in 3.2. Dimensions that we do not need for this initialization are set to zero but can change during training.
We fix the “”-layer to always produce the diagonal entries of from corollary 1. The output of this layer is not affected by training. So the variance of “VAE preinit” does not depend on the input and behaves approximately as suggested from the theoretical results.
Obviously both autoencoders are over-parametrized for the simple simulation. Looking at the decoder part and comparing it with the data generation in 4.1, one layer suffices. Furthermore, our theoretical setting in section 3 manifests this view. We note that the simulation on such a “canonical” version of a VAE produced results that favoured our proposed initialization and architecture even more than those stated in 4.3
. The results for this case can be found in the supplementary materials. The intention behind the over-parametrized structures is to show that even for deep learning architectures, our initialization performs well and that our theoretical bound serves its purpose.
4.3 Training setting and results
For training of the nets we used the Adam optimizer by [Kingma2015] with learning rate and a batch size of . Training was done for a total of epochs each time. We split the data into training data and test data and calculate the optimal values and only on basis of the training data. Given these, we obtain a theoretical bound for training and test data. We also use these parameters for the initialization of “VAE preinit” described in 4.2.
We first present our simulation results given the values of the losses after the epochs. Afterwards, we compare the development of the two autoencoders during one training.
4.3.1 Simulation results for deep architectures
Given the possible combinations of and , we have nine different simulations. We generate data once and for each of the nine simulation settings we train the two VAEs with different (randomly) initialized starting points. The “VAE preinit” layers affected by our initialization keep their values. For this net we get different (randomly) initialized starting points for the remaining layers.
Each simulation (200 net trainings) was executed on a Dual Intel Xeon Gold 6132 ("Skylake") @ 2.6 GHz with 28 CPU cores. The longest setup ( and ) took about two and a half days of computing time. We used and modified the implementation of a VAE provided by [keras].‡‡‡Our code together with a readme file for execution can be found in the provided file “vae_analysis_binary.zip” in the supplementary materials.
Table 1 and table 2 display information on loss values at the end of epochs of training. We calculate the deviation from the bound value displayed in the same row and provide the interval of minimal and maximal deviation, displayed as “[Min%,Max%]”.§§§To recover the real minimal or maximal value calculate : e.g. The “[Min%,Max%]” intervals let us highlight the main aspects later on better than a “mean
std”-format. The mean and standard deviation values for the deep leaning architectures can be found in the supplementary material. We also present the simulation results (“[Min%,Max%]” and “meanstd”) for the “canonical” architectures there.
In table 1 and table 2, the nets perform better with smaller values. A negative percentage means that the final loss is smaller than the bound. As stated earlier, the deep architecture we look at here is actually not comparable to the bound. Hence, the loss values can be greater than the bound.
Comparing the test cases, we favour “VAE preinit” in all but one setup and .
For the simulation setups with both nets are over-fitting. For all cases the “VAE preinit” is less prone to this behaviour, as can explicitly be seen from the performance on the test data in table 1.
For the training data, the bound is often reached by both nets. For the test data though, the resulting loss is likely to lie above the bound.
Overall, the intervals are getting tighter for increasing : Not surprisingly more data results in better fits. The bound is more robust: The values for different are comparable.
4.3.2 Loss development for deep architectures
Figure 2 displays loss values of the two different VAEs for test and training data. Furthermore, the bound was calculated for these two sets and added to the figure. As we are looking at the losses (the negative average ELBOs) of neural nets, smaller is better. We note that the “VAE preinit” has a faster convergence for both training and test set. After epochs, the different net architectures produce comparable losses. We conclude that fixing the “”-layer in “VAE preinit” produces a comparable final loss value. For the training set, both net architectures admit the bound after enough steps of training. Even though our theory from section 3 produces this bound for less parametrized net-structures than we look at here.
5 Conclusion and possible extensions
The derived theoretical bound for VAE is easy to calculate and we show that even the performance of deep VAE-architectures (six layers) can be assessed with it. Therefore, we propose to use our bound to monitor the training of Bernoulli VAE.
Using our proposed initialization and the modification of the Bernoulli VAE architecture, a faster convergence rate is visible. Furthermore, we get more favourable results for the final loss values in all but one of our simulations. As a result, we motivate practitioners with real world binary data to apply a Bernoulli VAE with this initialization and architecture to their data.
The decision support for the optimal latent dimension reduces the set of possible values when using a PCA.
Appendix A Supplementary Material
a.1 Proof of proposition 1
To proof proposition 1, we will change the perspective. Instead of maximizing the ELBO, we want to minimize the negative ELBO given by
Looking at (12), we see two terms. For the KL-Divergence we have that
and for the second term (with as abbreviation for ) we derive that
For the first term we have
If we do a Taylor series expansion in the point for the function we get
This approximation is shown in figure 3. A degree 2 expansion seems reasonable and we see that it is approaching the original function from below. In fact, we have that for the remainder it holds
An detailed argument is given section A.2.
Leaving the remainder aside, we get for the expectation in (14) that
We can write , with . Therefore, if denotes the -th row of , for
it follows that
For the sum in (14), we get
with . We always approximate the sum from below and hence the negative ELBO from above. Putting these results together with (A.1) and (15) for our target function in (12), it follows that it is smaller than
All potential minima w.r.t. have to conform to
independent of . To see that these are also minima and not maxima, consider that it holds (see A.3) that
where we set and .
Given the candidates for the minimal target function are writeable as
For , we achieve as minimal points
with . These are minima since the second derivative is , which is positive definite. Given the optimal and our target function is only dependent on the parameters and . Hence
where and is fixed. Consider the Singular Value Decomposition of with and are unitary matrices and
and can therefore write
with . For the matrix expression in the middle of the term in (20) it follows that
where we denote . The justification of the last equation becomes apparent, when we consider one respective diagonal element of the diagonal matrices in the equation. We have
We can further rewrite (21) as
where we have introduced
So for our target function in (20), we get
By adding and subtracting the constant term , concluding
and together with