 # Least Square Variational Bayesian Autoencoder with Regularization

In recent years Variation Autoencoders have become one of the most popular unsupervised learning of complicated distributions.Variational Autoencoder (VAE) provides more efficient reconstructive performance over a traditional autoencoder. Variational auto enocders make better approximaiton than MCMC. The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden stochastic layers. They are a directed graphic models. Variational autoencoder is trained to maximise the variational lower bound. Here we are trying maximise the likelihood and also at the same time we are trying to make a good approximation of the data. Its basically trading of the data log-likelihood and the KL divergence from the true posterior. This paper describes the scenario in which we wish to find a point-estimate to the parameters θ of some parametric model in which we generate each observations by first sampling a local latent variable and then sampling the associated observation. Here we use least square loss function with regularization in the the reconstruction of the image, the least square loss function was found to give better reconstructed images and had a faster training time.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Variational Autoencoder (VAE) provides more efficient reconstructive performance over a traditional autoencoder. They are a directed graphic models. Variational autoencoder is trained to maximise the variational lower bound. Its basically trading of the data log-likelihood and the KL divergence from the true posterior. This paper describes the scenario in which we wish to find a point-estimate to the parameters of some parametric model in which we generate each observations by first sampling a “local” latent variable and then sampling the associated observation . The model is graphically represented below 1.

The posterior is intractable for a continuous latent space whenever either the prior or the likelihood are non-Gaussian, meaning that approximate inference is required. To this end Autoencoding Variational Bayes makes two contributions in terms of methodology, introducing a differentiable stochastic estimator for the variational lower bound to the model evidence, and using this to learn a recognition model to provide a fast method to compute an approximate posterior distribution over “local” latent variables given observations.

## 2 Stochastic Variational Inference

The aim of variational inference is to provide a deterministic approximation to an intractable posterior distribution by finding parameters such that is minimised. This is achieved by noting that

 DKL(qϕ(θ)∣∣p(θ|D))= logp(D)+Eqϕ(θ)[logqϕ(θ)−logp(θ,D)] =: logp(X)−L(ϕ;D). (1)

Noting that is constant w.r.t. , we can now minimise the KL-divergence by maximising the evidence lower bound (ELBO) (that this is indeed a lower bound follows from the non-negativity of the KL-divergence). Aside from some notable exceptions (eg. ) this quantity is not tractably point-wise evaluable. However, if and are point-wise evaluable, it can be approximated using Monte Carlo as

 L(ϕ;D)≈1LL∑l=1logp(θl,D)−logqϕ(θl),θl∼qϕ(θ) (2)

This stochastic approximation to the ELBO is not differentiable w.r.t. as the distribution from which each is sampled itself depends upon , meaning that the gradient of the log likelihood cannot be exploited to perform inference. One of the primary contributions of the paper being reviewed is to provide a differentiable estimator for that allows gradient information in the log likelihood

to be exploited, resulting in an estimator with lower variance. In particular it notes that if there exists a tractable reparameterisation of the random variable

such that

 (3)

then we can approximate the gradient of the ELBO as

 L(ϕ;D)=Ep(ϵ)[logp(θ,X)−qϕ(θ)]≈1LL∑l=1logp(θl,X)−logqϕ(θl)=:~L1(ϕ;X), (4)

where and . Thus the dependence of the sampled parameters on has been removed, yielding a differentiable estimator provided that both and are themselves differentiable. Approximate inference can now be performed by computing the gradient of w.r.t. either by hand or using one’s favourite reverse-mode automatic differentiation package (eg. Autograd ) and performing gradient-based stochastic optimisation to maximise the elbo using, for example, AdaGrad . One can also re-express the elbo in the following manner

 (5)

This is useful as the KL-divergence between the variational approximation and the prior over the parameters has a tractable closed-form expression in a number of useful cases. This leads to a second estimator for the elbo:

 ~L2(ϕ;D):=1LL∑l=1logp(D|θl)−DKL(qϕ(θ)∣∣p(θ)). (6)

It seems probable that this estimator will in general have lower variance than

. So far Stochastic Variational Inference has been discussed only in a general parametric setting. The paper’s other primary contribution is to use a differentiable recognition network to learn to parameterise the posterior distribution over latent variables local to each observation in a parametric model. In particular, they assume that given some global parameters , and . In the general case the posterior distribution over each will be intractable. Furthermore, the number of latent variables increases as the number of observations increases, meaning that under the framework discussed above we would have to optimise the variational objective with respect to each of them independently. This is potentially computationally intensive and quite wasteful as it completely disregards any information about the posterior distribution over the provided by the similarities between inputs locations and corresponding posteriors . To rectify this the recognition model is introduced. Given the recognition model and a point estimate for , the ELBO becomes

 L(θ,ϕ;D)= Eqϕ(z1),...,qϕ(zN)[logN∏i=1pθ(xi,zi)−logN∏i=1qϕ(z)i] = N∑i=1Eqϕ(zi)[logpθ(xi,zi)−logqϕ(zi)]

For this ELBO we can derive a similar result to , where we do not assume a closed-form solution for the KL divergence between distributions and include mini-batching to obtain an estimator for the ELBO for a mini-batch of observations

 ~LA(θ,ϕ;D)≈NLMM∑i=1L∑l=1logpθ(xi,zi,l)−logqϕ(zi,l),zi,l=gϕ(xi,ϵi,l),ϵi,l∼p(ϵ) (8)

where the observations in the mini-batch are drawn uniformly from the data set comprised of observations and for each observation we draw samples from the approximate posterior . Similarly, if and are such that the KL-divergence between them has a tractable closed-form solution then we can use an approximate bound which we could reasonably expect to have a lower variance:

 L(θ,ϕ;D)= Eqϕ(z1),...,qϕ(zN)[logN∏i=1pθ(xi|zi)]−DKL(N∏i=1qϕ(zi|xi)∣∣N∏i=1pθ(zi)) = N∑i=1Eqϕ(zi|xi)[logpθ(xi|zi)]−DKL(qϕ(zi|xi)∣∣pθ(zi)) ≈ (9)

where and .

## 3 The Variational Autoencoder

Variational Autoencoders is a generative model that belongs to the field of representation learning in Artificial Intelligence where an input is mapped into hidden representations. They combine the deep learning techniques with the bayesian theory. Variational Autoencoders often learns the joint distribution of a data by imposing some regularization and learns the latent variables that are very similar to the observed datapoints and prior distribution over latent space. The variational autoencoders uses some of the methods described previously to optimize the objective function. We define a generative model where we assume that the

observation was generated by first sampling a latent variable

and that the each observation real-valued vector

where

 μθ(zi)= hiW(dec)μ+b(dec)μ, (10) logσ2θ(zi)= hiW(dec)σ+b(dec)σ, (11)

is the output at the final hidden layer of the “decoder MLP”, are matrices mapping from the hidden units to the dimensional latent space. Similarly, are row vector biases. Note that the variances are parameterised implicitly through their logs to ensure that they are correctly valued only on the positive reals. Here and the output at the final hidden layer of the decoder is

 fθ(zi)=[1+exp(−hiW(dec)−b(dec))]−1, (12)

where and . The model is given by

 qϕ(z|x)=N(z;μϕ,xi,σϕ2(xi)), (13)

where the parameters and are again given by an MLP, which will be referred to as the encoders, whose input is . So there is a nearby solution for the KL-divergence between the variational posterior and the priori . Hence we can equate the lower bound using equation 9 which is

 gϕ(x,ϵ)=μϕ(xi)+σϕ(xi)⊙ϵ,ϵ∼N(0,I). (14)

## 4 Replications and Extensions

### 4.1 Visualization

Here we show the the learnt representations of encoders by mapping the latent space in ,, and . Here we have mapped the coordinates through inverse CDF . Then, we plotted the output of our decoder with the estimated parameters . Figure 2 shows the results for two learned manifolds. Figure 2: Visualisations of learned data manifolds using two-dimensional latent space, trained with AEVB and Gaussian prior over the latent variable z.

In the case of the MNIST dataset, the visualisation of learned data mainfolds can be seem. In the case of the second dataset, it is interesting that the first half of the manifold shows the face from left profile while the second half slowly transforms it to the right profile.

### 4.2 Numeber of Samples

It was found out that the number of samples per data point can be set to as long as the mini-batch size was large enough, e.g. . However, they didn’t provide any empirical results and we decided to run the comparison between the number of samples and a batch size in terms of optimizing the lower bound. Due to the computational requirements ( models needed for evaluation), the experiment was run only with Frey Face dataset with epochs of training. Figure 3 presents the results for sample size ranging from to and batches of size , , and . Figure 3: Heatmaps of lower bound for Frey Face data set.

In the case of the test data, the highest score was obtained with and and the score was invariant to the sample size only with the batch size larger than . In the case of training set, the highest score for lower bound was obtained with the batch of size where the sample size makes a big influence. For larger batch sizes, the sample size becomes invariant in terms of the score for the lower bound. However, it is possible that the models trained with a larger batch size might need more time to converge. So we used least square loss function with regularization that provided better convergence.

### 4.3 Increasing the depth of the encoder

Extending the depth of the neural networks proved to be a very helpful method in increasing the power of neural architectures. We decided to add additional hidden layers to test how much gains can be obtained in terms of optimizing the lower bound and at the same time still obtaining robust results. The experiment was run with MNIST data set having

hidden units with the size of the latent space set to which seems to be optimal value. Figure 4: Comparison of performance between different encoders architectures in terms of optimizing the lower bound with dimensionality of latent a space set to 10.

Additional hidden layers didn’t yield substantial increase in the performance however the encoder with two hidden layers performed slightly better than the original architecture. Presumably owing to the “vanishing gradients” problem, adding fourth hidden layer resulted in the inability of the network to learn.

### 4.4 Noisy KL-divergence estimate

Until now we were assuming that the -divergence term

can be obtained using a closed-form expression. However, in the case of the non-Gaussian distributions it is often impossible and this term also requires estimation by sampling. This more generic SGVB estimator

is of the form:

 ˜LA(θ,ϕ;xi)=1LL∑l=1(logpθ(xi,zi,l)−logqϕ(zi,l|xi)).

Naturally, this form in general will have greater variance than the previous estimator. We decided that it will be informative to compare the performance of both estimators using only one sample i.e. – this will allow us to observe how much more robust is the first estimator. Figure 5 shows the results for the MNIST data set. Figure 5: Comparison of two SGVB estimators in terms of optimizing the lower bound for different dimensionality of latent space (Nz).

As we can see, in each case the performance of the estimator is substantially better. Moreover, this shows that the generic estimator might be used if we increase the sample size

to reduce the variance of the estimator, however, this needs to be examined empirically. Additionally, this comes with higher computational costs. Although, as it was expected, ReLu function learns the fastest, there is no substantial gains over tanh function. In each case, the sigmoid function learns the slowest and obtains the lowest score for the bound and at the same time the training time took about

more time than in the case of the two other functions.

## 5 Reconstruction

The VAE with L2 regularization (VAE) provides a probabilistic approach to modeling the data as a distribution around some underlying manifold. This allows us to define a posterior distribution over the latent space, and a distribution over input space which can be used to score inputs, while being able to measure uncertainty over both estimates. Here we choose to represent traditional Variational Autoenocder as AE and Variational Autoencoders with L2 Regularisation as VAE. The Variational autoencoders (AE) takes an input vector and maps it to some latent space, however only providing a point estimate on some lower dimensional manifold. VAEs are trained by minimizing the reconstruction error of training examples usually measured by mean squared error (MSE), and a weight regularizer:

 E(x,θ)=∥x+σθ(x)∥2+λℓ(θ), (15)

where represents the application of the autoencoder (encoding followed by decoding) to a particular data example, represents a specific weight decay, and represents the weight penalty. This can be seen to resemble the variational lower bound:

 L(θ,ϕ;xi)=Eqϕ(zi|xi)[logpθ(xi|zi)]−DKL(qϕ(zi|xi)∣∣pθ(zi)), (16)

in which the first term can be seen as the expected negative reconstruction weight and the second acts as a regularizer pulling latent variables towards the prior. Rather than defining a distribution in latent variable space, the AE instead provides a point estimate in the bottleneck layer, but the two are analogous in that they provide a lower dimensional representation of an input.
Next we add L2 regularisation as loss function for the Variational Autoencoders. Since the latent representation in the VAE with L2 regularisation is a distribution we have to choose how we use this to create a single decoded example. Both using the mean of the distribution and averaging the output of multiple samples were tried. Using the mean was shown to give an absolute improvement in error of about 0.2%, and so that is the method that is used for further experiments.

Instantiations of each model type were trained using the configurations: the encoder and decoder each have one hidden layer of size 500 consisting of, tanh activation functions and using the same

normalization penalty on the same portion of the MNIST data set. The resulting mean construction errors for the two models for a different size of the latent space/ compression layer are shown in Figure 9. Figure 6: Comparison of reconstruction error measured by MSE for variational auto-encoder (AE) and variational auto-encoder with L2 regularization (VAE) for various sizes of representation space

It can be seen that the variational auto-encoder with L2 regularization outperforms the normal variational auto-encoder for all reduced dimensional sizes. The difference in reconstruction error increases as the size of the latent space/ bottleneck layer increases. This suggests that the difference in performance is due to the better generalisation afforded by the variational bayesian approach.
Note that the VAE is directly trained to minimize the criterion that we have used to compare the two models, whereas VAE with L2 regularization is trained to minimize the expected reconstruction rate, which for the discrete case is the binary cross entropy. So despited having an “unfair advantage” the auto-encoder still performs worse.
To further contrast the two approaches (Figure 7) shows specific examples of digits constructed by VAE and VAE with L2 regularization. It can be seen that generally the VAE produces images that are sharper and more similar to the original digit, with the exception of the number 9 for the 2 dimensional case. It was initially speculated that this would correspond to a higher variance of the posterior, however this was found to not be the case. Looking at the representation of the two dimensional latent space in we can see that the further right we go the more rightward slanting nines we have, so having a leftward slanting nine would push us away from nines towards the region of weight space containing eights. In such a compressed latent space it seems reasonable to assume that there will be forms of certain digits that are unrepresentable, in this case we have found an unfortunate example on which the VAE performs poorly. Figure 7: Examples of the quality of reconstruction of certain digits for VAE with L2 regularization and VAE

The reconstructed images of VAE with L2 regularisation with more training epochs is shown below. It can be see that the first row has sharper image quality than the second. Figure 8: First row is the reconstructed image of VAE with L2 regularisation and second is the normal VAE Figure 9: This is how the encoder/inference network learns to map the training set from the input data space to the latent space.

## 6 Full Variational Bayes

As well as providing a method of variational inference over the parameters of a latent space Kingma and Welling also detail a method of performing full variational Bayesian inference over the parameters. In this scheme we place a hyperprior over the parameters of the model

. The variational of the lower bound of the marginal likelihood can then be written:

 L(ϕ;X)=Eqϕ(θ)[logpθ(X)]−DKL(qϕ(θ)∣∣pα(θ)) (17)

By maximizing this we are encouraging the model to reconstruct the data accurately, while constraining the form that the distribution of parameters can take. For a particular point we have a variational lower bound on the marginal likelihood:

 (18)

Combining creftype 17 and creftype 18, using the same reparameterization trick as with and using the same trick for the variational approximation to the posterior over parameters: with we arrive at the differentiable Monte Carlo estimate for the variational lower bound of the marginal likelihood:

 L(ϕ;X)≈1LL∑l=1N⋅(logp~θ(x|~z)+logp~θ(~z)−logqϕ(~z|x))+logpα(~θ)−logqϕ(~θ) (19)

which can be maximized by performing SGVB as before by differentiating with respect to

. They provide a concrete example of a realisation of the above model in which we assume standard normal distributions for the priors over the variables and latent space, and have variational approximations to the posteriors of the form:

 qϕ(θ)=N(θ;μθ,σ2θI)qϕ(z|x)=N(z;μz,σ2zI) (20)

Those assumptions enable us to obtain closed form solutions for the KL term. This approach was implemented and tested on the MNIST and Frey Face data sets. Although the lower bound was increased, progress was extremely slow, the training lower bound increased much faster than the validation set, and evaluation of the reconstruction ability of the resulting models showed that no learning had taken place. The very slow progress to an eventual poor model resembled the effects of starting in a poor region of neural network parameter space,and so the initial values of were seeded with the MAP solutions from a regular VAE trained to convergence while were all set to be

, thereby hopefully encouraging the model to learn a distribution around a known good configuration of parameters. Nonetheless, this yielded identically poor results. The purpose of performing inference over the parameters is to reduce overfitting and promote generalization. However in the scheme proposed it appears that the model underfits to the extent that it simply does not learn. There are a number of possible explanations for this. One problem that was faced in the implementation was negative values of variances. This was worked around by using a standard deviation which is then squared to yield a positive variance. In their recent paper on variational inference over MLP parameters

 work around this by parameterizing as . Despite yielding a closed form solution, a standard normal prior over weights is perhaps too wide for MLP weight parameters, which typically have very low variance about zero.  found that despite not yielding a closed form solution a complicated spike-and-slab-like prior performed best composed of a mixture of a high variance and low variance Gaussian centered at 0 performed well. Performing full variational inference will allow robust weight estimates even in low resource environments. The approach in the paper favours a neat analytical form of prior over analytically complicated priors that may induce more reliable weight estimates. The trade off between precision of gradient estimates and efficacy of form is an interesting problem that requires further research.

## 7 Future Works

An obvious extension to the paper to investigate is to simply change the form of the prior and the variational approximation in an attempt to induce a particular form of latent space. For example a particularly interesting set up would be to define a sparsity inducing prior that encourages each dimension of the latent space to be approximately valued on

. An obvious choice would be a set of sparse Beta distributions (ie. ones in which the shape parameters

), but one could also use pairs of univariate Gaussians with means and and small variances. Such a prior would be useful for two reasons - firstly it would allow one to provide a binary encoding for a data set by truncating the posterior approximation for any particular observation to be exactly vector binary valued allowing for a large amount of lossy compression. The posterior distribution over the parameters and latent values also contains rotational symmetry which may affect the quality of the approximate inference if it attempts to place posterior mass over the entirety of this. Were a prior such as the one proposed used, this rotational symmetry would be destroyed and replaced with a “permutation symmetry”, similar to that found in a finite mixture model.
We currently assume a simple parametric form for the approximate posterior that allows the use of the reparameterization trick. Although this yields a robust training regime, it limits the expressibility of the model to a subset of potential distributions. If instead we directly use the we can induce an arbitrarily complex posterior that would allow us to approximate any true posterior.
This idea has been recently realised using Gaussian processes by  who draw random latent input samples, push them through a non-linear mapping and then draw posterior samples. If we instead were to use a MLP to model we can, theoretically, model arbitrary posteriors. The problem now is the ability to yield a differentiable distribution over latent space which can potentially be sampling multiple to approximate a distribution, and batching gradients over all samples. This is akin to a Monte Carlo estimate of the variational posterior. One of the most popular approaches in the unsupervised learning using autoencoding structures is making the learned representation robust to partial corruption of the input pattern . This also proved to be an effective step in pre-training of deep neural architectures. Moreover, this method can be extended where the network is trained with a schedule of gradually decreasing noise levels . This approach is motivated by a desire to encourage the network to learn a more diverse set of features from a coarse-grained to fine-grained ones. Moreover, there was recently an effort to inject noise into both an input and in the stochastic hidden layer (denoising variational autoencoder, DVAE) which yields better score in terms of optimising the log likelihood . In order to estimate the variational lower bound the corrupted input , obtained from a known corruption distribution around , requires to be integrated out which is intractable in the case of . Thus, Im et al. arrived at the new form of the objective function – the denoising variational lower bound:

where . However, the noise in this case was set to a constant during the training procedure. To the best of our knowledge no one analysed how the scheduling scheme might influence the learning of the auto-encoder’s structure as well as the approximate form of the posterior of the latent variable. We believe that combination of both scheduled denoising training with the variational form of an auto-encoder should lead to gains in terms of the optimising lower bound and improving the reconstruction error as it was the case in the section .

## 8 Conclusions

In this paper we provide a clear introduction to a new methodology for performing the reconstruction of images with variational inference using L2 regularisation. We managed to obtain better results with L2 regularization for both data sets. We found the model structure very robust to changes of parameters of the network. Moreover, our experiments show that the performance of VAEB with L2 regularization is superior to that of the traditional VAE architecture and it is resistant to superfluous latent variables thanks to automatic regularisation via the KL-term. Our implementation of variational inference on both the parameters and the latent variables performed disappointingly poorly which might be partially explained by overly-restrictive prior and we plan to further investigate this problem.