1 Introduction
Variational Autoencoder (VAE) provides more efficient reconstructive performance over a traditional autoencoder. They are a directed graphic models. Variational autoencoder is trained to maximise the variational lower bound. Its basically trading of the data loglikelihood and the KL divergence from the true posterior. This paper describes the scenario in which we wish to find a pointestimate to the parameters of some parametric model in which we generate each observations by first sampling a “local” latent variable and then sampling the associated observation . The model is graphically represented below 1.
The posterior is intractable for a continuous latent space whenever either the prior or the likelihood are nonGaussian, meaning that approximate inference is required. To this end Autoencoding Variational Bayes makes two contributions in terms of methodology, introducing a differentiable stochastic estimator for the variational lower bound to the model evidence, and using this to learn a recognition model to provide a fast method to compute an approximate posterior distribution over “local” latent variables given observations.
2 Stochastic Variational Inference
The aim of variational inference is to provide a deterministic approximation to an intractable posterior distribution by finding parameters such that is minimised. This is achieved by noting that
(1) 
Noting that is constant w.r.t. , we can now minimise the KLdivergence by maximising the evidence lower bound (ELBO) (that this is indeed a lower bound follows from the nonnegativity of the KLdivergence). Aside from some notable exceptions (eg. [1]) this quantity is not tractably pointwise evaluable. However, if and are pointwise evaluable, it can be approximated using Monte Carlo as
(2) 
This stochastic approximation to the ELBO is not differentiable w.r.t. as the distribution from which each is sampled itself depends upon , meaning that the gradient of the log likelihood cannot be exploited to perform inference. One of the primary contributions of the paper being reviewed is to provide a differentiable estimator for that allows gradient information in the log likelihood
to be exploited, resulting in an estimator with lower variance. In particular it notes that if there exists a tractable reparameterisation of the random variable
such that(3) 
then we can approximate the gradient of the ELBO as
(4) 
where and . Thus the dependence of the sampled parameters on has been removed, yielding a differentiable estimator provided that both and are themselves differentiable. Approximate inference can now be performed by computing the gradient of w.r.t. either by hand or using one’s favourite reversemode automatic differentiation package (eg. Autograd [6]) and performing gradientbased stochastic optimisation to maximise the elbo using, for example, AdaGrad [2]. One can also reexpress the elbo in the following manner
(5) 
This is useful as the KLdivergence between the variational approximation and the prior over the parameters has a tractable closedform expression in a number of useful cases. This leads to a second estimator for the elbo:
(6) 
It seems probable that this estimator will in general have lower variance than
. So far Stochastic Variational Inference has been discussed only in a general parametric setting. The paper’s other primary contribution is to use a differentiable recognition network to learn to parameterise the posterior distribution over latent variables local to each observation in a parametric model. In particular, they assume that given some global parameters , and . In the general case the posterior distribution over each will be intractable. Furthermore, the number of latent variables increases as the number of observations increases, meaning that under the framework discussed above we would have to optimise the variational objective with respect to each of them independently. This is potentially computationally intensive and quite wasteful as it completely disregards any information about the posterior distribution over the provided by the similarities between inputs locations and corresponding posteriors . To rectify this the recognition model is introduced. Given the recognition model and a point estimate for , the ELBO becomesFor this ELBO we can derive a similar result to , where we do not assume a closedform solution for the KL divergence between distributions and include minibatching to obtain an estimator for the ELBO for a minibatch of observations
(8) 
where the observations in the minibatch are drawn uniformly from the data set comprised of observations and for each observation we draw samples from the approximate posterior . Similarly, if and are such that the KLdivergence between them has a tractable closedform solution then we can use an approximate bound which we could reasonably expect to have a lower variance:
(9) 
where and .
3 The Variational Autoencoder
Variational Autoencoders is a generative model that belongs to the field of representation learning in Artificial Intelligence where an input is mapped into hidden representations. They combine the deep learning techniques with the bayesian theory. Variational Autoencoders often learns the joint distribution of a data by imposing some regularization and learns the latent variables that are very similar to the observed datapoints and prior distribution over latent space. The variational autoencoders uses some of the methods described previously to optimize the objective function. We define a generative model where we assume that the
observation was generated by first sampling a latent variableand that the each observation realvalued vector
where(10)  
(11) 
is the output at the final hidden layer of the “decoder MLP”, are matrices mapping from the hidden units to the dimensional latent space. Similarly, are row vector biases. Note that the variances are parameterised implicitly through their logs to ensure that they are correctly valued only on the positive reals. Here and the output at the final hidden layer of the decoder is
(12) 
where and . The model is given by
(13) 
where the parameters and are again given by an MLP, which will be referred to as the encoders, whose input is . So there is a nearby solution for the KLdivergence between the variational posterior and the priori . Hence we can equate the lower bound using equation 9 which is
(14) 
4 Replications and Extensions
4.1 Visualization
Here we show the the learnt representations of encoders by mapping the latent space in ,, and . Here we have mapped the coordinates through inverse CDF . Then, we plotted the output of our decoder with the estimated parameters . Figure 2 shows the results for two learned manifolds.
In the case of the MNIST dataset, the visualisation of learned data mainfolds can be seem. In the case of the second dataset, it is interesting that the first half of the manifold shows the face from left profile while the second half slowly transforms it to the right profile.
4.2 Numeber of Samples
It was found out that the number of samples per data point can be set to as long as the minibatch size was large enough, e.g. . However, they didn’t provide any empirical results and we decided to run the comparison between the number of samples and a batch size in terms of optimizing the lower bound. Due to the computational requirements ( models needed for evaluation), the experiment was run only with Frey Face dataset with epochs of training. Figure 3 presents the results for sample size ranging from to and batches of size , , and .
In the case of the test data, the highest score was obtained with and and the score was invariant to the sample size only with the batch size larger than . In the case of training set, the highest score for lower bound was obtained with the batch of size where the sample size makes a big influence. For larger batch sizes, the sample size becomes invariant in terms of the score for the lower bound. However, it is possible that the models trained with a larger batch size might need more time to converge. So we used least square loss function with regularization that provided better convergence.
4.3 Increasing the depth of the encoder
Extending the depth of the neural networks proved to be a very helpful method in increasing the power of neural architectures. We decided to add additional hidden layers to test how much gains can be obtained in terms of optimizing the lower bound and at the same time still obtaining robust results. The experiment was run with MNIST data set having
hidden units with the size of the latent space set to which seems to be optimal value.Additional hidden layers didn’t yield substantial increase in the performance however the encoder with two hidden layers performed slightly better than the original architecture. Presumably owing to the “vanishing gradients” problem, adding fourth hidden layer resulted in the inability of the network to learn.
4.4 Noisy KLdivergence estimate
Until now we were assuming that the divergence term
can be obtained using a closedform expression. However, in the case of the nonGaussian distributions it is often impossible and this term also requires estimation by sampling. This more generic SGVB estimator
is of the form:Naturally, this form in general will have greater variance than the previous estimator. We decided that it will be informative to compare the performance of both estimators using only one sample i.e. – this will allow us to observe how much more robust is the first estimator. Figure 5 shows the results for the MNIST data set.
As we can see, in each case the performance of the estimator is substantially better. Moreover, this shows that the generic estimator might be used if we increase the sample size
to reduce the variance of the estimator, however, this needs to be examined empirically. Additionally, this comes with higher computational costs. Although, as it was expected, ReLu function learns the fastest, there is no substantial gains over tanh function. In each case, the sigmoid function learns the slowest and obtains the lowest score for the bound and at the same time the training time took about
more time than in the case of the two other functions.5 Reconstruction
The VAE with L2 regularization (VAE) provides a probabilistic approach to modeling the data as a distribution around some underlying manifold. This allows us to define a posterior distribution over the latent space, and a distribution over input space which can be used to score inputs, while being able to measure uncertainty over both estimates. Here we choose to represent traditional Variational Autoenocder as AE and Variational Autoencoders with L2 Regularisation as VAE. The Variational autoencoders (AE) takes an input vector and maps it to some latent space, however only providing a point estimate on some lower dimensional manifold. VAEs are trained by minimizing the reconstruction error of training examples usually measured by mean squared error (MSE), and a weight regularizer:
(15) 
where represents the application of the autoencoder (encoding followed by decoding) to a particular data example, represents a specific weight decay, and represents the weight penalty. This can be seen to resemble the variational lower bound:
(16) 
in which the first term can be seen as the expected negative reconstruction weight and the second acts as a regularizer pulling latent variables towards the prior. Rather than defining a distribution in latent variable space, the AE instead provides a point estimate in the bottleneck layer, but the two are analogous in that they provide a lower dimensional representation of an input.
Next we add L2 regularisation as loss function for the Variational Autoencoders. Since the latent representation in the VAE with L2 regularisation is a distribution we have to choose how we use this to create a single decoded example. Both using the mean of the distribution and averaging the output of multiple samples were tried. Using the mean was shown to give an absolute improvement in error of about 0.2%, and so that is the method that is used for further experiments.
Instantiations of each model type were trained using the configurations: the encoder and decoder each have one hidden layer of size 500 consisting of, tanh activation functions and using the same
normalization penalty on the same portion of the MNIST data set. The resulting mean construction errors for the two models for a different size of the latent space/ compression layer are shown in Figure 9.It can be seen that the variational autoencoder with L2 regularization outperforms the normal variational autoencoder for all reduced dimensional sizes. The difference in reconstruction error increases as the size of the latent space/ bottleneck layer increases. This suggests that the difference in performance is due to the better generalisation afforded by the variational bayesian approach.
Note that the VAE is directly trained to minimize the criterion that we have used to compare the two models, whereas VAE with L2 regularization is trained to minimize the expected reconstruction rate, which for the discrete case is the binary cross entropy. So despited having an “unfair advantage” the autoencoder still performs worse.
To further contrast the two approaches (Figure 7) shows specific examples of digits constructed by VAE and VAE with L2 regularization. It can be seen that generally the VAE produces images that are sharper and more similar to the original digit, with the exception of the number 9 for the 2 dimensional case. It was initially speculated that this would correspond to a higher variance of the posterior, however this was found to not be the case. Looking at the representation of the two dimensional latent space in we can see that the further right we go the more rightward slanting nines we have, so having a leftward slanting nine would push us away from nines towards the region of weight space containing eights. In such a compressed latent space it seems reasonable to assume that there will be forms of certain digits that are unrepresentable, in this case we have found an unfortunate example on which the VAE performs poorly.
The reconstructed images of VAE with L2 regularisation with more training epochs is shown below. It can be see that the first row has sharper image quality than the second.
6 Full Variational Bayes
As well as providing a method of variational inference over the parameters of a latent space Kingma and Welling also detail a method of performing full variational Bayesian inference over the parameters. In this scheme we place a hyperprior over the parameters of the model
. The variational of the lower bound of the marginal likelihood can then be written:(17) 
By maximizing this we are encouraging the model to reconstruct the data accurately, while constraining the form that the distribution of parameters can take. For a particular point we have a variational lower bound on the marginal likelihood:
(18) 
Combining creftype 17 and creftype 18, using the same reparameterization trick as with and using the same trick for the variational approximation to the posterior over parameters: with we arrive at the differentiable Monte Carlo estimate for the variational lower bound of the marginal likelihood:
(19) 
which can be maximized by performing SGVB as before by differentiating with respect to
. They provide a concrete example of a realisation of the above model in which we assume standard normal distributions for the priors over the variables and latent space, and have variational approximations to the posteriors of the form:
(20) 
Those assumptions enable us to obtain closed form solutions for the KL term. This approach was implemented and tested on the MNIST and Frey Face data sets. Although the lower bound was increased, progress was extremely slow, the training lower bound increased much faster than the validation set, and evaluation of the reconstruction ability of the resulting models showed that no learning had taken place. The very slow progress to an eventual poor model resembled the effects of starting in a poor region of neural network parameter space,and so the initial values of were seeded with the MAP solutions from a regular VAE trained to convergence while were all set to be
, thereby hopefully encouraging the model to learn a distribution around a known good configuration of parameters. Nonetheless, this yielded identically poor results. The purpose of performing inference over the parameters is to reduce overfitting and promote generalization. However in the scheme proposed it appears that the model underfits to the extent that it simply does not learn. There are a number of possible explanations for this. One problem that was faced in the implementation was negative values of variances. This was worked around by using a standard deviation which is then squared to yield a positive variance. In their recent paper on variational inference over MLP parameters
[7] work around this by parameterizing as . Despite yielding a closed form solution, a standard normal prior over weights is perhaps too wide for MLP weight parameters, which typically have very low variance about zero. [7] found that despite not yielding a closed form solution a complicated spikeandslablike prior performed best composed of a mixture of a high variance and low variance Gaussian centered at 0 performed well. Performing full variational inference will allow robust weight estimates even in low resource environments. The approach in the paper favours a neat analytical form of prior over analytically complicated priors that may induce more reliable weight estimates. The trade off between precision of gradient estimates and efficacy of form is an interesting problem that requires further research.7 Future Works
An obvious extension to the paper to investigate is to simply change the form of the prior and the variational approximation in an attempt to induce a particular form of latent space. For example a particularly interesting set up would be to define a sparsity inducing prior that encourages each dimension of the latent space to be approximately valued on
. An obvious choice would be a set of sparse Beta distributions (ie. ones in which the shape parameters
), but one could also use pairs of univariate Gaussians with means and and small variances. Such a prior would be useful for two reasons  firstly it would allow one to provide a binary encoding for a data set by truncating the posterior approximation for any particular observation to be exactly vector binary valued allowing for a large amount of lossy compression. The posterior distribution over the parameters and latent values also contains rotational symmetry which may affect the quality of the approximate inference if it attempts to place posterior mass over the entirety of this. Were a prior such as the one proposed used, this rotational symmetry would be destroyed and replaced with a “permutation symmetry”, similar to that found in a finite mixture model.We currently assume a simple parametric form for the approximate posterior that allows the use of the reparameterization trick. Although this yields a robust training regime, it limits the expressibility of the model to a subset of potential distributions. If instead we directly use the we can induce an arbitrarily complex posterior that would allow us to approximate any true posterior.
This idea has been recently realised using Gaussian processes by [7] who draw random latent input samples, push them through a nonlinear mapping and then draw posterior samples. If we instead were to use a MLP to model we can, theoretically, model arbitrary posteriors. The problem now is the ability to yield a differentiable distribution over latent space which can potentially be sampling multiple to approximate a distribution, and batching gradients over all samples. This is akin to a Monte Carlo estimate of the variational posterior. One of the most popular approaches in the unsupervised learning using autoencoding structures is making the learned representation robust to partial corruption of the input pattern [9]. This also proved to be an effective step in pretraining of deep neural architectures. Moreover, this method can be extended where the network is trained with a schedule of gradually decreasing noise levels [10]. This approach is motivated by a desire to encourage the network to learn a more diverse set of features from a coarsegrained to finegrained ones. Moreover, there was recently an effort to inject noise into both an input and in the stochastic hidden layer (denoising variational autoencoder, DVAE) which yields better score in terms of optimising the log likelihood [11]. In order to estimate the variational lower bound the corrupted input , obtained from a known corruption distribution around , requires to be integrated out which is intractable in the case of . Thus, Im et al. arrived at the new form of the objective function – the denoising variational lower bound:
where . However, the noise in this case was set to a constant during the training procedure. To the best of our knowledge no one analysed how the scheduling scheme might influence the learning of the autoencoder’s structure as well as the approximate form of the posterior of the latent variable. We believe that combination of both scheduled denoising training with the variational form of an autoencoder should lead to gains in terms of the optimising lower bound and improving the reconstruction error as it was the case in the section .
8 Conclusions
In this paper we provide a clear introduction to a new methodology for performing the reconstruction of images with variational inference using L2 regularisation. We managed to obtain better results with L2 regularization for both data sets. We found the model structure very robust to changes of parameters of the network. Moreover, our experiments show that the performance of VAEB with L2 regularization is superior to that of the traditional VAE architecture and it is resistant to superfluous latent variables thanks to automatic regularisation via the KLterm. Our implementation of variational inference on both the parameters and the latent variables performed disappointingly poorly which might be partially explained by overlyrestrictive prior and we plan to further investigate this problem.
References
 [1] Michalis K Titsias Variational learning of inducing variables in sparse Gaussian processes. International Conference on Artificial Intelligence and Statistic

[2]
John Duchi, Elad Hazan and Yoram Singer
Adaptive subgradient methods for online learning and stochastic optimization
The Journal of Machine Learning Research
 [3] Diederik P Kingma and Max Welling Autoencoding variational bayesl arXiv preprint arXiv:1312.6114
 [4] Prasoon Goyal, Zhiting Hu,Xiaodan Liang, Chenyu Wang and Eric Xing Nonparametric Variational Autoencoders for Hierarchical Representation Learning arXiv preprint arXiv:1703.07027
 [5] Lars Mescheder, Sebastian Nowozin and Andreas Geiger Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks arXiv preprint arXiv:1703.07027
 [6] Dougal Maclaurin and Duvenaud, David and Ryan P Adams Autograd: Effortless Gradients in Numpy
 [7] Charles Blundell and Julien Cornebise and Koray Kavukcuoglu and Daan Wiestra Weight uncertainty in neural networks arXiv preprint arXiv:1505.05424
 [8] Dustin Tran,Rajesh Raganath and David M Blei Variational Gaussian Process arXiv preprint arXiv:1511.06499

[9]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio and PierreAntoine Manzagol
Extracting and composing robust features with denoising autoencoders
Proceedings of the 25th international conference on Machine learning  [10] Krzysztof J Geras and Charles Sutton Scheduled denoising autoencoders arXiv preprint arXiv:1406.326
 [11] Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic and Yoshua Bengio Denoising Criterion for Variational AutoEncoding Framework arXiv preprint arXiv:1511.06406