1 Introduction
Generative modeling of images van2016conditional ; goodfellow_generative_2014 ; gregor2016towards ; kingma2016improved ; dinh2016density , audio oord2016wavenet ; mehri2016samplernn , and videos kalchbrenner2017video ; finn2016unsupervised has advanced remarkably in the past few years and have resulted in great applications ledig2017photo ; isola2017image
. Many machine learning tasks such as domain adaptation
hoffman2013efficient ; ghifary2014domain ; bousmalis2017unsupervised ; wang2018deep and fewshot learning ravi2016optimization ; santoro2016one ; snell2017prototypical rely on learned representations of observations. The goal of generative learning is to model the distribution of unseen observations. This task is achieved with some degree of success by latent variable models such as variational autoencoder (VAE) kingma2013auto ; rezende2014stochastic ; salimans2015markov ; kingma2016improved , flowbased models kingma2018glow ; rezende2015variational ; dinh2014nice ; dinh2016density du2019implicit, autoregressive models such as PixelCNN
salimans2017pixelcnn++ ; van2016conditional ; van2016pixel ; oord2016wavenet ; chen2017pixelsnail ; germain2015made , and lately, by the combination of latentvariable and autoregressive models, PixelVAE gulrajani2016pixelvae ; chen2016variational .While PixelCNN variants are able to model local correlations successfully, large models and selfattention mechanisms are necessary to capture longrange correlations chen2017pixelsnail . VAEs, on the other hand, are able to capture global information in the latent space but generally fail to model local structure variations. Modelling the local structure, while not important for downstream tasks such as classification bengio2013deep , is essential for building a powerful generative model. Early models with autoregressive decoder found that the autoregressive part learns all the features of the data and the latent variables are not used fabius2014variational ; bowman2015generating ; chung2014empirical ; serban2016building ; serban2017hierarchical ; fraccaro2016sequential . Using a shallow PixelCNN decoder, PixelVAE gulrajani2016pixelvae was able to achieve a generative performance comparable with PixelCNN, with hierarchical latent variables that meaningfully control the global features of generated images. However, in chen2016variational , authors argued that in principle the optimal point of a PixelVAE is where all information is modelled by the autoregressive part, but in practice this may not be achievable.
In this manuscript, we present PixelVAE++, a model that improves performance of PixelCNN++ and achieves state of the art performance on a few data sets. Our contribution can be summarized as follows:

We combine PixelCNN++ architecture with VAEs. We show that using latent variables improves performance.

We implement a hierarchical encoder that shares most of its parameters with the autoregressive decoder.

We show that the choice of prior is essential in building a useful latent variable model. In particular, we use discrete latent variables with an RBM prior.
2 Background
PixelCNN van2016conditional
is an autoregressive generative model with a tractable likelihood. The model fully factorizes the probability density function as follow:
(1) 
where is the set of all with . The conditional distributions
are parameterized by convolutional neural networks. The functional forms of these conditionals are very flexible, making PixelCNN a powerful generative model. We used the most recent implementation of PixelCNN++
salimans2017pixelcnn++ that removes the blind spots and achieves better performance than previous PixelCNN models.PixelVAE gulrajani2016pixelvae is a VAE with PixelCNN as its decoder. Since its loglikelihood is intractable for large number of latent variables, the evidence lower bound (ELBO) is minimized as the objective function kingma2013auto .
(2) 
where is the prior distribution over the latent variables , which together with conditional probability forms the generative model with parameters , and is the approximating posterior represented as a neural network with parameters .
The conditional probability (decoder) and the approximating posterior (encoder) are trained endtoend to optimize the generative and inference models jointly. Many researches around VAE models deal with building more expressive distributions for all its components kingma2016improved ; tomczak2017vae ; vahdat2018dvae++ ; gulrajani2016pixelvae ; chen2016variational . PixelVAE in particular adds more power to the structure of the decoder by replacing conditionally independent dimensions with a fully autoregressive structure.
2.1 Discrete Variational Autoencoder (DVAE)
Restricted Boltzmann machines model multimodal distributions and can capture complex relationship in data le2008representational . Using RBM as a prior in a generative model has been pioneered by rolfe2016discrete and later improved in vahdat2018dvae++ ; vahdat2018dvae ; vahdat2019learning ; khoshaman2018gumbolt
. In order to backpropagate through the approximating posterior of discrete binary variables, we employ a continuous relaxation estimator that trades bias for variance
bengio2013estimating ; raiko2014techniques , in particular the GumbelSoftmax estimator jang2016categorical ; maddison2016concrete . Reference khoshaman2018gumbolt showed that a relaxed (biased) objective can be used to train a DVAE with an RBM prior. The relaxed objective can be expressed as(3) 
where
is a random variable drawn from a uniform distribution
, and is the continuous relaxation of the discrete binary latent variables,(4) 
with
being the sigmoid function,
is the logit (output of a neural network), and
is the temperature that controls the sharpness of the function . The continuous variable is equal to the discrete variable in the limit . At the test time only discrete variables are used which can be obtained by setting the temperature to zero.2.2 PixelVAE++ Architecture
For the decoder of the PixelVAE we implement a model similar to PixelCNN++ salimans2017pixelcnn++ using downward and rightward shifted image and feature maps. Similarly, we use 6 blocks of ResNet layers, where for 32x32x3 inputs and smaller (
=3) for 28x28 inputs. Down sampling is performed between the 1st/2nd and 2nd/3rd blocks using convolutions with strides of 2, and upsampling is performed between the 4th/5th and 5th/6th blocks. Similar to PixelCNN++, convolutional connections are introduced between early and late layers to ensure that details and fine structures are preserved at construction time.
For the encoder, we use three groups of latent variables with a factorial Bernoulli distribution. The encoding distribution of the first group,
concatenated latent variables , is parameterized by convolutional neural networks (CNN) followed by a dense layer. Using deconvolutions with upsampling, the decoder transforms the stochastic variables to the size of the input and then concatenates them with the input, similar to the original PixelVAE gulrajani2016pixelvae .The parameters of the second encoding distribution is obtained by a separate set of CNN and dense layers. In the decoder, these conditioning latent variables are added to activations before nonlinearities, similar to the conditional PixelCNN van2016conditional .
The architecture of the third group, the shared latent variables , leverages the autoencoding structure of the PixelCNN++. There are layers up to the last downsampling stage, where n ( for CIFAR10) is the number of ResNet layers per block. Each layer (block index, ResNet layer index) with the size , ([batch size, height, width, filters]) is reduced to a by convolution and is followed by a dense layer to 64 variables to produce the logits of the Bernoulli distribution. Each latent variable is then transformed by a dense layer, followed by upsampling to the same size as . The transformed stochastic variable is concatenated with and is passed to the corresponding upsampling block . A gated ResNet combines this layer and the transformed latent variables. A schematic drawing is presented in Fig. 1.
3 Experiments
We study the performance of PixelVAE++ for density estimation on 2D images. The experiments are performed on statically and dynamically lecun1998gradient binarized MNIST, Omniglot, and Caltech101 silhouettes marlin2010inductive and CIFAR10 krizhevsky2009learning
. For all data sets, we use the standard allocation of training, validation, and test sets. We trained the models for 500 epochs on one GPU for binary data sets and 1000 epochs on 8 GPUS for CIFAR10. Our goal is to determine whether we can improve the performance of PixelCNN++ with VAE and to understand what effect the use of discrete binary latent variables has on the performance of PixelVAE++, if any.
3.1 Performance of PixelVAE++
Some inputs have binary pixel images. We use the same architecture for the MNIST and Omniglot data sets with 3 ResNet layers per block, each having 64 feature maps. Because the Caltech101 silhouettes data set is smaller, our model with PixelCNN++ decoder using strides of 2 for downsampling and upsampling, easily overfits. We remove the strides and reduce the number of feature maps to 32, but keep the number of ResNet layers the same. We found that for input sizes of 28x28, only three ResNet layers per block was needed to achieve optimal performance. The LogLikelihood (LL) is evaluated using 1000 importance weighted samples burda2015importance
. We were able to achieve better performance than what is already published on dynamically binarized MNIST using PixelCNN++ and we further improve the LL using PixelVAE++. The LL reaches the values of 78.00 and 88.29 for MNIST (Dynamic) and Omniglot, respectively. We obtained this results by including only the first group of latent variables, but we also experimented with including all three categories of latent variables. Only the first category, the concatenated variables, was necessary and sufficient for optimal LL in these data sets and adding the other groups did not improve results. We have repeated the experiments for each data set three times and have reported the average. The standard deviation for the mean is
, therefore the last digit is uncertain. However, we have followed the norm to report the results for the binary data sets up to the second digit.For the binary data sets, we use 400 binary variables in the latent space all of which belong to concatenated latent variables. The prior consists of an RBM with 200 variables on each side of the bipartite graph. We sample from the RBM during training to computing the gradients of the logpartition function. We use 5000 samples obtained by annealed importance sampling (AIS) neal2001annealed with 1000 temperatures steps and 50 MCMC update per step. For the evaluation of the logpartition function, we increase the number of samples and temperature steps tenfold. For training we used batch sizes of 128 samples.
Experiments on CIFAR10 are done using the same set of hyperparameters and network architecture as in PixelCNN++ salimans2017pixelcnn++ , except for the number of filters which we reduce from 160 to 128. We use batch sizes of 8 samples per GPU or 64 samples in total. We only add and tune VAE related parameters and hyperparameters. Our model has 40 M parameters, including 6 M for VAE, as compared to 54 M in the original PixelCNN++. Despite the reduced number of parameters, the negative loglikelihood in bpd reaches 2.90. The decoder alone with 128 filters only reaches 2.95 bpd, higher than 2.92 bpd with 160 filters. Increasing the number of filters back to 160 for the VAE model did not result in a better performance than 2.90 bpd. The details of the implementation is outlined in the appendix A for inputs. The numbers are reported after averaging three independent runs with a standard deviation of . The standard deviation of loglikelihood for different evaluations of importance weighted sum is negligible.
MNIST  Static  Dynamic  

LL  KL  LL  KL  
PixelVAE++ Gaussian  78.66  6.86  78.01  4.2 
PixelVAE++ RBM  78.65  7.62  78.00  5.05 
VLAE  79.03  78.53  
OMNIGLOT  Caltech 101  
LL  KL  LL  KL  
PixelVAE++ Gaussian  88.65  1.63  79.52  4.00 
PixelVAE++ RBM  88.29  2.56  77.46  6.85 
VLAE  89.83  77.36  
bpd  CIFAR10  
LL  KL  
PixelVAE++ Gaussian  2.92  0.005  
PixelVAE++ RBM  2.90  0.016  
VLAE  2.95  
PixelCNN++  2.92  
PixelSNAIL  2.85 
For the CIFAR10 data set, we use an architecture that includes all three groups of latent variables: 512 concatenated, 128 conditional, and shared. To construct the RBM, we place the first two groups on one side and the shared variables on the other side of the fully visible RBM. Due to the increased size of the RBM and to reduce the computational cost of AIS we use only 500 samples for training and 5000 samples for evaluation.
3.2 Conditional image generation
We perform experiments with PixelVAE++ to assess generation of images conditional on the latent variables given by the approximating posterior. We generate samples using the latent variables inferred from test images. Figures 2 and 3 show reconstructed images from MNIST and Omniglot test sets for PixelVAE++ trained with RBM and Gaussian priors. As evident from Fig. 2, both models with continuous and discrete priors construct the digits very well. Despite having similar LL, the model with the RBM prior has a sharp conditional distribution that captures small variations in digit shape, while the one with the Gaussian prior has much broader conditional, that can cover multiple digit classes. For Omniglot data set (Fig. 3), the model with RBM prior similarly has a sharper conditional, although in this case both RBM and Gaussian prior models span multiple object classes.
The similarity of the reconstructed image and the original one can be measured equivalently by conditional probability, KL divergence of approximating posterior from prior (for a fixed ELBO), and the mutual information between and , all of which change monotonically with respect to each other alemi2017fixing . We report the KL divergence in table 1. It is evident from the images and the KL values that PixelVAE++ with RBM prior is able to learn models with sharper decoder distributions and more informative latent space than the one with Gaussian prior.
For the case of CIFAR10 images, the decoder distribution becomes rather broad as illustrated in figure 4. To visually highlight how this distribution varies over the data set we choose 128 data points, sample 8 images conditioned on the latent representation of each data point, and then display the ones with the smallest mutual energy distance salimans2017pixelcnn++ on the left panel, and largest on the right. It is hardly possible to categorize images conditioned on the same latent variable as belonging to the same class, but in some cases it is possible to argue that images are similar in some global appearances such as color and composition.
4 Conclusions
We have presented a VAE model using an autoregressive decoder that performs well in generative tasks, in terms of LL, and makes use of the latent variables. This model achieves the best performance among other latent variable models on CIFAR10 data set by reaching LL of 2.90 bpd using 25% less parameters than PixelCNN++.
For the binary data sets, the discrete latent variables capture global features of images (like digit class) while the decoder distribution models local variations.
For more complex natural images, the latent variables help autoregressive model to represent some global features, like color and composition, and achieve better performance in terms of loglikelihood.
References
 (1) Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
 (2) Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. arXiv:1406.2661 [cs, stat], June 2014. arXiv: 1406.2661.
 (3) Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems, pages 3549–3557, 2016.
 (4) Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.
 (5) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 (6) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 (7) Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. Samplernn: An unconditional endtoend neural audio generation model. arXiv preprint arXiv:1612.07837, 2016.
 (8) Nal Kalchbrenner, Aäron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1771–1779. JMLR. org, 2017.
 (9) Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pages 64–72, 2016.

(10)
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al.
Photorealistic single image superresolution using a generative adversarial network.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pages 4681–4690, 2017.  (11) Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros. Imagetoimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
 (12) Judy Hoffman, Erik Rodner, Jeff Donahue, Trevor Darrell, and Kate Saenko. Efficient learning of domaininvariant image representations. arXiv preprint arXiv:1301.3224, 2013.

(13)
Muhammad Ghifary, W Bastiaan Kleijn, and Mengjie Zhang.
Domain adaptive neural networks for object recognition.
In
Pacific Rim international conference on artificial intelligence
, pages 898–904. Springer, 2014.  (14) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixellevel domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3722–3731, 2017.
 (15) Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
 (16) Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. 2016.
 (17) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Oneshot learning with memoryaugmented neural networks. arXiv preprint arXiv:1605.06065, 2016.
 (18) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 (19) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 (20) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 (21) Tim Salimans, Diederik P Kingma, Max Welling, et al. Markov chain monte carlo and variational inference: Bridging the gap. In ICML, volume 37, pages 1218–1226, 2015.
 (22) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10236–10245, 2018.
 (23) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 (24) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 (25) Yilun Du and Igor Mordatch. Implicit generation and generalization in energybased models. arXiv preprint arXiv:1903.08689, 2019.
 (26) Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the pixelCNN with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.

(27)
Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.
In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pages 1747–1756. JMLR. org, 2016.  (28) Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763, 2017.
 (29) Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889, 2015.
 (30) Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
 (31) Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
 (32) Yoshua Bengio. Deep learning of representations: Looking forward. In International Conference on Statistical Language and Speech Processing, pages 1–37. Springer, 2013.
 (33) Otto Fabius and Joost R van Amersfoort. Variational recurrent autoencoders. arXiv preprint arXiv:1412.6581, 2014.
 (34) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 (35) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 (36) Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. Building endtoend dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 (37) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. A hierarchical latent variable encoderdecoder model for generating dialogues. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 (38) Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pages 2199–2207, 2016.
 (39) Jakub M Tomczak and Max Welling. Vae with a vampprior. arXiv preprint arXiv:1705.07120, 2017.
 (40) Arash Vahdat, William G Macready, Zhengbing Bian, Amir Khoshaman, and Evgeny Andriyash. Dvae++: Discrete variational autoencoders with overlapping transformations. arXiv preprint arXiv:1802.04920, 2018.

(41)
Nicolas Le Roux and Yoshua Bengio.
Representational power of restricted boltzmann machines and deep belief networks.
Neural computation, 20(6):1631–1649, 2008.  (42) Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.
 (43) Arash Vahdat, Evgeny Andriyash, and William G Macready. DVAE#: Discrete variational autoencoders with relaxed Boltzmann priors. In Neural Information Processing Systems (NIPS), 2018.
 (44) Arash Vahdat, Evgeny Andriyash, and William G Macready. Learning undirected posteriors by backpropagation through mcmc updates. arXiv preprint arXiv:1901.03440, 2019.
 (45) Amir H Khoshaman and Mohammad Amin. Gumbolt: Extending gumbel trick to boltzmann priors. In Advances in Neural Information Processing Systems, pages 4065–4074, 2018.
 (46) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 (47) Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.
 (48) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 (49) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 (50) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 (51) Benjamin Marlin, Kevin Swersky, Bo Chen, and Nando Freitas. Inductive principles for restricted Boltzmann machine learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 509–516, 2010.
 (52) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 (53) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 (54) Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
 (55) Alexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken elbo. arXiv preprint arXiv:1711.00464, 2017.
 (56) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
 (57) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 (58) Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pages 901–909, 2016.
 (59) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016.
 (60) Geoffrey Roeder, Yuhuai Wu, and David K Duvenaud. Sticking the landing: Simple, lowervariance gradient estimators for variational inference. In Advances in Neural Information Processing Systems, pages 6925–6934, 2017.
Appendix A Network architecture details
The table 2 outlines the details of the implementation for the CIFAR10 experiment.
Module  Type  CIFAR10 

convolutional + dense  64, 64, 64, 64, 64, 64, 64, 64, 512 FC  
dense + deconvolutional  512 FC, 64, 64, 64, 64, 64, 64, 16, 16, 3  
convolutional + dense  1, 128 FC  
convolutional + dense  1(Kernel = Stride = 4), 64 FC  
convolutional + dense  1, 64 FC  
convolutional + dense  1, 64 FC  
dense + deconvolutional  64 FC, 128(Kernel = Stride = 4)  
dense + deconvolutional  64 FC, 128  
dense + deconvolutional  64 FC, 128 
Appendix B Improving optimization
Techniques such as batch normalization ioffe2015batch , dropout srivastava2014dropout , weight normalization salimans2016weight , and learningrate decay can significantly improve the performance of a PixelVAE. We evaluate PixelVAE++ by comparing the training of the model to the original PixelCNN++ , both of which include these improvements. For a fair comparison, we apply only those techniques that were also used in salimans2017pixelcnn++ . We examine whether adding a VAE structure to the PixelCNN++ would improve the performance.
For VAE specifically, we use optimization methods such as KL annealing sonderby2016ladder which is shown to prevent the approximating posterior from falling into a local minimum.
When we use RBM as the prior, the KL term of the VAE doesn’t have a closed form. We reduce the variance of the optimization by neglecting the derivatives of the form and instead the encoder is only trained by the path derivatives roeder2017sticking .
Appendix C Temperature annealing
The temperature for continuous relaxation (3) determines the tradeoff between bias and variance. While low temperature relaxation is less biased, its derivative has high variance. We find that a temperature around 0.25 is sufficient to achieve best performance for all experiments. If the temperature is too high (0.5), the mismatch between the smoothed posterior and the RBM grows rapidly. If the temperature is too low (0.1) the high variance of the gradient prevents convergence to below 2.98 bpd in CIFAR10 experiment.
We experimented with annealing the temperature from high to low and vice versa. The intuition for annealing to high temperature is to reduce variance towards the end of the optimization, and lowering temperature aims at reducing the bias towards the end while keeping the beginning of the training high variance in order to help keeping away from a local minima. In practice however, we find decreasing temperature to degrade performance and increasing it to have no effect compared to training with the final temperature all the way through. This is perhaps due to the high variance of training VAEs and therefore adding the variance due to low temperature has no meaningful effect.
Appendix D Image generation
We evaluate the generated samples conditional on the prior distribution. To sample from the RBM prior, we perform 50000 Gibbs block sampling updates. While the cost of these updates for 1792 variables may be too much for the training phase, it is negligible compared to the autoregressive generation with deep neural networks of 40M variables. With a fixed RBM samples, we then generate multiple samples from the decoder.
For MNIST and Omniglot, the generated samples are fairly sharp (Fig. 5 and Fig. 6). As show in figure 5, samples from the model with RBM prior are generally fall in the same mode but occasionally change, however, the same change occurs more frequently when the prior is Gaussian.
For CIFAR10, there is hardly any similarity between the images. To visualize the similarity of the images (if any), we generated 128 RBM samples and 8 images per sample in each row and then sorted the rows based on the mutual energy distance salimans2017pixelcnn++ . We put the first 8 rows with the smallest mutual energy distance in the left panel and the last 8 rows on the right panel of figure 7. We also experimented with other metrics such as mutual and distances. In all cases there is hardly any similarity between the generated samples as in figure 7. While discrete latent variables capture the structure of the data well in smaller data sets (MNIST, Omniglot), neither the discrete nor continuous variables capture the structure in the CIFAR10 data set.
Comments
There are no comments yet.