1 Introduction
Representation learning, besides data distributions estimation, is a principle component in generative models. The goal is to identify and disentangle the underlying causal factors, to tease apart the underlying dependencies of the data, so that it becomes easier to understand, to classify, or to perform other tasks
(Bengio et al., 2013). Among these generative models, VAE (Kingma & Welling, 2014; Rezende et al., 2014)gains popularity for its capability of estimating densities of complex distributions, while automatically learning meaningful (lowdimensional) representations from raw data. VAE, as a member of latent variable models (LVMs), defines the joint distribution between the observed data (visible variables) and a set of latent variables by factorizing it as the product of a prior over the latent variables and a conditional distribution of the visible variables given the latent ones (detailed in §
2). VAEs are usually estimated by maximizing the likelihood of the observed data by marginalizing over the latent variables, typically via optimizing the evidence lower bound (ELBO). By learning a VAE from the data with the appropriate hierarchical structure of latent variables, the hope is to uncover and untangle the causal sources of variations that we are interested in.A notorious problem of VAEs, however, is that the marginal likelihood may not guide the model to learn the intended latent variables. It may instead focus on explaining irrelevant but common correlations in the data (Ganchev et al., 2010). Extensive previous studies (Bowman et al., 2015; Chen et al., 2017a; Yang et al., 2017) showed that optimizing the ELBO objective is often completely disconnected from the goal of learning good representations. An extreme case called KL varnishing, happens when using sufficiently expressive decoding distributions such as autoregressive ones; the latent variables are often completely ignored and the VAE regresses to a standard autoregressive model (Larochelle & Murray, 2011; Oord et al., 2016).
This problem has spawned significant interests in analyzing and solving it from both theoretical and practical perspectives. We can only name a few here due to space limits. Some previous work (Bowman et al., 2015; Sønderby et al., 2016b; Serban et al., 2017) attributed the KL varnishing phenomenon to “optimization challenges” of VAEs, and proposed training methods including annealing the relative weight of the KL term in ELBO (Bowman et al., 2015; Sønderby et al., 2016b) or adding free bits (Kingma et al., 2016; Chen et al., 2017a). However, Chen et al. (2017a) pointed out that this phenomenon arises not just due to the optimization challenges, and even if we find the exact solution for the optimization problems, the latent code will still be ignored at optimum. They proposed a solution by limiting the capacity of the decoder and applied PixelCNN (Oord et al., 2016) with small local receptive fields as the decoder of VAEs to model 2D images, achieving both impressive performance for density estimation and informative latent representations. Yang et al. (2017) embraced a similar idea and applied VAE to text modeling by using dilated CNN as the decoder. Unfortunately, these approaches require manual and problemspecific design of the decoder’s architecture to learn meaningful representations. Other studies attempted to explore alternatives of ELBO. Makhzani et al. (2015) proposed Adversarial Autoencoders (AAEs) by replacing the KLdivergence between the posterior and prior distributions with JensenShannon divergence on the aggregated posterior distribution. InfoVAEs (Zhao et al., 2017) generalized the JensenShannon divergence in AAEs to a divergence family and linked its objective to the Mutual Information between the data and the latent variables. However, directly optimizing these objectives is intractable, requiring advanced approximate learning methods such as adversarial learning or MaximumMean Discrepancy (Gretton et al., 2007; Dziugaite et al., 2015; Li et al., 2015). Moreover, these models’ performance on density estimation significantly falls behind stateoftheart models (Salimans et al., 2017; Chen et al., 2017a).
In this paper, we propose to tackle the aforementioned representation learning challenges of VAEs by adding a datadependent regularization to the ELBO objective. Our contributions are threefold: (1) Algorithmically, we introduce the mutual posteriordivergence regularization for VAEs, named MAEs (§3.2), to control the geometry of the latent space during learning by encouraging the learned variational posteriors to be diverse (i.e. they are favored to be mutually “different” from each other), to achieve lowredundant, interpretable representation learning. (2) Theoretically, we establish a close relation between MAE and InfoVAE, by showing that the mutual posteriordevergence regularization maximizes a symmetric version of the KL divergence involved in InfoVAE’s mutual information term (§3.3). (3) Experimentally, on three benchmark datasets for images, we demonstrate the effectiveness of MAE as a density estimator by stateoftheart loglikelihood results on MNIST and OMNIGLOT, and comparable result on CIFAR10. Moreover, by performing image reconstruction, unsupervised and semisupervised classification, we show that MAE is also capable of learning meaningful latent representations, even combined with a sufficiently powerful decoder (§4).
2 Variational Autoencoders
2.1 Notations
Throughout we use uppercase letters for random variables, and lowercase letters for realizations of the corresponding random variables. Let
be the randoms variables of the observed data, e.g.,is an image or a sentence for image and text generation, respectively.
Let denote the true distribution of the data, i.e., , and be our training sample, where are usually i.i.d. samples of . Let denote a parametric statistical model indexed by parameter , where is the parameter space. is used to denote the density of corresponding distribution
. In the literature of deep generative models, deep neural networks are the most widely used parametric models. The goal of generative models is to learn the parameter
such that can best approximate the true distribution .2.2 VAEs
In the framework of VAEs, or general LVMs, a set of latent variables are introduced to characterize the hidden patterns of , and the model distribution is defined as the marginal of the joint distribution between and :
(1) 
where the joint distribution is factorized as the product of a prior over the latent , and the “generative” distribution . is the base measure on the latent space . Typically, prior is modeled with a simple distribution like multivariate Gaussian, or transforming simple priors to complex ones by normalizing flows and variants (Rezende & Mohamed, 2015; Kingma et al., 2016; Sønderby et al., 2016a).
To learn parameters , we wish to minimizes the negative loglikelihood of the parameters:
(2) 
where is the empirical distribution derived from training data . In general, this marginal likelihood is intractable to compute or differentiate directly for highdimensional latent space . Variational Inference (Wainwright et al., 2008) provides a solution to optimize the evidence lower bound (ELBO) an alternative objective by introducing a parametric inference model :
(3) 
where could be seen as an autoencoding loss with being the encoder and being the decoder, with the first term in the RHS in (3) as the reconstruction error.
2.3 Autoencoding Problem in VAEs
As discussed in Chen et al. (2017a), without further assumptions, the ELBO objective in (3) may not guide the model towards the intended role for the latent variables , or even learn uninformative with the observation that the KL term varnishes to zero. For example, suppose we use an autoregressive decoder, , which is sufficiently expressive that it can model the data distribution without the assistance of , i.e, . In this case, the optimal w.r.t. is the one independent with , with the inference model reducing to the prior, i.e., .
The essential reason of this problem is that, under absolutely unsupervised setting, the marginal likelihood based objective incorporates no (direct) supervision on the latent space to characterize the latent variable with preferred properties w.r.t. representation learning. The main goal of this work is to explicitly control the geometry of the latent space, in the hope that preferred latent representations would be characterized and selected.
3 Mutual PosteriorDivergence Regularization
3.1 Geometric Properties of Meaningful Latent Space
Motivated by the DiversityInducing Mutual Angular Regularization (Xie et al., 2015) which is widely used in LVMs, we propose to regularize the posteriors of different data , to encourage them to diversely, smoothly, and evenly spread out in the data space . The intuition is: (1) To make posteriors mutually diverse from each other, the patterns captured by different posteriors are likely to have less redundancy and hence characterizing and interpreting different data . (2) To make posteriors smoothly and evenly distributed in the whole space of , the shared patterns of similar data points are likely to be captured by their posteriors, to avoid isolating each data point from others. By balancing the diversity and smoothness of the distribution of posteriors, learned representations are encouraged to maintain global structured information, discarding detailed texture of local dependencies in the data.
3.2 MAEs
Measure of Diversity.
We propose to use expectation of the mutual KLdivergence between a pair of data to measure the diversity of posteriors. Specifically, the mutual posterior diversity is defined as:
(4) 
There are two main reasons we use KLdivergence instead of others as the measure of diversity: (1) KLdivergence is transformation invariant, i.e., for an invertible smooth function ,
It makes the computation efficient for complex posteriors that are transformed from simple ones, such as applying normalizing flows and variants (Rezende & Mohamed, 2015; Kingma et al., 2016; Sønderby et al., 2016a). (2) KLdivergence has a close relation with mutual information, an important informationtheoretic measure of the mutual dependence between two variables, which provides us the theoretical justification of the proposed regularizer (detailed in §3.3).
In MAEs, we propose to maximize the mutual posterior diversity (MPD) in (4). A straightforward way is to add negative MPD to the objective in (3) that VAEs attempt to minimize. There are, however, two practical issues: (1) The scale of MPD, particularly for continuous , is much larger than that of
. We need to choose a hyperparameter carefully to control the scale of MPD, making optimization much more challenging and unstable. (2) For multivariate
, e.g. a dimensional , due to the property of KLdivergence, MPD may be dominated by a small group of dimensions, leaving others close to zero. In this case, most dimensions of are uninformative, which is not a desired representation.To solve the two problems, in practice, we propose to minimize a MPDbased loss instead of directly maximizing MPD itself:
(5) 
has two important good properties: (1) . (2) iff . The first property sets a lower bound of , making optimization much more stable. The second one guarantees that all the dimensions of the latent need to be mutually diverse w.r.t minimizing .
Measure of Smoothness.
The smoothness of the distribution of posteriors is measured by utilizing the standard deviation of the mutual KLdivergence:
(6) 
where stands for standard deviation of random variables.
encourages the posteriors to smoothly and evenly spread out to different directions. Encouraging the standard deviation to be small can prevent the phenomenon that the posteriors fall into several small groups that are isolated from each other. It is crucially important for unsupervised clustering tasks, in which we want to cluster similar data into a big group instead of splitting them into multiple separated small groups (see §4.1.2 for detailed experimental results). Figure 1 shows two sets of distributions of data points, where the mean of the pairwise distances of the first set (Figure 1 (a)) is roughly the same as the second set (Figure 1 (b)). But the standard deviation of the first set is larger.
In the framework of MAEs, the final objective to minimize is:
(7) 
where are regularization constants to balance the three losses in . Even though MAE introduces two extra hyperparameters and , we find them easy to tune and MAE shows robust performance with different values of and .
To solve (7), we can approximate and using Monte carlo in each minibatch:
(8) 
where is the number of valid pairs of data in each minibatch. is approximately computed similarly.
3.3 Theoretical Justification
So far our discussion has been concentrated on the motivation and mathematical formulation of the proposed regularization method for VAE. In this section, we provide theoretical justification by connecting the mutual posterior diversity (MPD) in (4) with the mutual information term defined in InfoVAE (Zhao et al., 2017). With the end goal of theoretically justifying the proposed regularizer in mind, we first review the background of the mutual information (MI) term involved in the InfoVAE objective, which is central for linking MAE and InfoVAE.
Mutual Information Maximization.
InfoVAE proposed the mutual information by first defining the joint “inference distribution”:
where is the density of the true data distribution . Then they added a mutual information maximization term that prefers high mutual information between and under to the standard :
and further proved that
(9) 
where is the marginal of . Mutual information inspired objectives have been explored in GANs (Goodfellow et al., 2014; Chen et al., 2016), clustering (Hinton et al., 1995; Krause et al., 2010) and representation learning (Esmaeili et al., 2018; Hjelm et al., 2018).
Relation between MPD and MI.
The following theorem states our major result that reveals the relation between MPD and MI (proof in Appendix A):
Theorem 1.
Roughly, Theorem 1 states that maximizing MPD and MI achieve the same goal: maximizing the divergence between the posterior distribution and the marginal . Note that the (approximate) computation of MPD, as described in (8), is much easier than MI, which is generally intractable and requires adversarial learning or MaximumMean Discrepancy.
4 Experiments
In this paper, we choose Variational Lossy Autoencoder (VLAE) (Chen et al., 2017a), VAE with autoregressive flow (AF) prior, and autoregressive decoder, as the basic architecture of our MAE models. More detailed descriptions, results, and analysis of the conducted experiments are provided in Appendix B.
4.1 Binary Images
We evaluate MAE on two binary images that are commonly used for evaluating deep generative models: MNIST (LeCun et al., 1998) and OMNIGLOT (Lake et al., 2013; Burda et al., 2015)
, both with dynamically binarized version
(Burda et al., 2015). VLAE networks used in binary image datasets are similar of that described in Chen et al. (2017a): ResNet (He et al., 2016) encoder same as in ResNet VAE (Kingma et al., 2016), PixelCNN (Oord et al., 2016) decoder with 6 layers of masked convolution, and 32dimensional latent code with AF prior implemented with MADE (Germain et al., 2015). The only difference is that the PixelCNN decoder has varying filter sizes: two 7x7 layers, followed by two 5x5 layers, and finally two 3x3 layers, instead of a fixed filter size of 3x3 used in Chen et al. (2017a). Hence the decoder we use has larger receptive field, to ensure that the decoder is sufficiently expressive. The same architecture is applied to all the experiments on both the two datasets. For pair comparison, we reimplemented VLAE using the same architecture in our MAE model. “Free bits” (Kingma et al., 2016) is used to improve optimization stability of VLAE (not for MAE). For hyperparameters and , we explored a few configurations: is selected from , and from .


unsupervised clustering  semisupervised classification  
KMeans  KNN  Linear  
Model  K=10  K=20  K=30  100  1000  All  100  1000  All 
ResNet VAE w. AF  67.3  81.6  86.6  77.4  94.3  98.1  84.6  94.3  97.4 
VLAE  68.1  74.0  79.1  75.7  90.0  95.6  86.4  93.7  96.1 
MAE:  82.7  92.3  93.0  86.6  95.5  97.8  91.1  96.3  98.3 
MAE:  84.7  92.6  93.2  86.3  96.3  98.0  90.6  96.1  98.1 
MAE:  91.2  92.6  93.6  86.7  95.9  98.2  91.5  96.4  98.4 
MAE:  78.2  92.0  92.8  85.5  96.4  98.2  90.7  96.0  98.0 
MAE:  83.1  92.3  94.3  86.2  96.6  98.1  90.0  95.7  98.0 
4.1.1 Density Estimation
We first evaluate MAE on density estimation performance. Table 1 provides the results of MAE with different settings of hyperparameters on MNIST, together with previous top systems for comparison. Reported marginal negative loglikelihood (NLL) is evaluated with 4096 importance samples (Burda et al., 2015). Our MAE achieves stateoftheart performance on both the two datasets, exceeding all previous models and the reimplemented VLAE. Note that our reimplementation of VLAE obtains better performance than the original one in Chen et al. (2017a), demonstrating the effectiveness of increasing decoder expressiveness by enlarging its receptive field.
4.1.2 Representation Learning
In order to evaluate the quality of the learned latent representations, we conduct three sets of experiments — image reconstruction and generation, unsupervised clustering, and semisupervised classification.
Image Reconstruction and Generation.
The visualization of the of image reconstruction and generation on MNIST and OMNIGLOT is shown in Figure 4 and Figure 7. For comparison, we also show the reconstructed images from VLAE. MAE achieves better reconstruction ability than VLAE, proving that the latent code from MAE encodes more information from data.
Unsupervised Clustering.
As discussed above, good latent representations need to capture global structured information and disentangle the underlying causal factors, rather than just memorizing the data. From this perspective, good image reconstruction results cannot guarantee good representations. To further evaluate the quality of the learned representation from MAE, we conduct the experiments of unsupervised clustering on MNIST. We perform KMeans clustering algorithm (Hartigan & Wong, 1979) on the learned representations. The class label of each cluster is assigned by finding the closest sample in the training data with the cluster head. Evaluation of clustering accuracy is based on the assigned cluster labels. We run three experiments with .
Table 2 (left section) illustrates the clustering performance. To make a thorough comparison, we also reimplemented a VAE model with a factorized decoder and AF prior, which has been proven to obtain remarkable reconstruction performance. The VAE model uses ResNet (He et al., 2016) as its encoder and decoder similar to Chen et al. (2017a). From Table 2 we see that MAE significantly outperforms ResNet VAE and VLAE, especially when the number of clusters is small. Interestingly, when keeps increasing, clustering accuracy of ResNet VAE increases rapidly, showing that in its latent space the data are split into small groups.
In addition, towards the affects of and on learned representations, MAEs with larger obtain worse performance on . The reason might be that large encourages the posteriors to diverse from each other, by splitting the data into small groups. Meanwhile, increasing is effective to prevent the phenomenon, showing that in practice considerations on the tradeoff between space diversity and smoothness are needed.
Semisupervised Classification.
For semisupervised classification, we reimplemented the M1 model as described in Kingma et al. (2014). To test quality of information encoded in the latent representations, we choose two simple classifiers with limited capacity — Knearest neighbor (
) and linear logistic regression. For each classifier, we use different numbers of labeled data — 100, 1000 and all the training data from MNIST.
From the results listed in Table 2 (right section), MAE obtains the best classification accuracy on all the settings. Moreover, the improvements of MAE over ResNet VAE and VLAE are more significant when the number of labeled training data is small, further proving the meaningful representation learned from MAE.
Model  bits/dim 

Deep GMMs (Van den Oord & Schrauwen, 2014)  4.00 
Real NVP (Dinh et al., 2016)  3.49 
PixelCNN (Oord et al., 2016)  3.14 
PixelRNN (Oord et al., 2016)  3.00 
PixelCNN++ (Salimans et al., 2017)  2.92 
PixelSNAIL (Chen et al., 2017b)  2.85 
Conv DRAW (Gregor et al., 2016)  3.50 
IAF VAE (Kingma et al., 2016)  3.11 
VLAE (Chen et al., 2017a)  2.95 
VLAE (reimpl)  2.98 
MAE:  2.95 
MAE:  2.97 
MAE:  2.96 
4.2 Natural Images
In addition to binary image datasets, we also applied MAE to CIFAR10 dataset (Krizhevsky & Hinton, 2009) of natural images. The VLAE with DenseNet (Huang et al., 2017) encoder and PixelCNN++ (Salimans et al., 2017) decoder described in Chen et al. (2017a) is used as the neural architecture of MAE. To ensure that the decoder is sufficiently expressive, the decoder PixelCNN has 5 blocks of 96 feature maps and 7x4 receptive field. Hence the PixelCNN decoder we used is both deeper and wider than that used in Chen et al. (2017a).
4.3 Density Estimation
Density estimation performance on CIFAR10 of MAEs with different hyperparameters is provided in Table 3, compared with the topperforming likelihoodbased unconditional generative models (first section) and variationally trained latentvariable models (second section). MAE models obtain improvement over the VLAE reimplemented by us, and slightly fall behind the original one in Chen et al. (2017a). Compared with PixelSNAIL (Chen et al., 2017b), the stateoftheart autoregressive generative model, the performance of MAE models is around 0.11 bits/dim worse. Further improving the density estimation performance of MAEs on natural images has been left to future work.
4.4 Image Reconstruction and Generation
We also investigate learning informative representations on CIFAR10 dataset. The visualization of image reconstruction and generation is shown in Figure 10, together with VLAE for comparison. It is interesting to note that MAE tends to preserve rather detailed shape information than VLAE, whereas the color information, particularly the color for background, is partially omitted. One reasonable explanation, as discussed in Chen et al. (2017a), is that color is predictable locally. This serves as one example showing that MAEs can capture global structured information from data, omitting common correlations. Image samples from MAE are shown in Figure (b)b.
5 Conclusion
In this paper, we proposed a mutual posteriordivergence regularization for VAEs, which controls the geometry of the latent space during training. By connecting the mutual posterior diversity with the mutual information, we have formally studied the theoretical properties of the proposed MAEs. Experiments on three benchmark datasets of images show the capability of MAEs on both density estimation and representation learning, with stateoftheart or comparable likelihood, and superior performance on image reconstruction, unsupervised clustering and semisupervised classification against previous topperforming models.
One potential direction for future work is to extend MAE to other forms of data, in particular text on which VAEs suffer a more serious KLvarnishing problem. Another exciting direction is to formally study the properties of the standard deviation of the mutual posterior KLdivergence used to measure smoothness, hence providing further justification of the proposed regularizer, or even introducing alternatives to further improve performances.
Acknowledgements
The authors thank Zihang Dai, Junxian He, Di Wang and Zhengzhong Liu for their helpful discussions. This research was supported in part by DARPA grant FA87501820018 funded under the AIDA program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA.
References
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 Burda et al. (2015) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180, 2016.
 Chen et al. (2017a) Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. In Proceedings of the 5th International Conference on Learning Representations (ICLR2017), Toulon, France, April 2017a.
 Chen et al. (2017b) Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. arXiv preprint arXiv:1712.09763, 2017b.
 Dinh et al. (2016) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 Dziugaite et al. (2015) Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.
 Esmaeili et al. (2018) Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, N Siddharth, Brooks Paige, Dana H Brooks, Jennifer Dy, and JanWillem van de Meent. Structured disentangled representations. stat, 1050:29, 2018.

Ganchev et al. (2010)
Kuzman Ganchev, Jennifer Gillenwater, Ben Taskar, et al.
Posterior regularization for structured latent variable models.
Journal of Machine Learning Research
, 11(Jul):2001–2049, 2010.  Germain et al. (2015) Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pp. 881–889, 2015.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems (NIPS2014), pp. 2672–2680, 2014.
 Gregor et al. (2016) Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, Ivo Danihelka, and Daan Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems, pp. 3549–3557, 2016.
 Gretton et al. (2007) Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for the twosampleproblem. In Advances in neural information processing systems, pp. 513–520, 2007.
 Hartigan & Wong (1979) John A Hartigan and Manchek A Wong. Algorithm as 136: A kmeans clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Hinton et al. (1995) Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The" wakesleep" algorithm for unsupervised neural networks. Science, 268(5214):1158–1161, 1995.
 Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel LavoieMarchildon, Karan Grewal, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
 Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, volume 1, pp. 3, 2017.
 JLB (2015) Diederik P Kingma JLB. Adam: A method for stochastic optimization. Proc. of ICLR, 2015.
 Kim et al. (2018) Yoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. Semiamortized variational autoencoders. arXiv preprint arXiv:1802.02550, 2018.
 Kingma & Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In Proceedings of the 2th International Conference on Learning Representations (ICLR2014), Banff, Canada, April 2014.
 Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
 Kingma et al. (2016) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743–4751, 2016.
 Krause et al. (2010) Andreas Krause, Pietro Perona, and Ryan G Gomes. Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, pp. 775–783, 2010.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Lake et al. (2013) Brenden M Lake, Ruslan R Salakhutdinov, and Josh Tenenbaum. Oneshot learning by inverting a compositional causal process. In Advances in neural information processing systems, pp. 2526–2534, 2013.

Larochelle & Murray (2011)
Hugo Larochelle and Iain Murray.
The neural autoregressive distribution estimator.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS2011
, pp. 29–37, 2011.  LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Li et al. (2015)
Yujia Li, Kevin Swersky, and Rich Zemel.
Generative moment matching networks.
In Proceedings of International Conference on Machine Learning (ICML2015), pp. 1718–1727, 2015.  Maaten & Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.

Oord et al. (2016)
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.
In Proceedings of International Conference on Machine Learning (ICML2016), 2016.  Polyak & Juditsky (1992) Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
 Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.

Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31st International Conference on Machine Learning (ICML2014), pp. 1278–1286, Bejing, China, 22–24 Jun 2014.  Rolfe (2016) Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.
 Salimans et al. (2017) Tim Salimans, Andrej Karpathy, Xi Chen, Diederik P Kingma, and Yaroslav Bulatov. Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations (ICLR), 2017.
 Serban et al. (2017) Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio. A hierarchical latent variable encoderdecoder model for generating dialogues. In AAAI, pp. 3295–3301, 2017.
 Sønderby et al. (2016a) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746, 2016a.
 Sønderby et al. (2016b) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. How to train deep variational autoencoders and probabilistic ladder networks. arXiv preprint arXiv:1602.02282, 2016b.
 Tomczak & Welling (2018) Jakub Tomczak and Max Welling. Vae with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214–1223, 2018.

Van den Oord & Schrauwen (2014)
Aaron Van den Oord and Benjamin Schrauwen.
Factoring variations in natural images with deep gaussian mixture models.
In Advances in Neural Information Processing Systems, pp. 3518–3526, 2014.  Wainwright et al. (2008) Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
 Xie et al. (2015) Pengtao Xie, Yuntian Deng, and Eric Xing. Latent variable modeling with diversityinducing mutual angular regularization. arXiv preprint arXiv:1512.07336, 2015.
 Yang et al. (2017) Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor BergKirkpatrick. Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of International Conference on Machine Learning (ICML2017), 2017.
 Zhao et al. (2017) Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017.
Appendix: MAE: Mutual PosteriorDivergence Regularization for Variational AutoEncoders
Appendix A Proof of Theorem 1
Proof.
where denotes the entropy. Then,
and
So we have,
∎
Appendix B Detailed Description of Experiments
b.1 Experiments for Binary Images
b.1.1 Neural Network Architectures and Training
The neural network architectures, including most of the hyperparameters, are the same as those in Chen et al. (2017a). The only difference in network architecture is the filter size of the PixelCNN decoder, which has been described in §4. For ResNet VAE with AF, we use the same ResNet encoder but a symmetric ResNet architecture for decoder. For encoder, we only use one stochastic layer with 32 dimensions.
In term of training, we use Adam optimizer (JLB, 2015) with learning rate 0.001, instead of Adamax used in Chen et al. (2017a). 0.01 nats/datadim free bits was used in all the experiments. In order to get a relatively accurate approximation of and , we used a much larger batch size 100 in our experiments. Polyak averaging (Polyak & Juditsky, 1992) was used to compute the final parameters, with .
Model  RE  KL  MPD  STD  NLL  

ResNet VAE with AF  56.04  25.38  1,193.18  630.90  81.42  79.28 
VLAE (w.o free bits)  71.74  7.07  109.31  66.52  78.81  78.45 
VLAE (w. free bits)  69.60  9.02  132.00  66.30  78.62  78.26 
MAE ()  69.58  9.57  99.65  22.03  79.15  78.04 
MAE ()  67.95  11.38  124.80  26.40  79.33  78.02 
MAE ()  66.83  12.67  148.84  29.88  79.50  78.02 
MAE ()  68.44  10.44  55.09  8.93  78.88  78.00 
MAE ()  67.40  11.54  79.71  12.69  78.94  77.98 
MAE ()  66.54  12.67  103.30  16.32  79.04  77.99 
MAE ()  71.08  8.41  30.64  4.78  79.49  78.36 
MAE ()  69.34  10.19  55.59  8.51  79.53  78.15 
MAE ()  68.03  11.65  80.87  12.19  79.68  78.06 
b.1.2 Detailed Results on Density Estimation
Table 4 shows the detailed results of density estimation on MNIST. We see that increasing always achieving more informative latent , but the NLL not always becomes better. It illustrates the hypothesis that good representations should encode global structured information in the data, rather than local dependencies. It is interesting to see that, the effect of on the latent is inconsistent — increasing from 0.1 to 0.5 leads to more informative (larger KL) and better NLL, but too large (1.0) prevent the latent to learn more information from data (smaller KL), resulting worse NLL. Hence, in practice considerations on the tradeoff between diversity and smoothness of the latent space are needed.
b.1.3 Detailed Results on Semisupervised Classification with SVMs
Table 5 provides the performance of semisupervised classification using SVMs with two different kernels — linear and RBF. MAE achieves the best classification accuracy on all the settings. It shoulld be noted that the accuracies of SVMs with nonlinear kernels fluctuate more rapidly than linear ones, particularly when the number of labeled training data is small.
SVMLinear  SVMRBF  
Model  100  1000  All  100  1000  All 
ResNet VAE w. AF  88.71.5  95.80.2  98.30.0  32.111.9  93.90.7  98.20.0 
VLAE  84.91.8  94.30.2  96.70.0  51.55.8  90.90.4  97.00.0 
MAE:  91.91.3  96.40.3  98.50.0  48.711.0  93.20.7  98.30.0 
MAE:  90.71.8  96.30.1  98.40.0  60.58.5  95.50.3  98.60.0 
MAE:  91.21.2  96.60.2  98.50.0  74.35.8  96.50.2  98.80.0 
MAE:  90.31.5  96.50.1  98.50.0  51.58.9  92.31.4  98.50.0 
MAE:  90.31.5  96.50.1  98.50.0  54.59.8  95.00.5  98.50.0 
b.1.4 Latent Space Visualization
Figure 13 visualize the latent spaces of VAEs and MAEs with different settings on MNIST, by tDistributed Stochastic Neighbor Embedding (tSNE) (Maaten & Hinton, 2008). The first row displays the visualizations of ResNet VAE, VLAE without freebits training and VLAE with freebits training. The following three rows display visualizations of MAEs with and . We see that large encourages the posteriors to diverse from each other, by splitting the data into small groups. Meanwhile, increasing is effective to prevent the phenomenon.
b.2 Experiments for CIFAR10
Following Kingma et al. (2016) and Chen et al. (2017a), latent codes are represented by 16 feature maps of 8x8. Prior distribution is factorized Gaussian transformed by 8 autoregressive flows, each of which is implemented by 3layer masked CNNs (Oord et al., 2016) with 128 feature maps. Between every other autoregressive flow, the ordering of stochastic units is reversed. PixelCNN++ (Salimans et al., 2017) with 7x3 receptive field is used as the decoder. Due to the limitation of computational resources, we used batch size 64 in our experiments. Same as experiments on binary images, Polyak averaging (Polyak & Juditsky, 1992) was used to compute the final parameters, with .
Appendix C Generated Samples from VLAE and MAE
Figure 20 provides generated images on MNIST, OMNIGLOT and CIFAR10 from VLAE and MAE.
Comments
There are no comments yet.