1 Introduction
Deep latent factor models, such as variational autoencoders (VAEs) and adversarial autoencoders (AAEs), are becoming increasingly popular in various tasks, such as image generation (larsen2015autoencoding), unsupervised clustering (dilokthanakul2016deep; makhzani2015adversarial), and crossdomain translation (wu2016learning). These models involve specifying a prior distribution over latent variables and defining a deep generative network (i.e. the decoder) that maps latent variables to data space in stochastic or deterministic fashion. Training such deep models usually requires learning a recognition network (i.e. the encoder) regularized by the prior.
Traditionally, a simple prior, such as the standard normal distribution
(kingma2013auto), is used for tractability, simplicity, or not knowing what prior to use. It is hoped that this simple prior will be transformed somewhere in the deep generative network into a form suitable for characterizing the data distribution. While this might hold true when the generative network has enough capacity, applying the standard normal prior often results in overregularized models with only few active latent dimensions (burda2015importance).Some recent works (hoffman2016elbo; goyal2017nonparametric; tomczak2017vae) suggest that the choice of the prior may have a profound impact on the expressiveness of the model. As an example, in learning the VAE with a simple encoder and decoder, Hoffman and Johnson hoffman2016elbo conjecture that multimodal priors can achieve a higher variational lower bound on the data loglikelihood than is possible with the standard normal prior. Tomczak and Welling tomczak2017vae confirm the truth of this conjecture by showing that their multimodal prior, a mixture of the variational posteriors, consistently outperforms simple priors on a number of datasets in terms of maximizing the data loglikelihood. Taking one step further, Goyal goyal2017nonparametric learn a treestructured nonparametric Bayesian prior for capturing the hierarchy of semantics presented in the data. All these priors are learned under the VAE framework following the principle of maximum likelihood.
Along a similar line of thinking, we propose in this paper the notion of code generators for learning a prior from data for AAE. The objective is to learn a code generator network to transform a simple prior into one that, together with the generative network, can better characterize the data distribution. To this end, we generalize the framework of AAE in several significant ways:

We replace the simple prior with a learned prior by training the code generator to output latent variables that will minimize an adversarial loss in data space.

We employ a learned similarity metric (larsen2015autoencoding) in place of the default squared error in data space for training the autoencoder.

We maximize the mutual information between part of the code generator input and the decoder output for supervised and unsupervised training using a variational technique introduced in InfoGAN (chen2016infogan).
Extensive experiments confirm its effectiveness of generating better quality images and learning better disentangled representations than AAE in both supervised and unsupervised settings, particularly on complicated datasets. In addition, to the best of our knowledge, this is one of the first few works that attempt to introduce a learned prior for AAE.
The remainder of this paper is organized as follows: Section 2 reviews the background and related works. Section 3 presents the implementation details and the training procedure of the proposed code generator. Section 4 presents extensive experiments to show the superiority of our models over prior works. Section 5 showcases an application of our model to texttoimage synthesis. Lastly, we conclude this paper with remarks on future work.
2 Related Work
A latent factor model is a probabilistic model for describing the relationship between a set of latent and visible variables. The model is usually specified by a prior distribution over the latent variables and a conditional distribution of the visible variables given the latent variables . The model parameters are often learned by maximizing the marginal loglikelihood of the data .
Variational Autoencoders (VAEs).
To improve the model’s expressiveness, it is common to make deep the conventional latent factor model by introducing a neural network to
. One celebrated example is VAE (kingma2013auto), which assumes the following prior and :(1) 
where the mean is modeled by the output of a neural network with parameters . In this case, the marginal becomes intractable; the model is thus trained by maximizing the log evidence lowerbound (ELBO):
(2) 
where is the variational density, implemented by another neural network with parameter , to approximate the posterior . When regarding as an (stochastic) encoder and as a (stochastic) decoder, Equation (2) bears an interpretation of training an autoencoder with the latent code regularized by the prior through the KLdivergence.
Adversarial Autoencoders (AAEs). Motivated by the observation that VAE is largely limited by the Gaussian prior assumption, i.e. , Makhzani makhzani2015adversarial relax this constraint by allowing to be any distribution. Apparently, the KLdivergence becomes intractable when is arbitrary. They thus replace the KLdivergence with an adversarial loss imposed on the encoder output, requiring that the latent code produced by the encoder should have an aggregated posterior distribution^{1}^{1}1The aggregated posterior distribution is defined as , where denotes the empirical distribution of the training data. the same as the prior .
Nonparametric Variational Autoencoders (Nonparametric VAEs). While AAE allows the prior to be arbitrary, how to select a prior that can best characterize the data distribution remains an open issue. Goyal goyal2017nonparametric make an attempt to learn a nonparametric prior based on the nested Chinese restaurant process for VAEs. Learning is achieved by fitting it to the aggregated posterior distribution, which amounts to maximization of ELBO. The result induces a hierarchical structure of semantic concepts in latent space.
Variational Mixture of Posteriors (VampPrior). The VampPrior tomczak2017vae is a new type of prior for the VAE. It consists of a mixture of the variational posteriors conditioned on a set of learned pseudoinputs . In symbol, this prior is given by
(3) 
Its multimodal nature and coupling with the posterior achieve superiority over many other simple priors in terms of training complexity and expressiveness.
Inspired by these learned priors (goyal2017nonparametric; tomczak2017vae) for VAE, we propose in this paper the notion of code generators to learn a proper prior from data for AAE. The relations of our work with these prior arts are illustrated in Fig. 1.
3 Method
In this paper, we propose to learn the prior from data instead of specifying it arbitrarily. Based on the framework of AAE, we introduce a neural network (which we call the code generator) to transform the manuallyspecified prior into a better form. Fig. 2 presents its role in the overall architecture, and contrasts the architectural difference relative to AAE.
3.1 Learning the Prior
Because the code generator itself has to be learned, we need an objective function to shape the distribution at its output. Normally, we wish to find a prior that, together with the decoder (see Fig. 2 (b)), would lead to a prior distribution that maximizes the data likelihood. We are however faced with two challenges. First, the output of the code generator could be any distribution, which may make the likelihood function and its variational lower bound intractable. Second, the decoder has to be learned simultaneously, which creates a moving target for the code generator.
To address the first challenge, we propose to impose an adversarial loss on the output of the decoder when training the code generator. That is, we want the code generator to produce a prior distribution that minimizes the adversarial loss at the decoder output. Consider the example in Fig. 3 (a). The decoder should generate images with a distribution that in principle matches the empirical distribution of real images in the training data, when driven by the output of the code generator. In symbols, this is to minimize
(4) 
where is the output of the code generator driven by a noise sample , is the discriminator in image space, and is the output of the decoder driven by .
To address the second challenge, we propose to alternate the training of the code generator and the decoder/encoder until convergence. In one phase, termed the prior improvement phase (see Fig. 3
(a)), we update the code generator with the loss function in Eq. (
4), by fixing the encoder^{2}^{2}2Supposedly, the decoder needs to be fixed in this phase. It is however found beneficial in terms of convergence to update also the decoder.. In the other phase, termed the AAE phase (see Fig. 3 (b)), we fix the code generator and update the autoencoder following the training procedure of AAE. Specifically, the encoder output has to be regularized by minimizing the following adversarial loss:(5) 
where is the output of the code generator, is the encoder output given the input , and is the discriminator in latent code space.
Because the decoder will be updated in both phases, the convergence of the decoder relies on consistent training objectives of the two training phases. It is however noticed that the widely used pixelwise squared error criterion in the AAE phase tends to produce blurry decoded images. This obviously conflicts with the adversarial objective in the prior improvement phase, which requires the decoder to produce sharp images. Inspired by the notion of learning similarity metrics (larsen2015autoencoding) and perceptual loss (johnson2016perceptual), we change the criterion of minimizing squared error in pixel domain to be in feature domain. Specifically, in the AAE phase, we require that a reconstructed image should minimize squared error in feature domain with respect to its original input . This loss is referred to as perceptual loss and is defined by
(6) 
where denotes the feature representation (usually the output of the last convolutional layer in the image discriminator ) of an image. With this, the decoder would be driven consistently in both phases towards producing decoded images that resemble closely real images.
3.2 Learning Conditional Priors
3.2.1 Supervised Setting
The architecture in Fig. 3 can be extended to learn conditional priors supervisedly. Such priors find applications in conditional data generation, e.g. conditional image generation in which the decoder generates images according to their class labels . To this end, we make three major changes to the initial architecture:

Firstly, the code generator now takes as inputs a data label and a noise variable accounting for the intraclass variety, and produces a prior distribution conditional on the label (see Fig. 4).

Secondly, the endtoend mutual information between the label and the decoded image is maximized as part of our training objective to have both the code generator and the decoder pick up the information carried by the label variable when generating the latent code and subsequently the decoded image . This is achieved by maximizing its variational lower bound of (chen2016infogan) as given by
(7) where
is the joint distribution of the label
and the noise , is the code generator, andis a classifier used to recover the label
of the decoded image . 
Lastly, the discriminator in latent code space is additionally provided with the label as input, to implement classdependent regularization at the encoder output during the AAE learning phase. That is,
(8) where is the label associated with the input image .
The fact that the label of an input image needs to be properly fed to different parts of the network during training indicates the supervised learning nature of the aforementioned procedure.
3.2.2 Unsupervised Setting
Taking one step further, we present in Fig. 5 a repurposed architecture to learn conditional priors under an unsupervised setting. Unlike the supervised setting where the correspondence between the label and the image is explicit during training, the unsupervised setting is to learn the correspondence in an implicit manner. Two slight changes are thus made to the architecture in Fig. 4: (1) the label at the input of the code generator is replaced with a label drawn randomly from a categorical distribution; and (2) the discriminator in the latent code space is made class agnostic by removing the label input. The former is meant to produce a multimodal distribution in the latent space while the latter is to align such a distribution with that at the encoder output. Remarkably, which mode (or class) of distribution an image would be assigned to in latent code space is learned implicitly. In a sense, we hope the code generator can learn to discover the intriguing latent code structure inherent at the encoder output. It is worth pointing out that in the absence of any regularization or guidance, there is no guarantee that this learned assignment would be in line with the semantics attached artificially to each data sample.
Algorithm 1 details the training procedure.
4 Experiments
We first show the superiority of our learned priors over manuallyspecified priors, followed by an ablation study of individual components. In the end, we compare the performance of our model with AAE in image generation tasks. Unless stated otherwise, all the models adopt the same autoencoder for a fair comparison.
4.1 Comparison with Prior Works
Method  Inception Score 

AAE makhzani2015adversarial w/ a Gaussian prior  2.15 
VAE kingma2013auto w/ a Gaussian prior  3.00 
Vamprior tomczak2017vae  2.88 
Our method w/ a learned prior  6.52 
Latent factor models with their priors learned from data rather than specified manually should better characterize the data distribution. To validate this, we compare the performance of our model with several prior arts, including AAE (makhzani2015adversarial), VAE (kingma2013auto), and Vamprior (tomczak2017vae), in terms of Inception Score (IS). Of these works, AAE chooses a Gaussian prior and regularizes the latent code distribution with an adversarial loss goodfellow2014generative. VAE kingma2013auto likewise adopts a Gaussian prior yet uses the KLdivergence for regularization. Vamprior tomczak2017vae learns for VAE a Gaussian mixture prior. For the results of Vamprior tomczak2017vae, we run their released software tomczak2017vae but replace their autoencoder with ours for a fair comparison.
Table 1 compares their Inception Score for image generation on CIFAR10 with a latent code size of 64. As expected, both AAE makhzani2015adversarial and VAE kingma2013auto, which adopt manuallyspecified priors, have a lower IS of 2.15 and 3.00, respectively. Somewhat surprisingly, Vamprior tomczak2017vae, although using a learned prior, does not have an advantage over VAE kingma2013auto with a simple Gaussian prior in the present case. This may be attributed to the fact that the prior is limited to be a Gaussian mixture distribution. Relaxing this constraint by modeling the prior with a neural network, our model achieves the highest IS of 6.52.
Fig. 6 further visualizes sample images generated with these models by driving the decoder with latent codes drawn from the prior or the code generator in our case. It is observed that our model produces much sharper images than the others. This confirms that a learned and flexible prior is beneficial to the characterization and generation of data.
To get a sense of how our model performs as compared to other stateoftheart generative models, Table 2 compares their Inception Score on CIFAR10. Caution must be exercised in interpreting these numbers as these models adopt different decoders (or generative networks). With the current implementation, our model achieves a comparable score to other generative models. Few sample images of these models are provided in Fig. 7 for subjective evaluation.
Method  Inception Score 

BEGAN berthelot2017began  5.62 
DCGAN radford2015unsupervised  6.16 
LSGAN mao2017least  5.98 
WGANGP gulrajani2017improved  7.86 
Our method w/ a learned prior  6.52 
4.2 Ablation Study
Method  A  B  C  IS 

AAE makhzani2015adversarial w/ a Gaussian prior and MSE loss  2.15  
AAE w/ a learned prior and MSE loss  ✓  3.04  
AAE w/ a learned prior and perceptual loss  ✓  ✓  3.69  
Ours (full)  ✓  ✓  ✓  6.52 
In this section, we conduct an ablation study to understand the effect of (A) the learned prior, (B) the perceptual loss, and (C) the updating of the decoder in both phases on Inception Score salimans2016improved. To this end, we train an AAE makhzani2015adversarial model with a 64D Gaussian prior on CIFAR10 as the baseline. We then enable incrementally each of these design choices. For a fair comparison, all the models are equipped with an identical autoencoder architecture yet trained with their respective objectives.
From Table 3, we see that the baseline has the lowest IS and that replacing the manuallyspecified prior with our learned prior increases IS by about 0.9. Furthermore, minimizing the perceptual loss instead of the conventional mean squared error in training the autoencoder achieves an even higher IS of 3.69. This suggests that the perceptual loss does help make more consistent the training objectives for the decoder in the AAE and the prior improvement phases. Under such circumstances, allowing the decoder to be updated in both phases tops the IS.
4.3 Indepth Comparison with AAE
Since our model is inspired by AAE (makhzani2015adversarial), this section provides an indepth comparison with it in terms of image generation. In this experiment, the autoencoder in our model is trained based on minimizing the perceptual loss (i.e. the mean squared error in feature domain), whereas by convention, AAE (makhzani2015adversarial) is trained by minimizing the mean squared error in data domain.
Fig. 8 displays sidebyside images generated from these models when trained on MNIST and CIFAR10 datasets. They are produced by the decoder driven by samples from their respective priors. In this experiment, two observations are immediate. First, our model can generate sharper images than AAE (makhzani2015adversarial) on both datasets. Second, AAE (makhzani2015adversarial) experiences problems in reconstructing visuallyplausible images on the more complicated CIFAR10. These highlight the advantages of optimizing the autoencoder with the perceptual loss and learning the code generator through an adversarial loss, which in general produces subjectively sharper images.
Another advantage of our model is its ability to have better adaptability in higher dimensional latent code space. Fig. 9 presents images generated by the two models when the dimension of the latent code is increased significantly from 8 to 100 on MNIST, and from 64 to 2000 on CIFAR10. As compared to Fig. 8, it is seen that the increase in code dimension has little impact on our model, but exerts a strong influence on AAE (makhzani2015adversarial). In the present case, AAE (makhzani2015adversarial) can hardly produce recognizable images, particularly on CIFAR10, even after the reparameterization trick has been applied to the output of the encoder as suggested in (makhzani2015adversarial). This emphasizes the importance of having a prior that can adapt automatically to a change in the dimensionality of the code space and data.
4.4 Disentangled Representations
Learning disentangled representations is desirable in many applications. It refers generally to learning a data representation whose individual dimensions can capture independent factors of variation in the data. To demonstrate the ability of our model to learn disentangled representations and the merits of datadriven priors, we repeat the disentanglement tasks in (makhzani2015adversarial), and compare its performance with AAE (makhzani2015adversarial).
4.4.1 Supervised Learning
This section presents experimental results of using the network architecture in Fig. 4 to learn supervisedly a code generator that outputs a conditional prior given the image label for characterizing the image distribution. In particular, the remaining uncertainty about the image’s appearance given its label is modeled by transforming a Gaussian noise through the code generator . By having the noise be independent of the label , we arrive at a disentangled representation of images. At test time, the generation of an image for a particular class is achieved by inputting the class label and a Gaussian noise to the code generator and then passing the resulting code through the decoder .
To see the sole contribution from the learned prior, the training of the AAE baseline (makhzani2015adversarial) also adopts the perceptual loss and the mutual information maximization; that is, the only difference to our model is the direct use of the label and the Gaussian noise as the conditional prior.
Fig. 10 displays images generated by our model and AAE (makhzani2015adversarial)
. Both models adopt a 10D onehot vector to specify the label
and a 54D Gaussian to generate the noise . To be fair, the output of our code generator has an identical dimension (i.e. 64) to the latent prior of AAE (makhzani2015adversarial). Each row of Fig. 10 corresponds to images generated by varying the label while fixing the noise . Likewise, each column shows images that share the same label yet with varied noise .On MNIST and SVHN, both models work well in separating the label information from the remaining (style) information. This is evidenced from the observation that along each row, the main digit changes with the label regardless of the noise , and that along each column, the style varies without changing the main digit. On CIFAR10, the two models behave differently. While both produce visually plausible images, ours generate more semantically discernible images that match their labels.
Fig. 11 visualizes the output of the code generator with the tdistributed stochastic neighbor embedding (tSNE). It is seen that the code generator learns a distinct conditional distribution for each class of images. It is believed that the more apparent interclass distinction reflects the more difficult it is for the decoder to generate images of different classes. Moreover, the elliptic shape of the intraclass distributions in CIFAR10 may be ascribed to the higher intraclass variability.
4.4.2 Unsupervised Learning
This section presents experimental results of using the network architecture in Fig. 5 to learn unsupervisedly a code generator that outputs a prior for best characterizing the image distribution. In the present case, the label is drawn randomly from a categorical distribution and independently from the Gaussian input , as shown in Fig. 5. The categorical distribution encodes our prior belief about data clusters, with the number of distinct values over which it is defined specifying the presumed number of clusters in the data. The Gaussian serves to explain the data variability within each cluster. In regularizing the distribution at the encoder output, we want the code generator
to make sense of the two degrees of freedom for a disentangled representation of images.
At test time, image generation is done similarly to the supervised case. We start by sampling and , followed by feeding them into the code generator and then onwards to the decoder. In this experiment, the categorical distribution is defined over 10D onehot vectors and the Gaussian is 90D. As in the supervised setting, after the model is trained, we alter the label variable or the Gaussian noise one at a time to verify whether the model has learned to cluster images unsupervisedly. We expect that a good model should generate images with certain common properties (e.g. similar backgrounds or digit types) when the Gaussian part is altered while the label part remains fixed.
The results in Fig. 12 show that on MNIST, both our model and AAE successfully learn to disentangle the digit type from the remaining information. Based on the same presentation order as in the supervised setting, we see that each column of images (which correspond to the same label ) do show images of the same digit. On the more complicated SVHN and CIFAR10 datasets, each column mixes images from different digits/classes. It is however worth noting that both models have a tendency to cluster images with similar backgrounds according to the label variable . Recall that without any semantic guidance, there is no guarantee that the clustering would be in line with the semantics attached artificially to each data sample.
Fig. 13 further visualizes the latent code distributions at the output of the code generator and the encoder. Several observations can be made. First, the encoder is regularized well to produce an aggregated posterior distribution similar to that at the code generator output. Second, the code generator learns distinct conditional distributions according to the categorical label input . Third, quite by accident, the encoder learns to cluster images of the same digit on MNIST, as has been confirmed in Fig. 12. As expected, such semantic clustering in code space is not obvious on more complicated SVHN and CIFAR10, as is evident from the somewhat random assignment of latent codes to images of the same class or label.
5 Application: TexttoImage Synthesis
This section presents an application of our model to texttoimage synthesis. We show that the code generator can transform the embedding of a sentence into a prior suitable for synthesizing images that match closely the sentence’s semantics. To this end, we learn supervisedly the correspondence between images and their descriptive sentences using the architecture in Fig. 4
, where given an imagesentence pair, the sentence’s embedding (which is a 200D vector) generated by a pretrained recurrent neural network is input to the code generator and the discriminator in image space as if it were the label information, while the image representation is learned through the autoencoder and regularized by the output of the code generator. As before, a 100D Gaussian is placed at the input of the code generator to explain the variability of images given the sentence.
The results in Fig. 14 present images generated by our model when trained on 102 Category Flower dataset (Nilsback08). The generation process is much the same as that described in Section 4.4.1. It is seen that most images match reasonably the text descriptions. In Fig. 15, we further explore how the generated images change with the variation of the color attribute in the text description. We see that most images agree with the text descriptions to a large degree.
6 Conclusion
In this paper, we propose to learn a proper prior from data for AAE. Built on the foundation of AAE, we introduce a code generator to transform the manually selected simple prior into one that can better fit the data distribution. We develop a training process that allows to learn both the autoencoder and the code generator simultaneously. We demonstrate its superior performance over AAE in image generation and learning disentangled representations in supervised and unsupervised settings. We also show its ability to do crossdomain translation. Mode collapse and training instability are two major issues to be further investigated in future work.
Comments
There are no comments yet.