Learning Priors for Adversarial Autoencoders

by   Hui-Po Wang, et al.
National Chiao Tung University

Most deep latent factor models choose simple priors for simplicity, tractability or not knowing what prior to use. Recent studies show that the choice of the prior may have a profound effect on the expressiveness of the model,especially when its generative network has limited capacity. In this paper, we propose to learn a proper prior from data for adversarial autoencoders(AAEs). We introduce the notion of code generators to transform manually selected simple priors into ones that can better characterize the data distribution. Experimental results show that the proposed model can generate better image quality and learn better disentangled representations than AAEs in both supervised and unsupervised settings. Lastly, we present its ability to do cross-domain translation in a text-to-image synthesis task.



There are no comments yet.


page 5

page 6

page 8

page 9

page 10


Diffusion Priors In Variational Autoencoders

Among likelihood-based approaches for deep generative modelling, variati...

Disentangling Generative Factors of Physical Fields Using Variational Autoencoders

The ability to extract generative parameters from high-dimensional field...

Learning Optimal Conditional Priors For Disentangled Representations

A large part of the literature on learning disentangled representations ...

NCP-VAE: Variational Autoencoders with Noise Contrastive Priors

Variational autoencoders (VAEs) are one of the powerful likelihood-based...

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Learned joint representations of images and text form the backbone of se...

Deep Hough-Transform Line Priors

Classical work on line segment detection is knowledge-based; it uses car...

Learning Inverse Mappings with Adversarial Criterion

We propose a flipped-Adversarial AutoEncoder (FAAE) that simultaneously ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep latent factor models, such as variational autoencoders (VAEs) and adversarial autoencoders (AAEs), are becoming increasingly popular in various tasks, such as image generation (larsen2015autoencoding), unsupervised clustering (dilokthanakul2016deep; makhzani2015adversarial), and cross-domain translation (wu2016learning). These models involve specifying a prior distribution over latent variables and defining a deep generative network (i.e. the decoder) that maps latent variables to data space in stochastic or deterministic fashion. Training such deep models usually requires learning a recognition network (i.e. the encoder) regularized by the prior.

Traditionally, a simple prior, such as the standard normal distribution

(kingma2013auto), is used for tractability, simplicity, or not knowing what prior to use. It is hoped that this simple prior will be transformed somewhere in the deep generative network into a form suitable for characterizing the data distribution. While this might hold true when the generative network has enough capacity, applying the standard normal prior often results in over-regularized models with only few active latent dimensions (burda2015importance).

Some recent works (hoffman2016elbo; goyal2017nonparametric; tomczak2017vae) suggest that the choice of the prior may have a profound impact on the expressiveness of the model. As an example, in learning the VAE with a simple encoder and decoder, Hoffman and Johnson hoffman2016elbo conjecture that multimodal priors can achieve a higher variational lower bound on the data log-likelihood than is possible with the standard normal prior. Tomczak and Welling tomczak2017vae confirm the truth of this conjecture by showing that their multimodal prior, a mixture of the variational posteriors, consistently outperforms simple priors on a number of datasets in terms of maximizing the data log-likelihood. Taking one step further, Goyal goyal2017nonparametric learn a tree-structured nonparametric Bayesian prior for capturing the hierarchy of semantics presented in the data. All these priors are learned under the VAE framework following the principle of maximum likelihood.

Along a similar line of thinking, we propose in this paper the notion of code generators for learning a prior from data for AAE. The objective is to learn a code generator network to transform a simple prior into one that, together with the generative network, can better characterize the data distribution. To this end, we generalize the framework of AAE in several significant ways:

  • We replace the simple prior with a learned prior by training the code generator to output latent variables that will minimize an adversarial loss in data space.

  • We employ a learned similarity metric (larsen2015autoencoding) in place of the default squared error in data space for training the autoencoder.

  • We maximize the mutual information between part of the code generator input and the decoder output for supervised and unsupervised training using a variational technique introduced in InfoGAN (chen2016infogan).

Extensive experiments confirm its effectiveness of generating better quality images and learning better disentangled representations than AAE in both supervised and unsupervised settings, particularly on complicated datasets. In addition, to the best of our knowledge, this is one of the first few works that attempt to introduce a learned prior for AAE.

The remainder of this paper is organized as follows: Section 2 reviews the background and related works. Section 3 presents the implementation details and the training procedure of the proposed code generator. Section 4 presents extensive experiments to show the superiority of our models over prior works. Section 5 showcases an application of our model to text-to-image synthesis. Lastly, we conclude this paper with remarks on future work.

2 Related Work

Figure 1: The relations of our work with prior arts.

A latent factor model is a probabilistic model for describing the relationship between a set of latent and visible variables. The model is usually specified by a prior distribution over the latent variables and a conditional distribution of the visible variables given the latent variables . The model parameters are often learned by maximizing the marginal log-likelihood of the data .

Variational Autoencoders (VAEs).

To improve the model’s expressiveness, it is common to make deep the conventional latent factor model by introducing a neural network to

. One celebrated example is VAE (kingma2013auto), which assumes the following prior and :


where the mean is modeled by the output of a neural network with parameters . In this case, the marginal becomes intractable; the model is thus trained by maximizing the log evidence lower-bound (ELBO):


where is the variational density, implemented by another neural network with parameter , to approximate the posterior . When regarding as an (stochastic) encoder and as a (stochastic) decoder, Equation (2) bears an interpretation of training an autoencoder with the latent code regularized by the prior through the KL-divergence.

Adversarial Autoencoders (AAEs). Motivated by the observation that VAE is largely limited by the Gaussian prior assumption, i.e. , Makhzani makhzani2015adversarial relax this constraint by allowing to be any distribution. Apparently, the KL-divergence becomes intractable when is arbitrary. They thus replace the KL-divergence with an adversarial loss imposed on the encoder output, requiring that the latent code produced by the encoder should have an aggregated posterior distribution111The aggregated posterior distribution is defined as , where denotes the empirical distribution of the training data. the same as the prior .

Non-parametric Variational Autoencoders (Non-parametric VAEs). While AAE allows the prior to be arbitrary, how to select a prior that can best characterize the data distribution remains an open issue. Goyal goyal2017nonparametric make an attempt to learn a non-parametric prior based on the nested Chinese restaurant process for VAEs. Learning is achieved by fitting it to the aggregated posterior distribution, which amounts to maximization of ELBO. The result induces a hierarchical structure of semantic concepts in latent space.

Variational Mixture of Posteriors (VampPrior). The VampPrior tomczak2017vae is a new type of prior for the VAE. It consists of a mixture of the variational posteriors conditioned on a set of learned pseudo-inputs . In symbol, this prior is given by


Its multimodal nature and coupling with the posterior achieve superiority over many other simple priors in terms of training complexity and expressiveness.

Inspired by these learned priors (goyal2017nonparametric; tomczak2017vae) for VAE, we propose in this paper the notion of code generators to learn a proper prior from data for AAE. The relations of our work with these prior arts are illustrated in Fig. 1.

3 Method

Figure 2: The architecture of AAE without (a) and with (b) the code generator.

In this paper, we propose to learn the prior from data instead of specifying it arbitrarily. Based on the framework of AAE, we introduce a neural network (which we call the code generator) to transform the manually-specified prior into a better form. Fig. 2 presents its role in the overall architecture, and contrasts the architectural difference relative to AAE.

3.1 Learning the Prior

Because the code generator itself has to be learned, we need an objective function to shape the distribution at its output. Normally, we wish to find a prior that, together with the decoder (see Fig. 2 (b)), would lead to a prior distribution that maximizes the data likelihood. We are however faced with two challenges. First, the output of the code generator could be any distribution, which may make the likelihood function and its variational lower bound intractable. Second, the decoder has to be learned simultaneously, which creates a moving target for the code generator.

To address the first challenge, we propose to impose an adversarial loss on the output of the decoder when training the code generator. That is, we want the code generator to produce a prior distribution that minimizes the adversarial loss at the decoder output. Consider the example in Fig. 3 (a). The decoder should generate images with a distribution that in principle matches the empirical distribution of real images in the training data, when driven by the output of the code generator. In symbols, this is to minimize


where is the output of the code generator driven by a noise sample , is the discriminator in image space, and is the output of the decoder driven by .

Figure 3: Alternation of training phases: (a) the prior improvement phase and (b) the AAE phase. The shaded building blocks indicate the blocks to be updated.

To address the second challenge, we propose to alternate the training of the code generator and the decoder/encoder until convergence. In one phase, termed the prior improvement phase (see Fig. 3

(a)), we update the code generator with the loss function in Eq. (

4), by fixing the encoder222Supposedly, the decoder needs to be fixed in this phase. It is however found beneficial in terms of convergence to update also the decoder.. In the other phase, termed the AAE phase (see Fig. 3 (b)), we fix the code generator and update the autoencoder following the training procedure of AAE. Specifically, the encoder output has to be regularized by minimizing the following adversarial loss:


where is the output of the code generator, is the encoder output given the input , and is the discriminator in latent code space.

Because the decoder will be updated in both phases, the convergence of the decoder relies on consistent training objectives of the two training phases. It is however noticed that the widely used pixel-wise squared error criterion in the AAE phase tends to produce blurry decoded images. This obviously conflicts with the adversarial objective in the prior improvement phase, which requires the decoder to produce sharp images. Inspired by the notion of learning similarity metrics (larsen2015autoencoding) and perceptual loss (johnson2016perceptual), we change the criterion of minimizing squared error in pixel domain to be in feature domain. Specifically, in the AAE phase, we require that a reconstructed image should minimize squared error in feature domain with respect to its original input . This loss is referred to as perceptual loss and is defined by


where denotes the feature representation (usually the output of the last convolutional layer in the image discriminator ) of an image. With this, the decoder would be driven consistently in both phases towards producing decoded images that resemble closely real images.

Figure 4: Supervised learning architecture with the code generator.
Figure 5: Unsupervised learning architecture with the code generator.

3.2 Learning Conditional Priors

3.2.1 Supervised Setting

The architecture in Fig. 3 can be extended to learn conditional priors supervisedly. Such priors find applications in conditional data generation, e.g. conditional image generation in which the decoder generates images according to their class labels . To this end, we make three major changes to the initial architecture:

  • Firstly, the code generator now takes as inputs a data label and a noise variable accounting for the intra-class variety, and produces a prior distribution conditional on the label (see Fig. 4).

  • Secondly, the end-to-end mutual information between the label and the decoded image is maximized as part of our training objective to have both the code generator and the decoder pick up the information carried by the label variable when generating the latent code and subsequently the decoded image . This is achieved by maximizing its variational lower bound of (chen2016infogan) as given by



    is the joint distribution of the label

    and the noise , is the code generator, and

    is a classifier used to recover the label

    of the decoded image .

  • Lastly, the discriminator in latent code space is additionally provided with the label as input, to implement class-dependent regularization at the encoder output during the AAE learning phase. That is,


    where is the label associated with the input image .

The fact that the label of an input image needs to be properly fed to different parts of the network during training indicates the supervised learning nature of the aforementioned procedure.

3.2.2 Unsupervised Setting

Taking one step further, we present in Fig. 5 a re-purposed architecture to learn conditional priors under an unsupervised setting. Unlike the supervised setting where the correspondence between the label and the image is explicit during training, the unsupervised setting is to learn the correspondence in an implicit manner. Two slight changes are thus made to the architecture in Fig. 4: (1) the label at the input of the code generator is replaced with a label drawn randomly from a categorical distribution; and (2) the discriminator in the latent code space is made class agnostic by removing the label input. The former is meant to produce a multimodal distribution in the latent space while the latter is to align such a distribution with that at the encoder output. Remarkably, which mode (or class) of distribution an image would be assigned to in latent code space is learned implicitly. In a sense, we hope the code generator can learn to discover the intriguing latent code structure inherent at the encoder output. It is worth pointing out that in the absence of any regularization or guidance, there is no guarantee that this learned assignment would be in line with the semantics attached artificially to each data sample.

Algorithm 1 details the training procedure.

  Initialize , , , , ,

  Repeat (for each epoch

   Repeat (for each mini-batch )
    // AAE phase
    If label exists then
    End If
    Compute ,
    // Update network parameters
    // Prior improvement phase
    If label exists then
    Compute and
    End If
    // Update network parameters
    If label exists then
    End If
   Until all mini-batches are processed
  Until termination
Algorithm 1 Training procedure.

4 Experiments

We first show the superiority of our learned priors over manually-specified priors, followed by an ablation study of individual components. In the end, we compare the performance of our model with AAE in image generation tasks. Unless stated otherwise, all the models adopt the same autoencoder for a fair comparison.

4.1 Comparison with Prior Works

Method Inception Score
AAE makhzani2015adversarial w/ a Gaussian prior 2.15
VAE kingma2013auto w/ a Gaussian prior 3.00
Vamprior tomczak2017vae 2.88
Our method w/ a learned prior 6.52
Table 1: Comparison with AAE, VAE, and Vamprior on CIFAR-10
(a) AAE makhzani2015adversarial
(b) VAE kingma2013auto
(c) Vamprior tomczak2017vae
(d) Proposed model
Figure 6: Sample images produced by (a) AAE, (b) VAE, (c) Vamprior, and (d) the proposed model.

Latent factor models with their priors learned from data rather than specified manually should better characterize the data distribution. To validate this, we compare the performance of our model with several prior arts, including AAE (makhzani2015adversarial), VAE (kingma2013auto), and Vamprior (tomczak2017vae), in terms of Inception Score (IS). Of these works, AAE chooses a Gaussian prior and regularizes the latent code distribution with an adversarial loss goodfellow2014generative. VAE kingma2013auto likewise adopts a Gaussian prior yet uses the KL-divergence for regularization. Vamprior tomczak2017vae learns for VAE a Gaussian mixture prior. For the results of Vamprior tomczak2017vae, we run their released software tomczak2017vae but replace their autoencoder with ours for a fair comparison.

Table 1 compares their Inception Score for image generation on CIFAR-10 with a latent code size of 64. As expected, both AAE makhzani2015adversarial and VAE kingma2013auto, which adopt manually-specified priors, have a lower IS of 2.15 and 3.00, respectively. Somewhat surprisingly, Vamprior tomczak2017vae, although using a learned prior, does not have an advantage over VAE kingma2013auto with a simple Gaussian prior in the present case. This may be attributed to the fact that the prior is limited to be a Gaussian mixture distribution. Relaxing this constraint by modeling the prior with a neural network, our model achieves the highest IS of 6.52.

Fig. 6 further visualizes sample images generated with these models by driving the decoder with latent codes drawn from the prior or the code generator in our case. It is observed that our model produces much sharper images than the others. This confirms that a learned and flexible prior is beneficial to the characterization and generation of data.

To get a sense of how our model performs as compared to other state-of-the-art generative models, Table 2 compares their Inception Score on CIFAR-10. Caution must be exercised in interpreting these numbers as these models adopt different decoders (or generative networks). With the current implementation, our model achieves a comparable score to other generative models. Few sample images of these models are provided in Fig. 7 for subjective evaluation.

Method Inception Score
BEGAN berthelot2017began 5.62
DCGAN radford2015unsupervised 6.16
LSGAN mao2017least 5.98
WGAN-GP gulrajani2017improved 7.86
Our method w/ a learned prior 6.52
Table 2: Comparison with other state-of-the-art generative models on CIFAR-10
(a) BEGAN berthelot2017began
(b) DCGAN radford2015unsupervised
(c) LSGAN mao2017least
(d) WGAN-GP gulrajani2017improved
(e) Proposed model
Figure 7: Subjective quality evaluation of generated images produced by state-of-the-art generative models.

4.2 Ablation Study

Method A B C IS
AAE makhzani2015adversarial w/ a Gaussian prior and MSE loss 2.15
AAE w/ a learned prior and MSE loss 3.04
AAE w/ a learned prior and perceptual loss 3.69
Ours (full) 6.52
Table 3: Inception score of generated images with the models trained on CIFAR-10: A, B, and C denote respectively the design choices of enabling the learned prior, the perceptual loss, and the updating of the decoder in both phases.

In this section, we conduct an ablation study to understand the effect of (A) the learned prior, (B) the perceptual loss, and (C) the updating of the decoder in both phases on Inception Score salimans2016improved. To this end, we train an AAE makhzani2015adversarial model with a 64-D Gaussian prior on CIFAR-10 as the baseline. We then enable incrementally each of these design choices. For a fair comparison, all the models are equipped with an identical autoencoder architecture yet trained with their respective objectives.

From Table 3, we see that the baseline has the lowest IS and that replacing the manually-specified prior with our learned prior increases IS by about 0.9. Furthermore, minimizing the perceptual loss instead of the conventional mean squared error in training the autoencoder achieves an even higher IS of 3.69. This suggests that the perceptual loss does help make more consistent the training objectives for the decoder in the AAE and the prior improvement phases. Under such circumstances, allowing the decoder to be updated in both phases tops the IS.

(a) Our model + 8-D latent code
(b) AAE makhzani2015adversarial + 8-D latent code
(c) Our model + 64-D latent code
(d) AAE makhzani2015adversarial + 64-D latent code
Figure 8: Images generated by our model and AAE trained on MNIST (upper) and CIFAR-10 (lower).
(a) Our model + 100-D latent code
(b) AAE makhzani2015adversarial + 100-D latent code
(c) Our model + 2000-D latent code
(d) AAE makhzani2015adversarial + 2000-D latent code
Figure 9: Images generated by our model and AAE trained on MNIST (upper) and CIFAR-10 (lower). In this experiment, the latent code dimension is increased significantly to 64-D and 2000-D for MNIST and CIFAR-10, respectively. For AAE, the re-parameterization trick is applied to the output of the encoder as suggested in (makhzani2015adversarial).

4.3 In-depth Comparison with AAE

Since our model is inspired by AAE (makhzani2015adversarial), this section provides an in-depth comparison with it in terms of image generation. In this experiment, the autoencoder in our model is trained based on minimizing the perceptual loss (i.e. the mean squared error in feature domain), whereas by convention, AAE (makhzani2015adversarial) is trained by minimizing the mean squared error in data domain.

Fig. 8 displays side-by-side images generated from these models when trained on MNIST and CIFAR-10 datasets. They are produced by the decoder driven by samples from their respective priors. In this experiment, two observations are immediate. First, our model can generate sharper images than AAE (makhzani2015adversarial) on both datasets. Second, AAE (makhzani2015adversarial) experiences problems in reconstructing visually-plausible images on the more complicated CIFAR-10. These highlight the advantages of optimizing the autoencoder with the perceptual loss and learning the code generator through an adversarial loss, which in general produces subjectively sharper images.

Another advantage of our model is its ability to have better adaptability in higher dimensional latent code space. Fig. 9 presents images generated by the two models when the dimension of the latent code is increased significantly from 8 to 100 on MNIST, and from 64 to 2000 on CIFAR-10. As compared to Fig. 8, it is seen that the increase in code dimension has little impact on our model, but exerts a strong influence on AAE (makhzani2015adversarial). In the present case, AAE (makhzani2015adversarial) can hardly produce recognizable images, particularly on CIFAR-10, even after the re-parameterization trick has been applied to the output of the encoder as suggested in (makhzani2015adversarial). This emphasizes the importance of having a prior that can adapt automatically to a change in the dimensionality of the code space and data.

4.4 Disentangled Representations

Learning disentangled representations is desirable in many applications. It refers generally to learning a data representation whose individual dimensions can capture independent factors of variation in the data. To demonstrate the ability of our model to learn disentangled representations and the merits of data-driven priors, we repeat the disentanglement tasks in (makhzani2015adversarial), and compare its performance with AAE (makhzani2015adversarial).

4.4.1 Supervised Learning

This section presents experimental results of using the network architecture in Fig. 4 to learn supervisedly a code generator that outputs a conditional prior given the image label for characterizing the image distribution. In particular, the remaining uncertainty about the image’s appearance given its label is modeled by transforming a Gaussian noise through the code generator . By having the noise be independent of the label , we arrive at a disentangled representation of images. At test time, the generation of an image for a particular class is achieved by inputting the class label and a Gaussian noise to the code generator and then passing the resulting code through the decoder .

To see the sole contribution from the learned prior, the training of the AAE baseline (makhzani2015adversarial) also adopts the perceptual loss and the mutual information maximization; that is, the only difference to our model is the direct use of the label and the Gaussian noise as the conditional prior.

Fig. 10 displays images generated by our model and AAE (makhzani2015adversarial)

. Both models adopt a 10-D one-hot vector to specify the label

and a 54-D Gaussian to generate the noise . To be fair, the output of our code generator has an identical dimension (i.e. 64) to the latent prior of AAE (makhzani2015adversarial). Each row of Fig. 10 corresponds to images generated by varying the label while fixing the noise . Likewise, each column shows images that share the same label yet with varied noise .

On MNIST and SVHN, both models work well in separating the label information from the remaining (style) information. This is evidenced from the observation that along each row, the main digit changes with the label regardless of the noise , and that along each column, the style varies without changing the main digit. On CIFAR-10, the two models behave differently. While both produce visually plausible images, ours generate more semantically discernible images that match their labels.

(a) Our model
(b) AAE makhzani2015adversarial
(c) Our model
(d) AAE makhzani2015adversarial
(e) Our model
(f) AAE makhzani2015adversarial
Figure 10: Images generated by the proposed model (a)(c)(e) and AAE (b)(d)(f) trained on MNIST, SVHN and CIFAR-10 datasets in the supervised setting. Each column of images have the same label/class information but varied Gaussian noise. On the other hand, each row of images have the same Gaussian noise but varied label/class variables.

Fig. 11 visualizes the output of the code generator with the t-distributed stochastic neighbor embedding (t-SNE). It is seen that the code generator learns a distinct conditional distribution for each class of images. It is believed that the more apparent inter-class distinction reflects the more difficult it is for the decoder to generate images of different classes. Moreover, the elliptic shape of the intra-class distributions in CIFAR-10 may be ascribed to the higher intra-class variability.

(b) SVHN
(c) CIFAR-10
Figure 11: Visualization of the code generator output in the supervised setting.

4.4.2 Unsupervised Learning

This section presents experimental results of using the network architecture in Fig. 5 to learn unsupervisedly a code generator that outputs a prior for best characterizing the image distribution. In the present case, the label is drawn randomly from a categorical distribution and independently from the Gaussian input , as shown in Fig. 5. The categorical distribution encodes our prior belief about data clusters, with the number of distinct values over which it is defined specifying the presumed number of clusters in the data. The Gaussian serves to explain the data variability within each cluster. In regularizing the distribution at the encoder output, we want the code generator

to make sense of the two degrees of freedom for a disentangled representation of images.

At test time, image generation is done similarly to the supervised case. We start by sampling and , followed by feeding them into the code generator and then onwards to the decoder. In this experiment, the categorical distribution is defined over 10-D one-hot vectors and the Gaussian is 90-D. As in the supervised setting, after the model is trained, we alter the label variable or the Gaussian noise one at a time to verify whether the model has learned to cluster images unsupervisedly. We expect that a good model should generate images with certain common properties (e.g. similar backgrounds or digit types) when the Gaussian part is altered while the label part remains fixed.

The results in Fig. 12 show that on MNIST, both our model and AAE successfully learn to disentangle the digit type from the remaining information. Based on the same presentation order as in the supervised setting, we see that each column of images (which correspond to the same label ) do show images of the same digit. On the more complicated SVHN and CIFAR-10 datasets, each column mixes images from different digits/classes. It is however worth noting that both models have a tendency to cluster images with similar backgrounds according to the label variable . Recall that without any semantic guidance, there is no guarantee that the clustering would be in line with the semantics attached artificially to each data sample.

Fig. 13 further visualizes the latent code distributions at the output of the code generator and the encoder. Several observations can be made. First, the encoder is regularized well to produce an aggregated posterior distribution similar to that at the code generator output. Second, the code generator learns distinct conditional distributions according to the categorical label input . Third, quite by accident, the encoder learns to cluster images of the same digit on MNIST, as has been confirmed in Fig. 12. As expected, such semantic clustering in code space is not obvious on more complicated SVHN and CIFAR-10, as is evident from the somewhat random assignment of latent codes to images of the same class or label.

(a) Our model
(b) AAE
(c) Our model
(d) AAE
(e) Our model
(f) AAE
Figure 12: Images generated by the proposed model (a)(c)(e) and AAE (b)(d)(f) trained on MNIST, SVHN and CIFAR-10 datasets in the unsupervised setting. Each column of images have the same label/class information but varied Gaussian noise. On the other hand, each row of images have the same Gaussian noise but varied label/class variables.
(a) Encoder (MNIST)
(b) Encoder (SVHN)
(c) Encoder (CIFAR-10)
(d) Code generator (MNIST)
(e) Code generator (SVHN)
(f) Code generator (CIFAR-10)
Figure 13: Visualization of the encoder output versus the code generator output in the unsupervised setting.

5 Application: Text-to-Image Synthesis

This section presents an application of our model to text-to-image synthesis. We show that the code generator can transform the embedding of a sentence into a prior suitable for synthesizing images that match closely the sentence’s semantics. To this end, we learn supervisedly the correspondence between images and their descriptive sentences using the architecture in Fig. 4

, where given an image-sentence pair, the sentence’s embedding (which is a 200-D vector) generated by a pre-trained recurrent neural network is input to the code generator and the discriminator in image space as if it were the label information, while the image representation is learned through the autoencoder and regularized by the output of the code generator. As before, a 100-D Gaussian is placed at the input of the code generator to explain the variability of images given the sentence.

The results in Fig. 14 present images generated by our model when trained on 102 Category Flower dataset (Nilsback08). The generation process is much the same as that described in Section 4.4.1. It is seen that most images match reasonably the text descriptions. In Fig. 15, we further explore how the generated images change with the variation of the color attribute in the text description. We see that most images agree with the text descriptions to a large degree.

(a) This vibrant flower features lush red petals and a similar colored pistil and stamen
(b) This flower has white and crumpled petals with yellow stamen
Figure 14: Generated images from text descriptions.
Figure 15: Generated images in accordance with the varying color attribute in the text description "The flower is pink in color and has petals that are rounded in shape and ruffled." From left to right, the color attribute is set to pink, red, yellow, orange, purple, blue, white, green, and black, respectively. Note that there is no green or black flower in the dataset.

6 Conclusion

In this paper, we propose to learn a proper prior from data for AAE. Built on the foundation of AAE, we introduce a code generator to transform the manually selected simple prior into one that can better fit the data distribution. We develop a training process that allows to learn both the autoencoder and the code generator simultaneously. We demonstrate its superior performance over AAE in image generation and learning disentangled representations in supervised and unsupervised settings. We also show its ability to do cross-domain translation. Mode collapse and training instability are two major issues to be further investigated in future work.