Flexible Prior Distributions for Deep Generative Models

10/31/2017 ∙ by Yannic Kilcher, et al. ∙ ETH Zurich 0

We consider the problem of training generative models with deep neural networks as generators, i.e. to map latent codes to data points. Whereas the dominant paradigm combines simple priors over codes with complex deterministic models, we argue that it might be advantageous to use more flexible code distributions. We demonstrate how these distributions can be induced directly from the data. The benefits include: more powerful generative models, better modeling of latent structure and explicit control of the degree of generalization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 8

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models have recently moved to the center stage of deep learning in their own right. Most notable is the seminal work on Generative Adversarial Networks (GAN)

(Goodfellow et al., 2014)

as well as probabilistic architectures known as Variational Autoencoder (VAE)

(Kingma & Welling, 2013; Rezende et al., 2014)

. Here, the focus has moved away from density estimation and towards generative models that – informally speaking – produce samples that are perceptually indistinguishable from samples generated by nature. This is particularly relevant in the context of high-dimensional signals such as images, speech, or text.

Generative models like GANs typically define a generative model via a deterministic generative mechanism or generator , , parametrized by . They are often implemented as a deep neural network (DNN), which is hooked up to a code distribution , to induce a distribution . It is known that under mild regularity conditions, by a suitable choice of generator, any can be obtained from an arbitrary fixed (Kallenberg, 2006). Relying on the power and flexibility of DNNs, this has led to the view that code distributions should be simple and a priori fixed, e.g. . As shown in Arjovsky & Bottou (2017) for DNN generators, is a countable union of manifolds of dimension though, which may pose challenges, if . Whereas a current line of research addresses this via alternative (non-MLE or KL-based) discrepancy measures between distributions (Dziugaite et al., 2015; Nowozin et al., 2016; Arjovsky et al., 2017), we investigate an orthogonal direction:

Claim 1.

It is advantageous to increase the modeling power of a generative model, not only via , but by using more flexible prior code distributions .

Another potential benefit of using a flexible latent prior is the ability to reveal richer structure (e.g. multimodality) in the latent space via , a view which is also supported by evidence on using more powerful posterior distributions (Mescheder et al., 2017). This argument can also be understood as follows. Denote by the distribution induced by the generator. Our goal is to ensure the distribution matches the true data distribution

. This brings us to consider the KL-divergence of the joint distributions which can be decomposed as

(1)

Assuming that the generator is powerful enough to closely approximate the data’s generative process, then the contribution of the term vanishes or becomes extremely small, and what remains is the divergence between the priors. This means that in light of using powerful neural networks to model , the prior agreement becomes a way to assess the quality of our learned model.

Empowered by this quantitative metric to evaluate the modeling power of a generative model, we will demonstrate some deficiencies in the assumption of using an arbitrary fixed prior such as a Normal distribution. We will further validate this observation by demonstrating that a flexible prior can be learned from data by mapping back the data points to their latent space representations. This procedure relies on a (trained) generator to compute an approximate inverse map

such that .

Claim 2.

The generator implicitly defines an approximate inverse, which can be computed with reasonable effort using gradient descent and without the need to co-train a recognition network. We call this approach generator reversal.

Note that, if the above argument holds, we can easily find latent vectors

corresponding to given observations . This then induces an empirical distribution of ”natural” codes. An extensive set of experiments presented in this paper reveals that this induced prior yields strong evidence of improved generative models. Our findings clearly suggest that further work is needed to develop flexible latent prior distributions in order to achieve generative models with greater modeling power.

2 Measuring the Modeling Power of the Latent Prior

2.1 Gradient–Based Reversal

Let us begin with making Claim 2 more precise. Given a data point , we aim to compute some approximate code such that . We do so by simple gradient descent, starting from some random initialization for (see Algorithm 1).

0:  Data point

, loss function

, initial value
1:  Initialize
2:  repeat
3:          {run generator}
4:     

  {backpropagate error}

5:  until converged
5:  latent code
Algorithm 1 Generator Reversal

Section B in the Appendix demonstrates that the generator reversal approach presented in Algorithm 1 ensures local convergence of gradient descent for a suitable choice of loss function.

Given the generator reversal procedure presented in Algorithm 1, a key question we would like to address is whether good (low loss) codevectors exist for data points . First of all, whenever was actually generated by , then surely we know that a perfect, zero-loss pre-image exists. Of course finding it exactly would require the exact inverse function of the generator process but our experiments demonstrate that, in practice, an approximate answer is sufficient.

Secondly, if is in the training data, then as is trained to mimic the true distribution, it would be suboptimal if any such would not have a suitable pre-image. We thus conjecture that learning a good generator will also improve the quality of the generator reversal, at least for points of interest (generated or data). Note that we do not explicitly train the generator to produce pre-images that would further optimize the training objective. This would require backpropagating through the reversal process which is certainly possible and would likely yield further improvements.

Anecdotally, we have found the generator reversal procedure to be quite effective at finding reasonable pre-images of data samples even for generators initialized for random weights. Some empirical results are provided in Section C of the Appendix.

2.2 Measuring Prior Agreement

Modeling the distribution of a complex set of data requires a rich family of distributions which is typically obtained by introducing latent variables and modeling the data as Often the prior is assumed to be a Normal distribution, i.e. .

This principle underlies most modern generative models such as GANs, effectively turning the generation process as mapping latent noise to observed data through a generative model . In GANs, the generative model - or generator - is a neural network parameterized as , and optimized to find the best set of parameters . Note that an implicit assumption that is made by this modeling choice is that the generative process is adequate and therefore sufficiently powerful to find in order to reconstruct the data . For GANs to work, this assumption has to hold despite the fact that the prior is kept fixed. In theory, such generator does exist as neural networks are universal approximators and can therefore approximate any function. However, finding requires solving a high-dimensional and non-convex optimization problem and is therefore not guaranteed to succeed.

We here explore an orthogonal direction. We assume that we have found a suitable generative model that could produce the data distribution but the prior is not appropriate. We would like to quantify to what degree does the assumed prior disagrees with the data induced prior therefore measuring how well our generated distribution agrees with the data distribution. Our goal in doing so is not to propose a new training criterion, but rather to assess the quality of our generative model by measuring the quality of the prior.

Prior Agreement (PAG) Score

We consider the standard case where is modeled as a multivariate Normal distribution with diagonal uniform covariance. Our goal is to measure the disagreement between this prior and some more suitable prior. The latter not being known a-priori, we instead settle for a multivariate Normal with diagonal covariance where the are inferred from a trained generator as detailed below. This choice of prior will allow for a simple computation of divergence as follows:

(2)

The divergence defined in Equation 2 defines the Prior Agreement (PAG) Score. It requires the quantities which can easily be computed by mapping the data to the latent space using the reversal procedure described in Section 2.1 and then performing an SVD decomposition on the resulting latent vectors . The

then correspond to the singular values

obtained in the SVD decomposition .

Note that more complex choices as a substitute for the data induced prior would allow for a better characterization of the inadequacy of the Normal prior with uniform covariance. We will however demonstrate that the PAG score defined in Equation 2 is already effective at revealing surprising deficiencies in the choice of the Normal prior.

3 Learning the Data Induced Prior

Figure 1: A data induced prior distribution is learned using a secondary GAN named PGAN. This prior is then used to further train the original GAN.

So far, we have introduced a way to characterize the fit of a chosen prior to model the data distribution given a trained generator . Equipped with this agreement score, we now turn our attention to designing a method to address the potential problems that could arise from choosing an inappropriate prior . As shown in Figure 1, we suggest learning the data induced prior distribution using a secondary GAN we name PGAN which learns a mapping function where is an auxiliary latent space with the same or higher ambient dimension as the original latent space . The mapping defines a transformation of the noise vectors in order to match the data induced prior. Note that training the mapping function is done by keeping the original GAN unchanged, thus we only need to run the reversal process once for the dataset and then the reverted data in the latent space becomes the target to train .

If the original prior over the latent space is indeed not a good choice to model , we should see evidence of a better generative model by sampling from the transformed latent space

. Such evidence including better quality samples, less outliers and higher PAG scores are shown in Figure 

5 (see details in Section 4). Note that having obtained this mapping opens the door to various schemes such as multiple rounds of re-training the original GAN and training the PGAN using the newly learned prior or training yet another PGAN to match the data induced prior of the first PGAN. We leave these practical considerations to future work, as our goal is simply to provide a method to quantify and remedy the fundamental problem of prior disagreement.

4 Experiments

The experimental results presented in this section are based on off-the-shelf GAN architectures whose details are provided in the appendix. We restrict the dimension of the latent space to . Although similar experimental results can be obtained with latent spaces of larger dimensions, the low-dimensional setting is particularly interesting as it requires more compression, providing an ideal scenario to empirically verify the fitness of the fixed Gaussian prior with respect to the data induced prior.

4.1 Mapping a dataset to the latent space

Given a generator network, we first map 1024 data points from the MNIST dataset to the latent space using the Generator Reversal procedure. We then use t-SNE to reduce the dimensionality of the latent vectors to 2 dimensions. We perform this procedure for both an untrained and a fully-trained networks and show the results in Figure 2. One can clearly see a multi-modal structure emerging after training the generator network, indicating that a unimodal Normal distribution is not an appropriate choice as a prior over the latent space.

Figure 2: Generator reversal on a sample of 1024 MNIST digits. Projections of data points with an untrained (left) and a fully trained GAN (right). Colors represent the respective class labels. The ratios of between-cluster distances to within-cluster distances are 0.1 (left) and 1.9 (right).

4.2 Prior-Data-Disagreement samples

In order to demonstrate that a simple Normal prior does not capture the data induced prior, we sample points that are likely under the Normal prior but unlikely under the data induced prior. This is achieved by sampling a batch of 1000 samples from the Normal prior, then ordering the samples according to their mean squared distance to the found latent representations of a batch of data. Figure 3

shows the the top 20 samples in the data space (obtained after mapping the top 20 latent vectors to the data space using the generator). The poor quality of these samples indicate that the induced prior assigns loss probability mass to unlikely samples.

(a) LSUN kitchen
(b) LSUN bedrooms
(c) CelebA
(d) CIFAR 10
Figure 3: Prior-data-disagreement samples. We visualize samples for which the likelihood under the GAN prior is high, but low under the data-induced prior. Note that most of these samples are of poor visual quality and contain numerous artifacts.

4.3 Evaluating the PAG Score

(a) LSUN kitchen. (PAG: 1547 / 2001)
(b) LSUN bedroom. (PAG: 1807 / 13309)
(c) CelebA. (PAG: 11390 / 39144, Inception: 4.35 / 2.97)
(d) CIFAR 10. (PAG: 5021 / 5352, Inception: 6.30 / 5.94)
Figure 4: Best / worst selection of samples (by visual inspection) from a number of different GAN models. We also report the PAG and Inception Scores (when available). Note that the PAG score agrees with the Inception Score, but does not require labeled data to be evaluated.

We now evaluate how the PAG score correlates with the visual quality of the samples produced by a generator. We first train a selection of 16 GAN models using different combinations of filter size, layer size and regularization constants. We then select the best and the worst model by visually inspecting the generated samples. We show the samples as well as the corresponding PAG scores in Figure 4. These results clearly demonstrate that the PAG score strongly correlates with the visual quality of the samples. We also report the Inception score for datasets that provide a class label, and observe a strong agreement with the PAG scores.

4.4 Learning the Data Induced Prior using a secondary GAN

Following the procedure presented in Section 3, we train a GAN until convergence and then use the Generator Reversal procedure to map the dataset to the latent space, therefore inducing a data-induced prior. We then train a secondary GAN (called PGAN) to learn this prior from which we can then continue training the original GAN for a few steps.

We expect that the model trained with the data-induced prior will be better at capturing the true data distribution. This is empirically verified in Figure 5 by inspecting samples produced by the original and re-trained model. We also report the PAG scores for which we see a significant reduction therefore confirming our hypothesis of obtaining an improved generative model. Note that the data-induced prior yields more realistic and varied output samples, even though it uses the same dimensionality of latent space as the original simple prior.

(a) LSUN kitchen. (PAG: 1861 / 779)
(b) LSUN bedroom. (PAG: 1810 / 746)
(c) CelebA. (PAG: 2492 / 863, Inception: 4.64 / 4.62)
(d) CIFAR 10. (PAG: 2485 / 1064, Inception: 6.24 / 6.59)
Figure 5: Samples before (left) and after (right) training with the data induced prior. Note the increased level of diversity in the samples obtained from the induced prior.

5 Related Work

Our generator reversal is similar in spirit to Kindermann & Linden (1990), but their intent differs as they use this technique as a tool to visualize the information processing capability of a neural network. Unlike previous works that require the transfer function to be bijective Baird et al. (2005); Rippel & Adams (2013), our approach does not strictly have this requirement, although this could still be imposed by carefully selecting the architecture of the network as shown in Dinh et al. (2016); Arjovsky et al. (2017).

In the context of GANs, other works have used a similar reversal mechanism as the one used in our approach, including e.g. Che et al. (2016); Dumoulin et al. (2016); Donahue et al. (2016). All these methods focus on training a separate encoder network in order to map a sample from the data space to its latent representation. Our goal is however different as the reversal procedure is here used to estimate a more flexible prior over the latent space.

Finally, we note that the importance of using an appropriate prior for GAN models has also been discussed in Han et al. (2016) which suggested to infer the continuous latent factors in order to maximize the data log-likelihood. However this approach still makes use of a simple fixed prior distribution over the latent factors and does not use the inferred latent factors to construct a prior as suggested by our approach.

6 Conclusion

We started our discussion by arguing that is advantageous to increase the modeling power of a generative model by using more flexible prior code distributions. We substantiated our claim by deriving a quantitative metric estimating the modeling power of a fixed prior such as the Normal prior commonly used when training GAN models. Our experimental results confirm that this measure reveals the standard choice of an arbitrary fixed prior is not always an appropriate choice. In order to address this problem, we presented a novel approach to estimate a flexible prior over the latent codes given by a generator . This was achieved through a reversal technique that reconstruct latent representations of data samples and use these reconstructions to construct a prior over the latent codes. We empirically demonstrated that the resulting data-induced prior yields several benefits including: more powerful generative models, better modeling of latent structure and semantically more appropriate output.

References

Appendix A Derivation Equation 1

(3)

Appendix B Local Convergence of the Gradient–Based Reversal

Let us demonstrate that the generator reversal approach presented in Algorithm 1 ensures local convergence of gradient descent for a suitable choice of loss function.

Proposition 1.

We are given an loss function and a generator function . Consider a point and assume the function is locally invertible around  111Note that this is a less restrictive assumption than the diffeomorphism property required in Arjovsky & Bottou (2017). Then the reconstruction problem is locally convex at .

Proof.

We prove the result stated above by showing that the Hessian of at is positive semidefinite.

(4)

Let denote the Jacobian of and let’s compute the Hessian of at :

Since is assumed to be locally invertible around , then and the Hessian is therefore positive semidefinite. ∎

Note that one could also add an regularizer to Equation 4 in order to obtain a locally strongly-convex function.

Appendix C Random Network Experiments

It is very hard to give quality guarantees for the approximations obtained via generator reversal. Here, we provide experimental evidence by showing that even a DNN generator with random weights can provide reasonable pre-images for data samples. As we argued above, we believe that actual training of will improve the quality of pre-images, so this is in a way a worst case scenario.

Examples for three different image data sets are shown in Figure 7. Here we show the average reconstruction error as a function of the number of gradient update steps. We observe that the error decreases steadily as the reconstruction progresses and reaches a low value very quickly. We also show randomly selected reconstructed samples in Figure 7, which reflect the fast decrease in terms of reconstruction error. After only 5 update steps, one can already recognize the general outline of the picture. This is perhaps even more surprising considering that these results were obtained using a generator with completely random weights. A similar finding was also reported in He et al. (2016) which constructed deep visualizations using untrained neural networks initialized with random weights.


Figure 6: Reconstruction loss in generator networks with random weights.

Figure 7: Reconstruction quality using generator networks with random weights. The left column is the original image, followed by reconstructions after 5, 20 and 400 steps.

Appendix D Latent space data distribution

Figure 8 shows the distribution of the obtained latent codes after Generator Reversal. Interestingly, although the distributions are different from the naive prior, they are not indicative of low rank latent data. This agrees with our expectations, as a well trained generator will make use of all available latent dimensions.


Figure 8: Distribution of singular values (in GANs using d latent dimensions) used in calculation of the PAG scores. For comparison, singular values of a sample of normally distributed latent codes in 100 dimensions are shown.

Appendix E Detailed Experiment Setup

Our experimental setup closely follows popular setups in GAN research in order to facilitate reproducibility and enable qualitative comparisons of results. Our network architectures are as follows:

The generator samples from a latent space of dimension 20, which is fed through a padded deconvolution to form an initial intermediate representation of

, which is then fed through four layers of deconvolutions with 512, 256, 128 and 64 filters, followed by a last deconvolution to get to the desired output size and channels.

The discriminator consists of three layers of convolutional layers with 512, 256, 128 and 64 filters, followed by a fully connected layer and a sigmoid classifier.

Both the generator and the discriminator use

filters with a stride of

in order to up- and downscale the representations, respectively. The generator employs ReLU non-linearities, except for the last layer, which uses hyperbolic tangent. The discriminator uses Leaky ReLU non-linearities with a leak of

, which is standard in the GAN literature.

The PGAN consists of four layers of fully connected units in both the generator and discriminator. Apart from the layers being fully connected, the architecture is analogous to the original GAN.

We use RMSProp(Tieleman & Hinton, 2012) with a step size of and mini-batches of size 100 for optimization for all networks.

For the generator reversal process, we use a learning rate of . The initial noise vectors are sampled from a normal distribution with .

We train until we can no longer see any significant qualitative improvement in the generated images or any quantitative improvement in the inception score (if available).

For the CelebA dataset, we crop the images to a size of pixels, after which we resize them to pixels.