Annealed Generative Adversarial Networks

05/21/2017
by   Arash Mehrjou, et al.
Max Planck Society
berkeley college
0

We introduce a novel framework for adversarial training where the target distribution is annealed between the uniform distribution and the data distribution. We posited a conjecture that learning under continuous annealing in the nonparametric regime is stable irrespective of the divergence measures in the objective function and proposed an algorithm, dubbed ß-GAN, in corollary. In this framework, the fact that the initial support of the generative network is the whole ambient space combined with annealing are key to balancing the minimax game. In our experiments on synthetic data, MNIST, and CelebA, ß-GAN with a fixed annealing schedule was stable and did not suffer from mode collapse.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 8

06/11/2019

On Stabilizing Generative Adversarial Training with Noise

We present a novel method and analysis to train generative adversarial n...
10/10/2019

Comparison of Generative Adversarial Networks Architectures Which Reduce Mode Collapse

Generative Adversarial Networks are known for their high quality outputs...
06/19/2020

Online Kernel based Generative Adversarial Networks

One of the major breakthroughs in deep learning over the past five years...
02/17/2018

CapsuleGAN: Generative Adversarial Capsule Network

We present Generative Adversarial Capsule Network (CapsuleGAN), a framew...
03/13/2018

Analysis of Nonautonomous Adversarial Systems

Generative adversarial networks are used to generate images but still th...
05/24/2017

Approximation and Convergence Properties of Generative Adversarial Learning

Generative adversarial networks (GAN) approximate a target data distribu...
03/07/2017

Stopping GAN Violence: Generative Unadversarial Networks

While the costs of human violence have attracted a great deal of attenti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Background—

One of the most fundamental problems in machine learning is the unsupervised learning of high-dimensional data. A class of problems in unsupervised learning is density estimation, where it is assumed that there exist a class of probabilistic models underlying observed data

and the goal of learning is to infer the right model(s). The generative adversarial network proposed by Goodfellow et al. (goodfellow2014generative, ) is an elegant framework, which transforms the problem of density estimation to an adversarial process in a minimax game between a generative network and a discriminative network . However, despite their simplicity, GANs are notoriously difficult to train.

Mode collapse— There are different schools in diagnosing and addressing the problems with training GANs, that have resulted in a variety of algorithms, network architectures, training procedures, and novel objective functions (radford2015unsupervised, ; salimans2016improved, ; zhao2016energy, ; arjovsky2017wasserstein, ; nowozin2016f, ). The roots of the problems in training GANs lie on the unbalanced nature of the game being played, the difficulty with high-dimensional minimax optimizations, and the fact that the data manifold is highly structured in the ambient space . Perhaps, the biggest challenge is that the natural data in the world reside on a very low-dimensional manifold of their ambient space (narayanan2010sample, ). Early in training the generative network is far off from this low-dimensional manifold and the discriminative network learns quickly to reject the generated samples, causing little room to improve . This was analyzed in depth by Arjovsky & Bottou (arjovsky2017towards, ), which highlighted the deficiencies of -divergences when the generative network has a low-dimensional support. The other challenging issue is that GANs’ optimal point is a saddle point. We have good understanding and a variety of optimization methods to find local minima/maxima of objective functions, but minimax optimization in high-dimensional spaces have proven to be challenging. Because of these two obstacles, i.e. the nature of high-dimensional data and the nature of the optimization, GANs suffer from stability issues and the ubiquitous problem of mode collapse, where the generator completely ignores parts of the low-dimensional data manifold.

-GAN— In this work, we address these two issues at the same time by lifting the minimax game, where the initial objective is to find the GAN equilibrium in an easier game of learning to map to . Here, is the noise variable corresponding to the latent space, and is the dimension of the ambient space . The subscript in refers to the inverse temperature , which is defined in the next section. After arriving at the equilibrium for , we anneal the uniform distribution towards the data distribution while performing the adversarial training simultaneously. Our assumption in this work is that once GAN is stable for the uniform distribution, it will remain stable in the continuous annealing limit irrespective of the divergence measure being used in the objective function. In this work, we used the original Jensen-Shannon formulation of Goodfellow et al. (goodfellow2014generative, ). The objective to learn the uniform distribution puts constraints on the architecture of the generative network, most importantly , which has deep consequences for the adversarial training as discussed below.

Related works— There are similarities between our approach here and recent proposals in stabilizing the GAN training by adding noise to samples from the generator and to the data points (kaae2016amortised, ; arjovsky2017towards, ). This was called instance noise in (kaae2016amortised, ). The key insight was provided in (arjovsky2017towards, ), where the role of noise was to enlarge the support of the generative network and the data distribution, which leads to stronger learning signals for the generative network during training. The crucial difference in this work is that we approached this problem from the perspective of annealing distributions and our starting point is to generate the uniform distribution, which has the support of the whole ambient space . This simple starting point is a straightforward solution to theoretical problems raised in (arjovsky2017towards, ) in using -divergences for adversarial training, where it was assumed that the support of the generative network has measure in the ambient space . Since the uniform distribution is not normalized in , we assumed to be a finite -dimensional box in . A good physical picture to have is to imagine the data manifold diffusing to the uniform distribution like ink in a -dimensional vase filled with water. What -GAN achieves during annealing is to shape the space-filling samples, step-by-step, to samples that lie on the low-dimensional manifold of the data distribution. Therefore, in our framework, there is no need to add any noise to samples from the generator (in contrast to (kaae2016amortised, ; arjovsky2017towards, )) since the generator support is initialized to be the ambient space. Finally, one can also motivate -GAN from the perspective of curriculum learning (bengio2009curriculum, ), where learning the uniform distribution is the initial task in the curriculum.

2 -Gan

In this section, we define the parameter , which plays the role of inverse temperature and parametrizes annealing from the uniform distribution () to the data distribution (). We provide a new algorithm for training GANs based on a conjecture with stability guarantees in the continuous annealing limit. We used the Jensen-Shannon formulation of GANs (goodfellow2014generative, ) below but the conjecture holds for other measures including -divergences (nowozin2016f, ) and the Wasserstein metric (arjovsky2017wasserstein, ).

We assume the generative and discriminative networks and

have very large capacity, parameterized by deep neural networks

and . Here, is the (noise) input to the generative network , and

is the discriminative network that is performing logistic regression. The discriminative network is trained with the binary classification labels

for the observations , and otherwise. The GAN objective is to find such that . This is achieved at the Nash equilibrium of the following minimax objective:

(1)
(2)

where at the equilibrium  (goodfellow2014generative, ). One way to introduce is to go back to the empirical distribution and rewrite it as a mixture of Gaussians with zero widths:

(3)

The heated data distribution at finite is therefore given by:

(4)

The -dimensional box— The starting point in -GAN is to learn to sample from the uniform distribution. Since the uniform distribution is not normalized in , we set to be the finite interval . The uniform distribution sets the scale in our framework, and the samples are rescaled to the same interval. This hard -dimensional box for the data particles is thus assumed throughout the paper. Its presence is conceptually equivalent to a diffusion process of the data particles in the box , where they diffuse to the uniform distribution like ink dropped in water (sohl2015deep, ). In this work, we parametrized the distributions with instead of the diffusion time. We also mention a non-Gaussian path to the uniform distribution in the discussion section.

With this setup, the minimax optimization task at each is:

Note that the optimal parameters and depend on implicitly. In -GAN, the first task is to learn to sample the uniform distribution. It is then trained simultanously as the uniform distribution is smoothly annealed to the empirical distribution by increasing . We chose a simple fixed geometric scheduling for annealing in this work. The algorithm is given below (see Fig. 1 for the schematic):

   Train GAN to generate uniform distribution and obtain and .
   Receive , , and , where is the number of cooling steps between/including and .
   Compute as the geometric cooling factor:
   Initialize :
   Initilize and
  for number of cooling steps  do
     for number of training steps  do
         Sample minibatch of noise samples from noise prior .
         Sample minibatch of examples from data generating distribution .
         Update the discriminator by ascending its stochastic gradient:
         Sample minibatch of noise samples from noise prior .
         Update the generator by descending its stochastic gradient:
     end for
      Increase geometrically:
  end for
   Switch from to the empirical distribution (

) for the final epochs.

Algorithm 1

Minibatch stochastic gradient descent training of annealed generative adversarial networks. The inner loop can be replaced with other GAN architectures and/or other divergence measures. The one below uses the Jensen-Shannon formulation of Goodfellow

et al. as the objective, as are all experiments in this paper.

The convergence of the algorithm is based on the following conjecture:

In the continuous annealing limit from the uniform distribution to the data distribution GAN remains stable at the equilibrium, assuming G and D have large capacity and that they are initialized at the minimax equilibrium for generating the uniform distribution111This requires . in the ambient space .

Figure 1: The schematic of -GAN— GAN is initialized at , corresponding to the uniform distribution. An annealing schedule is chosen to take from zero to infinity and the GAN training is performed simultaneously, where the parameters at each is initialized by the optimal parameters found at the previous smaller . The notation refers to samples that come from .
                                       
                                      
                                      
Figure 2: Three dimensional example — Top row: The performance of vanilla GAN on a mixture of five Gaussian components in three dimensions. Middle row: The performance of -GAN on the same dataset. Bottom row: The performance of -GAN on the synthesized mixture of two cubes. Blue/red dots are real/generated data. To compare the computational cost, we report

, which is the total number of gradient evaluations from the start. We use the architecture G:[z(3) | ReLU(128) | ReLU(128) | Linear(3)] and D:[x(3) | Tanh(128) | Tanh(128) | Tanh(128) | Sigmoid(1)] for generator and discriminator where the numbers in the parentheses show the number of units in each layer. The annealing parameters are [

].

3 Experiments

-GAN starts with learning to generate the uniform distribution in the ambient space of data. The mapping that transforms the uniform distribution222We used the uniform prior for in all our experiments. to the uniform distribution of the same dimension is an affine function. We therefore used only ReLU nonlinearity in the generative network to make the job for the generator easier. The performance of the network in generating the uniform distribution was degraded by using smooth nonlinearities like Tanh. It led to immediate mode collapse to frozen noise instead of generating high-entropy noise (see Figure 4). The mode collapse to frozen noise was especially prominent in high dimensions.

3.1 Toy examples

To check the stability of -GAN, we ran experiments on mixtures of 1D, 2D, 3D Gaussians, and a mixture of two cubic frames in 3D. The 3D results are presented here. The reported results for vanilla GAN (top row of Fig. 2) was the best among many runs; in most experiments vanila-GAN captured only one mode or failed to capture any mode. However, -GAN produced similar results consistently. In addition, vanilla GAN requires the modification of the generator loss to to avoid saturation of discriminator (goodfellow2014generative, ), while in -GAN we did not make any modification, staying with the generator loss . In the experiments, the total number of training iterations in -GAN was the same as vanilla GAN, but distributed over many intermediate temperatures, thus curbing the computational cost. We characterized the computation cost by the total number of gradient evaluations reported in the Fig. 2. We also compared the training curves of -GAN and vanilla GAN for mixtures of five and ten Gaussians (see Fig. 3).

We also synthesized a dataset that is a mixture of two cubic frames, one enclosed by the other. This dataset is interesting since the data is located on disjoint 1D manifolds within the 3D ambient space. -GAN performs well in this case in every run of the algorithm (see bottom row of Fig. 2)

We should emphasize that different GAN architectures can be easily augmented with -GAN as the outer loop. In the 3D experiments here, we chose the original architecture of generative adversarial network from  (goodfellow2014generative, ) as the inner loop (see Algorithm 1). In the next section we show the results for more sophisticated GAN architectures.

3.2 High-dimensional examples

To check the performance of our method in higher dimensions we applied -GAN to the MNIST dataset (lecun1998gradient, ) with the dimension and CelebA dataset (liu2015faceattributes, ) with the the dimension

. Once again, we start from generating the uniform distribution in the ambient space of the data and we use only piecewise linear activation functions for the generative network due to the

frozen noise mode collapse that we discussed earlier.

The performance of -GAN for the MNIST dataset with a fully connected network is shown in Fig.. 5. As gradually increases, the network learns to generate noisy images corresponding to each temperature. The results converge to clean MNIST images in the last epochs of training, where data distribution is cooled down at high value of . Also during intermediate epochs, noisy digits are generated, which are still diverse. This behavior is in contrast with the training of vanilla GAN, where collapsing at single mode is common in intermediate iterations. The same experiment was performed for CelebA dataset with the same annealing procedure, starting from the uniform distribution and annealing to the data distribution. The results are reported in Figure 6.

Regarding annealing from the uniform distribution to the data distribution, we used the same annealing schedule in all our experiments – for mixture of Gaussians (different number of modes), mixture of interlaced cubes, MNIST and CelebA – and we consistently achieved the results reported here. This highlights the stability of -GAN. We think this stability is due to the -GAN conjecture (see Section 2) even though the annealing is not continuous in the experiments.

We emphasize that both MNIST and CelebA images were generated with and , the dimensions of their ambient space respectively. At the beginning, the support of the generated distribution (i.e. the uniform distribution) is the ambient space. -GAN learns during annealing, step-by-step, to shape the space-filling samples to samples that lie on the manifold of MNIST digits and CelebA faces.

(a) -GAN for MoG with 5 modes
(b) -GAN for MoG with 10 modes
(c) Vanilla GAN for MoG with 5 modes
(d) Vanilla GAN for MoG with 10 modes
Figure 3: Training curves —

The curves shown here are the output of the discriminator (which is a classifier in this case) for the real and generated samples. For

-GAN the training curves show a more stable behavior with more robustness to the complexity of input data (a,b). However, when the data gets more complex, vanilla GAN performance gets worse signified the growing gap between and (c,d).

4 Discussion

In this work, we took a departure from the current practices in training adversarial networks by giving the generative network the capacity to fill the ambient space in the form of the uniform distribution. The uniform distribution was motivated from statistical mechanics, where we imagined the data particles diffusing like ink dropped in water. The parameter can be thought of as a surrogate for this diffusion process. There are in fact many ways to transform the data distribution to the uniform distribution. An approach that is non-Gaussian is flipping bits randomly in the bit representation (saremi2013hierarchical, ; saremi2016correlated, ) – this process will take any distribution to the uniform distribution in the limit of many bit flips. The starting point in -GAN has deep consequences for the adversarial training. It is a straightforward solution to the theoretical problems raised in (arjovsky2017towards, ), since the results there were based on . However, despite -GAN’s success in our experiments, the brute force may not be practical in large dimensions. We are working on ideas to incorporate multi-scale representations (denton2015deep, ) into this framework, and are considering dimensionality reduction as a pre-processing step before feeding data into -GAN. To emphasize the robustness of -GAN, we reported results with a fixed annealing schedule, but we have also explored ideas from feedback control (BerthelotSM17BEGAN, ) to make the annealing adaptive.

(a) Mode collapse to frozen noise
(b) Samples from uniform distribution
Figure 4: Uniform distribution generation performance — (a) The frozen noise pattern that we observe in our training using smooth nonlinearities in the generative network. The result here is for Tanh. (b) The mode collapse to frozen noise was resolved using piece-wise linear ReLU activation in the generator.
(a) Generated samples for
(b) Generated samples for
(c) Generated samples for
(d) Generated samples for
Figure 5: -GAN trained on MNIST with Samples generated from MNIST during annealing procedure. The network starts from generating the uniform distribution and gradually generates samples corresponding to each value of

. We use the fullly connected architecture G:[z(784) | BNReLU(256) | BNReLU(256) | BNReLU(256) | Linear(784)] and D:[x(784) | BNReLU(256) | BNReLU(512) | BNReLU(512) | Sigmoid(1)] for generator and discriminator where the numbers in the parentheses show the number of units in each layer. BNReLU is batch normalization 

(IoffeS15batchnormal, ) concatenated with ReLU activation. The annealing parameters are [] the same as 3D experiment in Fig. 2.
(a) Generated samples for
(b) Generated samples for
(c) Generated samples for
(d) Generated samples for
Figure 6: -GAN trained on CelebA with Samples generated from CelebA dataset during annealing procedure. The network starts from generating the uniform distribution and gradually generates samples corresponding to each value of . We borrowed DCGAN architecture from  (radford2015unsupervised, ) except that the input noise of the generative network has the dimension of data and the output layer is changed to linear instead of Tanh. The annealing parameters are [] the same as 3D experiment in Fig. 2.

Acknowledgments

SS acknowledges the support by CIFAR. We also acknowledge comments by Brian Cheung on the manuscript.

References