Improved Training with Curriculum GANs

07/24/2018 ∙ by Rishi Sharma, et al. ∙ Stanford University 2

In this paper we introduce Curriculum GANs, a curriculum learning strategy for training Generative Adversarial Networks that increases the strength of the discriminator over the course of training, thereby making the learning task progressively more difficult for the generator. We demonstrate that this strategy is key to obtaining state-of-the-art results in image generation. We also show evidence that this strategy may be broadly applicable to improving GAN training in other data modalities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) are an innovative approach to generative modeling that cast the problem of producing synthetic data as a game between two adversaries: a generator, which seeks to produce samples from the same distribution as the data, and a discriminator, whose job is to distinguish between real and generated data (Goodfellow et al., 2014)

. In practice, by implementing the generator and discriminator as competing deep neural networks, each trained via stochastic gradient methods, GANs are capable of producing plausible synthetic data across a wide diversity of data modalities, including natural images

(Radford et al., 2015; Karras et al., 2017), natural language (Yu et al., 2016; Press et al., 2017; Fedus et al., 2018), medical records (Esteban et al., 2017) and molecules (Sanchez-Lengeling et al., 2017).

Despite these successes, training GANs via stochastic gradient methods remains unstable and prone to a variety of failure modes. This has led to a proliferation of work that focuses on improving the quality of the output of GANs by stabilizing the training procedure (Salimans et al., 2016; Poole et al., 2016; Warde-Farley and Bengio, 2016; Gulrajani et al., 2017; Karras et al., 2017). Through incremental successes, the deep generative modeling community has amassed a set of tips, tricks and hacks that have made training GANs easier111https://github.com/soumith/ganhacks222https://medium.com/@utk.is.here/keep-calm-and-train-a-gan-pitfalls-and-tips-on-training-generative-adversarial-networks-edd529764aa9. A patchwork of methods has emerged; specialized techniques—uniquely suited to different domains—have enabled GANs both to expand to a diversity of new domains and to continuously improve on the state-of-the-art in classical areas such as image generation. These efforts have been successful, as evidenced by the improved quality of GAN-generated faces in recent years shown in Figure 1.

Figure 1: Progress in GANs from 2014 to 2017 (image source: Brundage et al. (2018))

Nevertheless, a lack of strong evaluative metrics for GANs

(Theis et al., 2015; Barratt and Sharma, 2018) has made it difficult to isolate precisely which methods have produced improvements or to reliably predict the regimes in which those improvements will hold. In this paper, we clarify which elements in the GAN “bag of tricks” have improved the application of GANs to image generation. In particular, we focus on the work of Karras et al. (2017), which gained widespread attention for its high-quality rendering of fake celebrity images via the layer-wise growing of a deep convolutional network. We simplify their model by introducing the concept of a GAN curriculum, and we argue that a well-crafted curriculum, one that gradually increases the capabilities of the discriminator, is the key to obtaining their state-of-the-art results. We remove the need for the complicated layer-wise training in their model, dramatically reducing the complexity of their setup, and still obtain results of the same (high) quality. As such, we argue that their instantiation of our framework was the primary contributing factor to their high quality images.

We also obtain preliminary results that indicate this technique may be generally applicable in broader GAN settings and capable of improving GANs beyond image generation. In Section 4

, we point to literature in natural language generation and text-to-image synthesis that takes advantage of related techniques, suggesting that informal variants of the method we formalize have already led to success in training GANs in other areas.

A recent paper by Lucic et al. (2017) rebuked many of the claims of improved GAN training and performance by conducting a large-scale, multi-faceted study and finding little evidence that newer training setups outperform the original GAN of Goodfellow et al. (2014). The paper accurately argued that improving GANs generally is an altogether more difficult task than improving GANs designed for a specific purpose, e.g., image generation. Despite the advances made by GANs using the "bag of tricks", we have failed the challenge of building training procedures that outperform the original setup broadly across many data modalities and tasks. Understanding why the bag of tricks has worked in specific environments is essential to eventually succeeding in creating training methods that in fact improve GANs in the general setting. As such, in addition to their utility in image generation, we believe that our findings represent an important step towards improving GANs generally.

2 Designing a Curriculum for Generative Adversarial Networks

In this section, we describe a general method for GAN training that helps to prevent instabilities during training and thus improve the quality of the final learned generator parameters. The main idea behind our method is to construct a training regimen for the generator that consists of increasingly difficult tasks. This allows the sophistication of the generator to gradually increase throughout training, rather than aiming for full sophistication at the outset. This method is similar to that of a curriculum

in supervised learning, where one orders the training examples presented to a learning algorithm according to some measure of difficulty 

(Bengio et al., 2009). Despite the conceptual similarity, the methods are in fact quite different. Under our approach, it is not the difficulty of the training examples presented to either network, but rather the capacity, and hence strength, of the discriminator that is increased throughout training.

2.1 Preliminaries

The goal in generative modeling is to learn the probability distribution

of some random variable

from a dataset of samples drawn from . In GANs (and other deep generative models), it is common practice to define a random variable with a known, fixed distribution . The generator is then defined as a parametric function that transforms into artificial samples. Implicitly, is now a random variable with a distribution that we denote . It is easy to sample from , as all we need to do is sample and then emit . The goal is to learn the parameters of the generator so that is as similar as possible to . To do this, we first define a discriminator for

that maps (real and artificial) samples to a real number that normally corresponds to some estimate of the distance between

and

, for example, the Kullback-Leibler divergence or Wasserstein distance. Then we set up a two-player game, wherein we switch off between

learning the distance between and and taking an (unbiased) gradient step to decrease that distance.

2.2 Wgan

We now describe the Wasserstein GAN (WGAN) formulation of GAN training (Arjovsky et al., 2017), beginning with the definition of the earth-mover distance between two probability distributions.

Definition 2.1 (Earth-Mover (EM) Distance).

The Earth-Mover (EM) distance or Wasserstein-1 is defined as

where

denotes the set of all joint distributions

whose marginals are and . We will refer to this quantity as the EM distance and Wasserstein distance interchangeably.

Roughly, the Wasserstein distance corresponds to the amount of “effort” required to transform one probability distribution to the other. One major benefit of the EM distance is that it is well-defined when the support of the distributions are non-overlapping. By the Kantorovich-Rubinstein duality (Villani, 2008), we know that the Wasserstein distance

(1)

where the supremum is over all the -Lipschitz functions . Therefore, the optimal value of the following optimization problem

(2)
subject to

is the Wasserstein distance, provided the optimal value satisfies the in (1). Here, we have used the fact that a differentiable function is 1-Lipschitz if and only if it has a gradient norm of at most 1 everywhere. If we solve (2), we can then take stochastic gradients of the Wasserstein distance, because

(3)

See Theorem 3 in (Arjovsky et al., 2017) for a proof. The main issue with this process is that the gradient constraint in (2) is challenging to enforce, because it needs to be satisfied for all in , but a compromise is to use what is called the gradient penalty method (Gulrajani et al., 2017). Our problem then becomes the unconstrained problem

(4)

where is some sampling distribution for and is the penalty parameter. We can solve this optimization problem with a stochastic gradient method, and then solve the outer optimization problem of minimizing the EM distance by another stochastic gradient method using the gradient in (3).

2.3 Curriculum WGAN

In our formulation, instead of fixing one discriminator , we consider convex combinations of a pre-defined set of discriminators, with the weighting denoted by such that . This means that our discriminator function can be written as . We also impose that come from increasingly large function classes, or that . Intuitively, one can view the weight as modulating the “strength” the discriminator. One can also interpret as an attention mechanism on the overall discriminator.

We further impose that is a convex set of functions (note that this is different than a set of convex functions).

Definition 2.2 (Convex set of functions).

A set of functions is convex if for all and , we have that

It will become clear later why we need this assumption, and we will also see that it is satisfied by the neural network discriminators we use. Given these assumptions, we can now define a partial order on the weight vector

. As we will see, this ordering corresponds to the strength of the discriminator, or equivalently, the difficulty the generator will have in lowering the Wasserstein distance.

Definition 2.3 (Partial Ordering on ).

We write if the set of functions contains the set . We also equivalently say that dominates . If neither nor , we write , meaning that neither nor dominates the other.

A sufficient condition for to dominate is that ’s backwards cumulative sum is always greater than ’s backwards cumulative sum, or

(5)

for all . The following example illustrates why this is a sufficient condition, at least for .

Example 2.1.

Let

(6)

where , , and and are convex sets of functions such that . Here, and and we have that if . Suppose that in fact . Then let and , which is in because contains all convex combinations of functions inside it and . Then and we have shown that is representable by .

Continuing the logic in Example 6 by induction, it is easy to see that (5) is a sufficient condition.

1:, optimization algorithm parameters. gradient penalty parameter. , the batch size. , number of inner critic iterations.
2:, , initial network parameters. , a sequence where .
3:while  has not converged do
4:      next.
5:     Let .
6:     for  do
7:         Sample a batch of real data.
8:         Sample a batch of prior samples.
9:         Sample a batch of random weights
10:         
11:         
12:         
13:         
14:     end for
15:     Sample a batch of prior samples.
16:     
17:     
18:end while
Algorithm 1 Curriculum WGAN

The WGAN Curriculum (WGAN-C) algorithm that we propose is summarized in Algorithm 1. It is essentially the Improved WGAN algorithm (Gulrajani et al., 2017), altered to include the curriculum as defined by an increase of on each iteration to strengthen the discriminator over the course of training (although we include the Wasserstein GAN algorithm here, this technique is also compatible with the original GAN objective). The basic idea is to define a curriculum of increasingly difficult s, made quantitative by the constraint that . Our thesis is that slowly unshackling the discriminator, and thus increasing the difficulty of the learning task presented to the generator, will lead to a more stable learning algorithm. (See the Supplementary Materials for a rough connection of our method to trust region methods in optimization.)

We are now able to rigorously define the strength of the discriminator, or equivalently the difficulty level for the generator.

Definition 2.4 (-fooling the discriminator).

We say that a generator -fools a discriminator for some if the optimal value of the maximization problem in (2) is less than . That is to say, if the learned discriminator results in a Wasserstein distance of less than , we consider it -fooled.

Theorem 2.1 (Generator Curriculum).

Suppose and . If is -fooled by a generator , then is also -fooled by .

Proof: Since the set the maximum value attained when optimizing over the latter set will necessarily be greater than the former. Thus, if the optimal discriminator in the latter set is -fooled, then the optimal discriminator in the former set is necessarily -fooled.

This means that at iteration of the algorithm, the discriminator can produce a lower bound of the Wasserstein distance, which we denote . Because of our ordering of discriminators, we have that are monotonically increasing throughout the algorithm, or by Theorem 2.1 for a fixed generator. Thus, even though at iteration we calculate an (approximate) Wasserstein distance, our generator can safely minimize at each iteration, because it is necessarily a lower bound on the actual Wasserstein distance. Because we are always minimizing a lower bound on the true objective, this makes the optimization algorithm more stable. Intuitively, as more capacity is allowed to the discriminator throughout the algorithm, the task of minimizing the lower bounds becomes more and more challenging. However, the generator has already (approximately) minimized all of the previous lower bounds, so it is well suited for the harder task.

3 Experiments

We evaluate the Curriculum WGAN on a sinusoid generation task and a celebrity image synthesis task. We refer the reader to Section A in the Supplementary Materials for more details on how to design a curriculum of generators for images and time series data, as well as an argument for why neural networks form a convex set of functions.

3.1 Sinusoids

In the sinusoid generation task, the goal is to generate one-dimensional sine waves, that is, . In this case, the generator and discriminators are two layer neural networks with hidden units. The generator is attempting to output length sine waves. Assuming the original length of the time series is , we define a sequence of discriminators acting on successively longer parts of the input for . This guarantees that , as the th discriminator can only act on fewer points of . There are lots of options, then, for the curriculum of s. The simplest schedule (that we use in our experiments) is to set for some fixed number of iterations, then set for the same number of iterations, all the way up to . This also has the advantage of sparsity; at any one time we are updating only one discriminator. In our experiments, when we update , we simply randomly initialize a new discriminator network with the appropriately sized input.

Figure 2: The plot on the left shows sinusoids generated with a progressive lengthening strategy to create a curriculum GAN. On the right are sinusoids generated with the same setup but no progressive lengthening. The generated sinusoids improve significantly when discriminator attention is grown (via progressive lengthening) instead of training the discriminator on full length sequences to start.

We run the WGAN-C algorithm with the schedule as described above, and contrast it to the WGAN algorithm (without curriculum). In fact, WGAN-C runs much faster, as the sequence length increases throughout training. Output from WGAN-C is shown side-by-side with WGAN in Figure 2. It is easy to see visually that WGAN-C out performs WGAN. To the best of our knowledge, no experiment on progressive lengthening of time-series data has been undertaken to date. We note that a similar experiment was performed in (Esteban et al., 2017), but without a curriculum. We have also run the same experiments with the original GAN objective (Goodfellow et al., 2014), and the results indicate that the improvements afforded by curriculum training are present independent of the choice of GAN training algorithm.

In addition to the visual comparison, we measured the average -2 error of the generated waves to the closest sine wave in the dataset (by discretizing the range of sinusoids that generate the dataset). At the end of training, the average error of sinusoids generated by a progressive lengthening strategy is 33.6% lower—the average minimum -2 distance from an element the training dataset is for the sinusoids generated by a growing strategy, and the error is for those generated without a growing strategy.

3.2 CelebA-HQ

Figure 3: Our training setup for Celeba-HQ. and operate on successively downsampled versions of the real and fake images. In this figure, each downsample operation reduces the image size by a factor of . Our curriculum uses 5 discriminators, and the downsampling factor is 2. Thus, the discriminators range from operating on 64 64 to 4 4 images.

The image synthesis task provides a direct comparison to the work of Karras et al. (2017), which uses a progressive growing strategy to deliver state-of-the-art image synthesis results using the Celeba-HQ dataset. We duplicate their training setup on CelebA-HQ, replacing their strategy of progressively growing the size of the generator and discriminator network with our simpler method of progressively growing the discriminator attention to create a curriculum. We achieve very similar results, which indicates that their method is a special case of our more general technique. See discussion in Section 4 for more.

The CelebA-HQ dataset is a refinement of the CelebA dataset (Liu et al., 2015) that consists of high-quality celebrity images that have been centered and cropped to maximize the view of the face. We use the training setup depicted in Figure 3 in which the curriculum is determined by the image average downsampling operator defined in A.1. We have a separate discriminator for each successively downsampled version of the final image, including at and . The full image size output by the generator is . We begin with , meaning that the effective discriminator only considers downsampled versions of the generated outputs and training data. We slowly change to focus on higher fidelity versions of the images.

Throughout training, we follow an identical schedule in switching to discriminators that operate on larger images (less downsampled versions of the output) as (Karras et al., 2017) use to progressively grow the output size of their GAN, and our results come out substantially similar. Our findings suggest that their results are achieved by the effective use of a curriculum to slowly build the strength of the generator, rather than from the more complicated layerwise pretraining aspect of their method.

The bottom row in Figure 4 shows the results of (Karras et al., 2017), and the top row shows our results for 6464 images. In Figure 5, the bottom row is the converged smaller images generated by (Karras et al., 2017), and the top row is our corresponding downsampled images.

We would note that Karras et al. (2017) were able to output images, due in part to the sheer quantity of images their network was able to see, as a consequence of their method beginning with smaller-sized outputs and gradually increasing output size. They are able to train with much larger batch sizes in the early stages when the output size of the network is small. We output images in order to obtain our results in a reasonable amount of time (4 days of training on an NVIDIA GeForce GTX 1080 Ti). Due to the batch size constraints imposed by GPU memory, our method, which always outputs full-sized images, takes comparatively longer to produce final outputs. In regimes where GPU memory is not highly restrictive, our method should perform equally well at generating larger sized images on a similar time scale. We have initiated image experiments and plan to include the results as soon as they are available; for comparison purposes, the (Karras et al., 2017) experiments took 20 days to run on a NVIDIA Tesla P100.

Figure 4: The bottom row shows the final output of (Karras et al., 2017) for 6464 images, and the top row shows our output images. These images were not cherry picked for quality, though some especially bad images were discarded when randomly selecting.
Figure 5: The bottom row shows the converged smaller images generated by (Karras et al., 2017) over the course of training. The top row shows our corresponding downsampled images at the same points in training. Our methods perform very similarly throughout the training process, in addition to having similar outputs.

4 Discussion and Related Work

Bengio et al. (2009)

introduced curriculum learning in the context of machine learning, formalizing the intuition that agents learn better when presented with a curriculum, i.e. a series of tasks of increasing difficulty. Their work primarily explored curriculum strategies in the context of classification and sequence prediction, showing that curriculum strategies help more quickly find local minima of non-convex loss functions. Curriculum strategies have since found particularly widespread use in training recurrent neural networks

(Zaremba and Sutskever, 2014; Bengio et al., 2015), and it is typical to see some variant of “teacher forcing” or “scheduled sampling” in applications where recurrent architectures are used.

Although it was not explicitly expressed as such, we view (Karras et al., 2017) in part as an application of curriculum learning to the training of a GAN in the context of image modeling. This view motivated our work to strip extraneous elements from the progressive growing strategy and test our framework.

When GANs have been applied to language modeling, they have frequently inherited the recurrent neural network architectures common in natural language tasks. As such, Press et al. (2017) use a recurrent architecture for language generation, which combines the use of a GAN with a progressive lengthening curriculum for the task of language generation. We see this as another instantiation of our framework of GAN curricula, contributing to the success of their model on the language generation task. Fedus et al. (2018)

also uses a recurrent architecture along with a GAN for the task of filling in the blank in sentences. The size of the blank in the sentence is grown over the course of training, and ultimately the language model simply outputs natural language. This model is made slightly more complex with the use of reinforcement learning for training on a discrete output space

(Sutton and Barto, 1998), which makes it less easy to understand as a pure instantiation of our framework, but the method nonetheless makes significant use of a curriculum strategy.

Though these works in natural language generation used a variant of curriculum learning during training, the method has largely been inherited along with the recurrent networks they use, despite not being properly motivated in the context of GANs. The training setup of GANs bears little resemblance to the supervised curriculum learning studied in (Bengio et al., 2009), which further motivated our work to establish empirically the efficacy of a curriculum strategy when using GANs.

Our work may also be connected to state-of-the-art work in text-to-image-synthesis, though the link is more tenuous. Zhang et al. (2017) uses multiple discriminators and multiple generators for different components of the text-to-image task, breaking down the generation process into easier constituent parts and tasking a separate GAN to learn each component. The reason this method works may be connected to the reason curriculum strategies are effective: that they start by teaching the generator simpler tasks and build upon the initial successes to formulate the final outputs.

As discussed earlier, Lucic et al. (2017) demonstrated in their recent paper that GAN improvements in areas such as image generation cannot be universally applied to improve GANs for other purposes. Though the value of improving GANs for specific domains should not be diminished, it would obviously be more desirable to find general principles for improving all GANs. With an eye to discovering these general principles, our results provide a clear demonstration of the positive effects of curriculum learning on image generation by GANs. Along with the prevalence of curriculum learning in training methods that have shown success across other data modalities, our formalization of curriculum learning as a standalone training method for GANs calls for further investigation into its relevance to GANs generally. We intend to develop training procedures based on curriculum GANs in other contexts, with the aim of uncovering a broadly applicable training method. Thus, we hope that we have taken an important first step towards meeting the challenge that Lucic et al. (2017) present: to discover training methods that can be applied to enhance all GANs.

References

Appendix A Designing Discriminators

WGAN-C requires sets of functions such that is a convex set of functions and . Normally, these sets will be neural networks operating on different versions of the input, however, any variation that satisfies the constraints we lay out fits in our framework.

We now show that when is a neural network, is a convex set of functions, assuming our neural networks have enough capacity. Recall the universal approximation theorem, i.e., a neural network with enough capacity can approximate any continuous (and hence differentiable) function arbitrarily well on a closed, bounded subset of  (Hornik et al., 1989). Because includes all continuous functions, and because the sum of a finite number of continuous functions is a continuous, we can conclude that is a convex set of functions.

We now describe how our framework can (and has already been in disguise) applied to several application areas.

a.1 Images

When is an image, we have that where is the width, is the height, and is the number of channels. One option for our discriminators, in this case, is for them to be (deep) convolutional networks applied to downsampled versions of the image.

Definition A.1 (Image Average Downsampling).

Given an image , its downsampled version , defined for and , is given by taking the average of every non-overlapping block of the image. Thus, we have that .

Assuming that our image dimensions are the same () and we have that

, which can be accomplished with interpolation, we can define a sequence of downsampled images, given by

. The first of these is the average of the entire image, and the last of these

is the original image. Our discriminators, then, can be (deep) convolutional networks applied to each of these downsampled images. It is important to note that image average downsampling is differentiable, which means that we can backpropagate through it. An illustration of this training pipeline is displayed in Figure 

3.

a.2 Sequences

In the domain of natural language, a natural way to design a curriculum for the generator is to train it to produce progressively longer sequences of words that are processed by the discriminator. It is already common to use this strategy when the generator is implemented by a recurrent architecture, but this holds independent of the architecture of the generator. With time series data, a natural way to design a curriculum for the generator is to train it to produce progressively longer time series that are processed by the discriminator.

Appendix B Connection to Trust-Region Methods

Trust-region methods in numerical optimization define a bounded region around the current iterate within which they trust a simplified model to be an adequate representation of the objective function, and then choose an update that is the approximate minimizer of the model in this trust region (Nocedal and Wright, 2006). Our method is (roughly) a trust-region method for GAN training, as we gradually increase the class of functions (the trust region) our discriminator can take on throughout training, which in turn increases the possible gradient updates for the generator.

Appendix C Sinusoid Generation Wasserstein Distance Plots

Figure 6: Wasserstein Distance over the course of training for WGAN-C. Note the spikes once is changed.
Figure 7: Wasserstein Distance over the course of training for standard WGAN.