Stackelberg GAN: Towards Provable Minimax Equilibrium via Multi-Generator Architectures

11/19/2018 ∙ by Hongyang Zhang, et al. ∙ Petuum, Inc. Carnegie Mellon University berkeley college 8

We study the problem of alleviating the instability issue in the GAN training procedure via new architecture design. The discrepancy between the minimax and maximin objective values could serve as a proxy for the difficulties that the alternating gradient descent encounters in the optimization of GANs. In this work, we give new results on the benefits of multi-generator architecture of GANs. We show that the minimax gap shrinks to ϵ as the number of generators increases with rate O(1/ϵ). This improves over the best-known result of O(1/ϵ^2). At the core of our techniques is a novel application of Shapley-Folkman lemma to the generic minimax problem, where in the literature the technique was only known to work when the objective function is restricted to the Lagrangian function of a constraint optimization problem. Our proposed Stackelberg GAN performs well experimentally in both synthetic and real-world datasets, improving Fréchet Inception Distance by 14.61% over the previous multi-generator GANs on the benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

page 9

page 10

page 24

page 25

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Nets (GANs) are emerging objects of study in machine learning, computer vision, natural language processing, and many other domains. In machine learning, study of such a framework has led to significant advances in adversarial defenses 

[28, 24] and machine security [4, 24]. In computer vision and natural language processing, GANs have resulted in improved performance over standard generative models for images and texts [13]

, such as variational autoencoder 

[16]

and deep Boltzmann machine 

[22]. A main technique to achieve this goal is to play a minimax two-player game between generator and discriminator under the design that the generator tries to confuse the discriminator with its generated contents and the discriminator tries to distinguish real images/texts from what the generator creates.

Despite a large amount of variants of GANs, many fundamental questions remain unresolved. One of the long-standing challenges is designing universal, easy-to-implement architectures that alleviate the instability issue of GANs training. Ideally, GANs are supposed to solve the minimax optimization problem [13], but in practice alternating gradient descent methods do not clearly privilege minimax over maximin or vice versa (page 35, [12]), which may lead to instability in training if there exists a large discrepancy between the minimax and maximin objective values. The focus of this work is on improving the stability of such minimax game in the training process of GANs.

To alleviate the issues caused by the large minimax gap, our study is motivated by the zero-sum Stackelberg competition [25]

in the domain of game theory. In the Stackelberg leadership model, the players of this game are one

leader and multiple followers, where the leader firm moves first and then the follower firms move sequentially. It is known that the Stackelberg model can be solved to find a subgame perfect Nash equilibrium. We apply this idea of Stackelberg leadership model to the architecture design of GANs. That is, we design an improved GAN architecture with multiple generators (followers) which team up to play against the discriminator (leader). We therefore name our model Stackelberg GAN. Our theoretical and experimental results establish that: GANs with multi-generator architecture have smaller minimax gap, and enjoy more stable training performances.

Figure 1: Left Figure, Top Row: Standard GAN training on a toy 2D mixture of 8 Gaussians. Left Figure, Bottom Row: Stackelberg GAN training with 8 generator ensembles, each of which is denoted by one color. Right Figure: Stackelberg GAN training with 10 generator ensembles on fashion-MNIST dataset without cherry pick, where each row corresponds to one generator.

Our Contributions. This paper tackles the problem of instability during the GAN training procedure with both theoretical and experimental results. We study this problem by new architecture design.

  • We propose the Stackelberg GAN framework of multiple generators in the GAN architecture. Our framework is general since it can be applied to all variants of GANs, e.g., vanilla GAN, Wasserstein GAN, etc. It is built upon the idea of jointly optimizing an ensemble of GAN losses w.r.t. all pairs of discriminator and generator.

    Differences from prior work. Although the idea of having multiple generators in the GAN architecture is not totally new, e.g., MIX+GAN [2], MGAN [15], MAD-GAN [11] and GMAN [10], there are key differences between Stackelberg GAN and prior work. a) In MGAN [15] and MAD-GAN [11], various generators are combined as a mixture of probabilistic models with assumption that the generators and discriminator have infinite capacity. Also, they require that the generators share common network parameters. In contrast, in the Stackelberg GAN model we allow various sampling schemes beyond the mixture model, e.g., each generator samples a fixed but unequal number of data points independently. Furthermore, each generator has free parameters. We also make no assumption on the model capacity in our analysis. This is an important research question as raised by [3]. b) In MIX+GAN [2], the losses are ensembled with learned weights and an extra regularization term, which discourages the weights being too far away from uniform. We find it slightly unnecessary because the expressive power of each generator already allows implicit scaling of each generator. In the Stackelberg GAN, we apply equal weights for all generators and obtain improved guarantees. c) In GMAN [10], there are multiple discriminators while it is unclear in theory why multi-discriminator architecture works well. In this paper, we provide formal guarantees for our model.

  • We prove that the minimax duality gap shrinks as the number of generators increases (see Theorem 1 and Corollary 2). Unlike the previous work, our result has no assumption on the expressive power of generators and discriminator, but instead depends on their non-convexity. With extra condition on the expressive power of generators, we show that Stackelberg GAN is able to achieve -approximate equilibrium with generators (see Theorem 3). This improves over the best-known result in [2] which requires generators. At the core of our techniques is a novel application of the Shapley-Folkman lemma to the generic minimax problem, where in the literature the shrinked duality gap was only known to happen when the objective function is restricted to the Lagrangian function of a constrained optimization problem [29, 5]. This results in tighter bounds than that of the covering number argument as in [2]

    . We also note that MIX+GAN is a heuristic model which does not exactly match the theoretical analysis in

    [2], while this paper provides formal guarantees for the exact model of Stackelberg GAN.

  • We empirically study the performance of Stackelberg GAN for various synthetic and real datasets. We observe that without any human assignment, surprisingly, each generator automatically learns balanced number of modes without any mode being dropped (see Figure 1). Compared with other multi-generator GANs with the same network capacity, our experiments show that Stackelberg GAN enjoys Fréchet Inception Distance on CIFAR-10 dataset while prior results achieve (smaller is better), achieving an improvement of .

2 Stackelberg GAN

Before proceeding, we define some notations and formalize our model setup in this section.

Figure 2: Architecture of Stackelberg GAN. We ensemble the losses of various generator and discriminator pairs with equal weights.

Notations.

We will use bold lower-case letter to represent vector and lower-case letter to represent scalar. Specifically, we denote by

the parameter vector of discriminator and the parameter vector of generator. Let

be the output probability of discriminator given input

, and let represent the generated vector given random input . For any function , we denote by the conjugate function of . Let be the convex closure of , which is defined as the function whose epigraph is the convex closed hull of that of function . We define . We will use to represent the number of generators.

2.1 Model Setup

Preliminaries. The key ingredient in the standard GAN is to play a zero-sum two-player

game between a discriminator and a generator — which are often parametrized by deep neural networks in practice — such that the goal of the generator is to map random noise

to some plausible images/texts and the discriminator aims at distinguishing the real images/texts from what the generator creates.

For every parameter implementations and of generator and discriminator, respectively, denote by the payoff value

where is some concave, increasing function. Hereby, is the distribution of true images/texts and

is a noise distribution such as Gaussian or uniform distribution. The standard GAN thus solves the following saddle point problems:

(1)

For different choices of function , problem (1) leads to various variants of GAN. For example, when , problem (1) is the classic GAN; when , it reduces to the Wasserstein GAN. We refer interested readers to the paper of [20] for more variants of GANs.

Stackelberg GAN. Our model of Stackelberg GAN is inspired from the Stackelberg competition in the domain of game theory. Instead of playing a two-player game as in the standard GAN, in Stackelberg GAN there are players with two firms — one discriminator and generators. One can make an analogy between the discriminator (generators) in the Stackelberg GAN and the leader (followers) in the Stackelberg competition.

Stackelberg GAN is a general framework which can be built on top of all variants of standard GANs. The objective function is simply an ensemble of losses w.r.t. all possible pairs of generators and discriminator: . Thus it is very easy to implement. The Stackelberg GAN therefore solves the following saddle point problems:

We term the minimax (duality) gap. We note that there are key differences between the naïve ensembling model and ours. In the naïve ensembling model, one trains multiple GAN models independently and averages their outputs. In contrast, our Stackelberg GAN shares a unique discriminator for various generators, thus requires jointly training. Figure 2 shows the architecture of our Stackelberg GAN.

How to generate samples from Stackelberg GAN? In the Stackelberg GAN, we expect that each generator learns only a few modes. In order to generate a sample that may come from all modes, we use a mixed model. In particular, we generate a uniformly random value from to and use the -th generator to obtain a new sample. Note that this procedure in independent of the training procedure.

3 Analysis of Stackelberg GAN

In this section, we develop our theoretical contributions and compare our results with the prior work.

3.1 Minimax Duality Gap

We begin with studying the minimax gap of Stackelberg GAN. Our main results show that the minimax gap shrinks as the number of generators increases.

To proceed, denote by where the conjugate operation is w.r.t. the second argument of . We clarify here that the subscript in indicates that the function is derived from the -th generator. The argument of should depend on , so we denote it by . Intuitively, serves as an approximate convexification of w.r.t the second argument due to the conjugate operation. Denote by the convex closure of :

represents the convex relaxation of because the epigraph of is exactly the convex hull of epigraph of by the definition of . Let

and

where and is the convex closure of w.r.t. argument . Therefore, measures the non-convexity of objective function w.r.t. argument . For example, it is equal to if and only if is concave and closed w.r.t. discriminator parameter .

We have the following guarantees on the minimax gap of Stackelberg GAN.

Theorem 1.

Let and . Denote by the number of parameters of discriminator, i.e., . Suppose that is continuous and is compact and convex. Then the duality gap can be bounded by

provided that the number of generators .

Remark 1.

Theorem 1 makes mild assumption on the continuity of loss and no assumption on the model capacity of discriminator and generators. The analysis instead depends on their non-convexity as being parametrized by deep neural networks. In particular, measures the divergence between the function value of and its convex relaxation ; When is convex w.r.t. argument , is exactly . The constant is the maximal divergence among all generators, which does not grow with the increase of . This is because measures the divergence of only one generator and when each generator for example has the same architecture, we have . Similarly, the terms and

characterize the non-convexity of discriminator. When the discriminator is concave such as logistic regression and support vector machine,

and we have the following straightforward corollary about the minimax duality gap of Stackelberg GAN.

Corollary 2.

Under the settings of Theorem 1, when is concave and closed w.r.t. discriminator parameter and the number of generators , we have .

3.2 Existence of Approximate Equilibrium

The results of Theorem 1 and Corollary 2 are independent of model capacity of generators and discriminator. When we make assumptions on the expressive power of generator as in [2], we have the following guarantee (2) on the existence of -approximate equilibrium.

Theorem 3.

Under the settings of Theorem 1, suppose that for any , there exists a generator such that . Let the discriminator and generators be -Lipschitz w.r.t. inputs and parameters, and let be -Lipschitz. Then for any , there exist generators and a discriminator such that for some value ,

(2)

Related Work. While many efforts have been devoted to empirically investigating the performance of multi-generator GAN, little is known about how many generators are needed so as to achieve certain equilibrium guarantees. Probably the most relevant prior work to Theorem 3 is that of [2]. In particular, [2] showed that there exist generators and one discriminator such that -approximate equilibrium can be achieved, provided that for all and any , there exists a generator such that . Hereby, is a global upper bound of function , i.e., . In comparison, Theorem 3 improves over this result in two aspects: a) the assumption on the expressive power of generators in [2] implies our condition . Thus our assumption is weaker. b) The required number of generators in Theorem 3 is as small as . We note that by the definition of . Therefore, Theorem 3 requires much fewer generators than that of [2].

4 Architecture, Capacity and Mode Collapse/Dropping

In this section, we empirically investigate the effect of network architecture and capacity on the mode collapse/dropping issues for various multi-generator architecture designs. Hereby, the mode dropping refers to the phenomenon that generative models simply ignore some hard-to-represent modes of real distributions, and the mode collapse means that some modes of real distributions are "averaged" by generative models. For GAN, it is widely believed that the two issues are caused by the large gap between the minimax and maximin objective function values (see page 35, [12]).

Our experiments verify that network capacity (change of width and depth) is not very crucial for resolving the mode collapse issue, though it can alleviate the mode dropping in certain senses. Instead, the choice of architecture of generators plays a key role. To visualize this discovery, we test the performance of varying architectures of GANs on a synthetic mixture of Gaussians dataset with 8 modes and 0.01 standard deviation. We observe the following phenomena:

Naïvely increasing capacity of one-generator architecture does not alleviate mode collapse. It shows that the multi-generator architecture in the Stackelberg GAN effectively alleviates the mode collapse issue. Though naïvely increasing capacity of one-generator architecture alleviates mode dropping issue, for more challenging mode collapse issue, the effect is not obvious (see Figure 3).

(a) GAN with 1 generator of architecture 2-128-2.
(b) GAN with 1 generator of architecture 2-128-256-512-1024-2.
(c) Stackelberg GAN with 8 generators of architecture 2-16-2.
Figure 3: Comparison of mode collapse/dropping issue of one-generator and multi-generator architectures with varying model capacities. (a) and (b) show that increasing the model capacity can alleviate the mode dropping issue, though it does not alleviate the mode collapse issue. (c) Multi-generator architecture with even small capacity resolves the mode collapse issue.

Stackelberg GAN outperforms multi-branch models. We compare performance of multi-branch GAN and Stackelberg GAN with objective functions:

Hereby, the multi-branch GAN has made use of extra information that the real distribution is Gaussian mixture model with probability distribution function

, so that each tries to fit one component. However, even this we observe that with same model capacity, Stackelberg GAN significantly outperforms multi-branch GAN (see Figure 4 (a)(c)) even without access to the extra information. The performance of Stackelberg GAN is also better than multi-branch GAN of much larger capacity (see Figure 4 (b)(c)).

(a) 8-branch GAN with generator architecture 2-16-2.
(b) 8-branch GAN with generator architecture 2-128-256-512-1024-2.
(c) Stackelberg GAN with 8 generators of architecture 2-16-2.
Figure 4: Comparison of mode collapse issue of multi-branch and multi-generator architectures with varying model capacities. (a) and (b) show that increasing the model capacity can alleviate the mode dropping issue, though it does not alleviate the mode collapse issue. (c) Multi-generator architecture with much smaller capacity resolves the mode collapse issue.

Generators tend to learn balanced number of modes when they have same capacity. We observe that for varying number of generators, each generator in the Stackelberg GAN tends to learn equal number of modes when the modes are symmetric and every generator has same capacity (see Figure 5).

(a) Two generators.
(b) Four generators.
(c) Six generators.
Figure 5: Stackelberg GAN with varying number of generators of architecture 2-128-256-512-1024-2.

5 Experiments

In this section, we verify our theoretical contributions by the experimental validation.

5.1 MNIST Dataset

We first show that Stackelberg GAN generates more diverse images on the MNIST dataset [18] than classic GAN. We follow the standard preprocessing step that each pixel is normalized via subtracting it by 0.5 and dividing it by . The detailed network setups of discriminator and generators are in Table 4.

Figure 6 shows the diversity of generated digits by Stackelberg GAN with varying number of generators. When there is only one generator, the digits are not very diverse with many "1"’s and much fewer "2"’s. As the number of generators increases, the images tend to be more diverse. In particular, for -generator Stackelberg GAN, each generator is associated with one or two digits without any digit being missed.

Figure 6: Standard GAN vs. Stackelberg GAN on the MNIST dataset without cherry pick. Left Figure: Digits generated by the standard GAN. It shows that the standard GAN generates many "1"’s which are not very diverse. Middle Figure: Digits generated by the Stackelberg GAN with 5 generators, where every two rows correspond to one generator. Right Figure: Digits generated by the Stackelberg GAN with 10 generators, where each row corresponds to one generator.

5.2 Fashion-MNIST Dataset

We also observe better performance by the Stackelberg GAN on the Fashion-MNIST dataset. Fashion-MNIST is a dataset which consists of 60,000 examples. Each example is a grayscale image associating with a label from 10 classes. We follow the standard preprocessing step that each pixel is normalized via subtracting it by 0.5 and dividing it by . We specify the detailed network setups of discriminator and generators in Table 4.

Figure 7 shows the diversity of generated fashions by Stackelberg GAN with varying number of generators. When there is only one generator, the generated images are not very diverse without any “bags” being found. However, as the number of generators increases, the generated images tend to be more diverse. In particular, for -generator Stackelberg GAN, each generator is associated with one class without any class being missed.

Figure 7: Generated samples by Stackelberg GAN on CIFAR-10 dataset without cherry pick. Left Figure: Examples generated by the standard GAN. It shows that the standard GAN fails to generate bags. Middle Figure: Examples generated by the Stackelberg GAN with 5 generators, where every two rows correspond to one generator. Right Figure: Examples generated by the Stackelberg GAN with 10 generators, where each row corresponds to one generator.

5.3 CIFAR-10 Dataset

We then implement Stackelberg GAN on the CIFAR-10 dataset. CIFAR-10 includes 60,000 3232 training images, which fall into 10 classes [17]). The architecture of generators and discriminator follows the design of DCGAN in [21]. We train models with 5, 10, and 20 fixed-size generators. The results show that the model with 10 generators performs the best. We also train 10-generator models where each generator has 2, 3 and 4 convolution layers. We find that the generator with 2 convolution layers, which is the most shallow one, performs the best. So we report the results obtained from the model with 10 generators containing 2 convolution layers. Figure  7(a) shows the samples produced by different generators. The samples are randomly drawn instead of being cherry-picked to demonstrate the quality of images generated by our model.

For quantitative evaluation, we use Inception score and Fréchet Inception Distance (FID) to measure the difference between images generated by models and real images.

Results of Inception Score. The Inception score measures the quality of a generated image and is correlated well with human’s judgment [23]. We report the Inception score obtained by our Stackelberg GAN and other baseline methods in Table 1. For fair comparison, we only consider the baseline models which are completely unsupervised model and do not need any label information. Instead of directly using the reported Inception scores by original papers, we replicate the experiment of MGAN using the code, architectures and parameters reported by their original papers, and evaluate the scores based on the new experimental results. Table 1 shows that our model achieves a score of 7.62 in CIFAR-10 dataset, which outperforms the state-of-the-art models. For fairness, we configure our Stackelberg GAN with the same capacity as MGAN, that is, the two models have comparative number of total parameters. When the capacity of our Stackelberg GAN is as small as DCGAN, our model improves over DCGAN significantly.

Results of Fréchet Inception Distance. We then evaluate the performance of models on CIFAR-10 dataset using the Fréchet Inception Distance (FID), which better captures the similarity between generated images and real ones  [14]. As Table 1 shows, under the same capacity as DCGAN, our model reduces the FID by . Meanwhile, under the same capacity as MGAN, our model reduces the FID by . This improvement further indicates that our Stackelberg GAN with multiple light-weight generators help improve the quality of the generated images.

Model Inception Score Fréchet Inception Distance
Real data -
WGAN [1] -
MIX+WGAN [2] -
Improved-GAN [23] -
ALI [9] -
BEGAN [6] -
MAGAN [27] -
GMAN [10] -
DCGAN [21] 37.7
Ours (capacity as DCGAN) 29.88
D2GAN [19] -
MAD-GAN (our run, capacity MGAN) [11] 34.10
MGAN (our run) [15] 31.34
Ours (capacity MGANDCGAN) 26.76
Table 1: Quantitative evaluation of various GANs on CIFAR-10 dataset. All results are either reported by the authors themselves or run by us with codes provided by the authors. Every model is trained without label. Methods with higher inception score and lower Fréchet Inception Distance are better.

5.4 Tiny ImageNet Dataset

We also evaluate the performance of Stackelberg GAN on the Tiny ImageNet dataset. The Tiny ImageNet is a large image dataset, where each image is labelled to indicate the class of the object inside the image. We resize the figures down to

following the procedure described in [8]. Figure 7(b) shows the randomly picked samples generated by -generator Stackelberg GAN. Each row has samples generated from one generator. Since the types of some images in the Tiny ImageNet are also included in the CIFAR-10, we order the rows in the similar way as Figure 7(a).

(a) Samples on CIFAR-10.
(b) Samples on Tiny ImageNet.
Figure 8: Examples generated by Stackelberg GAN on CIFAR-10 (left) and Tiny ImageNet (right) without cherry pick, where each row corresponds to samples from one generator.

6 Conclusions

In this work, we tackle the problem of instability during GAN training procedure, which is caused by the huge gap between minimax and maximin objective values. The core of our techniques is a multi-generator architecture. We show that the minimax gap shrinks to as the number of generators increases with rate , when the maximization problem w.r.t. the discriminator is concave. This improves over the best-known results of . Experiments verify the effectiveness of our proposed methods.

Acknowledgements. Part of this work was done while H.Z. and S.X. were summer interns at Petuum Inc. We thank Maria-Florina Balcan, Yingyu Liang, and David P. Woodruff for their useful discussions.

References

Appendix A Supplementary Experiments

(a) Step 0.
(b) Step 6k.
(c) Step 13k.
(d) Step 19k.
(e) Step 25k.
(f) Step 0.
(g) Step 6k.
(h) Step 13k.
(i) Step 19k.
(j) Step 25k.
(k) Step 0.
(l) Step 6k.
(m) Step 13k.
(n) Step 19k.
(o) Step 25k.
Figure 9: Effects of generator architecture of Stackelberg GAN on a toy 2D mixture of Gaussians, where the number of generators is set to be 8. Top Row: The generators have one hidden layer. Middle Row: The generators have two hidden layers. Bottom Row: The generators have three hidden layer. It shows that with the number of hidden layers increasing, each generator tends to learn more modes. However, mode collapse never happens for all three architectures.

Figure 9 shows how the architecture of generators affects the distributions of samples by each generators. The enlarged versions of samples generated by Stackelberg GAN with architectures shown in Table 5 and Table 6 are deferred to Figures 101112 and 13.

Appendix B Proofs of Main Results

b.1 Proofs of Theorem 1 and Corollary 2: Minimax Duality Gap

Theorem 1 (restated). Let and . Denote by the number of parameters of discriminator, i.e., . Suppose that is continuous and is compact and convex. Then the duality gap can be bounded by

provided that the number of generators .

Proof.

The statement is by the weak duality. Thus it suffices to prove the other side of the inequality. All notations in this section are defined in Section 3.1.

We first show that

Denote by

We have the following lemma.

Lemma 4.

We have

Proof.

By the definition of , we have . Since is the convex closure of function (a.k.a. weak duality theorem), we have . We now show that Note that , where

and that

(3)

So we have

as desired.

By Lemma 4, it suffices to show . We have the following lemma.

Lemma 5.

Under the assumption in Theorem 1,

Proof.

We note that

where . Therefore,

Consider the subset of :

Define the vector summation

Since is continuous and is compact, the set

is compact. So , , , and , are all compact sets. According to the definition of and the standard duality argument [7], we have

and

We are going to apply the following Shapley-Folkman lemma.

Lemma 6 (Shapley-Folkman, [26]).

Let be a collection of subsets of . Then for every , there is a subset of size at most such that

We apply Lemma 6 to prove Lemma 5 with . Let be such that

Applying the above Shapley-Folkman lemma to the set , we have that there are a subset of size and vectors

such that

(4)
(5)

Representing elements of the convex hull of by Carathéodory theorem, we have that for each , there are vectors and scalars such that

(6)

Recall that we define

and . We have for ,

(7)

Thus, by Eqns. (4) and (6), we have

(8)

Therefore, we have

as desired. ∎

By Lemmas 4 and 5, we have proved that

To prove Theorem 1, we note that

as desired. ∎

Corollary 2 (restated). Under the settings of Theorem 1, when is concave and closed w.r.t. discriminator parameter and the number of generators , we have .

Proof.

When is concave and closed w.r.t. discriminator parameter , we have . Thus, and . ∎

b.2 Proofs of Theorem 3: Existence of Approximate Equilibrium

Theorem 3 (restated). Under the settings of Theorem 1, suppose that for any , there exists a generator such that