Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets

12/26/2019 ∙ by Mingrui Liu, et al. ∙ The University of Iowa 0

Adaptive gradient algorithms perform gradient-based updates using the history of gradients and are ubiquitous in training deep neural networks. While adaptive gradient methods theory is well understood for minimization problems, the underlying factors driving their empirical success in min-max problems such as GANs remain unclear. In this paper, we aim at bridging this gap from both theoretical and empirical perspectives. First, we analyze a variant of Optimistic Stochastic Gradient (OSG) proposed in <cit.> for solving a class of non-convex non-concave min-max problem and establish O(ϵ^-4) complexity for finding ϵ-first-order stationary point, in which the algorithm only requires invoking one stochastic first-order oracle while enjoying state-of-the-art iteration complexity achieved by stochastic extragradient method by <cit.>. Then we propose an adaptive variant of OSG named Optimistic Adagrad (OAdagrad) and reveal an improved adaptive complexity O(ϵ^-2/1-α) [%s], where α characterizes the growth rate of the cumulative stochastic gradient and 0≤α≤ 1/2. To the best of our knowledge, this is the first work for establishing adaptive complexity in non-convex non-concave min-max optimization. Empirically, our experiments show that indeed adaptive gradient algorithms outperform their non-adaptive counterparts in GAN training. Moreover, this observation can be explained by the slow growth rate of the cumulative stochastic gradient, as observed empirically.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adaptive gradient algorithms (Duchi et al., 2011; Tieleman and Hinton, 2012; Kingma and Ba, 2014; Reddi et al., 2019) are very popular in training deep neural networks due to their computational efficiency and minimal need for hyper-parameter tuning (Kingma and Ba, 2014). For example, Adagrad (Duchi et al., 2011)

automatically adjusts the learning rate for each dimension of the model parameter according to the information of history gradients, while its computational cost is almost the same as Stochastic Gradient Descent (SGD). However, in supervised deep learning (for example, image classification tasks using a deep convolutional neural network), there is not enough evidence showing that adaptive gradient methods converge faster than its non-adaptive counterpart (i.e., SGD) on benchmark datasets. For example, it is argued in 

(Wilson et al., 2017) that adaptive gradient methods often find a solution with worse performance than SGD. Specifically, Wilson et al. (2017) observed that Adagrad has slower convergence than SGD in terms of both training and testing error, while using VGG (Simonyan and Zisserman, 2014) on CIFAR10 data.

GANs (Goodfellow et al., 2014) are a popular class of generative models. In a nutshell, they consist of a generator and a discriminator, both of which are defined by deep neural networks. The generator and the discriminator are trained under an adversarial cost, corresponding to a non-convex non-concave min-max problem. GANs are known to be notoriously difficult to train. In practice, Adam (Kingma and Ba, 2014) is the defacto optimizer used for GAN training. The common optimization strategy is to alternatively update the discriminator and the generator (Arjovsky et al., 2017; Gulrajani et al., 2017). Using Adam is important in GAN training, since replacing it with non-adaptive methods (e.g. SGD) would significantly deteriorate the performance. This paper studies and attempts to answer the following question:

Why do adaptive gradient methods outperform their non-adaptive counterparts in GAN training?

We analyze a variant of Optimistic Stochastic Gradient (OSG) in (Daskalakis and Panageas, 2018) and propose an adaptive variant named Optimistic Adagrad (OAdagrad) for solving a class of non-convex non-concave min-max problems. Both of them are shown to enjoy state-of-the-art complexities. We further prove that the convergence rate of OAdagrad to an -first-order stationary point depends on the growth rate of the cumulative stochastic gradient. In our experiments, we observed an interesting phenomenon while using adaptive gradient methods for training GANs: the cumulative stochastic gradient grows at a slow rate. This observation is in line with the prediction of our theory suggesting improved convergence rate for OAdagrad in GAN training, when the growth rate of the cumulative stochastic gradient is slow.

Since GAN is a min-max optimization problem in nature, our problem of interest is to solve the following stochastic optimization problem:

(1)

where , are closed and convex sets, is possibly non-convex in and non-concave in .

is a random variable following an unknown distribution

. In GAN training, , represent the parameters of generator and discriminator respectively.

The ideal goal for solving (1) is to find a saddle point such that for .

To achieve this goal, the typical assumption usually made is that the objective function is convex-concave. When is convex in and concave in , non-asymptotic guarantee in terms of the duality gap is well established by a series of work (Nemirovski and Yudin, 1978; Nemirovski, 2004; Nesterov, 2007; Nemirovski et al., 2009; Juditsky et al., 2011). However, when is non-convex in and non-concave in , finding the saddle point is NP-hard in general. Instead, we focus on finding the first-order stationary point provided that the objective function is smooth. I.e. we aim to find such that , . Note that this is a necessary condition for finding the (local) saddle point.

Related Work. Several works designed iterative first-order deterministic (Dang and Lan, 2015) and stochastic (Iusem et al., 2017; Lin et al., 2018) algorithms for achieving the -first-order stationary point with non-asymptotic guarantee. The goal is to find such that or , where the first-order oracle is defined as with and the first-order stochastic oracle is the noisy observation of , i.e. . For instance, Dang and Lan (2015) focuses on the deterministic setting. On the other hand, (Iusem et al., 2017) develops a stochastic extra-gradient algorithm that enjoys iteration complexity. The extra-gradient method requires two stochastic first-order oracles in one iteration, which can be computationally expensive in deep learning applications such as GANs. The inexact proximal point method developed in (Lin et al., 2018) has iteration complexity for finding an -first-order stationary point 222The result in (Lin et al., 2018) assumes the first-order oracle is a weakly-monotone operator, which is milder than the Lipschitz-continuity assumption as assumed in Iusem et al. (2017). However, simply applying the Lipschitz-continuity condition in their proof does not change their iteration complexity..

To avoid the cost of an additional oracle call in extragradient step, several studies (Chiang et al., 2012; Rakhlin and Sridharan, 2013; Daskalakis et al., 2017; Gidel et al., 2018; Xu et al., 2019) proposed single-call variants of the extragradient algorithm. Some of them focus on the convex setting (e.g.  (Chiang et al., 2012; Rakhlin and Sridharan, 2013)), while others focus on the non-convex setting (Xu et al., 2019). The closest to our work is the work by (Daskalakis et al., 2017; Gidel et al., 2018), where the min-max setting and GAN training are considered. However, the convergence of those algorithms is only shown for a class of bilinear problems in (Daskalakis et al., 2017) and for monotone variational inequalities in (Gidel et al., 2018). Hence a big gap remains between the specific settings studied in (Daskalakis et al., 2017; Gidel et al., 2018) and more general non-convex non-concave min-max problems. Table 1 provides a complete overview of our results and existing results. It is hard to give justice to the large body of work on min-max optimization, so we refer the interested reader to Appendix B that gives a comprehensive survey of related previous methods that are not covered in this Table.

Assumption Setting IC PC Guarantee
Extragradient
 (Iusem et al., 2017)
pseudo-monotonicity 333Note that the pseudo-monotonicity assumption used by (Iusem et al., 2017) can also be replaced by our MVI assumption in their proof. The main difference between our OSG and the stochastic extragradient method in (Iusem et al., 2017) is the number of stochastic gradient calculations in each iteration. stochastic -SP
OMD
 (Daskalakis et al., 2017)
bilinear deterministic N/A asymptotic
AvgPastExtraSGD
 (Gidel et al., 2018)
monotonicity stochastic -DG
OMD
 (Mertikopoulos et al., 2018)
coherence stochastic N/A asymptotic
IPP
 (Lin et al., 2018)
MVI has solution stochastic -SP
Alternating Gradient
 (Gidel et al., 2019)
bilinear form 444Here the bilinear game is defined as

, where the smallest singular value of

is positive, , .
deterministic -optim
SVRE
 (Chavdarova et al., 2019)
strong-monotonicity
finite sum
stochastic
finite sum
 555Here , , denote the number of components in the finite sum structure, Lipschitz constant and strong-monotonicity parameter of the operator of variational inequality respectively. -optim
Extragradient
 (Azizian et al., 2019)
strong-monotonicity deterministic -optim
OSG
(this work)
MVI has solution stochastic -SP
OAdagrad
(this work)
MVI has solution stochastic -SP
Table 1: Summary of different algorithms with IC (Iteration Complexity), PC (Per-iteration Complexity) to find -SP (-first-order Stationary Point), -DG (-Duality Gap, i.e. a point such that ), or -optim (-close to the set of optimal solution). stands for the time complexity for invoking one stochastic first-order oracle.

Our main goal is to design stochastic first-order algorithms with low iteration complexity, low per-iteration cost and suitable for a general class of non-convex non-concave min-max problems. The main tool we use in our analysis is variational inequality.

Let be an operator and is a closed convex set. The Stampacchia Variational Inequality (SVI) problem (Hartman and Stampacchia, 1966) is defined by the operator and and denoted by . It consists of finding such that for . A similar one is Minty Variational Inequality (MVI) problem (Minty and others, 1962) denoted by , which consists of finding such that for . Min-max optimization is closely related to variational inequalities. The corresponding SVI and MVI for the min-max problem are defined through with .

Our main contributions are summarized as follows:

  • Following (Daskalakis et al., 2017), we extend optimistic stochastic gradient (OSG) analysis beyond the bilinear and unconstrained case, by assuming the Lipschitz continuity of the operator and the existence of a solution for the variational inequality . These conditions were considered in the analysis of the stochastic extragradient algorithm in (Iusem et al., 2017). We analyze a variant of Optimistic Stochastic Gradient (OSG) under these conditions, inspired by the analysis of (Iusem et al., 2017). We show that OSG achieves state-of-the-art iteration complexity for finding an -first-order stationary point. Note that our OSG variant only requires invoking one stochastic first-order oracle while enjoying the state-of-the-art iteration complexity achieved by stochastic extragradient method (Iusem et al., 2017).

  • Under the same conditions, we design an adaptive gradient algorithm named Optimistic Adagrad (OAdagrad), and show that it enjoys better adaptive complexity , where characterizes the growth rate of cumulative stochastic gradient and . Similar to Adagrad (Duchi et al., 2011), our main innovation is in considering variable metrics according to the geometry of the data in order to achieve potentially faster convergence rate for a class of nonconvex-nonconcave min-max games. Note that this adaptive complexity improves upon the non-adaptive one (i.e. ) achieved by OSG. To the best of our knowledge, we establish the first known adaptive complexity for adaptive gradient algorithms in a class of non-convex non-concave min-max problems.

  • We demonstrate the effectiveness of our algorithms in GAN training on CIFAR10 data. Empirical results identify an important reason behind why adaptive gradient methods behave well in GANs, which is due to the fact that the cumulative stochastic gradient grows in a slow rate. We also show that OAdagrad outperforms Simultaneous Adam in sample quality in ImageNet generation using self-attention GANs 

    (Zhang et al., 2018). This confirms the superiority of OAdagrad in min-max optimization.

2 Preliminaries and Notations

In this section, we fix some notations and give formal definitions of variational inequalities, and their relationship to the min-max problem (1).

Notations. Let be a closed convex set, and the euclidean norm. We note the projection operator, i.e. . Define with in problem (1). At every point , we don’t have access to and have only access to a noisy observations of . That is, , where is a random variable with distribution . For the ease of presentation, we use the terms stochastic gradient and stochastic first-order oracle interchangeably to stand for in the min-max setting.

Definition 1 (Monotonicity).

An operator is monotone if for . An operator is pseudo-monotone if for . An operator is -strongly-monotone if for .

We give here formal definitions of monotonic operators and the -first-order stationary point.

Definition 2 (-First-Order Stationary Point).

A point is called -first-order stationary point if .

Remark: We make the following observations:

  • From the definition, it is evident that strong-monotonicity monotonicity pseudo-monotonicity. Assuming SVI has a solution and pseudo-monotonicity of the operator imply that has a solution. To see that, assume that SVI has a nonempty solution set, i.e. there exists such that for any . Noting that pseudo-monotonicity means that for every , implies , we have for any , which means that is the solution of Minty variational inequality. Note that the reverse may not be true and an example is provided in Appendix G.

  • For the min-max problem (1), when is convex in and concave in , is monotone. And, therefore solving is equivalent to solving (1). When is not monotone, by assuming is Lipschitz continuous, it can be shown that the solution set of (1) is a subset of the solution set of . However, even solving is NP-hard in general and hence we resort to finding an -first-order stationary point.

Throughout the paper, we make the following assumption:

Assumption 1.
  • is -Lipschitz continuous, i.e. for .

  • has a solution, i.e. there exists such that for .

  • For , , .

Remark: Assumptions (i) and (iii) are commonly used assumptions in the literature of variational inequalities and non-convex optimization (Juditsky et al., 2011; Ghadimi and Lan, 2013; Iusem et al., 2017). Assumption (ii) is used frequently in previous work focusing on analyzing algorithms that solve non-monotone variational inequalities (Iusem et al., 2017; Lin et al., 2018; Mertikopoulos et al., 2018). Assumption (ii) is weaker than other assumptions usually considered, such as pseudo-monotonicity, monotonicity, or coherence as assumed in (Mertikopoulos et al., 2018). For non-convex minimization problem, it has been shown that this assumption holds while using SGD to learn neural networks (Li and Yuan, 2017; Kleinberg et al., 2018; Zhou et al., 2019).

3 Optimistic Stochastic Gradient

This section serves as a warm-up and motivation of our main theoretical contribution presented in the next section. Inspired by (Iusem et al., 2017), we present an algorithm called Optimistic Stochastic Gradient (OSG) that saves the cost of the additional oracle call as required in (Iusem et al., 2017) and maintains the same iteration complexity. The main algorithm is described in Algorithm 1, where

denotes the minibatch size for estimating the first-order oracle. It is worth mentioning that Algorithm 

1 becomes stochastic extragradient method if one changes to in line 3. Stochastic extragradient method requires to compute stochastic gradient over both sequences and . In contrast, is an ancillary sequence in OSG and the stochastic gradient is only computed over the sequence of . Thus, stochastic extragradient method is twice as expensive as OSG in each iteration. In some tasks (e.g. training GANs) where the stochastic gradient computation is expensive, OSG is numerically more appealing.

1:  Input:
2:  for  do
3:     
4:     
5:  end for
Algorithm 1 Optimistic Stochastic Gradient (OSG)

Remark: When , the update in Algorithm 1 becomes the algorithm in (Daskalakis et al., 2017), i.e.

(2)

The detailed derivation of (2) can be found in Appendix F.

Theorem 1.

Suppose that Assumption 1 holds. Let . Let and run Algorithm 1 for iterations. Then we have

Corollary 1.

Consider the unconstrained case where . Let , and we have

(3)

Remark: There are two implications of Corollary 1.

  • (Increasing Minibatch Size) Let , . To guarantee , the total number of iterations is , and the total complexity is , where hides a logarithmic factor of .

  • (Constant Minibatch Size) Let , . To guarantee , the total number of iterations is , and the total complexity is .

4 Optimistic Adagrad

4.1 Adagrad for minimization Problems

Before introducing Optimistic Adagrad, we present here a quick overview of Adagrad (Duchi et al., 2011). The main objective in Adagrad is to solve the following minimization problem:

(4)

where is the model parameter, and is an random variable following distribution . The update rule of Adagrad is

(5)

where , , with denoting the Hadamard product. Adagrad when taking reduces to SGD. Different from SGD, Adagrad dynamically incorporates knowledge of history gradients to perform more informative gradient-based learning. When solving a convex minimization problem and the gradient is sparse, Adagrad converges faster than SGD. There are several variants of Adagrad, including Adam (Kingma and Ba, 2014)

, RMSProp 

(Tieleman and Hinton, 2012), and AmsGrad (Reddi et al., 2019). All of them share the spirit, as they take advantage of the information provided by the history of gradients. Wilson et al. (2017) provide a complete overview of different adaptive gradient methods in a unified framework. It is worth mentioning that Adagrad can not be directly applied to solve non-convex non-concave min-max problems with provable guarantee.

4.2 Optimistic Adagrad for min-max optimization

Our second algorithm named Optimistic Adagrad (OAdagrad) is an adaptive variant of OSG, which also updates minimization variable and maximization variable simultaneously. The key difference between OSG and OAdagrad is that OAdagrad inherits ideas from Adagrad to construct variable metric based on history gradients information, while OSG only utilizes a fixed metric. This difference helps us establish faster adaptive convergence under some mild assumptions. Note that in OAdagrad we only consider the unconstrained case, i.e. .

Assumption 2.
  • There exists and such that , for all almost surely.

  • There exists a universal constant such that for , and .

Remark: Assumption 2 (i) is a standard one often made in literature (Duchi et al., 2011). Assumption 2 (ii) holds when we use normalization layers in the discriminator and generator such as spectral normalization of weights (Miyato et al., 2018; Zhang et al., 2018), that will keep the norms of the weights bounded. Regularization techniques such as weight decay also ensure that the weights of the networks remain bounded throughout the training.

, . Denote by the concatenation of , and denote by the -th row of .

1:  Input: ,
2:  for  do
3:     
4:     
5:     Update , , and set
6:  end for
Algorithm 2 Optimistic AdaGrad (OAdagrad)
Theorem 2.

Suppose Assumption 1 and 2 hold. Suppose with for every and every . When , after running Algorithm 2 for iterations, we have

(6)

To make sure , the number of iterations is , where hides a logarithmic factor of .

Remark:

  • We denote by the cumulative stochastic gradient, where characterizes the growth rate of the gradient in terms of -th coordinate. In our proof, a key quantity is that crucially affects the computational complexity of Algorithm 2. Since , in the worst case, . But in practice, the stochastic gradient is usually sparse, and hence can be strictly smaller than .

  • As shown in Theorem 2, the minibatch size used in Algorithm 2 for estimating the first-order oracle can be any positive constant and independent of . This is more practical than the results established in Theorem 1, since the minibatch size in Theorem 1 does either increase in terms of number of iterations or is dependent on . When , the complexity of Algorithm 2 is , which matches the complexity stated in Theorem 1. When , the complexity of OAdagrad given in Algorithm 2 is , i.e., strictly better than that of OSG given in Algorithm 1.

Comparison with Alternating Adam and Optimistic Adam

Alternating Adam is very popular in GAN training (Goodfellow et al., 2014; Arjovsky et al., 2017; Gulrajani et al., 2017; Brock et al., 2018). In Alternating Adam, one alternates between multiple steps of Adam on the discriminator and a single step of Adam on the generator. The key difference between OAdagrad and Alternating Adam is that OAdagrad updates the discriminator and generator simultaneously. It is worth mentioning that OAdagrad naturally fits into the framework of Optimistic Adam proposed in (Daskalakis et al., 2017). Taking in their Algorithm 1 reduces to OAdagrad with annealing learning rate. To the best of our knowledge, there is no convergence proof for Alternating Adam for non-convex non-concave problems. Our convergence proof for OAdagrad provides a theoretical justification of a special case of Optimistic Adam.

5 Experiments

Figure 1: OAdagrad, OSG and Alternating Adam for WGAN-GP on CIFAR10 data
Figure 2: Cumulative Stochastic Gradient as a function of number of iterations, where netD and netG stand for the discriminator and generator respectively. The blue curve and red curve stand for the growth rate of the cummulative stochastic gradient for OAdagrad and its corresponding tightest polynomial growth upper bound, respectively.
(a) Inception Score
(b) FID
Figure 3:

Self-Attention GAN on ImageNet, with evaluation using Official TensorFlow Inception Score and Official TensorFlow FID. We see that OAdagard indeed outperforms Simultaneous Adam in terms of the (TensorFlow) Inception score (higher is better), and in terms of (TensorFlow) Fréchet Inception Distance (lower is better). We don’t report here Alternating Adam since in our run it has collapsed.

WGAN-GP on CIFAR10

In the first experiment, we verify the effectiveness of the proposed algorithms in GAN training using the PyTorch framework 

(Paszke et al., 2017). We use Wasserstein GAN with gradient penalty (WGAN-GP) (Gulrajani et al., 2017) and CIFAR10 data in our experiments. The architectures of discriminator and generator, and the penalty parameter in WGAN-GP are set to be same as in the original paper. We compare Alternating Adam, OSG and OAdagrad, where the Alternating Adam is to run steps of Adam on the discriminator before performing step of Adam on the generator. We try different batch sizes for each algorithm. For each algorithm, we tune the learning rate in the range of when using batch size 64, and use the same learning rate for batch size and . We report Inception Score (IS) (Salimans et al., 2016) as a function of number of iterations. Figure 1 suggests that OAdagrad performs better than OSG and Alternating Adam, and OAdagrad results in higher IS. We compare the generated CIFAR10 images associated with these three methods, which is included in Appendix A. We also provide experimental results to compare the performance of different algorithms using different minibatch sizes, which are included in Appendix E.

Growth Rate of Cumulative Stochastic Gradient

In the second experiment, we employ OAdagrad to train GANs and study the growth rate of the cumulative stochastic gradient (i.e., ). We tune the learning rate from and choose batch size to be 64. In Figure 2, the blue curve and red curve stand for the growth rate for OAdagrad and its corresponding tightest polynomial growth upper bound respectively. is the number of iterations, and is a multiplicative constant such that the red curve and blue curve overlaps at the starting point of the training. The degree of the polynomial is determined using binary search. We can see that the growth rate of cumulative stochastic gradient grows very slowly in GANs (the worst-case polynomial degree is , but it is for WGAN-GP on CIFAR10 and for WGAN on LSUN Bedroom dataset). As predicted by our theory, this behavior explains the faster convergence of OAdagrad versus OSG, consistent with what is observed empirically in Figure 1.

Self-Attention GAN on ImageNet

In the third experiment, we consider GAN training on large-scale dataset. We use the model from Self-Attention GAN (Zhang et al., 2018) (SA-GAN) and ImageNet as our dataset. Note that in this setting the boundedness of both generator () and discriminator () is ensured by spectral normalization of both and . Three separate experiments are performed, including Alternating Adam (baseline), Simultaneous Adam (Mescheder et al., 2017), and OAdagrad. It should be mentioned that the update rule of Simultaneous Adam involves performing Adam-type update for discriminator and generator simultaneously. Training is performed with batch size 128 for all experiments.

For the baseline experiment (Alternating Adam) we use the default settings and hyper parameters reported in SA-GAN (Zhang et al., 2018) (note that we are not using the same batch size of as in (Zhang et al., 2018) due to limited computational resources). In our experience, Alternating Adam training for a batch size of with same learning rate as in SA-GAN ( for generator and

for discriminator) collapsed. This does not mean that Alternating Adam fails, it just needs more tuning to find the correct range of learning rates for the particular batch size we have. With the hyperparameters ranges we tried Alternating Adam collapsed, with extra tuning efforts and an expensive computational budget Alternating Adam would eventually succeed. This is inline with the large scale study in 

(Lucic et al., 2018) that states that given a large computational budget for tuning hyper-parameters most GANs training succeed equally.

For both OAdagrad and Simultaneous Adam, we use different learning rate for generator and discriminator, as suggested in (Heusel et al., 2017). Specifically, the learning rates used are for the generator and for the discriminator. We report both Inception Score (IS) and Fréchet Inception Distance (Heusel et al., 2017) (FID) as a function of number of iterations.

We compare the generated ImageNet images associated with the three optimization methods in Appendix 4. Since Alternating Adam collapsed we don’t report its Inception Score or FID. As it can be seen in Figure 3 and Appendix 4, OAdagrad outperforms simultaneous Adam in quantitative metrics (IS and FID) and in sample quality generation. Future work will include investigating whether OAdagrad would benefit from training with larger batch size, in order to achieve state-of-the-art results.

6 Conclusion

In this paper, we explain the effectiveness of adaptive gradient methods in training GANs from both theoretical and empirical perspectives. Theoretically, we provide two efficient stochastic algorithms for solving a class of min-max non-convex non-concave problems with state-of-the-art computational complexities. We also establish adaptive complexity results for an Adagrad-style algorithm by using coordinate-wise stepsize according to the geometry of the history data. The algorithm is proven to enjoy faster adaptive convergence than its non-adaptive counterpart when the gradient is sparse, which is similar to Adagrad applied to convex minimization problem. We have conducted extensive empirical studies to verify our theoretical findings. In addition, our experimental results suggest that the reason why adaptive gradient methods deliver good practical performance for GAN training is due to the slow growth rate of the cumulative stochastic gradient.

Acknowledgments

The authors thank the anonymous reviewers for their helpful comments. M. Liu and T. Yang are partially supported by National Science Foundation (IIS-1545995). M. Liu would like to thank Xiufan Yu from Pennsylvania State University for helpful discussions.

References

  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1, §4.2.
  • W. Azizian, I. Mitliagkas, S. Lacoste-Julien, and G. Gidel (2019) A tight and unified analysis of extragradient for a whole spectrum of differentiable games. arXiv preprint arXiv:1906.05945. Cited by: Appendix B, Table 1.
  • F. Bach and K. Y. Levy (2019) A universal algorithm for variational inequalities adaptive to smoothness and noise. arXiv preprint arXiv:1902.01637. Cited by: Appendix B.
  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §4.2.
  • T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien (2019)

    Reducing noise in gan training with variance reduced extragradient

    .
    arXiv preprint arXiv:1904.08598. Cited by: Appendix B, Table 1.
  • C. Chiang, T. Yang, C. Lee, M. Mahdavi, C. Lu, R. Jin, and S. Zhu (2012) Online optimization with gradual variations. In Conference on Learning Theory, pp. 6–1. Cited by: §1.
  • C. D. Dang and G. Lan (2015) On the convergence properties of non-euclidean extragradient methods for variational inequalities with generalized monotone operators. Computational Optimization and applications 60 (2), pp. 277–310. Cited by: Appendix B, §1.
  • C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng (2017) Training gans with optimism. arXiv preprint arXiv:1711.00141. Cited by: Appendix B, Appendix F, Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets, 1st item, Table 1, §1, §3, §4.2.
  • C. Daskalakis and I. Panageas (2018) The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pp. 9236–9246. Cited by: Appendix B, §1.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of Machine Learning Research

    12 (Jul), pp. 2121–2159.
    Cited by: §D.1, 2nd item, §1, §4.1, §4.2.
  • S. Ghadimi and G. Lan (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §2.
  • G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien (2018) A variational inequality perspective on generative adversarial networks. arXiv preprint arXiv:1802.10551. Cited by: Appendix B, Table 1, §1.
  • G. Gidel, R. A. Hemmat, M. Pezeshki, R. L. Priol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas (2019) Negative momentum for improved game dynamics. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 1802–1811. External Links: Link Cited by: Appendix B, Table 1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §4.2.
  • P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause (2017) An online learning approach to generative adversarial networks. arXiv preprint arXiv:1706.03269. Cited by: Appendix B.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §1, §4.2, §5.
  • P. Hartman and G. Stampacchia (1966) On some non-linear elliptic differential-functional equations. Acta mathematica 115 (1), pp. 271–310. Cited by: §1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: Appendix B, §5.
  • A. Iusem, A. Jofré, R. I. Oliveira, and P. Thompson (2017) Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization 27 (2), pp. 686–724. Cited by: Appendix B, Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets, 1st item, Table 1, §1, §2, §3, footnote 2, footnote 3.
  • A. Juditsky, A. Nemirovski, C. Tauvel, et al. (2011) Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems 1 (1), pp. 17–58. Cited by: Appendix B, §1, §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1, §1, §4.1.
  • R. Kleinberg, Y. Li, and Y. Yuan (2018) An alternative view: when does sgd escape local minima?. arXiv preprint arXiv:1802.06175. Cited by: §2.
  • G. Korpelevich (1976) The extragradient method for finding saddle points and other problems. Matecon 12, pp. 747–756. Cited by: Appendix B.
  • Y. Li and Y. Yuan (2017)

    Convergence analysis of two-layer neural networks with relu activation

    .
    In Advances in Neural Information Processing Systems, pp. 597–607. Cited by: §2.
  • Q. Lin, M. Liu, H. Rafique, and T. Yang (2018) Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality. arXiv preprint arXiv:1810.10207. Cited by: Appendix B, Table 1, §1, §2, footnote 2.
  • T. Lin, C. Jin, and M. I. Jordan (2019) On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331. Cited by: Appendix B.
  • M. Liu, Z. Yuan, Y. Ying, and T. Yang (2020) Stochastic auc maximization with deep neural networks. In International Conference on Learning Representations, External Links: Link Cited by: Appendix B.
  • S. Lu, I. Tsaknakis, and M. Hong (2019) Block alternating optimization for non-convex min-max problems: algorithms and applications in signal processing and communications. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4754–4758. Cited by: Appendix B.
  • M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §5.
  • E. V. Mazumdar, M. I. Jordan, and S. S. Sastry (2019) On finding local nash equilibria (and only local nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838. Cited by: Appendix B.
  • P. Mertikopoulos, H. Zenati, B. Lecouat, C. Foo, V. Chandrasekhar, and G. Piliouras (2018) Mirror descent in saddle-point problems: going the extra (gradient) mile. arXiv preprint arXiv:1807.02629. Cited by: Appendix B, Table 1, §2.
  • L. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835. Cited by: §5.
  • G. J. Minty et al. (1962) Monotone (nonlinear) operators in hilbert space. Duke Mathematical Journal 29 (3), pp. 341–346. Cited by: §1.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §4.2.
  • V. Nagarajan and J. Z. Kolter (2017) Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pp. 5585–5595. Cited by: Appendix B.
  • A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro (2009) Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization 19 (4), pp. 1574–1609. Cited by: §1.
  • A. Nemirovski and D. Yudin (1978) On cezari?s convergence of the steepest descent method for approximating saddle point of convex-concave functions. In Soviet Math. Dokl, Vol. 19, pp. 258–269. Cited by: §1.
  • A. Nemirovski (2004) Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization 15 (1), pp. 229–251. Cited by: Appendix B, §1.
  • A. S. Nemirovsky and D. B. Yudin (1983) Problem complexity and method efficiency in optimization.. Cited by: Appendix B.
  • Y. Nesterov (2007) Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming 109 (2-3), pp. 319–344. Cited by: Appendix B, §1.
  • A. Paszke, S. Gross, S. Chintala, and G. Chanan (2017)

    Pytorch: tensors and dynamic neural networks in python with strong gpu acceleration

    .
    Team, Pytorch Core 6. Cited by: §5.
  • B. T. Polyak (1969) Minimization of unsmooth functionals. USSR Computational Mathematics and Mathematical Physics 9 (3), pp. 14–29. Cited by: Appendix B.
  • H. Rafique, M. Liu, Q. Lin, and T. Yang (2018) Non-convex min-max optimization: provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060. Cited by: Appendix B.
  • S. Rakhlin and K. Sridharan (2013) Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pp. 3066–3074. Cited by: §1.
  • S. J. Reddi, S. Kale, and S. Kumar (2019) On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237. Cited by: §1, §4.1.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §5.
  • M. Sanjabi, M. Razaviyayn, and J. D. Lee (2018) Solving non-convex non-concave min-max games under polyak-l ojasiewicz condition. arXiv preprint arXiv:1812.02878. Cited by: Appendix B.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
  • T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop, coursera: neural networks for machine learning. University of Toronto, Technical Report. Cited by: §1, §4.1.
  • A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht (2017) The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148–4158. Cited by: §1, §4.1.
  • Y. Xu, Z. Yuan, S. Yang, R. Jin, and T. Yang (2019) On the convergence of (stochastic) gradient descent with extrapolation for non-convex optimization. arXiv preprint arXiv:1901.10682. Cited by: §1.
  • A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein (2017) Stabilizing adversarial nets with prediction methods. arXiv preprint arXiv:1705.07364. Cited by: Appendix B.
  • H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: 3rd item, §4.2, §5, §5.
  • R. Zhao (2019) Optimal stochastic algorithms for convex-concave saddle-point problems. arXiv preprint arXiv:1903.01687. Cited by: Appendix B.
  • Y. Zhou, J. Yang, H. Zhang, Y. Liang, and V. Tarokh (2019) SGD converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451. Cited by: §2.

Appendix A More Experimental Results

Comparison of Generated CIFAR10 Images by Different Optimization Methods In this section, we report the generated CIFAR10 images during the training of WGAN-GP by three optimization methods (OSG, OAdagrad, Alternating Adam). Every method uses batch size 64, and 1 iteration represents calculating the stochastic gradient with minibatch size 64 once. Figure 4 consists of images by three optimization methods at iteration 8000. Visually we can see that OAdagrad is better than Alternating Adam, and both of them are significantly better than OSG. It is consistent with the inception score results reported in Figure 1, and it also illustrates the tremendous benefits delivered by adaptive gradient methods when training GANs.

(a) OSG
(b) OAdagrad
(c) Alternating Adam
Figure 4: WGAN-GP: Generated CIFAR10 images using different optimization methods at iteration 8000.

Comparison of Generated ImageNet Images by Different Optimization Methods In this section, we report the generated ImageNet images during the training of Self-Attention GAN by three optimization methods (OAdagrad, Simultaneous Adam, Alternating Adam). Every method uses batch size 128 and 1 iteration represents calculating the stochastic gradient with minibatch 128 once. Figure 5 consists of images by three optimization methods at iteration 135000. Visually it is apparent that OAdagrad is better than Simultaneous Adam, and both of them are significantly than Alternating Adam.

(a) OAdagrad
(b) Simultaneous Adam
(c) Alternating Adam
Figure 5: Self-Attention GAN (SA-GAN): Generated ImageNet images using different optimization methods at iteration 135000. OAdagrad produces better quality images than simultaneous Adam. For both Oadagrad and simultaneous Adam we use the same learning rates: for generator and for the discriminator. Alternating Adam in our experience with same learning rate as in SA-GAN for generator and for discriminator collapsed. Note that our setting is different from SA-GAN since our batchsize is 128 while it is 256 in SA-GAN. It was also noted in SA-GAN that alternating Adam is hard to train.

Unofficial PyTorch Inception Score and FID results for SA-GAN on ImageNet

(a) Inception Score
(b) FID
Figure 6: Self-Attention GAN on ImageNet, with evaluation using Unoffical PyTorch Inception Score and Unoffical Pytorch FID. We see that OAdagard indeed outperforms Simultaneous Adam in terms of the (PyTorch) Inception score (higher is better), and in terms of (PyTorch) Fréchet Inception Distance (lower is better). We don’t report here Alternating Adam since in our run it has collapsed.

Appendix B Related Work

Min-max Optimization and GAN Training

For convex-concave min-max optimization, the extragradient method was first proposed by (Korpelevich, 1976). Later on, under gradient Lipschitz condition, Nemirovski (2004) extended the idea of extragradient to mirror-prox and obtained the convergence rate in terms of the duality gap (see also (Nesterov, 2007)), where is the number of iterations. When only the stochastic first-order oracle is available, the stochastic mirror-prox was analyzed by (Juditsky et al., 2011). The convergence rates for both deterministic and stochastic mirror-prox are optimal (Nemirovsky and Yudin, 1983). Recently, Zhao (2019) developed a nearly-optimal stochastic first-order algorithm when the primal variable is strongly convex in the primal variable. Bach and Levy (2019) proposed a universal algorithm that is adaptive to smoothness and noise, and simultaneously achieves optimal convergence rate.

There is a plethora of work analyzing one-sided nonconvex min-max problem, where the objective function is nonconvex in the minimization variable but concave in maximization variable. When the function is weakly-convex in terms of the minimization variable, Rafique et al. (2018) propose a stage-wise stochastic algorithm that approximately solves a convex-concave subproblem by adding a quadratic regularizer and show the first-order convergence of the equivalent minimization problem. Under the same setting, Lu et al. (2019) utilize block-based optimization strategy and show the convergence of the stationarity gap. By further assuming that the function is smooth in the minimization variable,  Lin et al. (2019) show that (stochastic) gradient descent ascent is able to converge to the first-order stationary point of the equivalent minimization problem.  Liu et al. (2020) cast the problem of stochastic AUC maximization with deep neural networks into a nonconvex-concave min-max problem, show the PL (Polyak-Łojasiewicz) condition holds for the objective of the outer minimization problem, and propose an algorithm and establish its fast convergence rate.

A more challenging problem is the non-convex non-concave min-max problem. Dang and Lan (2015) demonstrate that the deterministic extragradient method is able to converge to -first-order stationary point with non-asymptotic guarantee. Under the condition that the objective function is weakly-convex and weakly-concave, Lin et al. (2018) designs a stage-wise algorithm, where in each stage a strongly-convex strongly-concave subproblem is constructed by adding quadratic terms and appropriate stochastic algorithms can be employed to approximately solve it. They also show the convergence to the stationary point. Sanjabi et al. (2018) design an alternating deterministic optimization algorithm, in which multiple steps of gradient ascent for dual variable are conducted before one step of gradient descent for primal variable is performed. They show the convergence to stationary point based on the assumption that the inner maximization problem satisfies PL condition (Polyak, 1969). Our work is different from these previous methods in many aspects. In comparison to (Lin et al., 2018), our result does not need the bounded domain assumption. Furthermore, our iteration complexity is to achieve -first-order stationary point while the corresponding complexity in (Lin et al., 2018) is . When comparing to (Sanjabi et al., 2018), we do not assume that the PL (Polyak-Łojasiewicz) condition holds. Additionally, our algorithm is stochastic and not restricted to the deterministic case. Apparently the most related work to the present one is (Iusem et al., 2017). The stochastic extragradient method analyzed in (Iusem et al., 2017) requires calculation of two stochastic gradients per iteration, while the present algorithm only needs one since it memorizes the stochastic gradient in the previous iteration to guide the update in the current iteration. Nevertheless, we achieve the same iteration complexity as in (Iusem et al., 2017).

There are a body of work analyzing the convergence behavior of min-max optimization algorithms and its application in training GANs (Heusel et al., 2017; Daskalakis and Panageas, 2018; Nagarajan and Kolter, 2017; Grnarova et al., 2017; Yadav et al., 2017; Gidel et al., 2018; Mertikopoulos et al., 2018; Mazumdar et al., 2019). A few of them (Heusel et al., 2017; Daskalakis and Panageas, 2018; Mazumdar et al., 2019) only have asymptotic convergence. Others (Nagarajan and Kolter, 2017; Grnarova et al., 2017; Daskalakis et al., 2017; Yadav et al., 2017; Gidel et al., 2018; Mertikopoulos et al., 2018) focus on more restricted settings. For example, Nagarajan and Kolter (2017); Grnarova et al. (2017) require the concavity of the objective function in terms of dual variable. Yadav et al. (2017); Gidel et al. (2018) assume the objective to be convex-concave. Mertikopoulos et al. (2018) imposes the so-called coherence condition which is stronger than our assumption. Daskalakis et al. (2017) analyze the last-iteration convergence for bilinear problem. Recently, Gidel et al. (2019) analyze the benefits of using negative momentum in alternating gradient descent to improve the training of a bilinear game. Chavdarova et al. (2019) develop a variance-reduced extragradient method and shows its linear convergence under strong monotonicity and finite-sum structure assumptions. Azizian et al. (2019) provide a unified analysis of extragradient for bilinear game, strongly monotone case, and their intermediate cases. However, none of them give non-asymptotic convergence results for the class of non-convex non-concave min-max problem considered in our paper.

Appendix C Proof of Theorem 1

c.1 Facts

Suppose is closed and convex set, then we have

Fact 1.

For all and , .

Fact 2.

For all and , .

c.2 Lemmas

Lemma 1.

For , we have

(7)
Proof.

Let , where is the set of optimal solutions of MVI, i.e. holds for . Define , and . For any , we have

(8)

where (a) holds by using Fact 1. Note that

(9)

where the last inequality holds by the fact that since is a solution of . Note that

(10)

where (a) holds by and Cauchy-Schwartz inequality, where the former inequality comes from Fact 2 and the update rules of the algorithm, (b) holds by the update rule of and , (c) holds by the nonexpansion property of the projection operator, (d) holds since is -Lipschitz continuous, (e) holds since .

Define . Taking in (8) and combining (9) and (10), we have

(11)

Noting that

we rearrange terms in (11), which yields

(12)

Take summation over in (12) and note that , which yields

(13)

By taking , we have , and we have the result. ∎

c.3 Main Proof of Theorem 1

Proof.

Define