DeepAI

• 18 publications
• 44 publications
• 6 publications
• 315 publications
• 24 publications
• 48 publications
• 79 publications
02/13/2020

### Sharp Analysis of Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization

Epoch gradient descent method (a.k.a. Epoch-GD) proposed by (Hazan and K...
06/10/2021

### A Decentralized Adaptive Momentum Method for Solving a Class of Min-Max Optimization Problems

Min-max saddle point games have recently been intensely studied, due to ...
11/23/2018

### Kernel-Based Training of Generative Networks

Generative adversarial networks (GANs) are designed with the help of min...
09/29/2021

04/26/2021

### Solving a class of non-convex min-max games using adaptive momentum methods

Adaptive momentum methods have recently attracted a lot of attention for...
03/29/2021

### Saddle Point Optimization with Approximate Minimization Oracle

A major approach to saddle point optimization min_xmax_y f(x, y) is a gr...
10/28/2019

### Decentralized Parallel Algorithm for Training Generative Adversarial Nets

Generative Adversarial Networks (GANs) are powerful class of generative ...

## 1 Introduction

Adaptive gradient algorithms (Duchi et al., 2011; Tieleman and Hinton, 2012; Kingma and Ba, 2014; Reddi et al., 2019) are very popular in training deep neural networks due to their computational efficiency and minimal need for hyper-parameter tuning (Kingma and Ba, 2014). For example, Adagrad (Duchi et al., 2011)

automatically adjusts the learning rate for each dimension of the model parameter according to the information of history gradients, while its computational cost is almost the same as Stochastic Gradient Descent (SGD). However, in supervised deep learning (for example, image classification tasks using a deep convolutional neural network), there is not enough evidence showing that adaptive gradient methods converge faster than its non-adaptive counterpart (i.e., SGD) on benchmark datasets. For example, it is argued in

(Wilson et al., 2017) that adaptive gradient methods often find a solution with worse performance than SGD. Specifically, Wilson et al. (2017) observed that Adagrad has slower convergence than SGD in terms of both training and testing error, while using VGG (Simonyan and Zisserman, 2014) on CIFAR10 data.

GANs (Goodfellow et al., 2014) are a popular class of generative models. In a nutshell, they consist of a generator and a discriminator, both of which are defined by deep neural networks. The generator and the discriminator are trained under an adversarial cost, corresponding to a non-convex non-concave min-max problem. GANs are known to be notoriously difficult to train. In practice, Adam (Kingma and Ba, 2014) is the defacto optimizer used for GAN training. The common optimization strategy is to alternatively update the discriminator and the generator (Arjovsky et al., 2017; Gulrajani et al., 2017). Using Adam is important in GAN training, since replacing it with non-adaptive methods (e.g. SGD) would significantly deteriorate the performance. This paper studies and attempts to answer the following question:

Since GAN is a min-max optimization problem in nature, our problem of interest is to solve the following stochastic optimization problem:

 minu∈Umaxv∈VF(u,v):=Eξ∼D[f(u,v;ξ)], (1)

where , are closed and convex sets, is possibly non-convex in and non-concave in .

is a random variable following an unknown distribution

. In GAN training, , represent the parameters of generator and discriminator respectively.

The ideal goal for solving (1) is to find a saddle point such that for .

To achieve this goal, the typical assumption usually made is that the objective function is convex-concave. When is convex in and concave in , non-asymptotic guarantee in terms of the duality gap is well established by a series of work (Nemirovski and Yudin, 1978; Nemirovski, 2004; Nesterov, 2007; Nemirovski et al., 2009; Juditsky et al., 2011). However, when is non-convex in and non-concave in , finding the saddle point is NP-hard in general. Instead, we focus on finding the first-order stationary point provided that the objective function is smooth. I.e. we aim to find such that , . Note that this is a necessary condition for finding the (local) saddle point.

Related Work. Several works designed iterative first-order deterministic (Dang and Lan, 2015) and stochastic (Iusem et al., 2017; Lin et al., 2018) algorithms for achieving the -first-order stationary point with non-asymptotic guarantee. The goal is to find such that or , where the first-order oracle is defined as with and the first-order stochastic oracle is the noisy observation of , i.e. . For instance, Dang and Lan (2015) focuses on the deterministic setting. On the other hand, (Iusem et al., 2017) develops a stochastic extra-gradient algorithm that enjoys iteration complexity. The extra-gradient method requires two stochastic first-order oracles in one iteration, which can be computationally expensive in deep learning applications such as GANs. The inexact proximal point method developed in (Lin et al., 2018) has iteration complexity for finding an -first-order stationary point 222The result in (Lin et al., 2018) assumes the first-order oracle is a weakly-monotone operator, which is milder than the Lipschitz-continuity assumption as assumed in Iusem et al. (2017). However, simply applying the Lipschitz-continuity condition in their proof does not change their iteration complexity..

To avoid the cost of an additional oracle call in extragradient step, several studies (Chiang et al., 2012; Rakhlin and Sridharan, 2013; Daskalakis et al., 2017; Gidel et al., 2018; Xu et al., 2019) proposed single-call variants of the extragradient algorithm. Some of them focus on the convex setting (e.g.  (Chiang et al., 2012; Rakhlin and Sridharan, 2013)), while others focus on the non-convex setting (Xu et al., 2019). The closest to our work is the work by (Daskalakis et al., 2017; Gidel et al., 2018), where the min-max setting and GAN training are considered. However, the convergence of those algorithms is only shown for a class of bilinear problems in (Daskalakis et al., 2017) and for monotone variational inequalities in (Gidel et al., 2018). Hence a big gap remains between the specific settings studied in (Daskalakis et al., 2017; Gidel et al., 2018) and more general non-convex non-concave min-max problems. Table 1 provides a complete overview of our results and existing results. It is hard to give justice to the large body of work on min-max optimization, so we refer the interested reader to Appendix B that gives a comprehensive survey of related previous methods that are not covered in this Table.

Our main goal is to design stochastic first-order algorithms with low iteration complexity, low per-iteration cost and suitable for a general class of non-convex non-concave min-max problems. The main tool we use in our analysis is variational inequality.

Let be an operator and is a closed convex set. The Stampacchia Variational Inequality (SVI) problem (Hartman and Stampacchia, 1966) is defined by the operator and and denoted by . It consists of finding such that for . A similar one is Minty Variational Inequality (MVI) problem (Minty and others, 1962) denoted by , which consists of finding such that for . Min-max optimization is closely related to variational inequalities. The corresponding SVI and MVI for the min-max problem are defined through with .

Our main contributions are summarized as follows:

• Following (Daskalakis et al., 2017), we extend optimistic stochastic gradient (OSG) analysis beyond the bilinear and unconstrained case, by assuming the Lipschitz continuity of the operator and the existence of a solution for the variational inequality . These conditions were considered in the analysis of the stochastic extragradient algorithm in (Iusem et al., 2017). We analyze a variant of Optimistic Stochastic Gradient (OSG) under these conditions, inspired by the analysis of (Iusem et al., 2017). We show that OSG achieves state-of-the-art iteration complexity for finding an -first-order stationary point. Note that our OSG variant only requires invoking one stochastic first-order oracle while enjoying the state-of-the-art iteration complexity achieved by stochastic extragradient method (Iusem et al., 2017).

• We demonstrate the effectiveness of our algorithms in GAN training on CIFAR10 data. Empirical results identify an important reason behind why adaptive gradient methods behave well in GANs, which is due to the fact that the cumulative stochastic gradient grows in a slow rate. We also show that OAdagrad outperforms Simultaneous Adam in sample quality in ImageNet generation using self-attention GANs

(Zhang et al., 2018). This confirms the superiority of OAdagrad in min-max optimization.

## 2 Preliminaries and Notations

In this section, we fix some notations and give formal definitions of variational inequalities, and their relationship to the min-max problem (1).

Notations. Let be a closed convex set, and the euclidean norm. We note the projection operator, i.e. . Define with in problem (1). At every point , we don’t have access to and have only access to a noisy observations of . That is, , where is a random variable with distribution . For the ease of presentation, we use the terms stochastic gradient and stochastic first-order oracle interchangeably to stand for in the min-max setting.

###### Definition 1 (Monotonicity).

An operator is monotone if for . An operator is pseudo-monotone if for . An operator is -strongly-monotone if for .

We give here formal definitions of monotonic operators and the -first-order stationary point.

###### Definition 2 (ϵ-First-Order Stationary Point).

A point is called -first-order stationary point if .

Remark: We make the following observations:

• From the definition, it is evident that strong-monotonicity monotonicity pseudo-monotonicity. Assuming SVI has a solution and pseudo-monotonicity of the operator imply that has a solution. To see that, assume that SVI has a nonempty solution set, i.e. there exists such that for any . Noting that pseudo-monotonicity means that for every , implies , we have for any , which means that is the solution of Minty variational inequality. Note that the reverse may not be true and an example is provided in Appendix G.

• For the min-max problem (1), when is convex in and concave in , is monotone. And, therefore solving is equivalent to solving (1). When is not monotone, by assuming is Lipschitz continuous, it can be shown that the solution set of (1) is a subset of the solution set of . However, even solving is NP-hard in general and hence we resort to finding an -first-order stationary point.

Throughout the paper, we make the following assumption:

###### Assumption 1.
• is -Lipschitz continuous, i.e. for .

• has a solution, i.e. there exists such that for .

• For , , .

Remark: Assumptions (i) and (iii) are commonly used assumptions in the literature of variational inequalities and non-convex optimization (Juditsky et al., 2011; Ghadimi and Lan, 2013; Iusem et al., 2017). Assumption (ii) is used frequently in previous work focusing on analyzing algorithms that solve non-monotone variational inequalities (Iusem et al., 2017; Lin et al., 2018; Mertikopoulos et al., 2018). Assumption (ii) is weaker than other assumptions usually considered, such as pseudo-monotonicity, monotonicity, or coherence as assumed in (Mertikopoulos et al., 2018). For non-convex minimization problem, it has been shown that this assumption holds while using SGD to learn neural networks (Li and Yuan, 2017; Kleinberg et al., 2018; Zhou et al., 2019).

This section serves as a warm-up and motivation of our main theoretical contribution presented in the next section. Inspired by (Iusem et al., 2017), we present an algorithm called Optimistic Stochastic Gradient (OSG) that saves the cost of the additional oracle call as required in (Iusem et al., 2017) and maintains the same iteration complexity. The main algorithm is described in Algorithm 1, where

denotes the minibatch size for estimating the first-order oracle. It is worth mentioning that Algorithm

1 becomes stochastic extragradient method if one changes to in line 3. Stochastic extragradient method requires to compute stochastic gradient over both sequences and . In contrast, is an ancillary sequence in OSG and the stochastic gradient is only computed over the sequence of . Thus, stochastic extragradient method is twice as expensive as OSG in each iteration. In some tasks (e.g. training GANs) where the stochastic gradient computation is expensive, OSG is numerically more appealing.

Remark: When , the update in Algorithm 1 becomes the algorithm in (Daskalakis et al., 2017), i.e.

 zk+1=zk−2η⋅1mk−1mk∑i=1T(zk;ξik)+η⋅1mk−1mk−1∑i=1T(zk−1;ξik−1) (2)

The detailed derivation of (2) can be found in Appendix F.

###### Theorem 1.

Suppose that Assumption 1 holds. Let . Let and run Algorithm 1 for iterations. Then we have

 1NN∑k=1E[r2η(zk)]≤8∥x0−x∗∥2N+100η2NN∑k=0σ2mk,
###### Corollary 1.

Consider the unconstrained case where . Let , and we have

 1NN∑k=1E∥T(zk)∥22≤8∥x0−x∗∥2η2N+100NN∑k=0σ2mk, (3)

Remark: There are two implications of Corollary 1.

• (Increasing Minibatch Size) Let , . To guarantee , the total number of iterations is , and the total complexity is , where hides a logarithmic factor of .

• (Constant Minibatch Size) Let , . To guarantee , the total number of iterations is , and the total complexity is .

 minw∈RdF(w)=Eζ∼Pf(w;ζ) (4)

where is the model parameter, and is an random variable following distribution . The update rule of Adagrad is

 wt+1=wt−ηH−1t^gt, (5)

(Tieleman and Hinton, 2012), and AmsGrad (Reddi et al., 2019). All of them share the spirit, as they take advantage of the information provided by the history of gradients. Wilson et al. (2017) provide a complete overview of different adaptive gradient methods in a unified framework. It is worth mentioning that Adagrad can not be directly applied to solve non-convex non-concave min-max problems with provable guarantee.

###### Assumption 2.
• There exists and such that , for all almost surely.

• There exists a universal constant such that for , and .

Remark: Assumption 2 (i) is a standard one often made in literature (Duchi et al., 2011). Assumption 2 (ii) holds when we use normalization layers in the discriminator and generator such as spectral normalization of weights (Miyato et al., 2018; Zhang et al., 2018), that will keep the norms of the weights bounded. Regularization techniques such as weight decay also ensure that the weights of the networks remain bounded throughout the training.

, . Denote by the concatenation of , and denote by the -th row of .

###### Theorem 2.

Suppose Assumption 1 and 2 hold. Suppose with for every and every . When , after running Algorithm 2 for iterations, we have

 1NN∑k=1E∥T(zk)∥22≤8log(N+1)D2δ2(1+d(N−1)α)η2N+100log(N+1)(σ2/m+d(2δ2Nα+G2))N. (6)

To make sure , the number of iterations is , where hides a logarithmic factor of .

Remark:

• We denote by the cumulative stochastic gradient, where characterizes the growth rate of the gradient in terms of -th coordinate. In our proof, a key quantity is that crucially affects the computational complexity of Algorithm 2. Since , in the worst case, . But in practice, the stochastic gradient is usually sparse, and hence can be strictly smaller than .

• As shown in Theorem 2, the minibatch size used in Algorithm 2 for estimating the first-order oracle can be any positive constant and independent of . This is more practical than the results established in Theorem 1, since the minibatch size in Theorem 1 does either increase in terms of number of iterations or is dependent on . When , the complexity of Algorithm 2 is , which matches the complexity stated in Theorem 1. When , the complexity of OAdagrad given in Algorithm 2 is , i.e., strictly better than that of OSG given in Algorithm 1.

## 5 Experiments

#### WGAN-GP on CIFAR10

In the first experiment, we verify the effectiveness of the proposed algorithms in GAN training using the PyTorch framework

#### Growth Rate of Cumulative Stochastic Gradient

In the second experiment, we employ OAdagrad to train GANs and study the growth rate of the cumulative stochastic gradient (i.e., ). We tune the learning rate from and choose batch size to be 64. In Figure 2, the blue curve and red curve stand for the growth rate for OAdagrad and its corresponding tightest polynomial growth upper bound respectively. is the number of iterations, and is a multiplicative constant such that the red curve and blue curve overlaps at the starting point of the training. The degree of the polynomial is determined using binary search. We can see that the growth rate of cumulative stochastic gradient grows very slowly in GANs (the worst-case polynomial degree is , but it is for WGAN-GP on CIFAR10 and for WGAN on LSUN Bedroom dataset). As predicted by our theory, this behavior explains the faster convergence of OAdagrad versus OSG, consistent with what is observed empirically in Figure 1.

#### Self-Attention GAN on ImageNet

In the third experiment, we consider GAN training on large-scale dataset. We use the model from Self-Attention GAN (Zhang et al., 2018) (SA-GAN) and ImageNet as our dataset. Note that in this setting the boundedness of both generator () and discriminator () is ensured by spectral normalization of both and . Three separate experiments are performed, including Alternating Adam (baseline), Simultaneous Adam (Mescheder et al., 2017), and OAdagrad. It should be mentioned that the update rule of Simultaneous Adam involves performing Adam-type update for discriminator and generator simultaneously. Training is performed with batch size 128 for all experiments.

For the baseline experiment (Alternating Adam) we use the default settings and hyper parameters reported in SA-GAN (Zhang et al., 2018) (note that we are not using the same batch size of as in (Zhang et al., 2018) due to limited computational resources). In our experience, Alternating Adam training for a batch size of with same learning rate as in SA-GAN ( for generator and

for discriminator) collapsed. This does not mean that Alternating Adam fails, it just needs more tuning to find the correct range of learning rates for the particular batch size we have. With the hyperparameters ranges we tried Alternating Adam collapsed, with extra tuning efforts and an expensive computational budget Alternating Adam would eventually succeed. This is inline with the large scale study in

(Lucic et al., 2018) that states that given a large computational budget for tuning hyper-parameters most GANs training succeed equally.

For both OAdagrad and Simultaneous Adam, we use different learning rate for generator and discriminator, as suggested in (Heusel et al., 2017). Specifically, the learning rates used are for the generator and for the discriminator. We report both Inception Score (IS) and Fréchet Inception Distance (Heusel et al., 2017) (FID) as a function of number of iterations.

We compare the generated ImageNet images associated with the three optimization methods in Appendix 4. Since Alternating Adam collapsed we don’t report its Inception Score or FID. As it can be seen in Figure 3 and Appendix 4, OAdagrad outperforms simultaneous Adam in quantitative metrics (IS and FID) and in sample quality generation. Future work will include investigating whether OAdagrad would benefit from training with larger batch size, in order to achieve state-of-the-art results.

## Acknowledgments

The authors thank the anonymous reviewers for their helpful comments. M. Liu and T. Yang are partially supported by National Science Foundation (IIS-1545995). M. Liu would like to thank Xiufan Yu from Pennsylvania State University for helpful discussions.

## References

• M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1, §4.2.
• W. Azizian, I. Mitliagkas, S. Lacoste-Julien, and G. Gidel (2019) A tight and unified analysis of extragradient for a whole spectrum of differentiable games. arXiv preprint arXiv:1906.05945. Cited by: Appendix B, Table 1.
• F. Bach and K. Y. Levy (2019) A universal algorithm for variational inequalities adaptive to smoothness and noise. arXiv preprint arXiv:1902.01637. Cited by: Appendix B.
• A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §4.2.
• T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien (2019)

Reducing noise in gan training with variance reduced extragradient

.
arXiv preprint arXiv:1904.08598. Cited by: Appendix B, Table 1.
• C. Chiang, T. Yang, C. Lee, M. Mahdavi, C. Lu, R. Jin, and S. Zhu (2012) Online optimization with gradual variations. In Conference on Learning Theory, pp. 6–1. Cited by: §1.
• C. D. Dang and G. Lan (2015) On the convergence properties of non-euclidean extragradient methods for variational inequalities with generalized monotone operators. Computational Optimization and applications 60 (2), pp. 277–310. Cited by: Appendix B, §1.
• C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng (2017) Training gans with optimism. arXiv preprint arXiv:1711.00141. Cited by: Appendix B, Appendix F, Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets, 1st item, Table 1, §1, §3, §4.2.
• C. Daskalakis and I. Panageas (2018) The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems, pp. 9236–9246. Cited by: Appendix B, §1.
• J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.

Journal of Machine Learning Research

12 (Jul), pp. 2121–2159.
Cited by: §D.1, 2nd item, §1, §4.1, §4.2.
• S. Ghadimi and G. Lan (2013) Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §2.
• G. Gidel, H. Berard, G. Vignoud, P. Vincent, and S. Lacoste-Julien (2018) A variational inequality perspective on generative adversarial networks. arXiv preprint arXiv:1802.10551. Cited by: Appendix B, Table 1, §1.
• G. Gidel, R. A. Hemmat, M. Pezeshki, R. L. Priol, G. Huang, S. Lacoste-Julien, and I. Mitliagkas (2019) Negative momentum for improved game dynamics. In Proceedings of Machine Learning Research, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, , pp. 1802–1811. External Links: Link Cited by: Appendix B, Table 1.
• I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §4.2.
• P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause (2017) An online learning approach to generative adversarial networks. arXiv preprint arXiv:1706.03269. Cited by: Appendix B.
• I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §1, §4.2, §5.
• P. Hartman and G. Stampacchia (1966) On some non-linear elliptic differential-functional equations. Acta mathematica 115 (1), pp. 271–310. Cited by: §1.
• M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: Appendix B, §5.
• A. Iusem, A. Jofré, R. I. Oliveira, and P. Thompson (2017) Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization 27 (2), pp. 686–724. Cited by: Appendix B, Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets, 1st item, Table 1, §1, §2, §3, footnote 2, footnote 3.
• A. Juditsky, A. Nemirovski, C. Tauvel, et al. (2011) Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems 1 (1), pp. 17–58. Cited by: Appendix B, §1, §2.
• D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1, §1, §4.1.
• R. Kleinberg, Y. Li, and Y. Yuan (2018) An alternative view: when does sgd escape local minima?. arXiv preprint arXiv:1802.06175. Cited by: §2.
• G. Korpelevich (1976) The extragradient method for finding saddle points and other problems. Matecon 12, pp. 747–756. Cited by: Appendix B.
• Y. Li and Y. Yuan (2017)

Convergence analysis of two-layer neural networks with relu activation

.
In Advances in Neural Information Processing Systems, pp. 597–607. Cited by: §2.
• Q. Lin, M. Liu, H. Rafique, and T. Yang (2018) Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality. arXiv preprint arXiv:1810.10207. Cited by: Appendix B, Table 1, §1, §2, footnote 2.
• T. Lin, C. Jin, and M. I. Jordan (2019) On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331. Cited by: Appendix B.
• M. Liu, Z. Yuan, Y. Ying, and T. Yang (2020) Stochastic auc maximization with deep neural networks. In International Conference on Learning Representations, External Links: Link Cited by: Appendix B.
• S. Lu, I. Tsaknakis, and M. Hong (2019) Block alternating optimization for non-convex min-max problems: algorithms and applications in signal processing and communications. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4754–4758. Cited by: Appendix B.
• M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 700–709. Cited by: §5.
• E. V. Mazumdar, M. I. Jordan, and S. S. Sastry (2019) On finding local nash equilibria (and only local nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838. Cited by: Appendix B.
• P. Mertikopoulos, H. Zenati, B. Lecouat, C. Foo, V. Chandrasekhar, and G. Piliouras (2018) Mirror descent in saddle-point problems: going the extra (gradient) mile. arXiv preprint arXiv:1807.02629. Cited by: Appendix B, Table 1, §2.
• L. Mescheder, S. Nowozin, and A. Geiger (2017) The numerics of gans. In Advances in Neural Information Processing Systems, pp. 1825–1835. Cited by: §5.
• G. J. Minty et al. (1962) Monotone (nonlinear) operators in hilbert space. Duke Mathematical Journal 29 (3), pp. 341–346. Cited by: §1.
• T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §4.2.
• V. Nagarajan and J. Z. Kolter (2017) Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems, pp. 5585–5595. Cited by: Appendix B.
• A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro (2009) Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization 19 (4), pp. 1574–1609. Cited by: §1.
• A. Nemirovski and D. Yudin (1978) On cezari?s convergence of the steepest descent method for approximating saddle point of convex-concave functions. In Soviet Math. Dokl, Vol. 19, pp. 258–269. Cited by: §1.
• A. Nemirovski (2004) Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization 15 (1), pp. 229–251. Cited by: Appendix B, §1.
• A. S. Nemirovsky and D. B. Yudin (1983) Problem complexity and method efficiency in optimization.. Cited by: Appendix B.
• Y. Nesterov (2007) Dual extrapolation and its applications to solving variational inequalities and related problems. Mathematical Programming 109 (2-3), pp. 319–344. Cited by: Appendix B, §1.
• A. Paszke, S. Gross, S. Chintala, and G. Chanan (2017)

Pytorch: tensors and dynamic neural networks in python with strong gpu acceleration

.
Team, Pytorch Core 6. Cited by: §5.
• B. T. Polyak (1969) Minimization of unsmooth functionals. USSR Computational Mathematics and Mathematical Physics 9 (3), pp. 14–29. Cited by: Appendix B.
• H. Rafique, M. Liu, Q. Lin, and T. Yang (2018) Non-convex min-max optimization: provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060. Cited by: Appendix B.
• S. Rakhlin and K. Sridharan (2013) Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pp. 3066–3074. Cited by: §1.
• S. J. Reddi, S. Kale, and S. Kumar (2019) On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237. Cited by: §1, §4.1.
• T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §5.
• M. Sanjabi, M. Razaviyayn, and J. D. Lee (2018) Solving non-convex non-concave min-max games under polyak-l ojasiewicz condition. arXiv preprint arXiv:1812.02878. Cited by: Appendix B.
• K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
• T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop, coursera: neural networks for machine learning. University of Toronto, Technical Report. Cited by: §1, §4.1.
• A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht (2017) The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148–4158. Cited by: §1, §4.1.
• Y. Xu, Z. Yuan, S. Yang, R. Jin, and T. Yang (2019) On the convergence of (stochastic) gradient descent with extrapolation for non-convex optimization. arXiv preprint arXiv:1901.10682. Cited by: §1.
• A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein (2017) Stabilizing adversarial nets with prediction methods. arXiv preprint arXiv:1705.07364. Cited by: Appendix B.
• H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: 3rd item, §4.2, §5, §5.
• R. Zhao (2019) Optimal stochastic algorithms for convex-concave saddle-point problems. arXiv preprint arXiv:1903.01687. Cited by: Appendix B.
• Y. Zhou, J. Yang, H. Zhang, Y. Liang, and V. Tarokh (2019) SGD converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451. Cited by: §2.

## Appendix A More Experimental Results

Comparison of Generated CIFAR10 Images by Different Optimization Methods In this section, we report the generated CIFAR10 images during the training of WGAN-GP by three optimization methods (OSG, OAdagrad, Alternating Adam). Every method uses batch size 64, and 1 iteration represents calculating the stochastic gradient with minibatch size 64 once. Figure 4 consists of images by three optimization methods at iteration 8000. Visually we can see that OAdagrad is better than Alternating Adam, and both of them are significantly better than OSG. It is consistent with the inception score results reported in Figure 1, and it also illustrates the tremendous benefits delivered by adaptive gradient methods when training GANs.

Comparison of Generated ImageNet Images by Different Optimization Methods In this section, we report the generated ImageNet images during the training of Self-Attention GAN by three optimization methods (OAdagrad, Simultaneous Adam, Alternating Adam). Every method uses batch size 128 and 1 iteration represents calculating the stochastic gradient with minibatch 128 once. Figure 5 consists of images by three optimization methods at iteration 135000. Visually it is apparent that OAdagrad is better than Simultaneous Adam, and both of them are significantly than Alternating Adam.

Unofficial PyTorch Inception Score and FID results for SA-GAN on ImageNet

## Appendix B Related Work

#### Min-max Optimization and GAN Training

For convex-concave min-max optimization, the extragradient method was first proposed by (Korpelevich, 1976). Later on, under gradient Lipschitz condition, Nemirovski (2004) extended the idea of extragradient to mirror-prox and obtained the convergence rate in terms of the duality gap (see also (Nesterov, 2007)), where is the number of iterations. When only the stochastic first-order oracle is available, the stochastic mirror-prox was analyzed by (Juditsky et al., 2011). The convergence rates for both deterministic and stochastic mirror-prox are optimal (Nemirovsky and Yudin, 1983). Recently, Zhao (2019) developed a nearly-optimal stochastic first-order algorithm when the primal variable is strongly convex in the primal variable. Bach and Levy (2019) proposed a universal algorithm that is adaptive to smoothness and noise, and simultaneously achieves optimal convergence rate.

There is a plethora of work analyzing one-sided nonconvex min-max problem, where the objective function is nonconvex in the minimization variable but concave in maximization variable. When the function is weakly-convex in terms of the minimization variable, Rafique et al. (2018) propose a stage-wise stochastic algorithm that approximately solves a convex-concave subproblem by adding a quadratic regularizer and show the first-order convergence of the equivalent minimization problem. Under the same setting, Lu et al. (2019) utilize block-based optimization strategy and show the convergence of the stationarity gap. By further assuming that the function is smooth in the minimization variable,  Lin et al. (2019) show that (stochastic) gradient descent ascent is able to converge to the first-order stationary point of the equivalent minimization problem.  Liu et al. (2020) cast the problem of stochastic AUC maximization with deep neural networks into a nonconvex-concave min-max problem, show the PL (Polyak-Łojasiewicz) condition holds for the objective of the outer minimization problem, and propose an algorithm and establish its fast convergence rate.

A more challenging problem is the non-convex non-concave min-max problem. Dang and Lan (2015) demonstrate that the deterministic extragradient method is able to converge to -first-order stationary point with non-asymptotic guarantee. Under the condition that the objective function is weakly-convex and weakly-concave, Lin et al. (2018) designs a stage-wise algorithm, where in each stage a strongly-convex strongly-concave subproblem is constructed by adding quadratic terms and appropriate stochastic algorithms can be employed to approximately solve it. They also show the convergence to the stationary point. Sanjabi et al. (2018) design an alternating deterministic optimization algorithm, in which multiple steps of gradient ascent for dual variable are conducted before one step of gradient descent for primal variable is performed. They show the convergence to stationary point based on the assumption that the inner maximization problem satisfies PL condition (Polyak, 1969). Our work is different from these previous methods in many aspects. In comparison to (Lin et al., 2018), our result does not need the bounded domain assumption. Furthermore, our iteration complexity is to achieve -first-order stationary point while the corresponding complexity in (Lin et al., 2018) is . When comparing to (Sanjabi et al., 2018), we do not assume that the PL (Polyak-Łojasiewicz) condition holds. Additionally, our algorithm is stochastic and not restricted to the deterministic case. Apparently the most related work to the present one is (Iusem et al., 2017). The stochastic extragradient method analyzed in (Iusem et al., 2017) requires calculation of two stochastic gradients per iteration, while the present algorithm only needs one since it memorizes the stochastic gradient in the previous iteration to guide the update in the current iteration. Nevertheless, we achieve the same iteration complexity as in (Iusem et al., 2017).

There are a body of work analyzing the convergence behavior of min-max optimization algorithms and its application in training GANs (Heusel et al., 2017; Daskalakis and Panageas, 2018; Nagarajan and Kolter, 2017; Grnarova et al., 2017; Yadav et al., 2017; Gidel et al., 2018; Mertikopoulos et al., 2018; Mazumdar et al., 2019). A few of them (Heusel et al., 2017; Daskalakis and Panageas, 2018; Mazumdar et al., 2019) only have asymptotic convergence. Others (Nagarajan and Kolter, 2017; Grnarova et al., 2017; Daskalakis et al., 2017; Yadav et al., 2017; Gidel et al., 2018; Mertikopoulos et al., 2018) focus on more restricted settings. For example, Nagarajan and Kolter (2017); Grnarova et al. (2017) require the concavity of the objective function in terms of dual variable. Yadav et al. (2017); Gidel et al. (2018) assume the objective to be convex-concave. Mertikopoulos et al. (2018) imposes the so-called coherence condition which is stronger than our assumption. Daskalakis et al. (2017) analyze the last-iteration convergence for bilinear problem. Recently, Gidel et al. (2019) analyze the benefits of using negative momentum in alternating gradient descent to improve the training of a bilinear game. Chavdarova et al. (2019) develop a variance-reduced extragradient method and shows its linear convergence under strong monotonicity and finite-sum structure assumptions. Azizian et al. (2019) provide a unified analysis of extragradient for bilinear game, strongly monotone case, and their intermediate cases. However, none of them give non-asymptotic convergence results for the class of non-convex non-concave min-max problem considered in our paper.

## Appendix C Proof of Theorem 1

### c.1 Facts

Suppose is closed and convex set, then we have

For all and , .

For all and , .

### c.2 Lemmas

###### Lemma 1.

For , we have

 12N∑k=1∥xk−1−zk∥2+12N∑k=1∥xk−zk∥2≤∥x0−x∗∥2−∥xN−x∗∥2+12η2N∑k=0∥ϵk∥2+N∑k=1Λk (7)
###### Proof.

Let , where is the set of optimal solutions of MVI, i.e. holds for . Define , and . For any , we have

 ∥xk−x∥2=∥ΠX(xk−1−ηˆT(ϵk,zk))−x∥2 (8) =∥∥xk−1−ηˆT(ϵk,zk)−x∥∥2−∥∥xk−1−ηˆT(ϵk,zk)−xk∥∥2 =∥xk−1−x∥2−∥xk−1−xk∥2+2⟨x−xk,ηˆT(ϵk,zk)⟩ =∥xk−1−x∥2−∥xk−1−xk∥2+2⟨x−zk,ηˆT(ϵk,zk)⟩+2⟨zk−xk,ηˆT(ϵk,zk)⟩ =∥xk−1−x∥2−∥xk−1−zk+zk−xk∥2+2⟨x−zk,ηˆT(ϵk,zk)⟩+2⟨zk−xk,ηˆT(ϵk,zk)⟩ =∥xk−1−x∥2−∥xk−1−zk∥2−∥zk−xk∥2−2⟨xk−1−zk,zk−xk⟩+ =∥xk−1−x∥2−∥xk−1−zk∥2−∥zk−xk∥2+2⟨x−zk,ηˆT(ϵk,zk)⟩+2⟨xk−zk,xk−1−ηˆT(ϵk,zk)−zk⟩

where (a) holds by using Fact 1. Note that

 (9)

where the last inequality holds by the fact that since is a solution of . Note that

 2⟨xk−zk,xk−1−ηˆT(ϵk,zk)−zk⟩ (10) =2⟨xk−zk,xk−1−ηˆT(ϵk−1,zk−1)−zk⟩+2⟨xk−zk,η(ˆT(ϵk−1,zk−1)−ˆT(ϵk,zk))⟩ (a)≤2η∥xk−zk∥⋅∥∥ˆT(ϵk−1,zk−1)−ˆT(ϵk,zk)∥∥ (c)≤2η2∥∥ˆT(ϵk−1,zk−1)−ˆT(ϵk,zk)∥∥2=2η2∥T(zk−1)+ϵk−1−(T(zk)+ϵk)∥2 ≤2η2(∥T(zk−1)−T(zk)∥+∥ϵk−1∥+∥ϵk∥)2(d)≤2η2(L∥zk−1−zk∥+∥ϵk−1∥+∥ϵk∥)2

where (a) holds by and Cauchy-Schwartz inequality, where the former inequality comes from Fact 2 and the update rules of the algorithm, (b) holds by the update rule of and , (c) holds by the nonexpansion property of the projection operator, (d) holds since is -Lipschitz continuous, (e) holds since .

Define . Taking in (8) and combining (9) and (10), we have

 ∥xk−x∗∥2 (11) ≤∥xk−1−x∗∥2−∥xk−1−zk∥2−∥zk−xk∥2+6η2L2∥zk−1−zk∥2+6η2∥ϵk−1∥2+6η2∥ϵk∥2+Λk

Noting that

 ∥zk−1−zk∥2 =∥zk−1−xk−1+xk−1−xk+xk−zk∥2≤3∥zk−1−xk−1∥2+3∥xk−1−xk∥2+3∥xk−zk∥2,

we rearrange terms in (11), which yields

 ∥xk−1−zk∥2+∥zk−xk∥2−6η2L2(3∥zk−1−xk−1∥2+3∥xk−1−zk∥2+3∥zk−xk∥2) (12) ≤∥xk−1−x∗∥2−∥xk−x∗∥2+6η2∥ϵk−1∥2+6η2∥ϵk∥2+Λk

Take summation over in (12) and note that , which yields

 (1−18η2L2)N∑k=1∥xk−1−zk∥2+(1−36η2L2)N∑k=1∥xk−zk∥2 (13) ≤∥x0−x∗∥2−∥xN−x∗∥2+12η2N∑k=0∥ϵk∥2+N∑k=1Λk

By taking , we have , and we have the result. ∎

### c.3 Main Proof of Theorem 1

###### Proof.

Define . Our goal is to get a bound on . We have: