A Variational Inequality Perspective on Generative Adversarial Nets

02/28/2018 ∙ by Gauthier Gidel, et al. ∙ 0

Stability has been a recurrent issue in training generative adversarial networks (GANs). One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods specifically designed for this adversarial training. In this work, we review the "variational inequality" framework which contains most formulations of the GAN objective introduced so far. Taping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend standard methods designed for variational inequalities to GANs training, such as a stochastic version of the extragradient method, and empirically investigate their behavior on GANs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 16

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GANs) (Goodfellow et al., 2014) form a generative modeling approach known for producing realistic natural images (Karras et al., 2018)

as well as high quality super-resolution 

(Ledig et al., 2017) and style transfer (Zhu et al., 2017). Nevertheless, GANs are also known to be difficult to train, often displaying an unstable behavior (Goodfellow, 2016). Much recent work has tried to tackle these training difficulties, usually by proposing new formulations of the GAN objective (Nowozin et al., 2016; Arjovsky et al., 2017)

. Each of these formulations can be understood as a two-player game, in the sense of game theory 

(Von Neumann and Morgenstern, 1944), and can be addressed as a variational inequality problem (VIP) (Harker and Pang, 1990), a framework that encompasses traditional saddle point optimization algorithms (Korpelevich, 1976).

Solving such GAN games is traditionally approached by running variants of stochastic gradient descent (SGD) initially developed for optimizing supervised neural network objectives. Yet it is known that for some games 

(Goodfellow, 2016, §8.2) SGD exhibits oscillatory behavior and fails to converge. This oscillatory behavior, which does not arise from stochasticity, highlights a fundamental problem: while a direct application of basic gradient descent is an appropriate method for regular minimization problems, it is not a sound optimization algorithm for the kind of two-player games of GANs. This constitutes a fundamental issue for GAN training, and calls for the use of more principled methods with more reassuring convergence guarantees.

Contributions.

We point out that multi-player games can be cast as variational inequality problems and consequently the same applies to any GAN formulation posed as a minimax or non-zero-sum game. We present two techniques from this literature, namely averaging and extrapolation, widely used to solve variational inequality problems (VIP) but which have not been explored in the context of GANs before.111Independent works (Mertikopoulos et al., 2018) and (Yazıcı et al., 2018) respectively explored extrapolation and averaging in the context of GANs. More details in related work section §6.

We extend standard GAN training methods such as SGD or Adam into variants that incorporate these techniques (Alg. 34 are new). We also explain that the oscillations of basic SGD for GAN training previously noticed  (Goodfellow, 2016) can be explained by standard variational inequality optimization results and we illustrate how averaging and extrapolation can fix this issue.

We introduce a new technique, called extrapolation from the past, that only requires one gradient computation per iteration compared to extrapolation which requires to compute the gradient twice. We prove its convergence in the stochastic variational inequality setting, i.e. when applied to SGD.

Finally, we test these techniques in the context of standard GAN training. We observe a 4%-6% improvement on the inception score (Salimans et al., 2016) of WGAN (Arjovsky et al., 2017) and WGAN-GP (Gulrajani et al., 2017) on the CIFAR-10 dataset.

Outline.

§2 presents the background on GAN and optimization, and shows how to cast this optimization as a VIP. §3 presents standard techniques to optimize variational inequalities in a batch setting as well as our new one, extrapolation from the past. §4 considers these methods in the stochastic setting, yielding three corresponding variants of SGD, and provide their respective convergence rates. §5 develops how to combine these techniques with already existing algorithms. §6 discusses the related work and §7 presents experimental results.

2 GAN optimization as a variational inequality problem

2.1 GAN formulations

The purpose of generative modeling is to generate samples from a distribution that matches best the true distribution of the data. The generative adversarial network training strategy can be understood as a game between two players called generator and discriminator

. The former produces a sample that the latter has to classify between real or fake data. The final goal is to build a generator able to produce sufficiently realistic samples to fool the discriminator.

In the original GAN paper (Goodfellow et al., 2014), the GAN objective is formulated as a zero-sum game where the cost function of the discriminator is given by the negative log-likelihood of the binary classification task between real or fake data generated from by the generator,

(1)

However Goodfellow et al. (2014) recommends to use in practice a second formulation, called non-saturating GAN. This formulation is a non-zero-sum game where the aim is to jointly minimize:

(2)

The dynamics of this formulation have the same stationary points as the zero-sum one (1) but are claimed to provide “much stronger gradients early in learning” (Goodfellow et al., 2014) .

2.2 Equilibrium

The minimax formulation (1) is theoretically convenient because a large literature on games studies this problem and provides guarantees on the existence of equilibria. Nevertheless, practical considerations lead the GAN literature to consider a different objective for each player as formulated in (2). In that case, the two-player game problem (Von Neumann and Morgenstern, 1944) consists in finding the following Nash equilibrium:

(3)

Only when is the game called a zero-sum game and (3) can be formulated as a minimax problem. One important point to notice is that the two optimization problems in (3) are coupled and have to be considered jointly from an optimization point of view.

Standard GAN objectives are non-convex (i.e. each cost function is non-convex), and thus such (pure) equilibria may not exist. As far as we know, not much is known about the existence of these equilibria for non-convex losses (see Heusel et al. (2017) and references therein for some results). In our theoretical analysis in §4, our assumptions (monotonicity (24) of the operator and convexity of the constraints set) imply the existence of an equilibrium.

In this paper, we focus on ways to optimize these games, assuming that an equilibrium exists. As is often standard in non-convex optimization, we also focus on finding points satisfying the necessary stationary conditions. As we mentioned previously, one difficulty that emerges in the optimization of such games is that the two different cost functions of (3) have to be minimized jointly in and . Fortunately, the optimization literature has for a long time studied so-called variational inequality problems, which generalize the stationary conditions for two-player game problems.

2.3 Variational inequality problem formulation

We first consider the local necessary conditions that characterize the solution of the smooth two-player game (3), defining stationary points, which will motivate the definition of a variational inequality. In the unconstrained setting, a stationary point is a couple with zero gradient:

(4)

When constraints are present,222An example of constraint for GANs is to clip the parameters of the discriminator (Arjovsky et al., 2017). a stationary point is such that the directional derivative of each cost function is non-negative in any feasible direction (i.e. there is no feasible descent direction):

(5)

Defining , Eq. (5) can be compactly formulated as:

(6)

These stationary conditions can be generalized to any continuous vector field: let

and be a continuous mapping. The variational inequality problem (Harker and Pang, 1990) (depending on and ) is:

(VIP)

We call optimal set the set of verifying (VIP). The intuition behind it is that any is a fixed point of the constrained dynamic of (constrained to ).

We have thus showed that both saddle point optimization and non-zero sum game optimization, which encompasses the large majority of GAN variants proposed in the literature, can be cast as Variational Inequality Problems. In the following section, we turn to suitable optimization techniques for such problems.

3 Optimization of Variational Inequalities (batch setting)

Let us begin by looking at techniques that were developed in the optimization literature to solve (VIP). We present the intuitions behind them as well as their performance on a simple bilinear problem (see Fig. 1). Our goal here is to provide mathematical insights into the techniques of averaging3.1) and extrapolation3.2), to inspire their application to extending other optimization algorithm. We then propose a novel variant of the extrapolation technique in §3.3 extrapolation from the past. We here treat the batch setting, i.e. considering that the operator as defined in Eq. 6 yields an exact full gradient. We will present extensions of these techniques to the stochastic setting later in §4.

The two standard methods studied in the VIP literature are the gradient method (Bruck, 1977) and the extragradient method (Korpelevich, 1976). The iterates of the basic gradient method are given by where is the projection onto the constraints set (if constraints are present) associated to (VIP). These iterates are known to converge linearly under an additional assumption on the operator333Strong monotonicity, a generalization of strong convexity. See §A. (Chen and Rockafellar, 1997), but oscillate for a bilinear operator as shown in Fig. 1. On the other hand, the uniform average of these iterates converge for any bounded monotone operator with a rate (Nedić and Ozdaglar, 2009), motivating the presentation of averaging in §3.1. By contrast, the extragradient method (extrapolated gradient) does not require any averaging to converge for monotone operators (in the batch setting), and can even converge at the faster rate (Nesterov, 2007). The idea of this method is to compute a lookahead step (see intuition on extrapolation in §3.2) in order to compute a more stable direction to follow.

3.1 Averaging

More generally, we consider a weighted averaging scheme with weights . This weighted averaging scheme have been proposed for the first time for (batch) VIP by Bruck (1977),

(7)

Averaging schemes can be efficiently implemented in an online fashion noticing that,

(8)

For instance, setting provides uniform averaging () and provides geometric averaging also known as exponential moving averaging (). Averaging is experimentally compared with the other techniques presented in this section in Fig. 1.

In order to illustrate how averaging tackle the oscillatory behavior in game optimization, we consider a toy example where the discriminator and the generator are linear: and (implicitly defining ). By replacing these expressions in the WGAN objective,444Wasserstein GAN (WGAN) proposed by Arjovsky et al. (2017) boils down to the following minimax formulation: . we get the following bilinear objective:

(9)

A similar task was presented by Nagarajan and Kolter (2017) where they consider a quadratic discriminator instead of a linear one, and show that gradient descent is not necessarily asymptotically stable. The bilinear objective has been extensively used (Goodfellow, 2016; Mescheder et al., 2018; Yadav et al., 2018) to highlight the difficulties of gradient descent for saddle point optimization. Yet, ways to cope with this issue have been proposed decades ago in the context of mathematical programming. Simplifying further by setting the dimension to 1 and centering the equilibrium to the origin, Eq. 9 becomes:

(10)

The operator associated with this minimax game is . There are several ways to compute the discrete updates of this dynamics. The two most common ones are the simultaneous and the alternated gradient update rules,

(11)

Interestingly, these two choices give rise to have a completely different behavior. The norm of the simultaneous updates diverges geometrically whereas the alternated iterates are bounded but do not converge to the equilibrium. As a consequence, their respective uniform average have a different behavior, as highlighted in the following proposition (more details and proof in §B.1):

Proposition 1.

The simultaneous iterates diverge geometrically and the alternated iterates defined in (11) are bounded but do not converge to 0 as

(12)

where .

The uniform average of the simultaneous updates (resp. the alternated updates) diverges (resp. converges to 0) as,

(13)

This sublinear convergence result, proved in §B, underlines the benefits of averaging when the sequence of iterates is bounded (i.e. for alternated update rule). When the sequence of iterates is not bounded (i.e. for simultaneous updates) averaging fails to ensure convergence. This theorem also shows how alternated updates may have better convergence properties than simultaneous updates.

3.2 Extrapolation

Another technique used in the variational inequality literature to prevent oscillations is extrapolation. This concept is anterior to the extragradient method since Korpelevich (1976) mentions that the idea of extrapolated “prices” to give “stability” had been already formulated by Polyak (1963, Chap. II). The idea behind this technique is to compute the gradient at an (extrapolated) point different from the current point from which the update is performed, stabilizing the dynamics:

Compute extrapolated point: (14)
Perform update step: (15)

Note that, even in the unconstrained case, this method is intrinsically different from Nesterov’s momentum555Sutskever (2013, §7.2) showed the equivalence between “standard momentum” and Nesterov’s formulation. (Nesterov, 1983, Eq. 2.2.9) because of this lookahead step for the gradient computation:

(16)

Nesterov’s method does not converge when trying to optimize (10). One intuition explaining why extrapolation provides better convergence properties than the standard gradient method comes from Euler’s method framework (see for instance (Atkinson, 2003) for more details on that topic). Actually, if we consider a first order approximation of , we have and consequently, the update step (15) is close to an implicit method step:

(17)

In the literature on Euler’s method, implicit methods are known to be more stable and to benefit from better convergence properties (Atkinson, 2003) than explicit methods. They are not often used in practice though since they require to solve a potentially non-linear system at each iteration.

Taking back the simplified WGAN toy example (10) from §3.1 we get the following update rules,

(18)

In the following proposition, we will see that the respective convergence rates of the implicit method and extrapolation are highly similar. Keeping in mind that the latter has the major advantage of being more practical, this proposition clearly underlines the benefits of extrapolation (more details and proof in §B.2),

Proposition 2.

The squared norm of the iterates , where the update rule of and are defined in (18), decreases geometrically for any as,

(19)

3.3 Extrapolation from the past

One issue with extrapolation is that the algorithm “wastes” a gradient (14). Indeed we need to compute the gradient at two different positions for every single update of the parameters. We thus propose a new technique that we call extrapolation from the past which only requires to compute one gradient for every update. The idea of this technique is to store and re-use the previous extrapolated gradient to compute the new extrapolation point:

Extrapolation from the past: (20)
Perform update step: (21)

This update scheme can be related to the optimistic mirror descent (Rakhlin and Sridharan, 2013; Daskalakis et al., 2018) in the unconstrained case, (20) and (21) reduce to:

(22)

However our technique comes from a different perspective, it was motived by VIP and inspired from the extragradient method. Furthermore our technique extends to constrained optimization as shown in (20) and (21). It is not clear whether or not a single projection added to (22) provides a provably converging algorithm. Using the VIP point of view we are able to prove a linear convergence rate for a projected version of the extrapolation from the past (see details and proof of Theorem 1 in §B.3). We also extend these results to the stochastic operator setting in §4.

Theorem 1 (Linear convergence of extrapolation from the past).

If is -strongly monotone (see §A for the definition of strong monotonicity) and -Lipschitz, then the updates (20) and (21) with provide linearly converging iterates,

(23)

In comparison to the results from (Daskalakis et al., 2018) that hold only for a bilinear objective, we provide a faster convergence rate (linear vs sublinear) on the last iterate for a general (strongly monotone) operator and any projection on a convex . One thing to notice is that the operator of a bilinear objective is not strongly monotone, but in that case one can use the standard extrapolation method (14) which converge linearly for a (constrained or not) bilinear game (Tseng, 1995, Cor. 3.3).

Figure 1: Comparison of the basic gradient method (as well as Adam) with the techniques presented in §3 on the optimization of (9). Only the algorithms advocated in this paper (Averaging, Extrapolation and Extrapolation from the past) converge quickly to the solution. Each marker represents 20 iterations. We compare these algorithms on a non-convex objective in §G.1.

4 Optimization of VIP with stochastic gradients

  Let
  for  do
     
     
  end for
  Return
Algorithm 1 AvgSGD
  for  do
     
     
     
     
  end for
  Return
Algorithm 2 AvgExtraSGD
  Let
  for  do
     
     
     
  end for
  Return
Algorithm 3 AvgPastExtraSGD
Figure 2: Three variants of SGD using the techniques introduced in §3.

In this section, we consider extensions of the techniques presented in section §3 for optimizing  (VIP), to the context of a stochastic operator. In this case, at each time step we no longer have access to the exact gradient but to an unbiased stochasticestimate of it where and . This is motivated from the GAN formulation where we only have access to a finite sample estimate of the expected gradient, computed on a mini-batch. For GANs, is thus a mini-batch of points coming from the true data distribution and the generator distribution .

For our analysis, we require at least one of the two following assumptions on the stochastic operator:

Assumption 1.

Bounded variance by

:

Assumption 2.

Bounded expected squared norm by :

Assump. 1 is standard in stochastic variational analysis, while Assump. 2 is a stronger assumption sometimes made in stochastic convex optimization. To illustrate how strong Assump. 2 is, note that it does not hold for an unconstrained bilinear objective like in our example 10 in §3. It is thus mainly reasonable for bounded constraint sets. Note that in practice we have .

We now present and analyse three algorithms, variants of SGD that are appropriate to solve (VIP). The first one Alg. 1 (AvgSGD) is the stochastic extension of the gradient method for solving (VIP); Alg. 2 (AvgExtraSGD) uses extrapolation and Alg. 3 (AvgPastExtraSGD) uses extrapolation from the past. A fourth variant (Alg.5) is proposed in §D. These three algorithms return an average of the iterates. The proofs of the theorems presented in this section are in §F.

To handle constraints such as parameter clipping (Arjovsky et al., 2017), we present a projected version of theses algorithms, where denotes the projection of  onto  (see §A). Note that when , the projection is the identity mapping (unconstrained setting). In order to prove the convergence of these three algorithms we will assume that is monotone:

(24)

If can be written as (6), it implies that the cost functions are convex.666The convexity of the cost functions in (3) is a necessary condition (not sufficient) for the operator to be monotone. In the context of a zero-sum game, the convexity of the cost functions is a sufficient condition.

Assumption 3.

is monotone and is a compact convex set, such that .

In that setting the quantity is well defined and is equal to 0 if and only if is a solution of (VIP). Moreover, if we are optimizing a zero-sum game, we have and . Hence, the quantity is well defined and equal to 0 if and only if is a Nash equilibrium of the game. The two functions and are called merit functions (more details on the concept of merit functions in §C). In the following, we call,

(25)
Averaging.

Alg. 1 (AvgSGD) presents the stochastic gradient method with averaging, which reduces to the standard (simultaneous) SGD updates for the two-player games used in the GAN literature, but returning an average of the iterates.

Theorem 2.

Under Assump. 1, 2 and 3, SGD with averaging (Alg. 1) with a constant step-size gives,

(26)

Thm. 2 uses a similar proof as (Nemirovski et al., 2009). The constant term in (26) is called the variance term. This type of bound is standard in stochastic optimization. We also provide in §F a similar rate with an extra log factor when . We show that this variance term is smaller than the one of SGD with prediction method (Yadav et al., 2018) in §E.

Extrapolations.

Alg. 2 (AvgExtraSGD) adds an extrapolation step compared to Alg. 1 in order to reduce the oscillations due to the game between the two players. A theoretical consequence is that it has a smaller variance term than (26). As discussed previously, Assump. 2 made in Thm. 2 for the convergence of Alg. 1 is very strong in the unbounded setting. One advantage of SGD with extrapolation is that Thm. 3 does not require this assumption.

Theorem 3.

(Juditsky et al., 2011, Thm. 1) Under Assump. 1 and 3, if is -Lipschitz, then SGD with extrapolation and averaging (Alg. 2) using a constant step-size gives,

(27)

Since in practice , the variance term in (27) is significantly smaller than the one in (26). To summarize, SGD with extrapolation provides better convergence guarantees but requires two gradient computations and samples per iteration. This motivates our new method, Alg. 3 (AvgPastExtraSGD) which uses extrapolation from the past and achieves the best of both worlds.

Theorem 4.

Under Assump. 1 and 3, if is -Lipschitz then SGD with extrapolation from the past using a constant step-size , gives that the averaged iterates converge as,

(28)

The bounds is similar to the one provided in Thm. 3 but each iteration of Alg. 3 is computationally half the cost of an iteration of Alg. 2.

5 Combining the techniques with established algorithms

In the previous sections we presented several techniques that converge on a simple bilinear example. These techniques can be combined in practice with existing algorithms. We propose to combine them to two standard algorithms used for training deep neural networks: the Adam optimizer (Kingma and Ba, 2015) and the SGD optimizer (Robbins and Monro, 1951). Note that in the case of a two-player game (3), the previous results can be generalized to gradient updates with a different step-size for each player by simply rescaling the objectives and by a different scaling factor. A detailed pseudo-code for Adam with extrapolation step (Extra-Adam) is given in Algorithm 4.

Algorithm 4 Extra-Adam: proposed Adam with extrapolation step.   input: step-size

, decay rates for moment estimates

, access to the stochastic gradients and to the projection onto the constraint set , initial parameter , averaging scheme
  for  do      Option 1: Standard extrapolation.      Sample new minibatch and compute stochastic gradient:      Option 2: Extrapolation from the past      Load previously saved stochastic gradient:      Update estimate of first moment for extrapolation:      Update estimate of second moment for extrapolation:      Correct the bias for the moments: ,      Perform extrapolation step from iterate at time :      Sample new minibatch and compute stochastic gradient:      Update estimate of first moment:      Update estimate of second moment:      Compute bias corrected for first and second moment: ,      Perform update step from the iterate at time :   end for   Output: , or (see (8) for online averaging)

6 Related Work

The extragradient method is the standard algorithm to optimize variational inequalities. This algorithm has been originally introduced by Korpelevich (1976) and extended by Nesterov (2007) and Nemirovski (2004). Stochastic versions of the extragradient have been recently analyzed (Juditsky et al., 2011; Yousefian et al., 2014; Iusem et al., 2017) for stochastic variational inequalities with bounded constraints. A linearly convergent variance reduced version of the stochastic gradient method has been proposed by Palaniappan and Bach (2016) for strongly monotone variational inequalities.

Several methods to stabilize GANs consist in transforming a zero-sum formulation into a more general game that can no longer be cast as a saddle point problem. This is the case of the non-saturating formulation of GANs (Goodfellow et al., 2014; Fedus et al., 2018), the DCGANs (Radford et al., 2016), the gradient penalty777The gradient penalty is only added to the discriminator cost function. Since this gradient penalty depends also on the generator, WGAN-GP cannot be cast as a SP problem and is actually a non-zero sum game. for WGANs (Gulrajani et al., 2017). Yadav et al. (2018) propose an optimization method for GANs based on AltSGD using a momentum based step on the generator. Daskalakis et al. (2018) proposed a method inspired from game theory. Li et al. (2017) suggest to dualize the GAN objective to reformulate it as a maximization problem and Mescheder et al. (2017) propose to add the norm of the gradient in the objective and provide an interesting perspective on GANs, interpreting the training as the search of a two-player game equilibrium. A study of the continuous version of two player games has been conducted by Ratliff et al. (2016). Interesting non-convex results were proved, for a new notion of regret minimization, by Hazan et al. (2017) and in the context of GANs by Grnarova et al. (2018).

The technique of unrolling steps proposed by Metz et al. (2017) can be confused with extrapolation but is actually fundamentally different: the perspective is try to construct the “true generator objective function" unrolling for steps the updates of the generator and then update the discriminator. Nevertheless the fact that this “true generator function" may not be found with a satisfying accuracy may lead to a different behavior than the one expected.

Regarding the averaging technique, some recent work appear to have already successfully used geometric averaging (7) for GANs in practice, but only briefly mention it (Karras et al., 2018; Mescheder et al., 2018). By contrast the present work formally motivates and justifies the use of averaging for GANs by relating them to the VIP perspective, and sheds light on its underlying intuitions in §3.1. Another independent work (Yazıcı et al., 2018) made a similar attempt but in the context of regret minimization in games. Mertikopoulos et al. (2018) also independently explored extrapolation providing asymptotic convergence results (i.e. without any rate of convergence) in the context of coherent saddle point. The coherence assumption is slightly weaker than monotonicity.

Model WGAN WGAN-GP
Method no averaging uniform avg EMA no averaging uniform avg
SimAdam
AltAdam5
ExtraAdam
PastExtraAdam
OptimAdam - -
Table 1: Best inception scores (averaged over 5 runs) achieved on CIFAR10 for every considered Adam variant. OptimAdam is the related Optimistic Adam (Daskalakis et al., 2018) algorithm. EMA denotes exponential moving average (with , see Eq. 8). We see that the techniques of extrapolation and averaging consistently enable improvements over the baselines (in italic).

7 Experiments

Our goal in this experimental section is not to provide new state-of-the art results with architectural improvements or a new GAN formulation but to show that using the techniques (with theoretical guarantees in the monotone case) that we introduced earlier allow us to optimize standard GANs in a better way. These techniques, which are orthogonal to the design of new formulations of GAN optimization objectives, and to architectural choices, can potentially be used for the training of any type of GAN. We will compare the following optimization algorithms: baselines are SGD and Adam using either simultaneous updates on the generator and on the discriminator (denoted SimAdam and SimSGD) or updates on the discriminator alternated with 1 update on the generator (denoted AltSGD and AltAdam)888In the original WGAN (Arjovsky et al., 2017) paper the authors use .. Variants that use extrapolation are denoted ExtraSGD (Alg. 2) and ExtraAdam (Alg. 4). Variants using extrapolation from the past are PastExtraSGD (Alg. 3) and PastExtraAdam (Alg. 4). We also present results using as output the averaged iterates, adding Avg as a prefix of the algorithm name when we use (uniform) averaging.

7.1 Bilinear saddle point (stochastic)

Figure 3: Performance of the considered stochastic optimization algorithms on the bilinear problem (29). Each method uses its respective optimal step-size found by grid-search.

We evaluate the performance of the various stochastic algorithms first on a simple () finite sum bilinear objective (a monotone operator) constrained to :

(29)

where , and . Marices and vectors were randomly generated, but ensuring that would belong to . Results are shown in Fig. 3. We can see that AvgSGD and AvgPastExtraSGD perform the best on this task.

7.2 WGAN and WGAN-GP on CIFAR10

We now evaluate the proposed techniques in the context of GAN training, which is a challenging stochastic optimization problem where the objectives of both players are non-convex. We focus on the more davanced Adam variants of optimization algorithms (see Alg. 4 for Adam with extrapolation) and compare them for training a fixed DCGAN architecture (Radford et al., 2016) on the CIFAR10 dataset (Krizhevsky and Hinton, 2009) for two different training objectives: WGAN with weight clipping (constrained) (Arjovsky et al., 2017), and a WGAN-GP objective (Gulrajani et al., 2017) (a non-zero sum game). Models are evaluated using the inception score (Salimans et al., 2016)

. For each algorithm we did an extensive search over the hyperparameters of Adam (

and performed best for all). We ran each with 5 different random seeds for 500,000 iterations.

Table 1 reports the best inception score achieved on this problem by each considered method. We see that the techniques of extrapolation and averaging consistently enable improvements over the baselines (see §G.3 for more experiments on averaging). Fig. 4 shows training curves for each method (for their best performing learning rate), as well as samples from an ExtraAdam-trained WGAN. For training WGAN, using an extrapolation step with Adam (ExtraAdam) outperformed all other methods. For training WGAN-GP, the best results are achieved with uniform averaging of AltAdam5, However its iterations require to update the discriminator 5 times for every generator update. With a small drop in best final score, ExtraAdam can train WGAN-GP significantly faster (see Fig. 4 right) as the discriminator and generator are updated only twice. We also observed that methods based on extrapolation are less sensitive to the choice of learning rate and can be used with higher learning rates with less degradation; see App. §G.2 for more details.

Figure 4: Left:

Mean and standard deviation of the inception score computed over 5 runs for each method on WGAN trained on CIFAR10. To keep the graph readable we show only SimAdam but AltAdam performs similarly.

Middle: Samples from a generator trained as a WGAN using ExtraAdam. Right: WGAN-GP trained on CIFAR10: mean and standard deviation of the inception score computed over 5 runs for each method using the best performing learning rate plotted over wall-clock time; all experiments were run on a NVIDIA Quadro GP100 GPU. We see that ExtraAdam converges faster than the Adam baselines.

8 Conclusion

We newly addressed GAN objectives in the framework of variational inequality. We tapped into the optimization literature to provide more principled sound techniques to optimize such games. We leveraged these techniques to develop practical optimization algorithms suitable for a wide range of GAN training objectives (including non-zero sum games and projections onto constraints). We experimentally verified that this could yield better trained models, achieving to our knowledge the best inception score when optimizing a WGAN objective on the reference unmodified DCGAN architecture (Radford et al., 2016). The presented techniques address a fundamental problem in GAN training in a principled way, and are orthogonal to the design of new GAN architectures and objectives. They are thus likely to be widely applicable, and benefit future development of GANs.

Acknowledgments

This research was partially supported by the Canada Excellence Research Chair in “Data Science for Realtime Decision-making” and by the NSERC Discovery Grant RGPIN-2017-06936. Gauthier Gidel would like to acknowledge Benoît Joly and Florestan Martin-Baillon for bringing a fresh point of view on the proof of Proposition 

1.

References

  • Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
  • Atkinson (2003) K. E. Atkinson. An introduction to numerical analysis. John Wiley & Sons, 2003.
  • Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Bruck (1977) R. E. Bruck. On the weak convergence of an ergodic iteration for the solution of variational inequalities for monotone operators in hilbert space. Journal of Mathematical Analysis and Applications, 1977.
  • Chen and Rockafellar (1997) G. H. Chen and R. T. Rockafellar. Convergence rates in forward–backward splitting. SIAM Journal on Optimization, 1997.
  • Crespi et al. (2005) G. P. Crespi, A. Guerraggio, and M. Rocca. Minty variational inequality and optimization: Scalar and vector case. In Generalized Convexity, Generalized Monotonicity and Applications, 2005.
  • Daskalakis et al. (2018) C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. In ICLR, 2018.
  • Fedus et al. (2018) W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. In ICLR, 2018.
  • Gidel et al. (2017) G. Gidel, T. Jebara, and S. Lacoste-Julien. Frank-wolfe algorithms for saddle point problems. In AISTATS, 2017.
  • Goodfellow (2016) I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv:1701.00160, 2016.
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • Grnarova et al. (2018) P. Grnarova, K. Y. Levy, A. Lucchi, T. Hofmann, and A. Krause. An online learning approach to generative adversarial networks. In ICLR, 2018.
  • Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein GANs. In NIPS, 2017.
  • Harker and Pang (1990) P. T. Harker and J.-S. Pang. Finite-dimensional variational inequality and nonlinear complementarity problems: a survey of theory, algorithms and applications. Mathematical programming, 1990.
  • Hazan et al. (2017) E. Hazan, K. Singh, and C. Zhang. Efficient regret minimization in non-convex games. In ICML, 2017.
  • Heusel et al. (2017) M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS, 2017.
  • Iusem et al. (2017) A. Iusem, A. Jofré, R. I. Oliveira, and P. Thompson. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 2017.
  • Juditsky et al. (2011) A. Juditsky, A. Nemirovski, and C. Tauvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 2011.
  • Karras et al. (2018) T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018.
  • Kingma and Ba (2015) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Korpelevich (1976) G. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12, 1976.
  • Krizhevsky and Hinton (2009) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, Canada, 2009.
  • Larsson and Patriksson (1994) T. Larsson and M. Patriksson. A class of gap functions for variational inequalities. Math. Program., 1994.
  • Ledig et al. (2017) C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
  • Li et al. (2017) Y. Li, A. Schwing, K.-C. Wang, and R. Zemel. Dualing GANs. In NIPS, 2017.
  • Mertikopoulos et al. (2018) P. Mertikopoulos, H. Zenati, B. Lecouat, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. arXiv, 2018.
  • Mescheder et al. (2017) L. Mescheder, S. Nowozin, and A. Geiger. The numerics of GANs. In NIPS, 2017.
  • Mescheder et al. (2018) L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In ICML, 2018.
  • Metz et al. (2017) L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. In ICLR, 2017.
  • Nagarajan and Kolter (2017) V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. In NIPS, 2017.
  • Nedić and Ozdaglar (2009) A. Nedić and A. Ozdaglar. Subgradient methods for saddle-point problems. J Optim Theory Appl, 2009.
  • Nemirovski (2004) A. Nemirovski. Prox-method with rate of convergence for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim., 2004.
  • Nemirovski et al. (2009) A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 2009.
  • Nesterov (1983) Y. Nesterov. Introductory Lectures On Convex Optimization. Springer, 1983.
  • Nesterov (2007) Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program., 2007.
  • Nowozin et al. (2016) S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.
  • Palaniappan and Bach (2016) B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems. In NIPS, 2016.
  • Polyak (1963) B. T. Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 1963.
  • Radford et al. (2016) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • Rakhlin and Sridharan (2013) A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In COLT, 2013.
  • Ratliff et al. (2016) L. J. Ratliff, S. A. Burden, and S. S. Sastry. On the characterization of local nash equilibria in continuous games. 2016.
  • Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 1951.
  • Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
  • Sutskever (2013) I. Sutskever.

    Training recurrent neural networks

    .
    PhD thesis, 2013.
  • Tseng (1995) P. Tseng. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, 1995.
  • Von Neumann and Morgenstern (1944) J. Von Neumann and O. Morgenstern. Theory of games and economic behavior. Princeton University Press, 1944.
  • Yadav et al. (2018) A. Yadav, S. Shah, Z. Xu, D. Jacobs, and T. Goldstein. Stabilizing adversarial nets with prediction methods. In ICLR, 2018.
  • Yazıcı et al. (2018) Y. Yazıcı, C.-S. Foo, S. Winkler, K.-H. Yap, G. Piliouras, and V. Chandrasekhar. The unusual effectiveness of averaging in gan training. arXiv preprint arXiv:1806.04498, 2018.
  • Yousefian et al. (2014) F. Yousefian, A. Nedić, and U. V. Shanbhag. Optimal robust smoothing extragradient algorithms for stochastic variational inequality problems. In CDC. IEEE, 2014.
  • Zhu et al. (2017) J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In ICCV, 2017.

Appendix A Definitions

In this section we recall usual definitions and lemmas from convex analysis. We start with the definitions and lemmas regarding the projection mapping.

a.1 Projection mapping

Definition 1.

The projection onto is defined as,

(30)

When is a convex set this projection is unique. This is a consequence of the following lemma that we will use in the following sections: the non-expansiveness of the projection onto a convex set.

Lemma 1.

Let a convex set, the projection mapping is nonexpansive, i.e.,

(31)

This is standard convex analysis result which can be found for instance in [Boyd and Vandenberghe, 2004]. The following Lemma is also standard in convex analysis and its proof uses similar arguments as the proof of Lemma 1.

Lemma 2.

Let and then for all we have,

(32)
Proof of Lemma 2.

We start by simply developing,

Then since is the projection onto the convex set of we have that , leading to the result of the Lemma. ∎

a.2 Smoothness and Monotonicity of the operator

Another important property used is the Lipschitzness of an operator.

Definition 2.

A mapping is said to be -Lipschitz if,

(33)
Definition 3.

A differentiable function is said to be -strongly convex if

(34)
Definition 4.

A function is said convex-concave if is convex for all and is concave for all . An is said to be -strongly convex concave if is convex concave.

Definition 5.

For , an operator is said to be -strongly monotone if

(35)

Appendix B Gradient methods on unconstrained bilinear games

In this section we will prove the results provided in §3, namely Proposition 1, Proposition 2 and Theorem 1. For Proposition 1 and 2 let us recall the context. We wanted to derive properties of some gradient methods on the following simple illustrative example

(36)

b.1 Proof of Proposition 1

Let us first recall the proposition:

Proposition’ 1.

The simultaneous iterates diverge geometrically and the alternated iterates defined in (11) are bounded but do not converge to 0 as

(37)

where .

The uniform average of the simultaneous updates (resp. the alternated updates) diverges (resp. converges to 0) as,

(38)
Proof.

Let us start with the simultaneous update rule:

(39)

Then we have,

(40)
(41)

The update rule (39) also gives us,

(42)

Summing these equation for