1 Introduction
A growing number of models in machine learning require to optimize over multiple interacting objectives. This is the case of generative adversarial networks goodfellow2014generative, imaginative agents racaniere2017imagination, hierarchical reinforcement learning wayne2014hierarchical,vezhnevets2017feudal and more generally multiagent reinforcement learning bu2008comprehensive. Solving saddlepoint problems [see e.g.,][]rockafellar_monotone_1970, that is key in robust learning kim_robust_2006 and image reconstruction chambolle_firstorder_2011, also falls in this category. All these examples can be cast as games where players are modules that compete or cooperate to minimize their own objective functions.
Optimizing over several objectives is challenging. To define a principled solution to a multiobjective optimization problem, we may rely on the notion of Nash equilibrium nash1951non. At a Nash equilibrium, no player can improve its objective by unilaterally changing its strategy. In general games, finding a Nash equilibrium is known to be PPADcomplete daskalakis2009complexity. The theoretical section of this paper considers the class of convex player games
, for which Nash equilibria exist nemirovski2010accuracy. Finding a Nash equilibrium in this setting is equivalent to solving a variational inequality problem (VI) with a monotone operator harker1990finite,nemirovski2010accuracy. This VI can be solved using firstorder methods, that are prevalent in singleobjective optimization for machine learning. Stochastic gradient descent (the simplest firstorder method) is indeed known to converge to local minima under mild conditions met by ML problems bottou2008tradeoffs,lee2016gradient. Yet, while gradient descent can be applied simultaneously to different objectives, it may fail in finding a Nash equilibrium in very simple settings [see e.g.,][]sos,gidel2018variational. Two alternative modifications of gradient descent are necessary to solve the VI (hence Nash) problem:
averaging magnanti1997averaging,nedic2009subgradient and extrapolation with averaging (introduced as the extragradient method korpelevich1976extragradient), which is faster nemirovski2004prox. Extrapolation corresponds to an opponent shaping step: each player anticipates its opponents’ next moves to update its strategy.In player games, extragradient computes single player gradients before performing a parameter update. Whether in massive or simple twoplayers games, this may be an inefficient update strategy: early gradient information, computed at the beginning of each iteration, could be used to perform eager updates or extrapolations, similar to how alternated training [[, e.g,]for GANs]goodfellow2014generative would behave. In this paper, we introduce and analyse new extragradient algorithms that extrapolate and update random or carefully selected subsets of players at each iteration (Fig. 1). Contributions are as follow.

[topsep=0pt,itemsep=0pt,parsep=3pt,leftmargin=15pt]

We review the extragradient algorithm for convex games and outline its shortcomings (§3.1). We propose a doublystochastic extragradient (DSEG) algorithm (§3.2) that relies on partially observed gradients of players. It performs faster but noisier updates than original extragradient descent. We introduce a variance reduction method to attenuate the added noise for smooth games. We describe an importance and a cyclic sampling scheme that improve convergence speed.

We demonstrate the performance of playersampled extragradient in controlled settings (quadratic games, §5), showing how our approach overcomes vanilla extragradient, especially using cyclic player selection. Most interestingly, compared to vanilla extragradient, our approach (with cyclic sampling) is also more efficient for GAN training (CIFAR10, ResNet architecture).
Algorithm  Nonsmooth games  Smooth games 
solving  
Doublystochastic extragradient 

bounds the noise in gradient estimation,
is the diameter of the parameter space.2 Related work
Extragradient method.
In this paper, we focus on finding the Nash equilibrium in convex player games (1), or equivalently the Variational Inequality problem (5) harker1990finite,nemirovski2010accuracy. This can be done using extrapolated gradient korpelevich1976extragradient, a “cautious” gradient descent approach (described in (4)) that was promoted by nemirovski2004prox and nesterov2007dual, under the name mirrorprox—we review this work in §3.1. solving propose a stochastic variant of mirrorprox, that assumes access to a noisy gradient oracle. Recently, universal2019bach described a smoothnessadaptive variant of this algorithm similar to AdaGrad duchi_adaptive_2011, an approach that can be combined with ours. yousefian2018stochastic consider multiagent games on networks and analyze a stochastic variant of extragradient that consists in randomly extrapolating and updating a single player. Compared to them, we analyse more general player sampling strategies. Moreover, our analysis holds for nonsmooth losses, and provides better rates for smooth losses, through variance reduction. We also analyse precisely the reasons why player sampling is useful (see §3.2 and comments on rates in §4), which is an original endeavor.
Finding Nash equilibria in nonconvex settings.
A number of algorithms have been proposed in the nonconvex setting under restricted assumptions on the game, for example WoLF in twoplayer twoaction games bowling2001rational, policy prediction in twoplayer twoaction bimatrix games zhanglesser, AWESOME in repeated games conitzer2007awesome, Optimistic Mirror Descent in twoplayer bilinear zerosum games daskalakis2018training and Consensus Optimization in twoplayer zerosum games mescheder2017numerics. optimistic proved asymptotic convergence results for extragradient without averaging in a slightly nonconvex setting. gidel2018variational demonstrated the effectivenes of extragradient for nonconvex GAN training—in §5, we demonstrate that player sampling improves training speed and effectiveness in the GAN setting.
Opponent shaping and gradient adjustment.
Extragradient can also be understood as an opponent shaping method: in the extrapolation step, the player looks one step in the future and anticipates the next moves of his opponents. Several recent works proposed algorithms that make use of the opponents’ information to converge to an equilibrium zhanglesser,lola,sos. In particular, the “Learning with opponentlearning awareness” (LOLA) algorithm is known for encouraging cooperation in cooperative games lola. Lastly, some recent works proposed algorithms to modify the dynamics of simultaneous gradient descent by adding an adjustment term in order to converge to the Nash equilibrium mazumdar2019finding and avoid oscillations mechanics,mescheder2017numerics. One caveat of these works is that they need to estimate the Jacobian of the simultaneous gradient, which may be expensive in largescale systems or even impossible when dealing with nonsmooth losses as we consider in our setting. This is orthogonal to our approach that finds solutions of the original VI problem (5).
3 Solving convex games with partial firstorder information
We review the framework of Cartesian convex games and the extragradient method in §3.1. Building on these, we propose to augment extragradient with player sampling and variance reduction in §3.2.
3.1 Solving convex player games with gradients
Each player observes a loss that depends on the independent parameters of all other players.
Definition 1.
A Cartesian player game is given by a set of players with parameters where decomposes into a Cartesian product . Each player’s parameter lives in , where
. Each player is given a loss function
.For example, generative adversarial network (GAN) training can be cast as a Cartesian game between a generator and discriminator that do not share parameters. We make the following assumption over the geometry of losses and constraints, that corresponds to the convexity assumption for one player.
Assumption 1.
The parameter spaces are compact, convex and nonempty. Each player’s loss is convex in its parameter and concave in , where contains all other players’ parameters. Moreover, is convex in .
Ass. 1 implies that has a diameter Note that the losses may be nondifferentiable. A simple example of Cartesian convex games satisfying Ass. 1, that we will empirically study in §5, are matrix games (e.g., rockpaperscissors) defined by a positive payoff matrix , with parameters corresponding to mixed strategies
lying in the probability simplex
.Nash equilibria.
Joint solutions to minimizing losses are naturally defined as the set of Nash equilibria nash1951non of the game. In this setting, the goal of multiobjective optimization becomes
(1) 
Intuitively, a Nash equilibrium is a point where no player can benefit by changing his strategy while the other players keep theirs unchanged. Ass. 1 implies the existence of a Nash equilibrium rosen1964existence. We quantify the inaccuracy of a solution by the functional Nash error nemirovski2004prox
(2) 
This error, computable through convex optimization, quantifies the gain that each player can obtain when deviating alone from the current strategy. In particular, if and only if is a Nash equilibrium; thus constitutes a proper indication of convergence for sequence of iterates seeking a Nash equilibrium. It is the value we bound in our convergence analysis (see §4).
Firstorder methods and extrapolation.
We consider (sub)differentiable losses forming a convex game. The Nash equilibrium can be found using firstorder methods, that access the gradients of . We define the simultaneous gradient of the game to be
(3) 
where we write . It corresponds to the concatenation of the gradients of each player’s loss with respect to its own parameters. The losses may be nonsmooth, in which case the gradients
should be replaced by subgradients. Simultaneous gradient descent simply approximates the flow of the simultaneous gradient. It fails to converge in very simple settings, in particular in any matrix games for which the payoff is skewsymmetric. An alternative approach with better guarantees is the extragradient korpelevich1976extragradient method, which forms the basis for the algorithms presented in this paper. It has been extensively analyzed under several settings (see
§2). In particular, nemirovski2004prox provides convergence results when gradients are exact, and solving when gradients are accessed through a noisy oracle.Extragradient consists in two steps: first, we take a gradient step to go to an extrapolated point. We then use the gradient at the extrapolated point to perform a gradient step from the original point:
(4) 
where is the Euclidean projection onto the constraint set , i.e. . This "cautious" approach allows to escape cycling orbits of the simultaneous gradient flow, that may arise around equilibrium points with skewsymmetric Hessians (see Fig. 1). The generalization of extragradient to general Banach spaces equipped by a Bregman divergence was introduced as the mirrorprox algorithm nemirovski2004prox. All new results from §4 extend to this mirror setting (see §A.1). As recalled in Table 1, solving provide rates of convergence for the average iterate . Those rates are introduced for the equivalent variational inequality (VI) problem:
(5) 
where Ass. 1 ensures that the simultaneous gradient is a monotone operator (see §A.2 for the link between Nash equilibria and solutions of the VI).
Computational caveats.
In systems with large number of players, an extragradient step may be computationally expensive due to the high number of backward passes necessary for gradient computations. Namely, at each iteration, we are required to compute gradients before performing a first update. This is likely to be inefficient, as we could use the first computed gradients to perform a first extrapolation or update. This remains true for games down to two players. In a different setting, stochastic gradient descent robbins_stochastic_1951 updates model parameters before observing the whole data, assuming that partial observation is sufficient for progress in the optimization loop. Similarly, partial gradient observation should be sufficient to perform extrapolation and updates toward the Nash equilibrium. We therefore propose to compute a few random player gradients at each iteration.
3.2 Partial extrapolation and update for extragradient
We present our main algorithm contribution in this section. While standard extragradient requires two full passes over players, we propose to compute doublystochastic simultaneous gradient estimates. This corresponds to evaluating a simultaneous gradient that is affected by two sources of noise. We sample a minibatch of players of size , and compute the gradients for this minibatch only. Furthermore, we assume that the gradients are noisy estimates, e.g., with noise coming from data sampling. We then compute a doublystochastic simultaneous gradient estimate as
(6) 
where
is a noisy unbiased estimate of
The factor in (6) ensures that the doublystochastic simultaneous gradient estimate is an unbiased estimator of the simultaneous gradient. Doublystochastic extragradient replaces the full update (4) by oracle (6), as detailed in Alg. 1.Motivation.
Sampling over players introduces a further source of noise in the average iterate sequence . The convergence of this sequence is already slowed down by noisy gradients or by the nonsmoothness of the losses, that both introduce a term in in the convergence bounds. It is therefore appealing to introduce a further source of noise, hoping that the computational speedups provided at each iteration mitigates the approximation errors introduced by player subsampling.
Variance reduction for player noise.
To obtain faster rates in convex games with smooth losses, we propose to compute variance reduced estimate of the simultaneous gradient. This mitigates the noise due to player sampling. Variance reduction is a technique known to accelerate convergence under smoothness assumptions in similar settings. While palaniappan2016stochastic,iusem2017extragradient,reducing2019chavdarova apply variance reduction on the noise coming from the gradient estimates, we apply it to the noise coming from the sampling over the players. We implement this idea in Alg. 2. We keep an estimate of for each player in a table , which we use to compute unbiased gradient estimates with lower variance, similar to SAGA defazio2014saga.
Sampling strategies.
In the basic version of the algorithm, the sampling over players can be performed using any distribution with uniform marginals, i.e such that all players have equal probability of being sampled. Sampling uniformly over subsets of is a reasonable way to fulfill this condition as all players have probability of being chosen. One faster alternative is to perform importance sampling. Namely, we sample each player with a probability proportional to the uniform bound of . This technique achieve faster convergence (see §B.3) when the gradient bounds for the different losses differ.
As a strategy to accelerate convergence, we propose to cycle over the pairs of different players (with ). At each iteration, we extrapolate the first player of the pair and update the second one. We shuffle the order of pairs once the block has been entirely seen. By excluding pairs, we avoid players extrapolating themselves, which is never useful to reduce . This scheme bridges extrapolation and alternated gradient descent: for GANs, it corresponds to extrapolate the generator before updating the discriminator, and viceversa, cyclically. Sampling over players proves powerful for quadratic games (§5.1) and GANs (§5.2). In App. C, we provide a first explanation for this fact, based on studying the spectral radius of recursion operators (echoing recent work on understanding cyclic coordinate descent li_faster_2018).
4 Sharp analysis of convergence rates
We state our main convergence results in this section. As announced, we derive rates for the algorithms mentioned in §3.2 following the analysis by solving. We compare them with the rates achieved by stochastic extragradient introduced by solving, which also assumes noisy gradients but no player subsampling. While in the main paper the theorems are provided in the Euclidean setting, the proofs in the appendices are written in the mirror setting. In the analysis, we separately consider the two following assumptions on the losses.
Assumption 2a (Nonsmoothness).
For each the loss has a bounded subgradient, namely for all . In this case, we also define the quantity
Assumption 2b (Smoothness).
For each the loss is oncedifferentiable and smooth, i.e. for
Classically, similar to solving, robbins_stochastic_1951, we assume unbiasedness and boundedness of the variance.
Assumption 3.
For each player , the noisy gradient is unbiased and has bounded variance:
(7) 
In stochastic gradientbased methods, comparing rates in terms of number of iterations is not appropriate since the complexity per iteration increases with the size of player minibatches. Instead, we define as the number of gradients estimates computed up to iteration . At each iteration in Alg. 1, the doublystochastic simultaneous gradient estimate is computed twice and requires gradient estimates. Therefore, which implies the number of iterations in terms of gradient computations is . We give the rates in terms of in the statement of the theorems. We first state the convergence result for doublystochastic extragradient under Ass. 2a.
Theorem 1.
The following results holds when the losses are oncedifferentiable and smooth (Ass. 2b).
Theorem 2.
Those rates should be compared to the rate of [solving, Corollary 1], that we recall in §A.3 and Table 1. Corollary 3 and 7 in §B.2.2 and §B.4 contain the statements of Theorem 1 and 2 in more detail.

[topsep=0pt,itemsep=0pt,parsep=3pt,leftmargin=15pt, label=()]

Under Ass. 2a, Alg. 1 performs with a rate similar to stochastic extragradient. In both cases the rate is , and the subgradient bound and noise bound appear on the numerator. Doublystochastic extragradient is more robust to noisy gradient estimates, because the dependency of its rate on is weaker than for full extragradient.

Under Ass. 2b, the deterministic term of the rate is times larger compared to stochastic extragradient while the noisy term is times smaller. For long runs (large ), the noise term dominates the deterministic one, which advocates for the use of small batch sizes: when the rate is asymptotically times smaller. Setting to zero in the noise term, doublystochastic extragradient with variance reduction recovers the rate from nemirovski2004prox.
To sum up, doublystochastic extragradient provides better convergence guarantees than stochastic extragradient under high levels of noise (), while it delivers similar or slightly worse theoretical results in the nonnoisy regime. Player randomness can be considered in the framework from solving by including it in the noisy unbiased estimate (increasing in Ass. 3 accordingly). This coarse approach does not yield the sharp bounds of Theorem 1 and 2 (see §A.4).
Importance sampling.
Using importance sampling when choosing player minibatches yields a better bound by a constant factor (see §B.3). In the nonsmooth case, this replaces the constant with the strictly smaller , which is useful when the gradient magnitudes are skewed.
5 Applications
We show the performance of doublystochastic extragradient in the setting of quadratic games over the simplex, and in the practical context of GAN training. A PyTorch/Numpy package is attached.
5.1 Random quadratic games
We consider a game where players can play actions, with payoffs provided by a matrix , an horizontal stack of matrices (one for each player). The loss function of each player is defined as its expected payoff given the mixed strategies , i.e.
(10) 
where is a regularization parameter that introduces nonsmoothness and pushes strategies to snap to the simplex center. The positivity of is equivalent to Ass. 1, i.e. for all .
Experiments.
We sample as the weighted sum of a random symmetric positive definite matrix and a skew matrix. We compare the convergence speeds of extragradient algorithms, with or without player subsampling. We vary three parameters: the variance of the noise in the gradient oracle (we add a Gaussian noise on each gradient coordinate, similar to Langevin dynamics neal2011mcmc), the nonsmoothness of the loss, and the skewness of the matrix. We consider small games and large games (). We use the (simplexadapted) mirror variant of doublystochastic extragradient, and a constant stepsize, selected among a grid (see App. D). We use variance reduction when (smooth case). We compare random fixedsize sampling with cyclic sampling (§3.2).
Results.
Fig. 2 compares the convergence speed of playersampled extragradient for the various settings and sampling schemes. As predicted by Theorem 1 and 2, rates of convergence are comparable with and without subsampling. Randomly subsampling players always brings a benefit in the convergence constant ((a)fig:quadratic_convergence_50), especially in the smooth noisy regime, using variance reduction ((a) column 2). Most interestingly, cyclic player selection brings a significant improvement over random sampling for small number of players, allowing larger gain in the rate constants ((a)).
(c) highlights the tradeoffs in Theorem 2: as the noise increase, the size of player batches should be reduced. Not that for skewgames with many players ((b) col. 3), which are the hardest games to solve as averaging is needed optimistic, our approach only becomes beneficial in the highnoise regime (more relevant in ML). Full extragradient should be favored in the nonnoisy regime (see App. D).
Spectral effect of sampling.
To better understand the benefit of the cyclic selection scheme, we study the linear “algorithm operator” such that in non constrained twoplayer bilinear games. The convergence speed of is governed by the spectral radius of , in light of Gelfand’s formula gelfand1941normierte. In App. C Fig. 4
, we consider random matrix games. For these, the algorithm operator of extragradient with cyclic player selection has on average a lower spectral radius than with random selection and a fortiori full selection. This leads to faster convergence of cyclic schemes.
5.2 Generative adversarial networks (WGANGP + ResNet)
We evaluate the performance of the player sampling approach to train a generative model on CIFAR10 krizhevsky2009learning. We use the WGANGP loss gulrajani_improved_2017, that defines a nonconvex twoplayer game. We compare the full extragradient approach advocated by gidel2018variational to the cyclic sampling scheme proposed in §3.2 (i.e. extra. D, upd. G, extra. G, upd. D). We use the ResNet he2016deep architecture from gidel2018variational, and select the best performing stepsizes among a grid (see App. D). We use the Adam kingma_adam_2014 refinement of extragradient gidel2018variational for both the baseline and proposed methods. We evaluate the Inception Score salimans_improved_2016 and Fréchet Inception Distance (FID) heusel2017gans along training.
Results.
We report training curves versus wallclock time in Fig. 3. Cyclic sampling allows faster and better training, especially with respect to FID, which is more correlated to human appreciation heusel2017gans. Table 2 compares our result to full extragradient with uniform averaging. It shows substantial improvements in FID, with results less sensitive to randomness. Note that scores could be slightly improved by leaving more time for training.
Interpretation.
Without extrapolation, alternated training is known to perform better than simultaneous updates in WGANGP gulrajani_improved_2017. Our approach allows to add extrapolation but keep an alternated schedule. It thus performs better than extrapolating with simultaneous updates. It remains true across every learning rate we tested. Deterministic sampling is crucial for performance, as random player selection performs poorly (best score 6.2 IS). This echoes the good results of cyclic sampling in §5.1.
6 Discussion and conclusion
We propose and analyse a doublystochastic extragradient approach for finding Nash equilibria. According to our convergence analysis, updating/extrapolating subsets of players only is useful in high noise or nonsmooth settings, and equivalent otherwise. Numerically, doublystochastic extragradient leads to speedups and improvements in convex and nonconvex settings, especially using noisy gradients (as with GANs). Our approach hence combines the advantages of alternated and extrapolation methods over simultaneous gradient descent—we recommend it for training GANs.
Beyond demonstrating the usefulness of sampling, numerical experiments show the importance of sampling schemes. We take a first step towards understanding the good performance of cyclic player extrapolation and update. A better theoretical analysis of this phenomenon is left for future work.
We foresee interesting developments using player sampling and extrapolation in reinforcement learning: the policy gradients obtained using multiagent actor critic methods mnih2016asynchronous,lowe2017multi are highly noisy estimates, a setting in which sampling over players proves beneficial.
7 Acknowledgements
This work was partially supported by NSF grant RIIIS 1816753, NSF CAREER CIF 1845360, the Alfred P. Sloan Fellowship and Samsung Electronics. The job of C. Domingo Enrich was partially supported by CFIS (UPC). The work of A. Mensch was supported by the European Research Council (ERC project NORIA).
A. Mensch thanks Guillaume Garrigos and Timothée Lacroix for helpful comments.
[segment=1]
[sections]
The appendices are structured as follows: App. A presents the setting and the existing results. In particular, we start by introducing the setting of the mirrorprox algorithm in §A.1. After detailing the relation between solving this problem and finding Nash equilibria in convex player games §A.2, we recall the rates for stochastic mirrorprox obtained by solving in §A.3. We then present the proofs of our theorems in App. B. We analyze the doublystochastic algorithm (Alg. 1) and separately study two variants of the latter, adding importance sampling (§B.3) and variancereduction (§B.4). App. C investigates the difference betwen random and cyclic player sampling. App. D presents further experimental results and details.
[sections]l1
Appendix A Existing results
a.1 Mirrorprox
Mirrorprox and mirror descent are the formulation of the extragradient method and gradient descent for nonEuclidean (Banach) spaces. bubeck_monograph (which is a good reference for this subsection) and solving study extragradient/mirrorprox in this setting. We provide an introduction to the topic for completeness.
Setting and notations.
We consider a Banach space and a compact set We define an open convex set such that is included in its closure, that is and The Banach space is characterized by a norm . Its conjugate norm is defined as For simplicity, we assume .
We assume the existence of a mirror map for , which is defined as a function that is differentiable and strongly convex i.e.
(11) 
We can define the Bregman divergence in terms of the mirror map.
Definition 2.
Given a mirror map , the Bregman divergence is defined as
(12) 
Note that is always nonnegative. For more properties, see e.g. nemirovsky1983problem and references therein. Given that is compact convex space, we define Lastly, for and we define the proxmapping as
(13) 
The mirrorprox algorithm is the most wellknown algorithm to solve convex player games in the mirror setting (and variational inequalities, see §A.2). An iteration of mirrorprox consists of:
(14) 
Remark that the extragradient algorithm defined in equation (4) corresponds to the mirrorprox (14) when choosing
Lemma 1.
By using the proximal mapping notation (13), the mirrorprox updates are equivalent to:
(15) 
Proof.
We just show that , as the second part is analogous.
The mirror framework is particularly wellsuited for simplex constraints i.e. when the parameter of each player is a probability vector. Such constraints usually arise in matrix games. If
is the simplex, we express the negative entropy for player as(16) 
We can then define and the mirror map as
(17) 
We used this mirror map in the experiments for random quadratic games (§5.1).
a.2 Link between convex games and variational inequalities
Finding a Nash equilibrium in a convex player game is related to solving a variational inequality (VI). Consider a space of parameters that is compact and convex, and consider a scalar product in . The strong form of the VI associated to the operator is
(18) 
The weak form of the VI is
(19) 
We define the concept of monotone operator.
Definition 3.
An operator is monotone if
If is monotone, a solution of the strong form of the VI is a solution of the weak form. The reciprocal implication is true when is continuous.
For convex player games (Ass. 1), the simultaneous (sub)gradient (Eq. 3) is a monotone operator. Moreover, if we assume continuity of the losses , the set of weak solutions to the VI (19) coincides with the set of Nash equilibria. Solving the VI is therefore sufficient to find Nash equilibria harker1990finite,nemirovski2010accuracy. The intuition behind this result is that equation (18) corresponds to the firstorder necessary optimality condition applied to the losses of players.
The quantity that is used in the literature to quantify the inaccuracy of a solution is the dual VI gap defined as However, the functional Nash error (2) is the usual performance measure for convex games. In this article we give the convergence rates for the functional Nash error but they also apply to the dual VI gap. That is because the bound in Lemma 4 applies to the dual VI gap as well.
a.3 Convergence rates for the stochastic mirrorprox
In this section, we recall the stochastic mirrorprox algorithm and its analysis by solving. Stochastic mirrorprox corresponds to Alg. 1 without subsampling over players i.e. setting the minibatch size . We start by giving the rates in terms of the number of iterations under Ass. 2a and Ass. 2b.
Theorem 3 (From solving).
To obtain a fair comparison with our results, we state these results in terms of the number of full gradients computations .
Corollary 1 (From solving).
a.4 Player randomness as noise
The easiest way to treat player randomness on the theoretical level is to incorporate it in the unbiased gradient estimate. Indeed, in equation (6) is an unbiased estimate of .
(24) 
If has variance bounded by , we can bound the variance of .
(25)  
(26)  
(27)  
(28)  
(29)  
(30) 
Substituting by on equations (22) and (23) yields:
(31) 
(32)  
(33) 
These bounds are clearly worse than the ones in Corollary 1 when , which motivates the theoretical work in App. B that yields Theorem 1 and 2.
Appendix B Proofs and mirrorsetting algorithms
b.1 Useful lemmas
In this section, we present lemmas that will be frequently used in the analysis of the algorithms in §B.2, §B.3 and §B.4. We first present the following two technical lemmas that are used and proven by solving.
Lemma 2.
Let be a point in , let be two points in the dual , let and Then,
(34) 
Moreover, for all , one has
(35) 
Lemma 3.
Let be a sequence of elements of . Define the sequence in as follows:
Then is a measurable function of and such that:
(36) 
The following lemma provides an upper bound on the Nash functional error The following lemma provides an upper bound on the that the Nash functional error have the same upper bounds.
Lemma 4.
We consider a convex player game with players losses where . Let a sequence of points , the stepsizes . We define the average iterate . The functional Nash error evaluated in is upper bounded by
(37) 
Proof.
By using the convexity of in its parameter and its concavity in the others parameters and applying Jensen’s inequality, we obtain:
(38)  
(39) 
As a consequence of the convexity of with respect to its parameter, we have where . Remark that . By plugging this inequality in (39), we obtain
Lemma 5.
Let be a sequence in and . For any , we define the function to be
(41) 
Then, it attains its minimum for when both terms are equal. Let us call the point at which the minimum is reached. The value of evaluated at is
(42) 
Proof.
It is sufficient to derive the firstorder optimality condition of :
(43) 
and the result follows. ∎
Lemma 6.
Let be Banach spaces where for each , is the norm associated to . The Cartesian product is and has a norm defined for as
(44) 
It is known that is a Banach space. Moreover, we define the dual spaces The dual space of is and has a norm . Then, for any , the following inequality holds
(45) 
Proof.
We first prove that the LHS is smaller than the RHS. By definition of the dual norm, we have
(46) 
where we used CauchySchwarz inequality. By applying again this inequality in (46), we obtain
(47) 
which proves the result. To prove the other inequality we define .