Extra-gradient with player sampling for provable fast convergence in n-player games

Data-driven model training is increasingly relying on finding Nash equilibria with provable techniques, e.g., for GANs and multi-agent RL. In this paper, we analyse a new extra-gradient method, that performs gradient extrapolations and updates on a random subset of players at each iteration. This approach provably exhibits the same rate of convergence as full extra-gradient in non-smooth convex games. We propose an additional variance reduction mechanism for this to hold for smooth convex games. Our approach makes extrapolation amenable to massive multiplayer settings, and brings empirical speed-ups, in particular when using cyclic sampling schemes. We demonstrate the efficiency of player sampling on large-scale non-smooth and non-strictly convex games. We show that the joint use of extrapolation and player sampling allows to train better GANs on CIFAR10.

Authors

• 10 publications
• 3 publications
• 8 publications
• 12 publications
• 68 publications
• Exponential Convergence of Gradient Methods in Concave Network Zero-sum Games

Motivated by Generative Adversarial Networks, we study the computation o...
07/10/2020 ∙ by Amit Kadan, et al. ∙ 0

• Smooth markets: A basic mechanism for organizing gradient-based learners

With the success of modern machine learning, it is becoming increasingly...
01/14/2020 ∙ by David Balduzzi, et al. ∙ 10

• Tight last-iterate convergence rates for no-regret learning in multi-player games

We study the question of obtaining last-iterate convergence rates for no...
10/26/2020 ∙ by Noah Golowich, et al. ∙ 0

• A mean-field analysis of two-player zero-sum games

Finding Nash equilibria in two-player zero-sum continuous games is a cen...
02/14/2020 ∙ by Carles Domingo Enrich, et al. ∙ 5

• Hedging in games: Faster convergence of external and swap regrets

We consider the setting where players run the Hedge algorithm or its opt...
06/08/2020 ∙ by Xi Chen, et al. ∙ 0

Motivated by applications of multi-agent learning in noisy environments,...
07/14/2020 ∙ by Sarah H. Q. Li, et al. ∙ 0

• A Provably Convergent and Practical Algorithm for Min-max Optimization with Applications to GANs

We present a new algorithm for optimizing min-max loss functions that ar...
06/22/2020 ∙ by Oren Mangoubi, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A growing number of models in machine learning require to optimize over multiple interacting objectives. This is the case of generative adversarial networks goodfellow2014generative, imaginative agents racaniere2017imagination, hierarchical reinforcement learning wayne2014hierarchical,vezhnevets2017feudal and more generally multi-agent reinforcement learning bu2008comprehensive. Solving saddle-point problems [see e.g.,][]rockafellar_monotone_1970, that is key in robust learning kim_robust_2006 and image reconstruction chambolle_firstorder_2011, also falls in this category. All these examples can be cast as games where players are modules that compete or cooperate to minimize their own objective functions.

Optimizing over several objectives is challenging. To define a principled solution to a multi-objective optimization problem, we may rely on the notion of Nash equilibrium nash1951non. At a Nash equilibrium, no player can improve its objective by unilaterally changing its strategy. In general games, finding a Nash equilibrium is known to be PPAD-complete daskalakis2009complexity. The theoretical section of this paper considers the class of convex -player games

, for which Nash equilibria exist nemirovski2010accuracy. Finding a Nash equilibrium in this setting is equivalent to solving a variational inequality problem (VI) with a monotone operator harker1990finite,nemirovski2010accuracy. This VI can be solved using first-order methods, that are prevalent in single-objective optimization for machine learning. Stochastic gradient descent (the simplest first-order method) is indeed known to converge to local minima under mild conditions met by ML problems bottou2008tradeoffs,lee2016gradient. Yet, while gradient descent can be applied simultaneously to different objectives, it may fail in finding a Nash equilibrium in very simple settings [see e.g.,][]sos,gidel2018variational. Two alternative modifications of gradient descent are necessary to solve the VI (hence Nash) problem:

averaging magnanti1997averaging,nedic2009subgradient and extrapolation with averaging (introduced as the extra-gradient method korpelevich1976extragradient), which is faster nemirovski2004prox. Extrapolation corresponds to an opponent shaping step: each player anticipates its opponents’ next moves to update its strategy.

In -player games, extra-gradient computes single player gradients before performing a parameter update. Whether in massive or simple two-players games, this may be an inefficient update strategy: early gradient information, computed at the beginning of each iteration, could be used to perform eager updates or extrapolations, similar to how alternated training [[, e.g,]for GANs]goodfellow2014generative would behave. In this paper, we introduce and analyse new extra-gradient algorithms that extrapolate and update random or carefully selected subsets of players at each iteration (Fig. 1). Contributions are as follow.

• [topsep=0pt,itemsep=0pt,parsep=3pt,leftmargin=15pt]

• We review the extra-gradient algorithm for convex games and outline its shortcomings (§3.1). We propose a doubly-stochastic extra-gradient (DSEG) algorithm (§3.2) that relies on partially observed gradients of players. It performs faster but noisier updates than original extra-gradient descent. We introduce a variance reduction method to attenuate the added noise for smooth games. We describe an importance and a cyclic sampling scheme that improve convergence speed.

• We propose a sharp analysis of our method’s convergence rates (§4), as outlined in Table 1, for the various proposed variants. Those rates outlines the trade-offs of player sampling.

• We demonstrate the performance of player-sampled extra-gradient in controlled settings (quadratic games, §5), showing how our approach overcomes vanilla extra-gradient, especially using cyclic player selection. Most interestingly, compared to vanilla extra-gradient, our approach (with cyclic sampling) is also more efficient for GAN training (CIFAR10, ResNet architecture).

2 Related work

Finding Nash equilibria in non-convex settings.

A number of algorithms have been proposed in the non-convex setting under restricted assumptions on the game, for example WoLF in two-player two-action games bowling2001rational, policy prediction in two-player two-action bi-matrix games zhanglesser, AWESOME in repeated games conitzer2007awesome, Optimistic Mirror Descent in two-player bilinear zero-sum games daskalakis2018training and Consensus Optimization in two-player zero-sum games mescheder2017numerics. optimistic proved asymptotic convergence results for extra-gradient without averaging in a slightly non-convex setting. gidel2018variational demonstrated the effectivenes of extra-gradient for non-convex GAN training—in §5, we demonstrate that player sampling improves training speed and effectiveness in the GAN setting.

Extra-gradient can also be understood as an opponent shaping method: in the extrapolation step, the player looks one step in the future and anticipates the next moves of his opponents. Several recent works proposed algorithms that make use of the opponents’ information to converge to an equilibrium zhanglesser,lola,sos. In particular, the “Learning with opponent-learning awareness” (LOLA) algorithm is known for encouraging cooperation in cooperative games lola. Lastly, some recent works proposed algorithms to modify the dynamics of simultaneous gradient descent by adding an adjustment term in order to converge to the Nash equilibrium mazumdar2019finding and avoid oscillations mechanics,mescheder2017numerics. One caveat of these works is that they need to estimate the Jacobian of the simultaneous gradient, which may be expensive in large-scale systems or even impossible when dealing with non-smooth losses as we consider in our setting. This is orthogonal to our approach that finds solutions of the original VI problem (5).

3 Solving convex games with partial first-order information

We review the framework of Cartesian convex games and the extra-gradient method in §3.1. Building on these, we propose to augment extra-gradient with player sampling and variance reduction in §3.2.

3.1 Solving convex n-player games with gradients

Each player observes a loss that depends on the independent parameters of all other players.

Definition 1.

A Cartesian -player game is given by a set of players with parameters where decomposes into a Cartesian product . Each player’s parameter lives in , where

. Each player is given a loss function

.

For example, generative adversarial network (GAN) training can be cast as a Cartesian game between a generator and discriminator that do not share parameters. We make the following assumption over the geometry of losses and constraints, that corresponds to the convexity assumption for one player.

Assumption 1.

The parameter spaces are compact, convex and non-empty. Each player’s loss is convex in its parameter and concave in , where contains all other players’ parameters. Moreover, is convex in .

Ass. 1 implies that has a diameter Note that the losses may be non-differentiable. A simple example of Cartesian convex games satisfying Ass. 1, that we will empirically study in §5, are matrix games (e.g., rock-paper-scissors) defined by a positive payoff matrix , with parameters corresponding to mixed strategies

lying in the probability simplex

.

Nash equilibria.

Joint solutions to minimizing losses are naturally defined as the set of Nash equilibria nash1951non of the game. In this setting, the goal of multi-objective optimization becomes

 Findθ⋆∈Θsuch that∀i∈[n],ℓi(θi⋆,θ−i⋆)=minθi∈Θiℓi(θi,θ−i⋆). (1)

Intuitively, a Nash equilibrium is a point where no player can benefit by changing his strategy while the other players keep theirs unchanged. Ass. 1 implies the existence of a Nash equilibrium rosen1964existence. We quantify the inaccuracy of a solution by the functional Nash error nemirovski2004prox

 ErrN(θ)≜∑ni=1[ℓi(θ)−minz∈Θiℓi(z,θ−i)]. (2)

This error, computable through convex optimization, quantifies the gain that each player can obtain when deviating alone from the current strategy. In particular, if and only if is a Nash equilibrium; thus constitutes a proper indication of convergence for sequence of iterates seeking a Nash equilibrium. It is the value we bound in our convergence analysis (see §4).

First-order methods and extrapolation.

We consider (sub)differentiable losses forming a convex game. The Nash equilibrium can be found using first-order methods, that access the gradients of . We define the simultaneous gradient of the game to be

 F≜(∇1ℓ1,…,∇nℓn)⊤∈Rd, (3)

where we write . It corresponds to the concatenation of the gradients of each player’s loss with respect to its own parameters. The losses may be non-smooth, in which case the gradients

should be replaced by subgradients. Simultaneous gradient descent simply approximates the flow of the simultaneous gradient. It fails to converge in very simple settings, in particular in any matrix games for which the payoff is skew-symmetric. An alternative approach with better guarantees is the extra-gradient korpelevich1976extragradient method, which forms the basis for the algorithms presented in this paper. It has been extensively analyzed under several settings (see

§2). In particular, nemirovski2004prox provides convergence results when gradients are exact, and solving when gradients are accessed through a noisy oracle.

Extra-gradient consists in two steps: first, we take a gradient step to go to an extrapolated point. We then use the gradient at the extrapolated point to perform a gradient step from the original point:

 At iteration\leavevmode\nobreak\ τ,(extrapolation)θτ+1/2=pΘ[θτ−γτF(θτ)],(update) θτ+1=pΘ[θτ−γτF(θτ+1/2)], (4)

where is the Euclidean projection onto the constraint set , i.e. . This "cautious" approach allows to escape cycling orbits of the simultaneous gradient flow, that may arise around equilibrium points with skew-symmetric Hessians (see Fig. 1). The generalization of extra-gradient to general Banach spaces equipped by a Bregman divergence was introduced as the mirror-prox algorithm nemirovski2004prox. All new results from §4 extend to this mirror setting (see §A.1). As recalled in Table 1, solving provide rates of convergence for the average iterate . Those rates are introduced for the equivalent variational inequality (VI) problem:

 Find θ⋆∈Θ such that F(θ⋆)⊤(θ−θ⋆)⩾0∀θ∈Θ, (5)

where Ass. 1 ensures that the simultaneous gradient is a monotone operator (see §A.2 for the link between Nash equilibria and solutions of the VI).

Computational caveats.

In systems with large number of players, an extra-gradient step may be computationally expensive due to the high number of backward passes necessary for gradient computations. Namely, at each iteration, we are required to compute gradients before performing a first update. This is likely to be inefficient, as we could use the first computed gradients to perform a first extrapolation or update. This remains true for games down to two players. In a different setting, stochastic gradient descent robbins_stochastic_1951 updates model parameters before observing the whole data, assuming that partial observation is sufficient for progress in the optimization loop. Similarly, partial gradient observation should be sufficient to perform extrapolation and updates toward the Nash equilibrium. We therefore propose to compute a few random player gradients at each iteration.

3.2 Partial extrapolation and update for extra-gradient

We present our main algorithm contribution in this section. While standard extra-gradient requires two full passes over players, we propose to compute doubly-stochastic simultaneous gradient estimates. This corresponds to evaluating a simultaneous gradient that is affected by two sources of noise. We sample a mini-batch of players of size , and compute the gradients for this mini-batch only. Furthermore, we assume that the gradients are noisy estimates, e.g., with noise coming from data sampling. We then compute a doubly-stochastic simultaneous gradient estimate as

 ~F≜(~F(1),…,~F(n))⊤∈Rd where ~F(i)(θ,P)≜{nb⋅gi(θ)if i∈P0diotherwise, (6)

where

is a noisy unbiased estimate of

The factor in (6) ensures that the doubly-stochastic simultaneous gradient estimate is an unbiased estimator of the simultaneous gradient. Doubly-stochastic extra-gradient replaces the full update (4) by oracle (6), as detailed in Alg. 1.

Motivation.

Sampling over players introduces a further source of noise in the average iterate sequence . The convergence of this sequence is already slowed down by noisy gradients or by the non-smoothness of the losses, that both introduce a term in in the convergence bounds. It is therefore appealing to introduce a further source of noise, hoping that the computational speed-ups provided at each iteration mitigates the approximation errors introduced by player subsampling.

Variance reduction for player noise.

To obtain faster rates in convex games with smooth losses, we propose to compute variance reduced estimate of the simultaneous gradient. This mitigates the noise due to player sampling. Variance reduction is a technique known to accelerate convergence under smoothness assumptions in similar settings. While palaniappan2016stochastic,iusem2017extragradient,reducing2019chavdarova apply variance reduction on the noise coming from the gradient estimates, we apply it to the noise coming from the sampling over the players. We implement this idea in Alg. 2. We keep an estimate of for each player in a table , which we use to compute unbiased gradient estimates with lower variance, similar to SAGA defazio2014saga.

Sampling strategies.

In the basic version of the algorithm, the sampling over players can be performed using any distribution with uniform marginals, i.e such that all players have equal probability of being sampled. Sampling uniformly over -subsets of is a reasonable way to fulfill this condition as all players have probability of being chosen. One faster alternative is to perform importance sampling. Namely, we sample each player with a probability proportional to the uniform bound of . This technique achieve faster convergence (see §B.3) when the gradient bounds for the different losses differ.

As a strategy to accelerate convergence, we propose to cycle over the pairs of different players (with ). At each iteration, we extrapolate the first player of the pair and update the second one. We shuffle the order of pairs once the block has been entirely seen. By excluding pairs, we avoid players extrapolating themselves, which is never useful to reduce . This scheme bridges extrapolation and alternated gradient descent: for GANs, it corresponds to extrapolate the generator before updating the discriminator, and vice-versa, cyclically. Sampling over players proves powerful for quadratic games (§5.1) and GANs (§5.2). In App. C, we provide a first explanation for this fact, based on studying the spectral radius of recursion operators (echoing recent work on understanding cyclic coordinate descent li_faster_2018).

4 Sharp analysis of convergence rates

We state our main convergence results in this section. As announced, we derive rates for the algorithms mentioned in §3.2 following the analysis by solving. We compare them with the rates achieved by stochastic extra-gradient introduced by solving, which also assumes noisy gradients but no player subsampling. While in the main paper the theorems are provided in the Euclidean setting, the proofs in the appendices are written in the mirror setting. In the analysis, we separately consider the two following assumptions on the losses.

Assumption 2a (Non-smoothness).

For each the loss has a bounded subgradient, namely for all . In this case, we also define the quantity

Assumption 2b (Smoothness).

For each the loss is once-differentiable and -smooth, i.e.  for

Classically, similar to solving, robbins_stochastic_1951, we assume unbiasedness and boundedness of the variance.

Assumption 3.

For each player , the noisy gradient is unbiased and has bounded variance:

 ∀θ∈Θ,E[gi(θ)]=∇iℓi(θ),E[∥gi(θ)−∇iℓi(θ)∥22]⩽σ2. (7)

In stochastic gradient-based methods, comparing rates in terms of number of iterations is not appropriate since the complexity per iteration increases with the size of player mini-batches. Instead, we define as the number of gradients estimates computed up to iteration . At each iteration in Alg. 1, the doubly-stochastic simultaneous gradient estimate is computed twice and requires gradient estimates. Therefore, which implies the number of iterations in terms of gradient computations is . We give the rates in terms of in the statement of the theorems. We first state the convergence result for doubly-stochastic extra-gradient under Ass. 2a.

Theorem 1.

We consider a convex -player game where Ass. 2a holds. Assume that Alg. 1 is run without variance reduction and constant stepsize . The expected is upper bounded as

 E[ErrN(^θt(k))]⩽O(√Ωnk(nG2+bσ2)),settingγ∈O(√bΩn(nG2+bσ2)t(k)). (8)

The following results holds when the losses are once-differentiable and smooth (Ass. 2b).

Theorem 2.

We consider a convex -player game where Ass. 2b holds. Assume that we run Alg. 1 with variance reduction and constant stepsize . The expected is upper bounded as

 E[ErrN(^θt(k))]⩽O(ΩLn2√bk+√Ωnbσ2k), setting γ∈min{O(b3/2Ln2),O(√Ωnσ2t(k))}. (9)

Those rates should be compared to the rate of [solving, Corollary 1], that we recall in §A.3 and Table 1. Corollary 3 and 7 in §B.2.2 and §B.4 contain the statements of Theorem 1 and 2 in more detail.

1. [topsep=0pt,itemsep=0pt,parsep=3pt,leftmargin=15pt, label=()]

2. Under Ass. 2a, Alg. 1 performs with a rate similar to stochastic extra-gradient. In both cases the rate is , and the subgradient bound and noise bound appear on the numerator. Doubly-stochastic extra-gradient is more robust to noisy gradient estimates, because the dependency of its rate on is weaker than for full extragradient.

3. Under Ass. 2b, the deterministic term of the rate is times larger compared to stochastic extra-gradient while the noisy term is times smaller. For long runs (large ), the noise term dominates the deterministic one, which advocates for the use of small batch sizes: when the rate is asymptotically times smaller. Setting to zero in the noise term, doubly-stochastic extra-gradient with variance reduction recovers the rate from nemirovski2004prox.

To sum up, doubly-stochastic extra-gradient provides better convergence guarantees than stochastic extra-gradient under high levels of noise (), while it delivers similar or slightly worse theoretical results in the non-noisy regime. Player randomness can be considered in the framework from solving by including it in the noisy unbiased estimate (increasing in Ass. 3 accordingly). This coarse approach does not yield the sharp bounds of Theorem 1 and 2 (see §A.4).

Importance sampling.

Using importance sampling when choosing player mini-batches yields a better bound by a constant factor (see §B.3). In the non-smooth case, this replaces the constant with the strictly smaller , which is useful when the gradient magnitudes are skewed.

5 Applications

We show the performance of doubly-stochastic extra-gradient in the setting of quadratic games over the simplex, and in the practical context of GAN training. A PyTorch/Numpy package is attached.

We consider a game where players can play actions, with payoffs provided by a matrix , an horizontal stack of matrices (one for each player). The loss function of each player is defined as its expected payoff given the mixed strategies , i.e.

 ∀i∈[n],∀θ∈Θ=△d1×⋯×△dn,ℓi(θi,θ−i)=θi⊤Aiθ+λ∥θi−1d∥1, (10)

where is a regularization parameter that introduces non-smoothness and pushes strategies to snap to the simplex center. The positivity of is equivalent to Ass. 1, i.e. for all .

Experiments.

We sample as the weighted sum of a random symmetric positive definite matrix and a skew matrix. We compare the convergence speeds of extra-gradient algorithms, with or without player subsampling. We vary three parameters: the variance of the noise in the gradient oracle (we add a Gaussian noise on each gradient coordinate, similar to Langevin dynamics neal2011mcmc), the non-smoothness of the loss, and the skewness of the matrix. We consider small games and large games (). We use the (simplex-adapted) mirror variant of doubly-stochastic extra-gradient, and a constant stepsize, selected among a grid (see App. D). We use variance reduction when (smooth case). We compare random fixed-size sampling with cyclic sampling (§3.2).

Results.

Fig. 2 compares the convergence speed of player-sampled extra-gradient for the various settings and sampling schemes. As predicted by Theorem 1 and 2, rates of convergence are comparable with and without subsampling. Randomly subsampling players always brings a benefit in the convergence constant ((a)-fig:quadratic_convergence_50), especially in the smooth noisy regime, using variance reduction ((a) column 2). Most interestingly, cyclic player selection brings a significant improvement over random sampling for small number of players, allowing larger gain in the rate constants ((a)).

(c) highlights the trade-offs in Theorem 2: as the noise increase, the size of player batches should be reduced. Not that for skew-games with many players ((b) col. 3), which are the hardest games to solve as averaging is needed optimistic, our approach only becomes beneficial in the high-noise regime (more relevant in ML). Full extra-gradient should be favored in the non-noisy regime (see App. D).

Spectral effect of sampling.

To better understand the benefit of the cyclic selection scheme, we study the linear “algorithm operator” such that in non constrained two-player bilinear games. The convergence speed of is governed by the spectral radius of , in light of Gelfand’s formula gelfand1941normierte. In App. C Fig. 4

, we consider random matrix games. For these, the algorithm operator of extra-gradient with cyclic player selection has on average a lower spectral radius than with random selection and a fortiori full selection. This leads to faster convergence of cyclic schemes.

5.2 Generative adversarial networks (WGAN-GP + ResNet)

We evaluate the performance of the player sampling approach to train a generative model on CIFAR10 krizhevsky2009learning. We use the WGAN-GP loss gulrajani_improved_2017, that defines a non-convex two-player game. We compare the full extra-gradient approach advocated by gidel2018variational to the cyclic sampling scheme proposed in §3.2 (i.e. extra. D, upd. G, extra. G, upd. D). We use the ResNet he2016deep architecture from gidel2018variational, and select the best performing stepsizes among a grid (see App. D). We use the Adam kingma_adam_2014 refinement of extra-gradient gidel2018variational for both the baseline and proposed methods. We evaluate the Inception Score salimans_improved_2016 and Fréchet Inception Distance (FID) heusel2017gans along training.

Results.

We report training curves versus wall-clock time in Fig. 3. Cyclic sampling allows faster and better training, especially with respect to FID, which is more correlated to human appreciation heusel2017gans. Table 2 compares our result to full extra-gradient with uniform averaging. It shows substantial improvements in FID, with results less sensitive to randomness. Note that scores could be slightly improved by leaving more time for training.

Interpretation.

Without extrapolation, alternated training is known to perform better than simultaneous updates in WGAN-GP gulrajani_improved_2017. Our approach allows to add extrapolation but keep an alternated schedule. It thus performs better than extrapolating with simultaneous updates. It remains true across every learning rate we tested. Deterministic sampling is crucial for performance, as random player selection performs poorly (best score 6.2 IS). This echoes the good results of cyclic sampling in §5.1.

6 Discussion and conclusion

We propose and analyse a doubly-stochastic extra-gradient approach for finding Nash equilibria. According to our convergence analysis, updating/extrapolating subsets of players only is useful in high noise or non-smooth settings, and equivalent otherwise. Numerically, doubly-stochastic extra-gradient leads to speed-ups and improvements in convex and non-convex settings, especially using noisy gradients (as with GANs). Our approach hence combines the advantages of alternated and extrapolation methods over simultaneous gradient descent—we recommend it for training GANs.

Beyond demonstrating the usefulness of sampling, numerical experiments show the importance of sampling schemes. We take a first step towards understanding the good performance of cyclic player extrapolation and update. A better theoretical analysis of this phenomenon is left for future work.

We foresee interesting developments using player sampling and extrapolation in reinforcement learning: the policy gradients obtained using multi-agent actor critic methods mnih2016asynchronous,lowe2017multi are highly noisy estimates, a setting in which sampling over players proves beneficial.

7 Acknowledgements

This work was partially supported by NSF grant RI-IIS 1816753, NSF CAREER CIF 1845360, the Alfred P. Sloan Fellowship and Samsung Electronics. The job of C. Domingo Enrich was partially supported by CFIS (UPC). The work of A. Mensch was supported by the European Research Council (ERC project NORIA).

[segment=1]

[sections]

The appendices are structured as follows: App. A presents the setting and the existing results. In particular, we start by introducing the setting of the mirror-prox algorithm in §A.1. After detailing the relation between solving this problem and finding Nash equilibria in convex -player games §A.2, we recall the rates for stochastic mirror-prox obtained by solving in §A.3. We then present the proofs of our theorems in App. B. We analyze the doubly-stochastic algorithm (Alg. 1) and separately study two variants of the latter, adding importance sampling (§B.3) and variance-reduction (§B.4). App. C investigates the difference betwen random and cyclic player sampling. App. D presents further experimental results and details.

[sections]l1

Appendix A Existing results

a.1 Mirror-prox

Mirror-prox and mirror descent are the formulation of the extra-gradient method and gradient descent for non-Euclidean (Banach) spaces. bubeck_monograph (which is a good reference for this subsection) and solving study extra-gradient/mirror-prox in this setting. We provide an introduction to the topic for completeness.

Setting and notations.

We consider a Banach space and a compact set We define an open convex set such that is included in its closure, that is and The Banach space is characterized by a norm . Its conjugate norm is defined as For simplicity, we assume .

We assume the existence of a mirror map for , which is defined as a function that is differentiable and -strongly convex i.e.

 ∀x,y∈D,⟨∇Φ(x)−∇Φ(y),x−y⟩⩾μ∥x−y∥2. (11)

We can define the Bregman divergence in terms of the mirror map.

Definition 2.

Given a mirror map , the Bregman divergence is defined as

 D(x,y)≜Φ(x)−Φ(y)−⟨∇Φ(y),x−y⟩. (12)

Note that is always non-negative. For more properties, see e.g. nemirovsky1983problem and references therein. Given that is compact convex space, we define Lastly, for and we define the prox-mapping as

 Pz(ξ)≜argminu∈D∩Θ{Φ(u)+⟨ξ−∇Φ(z),u⟩}=argminu∈D∩Θ{D(z,u)+⟨ξ,u⟩}. (13)

The mirror-prox algorithm is the most well-known algorithm to solve convex -player games in the mirror setting (and variational inequalities, see §A.2). An iteration of mirror-prox consists of:

 Compute the extrapolated point: {∇Φ(yτ+1/2)=∇Φ(θτ)−γF(θτ),θτ+1/2=argminx∈D∩Θ D(x,yτ+1/2),Compute a gradient step: {∇Φ(yτ+1)=∇Φ(θτ)−γF(θτ+1/2),θτ+1=argminx∈D∩Θ D(x,yτ+1).. (14)

Remark that the extra-gradient algorithm defined in equation (4) corresponds to the mirror-prox (14) when choosing

Lemma 1.

By using the proximal mapping notation (13), the mirror-prox updates are equivalent to:

 Compute the extrapolated point: θτ+1/2=Pθτ(γF(θτ)),Compute a gradient step: θτ+1=Pθτ(γF(θτ+1/2)). (15)
Proof.

We just show that , as the second part is analogous.

 θτ+1/2=argminx∈D∩Θ D(x,yτ+1/2)=argminx∈D∩Θ Φ(x)−⟨∇Φ(yτ+1/2),x⟩=argminx∈D∩Θ Φ(x)−⟨∇Φ(θτ)−αF(θτ),x⟩=argminx∈D∩Θ ⟨αF(θτ),x⟩+D(x,θτ).\qed

The mirror framework is particularly well-suited for simplex constraints i.e. when the parameter of each player is a probability vector. Such constraints usually arise in matrix games. If

is the -simplex, we express the negative entropy for player as

 Φi(θi)=di∑j=1θi(j)logθi(j). (16)

We can then define and the mirror map as

 Φ(θ)=n∑i=1Φi(θi). (17)

We used this mirror map in the experiments for random quadratic games (§5.1).

a.2 Link between convex games and variational inequalities

Finding a Nash equilibrium in a convex -player game is related to solving a variational inequality (VI). Consider a space of parameters that is compact and convex, and consider a scalar product in . The strong form of the VI associated to the operator is

 find θ∗∈Θ such that ⟨F(θ∗),θ−θ∗⟩⩾0∀θ∈Θ. (18)

The weak form of the VI is

 find θ∗∈Θ such that ⟨F(θ),θ−θ∗⟩⩾0∀θ∈Θ. (19)

We define the concept of monotone operator.

Definition 3.

An operator is monotone if

If is monotone, a solution of the strong form of the VI is a solution of the weak form. The reciprocal implication is true when is continuous.

For convex -player games (Ass. 1), the simultaneous (sub)gradient (Eq. 3) is a monotone operator. Moreover, if we assume continuity of the losses , the set of weak solutions to the VI (19) coincides with the set of Nash equilibria. Solving the VI is therefore sufficient to find Nash equilibria harker1990finite,nemirovski2010accuracy. The intuition behind this result is that equation (18) corresponds to the first-order necessary optimality condition applied to the losses of players.

The quantity that is used in the literature to quantify the inaccuracy of a solution is the dual VI gap defined as However, the functional Nash error (2) is the usual performance measure for convex games. In this article we give the convergence rates for the functional Nash error but they also apply to the dual VI gap. That is because the bound in Lemma 4 applies to the dual VI gap as well.

a.3 Convergence rates for the stochastic mirror-prox

In this section, we recall the stochastic mirror-prox algorithm and its analysis by solving. Stochastic mirror-prox corresponds to Alg. 1 without subsampling over players i.e. setting the mini-batch size . We start by giving the rates in terms of the number of iterations under Ass. 2a and Ass. 2b.

Theorem 3 (From solving).

We consider a convex -player game where Ass. 2a and Ass. 3 hold. Assume that Alg. 1 is run for iterations without subsampling () and with the optimal constant stepsize , the expected is upper bounded as

 E[ErrN(^θt)]⩽7√2Ωn3t(G2+2σ2). (20)

Assuming Ass. 2b (instead of Ass. 2a) and setting the optimal constant stepsize ,

 E[ErrN(^θt)]⩽max{72ΩLt,14√Ωnσ23t}. (21)

To obtain a fair comparison with our results, we state these results in terms of the number of full gradients computations .

Corollary 1 (From solving).

We consider a convex -player game where Ass. 2a and Ass. 3 hold. Assume that Alg. 1 is run for iterations without subsampling () and with the optimal constant stepsize , the expected is upper bounded as

 E[ErrN(^θt(k))]⩽14n√Ω3k(G2+2σ2). (22)

Assuming Ass. 2b (instead of Ass. 2a) and setting the optimal constant stepsize ,

 (23)

a.4 Player randomness as noise

The easiest way to treat player randomness on the theoretical level is to incorporate it in the unbiased gradient estimate. Indeed, in equation (6) is an unbiased estimate of .

 E[~Fi(θ,P)]=Prob(i∈P)nbE[gi(θ)]=E[gi(θ)]=∇iℓi(θ). (24)

If has variance bounded by , we can bound the variance of .

 E[∥~Fi(θ,P)−∇ili(θ)∥2] =E[∥~Fi(θ,P)−gi(θ)+gi(θ)−∇ili(θ)∥2] (25) ⩽2E[∥~Fi(θ,P)−gi(θ)∥2]+2E[∥gi(θ)−∇ili(θ)∥2] (26) ⩽2E[∥~Fi(θ,P)−gi(θ)∥2]+2σ2 (27) =2E[bn∥∥∥(nb−1)gi(θ)∥∥∥2+(1−bn)∥∥∥(nb−1)gi(θ)∥∥∥2]+2σ2 (28) ⩽2n−bbE[∥gi(θ)∥2]+2σ2 (29) ⩽2n−bbG2+2σ2. (30)

Substituting by on equations (22) and (23) yields:

 E[ErrN(^θt(k))]⩽14n√Ω3k(4n−3bbG2+2σ2)=O(n√Ωk(nbG2+σ2)). (31)
 E[ErrN(^θt(k))] ⩽max⎧⎨⎩7ΩLn3/2k,28n√Ω((n−b)G2+bσ2)3kb⎫⎬⎭ (32) =O⎛⎝ΩLn3/2k+n√Ω(nG2+bσ2)kb⎞⎠. (33)

These bounds are clearly worse than the ones in Corollary 1 when , which motivates the theoretical work in App. B that yields Theorem 1 and 2.

Appendix B Proofs and mirror-setting algorithms

b.1 Useful lemmas

In this section, we present lemmas that will be frequently used in the analysis of the algorithms in §B.2, §B.3 and §B.4. We first present the following two technical lemmas that are used and proven by solving.

Lemma 2.

Let be a point in , let be two points in the dual , let and Then,

 ∥w−r+∥⩽∥χ−η∥∗ . (34)

Moreover, for all , one has

 D(u,r+)−D(u,z)⩽⟨η,u−w⟩+12∥χ−η∥2∗−12∥w−z∥2 . (35)
Lemma 3.

Let be a sequence of elements of . Define the sequence in as follows:

 yτ=Pyτ−1(ξτ).

Then is a measurable function of and such that:

 (36)

The following lemma provides an upper bound on the Nash functional error The following lemma provides an upper bound on the that the Nash functional error have the same upper bounds.

Lemma 4.

We consider a convex -player game with players losses where . Let a sequence of points , the stepsizes . We define the average iterate . The functional Nash error evaluated in is upper bounded by

 ErrN(^zt)≜supu∈Zn∑i=1ℓi(^zt)−ℓi(ui,^z−it)⩽supu∈Z(t∑τ=0γτ)−1t∑τ=0⟨γτF(zτ),zτ−u⟩. (37)
Proof.

By using the convexity of in its parameter and its concavity in the others parameters and applying Jensen’s inequality, we obtain:

 n∑i=1ℓi(^zt)−ℓi(ui,^z−it) =n∑i=1ℓi(∑tτ=0γτzτ∑tτ=0γτ)−ℓi(ui,∑tτ=0γτz−iτ∑tτ=0γτ) (38) ⩽(t∑τ=0γτ)−1t∑τ=0γτn∑i=1ℓi(zτ)−ℓi(ui,z−iτ). (39)

As a consequence of the convexity of with respect to its parameter, we have where . Remark that . By plugging this inequality in (39), we obtain

 n∑i=1ℓi(^zt)−ℓi(ui,^z−it) ⩽(t∑τ=0γτ)−1t∑τ=0n∑i=1⟨γτhi(zτ),ziτ−ui⟩ =(t∑τ=0γτ)−1t∑τ=0⟨γτF(zτ),zτ−u⟩.\qed
Lemma 5.

Let be a sequence in and . For any , we define the function  to be

 ft(α)≜A∑tτ=0αγτ+B∑tτ=0(αγτ)2∑tτ=0αγτ. (41)

Then, it attains its minimum for when both terms are equal. Let us call the point at which the minimum is reached. The value of evaluated at is

 ft(α∗)=f⎛⎝√AB∑tτ=0γ2τ⎞⎠=2√AB∑tτ=0γ2τ∑tτ=0γτ. (42)
Proof.

It is sufficient to derive the first-order optimality condition of :

 −1α2∗A∑tτ=0γτ+B∑tτ=0γ2τ∑tτ=0γτ=0, (43)

and the result follows. ∎

Lemma 6.

Let be Banach spaces where for each , is the norm associated to . The Cartesian product is and has a norm defined for as

 ∥y∥X≜ ⎷n∑i=1∥yi∥2Xi. (44)

It is known that is a Banach space. Moreover, we define the dual spaces The dual space of is and has a norm . Then, for any , the following inequality holds

 ∥a∥2X∗=n∑i=1∥ai∥2X∗i. (45)
Proof.

We first prove that the LHS is smaller than the RHS. By definition of the dual norm, we have

 (46)

where we used Cauchy-Schwarz inequality. By applying again this inequality in (46), we obtain

 ∥a∥2X∗⩽supy∈X(∑ni=1∥ai∥2X∗i)(∑ni=1∥yi∥2Xi)∥y∥2X=n∑i=1∥ai∥2X∗i, (47)

which proves the result. To prove the other inequality we define .

 ∥a∥2X∗ =supy∈X|ay|2∥y∥2X ⩾supy∈Z1×⋯×Zn|ay|2∥y∥2X =(∑