1 Introduction
In the multiagent online learning (CesaBianchi and Lugosi, 2006b; Shoham and LeytonBrown, 2008; Busoniu et al., 2010), a set of agents are repeatedly making decisions and accumulating rewards over time, where each agent’s action impacts not only its own reward, but that of the others. However, the mechanism of this interaction – the underlying game that specifies how an agent’s reward depends on the joint action of all – is unknown to agents, and agents may not even be aware that there is such a game. As such, from each agent’s own perspective, it is simply engaged in an online decision making process, where the environment consists of all other agents who are simultaneously making such sequential decisions, which are of consequence to all agents.
In the past two decades, the above problem has actively engaged researchers from two fields: machine learning (and online learning in particular), which aims to develop singleagent online learning algorithms that are noregret in an arbitrarily timevarying and/or adversarial environment
(Blum, 1998; ShalevShwartz, 2007; ShalevShwartz et al., 2012; Arora et al., 2012; Hazan, 2016); and game theory, which aims to develop (ideally distributed) algorithms (see
Fudenberg and Levine (1998b) and references therein) that efficiently compute a Nash equilibrium (a joint optimal outcome where no one can do better by deviating unilaterally) for games with special structures^{1}^{1}1Computing a Nash equilibrium is in general computationally intractable; indeed, this problem is in fact PPADcomplete (Daskalakis et al., 2009)). Although these two research threads initially developed separately, they have subsequently merged and formed the core of multiagent/gametheoretical online learning, whose main research agenda can be phrased as follows: Will joint noregret learning lead to a Nash equilibrium, thereby reaping both the transient benefits (conferred by low finitetime regret) and the longrun benefits (conferred by Nash equilibria)?More specifically, through the online learning lens, the agent ’s reward function at – viewed as a function solely of its own action – is , and it needs to select an action before – or other feedback associated with it – is revealed. In this context, noregret algorithms ensure that the difference between the cumulative performance of the best fixed action and that of the learning algorithm, a widely adopted metric known as regret (), grows sublinearly in . This problem has been extensively studied; in particular, when gradient feedback is available – can be observed after is selected – the minimax optimal regret is for convex and for strongly convex . Further, several algorithms have been developed that achieve these optimal regret bounds, including “followtheregularizedleader” (Kalai and Vempala, 2005), online gradient descent (Zinkevich, 2003b), multiplicative/exponential weights (Arora et al., 2012) and online mirror descent (ShalevShwartz and Singer, 2007). As these algorithms provide optimal regret bounds and hence naturally raise high expectations in terms of performance guarantees, a recent line of work has investigated when all the agents apply a noregret learning algorithm, what the evolution of the joint action would be, and in particular, whether the joint action would converge in last iterate to a Nash equilibrium (if one exists).
These questions turn out to be difficult and had remained open until a few years ago, mainly because the traditional Nashseeking algorithms in the economics literature are mostly either not noregret or only exhibit convergence in timeaverage (ergodic convergence), or both. Despite the challenging landscape, in the past five years, affirmative answers have emerged from a fruitful line of work and the new analysis tools developed therein, first on the qualitative lastiterate convergence (Krichene et al., 2015; Balandat et al., 2016; Zhou et al., 2017, 2018; Mertikopoulos et al., 2019; Mertikopoulos and Zhou, 2019; Golowich et al., 2020b, a) in different classes of continuous games (such as variationally stable games), and then on the quantitative lastiterate convergence rates in more specially structured games such as cocoercive games or strongly monotone games (Lin et al., 2020; Zhou et al., 2021). In particular, very recently, Zhou et al. (2021) has shown that if each agent applies a version of online gradient descent, then the joint action converges in last iterate to the unique Nash equilibrium in a strongly monotone game at an optimal rate of .
Despite this remarkably pioneering line of work, which has elegantly bridged noregret learning with convergence to Nash in continuous games and thus renewed the excitement of gametheoretical learning, a significant impediment still exists and limits their practical impact. Specifically, in the multiagent learning setting, an agent is rarely able to observe gradient feedback. Instead, in most cases, only bandit feedback is available: each agent observes only its own reward after choosing an action each time (rather than the gradient at the chosen action). This consideration of practical feasible algorithms then brings us to a more challenging and less explored desideratum in multiagent learning: if each agent applies a noregret online bandit convex optimization algorithm, would the joint action still converge to a Nash equilibrium? At what rate and in what class of games?
Related work.
To appreciate the difficulty and the broad scope of this research agenda, we start by describing the existing related literature. First of all, we note that singleagent bandit convex optimization algorithms – and their theoretical regret characterizations – are not as welldeveloped as their gradient counterparts. More specifically, Flaxman et al. (2005) and Kleinberg (2004) provided the first bandit convex optimization algorithm – known as FKM – that achieved the regret bound of
for convex and Lipschitz loss functions. However, it was unclear whether
is optimal. Subsequently, Saha and Tewari (2011) developed a barrierbased online convex bandit optimization algorithm and established a regret bound for convex and smooth cost functions, a result that has further been improved to (Dekel et al., 2015) via a variant on the algorithm and new analysis. More recently, progress has been made on developing algorithms that achieve minimaxoptimal regret. In particular, Bubeck et al. (2015) and Bubeck and Eldan (2016) provided nonconstructive arguments showing that the minimax regret bound of in one and high dimension are achievable respectively, without providing any algorithm. Later, Bubeck et al. (2017) developed a kernel method based bandit convex optimization algorithm which attains the bound of . Independently, Hazan and Li (2016) considered an ellipsoid method for bandit convex optimization that also achieves .When the cost function is strongly convex and smooth, Agarwal et al. (2010) showed that the FKM algorithm achieves an improved regret bound of . Later, Hazan and Levy (2014) established that another variant of the barrier based online bandit algorithm given in Saha and Tewari (2011) achieves the minimaxoptimal regret of . For an overview of the relevant theory and applications, we refer to the recent survey (Bubeck and CesaBianchi, 2012; Lattimore and Szepesvári, 2020).
However, much remains unknown in understanding the convergence of these noregret bandit convex optimization algorithms to Nash equilibria. Bervoets et al. (2020) developed a specialized distributed payoff based algorithm that asymptotically converges to the unique Nash equilibrium in the class of strictly monotone games. However, the algorithm is not known to be noregret and no rate is given. Héliou et al. (2020) considered a variant of the FKM algorithm and showed that it is noregret even under delays; further, provided the delays are not too large, joint FKM learning would converge to the unique Nash equilibrium in strictly monotone games (again without rates). At this writing, the most relevant and thestateoftheart result on this topic is presented in Bravo et al. (2018), which showed that if each agent applies the FKM algorithm in strongly monotone games, then lastiterate convergence to the unique Nash equilibrium is guaranteed at a rate of . Per Bravo et al. (2018), the analysis itself is unlikely to be improved to yield any tighter rate. However, a sizable gap still exists between this bound and the best known lower bound given in Shamir (2013), which established that in optimization problems with strongly convex and smooth objectives (which is a oneplayer Nashseeking problem), no algorithm that uses only bandit feedback (i.e. zerothorder oracle) can compute the unique optimal solution at a rate faster than . Consequently, it remains unknown as to whether other algorithms can improve the rate of as well as what the true optimal convergence rate is. In particular, since the lower bound is established for the special case of optimization problems, it is plausible that in the multiagent setting – where a natural potential function such as the cost in optimization does not exist – the problem is inherently more difficult, and hence the convergence intrinsically slower. Further, note that the lower bound in Shamir (2013) is established against the class of all bandit optimization algorithms, not necessarily noregret; essentially, a priori, that could mean a larger lower bound when the algorithms are further restricted to be noregret. As such, it has been a challenging open problem to close the gap.
Our contributions.
We tackle the problem of noregret learning in strongly monotone games with bandit feedback and settle the above open problem by establishing that the convergence rate of – and hence minimax optimal (up to log factors) – is achievable. More specifically, we start by studying (in Section 2) singleagent learning with bandit feedback – in particular bandit convex optimization– and develop an eager variant of the barrier based family of online convex bandit optimization algorithms (Saha and Tewari, 2011; Hazan and Levy, 2014; Dekel et al., 2015). We establish that the algorithm achieves the minimax optimal regret bound of . Next, extending to multiagent learning settings, we show that if all agents employ this optimal noregret learning algorithm (see Algorithm 2 in Section 4), then the joint action converges in last iterate to the unique Nash equilibrium at a rate of . As such, we provide the first online convex bandit learning algorithm (with continuous action) that is doubly optimal (up to log factors): it achieves optimal regret in singleagent settings under strongly convex cost functions and optimal convergence rate to Nash in multiagent settings under strongly monotone games. Finally, we conduct extensive experiments on Cournot competition, Kelly auction and distributed regularized logistic regression in Section 5. The numerical results demonstrate that our algorithm outperforms thestateoftheart multiagent FKM algorithm.
Organization.
The remainder of the paper is organized as follows. In Section 3, we provide the basic setup for multiagent bandit learning in strongly monotone games and review the background materials on regret minimization and Nash equilibria. In Section 2, we develop a singleagent bandit learning algorithm with eager projection and prove a nearoptimal regret minimization when the reward function is smooth and strongly concave. In Section 4, we extend the algorithm to multiagent setting and prove a nearoptimal lastiterate convergence rate for smooth and strongly monotone games. We also consider the multiagent setting in which the bandit feedback is noisy but bounded and prove the same convergence result. We conduct extensive experiments on Cournot competition, Kelly auction and distributed regularized logistic regression in Section 5. The numerical results demonstrate that our algorithm outperforms thestateoftheart multiagent FKM algorithm. Finally, we conclude this paper in Section 6. For the ease of presentation, we defer all the proof details to the appendix.
2 SingleAgent Learning with Bandit Feedback
In this section, we provide a simple singleagent bandit learning algorithm that players could employ to increase their individual rewards in an online manner and prove that it achieves the nearoptimal regret minimization property for bandit concave optimization (BCO)^{2}^{2}2This setting is the same as bandit convex optimization in the literature and we consider maximization and concave reward functions instead of minimization and convex loss functions..
In BCO, an adversary first chooses a sequence of concave reward functions , where is a closed convex subset of . At each round , a (randomized) decision maker has to choose a point and incurs a reward of after committing to her decision. Her expected reward (where the expectation is taken with respect to her random choice) is and the corresponding regret is defined by . In the bandit setting, the feedback is limited to the reward at the point that she has chosen, i.e., .
Algorithm 1 is inspired by the existing online learning algorithms developed in Saha and Tewari (2011) and Hazan and Levy (2014): the main difference lies in eager projection (ours) v.s. lazy projection for updating . This modification is crucial to lastiterate convergence analysis when we extend Algorithm 1 to the multiagent setting. In what follows, we present the individual algorithm components.
2.1 SelfConcordant Barrier
Existing bandit learning algorithms can be interpreted under the framework of mirror descent (CesaBianchi and Lugosi, 2006a) and the common choice of regularization is selfconcordant barrier, which is a key ingredient in regretoptimal bandit algorithms when the loss function is linear (Abernethy et al., 2008) or smooth and strongly convex (Hazan and Levy, 2014). Here we provide an overview and refer to Nesterov and Nemirovskii (1994) for details.
Definition 2.1
A function is called a self concordant barrier for a closed convex set if (i) is three times continuously differentiable, (ii) if , and (iii) for and , we have and where .
Similar to the existing online bandit learning algorithms (Abernethy et al., 2008; Saha and Tewari, 2011; Hazan and Levy, 2014; Dekel et al., 2015), our algorithm requires selfconcordant barriers over for all ; see Algorithm 1. However, this does not weaken its generality. Indeed, it is known that any convex and compact set in admits a nondegenerate selfconcordant barrier (Nesterov and Nemirovskii, 1994) with , and such barrier can be efficiently represented and evaluated for various choices of in gametheoretical setting. For example, is selfconcordant barrier for linear constraints and similar function is selfconcordant barrier for dimensional simplex or a cube. For dimensional ball, is selfconcordant barrier and is even independent of the dimension.
The above definition is only given for the sake of completeness and our analysis relies on some useful facts about selfconcordant barriers. In particular, the Hessian of a selfconcordant barrier induces a local norm at ; that is, and for all . The nondegeneracy of implies that and are both norms. To this end, we define the Dikin ellipsoid at any as
and present some nontrivial facts (see Nesterov and Nemirovskii (1994, Theorem 2.1.1) for a proof):
Lemma 2.2
Let be the Dikin ellipsoid at any , the following statements hold true:

for every ;

For , we have .
Remark 2.3
Second, we define the Minkowsky function (Nesterov and Nemirovskii, 1994, Page 34) on , parametrized by a point as
and a scaled version of by
where be a “center” of satisfying that , and is a selfconcordant barrier function for . The following lemma shows that is quite flat at points that are far from the boundary (see Nesterov and Nemirovskii (1994, Proposition 2.3.2 and 2.3.3) for a proof):
Lemma 2.4
Suppose that is a closed convex set, is a selfconcordant barrier function for , and is a center of . Then, we have
For any and , we have and .
Finally, we define the Newton decrement for a selfconcordant function (not necessarily a barrier function) as: (recall that and are a local norm and its dual norm)
which can be used to measure roughly how far a point is from a global optimum of . Formally, we summarize the results in the following lemma (see Nemirovski and Todd (2008) for a proof):
Lemma 2.5
For any selfconcordant function , whenever , we have
where the local norm is defined with respect to , i.e., .
2.2 SingleShot Ellipsoidal Estimator
It was Flaxman et al. (2005) that introduced a singleshot spherical estimator in the BCO literature and cleverly combined it with online gradient descent learning (Zinkevich, 2003a). In particular, let be a function, and , a singleshot spherical estimator is defined by
(1) 
This estimator is an unbiased prediction for the gradient of a smoothed version; that is, where . As , the bias caused by the difference between and vanishes while the variability of
explodes. This manifestation of the biasvariance dilemma plays a key role in designing bandit learning algorithms and a singleshot spherical estimator is known to be suboptimal in terms of biasvariance tradeoff and hence regret minimization. This gap is closed by using a more sophisticated singleshot ellipsoidal estimator based on the selfconcordant barrier function of
(Saha and Tewari, 2011; Hazan and Levy, 2014). Comparing to the spherical estimator in Eq. (1) based on the uniform sampling, Saha and Tewari (2011) and Hazan and Levy (2014) proposed to samplenonuniformly over all directions and create an unbiased gradient estimate of the scaled smooth version. In particular, a singleshot ellipsoidal estimator with respect to an invertible matrix
is defined by(2) 
The following lemma is a simple modification of (Hazan and Levy, 2014, Corollary 6 and Lemma 7).
Lemma 2.6
Suppose that is a continuous and concave function, is an invertible matrix, and , we define the smoothed version of with respect to by . Then, the following holds:

.

If is strongly concave, then so is .
2.3 Eager Mirror Descent
The idea of eager mirror descent^{3}^{3}3In utility maximization, we shall use mirror ascent instead of mirror descent because players seek to maximize their rewards (as opposed to minimizing their losses). Nonetheless, we keep the term “descent” throughout because, despite the role reversal, it is the standard name associated with the method. (Nemirovskij and Yudin, 1983) is to generate a new feasible point by taking a “mirror step” from a starting point along an “approximate gradient” direction . By abuse of notation, we let be a continuous and strictly convex distancegenerating (or regularizer) function, i.e., with equality if and only if for all and all . We also assume that is continuously differentiable, i.e., is continuous. Then, we get a Bregman divergence on via the relation
(3) 
for all and , which can fail to be symmetric and/or satisfy the triangle inequality. Nevertheless, with equality if and only if , so the asymptotic convergence of a sequence to can be checked by showing that . For technical reasons, it will be convenient to assume the converse, i.e., when . This condition is known in the literature as “Bregman reciprocity” (Chen and Teboulle, 1993). We continue with some basic relations connecting the Bregman divergence relative to a target point before and after a proxmap. The key ingredient is “threepoint identity” (Chen and Teboulle, 1993) which generalizes the law of cosines, and widely used in the literature (Beck and Teboulle, 2003; Nemirovski, 2004).
Lemma 2.8
Let be a regularizer on . Then, for all and all , we have
Remark 2.9
In Algorithm 1, we set as a selfconcordant barrier function for and it is easy to check that satisfies the aforementioned Bregman reciprocity for various constraint sets, e.g., dimensional simplex, a cube or dimensional ball.
The key notion for Bregman divergence is the induced proxmapping given by
(4) 
for all and all , which reflects the intuition behind eager mirror descent. Indeed, it starts with a point
and steps along the dual (gradientlike) vector
to generate a new feasible point . However, the proxmapping in Eq. (4) considers general Bregman divergence which does not exploit the structure of strong concave payoff function or constraint sets. In response to the above issues, we propose to use the proxmapping given by(5) 
for all and all . We remark that the above proxmapping explicitly incorporates the problem structure information by considering a mixed Bregman divergence: the first term is Euclidean distance with the coefficient proportional to strong concavity parameter and the second term is Bregman divergence defined with the selfconcordant barrier function for the constraint set . It is also worth noting that we consider eager projection in Eq. (5), making our proxmapping different from the one used in Hazan and Levy (2014) that also exploits the structure of strong concave payoff function and constraint sets but is with lazy projection.
With all these in mind, the eager selfconcordant barrier bandit learning algorithm is given by the recursion in which is a stepsize and is a feedback of estimated gradient. In Algorithm 1, we generate by a singleshot ellipsoidal estimator as mentioned before.
2.4 Regret Bound
We consider the singleagent setting where the adversary is limited to choosing smooth and strongly concave functions. The following theorem shows that Algorithm 1 achieves the regret bound, matching the lower bound in Shamir (2013).
Theorem 2.10
Suppose that the adversary is limited to choosing smooth and strongly concave functions . Each function is Lipschitz continuous and satisfies that for all . If a horizon is fixed and each player follows Algorithm 1 with parameters , we have
Remark 2.11
Theorem 2.10 shows that Algorithm 1 is a nearregretoptimal bandit learning algorithm when the adversary is limited to choosing smooth and strongly concave functions. The algorithmic scheme is based on the eager mirror descent and thus appears to be different from the existing one presented in Hazan and Levy (2014).
Lemma 2.12
Suppose that the iterate is generated by Algorithm 1 and each function satisfies that for all , we have
where and is a nonincreasing sequence satisfying that .
3 MultiAgent Learning with Bandit Feedback
In this section, we consider multiagent learning with bandit feedback and characterize the behavior of the system when each agent applies the eager selfconcordant barrier bandit learning algorithm. We first present basic definitions, and discuss a few important classes of games that are strongly monotone and finally discuss the multiagent version of the learning algorithm.
3.1 Basic Definition and Notation
We focus on games played by a finite set of players . At each iteration of the game, each player selects an action from a convex subset of a finitedimensional vector space and their reward is determined by the profile of the action of all players; subsequently, each player receives the reward, and repeats this process. We denote as the Euclidean norm (in the corresponding vector space): other norms can be easily accommodated in our framework (and different ’s can in general have different norms), although we will not bother with all of these since we do not play with (and benefit from) complicated geometries.
Definition 3.1
A smooth concave game is a tuple , where is the set of players, is a convex and compact set of finitedimensional vector space representing the action space of player , and is the th player’s payoff function satisfying: (i) is continuous in and concave in for all fixed ; (ii) is continuously differentiable in and the individual payoff gradient is Lipschitz continuous.
A commonly used solution concept for noncooperative games is Nash equilibrium
(NE). For the smooth concave games considered in this paper, we are interested in the purestrategy Nash equilibria. Indeed, for finite games, the mixed strategy NE is a probability distribution over the pure strategy NE. Our setting assumes continuous and convex action sets, where each action already lives in a continuum, and pursuing purestrategy Nash equilibria is sufficient.
Definition 3.2
An action profile is called a (purestrategy) Nash equilibrium of a game if it is resilient to unilateral deviations; that is, for all and .
It is known that every smooth concave game admits at least one Nash equilibrium when all action sets are compact (Debreu, 1952) and Nash equilibria admit a variational characterization. We summarize this result in the following proposition.
Proposition 3.3
In a smooth concave game , the action profile is a Nash equilibrium if and only if for all and .
Proposition 3.3 shows that Nash equilibria of a smooth concave game can be precisely characterized as the solution set of a variational inequality (VI), so the existence results also follow from the standard results in VI literature (Facchinei and Pang, 2007). We omit the proof here and refer to Mertikopoulos and Zhou (2019) for the details.
3.2 Strongly Monotone Games
The study of (strongly) monotone games dates to Rosen (1965), with many subsequent developments; see, e.g., Facchinei and Pang (2007). Specifically, Rosen (1965) considered a class of games that satisfy the diagonal strict concavity (DSC) condition and prove that they admit a unique Nash equilibrium. Further work in this vein appeared in Sandholm (2015) and Sorin and Wan (2016), where games that satisfy DSC are referred to as “contractive” and “dissipative”. These conditions are equivalent to (strict) monotonicity in convex analysis (Bauschke and Combettes, 2011). To avoid confusion, we provide the formal definition of strongly monotone games considered in this paper.
Definition 3.4
A smooth concave game is called strongly monotone if there exist positive constants such that for any .
The notion of (strong) monotonicity, which will play a crucial role in the subsequent analysis, is not necessarily theoretically artificial but encompasses a very rich class of games. We present three typical examples which satisfy the conditions in Definition 3.4 (see the references (Bravo et al., 2018; Mertikopoulos and Zhou, 2019) for more details).
Example 3.1 (Cournot Competition)
In the Cournot oligopoly model, there is a finite set of firms, each supplying the market with a quantity of some good (or service) up to the firm’s production capacity, given here by a positive scalar . This good is then priced as a decreasing function of the total supply to the market, as determined by each firm’s production; for concreteness, we focus on the standard linear model where and are positive constants. In this model, the utility of firm (considered here as a player) is given by
where represents the marginal production cost function of firm and is assumed to be strongly convex, i.e., as the income obtained by producing units of the good in question minus the corresponding production cost. Letting denote the space of possible production values for each firm, the resulting game is strongly monotone.
Example 3.2 (Strongly Concave Potential Games)
A game is called a potential game (Monderer and Shapley, 1996; Sandholm, 2001) if there exists a potential function such that
for all , all and all . A potential game is called a strongly concave potential game if the potential function is strongly concave. Since the gradient of a strongly concave function is strongly monotone (Bauschke and Combettes, 2011), a strongly concave potential game is a strongly monotone game.
Example 3.3 (Kelly Auctions)
Consider a service provider with a number of splittable resources (representing, e.g., bandwidth, server time, ad space on a website, etc.). These resources can be leased to a set of bidders (players) who can place monetary bids for the utilization of each resource up to each player’s total budget , i.e., . A popular and widely used mechanism to allocate resources in this case is the socalled Kelly mechanism (Kelly et al., 1998) whereby resources are allocated proportionally to each player’s bid, i.e., player gets
units of the th resource (in the above, denotes the available units of said resource and is the “entry barrier” for bidding on it). A simple model for the utility of player is then given by
where denotes the player’s marginal gain from acquiring a unit slice of resources and denotes the cost if player place one unit monetary bid for the utilization of resource (the function is assumed to be strongly convex). If we write for the space of possible bids of player on the set of resources , we obtain a strongly monotone game.
There are many other application problems that can be cast into strongly monotone games (Orda et al., 1993; CesaBianchi and Lugosi, 2006a; Sandholm, 2015; Sorin and Wan, 2016; Mertikopoulos et al., 2017). Typical examples include stronglyconvexstronglyconcave zerosum games, congestion games (Mertikopoulos and Zhou, 2019), wireless network games (Weeraddana et al., 2012; Tan, 2014; Zhou et al., 2021) and a wide range of online decisionmaking problems (CesaBianchi and Lugosi, 2006a). From an economic point of view, one appealing feature of strongly monotone games is that the lastiterate convergence can be achieved by some standard learning algorithms (Zhou et al., 2021), which is more natural than timeaverageiterate convergence (Fudenberg and Levine, 1998a; CesaBianchi and Lugosi, 2006a; Lin et al., 2020). The finitetime convergence rate is derived in terms of the distance between and , where is the realized action and is the unique Nash equilibrium (under convex and compact action sets). In view of all this (and unless explicitly stated otherwise), we will focus throughout on strongly monotone games.
3.3 MultiAgent Bandit Learning
In multiagent learning with bandit feedback, at each round , each (possibly randomized^{4}^{4}4Randomization plays an important role in the online game playing literature. For example, the classical Follow The Leader (FTL) algorithm does not attain any nontrivial regret guarantee for linear cost functions (in the worst case it can be if the cost functions are chosen adversarially). However, Hannan (1957) proposed a randomized variant of FTL, called perturbedfollowtheleader, which could attain an optimal regret of for linear functions over the simplex.) decision maker selects an action . The reward is realized after after all decision makers have chosen their actions. In addition to regret minimization, the convergence to Nash equilibria is an important criterion for measuring the performance of learning algorithms. In multiagent bandit learning, the feedback is limited to the reward at the point that she has chosen, i.e., . Here, we propose a simple multiagent bandit learning algorithm (see Algorithm 2) in which each player chooses her action using Algorithm 1. Algorithm 2 is a straightforward extension of Algorithm 1 from singleagent setting to multiagent setting. It differs from Bravo et al. (2018, Algorithm 1) in two aspects: the ellipsoidal SPSA estimator v.s. spherical SPSA estimator and selfconcordant Bregman divergence v.s. general Bregman divergence. We discuss these two crucial components next.
Singleshot ellipsoidal SPSA estimator.
Recently, Bravo et al. (2018) have extended the spherical estimator in Eq. (1) to multiagent setting by positing instead that players rely on a simultaneous perturbation stochastic approximation (SPSA) approach (Spall, 1997) that allows them to estimate their individual payoff gradients based off a single function evaluation. In particular, let be a payoff function, and the query directions be drawn independently across players, a singleshot spherical SPSA estimator is defined by
(6) 
This estimator is an unbiased prediction for the partial gradient of a smoothed version of ; that is where . We can easily see the biasvariance dilemma here: as , becomes more accurate since , while the variability of
grows unbounded since the second moment of
grows as . By carefully choosing , Bravo et al. (2018) provided the bestknown lastiterate convergence rate of which almost matches the lower bound in Shamir (2013). However, a gap still remains and they believe it can be closed by using a more sophisticated singleshot estimator.We now provide a singleshot ellipsoidal SPSA estimator by extending the estimator in Eq. (2) to the multiagent setting. Using the SPSA approach, we let and have
(7) 
The following lemma provides some results for a singleshot ellipsoidal SPSA estimator and the proof is given in Appendix C.
Lemma 3.5
The singleshot ellipsoidal SPSA estimator given by Eq. (7) satisfies
with where for all . Moreover, if is Lipschitz continuous and is the largest eigenvalue of , we have
Remark 3.6
Eager mirror descent.
The proxmapping in Eq. (4) has been used in Bravo et al. (2018) to construct their multiagent bandit learning algorithm, which achieves the suboptimal regret minimization and lastiterate convergence. One possible reason for such suboptimal guarantee is that the proxmapping defined in Eq. (4) considers general Bregman divergence and takes care of neither strong monotone payoff gradient nor constraint sets. In Algorithm 2, we let each player update using the proxmapping in Eq. (5) and the rule is given by
(8) 
for all and all . With all these in mind, the main step of Algorithm 2 is given by the recursion in which is a stepsize and is a feedback of estimated gradients for player (we generate by a singleshot ellipsoidal SPSA estimator).
4 FiniteTime Convergence Rate
In this section, we establish that Algorithm 2 achieves the nearoptimal lastiterate convergence rate for smooth and strongly monotone games if the perfect bandit feedback is available, improving the bestknown lastiterate convergence rate in Bravo et al. (2018). The following theorem shows that Algorithm 2 achieves the rate of lastiterate convergence to a unique Nash equilibrium if the perfect bandit feedback is available (), improving the bestknown rate of (Bravo et al., 2018) and matching the lower bound (Shamir, 2013).
Theorem 4.1
Suppose that is a unique Nash equilibrium of a smooth and strongly monotone game. Each payoff function satisfies that for all . If each player follows Algorithm 2 with parameters , we have
Remark 4.2
Theorem 4.1 shows that Algorithm 2 attains a nearoptimal rate of lastiterate convergence in smooth and strongly monotone games. It extends Algorithm 1 from the singleagent setting to multiagent setting, providing the first doubly optimal bandit learning algorithm, in that it achieves (up to log factors) both optimal regret in the singleagent learning and optimal lastiterate convergence rate in the multiagent learning. In contrast, it remains unclear whether the multiagent extension of Hazan and Levy (2014, Algorithm 1) can achieve the nearoptimal lastiterate convergence rate or not.
Lemma 4.3
Suppose that the iterate is generated by Algorithm 2 and each function satisfies that for all , we have
where and is a nonincreasing sequence satisfying that .
5 Numerical Experiments
In this section, we conduct the experiments using three different tasks: Cournot competition, Kelly auction, and distributed regularized logistic regression. We compare Algorithm 2 with Bravo et al. (2018, Algorithm 1). The implementation is done with MATLAB R2020b on a MacBook Pro with a Intel Core i9 2.4GHz (8 cores and 16 threads) and 16GB memory.
5.1 Cournot Competition
We show that Cournot competition is strongly monotone in the sense of Definition 3.4. In particular, each players’ payoff function is given by
Taking the derivative of with respect to , we have
This implies that
Therefore, we conclude that Cournot competition is strongly monotone with the module .
Bravo et al. (2018, Algorithm 1)  Our Algorithm  

(10, 10, 0.05)  1.9e01 4.1e02  1.3e03 3.6e04 
(10, 10, 0.10)  9.1e02 3.9e02  1.4e03 3.3e04 
(10, 20, 0.05)  3.0e01 4.9e02  8.9e04 5.7e04 
(10, 20, 0.10)  1.5e01 5.0e02  7.0e04 2.2e04 
(20, 10, 0.05)  2.1e01 4.1e02  1.7e03 2.7e04 
(20, 10, 0.10)  9.6e02 1.7e02  1.9e03 4.1e04 
(20, 20, 0.05)  3.4e01 7.4e02  9.4e04 2.4e04 
(20, 20, 0.10)  1.9e01 
Comments
There are no comments yet.