Optimal No-Regret Learning in Strongly Monotone Games with Bandit Feedback

by   Tianyi Lin, et al.

We consider online no-regret learning in unknown games with bandit feedback, where each agent only observes its reward at each time – determined by all players' current joint action – rather than its gradient. We focus on the class of smooth and strongly monotone games and study optimal no-regret learning therein. Leveraging self-concordant barrier functions, we first construct an online bandit convex optimization algorithm and show that it achieves the single-agent optimal regret of Θ̃(√(T)) under smooth and strongly-concave payoff functions. We then show that if each agent applies this no-regret learning algorithm in strongly monotone games, the joint action converges in last iterate to the unique Nash equilibrium at a rate of Θ̃(1/√(T)). Prior to our work, the best-know convergence rate in the same class of games is O(1/T^1/3) (achieved by a different algorithm), thus leaving open the problem of optimal no-regret learning algorithms (since the known lower bound is Ω(1/√(T))). Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning by identifying the first doubly optimal bandit learning algorithm, in that it achieves (up to log factors) both optimal regret in the single-agent learning and optimal last-iterate convergence rate in the multi-agent learning. We also present results on several simulation studies – Cournot competition, Kelly auctions, and distributed regularized logistic regression – to demonstrate the efficacy of our algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4


No-regret learning for repeated non-cooperative games with lossy bandits

This paper considers no-regret learning for repeated continuous-kernel g...

Finite-Time Last-Iterate Convergence for Multi-Agent Learning in Games

We consider multi-agent learning via online gradient descent (OGD) in a ...

Online Monotone Games

Algorithmic game theory (AGT) focuses on the design and analysis of algo...

Bandit learning in concave N-person games

This paper examines the long-run behavior of learning with bandit feedba...

Taming Wild Price Fluctuations: Monotone Stochastic Convex Optimization with Bandit Feedback

Prices generated by automated price experimentation algorithms often dis...

A Tight and Unified Analysis of Extragradient for a Whole Spectrum of Differentiable Games

We consider differentiable games: multi-objective minimization problems,...

FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizer by Strong Convexity

The AdaBelief algorithm demonstrates superior generalization ability to ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the multi-agent online learning (Cesa-Bianchi and Lugosi, 2006b; Shoham and Leyton-Brown, 2008; Busoniu et al., 2010), a set of agents are repeatedly making decisions and accumulating rewards over time, where each agent’s action impacts not only its own reward, but that of the others. However, the mechanism of this interaction – the underlying game that specifies how an agent’s reward depends on the joint action of all – is unknown to agents, and agents may not even be aware that there is such a game. As such, from each agent’s own perspective, it is simply engaged in an online decision making process, where the environment consists of all other agents who are simultaneously making such sequential decisions, which are of consequence to all agents.

In the past two decades, the above problem has actively engaged researchers from two fields: machine learning (and online learning in particular), which aims to develop single-agent online learning algorithms that are no-regret in an arbitrarily time-varying and/or adversarial environment 

(Blum, 1998; Shalev-Shwartz, 2007; Shalev-Shwartz et al., 2012; Arora et al., 2012; Hazan, 2016)

; and game theory, which aims to develop (ideally distributed) algorithms (see 

Fudenberg and Levine (1998b) and references therein) that efficiently compute a Nash equilibrium (a joint optimal outcome where no one can do better by deviating unilaterally) for games with special structures111Computing a Nash equilibrium is in general computationally intractable; indeed, this problem is in fact PPAD-complete (Daskalakis et al., 2009)). Although these two research threads initially developed separately, they have subsequently merged and formed the core of multi-agent/game-theoretical online learning, whose main research agenda can be phrased as follows: Will joint no-regret learning lead to a Nash equilibrium, thereby reaping both the transient benefits (conferred by low finite-time regret) and the long-run benefits (conferred by Nash equilibria)?

More specifically, through the online learning lens, the agent ’s reward function at – viewed as a function solely of its own action – is , and it needs to select an action before – or other feedback associated with it – is revealed. In this context, no-regret algorithms ensure that the difference between the cumulative performance of the best fixed action and that of the learning algorithm, a widely adopted metric known as regret (), grows sublinearly in . This problem has been extensively studied; in particular, when gradient feedback is available – can be observed after is selected – the minimax optimal regret is for convex and for strongly convex . Further, several algorithms have been developed that achieve these optimal regret bounds, including “follow-the-regularized-leader” (Kalai and Vempala, 2005), online gradient descent (Zinkevich, 2003b), multiplicative/exponential weights (Arora et al., 2012) and online mirror descent (Shalev-Shwartz and Singer, 2007). As these algorithms provide optimal regret bounds and hence naturally raise high expectations in terms of performance guarantees, a recent line of work has investigated when all the agents apply a no-regret learning algorithm, what the evolution of the joint action would be, and in particular, whether the joint action would converge in last iterate to a Nash equilibrium (if one exists).

These questions turn out to be difficult and had remained open until a few years ago, mainly because the traditional Nash-seeking algorithms in the economics literature are mostly either not no-regret or only exhibit convergence in time-average (ergodic convergence), or both. Despite the challenging landscape, in the past five years, affirmative answers have emerged from a fruitful line of work and the new analysis tools developed therein, first on the qualitative last-iterate convergence (Krichene et al., 2015; Balandat et al., 2016; Zhou et al., 2017, 2018; Mertikopoulos et al., 2019; Mertikopoulos and Zhou, 2019; Golowich et al., 2020b, a) in different classes of continuous games (such as variationally stable games), and then on the quantitative last-iterate convergence rates in more specially structured games such as co-coercive games or strongly monotone games (Lin et al., 2020; Zhou et al., 2021). In particular, very recently, Zhou et al. (2021) has shown that if each agent applies a version of online gradient descent, then the joint action converges in last iterate to the unique Nash equilibrium in a strongly monotone game at an optimal rate of .

Despite this remarkably pioneering line of work, which has elegantly bridged no-regret learning with convergence to Nash in continuous games and thus renewed the excitement of game-theoretical learning, a significant impediment still exists and limits their practical impact. Specifically, in the multi-agent learning setting, an agent is rarely able to observe gradient feedback. Instead, in most cases, only bandit feedback is available: each agent observes only its own reward after choosing an action each time (rather than the gradient at the chosen action). This consideration of practical feasible algorithms then brings us to a more challenging and less explored desideratum in multi-agent learning: if each agent applies a no-regret online bandit convex optimization algorithm, would the joint action still converge to a Nash equilibrium? At what rate and in what class of games?

Related work.

To appreciate the difficulty and the broad scope of this research agenda, we start by describing the existing related literature. First of all, we note that single-agent bandit convex optimization algorithms – and their theoretical regret characterizations – are not as well-developed as their gradient counterparts. More specifically, Flaxman et al. (2005) and Kleinberg (2004) provided the first bandit convex optimization algorithm – known as FKM – that achieved the regret bound of

for convex and Lipschitz loss functions. However, it was unclear whether

is optimal. Subsequently, Saha and Tewari (2011) developed a barrier-based online convex bandit optimization algorithm and established a regret bound for convex and smooth cost functions, a result that has further been improved to  (Dekel et al., 2015) via a variant on the algorithm and new analysis. More recently, progress has been made on developing algorithms that achieve minimax-optimal regret. In particular, Bubeck et al. (2015) and Bubeck and Eldan (2016) provided non-constructive arguments showing that the minimax regret bound of in one and high dimension are achievable respectively, without providing any algorithm. Later, Bubeck et al. (2017) developed a kernel method based bandit convex optimization algorithm which attains the bound of . Independently, Hazan and Li (2016) considered an ellipsoid method for bandit convex optimization that also achieves .

When the cost function is strongly convex and smooth, Agarwal et al. (2010) showed that the FKM algorithm achieves an improved regret bound of . Later, Hazan and Levy (2014) established that another variant of the barrier based online bandit algorithm given in Saha and Tewari (2011) achieves the minimax-optimal regret of . For an overview of the relevant theory and applications, we refer to the recent survey (Bubeck and Cesa-Bianchi, 2012; Lattimore and Szepesvári, 2020).

However, much remains unknown in understanding the convergence of these no-regret bandit convex optimization algorithms to Nash equilibria. Bervoets et al. (2020) developed a specialized distributed payoff based algorithm that asymptotically converges to the unique Nash equilibrium in the class of strictly monotone games. However, the algorithm is not known to be no-regret and no rate is given. Héliou et al. (2020) considered a variant of the FKM algorithm and showed that it is no-regret even under delays; further, provided the delays are not too large, joint FKM learning would converge to the unique Nash equilibrium in strictly monotone games (again without rates). At this writing, the most relevant and the-state-of-the-art result on this topic is presented in Bravo et al. (2018), which showed that if each agent applies the FKM algorithm in strongly monotone games, then last-iterate convergence to the unique Nash equilibrium is guaranteed at a rate of . Per Bravo et al. (2018), the analysis itself is unlikely to be improved to yield any tighter rate. However, a sizable gap still exists between this bound and the best known lower bound given in Shamir (2013), which established that in optimization problems with strongly convex and smooth objectives (which is a one-player Nash-seeking problem), no algorithm that uses only bandit feedback (i.e. zeroth-order oracle) can compute the unique optimal solution at a rate faster than . Consequently, it remains unknown as to whether other algorithms can improve the rate of as well as what the true optimal convergence rate is. In particular, since the lower bound is established for the special case of optimization problems, it is plausible that in the multi-agent setting – where a natural potential function such as the cost in optimization does not exist – the problem is inherently more difficult, and hence the convergence intrinsically slower. Further, note that the lower bound in Shamir (2013) is established against the class of all bandit optimization algorithms, not necessarily no-regret; essentially, a priori, that could mean a larger lower bound when the algorithms are further restricted to be no-regret. As such, it has been a challenging open problem to close the gap.

Our contributions.

We tackle the problem of no-regret learning in strongly monotone games with bandit feedback and settle the above open problem by establishing that the convergence rate of – and hence minimax optimal (up to log factors) – is achievable. More specifically, we start by studying (in Section 2) single-agent learning with bandit feedback – in particular bandit convex optimization– and develop an eager variant of the barrier based family of online convex bandit optimization algorithms (Saha and Tewari, 2011; Hazan and Levy, 2014; Dekel et al., 2015). We establish that the algorithm achieves the minimax optimal regret bound of . Next, extending to multi-agent learning settings, we show that if all agents employ this optimal no-regret learning algorithm (see Algorithm 2 in Section 4), then the joint action converges in last iterate to the unique Nash equilibrium at a rate of . As such, we provide the first online convex bandit learning algorithm (with continuous action) that is doubly optimal (up to log factors): it achieves optimal regret in single-agent settings under strongly convex cost functions and optimal convergence rate to Nash in multi-agent settings under strongly monotone games. Finally, we conduct extensive experiments on Cournot competition, Kelly auction and distributed regularized logistic regression in Section 5. The numerical results demonstrate that our algorithm outperforms the-state-of-the-art multi-agent FKM algorithm.


The remainder of the paper is organized as follows. In Section 3, we provide the basic setup for multi-agent bandit learning in strongly monotone games and review the background materials on regret minimization and Nash equilibria. In Section 2, we develop a single-agent bandit learning algorithm with eager projection and prove a near-optimal regret minimization when the reward function is smooth and strongly concave. In Section 4, we extend the algorithm to multi-agent setting and prove a near-optimal last-iterate convergence rate for smooth and strongly monotone games. We also consider the multi-agent setting in which the bandit feedback is noisy but bounded and prove the same convergence result. We conduct extensive experiments on Cournot competition, Kelly auction and distributed regularized logistic regression in Section 5. The numerical results demonstrate that our algorithm outperforms the-state-of-the-art multi-agent FKM algorithm. Finally, we conclude this paper in Section 6. For the ease of presentation, we defer all the proof details to the appendix.

2 Single-Agent Learning with Bandit Feedback

In this section, we provide a simple single-agent bandit learning algorithm that players could employ to increase their individual rewards in an online manner and prove that it achieves the near-optimal regret minimization property for bandit concave optimization (BCO)222This setting is the same as bandit convex optimization in the literature and we consider maximization and concave reward functions instead of minimization and convex loss functions..

In BCO, an adversary first chooses a sequence of concave reward functions , where is a closed convex subset of . At each round , a (randomized) decision maker has to choose a point and incurs a reward of after committing to her decision. Her expected reward (where the expectation is taken with respect to her random choice) is and the corresponding regret is defined by . In the bandit setting, the feedback is limited to the reward at the point that she has chosen, i.e., .

Algorithm 1 is inspired by the existing online learning algorithms developed in Saha and Tewari (2011) and Hazan and Levy (2014): the main difference lies in eager projection (ours) v.s. lazy projection for updating . This modification is crucial to last-iterate convergence analysis when we extend Algorithm 1 to the multi-agent setting. In what follows, we present the individual algorithm components.

1:  Input: step size , module and barrier .
2:  Initialization: .
3:  for  do
4:     set . # scaling matrix
5:     draw . # perturbation direction
6:     play . # choose action
7:     receive . # get payoff
8:     set .

# estimate gradient

9:     update . # update pivot
Algorithm 1 Eager Self-Concordant Barrier Bandit Learning

2.1 Self-Concordant Barrier

Existing bandit learning algorithms can be interpreted under the framework of mirror descent (Cesa-Bianchi and Lugosi, 2006a) and the common choice of regularization is self-concordant barrier, which is a key ingredient in regret-optimal bandit algorithms when the loss function is linear (Abernethy et al., 2008) or smooth and strongly convex (Hazan and Levy, 2014). Here we provide an overview and refer to Nesterov and Nemirovskii (1994) for details.

Definition 2.1

A function is called a -self concordant barrier for a closed convex set if (i) is three times continuously differentiable, (ii) if , and (iii) for and , we have and where .

Similar to the existing online bandit learning algorithms (Abernethy et al., 2008; Saha and Tewari, 2011; Hazan and Levy, 2014; Dekel et al., 2015), our algorithm requires -self-concordant barriers over for all ; see Algorithm 1. However, this does not weaken its generality. Indeed, it is known that any convex and compact set in admits a non-degenerate -self-concordant barrier (Nesterov and Nemirovskii, 1994) with , and such barrier can be efficiently represented and evaluated for various choices of in game-theoretical setting. For example, is -self-concordant barrier for linear constraints and similar function is -self-concordant barrier for -dimensional simplex or a cube. For -dimensional ball, is -self-concordant barrier and is even independent of the dimension.

The above definition is only given for the sake of completeness and our analysis relies on some useful facts about self-concordant barriers. In particular, the Hessian of a self-concordant barrier induces a local norm at ; that is, and for all . The nondegeneracy of implies that and are both norms. To this end, we define the Dikin ellipsoid at any as

and present some nontrivial facts (see Nesterov and Nemirovskii (1994, Theorem 2.1.1) for a proof):

Lemma 2.2

Let be the Dikin ellipsoid at any , the following statements hold true:

  1. for every ;

  2. For , we have .

Remark 2.3

Based on Lemma 2.2, the update of in Algorithm 1 is well-posed since we have , which implies that .

Second, we define the Minkowsky function (Nesterov and Nemirovskii, 1994, Page 34) on , parametrized by a point as

and a scaled version of by

where be a “center” of satisfying that , and is a -self-concordant barrier function for . The following lemma shows that is quite flat at points that are far from the boundary (see Nesterov and Nemirovskii (1994, Proposition 2.3.2 and 2.3.3) for a proof):

Lemma 2.4

Suppose that is a closed convex set, is a -self-concordant barrier function for , and is a center of . Then, we have

For any and , we have and .

Finally, we define the Newton decrement for a self-concordant function (not necessarily a barrier function) as: (recall that and are a local norm and its dual norm)

which can be used to measure roughly how far a point is from a global optimum of . Formally, we summarize the results in the following lemma (see Nemirovski and Todd (2008) for a proof):

Lemma 2.5

For any self-concordant function , whenever , we have

where the local norm is defined with respect to , i.e., .

2.2 Single-Shot Ellipsoidal Estimator

It was Flaxman et al. (2005) that introduced a single-shot spherical estimator in the BCO literature and cleverly combined it with online gradient descent learning (Zinkevich, 2003a). In particular, let be a function, and , a single-shot spherical estimator is defined by


This estimator is an unbiased prediction for the gradient of a smoothed version; that is, where . As , the bias caused by the difference between and vanishes while the variability of

explodes. This manifestation of the bias-variance dilemma plays a key role in designing bandit learning algorithms and a single-shot spherical estimator is known to be suboptimal in terms of bias-variance trade-off and hence regret minimization. This gap is closed by using a more sophisticated single-shot ellipsoidal estimator based on the self-concordant barrier function of

 (Saha and Tewari, 2011; Hazan and Levy, 2014). Comparing to the spherical estimator in Eq. (1) based on the uniform sampling, Saha and Tewari (2011) and Hazan and Levy (2014) proposed to sample

nonuniformly over all directions and create an unbiased gradient estimate of the scaled smooth version. In particular, a single-shot ellipsoidal estimator with respect to an invertible matrix

is defined by


The following lemma is a simple modification of (Hazan and Levy, 2014, Corollary 6 and Lemma 7).

Lemma 2.6

Suppose that is a continuous and concave function, is an invertible matrix, and , we define the smoothed version of with respect to by . Then, the following holds:

  1. .

  2. If is -strongly concave, then so is .

  3. If is Lipschitz continuous with parameter and

    is the largest eigenvalue of

    , then we have

Remark 2.7

Lemma 2.6 shows that where . In Algorithm 1, we set using self-concordant barrier function for and perform shrinking sampling (Hazan and Levy, 2014), which makes the setup of unnecessary in the update, leading to a better bias-variance trade-off and a near-optimal regret minimization.

2.3 Eager Mirror Descent

The idea of eager mirror descent333In utility maximization, we shall use mirror ascent instead of mirror descent because players seek to maximize their rewards (as opposed to minimizing their losses). Nonetheless, we keep the term “descent” throughout because, despite the role reversal, it is the standard name associated with the method. (Nemirovskij and Yudin, 1983) is to generate a new feasible point by taking a “mirror step” from a starting point along an “approximate gradient” direction . By abuse of notation, we let be a continuous and strictly convex distance-generating (or regularizer) function, i.e., with equality if and only if for all and all . We also assume that is continuously differentiable, i.e., is continuous. Then, we get a Bregman divergence on via the relation


for all and , which can fail to be symmetric and/or satisfy the triangle inequality. Nevertheless, with equality if and only if , so the asymptotic convergence of a sequence to can be checked by showing that . For technical reasons, it will be convenient to assume the converse, i.e., when . This condition is known in the literature as “Bregman reciprocity” (Chen and Teboulle, 1993). We continue with some basic relations connecting the Bregman divergence relative to a target point before and after a prox-map. The key ingredient is “three-point identity” (Chen and Teboulle, 1993) which generalizes the law of cosines, and widely used in the literature (Beck and Teboulle, 2003; Nemirovski, 2004).

Lemma 2.8

Let be a regularizer on . Then, for all and all , we have

Remark 2.9

In Algorithm 1, we set as a self-concordant barrier function for and it is easy to check that satisfies the aforementioned Bregman reciprocity for various constraint sets, e.g., -dimensional simplex, a cube or -dimensional ball.

The key notion for Bregman divergence is the induced prox-mapping given by


for all and all , which reflects the intuition behind eager mirror descent. Indeed, it starts with a point

and steps along the dual (gradient-like) vector

to generate a new feasible point . However, the prox-mapping in Eq. (4) considers general Bregman divergence which does not exploit the structure of strong concave payoff function or constraint sets. In response to the above issues, we propose to use the prox-mapping given by


for all and all . We remark that the above prox-mapping explicitly incorporates the problem structure information by considering a mixed Bregman divergence: the first term is Euclidean distance with the coefficient proportional to strong concavity parameter and the second term is Bregman divergence defined with the self-concordant barrier function for the constraint set . It is also worth noting that we consider eager projection in Eq. (5), making our prox-mapping different from the one used in Hazan and Levy (2014) that also exploits the structure of strong concave payoff function and constraint sets but is with lazy projection.

With all these in mind, the eager self-concordant barrier bandit learning algorithm is given by the recursion in which is a step-size and is a feedback of estimated gradient. In Algorithm 1, we generate by a single-shot ellipsoidal estimator as mentioned before.

2.4 Regret Bound

We consider the single-agent setting where the adversary is limited to choosing smooth and strongly concave functions. The following theorem shows that Algorithm 1 achieves the regret bound, matching the lower bound in Shamir (2013).

Theorem 2.10

Suppose that the adversary is limited to choosing smooth and -strongly concave functions . Each function is Lipschitz continuous and satisfies that for all . If a horizon is fixed and each player follows Algorithm 1 with parameters , we have

Remark 2.11

Theorem 2.10 shows that Algorithm 1 is a near-regret-optimal bandit learning algorithm when the adversary is limited to choosing smooth and strongly concave functions. The algorithmic scheme is based on the eager mirror descent and thus appears to be different from the existing one presented in Hazan and Levy (2014).

To prove Theorem 2.10, we present our main descent lemma for the iterates generated by Algorithm 1.

Lemma 2.12

Suppose that the iterate is generated by Algorithm 1 and each function satisfies that for all , we have

where and is a nonincreasing sequence satisfying that .

See the proofs of Lemma 2.12 and Theorem 2.10 in Appendix A and B.

3 Multi-Agent Learning with Bandit Feedback

In this section, we consider multi-agent learning with bandit feedback and characterize the behavior of the system when each agent applies the eager self-concordant barrier bandit learning algorithm. We first present basic definitions, and discuss a few important classes of games that are strongly monotone and finally discuss the multi-agent version of the learning algorithm.

3.1 Basic Definition and Notation

We focus on games played by a finite set of players . At each iteration of the game, each player selects an action from a convex subset of a finite-dimensional vector space and their reward is determined by the profile of the action of all players; subsequently, each player receives the reward, and repeats this process. We denote as the Euclidean norm (in the corresponding vector space): other norms can be easily accommodated in our framework (and different ’s can in general have different norms), although we will not bother with all of these since we do not play with (and benefit from) complicated geometries.

Definition 3.1

A smooth concave game is a tuple , where is the set of players, is a convex and compact set of finite-dimensional vector space representing the action space of player , and is the -th player’s payoff function satisfying: (i) is continuous in and concave in for all fixed ; (ii) is continuously differentiable in and the individual payoff gradient is Lipschitz continuous.

A commonly used solution concept for non-cooperative games is Nash equilibrium

(NE). For the smooth concave games considered in this paper, we are interested in the pure-strategy Nash equilibria. Indeed, for finite games, the mixed strategy NE is a probability distribution over the pure strategy NE. Our setting assumes continuous and convex action sets, where each action already lives in a continuum, and pursuing pure-strategy Nash equilibria is sufficient.

Definition 3.2

An action profile is called a (pure-strategy) Nash equilibrium of a game if it is resilient to unilateral deviations; that is, for all and .

It is known that every smooth concave game admits at least one Nash equilibrium when all action sets are compact (Debreu, 1952) and Nash equilibria admit a variational characterization. We summarize this result in the following proposition.

Proposition 3.3

In a smooth concave game , the action profile is a Nash equilibrium if and only if for all and .

Proposition 3.3 shows that Nash equilibria of a smooth concave game can be precisely characterized as the solution set of a variational inequality (VI), so the existence results also follow from the standard results in VI literature (Facchinei and Pang, 2007). We omit the proof here and refer to Mertikopoulos and Zhou (2019) for the details.

3.2 Strongly Monotone Games

The study of (strongly) monotone games dates to Rosen (1965), with many subsequent developments; see, e.g., Facchinei and Pang (2007). Specifically, Rosen (1965) considered a class of games that satisfy the diagonal strict concavity (DSC) condition and prove that they admit a unique Nash equilibrium. Further work in this vein appeared in Sandholm (2015) and Sorin and Wan (2016), where games that satisfy DSC are referred to as “contractive” and “dissipative”. These conditions are equivalent to (strict) monotonicity in convex analysis (Bauschke and Combettes, 2011). To avoid confusion, we provide the formal definition of strongly monotone games considered in this paper.

Definition 3.4

A smooth concave game is called -strongly monotone if there exist positive constants such that for any .

The notion of (strong) monotonicity, which will play a crucial role in the subsequent analysis, is not necessarily theoretically artificial but encompasses a very rich class of games. We present three typical examples which satisfy the conditions in Definition 3.4 (see the references (Bravo et al., 2018; Mertikopoulos and Zhou, 2019) for more details).

Example 3.1 (Cournot Competition)

In the Cournot oligopoly model, there is a finite set of firms, each supplying the market with a quantity of some good (or service) up to the firm’s production capacity, given here by a positive scalar . This good is then priced as a decreasing function of the total supply to the market, as determined by each firm’s production; for concreteness, we focus on the standard linear model where and are positive constants. In this model, the utility of firm (considered here as a player) is given by

where represents the marginal production cost function of firm and is assumed to be strongly convex, i.e., as the income obtained by producing units of the good in question minus the corresponding production cost. Letting denote the space of possible production values for each firm, the resulting game is strongly monotone.

Example 3.2 (Strongly Concave Potential Games)

A game is called a potential game (Monderer and Shapley, 1996; Sandholm, 2001) if there exists a potential function such that

for all , all and all . A potential game is called a strongly concave potential game if the potential function is strongly concave. Since the gradient of a strongly concave function is strongly monotone (Bauschke and Combettes, 2011), a strongly concave potential game is a strongly monotone game.

Example 3.3 (Kelly Auctions)

Consider a service provider with a number of splittable resources (representing, e.g., bandwidth, server time, ad space on a website, etc.). These resources can be leased to a set of bidders (players) who can place monetary bids for the utilization of each resource up to each player’s total budget , i.e., . A popular and widely used mechanism to allocate resources in this case is the so-called Kelly mechanism (Kelly et al., 1998) whereby resources are allocated proportionally to each player’s bid, i.e., player gets

units of the -th resource (in the above, denotes the available units of said resource and is the “entry barrier” for bidding on it). A simple model for the utility of player is then given by

where denotes the player’s marginal gain from acquiring a unit slice of resources and denotes the cost if player place one unit monetary bid for the utilization of resource (the function is assumed to be strongly convex). If we write for the space of possible bids of player on the set of resources , we obtain a strongly monotone game.

There are many other application problems that can be cast into strongly monotone games (Orda et al., 1993; Cesa-Bianchi and Lugosi, 2006a; Sandholm, 2015; Sorin and Wan, 2016; Mertikopoulos et al., 2017). Typical examples include strongly-convex-strongly-concave zero-sum games, congestion games (Mertikopoulos and Zhou, 2019), wireless network games (Weeraddana et al., 2012; Tan, 2014; Zhou et al., 2021) and a wide range of online decision-making problems (Cesa-Bianchi and Lugosi, 2006a). From an economic point of view, one appealing feature of strongly monotone games is that the last-iterate convergence can be achieved by some standard learning algorithms (Zhou et al., 2021), which is more natural than time-average-iterate convergence (Fudenberg and Levine, 1998a; Cesa-Bianchi and Lugosi, 2006a; Lin et al., 2020). The finite-time convergence rate is derived in terms of the distance between and , where is the realized action and is the unique Nash equilibrium (under convex and compact action sets). In view of all this (and unless explicitly stated otherwise), we will focus throughout on strongly monotone games.

3.3 Multi-Agent Bandit Learning

In multi-agent learning with bandit feedback, at each round , each (possibly randomized444Randomization plays an important role in the online game playing literature. For example, the classical Follow The Leader (FTL) algorithm does not attain any non-trivial regret guarantee for linear cost functions (in the worst case it can be if the cost functions are chosen adversarially). However, Hannan (1957) proposed a randomized variant of FTL, called perturbed-follow-the-leader, which could attain an optimal regret of for linear functions over the simplex.) decision maker selects an action . The reward is realized after after all decision makers have chosen their actions. In addition to regret minimization, the convergence to Nash equilibria is an important criterion for measuring the performance of learning algorithms. In multi-agent bandit learning, the feedback is limited to the reward at the point that she has chosen, i.e., . Here, we propose a simple multi-agent bandit learning algorithm (see Algorithm 2) in which each player chooses her action using Algorithm 1. Algorithm 2 is a straightforward extension of Algorithm 1 from single-agent setting to multi-agent setting. It differs from Bravo et al. (2018, Algorithm 1) in two aspects: the ellipsoidal SPSA estimator v.s. spherical SPSA estimator and self-concordant Bregman divergence v.s. general Bregman divergence. We discuss these two crucial components next.

Single-shot ellipsoidal SPSA estimator.

Recently, Bravo et al. (2018) have extended the spherical estimator in Eq. (1) to multi-agent setting by positing instead that players rely on a simultaneous perturbation stochastic approximation (SPSA) approach (Spall, 1997) that allows them to estimate their individual payoff gradients based off a single function evaluation. In particular, let be a payoff function, and the query directions be drawn independently across players, a single-shot spherical SPSA estimator is defined by


This estimator is an unbiased prediction for the partial gradient of a smoothed version of ; that is where . We can easily see the bias-variance dilemma here: as , becomes more accurate since , while the variability of

grows unbounded since the second moment of

grows as . By carefully choosing Bravo et al. (2018) provided the best-known last-iterate convergence rate of which almost matches the lower bound in Shamir (2013). However, a gap still remains and they believe it can be closed by using a more sophisticated single-shot estimator.

1:  Input: step size , weight , module , and barrier .
2:  Initialization: .
3:  for  do
4:     for  do
5:        set . # scaling matrix
6:        draw . # perturbation direction
7:        play . # choose action
8:     receive for all . # get payoff
9:     for  do
10:        set . # estimate gradient
11:        update . # update pivot
Algorithm 2 Multi-Agent Eager Self-Concordant Barrier Bandit Learning

We now provide a single-shot ellipsoidal SPSA estimator by extending the estimator in Eq. (2) to the multi-agent setting. Using the SPSA approach, we let and have


The following lemma provides some results for a single-shot ellipsoidal SPSA estimator and the proof is given in Appendix C.

Lemma 3.5

The single-shot ellipsoidal SPSA estimator given by Eq. (7) satisfies

with where for all . Moreover, if is -Lipschitz continuous and is the largest eigenvalue of , we have

Remark 3.6

Lemma 3.5 generalizes Bravo et al. (2018, Lemma C.1) which is proved for a single-shot spherical SPSA estimator. It shows that where . In Algorithm 2, we use self-concordant barrier function for and perform shrinking sampling, leading to an optimal last-iterate convergence rate up to a log factor.

Eager mirror descent.

The prox-mapping in Eq. (4) has been used in Bravo et al. (2018) to construct their multi-agent bandit learning algorithm, which achieves the suboptimal regret minimization and last-iterate convergence. One possible reason for such suboptimal guarantee is that the prox-mapping defined in Eq. (4) considers general Bregman divergence and takes care of neither strong monotone payoff gradient nor constraint sets. In Algorithm 2, we let each player update using the prox-mapping in Eq. (5) and the rule is given by


for all and all . With all these in mind, the main step of Algorithm 2 is given by the recursion in which is a step-size and is a feedback of estimated gradients for player (we generate by a single-shot ellipsoidal SPSA estimator).

4 Finite-Time Convergence Rate

In this section, we establish that Algorithm 2 achieves the near-optimal last-iterate convergence rate for smooth and strongly monotone games if the perfect bandit feedback is available, improving the best-known last-iterate convergence rate in Bravo et al. (2018). The following theorem shows that Algorithm 2 achieves the rate of last-iterate convergence to a unique Nash equilibrium if the perfect bandit feedback is available (), improving the best-known rate of  (Bravo et al., 2018) and matching the lower bound (Shamir, 2013).

Theorem 4.1

Suppose that is a unique Nash equilibrium of a smooth and -strongly monotone game. Each payoff function satisfies that for all . If each player follows Algorithm 2 with parameters , we have

Remark 4.2

Theorem 4.1 shows that Algorithm 2 attains a near-optimal rate of last-iterate convergence in smooth and strongly monotone games. It extends Algorithm 1 from the single-agent setting to multi-agent setting, providing the first doubly optimal bandit learning algorithm, in that it achieves (up to log factors) both optimal regret in the single-agent learning and optimal last-iterate convergence rate in the multi-agent learning. In contrast, it remains unclear whether the multi-agent extension of Hazan and Levy (2014, Algorithm 1) can achieve the near-optimal last-iterate convergence rate or not.

To prove Theorem 4.1, we present our main descent lemma for the iterates generated by Algorithm 2.

Lemma 4.3

Suppose that the iterate is generated by Algorithm 2 and each function satisfies that for all , we have

where and is a nonincreasing sequence satisfying that .

See the proofs of Lemma 4.3 and Theorem 4.1 in Appendix D and E. We also provide the convergence results under the imperfect bandit feedback setting in Appendix F.

5 Numerical Experiments

In this section, we conduct the experiments using three different tasks: Cournot competition, Kelly auction, and distributed regularized logistic regression. We compare Algorithm 2 with Bravo et al. (2018, Algorithm 1). The implementation is done with MATLAB R2020b on a MacBook Pro with a Intel Core i9 2.4GHz (8 cores and 16 threads) and 16GB memory.

5.1 Cournot Competition

We show that Cournot competition is strongly monotone in the sense of Definition 3.4. In particular, each players’ payoff function is given by

Taking the derivative of with respect to , we have

This implies that

Therefore, we conclude that Cournot competition is strongly monotone with the module .

Bravo et al. (2018, Algorithm 1) Our Algorithm
(10, 10, 0.05) 1.9e-01 4.1e-02 1.3e-03 3.6e-04
(10, 10, 0.10) 9.1e-02 3.9e-02 1.4e-03 3.3e-04
(10, 20, 0.05) 3.0e-01 4.9e-02 8.9e-04 5.7e-04
(10, 20, 0.10) 1.5e-01 5.0e-02 7.0e-04 2.2e-04
(20, 10, 0.05) 2.1e-01 4.1e-02 1.7e-03 2.7e-04
(20, 10, 0.10) 9.6e-02 1.7e-02 1.9e-03 4.1e-04
(20, 20, 0.05) 3.4e-01 7.4e-02 9.4e-04 2.4e-04
(20, 20, 0.10) 1.9e-01