Learning in Cournot Games with Limited Information Feedback

06/15/2019 ∙ by Yuanyuan Shi, et al. ∙ University of Washington 0

In this work, we study the interaction of strategic players in continuous action Cournot games with limited information feedback. Cournot game is the essential market model for many socio-economic systems where players learn and compete without the full knowledge of the system or each other. In this setting, it becomes important to understand the dynamics and limiting behavior of the players. We consider concave Cournot games and two widely used learning strategies in the form of no-regret algorithms and policy gradient. We prove that if the players all adopt one of the two algorithms, their joint action converges to the unique Nash equilibrium of the game. Notably, we show that myopic algorithms such as policy gradient can achieve exponential convergence rate, while no-regret algorithms only converge (sub)linearly. Together, our work presents significantly sharper convergence results and shows how exploiting the structure of the game can lead to much faster convergence rates.



There are no comments yet.


page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Game-theoretic models have been used to describe the cooperative and competitive behaviors of a group of players in a wide range of systems from robotics, distributed control to resource allocation [6, 33, 27, 21]. In this paper, we study the interaction of strategic players in Cournot games [9], which is one of the best studied models for competition between self-interested agents in a system [17]: the Cournot competition is the essential market model for many socio-economic systems such as energy [28], transportation [11] and healthcare systems [15]. This competition can be thought as multiple agents competing to satisfy an elastic demand by changing their production levels. For example, most of the US electricity market is built upon a Cournot competition model [20], where energy producers bid to serve the loads in the grid, and a market price is cleared based on the total supply and demand. Each producer’s payoff is based on the market price multiplied by its share of the supply. The goal of producers is then to maximize their individual payoffs by strategically choosing the production levels.

Previous works in Cournot games mostly focused on analyzing the equilibrium behavior, especially the Nash equilibria. However, this raises the question of how the players would reach an equilibrium when they do not start at one. This question was actually considered by Cournot in the original 1838 paper [9] in the case of two players. Since then, a rich set of results have generalized the conditions under which the players can get to a Nash equilibrium. For example, [25, 26] proposed a dynamic and adaptive behavior rule to reach a Nash equilibrium for an arbitrary number of players and encompasses both best-response dynamics and Bayesian learning. However, most of the previously known results require full information of other players’ actions and the exact market price function, which are not usually available in practice. For many applications of interest, providing full feedback is either impractical (e.g., distributed control [24]) or explicitly disallowed due to privacy and market power concerns (e.g., energy markets [29]). Therefore in this work, we move away from both the static view of assuming players are at a Nash equilibrium and the assumption that they have full knowledge of the system. Instead, we analyze the long-run outcome of the learning dynamics under limited information feedback, and ask the following two fundamental questions:

  1. [noitemsep,nolistsep]

  2. Will strategic learning agents reach an equilibrium in Cournot games?

  3. If so, how quickly do they converge to the equilibrium?

Reasoning about these questions requires specifying the dynamics, which describes how players act before achieving the equilibrium. In particular, we consider player dynamics induced by learning algorithms that attempt to find the best strategies according to some objective. We consider two well-known families of algorithms. One class is based on regret-minimization (i.e., no-regret algorithms [1, 14, 19], which treats the environment and other players as adversarial. The other class is myopic learning algorithms (i.e., policy gradient [34, 18]), that seeks to maximize the individual payoff in a stochastic system. It should be noted that these two classes of algorithms–as they are constructed from distinctive starting points– may lead to very different behavior patterns and system dynamics. The first class of algorithms is more risk-averse and focuses on minimizing the hindsight regret; while the second class is less conservative and tries to maximize the expected payoff.

In terms of the information structure, we consider bandit feedbacks where each player receives only limited information. In the Cournot game, it means that the players only receive the price (a single number) from the system and nothing else. Therefore they must make decisions based only on the observed history of the price and their own actions. This is an important feature of many socio-economic games where each player only has access to local information (i.e., the realized payoffs) and are not informed about the attributes of the other participants.

I-a Related Work

Studying the behavior of players under limited feedback is a challenging problem which has started to receive some recent attention. Understandably, most works focus on no-regret algorithms [8, 4, 40, 13, 35, 10, 3] because of their inherent robustness. In particular, a player’s no-regret dynamics definition could be directly translated to the coarse correlated equilibrium condition [12] for a wide range of algorithms (e.g., multiplicative-weight [1], online mirror descent [14], Follow-the-Regularized-Leader [19]).

In terms of the convergence speed, [35] and [10] showed that the players’ time-averaged history of joint play converges to the set of coarse correlated equilibrium at a rate of for -smooth games. As for convergence in distribution (rather than the time-average), [4, 3, 40] showed that joint actions generated by mirror descent algorithms converge to the Nash equilibrium in the class of variationally stable games and [8] showed the joint actions produced by no-regret learning with exponential weights converge to the Nash equilibrium in potential games. However, each of the aforementioned papers focused on a specific class of no-regret algorithms and do not generalize to all no-regret dynamics. Consequently, establishing convergence to finer notion of equilibria (e.g., Nash) for a broad class of no-regret algorithms remains open.

While theoretical properties of no-regret algorithms can be attractive, they also limit the applicability of these algorithms. In practice, systems and agents are often not adversarial to each other, and the competition is often designed to have specific structures. The electricity market serves as a good example. The market clearing mechanism is designed specifically to be convex to utilize the dual variables as prices (so-called locational marginal prices [20] ) and different generators only care about maximizing their own payoffs [39]. Therefore no-regret algorithms can lead to poor performances, since they cannot take advantage of the structure of the game.

Much less is known under the limited feedback situation beyond no-regret dynamics. In many games, it is more natural for players to use myopic policies that aim for profit maximization directly [38]

. Since the action spaces in Cournot games are continuous, the action policy is commonly modeled and trained by using policy gradient reinforcement learning 

[34]. These algorithms can lead to much better performances than no-regret algorithms, but proving their convergence has proven to be challenging [5] since the coupling between the (continuous) actions of the players must be carefully analyzed. Attempts have been made to discretize the space then studying the resulting discrete game, but the dimensionally quickly grows and important features (e.g., convexity) are hard to retain [22].

I-B Our Contributions

In this work, we study the dynamics of both no-regret learning algorithms and policy gradient in concave Cournot games with bandit feedbacks, and our major contributions are in the following three aspects.


, we prove that under assumptions where a unique Nash equilibrium (NE) exists, the joint distribution induced by any no-regret algorithm converges to the NE. This is a much sharper result compared to the standard convergence result of the time-averaged joint distribution converging to a coarse correlated equilibrium.


, under the same assumptions, we show that the joint distribution induced by policy gradient, where players use Gaussian policies with at least two degrees of freedom (i.e., mean and variance), also converges to the NE. This is the first result (to the best of our knowledge) on the convergence property of algorithms with continuous action spaces that do not fall in the no-regret class.

Third, we show that the convergence rate of policy gradient occurs at an exponential rate, which is much faster than the linear rate obtained via no-regret algorithms [35].

Ii Problem setup and preliminaries

Ii-a Model of Cournot Competition

In this section, we first review the classical Cournot game setup and then provide two motivating examples of its applications in modeling competition in infrastructure systems.

Definition 1 (Cournot Game).

Consider players with homogeneous products in a limited market, where the strategy space of player is its production level . The utility function of player is denoted as , where is the market clearing price (inverse demand) function that maps the total production quantity to a price in and is the cost function of player .

The goal of each player in the Cournot game is to choose the best production quantity such that maximizes his own utility

. An important concept in game theory is the

Nash equilibrium, at which state no player can increase his expected payoff via a unilateral deviation. A Nash equilibrium of the Cournot game defined by

is a vector

such that for all :


where denotes the actions of all players except . To ensure the existence and uniqueness of Nash equilibrium in Cournot games, we make the following assumptions, throughout the paper:

Assumption 1.

We assume a Cournot game has the following properties:

  1. The individual strategy set for each player is convex and compact, i.e., is finite. (A1)

  2. The price function is concave, strictly decreasing and twice differentiable. (A2)

  3. The individual cost function for each player is convex. (A3)

These assumptions are common in the literature (e.g., see [17]). It is straightforward to show that a Cournot game satisfies the above assumptions (A1)-(A3) is a concave N-player game [30], thus a unique Nash equilibrium exists. Below, we briefly discuss two applications of the Cournot game socio-economic systems.

Example 1 (Wholesale Electricity Market)

The Cournot model is the most widely adopted framework for electricity market design [20]. Suppose there are electricity producers, each supplying the market with units of energy, up to the player’s production capacity . In an uncongested grid, the electricity is priced as a decreasing function of the total amount . In practice, linear price and cost functions are commonly adopted and the profit of generator is: where represents the marginal production cost of generator .

Example (City Parking Planning [39])

Cournot competition can also be used to study oligopsonies, where demand competes for supply. For example, businesses compete for a fixed set of on-street parking spots in cities. For business , suppose it is allocated

number of parking spaces. The amount of vehicles that visit it is a random variable denoted by

. If a vehicle successfully finds parking when customers try to visit, the business derives a certain amount of utility. If parking is not available, it gets no utility. Therefore business ’s utility is denoted as (concave w.r.t. ). However, an increase of parking spaces is correlated with the total traffic into downtown areas and may increase congestion, which we model with a price function . The expected payoff for business is then It turns out that there is an exact transformation between Cournot oligopolies and oligopsonies and a unique Nash equilibrium exists when is convex [17].

Ii-B Review of learning dynamics

We now turn to review the two families of learning methods that players could employ to adjust their actions, namely the no-regret learning algorithms and policy gradient.

No-regret algorithms

An online algorithm is called no-regret (or no-external regret) if the difference between the total payoff it receives and that of the best fixed decision in hindsight is sublinear as a function of time [14]. Formally, fix payoff vector , the regret of after steps is:


An algorithm is said to have no regret, if for every sequence of payoff vectors, the regret . Consider the case of players that engage in a repeated Cournot game, and suppose is a sequence of actions, where represents the production levels set by all players at time . The regret of player at time is defined as , where is the utility function of player and is player ’s action set, i.e., . There are a collection of algorithms satisfy the regret minimization property, e.g., EXP3 [2], Online Mirror Descent [14], and Follow the Regularized/Perturbed Leader [19].

Policy gradient

Policy gradient is a widely used reinforcement learning algorithm for problems with continuous action space [32]. The policy is usually modeled with a parameterized function , where is the action, is the state and is the policy parameter. In a repeated Cournot game, since the game is reset at each iteration, the policy could be reduced to a stateless version where we aim to find a stationary decision law . We abuse the notation slightly and use to denote the parameters associated with player . Assuming all players follow the policy gradient algorithm111This is an important assumption for the convergence analysis. We provide some robustness analysis in Section V by assuming a small portion of players do not follow their policies, and the convergence still holds. and take actions independently, the expected reward of player is:


Player aims to find the best parameter which maximizes the expected reward with the following update rule:


Following the policy gradient theorem [34], the gradient with respect to can be reformed as:



is the observed payoffs and the sample average forms an unbiased estimator if all players follow their current policies to take actions.

Iii Convergence of no-regret algorithms in Cournot games

Given N players, with actions . Suppose that is some joint distribution over the action space of all players. Then we can define a coarse correlated equilibrium (CCE) notion based on as, :


It is a well-known result that the above coarse correlated equilibrium is learnable if all players follow no-regret algorithms [13]. Although these results deal with the convergence of time-average (empirical) distribution and not about the outcome distribution at any particular given time.

We first state a stronger result by strengthening the equilibria notion to that of the correlated equilibrium (CE) for Cournot games satisfying (A1)-(A3):

Lemma 1.

If each player chooses actions using a no-regret algorithm in a Cournot game, then the empirical distribution of all the players’ actions converges to a correlated equilibrium, i.e., :


for any measurable function .

In this lemma and the rest of the paper, the notation of probabilistic convergence is always in the sense of distribution [37]. In general games, we know that no-regret learning procedures would converge to the set of CCE, the biggest equilibrium set of game. Lemma 1 shows that for Cournot games with assumptions (A1)-(A3), every CCE is also a CE. The next theorem tightens the result even more by showing that the Nash equilibrium is the only correlated equilibrium in the defined Cournot games.

Fig. 1: Hierarchy of equilibria [31]
Theorem 1.

An N-player Cournot game satisfying (A1)-(A3) has a unique correlated equilibrium, which places probability one on the Nash equilibrium. If all players in the game follow no-regret algorithms, the actual sequence of play converges to the Nash equilibrium.

The detailed proof of Lemma 1 and Theorem 1 are given in Appendix A. In short, we proved the above theorem by contradiction. We first assume there exists a strategy has non-zero probability for a measurable set not containing the NE, and then show such an assumption leads to a contradiction in the definition of correlated equilibrium. The convergence result applies to all no-regret algorithms, without restriction to any specific subclass of algorithms.

Convergence rate.

The players’ joint actions generally converge to the set of CCE (NE for the Cournot games) at sublinear rate  [12], by using common no-rerget algorithms such as EXP3 [2], Online Mirror Descent [14] and Follow the Reguralized Leader [19] in their vanilla forms. The best previously known convergence result is  [35], which holds for a class of no-regret algorithms with RVU property defined in paper [35].

Iv Convergence of policy gradient in Cournot games

Although the convergence result of no-regret algorithms is neat, in practice, the players in Cournot games are more likely to face other similar “self-interested” players rather than adversarial ones. Instead of following regret-minimization schemes, a more natural choice would be to learn a decision policy that directly maximizes individual payoff, which leads to the following theorem.

Theorem 2.

For an N-player Cournot game, suppose all players follow Gaussian policies with at least two degrees of freedom (i.e., mean and variance) and use policy gradient to update the policy parameters. Starting from any initial parameters, the variance of the Gaussian policies will shrink to zero and the actual sequence of play converges to the Nash equilibrium at an exponential rate.

Gaussian policy is a natural choice for continuous action spaces [34, 32]

, and such a form also includes popular neural network policies 

[23] where the mean can be parameterized via a neural network. In cases where the Gaussian policy is too-restrictive, one could use a complex distribution e.g. mixture of Gaussian or a heavy-tailed distribution. In essence, for the convergence of policy gradient to the Nash equilibrium, we need a parameterized policy where the mean and variance are able to adapt independently. For the remaining part of this section, we first walk through the proof on a toy example of two players and linear price in Section IV-A, which provides intuitions on the convergence behavior. Then we generalize our analysis to concave Cournot games in Section IV-B.

Iv-a System dynamics with linear price functions

Consider a two-player Cournot game, where each player uses a Gaussian policy to sample their actions: and . Suppose the price function is linear and there is no production cost. Then the payoff for player 1 is and the payoff for player 2 is . The expected payoff of the players are:

Therefore, running policy gradient with fixed step size leads to the following dynamics,


We could write the above dynamics in a state-space representation , in which we define the system state , and

is the state matrix. Solving the system characteristic equation, the eigenvalues of

are , , , . A discrete-time linear system is asymptotically stable if and only if all eigenvalues of the state matrix are inside the unit circle, i.e., . Suppose we choose proper step size such that the system is asymptotically stable, it is straightforward to check that the NE of the Cournot game is the only solution to the system equilibrium condition , therefore it is the unique system equilibria.

Convergence rate

Next, we analyze the convergence rate of the above dynamical system ((8) - 9). Following linear system analysis, the distance between the system state at any given time and the equilibrium point is bounded by:


where denotes the largest eigenvalue of (in terms of absolute value), and when . The state variable converges to the system equilibrium (coincides with the Nash equilibrium) at an exponential rate. We find this convergence result is particularly appealing for practical applications because it is much faster than the best previously known convergence rate of no-regret algorithms [35]. A step-by-step derivation of the convergence rate (10) and the N-player linear price case are provided in Appendix B1 and B2.

Iv-B System dynamics with general concave price functions

Here, we extend the previous convergence results to general Cournot games satisfying (A1)-(A3). The analysis is in two stages: we first show that the variance of the policy shrinks to zero; then we prove the mean converges to the Nash equilibrium.

Lemma 2 (Variance shrinkage).

For an N-player Cournot game, suppose all players follow Gaussian policies with at least two degrees of freedom (i.e., mean and variance) and use policy gradient to update the policy parameters. Starting from any initial parameters, the variance of the Gaussian policies will converge to zero at an exponential rate.

Proof of Lemma 2 could be found in Appendix B3. Lemma 2 basically suggests that after sufficient iterations, each player would follow a deterministic policy with standard derivation equals zero. Next, we analyze the dynamics of the mean (equivalently, the actual action ).

Suppose each player changes his strategy at a rate proportional to the gradient of his payoff function , and we stack the dynamics of all players together:


where is the step size to be selected later. By the mean value theorem, we have


where is the Jacobian matrix of evaluated at some point. Combining (11) and 12 leads to


We would need the following lemma for the stability analysis, and the detailed proof is deferred to Appendix B4.

Lemma 3.

For an N-player Cournot game satisfying (A1)-(A3), let denote the Jacobian matrix of the system dynamics , is negative definite for all .

Similar to Theorem 10 in [30], the norm of is minimized by the choice , where is negative definite proved by Lemma 3. Using this step size gives:


It therefore follows (14) that , and where is an equilibrium point that satisfies . Meanwhile, the system equilibrium point would also satisfies , which is exactly the Nash equilibrium condition. Theorem 2 follows by combining this result with the result in Lemma 2 on the variances of the policies.

In terms of convergence rate, the evolution of follows a linear dynamics (13), thus the convergence happens at exponential order [7], which depends on the eigenvalues of system state matrix (similar to the convergence rate analysis in Section 4.1).

We close this section with two remarks. Firstly it should be noted that in practice, players only observe their realized payoffs, rather than the exact form of the payoff gradient . Following the policy gradient theorem [34], the sample average of the observed payoffs would form an unbiased estimator of the gradient if all players follow their own policy and take actions independently. The sample average estimate could be imperfect and accompanied by noisy information (random error, systematic error, or otherwise). A detailed discussion on the convergence robustness to noisy gradient estimation could be found in [4]. Secondly, some players may decide to not follow a policy or act in adversarial manners. We provide some empirical evaluations of the system robustness in Section 5, by assuming a small portion of players are acting randomly or adversarially.

V Numerical experiments

In this section, we look at the performance of two classes of learning algorithms in various of Cournot games. We first verify the convergence behavior and compare the convergence rate of two algorithms. Next, we exam the robustness of policy gradient against of random players with different strategies.

Setup We consider three-player games under different price and individual cost settings. G1: linear price function without cost . The Nash equilibrium is . G2: quadratic price function without cost. The Nash equilibrium is . G3: linear price function with cost for player . The Nash equilibrium is . In all of the games, each player simultaneously picks a production level. The price is determined by corresponding price function and broadcasted back to all players. This game is repeated for multiple times with all players either using no-regret algorithms or policy gradient. For the no-regret algorithm, we implemented EXP3 [2], a classic multiplicative weights algorithm. For the policy gradient method, we assume all players use the natural policy gradient [18] with Gaussian policies. The algorithm implementation details are provided in Appendix C.

Fig. 2: Convergence behavior of various 3-player Cournot games. G1: (a)-(d); G2: (e)-(h); G3 (i)-(l). The first two columns plot the actual action sequences of EXP3 and policy gradient, respectively. The third column demonstrates the accumulated regret for all players, where the solid line is the average regret and the dash area represents the variation of different players. The final columns shows the robustness of policy gradient in against of a random player with different strategies.

Convergence verification Fig. 2 shows that the joint actions converge to the Nash equilibrium for both algorithms. However, the actions under no-regret algorithm are extremely random and converge rather slowly (Fig. 2 first column). For the policy gradient, the actions converge much more quickly to the NE (Fig. 2 second column), which confirming the theoretical results on convergence and rate.

Accumulated regret We now compare the dynamics of two algorithms in terms of the accumulated regret (Fig. 2 third column). Note here we define the “regret” based on the difference of payoffs collected by always playing at the Nash equilibrium and that of using the deployed algorithm. The accumulated regret under policy gradient quickly converges a nearly constant level, while the accumulated regret of EXP3 continues to grow with large variance for different players.

Robustness In the previous proof of Theorem 2, we assume that every player in the game should follow the policy gradient algorithm to bid. However, in practice, there may exist some players who deviate from their policies. The last column of Fig.  2 shows what happens when a user deviates from its “rational” policy. For all game settings, the remaining players would still converge to the Nash equilibrium of the player Cournot game. There may also be settings where players are malicious, but designing optimal adversarial tactics and the detection algorithms, by themselves are advanced topics that contain a vast body of literature, and is beyond the scope of this work.

Vi Conclusion

In this paper, we study the interaction of stratedgic players in Cournot games. We prove the convergence of two widely used classes of algorithms, namely the no-regret algorithms and policy gradient, to the Nash equilibrium. In addition, we compare the convergence rate for these two classes of algorithms, and demonstrate that by taking advantages of the game structure, myopic algorithms such as policy gradient could achieve much faster (exponential) convergence rate, compared with the (sub)linear rate of no-regret algorithms.


  • [1] S. Arora, E. Hazan, and S. Kale (2012) The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing 8 (1), pp. 121–164. Cited by: §I-A, §I.
  • [2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §II-B, §III, §V, Appendix C. Details on algorithm implementation for Section 5.
  • [3] S. Bervoets, M. Bravo, and M. Faure (2018) Learning with minimal information in continuous games. arXiv preprint arXiv:1806.11506. Cited by: §I-A, §I-A.
  • [4] M. Bravo, D. Leslie, and P. Mertikopoulos (2018) Bandit learning in concave n-person games. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, , pp. 5666–5676. External Links: Link Cited by: §I-A, §I-A, §IV-B.
  • [5] L. Bu, R. Babu, B. De Schutter, et al. (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2), pp. 156–172. Cited by: §I-A.
  • [6] L. Buşoniu, R. Babuška, and B. De Schutter (2010) Multi-agent reinforcement learning: an overview. In Innovations in Multi-Agent Systems and Applications - 1, D. Srinivasan and L. C. Jain (Eds.), pp. 183–221. External Links: ISBN 978-3-642-14435-6, Document, Link Cited by: §I.
  • [7] C. Chen (1998) Linear system theory and design. Oxford University Press, Inc.. Cited by: §IV-B.
  • [8] J. Cohen, A. Héliou, and P. Mertikopoulos (2017) Learning with bandit feedback in potential games. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, , pp. 6372–6381. External Links: ISBN 978-1-5108-6096-4, Link Cited by: §I-A, §I-A.
  • [9] A. Cournot (1838) Recherches sur les principes mathématiques de la théorie des richesses par augustin cournot. chez L. Hachette. Cited by: §I, §I.
  • [10] D. J. Foster, Z. Li, T. Lykouris, K. Sridharan, and É. Tardos (2016) Learning in games: robustness of fast convergence. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 4734–4742. External Links: ISBN 978-1-5108-3881-9, Link Cited by: §I-A, §I-A.
  • [11] M. Guériau, R. Billot, N. El Faouzi, S. Hassas, and F. Armetta (2015) Multi-agent dynamic coupling for cooperative vehicles modeling. In

    Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

    AAAI’15, pp. 4276–4277. External Links: ISBN 0-262-51129-0, Link Cited by: §I.
  • [12] S. Hart and A. Mas-Colell (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68 (5), pp. 1127–1150. Cited by: §I-A, §III.
  • [13] E. Hazan and S. Kale (2007) Computational equivalence of fixed points and no regret algorithms, and convergence to equilibria. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, pp. 625–632. External Links: ISBN 978-1-60560-352-0, Link Cited by: §I-A, §III.
  • [14] E. Hazan et al. (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: §I-A, §I, §II-B, §III.
  • [15] D. Isern and A. Moreno (2016) A systematic literature review of agents applied in healthcare. Journal of medical systems 40 (2), pp. 43. Cited by: §I.
  • [16] K. Jamieson (February 2018) Non-stochastic bandits. https://courses.cs.washington.edu/courses/cse599i/18wi/resources/lecture5/lecture5.pdf. Cited by: Appendix C. Details on algorithm implementation for Section 5.
  • [17] R. Johari and J. N. Tsitsiklis (2005) Efficiency loss in cournot games. Harvard University. Cited by: §I, §II-A, §II-A.
  • [18] S. Kakade (2001) A natural policy gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, , pp. 1531–1538. External Links: Link Cited by: §I, §V, Appendix C. Details on algorithm implementation for Section 5.
  • [19] A. Kalai and S. Vempala (2005) Efficient algorithms for online decision problems. Journal of Computer and System Sciences 71 (3), pp. 291–307. Cited by: §I-A, §I, §II-B, §III.
  • [20] D. S. Kirschen and G. Strbac (2004) Fundamentals of power system economics. Vol. 1, Wiley Online Library. Cited by: §I-A, §I, §II-A.
  • [21] M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel (2017) A unified game-theoretic approach to multiagent reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, , pp. 4193–4206. External Links: ISBN 978-1-5108-6096-4, Link Cited by: §I.
  • [22] D. S. Leslie and E. J. Collins (2005) Individual q-learning in normal form games. SIAM Journal on Control and Optimization 44 (2), pp. 495–514. Cited by: §I-A.
  • [23] S. Levine and V. Koltun (2013) Guided policy search. In

    Proceedings of 30th International Conference on International Conference on Machine Learning

    ICML’13, pp. 1–9. Cited by: §IV.
  • [24] F. Lian, J. Moyne, and D. Tilbury (2002) Network design consideration for distributed control systems. IEEE Transactions on Control Systems Technology 10 (2), pp. 297–307. Cited by: §I.
  • [25] P. Milgrom and J. Roberts (1990) Rationalizability, learning, and equilibrium in games with strategic complementarities. Econometrica 58 (6), pp. 1255–77. Cited by: §I.
  • [26] P. Milgrom and J. Roberts (1991) Adaptive and sophisticated learning in repeated normal form games. In Games and Economic Behavior, pp. 82–100. Cited by: §I.
  • [27] R. M. Murray (2007-01) Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE 95 (1), pp. 215–233. External Links: Document, ISSN 0018-9219 Cited by: §I.
  • [28] M. Pipattanasomporn, H. Feroze, and S. Rahman (2009) Multi-agent systems in a distributed smart grid: design and implementation. In 2009 IEEE/PES Power Systems Conference and Exposition, pp. 1–8. Cited by: §I.
  • [29] E. L. Quinn (2009) Privacy and the new energy infrastructure. Available at SSRN 1370731. Cited by: §I.
  • [30] J. B. Rosen (1965) Existence and uniqueness of equilibrium points for concave n-person games. Econometrica: Journal of the Econometric Society, pp. 520–534. Cited by: §II-A, §IV-B, A2. Proof of Theorem 1.
  • [31] T. Roughgarden (2016) Twenty lectures on algorithmic game theory. Cambridge University Press. Cited by: Fig. 1.
  • [32] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML’14, pp. 387–395. External Links: Link Cited by: §II-B, §IV.
  • [33] P. Stone and M. Veloso (2000-06) Multiagent systems: a survey from a machine learning perspective. Autonomous Robots 8 (3), pp. 345–383. External Links: ISSN 0929-5593, Link, Document Cited by: §I.
  • [34] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour (1999) Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, , pp. 1057–1063. External Links: Link Cited by: §I-A, §I, §II-B, §IV-B, §IV.
  • [35] V. Syrgkanis, A. Agarwal, H. Luo, and R. E. Schapire (2015) Fast convergence of regularized learning in games. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pp. 2989–2997. External Links: Link Cited by: §I-A, §I-A, §I-B, §III, §IV-A.
  • [36] T. Ui (2008) Correlated equilibrium and concave games. International Journal of Game Theory 37 (1), pp. 1–13. Cited by: A2. Proof of Theorem 1.
  • [37] A. W. Van der Vaart (2000) Asymptotic statistics. Vol. 3, Cambridge university press. Cited by: §III.
  • [38] H. Wang and B. Zhang (2018) Energy storage arbitrage in real-time markets via reinforcement learning. In 2018 IEEE Power & Energy Society General Meeting (PESGM), pp. 1–5. Cited by: §I-A.
  • [39] B. Zhang, R. Johari, and R. Rajagopal (2015) Competition and coalition formation of renewable power producers. IEEE Transactions on Power Systems 30 (3), pp. 1624–1632. Cited by: §I-A, §II-A.
  • [40] Z. Zhou, P. Mertikopoulos, S. Athey, N. Bambos, P. Glynn, and Y. Ye (2018) Learning in games with lossy feedback. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pp. 5140–5150. External Links: Link Cited by: §I-A, §I-A.


Appendix A. Proofs for Section 3: Convergence of no-regret algorithms

A1. Proof of Lemma 1


Consider the first player. In each game iteration t, let be the moves played by all the players. We use to denote the actions played by all players except player . From player 1’s point of view, the payoff he obtains at time is the following:

By the definition of regret,

Rewriting this in terms of the original utility function, and scaling by the number of iterations we get,


In addition, for every swap function , it follows:


where . The inequality holds because is concave with respect to , and all are independent by assumption.

Denoted by the empirical distribution of the played strategies till iteration , i.e., the distribution which puts a probability mass of on all pairs for . Then the above inequalities 15 and A1. Proof of Lemma 1 can be combined as,


Similar inequalities hold for all other players . Since we assume that all players use no-regret algorithms, it ensures that . Thus converges to the correlated equilibrium satisfies the following condition:


A2. Proof of Theorem 1


For an N-player Cournot game satisfying assumptions (A1)-(A3), that is the price function is concave and strictly decreasing, and the individual cost function is convex. The payoff function of each player follows:

which is strictly concave. This can be shown by taking the second derivative of w.r.t. ,

By Rosen’s definition in [30], if all players’ payoff functions are strictly concave, and their strategy sets are convex and compact, then the game is a concave N-player game that has a unique Nash equilibrium .

For any joint action , since is strictly concave,


Therefore, there exists an that . Rearranging the terms we have,


where (at the Nash equilibrium, no one could gain by unilaterally changing his action). Thus there exists a player , such that:


We prove that the Nash equilibrium is the only correlated equilibrium in concave Cournot games, by contradiction. Let define

as a probability density function over r.v.

, such that there exists for some measurable set not containing the NE . We first assume can be a correlated equilibrium, and then show such an assumption leads to a contradiction in the definition of correlated equilibrium.

Following (21), for each Cournot game satisfying (A1)-(A3), there exists a player such that:


If a function is differentiable, we have . By setting , the following equation holds:


Therefore, there exists , such that


Set . Define a measurable set,

We define as an indicator function, where for and otherwise. Then, it follows


Let define as,


Therefore, we have

Equivalently speaking, we have found a measurable function that,


which contradicts the definition of correlated equilibrium. Thus the distribution (and any distribution assigns non-zero probability to a measurable set not containing the NE ) is not a correlated equilibrium. ∎

Paper [36] states a similar theorem, under the weaker condition (utility function is -weighted monotonically decreasing). Our proof is in the same spirit as the proof of Proposition 4 and 5 in paper [36], but more concise with application to Cournot games with assumptions (A1)-(A3).

Appendix B. Proofs for Section 4: Convergence of policy gradient algorithm

B1. Convergence rate analysis for 2-player Cournot game with linear price (Section 4.1)

We analyze the convergence rate of system , where the system state , system matrix and .

System state at time is:


If is stable, i.e., all eigenvalues of are within the unit circle , we have,


From the system equilibrium condition, we have . Together with (28) - 29,


Thus we can bound the distance between state and the equilibria ,


where , and .

B2. Convergence analysis of N-player Cournot game with linear price function (Section 4.1)

Similar to the 2-player Cournot game example, suppose there’re N players, and we define the system state . The system dynamics equation is:

where follows,

and .

is a block matrix in the form of , and thus the eigenvalues of is the combination of the eigenvalues of the upper left matrix and the eigenvalues of the lower right matrix .

The eigenvalues for are ( repeats). To calculates the eigenvalues of , we have


where is the by identity matrix, and is the vector with all components 1. Combined the above two equations, we have,


So for all vectors orthogonal to

, it is an eigenvector to

with eigenvalue , and the space spanned by these eigenvectors are dimensions. Besides, is also an eigenvector of with eigenvalue . For the system to be asymptotically stable, we should have all the eigenvalues within the unit circle, i.e.,

Choosing the step size satisfies would ensure the above inequalities hold and the system is asymptotically stable.

B3. Proof of Lemma 2 Variance shrinkage (Section 4.2)


We first write out player i’s expected payoff function using Taylor series: