I Introduction
Gametheoretic models have been used to describe the cooperative and competitive behaviors of a group of players in a wide range of systems from robotics, distributed control to resource allocation [6, 33, 27, 21]. In this paper, we study the interaction of strategic players in Cournot games [9], which is one of the best studied models for competition between selfinterested agents in a system [17]: the Cournot competition is the essential market model for many socioeconomic systems such as energy [28], transportation [11] and healthcare systems [15]. This competition can be thought as multiple agents competing to satisfy an elastic demand by changing their production levels. For example, most of the US electricity market is built upon a Cournot competition model [20], where energy producers bid to serve the loads in the grid, and a market price is cleared based on the total supply and demand. Each producer’s payoff is based on the market price multiplied by its share of the supply. The goal of producers is then to maximize their individual payoffs by strategically choosing the production levels.
Previous works in Cournot games mostly focused on analyzing the equilibrium behavior, especially the Nash equilibria. However, this raises the question of how the players would reach an equilibrium when they do not start at one. This question was actually considered by Cournot in the original 1838 paper [9] in the case of two players. Since then, a rich set of results have generalized the conditions under which the players can get to a Nash equilibrium. For example, [25, 26] proposed a dynamic and adaptive behavior rule to reach a Nash equilibrium for an arbitrary number of players and encompasses both bestresponse dynamics and Bayesian learning. However, most of the previously known results require full information of other players’ actions and the exact market price function, which are not usually available in practice. For many applications of interest, providing full feedback is either impractical (e.g., distributed control [24]) or explicitly disallowed due to privacy and market power concerns (e.g., energy markets [29]). Therefore in this work, we move away from both the static view of assuming players are at a Nash equilibrium and the assumption that they have full knowledge of the system. Instead, we analyze the longrun outcome of the learning dynamics under limited information feedback, and ask the following two fundamental questions:

[noitemsep,nolistsep]

Will strategic learning agents reach an equilibrium in Cournot games?

If so, how quickly do they converge to the equilibrium?
Reasoning about these questions requires specifying the dynamics, which describes how players act before achieving the equilibrium. In particular, we consider player dynamics induced by learning algorithms that attempt to find the best strategies according to some objective. We consider two wellknown families of algorithms. One class is based on regretminimization (i.e., noregret algorithms [1, 14, 19], which treats the environment and other players as adversarial. The other class is myopic learning algorithms (i.e., policy gradient [34, 18]), that seeks to maximize the individual payoff in a stochastic system. It should be noted that these two classes of algorithms–as they are constructed from distinctive starting points– may lead to very different behavior patterns and system dynamics. The first class of algorithms is more riskaverse and focuses on minimizing the hindsight regret; while the second class is less conservative and tries to maximize the expected payoff.
In terms of the information structure, we consider bandit feedbacks where each player receives only limited information. In the Cournot game, it means that the players only receive the price (a single number) from the system and nothing else. Therefore they must make decisions based only on the observed history of the price and their own actions. This is an important feature of many socioeconomic games where each player only has access to local information (i.e., the realized payoffs) and are not informed about the attributes of the other participants.
Ia Related Work
Studying the behavior of players under limited feedback is a challenging problem which has started to receive some recent attention. Understandably, most works focus on noregret algorithms [8, 4, 40, 13, 35, 10, 3] because of their inherent robustness. In particular, a player’s noregret dynamics definition could be directly translated to the coarse correlated equilibrium condition [12] for a wide range of algorithms (e.g., multiplicativeweight [1], online mirror descent [14], FollowtheRegularizedLeader [19]).
In terms of the convergence speed, [35] and [10] showed that the players’ timeaveraged history of joint play converges to the set of coarse correlated equilibrium at a rate of for smooth games. As for convergence in distribution (rather than the timeaverage), [4, 3, 40] showed that joint actions generated by mirror descent algorithms converge to the Nash equilibrium in the class of variationally stable games and [8] showed the joint actions produced by noregret learning with exponential weights converge to the Nash equilibrium in potential games. However, each of the aforementioned papers focused on a specific class of noregret algorithms and do not generalize to all noregret dynamics. Consequently, establishing convergence to finer notion of equilibria (e.g., Nash) for a broad class of noregret algorithms remains open.
While theoretical properties of noregret algorithms can be attractive, they also limit the applicability of these algorithms. In practice, systems and agents are often not adversarial to each other, and the competition is often designed to have specific structures. The electricity market serves as a good example. The market clearing mechanism is designed specifically to be convex to utilize the dual variables as prices (socalled locational marginal prices [20] ) and different generators only care about maximizing their own payoffs [39]. Therefore noregret algorithms can lead to poor performances, since they cannot take advantage of the structure of the game.
Much less is known under the limited feedback situation beyond noregret dynamics. In many games, it is more natural for players to use myopic policies that aim for profit maximization directly [38]
. Since the action spaces in Cournot games are continuous, the action policy is commonly modeled and trained by using policy gradient reinforcement learning
[34]. These algorithms can lead to much better performances than noregret algorithms, but proving their convergence has proven to be challenging [5] since the coupling between the (continuous) actions of the players must be carefully analyzed. Attempts have been made to discretize the space then studying the resulting discrete game, but the dimensionally quickly grows and important features (e.g., convexity) are hard to retain [22].IB Our Contributions
In this work, we study the dynamics of both noregret learning algorithms and policy gradient in concave Cournot games with bandit feedbacks, and our major contributions are in the following three aspects.
First
, we prove that under assumptions where a unique Nash equilibrium (NE) exists, the joint distribution induced by any noregret algorithm converges to the NE. This is a much sharper result compared to the standard convergence result of the timeaveraged joint distribution converging to a coarse correlated equilibrium.
Second
, under the same assumptions, we show that the joint distribution induced by policy gradient, where players use Gaussian policies with at least two degrees of freedom (i.e., mean and variance), also converges to the NE. This is the first result (to the best of our knowledge) on the convergence property of algorithms with continuous action spaces that do not fall in the noregret class.
Third, we show that the convergence rate of policy gradient occurs at an exponential rate, which is much faster than the linear rate obtained via noregret algorithms [35].
Ii Problem setup and preliminaries
Iia Model of Cournot Competition
In this section, we first review the classical Cournot game setup and then provide two motivating examples of its applications in modeling competition in infrastructure systems.
Definition 1 (Cournot Game).
Consider players with homogeneous products in a limited market, where the strategy space of player is its production level . The utility function of player is denoted as , where is the market clearing price (inverse demand) function that maps the total production quantity to a price in and is the cost function of player .
The goal of each player in the Cournot game is to choose the best production quantity such that maximizes his own utility
. An important concept in game theory is the
Nash equilibrium, at which state no player can increase his expected payoff via a unilateral deviation. A Nash equilibrium of the Cournot game defined byis a vector
such that for all :(1) 
where denotes the actions of all players except . To ensure the existence and uniqueness of Nash equilibrium in Cournot games, we make the following assumptions, throughout the paper:
Assumption 1.
We assume a Cournot game has the following properties:

The individual strategy set for each player is convex and compact, i.e., is finite. (A1)

The price function is concave, strictly decreasing and twice differentiable. (A2)

The individual cost function for each player is convex. (A3)
These assumptions are common in the literature (e.g., see [17]). It is straightforward to show that a Cournot game satisfies the above assumptions (A1)(A3) is a concave Nplayer game [30], thus a unique Nash equilibrium exists. Below, we briefly discuss two applications of the Cournot game socioeconomic systems.
Example 1 (Wholesale Electricity Market)
The Cournot model is the most widely adopted framework for electricity market design [20]. Suppose there are electricity producers, each supplying the market with units of energy, up to the player’s production capacity . In an uncongested grid, the electricity is priced as a decreasing function of the total amount . In practice, linear price and cost functions are commonly adopted and the profit of generator is: where represents the marginal production cost of generator .
Example (City Parking Planning [39])
Cournot competition can also be used to study oligopsonies, where demand competes for supply. For example, businesses compete for a fixed set of onstreet parking spots in cities. For business , suppose it is allocated
number of parking spaces. The amount of vehicles that visit it is a random variable denoted by
. If a vehicle successfully finds parking when customers try to visit, the business derives a certain amount of utility. If parking is not available, it gets no utility. Therefore business ’s utility is denoted as (concave w.r.t. ). However, an increase of parking spaces is correlated with the total traffic into downtown areas and may increase congestion, which we model with a price function . The expected payoff for business is then It turns out that there is an exact transformation between Cournot oligopolies and oligopsonies and a unique Nash equilibrium exists when is convex [17].IiB Review of learning dynamics
We now turn to review the two families of learning methods that players could employ to adjust their actions, namely the noregret learning algorithms and policy gradient.
Noregret algorithms
An online algorithm is called noregret (or noexternal regret) if the difference between the total payoff it receives and that of the best fixed decision in hindsight is sublinear as a function of time [14]. Formally, fix payoff vector , the regret of after steps is:
(2) 
An algorithm is said to have no regret, if for every sequence of payoff vectors, the regret . Consider the case of players that engage in a repeated Cournot game, and suppose is a sequence of actions, where represents the production levels set by all players at time . The regret of player at time is defined as , where is the utility function of player and is player ’s action set, i.e., . There are a collection of algorithms satisfy the regret minimization property, e.g., EXP3 [2], Online Mirror Descent [14], and Follow the Regularized/Perturbed Leader [19].
Policy gradient
Policy gradient is a widely used reinforcement learning algorithm for problems with continuous action space [32]. The policy is usually modeled with a parameterized function , where is the action, is the state and is the policy parameter. In a repeated Cournot game, since the game is reset at each iteration, the policy could be reduced to a stateless version where we aim to find a stationary decision law . We abuse the notation slightly and use to denote the parameters associated with player . Assuming all players follow the policy gradient algorithm^{1}^{1}1This is an important assumption for the convergence analysis. We provide some robustness analysis in Section V by assuming a small portion of players do not follow their policies, and the convergence still holds. and take actions independently, the expected reward of player is:
(3) 
Player aims to find the best parameter which maximizes the expected reward with the following update rule:
(4) 
Following the policy gradient theorem [34], the gradient with respect to can be reformed as:
(5) 
where
is the observed payoffs and the sample average forms an unbiased estimator if all players follow their current policies to take actions.
Iii Convergence of noregret algorithms in Cournot games
Given N players, with actions . Suppose that is some joint distribution over the action space of all players. Then we can define a coarse correlated equilibrium (CCE) notion based on as, :
(6) 
It is a wellknown result that the above coarse correlated equilibrium is learnable if all players follow noregret algorithms [13]. Although these results deal with the convergence of timeaverage (empirical) distribution and not about the outcome distribution at any particular given time.
We first state a stronger result by strengthening the equilibria notion to that of the correlated equilibrium (CE) for Cournot games satisfying (A1)(A3):
Lemma 1.
If each player chooses actions using a noregret algorithm in a Cournot game, then the empirical distribution of all the players’ actions converges to a correlated equilibrium, i.e., :
(7) 
for any measurable function .
In this lemma and the rest of the paper, the notation of probabilistic convergence is always in the sense of distribution [37]. In general games, we know that noregret learning procedures would converge to the set of CCE, the biggest equilibrium set of game. Lemma 1 shows that for Cournot games with assumptions (A1)(A3), every CCE is also a CE. The next theorem tightens the result even more by showing that the Nash equilibrium is the only correlated equilibrium in the defined Cournot games.
Theorem 1.
An Nplayer Cournot game satisfying (A1)(A3) has a unique correlated equilibrium, which places probability one on the Nash equilibrium. If all players in the game follow noregret algorithms, the actual sequence of play converges to the Nash equilibrium.
The detailed proof of Lemma 1 and Theorem 1 are given in Appendix A. In short, we proved the above theorem by contradiction. We first assume there exists a strategy has nonzero probability for a measurable set not containing the NE, and then show such an assumption leads to a contradiction in the definition of correlated equilibrium. The convergence result applies to all noregret algorithms, without restriction to any specific subclass of algorithms.
Convergence rate.
The players’ joint actions generally converge to the set of CCE (NE for the Cournot games) at sublinear rate [12], by using common norerget algorithms such as EXP3 [2], Online Mirror Descent [14] and Follow the Reguralized Leader [19] in their vanilla forms. The best previously known convergence result is [35], which holds for a class of noregret algorithms with RVU property defined in paper [35].
Iv Convergence of policy gradient in Cournot games
Although the convergence result of noregret algorithms is neat, in practice, the players in Cournot games are more likely to face other similar “selfinterested” players rather than adversarial ones. Instead of following regretminimization schemes, a more natural choice would be to learn a decision policy that directly maximizes individual payoff, which leads to the following theorem.
Theorem 2.
For an Nplayer Cournot game, suppose all players follow Gaussian policies with at least two degrees of freedom (i.e., mean and variance) and use policy gradient to update the policy parameters. Starting from any initial parameters, the variance of the Gaussian policies will shrink to zero and the actual sequence of play converges to the Nash equilibrium at an exponential rate.
Gaussian policy is a natural choice for continuous action spaces [34, 32]
, and such a form also includes popular neural network policies
[23] where the mean can be parameterized via a neural network. In cases where the Gaussian policy is toorestrictive, one could use a complex distribution e.g. mixture of Gaussian or a heavytailed distribution. In essence, for the convergence of policy gradient to the Nash equilibrium, we need a parameterized policy where the mean and variance are able to adapt independently. For the remaining part of this section, we first walk through the proof on a toy example of two players and linear price in Section IVA, which provides intuitions on the convergence behavior. Then we generalize our analysis to concave Cournot games in Section IVB.Iva System dynamics with linear price functions
Consider a twoplayer Cournot game, where each player uses a Gaussian policy to sample their actions: and . Suppose the price function is linear and there is no production cost. Then the payoff for player 1 is and the payoff for player 2 is . The expected payoff of the players are:
Therefore, running policy gradient with fixed step size leads to the following dynamics,
(8)  
(9) 
We could write the above dynamics in a statespace representation , in which we define the system state , and
is the state matrix. Solving the system characteristic equation, the eigenvalues of
are , , , . A discretetime linear system is asymptotically stable if and only if all eigenvalues of the state matrix are inside the unit circle, i.e., . Suppose we choose proper step size such that the system is asymptotically stable, it is straightforward to check that the NE of the Cournot game is the only solution to the system equilibrium condition , therefore it is the unique system equilibria.Convergence rate
Next, we analyze the convergence rate of the above dynamical system ((8)  9). Following linear system analysis, the distance between the system state at any given time and the equilibrium point is bounded by:
(10) 
where denotes the largest eigenvalue of (in terms of absolute value), and when . The state variable converges to the system equilibrium (coincides with the Nash equilibrium) at an exponential rate. We find this convergence result is particularly appealing for practical applications because it is much faster than the best previously known convergence rate of noregret algorithms [35]. A stepbystep derivation of the convergence rate (10) and the Nplayer linear price case are provided in Appendix B1 and B2.
IvB System dynamics with general concave price functions
Here, we extend the previous convergence results to general Cournot games satisfying (A1)(A3). The analysis is in two stages: we first show that the variance of the policy shrinks to zero; then we prove the mean converges to the Nash equilibrium.
Lemma 2 (Variance shrinkage).
For an Nplayer Cournot game, suppose all players follow Gaussian policies with at least two degrees of freedom (i.e., mean and variance) and use policy gradient to update the policy parameters. Starting from any initial parameters, the variance of the Gaussian policies will converge to zero at an exponential rate.
Proof of Lemma 2 could be found in Appendix B3. Lemma 2 basically suggests that after sufficient iterations, each player would follow a deterministic policy with standard derivation equals zero. Next, we analyze the dynamics of the mean (equivalently, the actual action ).
Suppose each player changes his strategy at a rate proportional to the gradient of his payoff function , and we stack the dynamics of all players together:
(11) 
where is the step size to be selected later. By the mean value theorem, we have
(12) 
where is the Jacobian matrix of evaluated at some point. Combining (11) and 12 leads to
(13) 
We would need the following lemma for the stability analysis, and the detailed proof is deferred to Appendix B4.
Lemma 3.
For an Nplayer Cournot game satisfying (A1)(A3), let denote the Jacobian matrix of the system dynamics , is negative definite for all .
Similar to Theorem 10 in [30], the norm of is minimized by the choice , where is negative definite proved by Lemma 3. Using this step size gives:
(14) 
It therefore follows (14) that , and where is an equilibrium point that satisfies . Meanwhile, the system equilibrium point would also satisfies , which is exactly the Nash equilibrium condition. Theorem 2 follows by combining this result with the result in Lemma 2 on the variances of the policies.
In terms of convergence rate, the evolution of follows a linear dynamics (13), thus the convergence happens at exponential order [7], which depends on the eigenvalues of system state matrix (similar to the convergence rate analysis in Section 4.1).
We close this section with two remarks. Firstly it should be noted that in practice, players only observe their realized payoffs, rather than the exact form of the payoff gradient . Following the policy gradient theorem [34], the sample average of the observed payoffs would form an unbiased estimator of the gradient if all players follow their own policy and take actions independently. The sample average estimate could be imperfect and accompanied by noisy information (random error, systematic error, or otherwise). A detailed discussion on the convergence robustness to noisy gradient estimation could be found in [4]. Secondly, some players may decide to not follow a policy or act in adversarial manners. We provide some empirical evaluations of the system robustness in Section 5, by assuming a small portion of players are acting randomly or adversarially.
V Numerical experiments
In this section, we look at the performance of two classes of learning algorithms in various of Cournot games. We first verify the convergence behavior and compare the convergence rate of two algorithms. Next, we exam the robustness of policy gradient against of random players with different strategies.
Setup We consider threeplayer games under different price and individual cost settings. G1: linear price function without cost . The Nash equilibrium is . G2: quadratic price function without cost. The Nash equilibrium is . G3: linear price function with cost for player . The Nash equilibrium is . In all of the games, each player simultaneously picks a production level. The price is determined by corresponding price function and broadcasted back to all players. This game is repeated for multiple times with all players either using noregret algorithms or policy gradient. For the noregret algorithm, we implemented EXP3 [2], a classic multiplicative weights algorithm. For the policy gradient method, we assume all players use the natural policy gradient [18] with Gaussian policies. The algorithm implementation details are provided in Appendix C.
Convergence verification Fig. 2 shows that the joint actions converge to the Nash equilibrium for both algorithms. However, the actions under noregret algorithm are extremely random and converge rather slowly (Fig. 2 first column). For the policy gradient, the actions converge much more quickly to the NE (Fig. 2 second column), which confirming the theoretical results on convergence and rate.
Accumulated regret We now compare the dynamics of two algorithms in terms of the accumulated regret (Fig. 2 third column). Note here we define the “regret” based on the difference of payoffs collected by always playing at the Nash equilibrium and that of using the deployed algorithm. The accumulated regret under policy gradient quickly converges a nearly constant level, while the accumulated regret of EXP3 continues to grow with large variance for different players.
Robustness In the previous proof of Theorem 2, we assume that every player in the game should follow the policy gradient algorithm to bid. However, in practice, there may exist some players who deviate from their policies. The last column of Fig. 2 shows what happens when a user deviates from its “rational” policy. For all game settings, the remaining players would still converge to the Nash equilibrium of the player Cournot game. There may also be settings where players are malicious, but designing optimal adversarial tactics and the detection algorithms, by themselves are advanced topics that contain a vast body of literature, and is beyond the scope of this work.
Vi Conclusion
In this paper, we study the interaction of stratedgic players in Cournot games. We prove the convergence of two widely used classes of algorithms, namely the noregret algorithms and policy gradient, to the Nash equilibrium. In addition, we compare the convergence rate for these two classes of algorithms, and demonstrate that by taking advantages of the game structure, myopic algorithms such as policy gradient could achieve much faster (exponential) convergence rate, compared with the (sub)linear rate of noregret algorithms.
References
 [1] (2012) The multiplicative weights update method: a metaalgorithm and applications. Theory of Computing 8 (1), pp. 121–164. Cited by: §IA, §I.
 [2] (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §IIB, §III, §V, Appendix C. Details on algorithm implementation for Section 5.
 [3] (2018) Learning with minimal information in continuous games. arXiv preprint arXiv:1806.11506. Cited by: §IA, §IA.
 [4] (2018) Bandit learning in concave nperson games. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, , pp. 5666–5676. External Links: Link Cited by: §IA, §IA, §IVB.
 [5] (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2), pp. 156–172. Cited by: §IA.
 [6] (2010) Multiagent reinforcement learning: an overview. In Innovations in MultiAgent Systems and Applications  1, D. Srinivasan and L. C. Jain (Eds.), pp. 183–221. External Links: ISBN 9783642144356, Document, Link Cited by: §I.
 [7] (1998) Linear system theory and design. Oxford University Press, Inc.. Cited by: §IVB.
 [8] (2017) Learning with bandit feedback in potential games. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, , pp. 6372–6381. External Links: ISBN 9781510860964, Link Cited by: §IA, §IA.
 [9] (1838) Recherches sur les principes mathématiques de la théorie des richesses par augustin cournot. chez L. Hachette. Cited by: §I, §I.
 [10] (2016) Learning in games: robustness of fast convergence. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 4734–4742. External Links: ISBN 9781510838819, Link Cited by: §IA, §IA.

[11]
(2015)
Multiagent dynamic coupling for cooperative vehicles modeling.
In
Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence
, AAAI’15, pp. 4276–4277. External Links: ISBN 0262511290, Link Cited by: §I.  [12] (2000) A simple adaptive procedure leading to correlated equilibrium. Econometrica 68 (5), pp. 1127–1150. Cited by: §IA, §III.
 [13] (2007) Computational equivalence of fixed points and no regret algorithms, and convergence to equilibria. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, pp. 625–632. External Links: ISBN 9781605603520, Link Cited by: §IA, §III.
 [14] (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (34), pp. 157–325. Cited by: §IA, §I, §IIB, §III.
 [15] (2016) A systematic literature review of agents applied in healthcare. Journal of medical systems 40 (2), pp. 43. Cited by: §I.
 [16] (February 2018) Nonstochastic bandits. https://courses.cs.washington.edu/courses/cse599i/18wi/resources/lecture5/lecture5.pdf. Cited by: Appendix C. Details on algorithm implementation for Section 5.
 [17] (2005) Efficiency loss in cournot games. Harvard University. Cited by: §I, §IIA, §IIA.
 [18] (2001) A natural policy gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, , pp. 1531–1538. External Links: Link Cited by: §I, §V, Appendix C. Details on algorithm implementation for Section 5.
 [19] (2005) Efficient algorithms for online decision problems. Journal of Computer and System Sciences 71 (3), pp. 291–307. Cited by: §IA, §I, §IIB, §III.
 [20] (2004) Fundamentals of power system economics. Vol. 1, Wiley Online Library. Cited by: §IA, §I, §IIA.
 [21] (2017) A unified gametheoretic approach to multiagent reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, , pp. 4193–4206. External Links: ISBN 9781510860964, Link Cited by: §I.
 [22] (2005) Individual qlearning in normal form games. SIAM Journal on Control and Optimization 44 (2), pp. 495–514. Cited by: §IA.

[23]
(2013)
Guided policy search.
In
Proceedings of 30th International Conference on International Conference on Machine Learning
, ICML’13, pp. 1–9. Cited by: §IV.  [24] (2002) Network design consideration for distributed control systems. IEEE Transactions on Control Systems Technology 10 (2), pp. 297–307. Cited by: §I.
 [25] (1990) Rationalizability, learning, and equilibrium in games with strategic complementarities. Econometrica 58 (6), pp. 1255–77. Cited by: §I.
 [26] (1991) Adaptive and sophisticated learning in repeated normal form games. In Games and Economic Behavior, pp. 82–100. Cited by: §I.
 [27] (200701) Consensus and cooperation in networked multiagent systems. Proceedings of the IEEE 95 (1), pp. 215–233. External Links: Document, ISSN 00189219 Cited by: §I.
 [28] (2009) Multiagent systems in a distributed smart grid: design and implementation. In 2009 IEEE/PES Power Systems Conference and Exposition, pp. 1–8. Cited by: §I.
 [29] (2009) Privacy and the new energy infrastructure. Available at SSRN 1370731. Cited by: §I.
 [30] (1965) Existence and uniqueness of equilibrium points for concave nperson games. Econometrica: Journal of the Econometric Society, pp. 520–534. Cited by: §IIA, §IVB, A2. Proof of Theorem 1.
 [31] (2016) Twenty lectures on algorithmic game theory. Cambridge University Press. Cited by: Fig. 1.
 [32] (2014) Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML’14, pp. 387–395. External Links: Link Cited by: §IIB, §IV.
 [33] (200006) Multiagent systems: a survey from a machine learning perspective. Autonomous Robots 8 (3), pp. 345–383. External Links: ISSN 09295593, Link, Document Cited by: §I.
 [34] (1999) Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, , pp. 1057–1063. External Links: Link Cited by: §IA, §I, §IIB, §IVB, §IV.
 [35] (2015) Fast convergence of regularized learning in games. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 2, NIPS’15, pp. 2989–2997. External Links: Link Cited by: §IA, §IA, §IB, §III, §IVA.
 [36] (2008) Correlated equilibrium and concave games. International Journal of Game Theory 37 (1), pp. 1–13. Cited by: A2. Proof of Theorem 1.
 [37] (2000) Asymptotic statistics. Vol. 3, Cambridge university press. Cited by: §III.
 [38] (2018) Energy storage arbitrage in realtime markets via reinforcement learning. In 2018 IEEE Power & Energy Society General Meeting (PESGM), pp. 1–5. Cited by: §IA.
 [39] (2015) Competition and coalition formation of renewable power producers. IEEE Transactions on Power Systems 30 (3), pp. 1624–1632. Cited by: §IA, §IIA.
 [40] (2018) Learning in games with lossy feedback. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pp. 5140–5150. External Links: Link Cited by: §IA, §IA.
Appendix
Appendix A. Proofs for Section 3: Convergence of noregret algorithms
A1. Proof of Lemma 1
Proof.
Consider the first player. In each game iteration t, let be the moves played by all the players. We use to denote the actions played by all players except player . From player 1’s point of view, the payoff he obtains at time is the following:
By the definition of regret,
Rewriting this in terms of the original utility function, and scaling by the number of iterations we get,
(15) 
In addition, for every swap function , it follows:
(16) 
where . The inequality holds because is concave with respect to , and all are independent by assumption.
Denoted by the empirical distribution of the played strategies till iteration , i.e., the distribution which puts a probability mass of on all pairs for . Then the above inequalities 15 and A1. Proof of Lemma 1 can be combined as,
(17) 
Similar inequalities hold for all other players . Since we assume that all players use noregret algorithms, it ensures that . Thus converges to the correlated equilibrium satisfies the following condition:
(18) 
∎
A2. Proof of Theorem 1
Proof.
For an Nplayer Cournot game satisfying assumptions (A1)(A3), that is the price function is concave and strictly decreasing, and the individual cost function is convex. The payoff function of each player follows:
which is strictly concave. This can be shown by taking the second derivative of w.r.t. ,
By Rosen’s definition in [30], if all players’ payoff functions are strictly concave, and their strategy sets are convex and compact, then the game is a concave Nplayer game that has a unique Nash equilibrium .
For any joint action , since is strictly concave,
(19) 
Therefore, there exists an that . Rearranging the terms we have,
(20) 
where (at the Nash equilibrium, no one could gain by unilaterally changing his action). Thus there exists a player , such that:
(21) 
We prove that the Nash equilibrium is the only correlated equilibrium in concave Cournot games, by contradiction. Let define
as a probability density function over r.v.
, such that there exists for some measurable set not containing the NE . We first assume can be a correlated equilibrium, and then show such an assumption leads to a contradiction in the definition of correlated equilibrium.Following (21), for each Cournot game satisfying (A1)(A3), there exists a player such that:
(22) 
If a function is differentiable, we have . By setting , the following equation holds:
(23) 
Therefore, there exists , such that
(24) 
Set . Define a measurable set,
We define as an indicator function, where for and otherwise. Then, it follows
(25) 
Let define as,
(26) 
Therefore, we have
Equivalently speaking, we have found a measurable function that,
(27) 
which contradicts the definition of correlated equilibrium. Thus the distribution (and any distribution assigns nonzero probability to a measurable set not containing the NE ) is not a correlated equilibrium. ∎
Appendix B. Proofs for Section 4: Convergence of policy gradient algorithm
B1. Convergence rate analysis for 2player Cournot game with linear price (Section 4.1)
We analyze the convergence rate of system , where the system state , system matrix and .
B2. Convergence analysis of Nplayer Cournot game with linear price function (Section 4.1)
Similar to the 2player Cournot game example, suppose there’re N players, and we define the system state . The system dynamics equation is:
where follows,
and .
is a block matrix in the form of , and thus the eigenvalues of is the combination of the eigenvalues of the upper left matrix and the eigenvalues of the lower right matrix .
The eigenvalues for are ( repeats). To calculates the eigenvalues of , we have
and
where is the by identity matrix, and is the vector with all components 1. Combined the above two equations, we have,
(32) 
So for all vectors orthogonal to
, it is an eigenvector to
with eigenvalue , and the space spanned by these eigenvectors are dimensions. Besides, is also an eigenvector of with eigenvalue . For the system to be asymptotically stable, we should have all the eigenvalues within the unit circle, i.e.,Choosing the step size satisfies would ensure the above inequalities hold and the system is asymptotically stable.
B3. Proof of Lemma 2 Variance shrinkage (Section 4.2)
Proof.
We first write out player i’s expected payoff function using Taylor series:
Comments
There are no comments yet.