1 Introduction
Scalable policy evaluation and learning have been long-standing challenges in multi-agent reinforcement learning (MARL) with two difficulties obstructing progress. First, joint-strategy spaces exponentially explode when a large number of strategic decision-makers is considered, and second, the underlying game dynamics may exhibit cyclic behavior (e.g. the game of Rock-Paper-Scissor) rendering an appropriate evaluation criteria non-trivial.
Focusing on the second challenge, much work in multi-agent systems followed a game-theoretic treatment proposing fixed-points, e.g., Nash (Nash et al., 1950)
equilibrium, as potentially valid evaluation metrics. Though appealing, such measures are normative only when prescribing behaviors of perfectly rational agents – an assumption rarely met in reality
Grau-Moya et al. (2018); Wen et al. (2019). In fact, many game dynamics have been proven not converge to any fixed-point equilibria (Hart & Mas-Colell, 2003; Viossat, 2007), but rather to limit cycles (Palaiopanos et al., 2017; Bowling & Veloso, 2001). Apart from these aforementioned inconsistencies, solving for a Nash equilibrium even for “simple” settings, e.g. two-player games is known to be PPAD-complete (Chen & Deng, 2005) – a demanding complexity class when it comes to computational requirements.To address some of the above limitations, Omidshafiei et al. (2019) recently proposed -Rank as a graph-based game-theoretic solution to multi-agent evaluation. -Rank adopts Markov Conley Chains to highlight the presence of cycles in game dynamics, and attempts to compute stationary distributions as a mean for strategy profile ranking. Though successful in small-scale applications, -Rank severely suffers in scalability contrary to polynomial time claims made in Omidshafiei et al. (2019). In fact, we show that -Rank exhibits exponential time and memory complexities shedding light on the small-scale empirical study conducted in Omidshafiei et al. (2019), whereby the largest reported game included only four agents with four available strategies each.
In this work, we put forward -Rank as a scalable alternative for multi-agent evaluation with linear time and memory demands
. Our method combines numerical optimization with evolutionary game theory for a scalable solver capable of handling large joint spaces with millions of strategy profiles. To handle even larger profiles, e.g., tens to hundreds of millions, we further introduce an oracle
(McMahan et al., 2003) mechanism transforming joint evaluation into a sequence of incremental sub-games with varying sizes. Given our algorithmic advancements, we justify our claims in a large-scale empirical study involving systems with possible strategy profiles. We first demonstrate the computation advantages of-Rank on varying size stochastic matrices against other implementations in Numpy, PyTorch, and OpenSpiel
(Lanctot et al., 2019). With these successes, we then consider experiments unsolvable by current techniques. Precisely, we evaluate multi-agent systems in self-driving and Ising model scenarios each exhibiting a prohibitively-large strategy space (i.e., order of thousands for the former, and tens of millions for the latter). Here, we again show that -Rank is capable of recovering correct strategy ranking in such complex domains.2 -Rank & Its Limitations
In -Rank, strategy profiles of agents are evaluated through an evolutionary process of mutation and selection. Initially, agent populations are constructed by creating multiple copies of each learner assuming that all agents (in one population) execute the same unified policy. With this, -Rank then simulates a multi-agent game played by randomly sampled learners from each population. Upon game termination, each participating agent receives a payoff to be used in policy mutation and selection after its return to the population. Here, the agent is faced with a probabilistic choice between switching to the mutation policy, continuing to follow its current policy, or randomly selecting a novel policy (other than the previous two) from the pool. This process repeats with the goal of determining an evolutionary strong profile that spreads across the population of agents. Each of the above three phases is demonstrated in Fig. 1 on a simple example of three agents – depicted by different symbols – each equipped with three strategies – depicted by the colors.
2.1 Mathematical Formalisation of -Rank
We next formalize the process posed by -Rank, which will lead to its limitations, and also pave the way for our own proposed solution. We consider agents with each agent having access to a set of strategies of size . At round of the evaluation process, we denote the strategy profile for agent by , with representing the allowed policy of the learner. represents the set of states and is the set of actions for agent . With this, we define a joint strategy profile for all participating agents as policies belonging to the joint strategy pool, : with and .
To evaluate performance, we assume each agent is additionally equipped with a payoff (reward) function . Crucially, the domain of is the pool of joint strategies so to accommodate the effect of other learners on the player performance further complicating the evaluation process. Finally, given a joint profile , we define the corresponding joint payoff to be the collection of all individual payoff functions, i.e., .
After attaining rewards from the environment, each agent returns to its population and faces a choice between switching to a mutation policy, exploring a novel policy, or sticking to the current one Such a choice is probabilistic and defined proportional to rewards. Precisely, agent adopts
with denoting an exploration parameter^{2}^{2}2Please note that in the original paper
is heuristically set to a small positive constant to ensure at maximum two varying policies per-each population. Theoretical justification can be found in
Fudenberg & Imhof (2006), representing policies followed by other agents at round , and an intensity ranking parameter. As noted in Omidshafiei et al. (2019), one can relate the above switching process to a random walk on a Markov chain with states defined as elements in
and transition probabilities through payoff functions. In particular, each entry in the transition probability matrix
refers to the probability of one agent switching from one policy in a relation to attained payoffs. Precisely, consider any two joint strategy profiles and that differ in only one individual strategy for the agent, i.e., there exists a unique agent such that and with , we set with defining the probability that one copy of agent with strategy invades the population with all other agents (in that population) playing . Following Pinsky & Karlin (2010) for , such a probability is formalized as:(1) |
with being the size of the population. So far, we presented relevant derivations for the entry of the state transition matrix when exactly the agent differs in exactly one strategy. Having one policy change, however, only represents a subset of allowed variations, where two more cases need to be considered. Now we restrict out attention to variations in joint policies involving more than two individual strategies, i.e., . Here, we set^{3}^{3}3This assumption significantly reduces the analysis complexity as detailed in Fudenberg & Imhof (2006). . Consequently, the remaining event of self-transitions can be thus written as . Summarising the above three cases, we can then write the ’s entry of the Markov chain’s transition matrix as:
(2) |
The goal in
-Rank is to establish an ordering in policy profiles dependent on evolutionary stability of each joint strategy. In other words, higher ranked strategies are these that are prevalent in populations with higher average times. Formally, such a notion can be easily derived as the limiting vector
of our Markov chain when evolving from an initial distribution. Knowing that the limiting vector is a stationary distribution, one can calculate strategy rankings as the solution to the following eigenvector problem:
(3) |
2.2 Limitations of -Rank
Though the work in Omidshafiei et al. (2019) seeks to determine a solution to the above problem, it is worth mentioning that -Rank suffers from one major drawback–the scalability–that we remedy in this paper. We note that the solution methodology in -Rank is in fact unscalable to settings involving more than a hand-full of agents. Particularly, authors claim polynomial complexities of their solution to the problem in Eqn. 3. Though polynomial, such a complexity, however, is polynomial in an exponential search space, i.e., the space of joint strategy profiles. As such, the polynomial complexity claim is not grounded, and need to be investigated. In short, -Rank exhibits an exponential (in terms of the number of agents) complexity for determining a ranking, thus rendering it inapplicable to settings involving more than a small amount of agents.
Before we look into the scalability issue of -Rank, it is worth mentioning that our proposed method can still be bounded by another major limitation inherent to -Rank in that it prohibits behavioural improvement through strategy adaptation. A natural, also simple, extension to -Rank that one can easily think of is to allow policy refinements by introducing Policy-Space Response Oracles (PSRO) (Lanctot et al., 2017) to induce PSRO--Rank. The idea of PSRO--Rank is that after each round of -Rank evaluation, one can augment the strategy space for each agent by finding the best response, through reinforcement learning algorithms, to the other agents according to the top-rank strategy profile. However, as we show in the Appendix C.3, such idea only shows minor advantages compared to the other PSRO baselines including PSRO-Replicator Dynamics (Lanctot et al., 2017) and PSRO-Nash (Balduzzi et al., 2019). Consequently, we believe the major issue of -Rank is still the scalability. Since our proposed -Rank is beneficial to all the PSRO extensions that are built on -Rank, we leave the experimental exploration on PSRO--Rank for future work.
In what comes next, we first discuss traditional approaches that could help solve the Eqn. 3; soon we realize an off-the-shelve solution is unavailable. Hence, we commence to propose an efficient evaluation algorithm, i.e., -Rank, based on stochastic optimization with suitable complexities and rigorous theoretical guarantees. At the end, we propose a search heuristic to further scale up our method by introducing oracles and we name it by -Oracle.
3 Scalable Evaluation for Multi-Agent Systems
The problem of computing stationary distributions is a long-standing classical problem from linear algebra. Various techniques including
power method, PageRank, eigenvalue decomposition, and mirror descent
can be utilized for solving the problem in Eqn. 3. As we demonstrate next, any such implementation scales exponentially in the number of learners, as we summarize in Table 1.3.1 Traditional Approaches
Power Method.
One of the most common approaches to computing the solution in Eqn. 3 is the power method. Power method computes the stationary vector by constructing a sequence from a non-zero initial vector by applying . Though viable, we first note that the power method exhibits an exponential memory complexity in terms of the number of agents. To formally derive the bound, define to represent the total number of joint strategy profiles (i.e., ), and the total number of transitions between the states of the Markov chain in Section 2. By the construction, one can easily see that as each row and column in contains non-zero elements. Hence, memory complexity of such an implementation are in the order of
Analyzing its time complexity, on the other hand, requires a careful consideration that links convergence rates with the resulting graph topology of the Markov chain. Precisely, the convergence rate of the power method is dictated by the second-smallest eigenvalue of the normalized Laplacian, , of the graph, , associated to the Markov chain in Section 2, i.e., , with being the second-smallest eigenvalue of , and . Hence, as long as the second-smallest eigenvalue of the normalized Laplacian is well-behaved, one would expect suitable time complexity guarantees. To this end, we prove the following lemma
Lemma: [Second-Smallest Eigenvalue] Consider the Markov chain defined in Section 2 with states in and transition probability matrix . The second-smallest eigenvalue of the normalized Laplacian of the graph associated with the Markov chain is given by:
PageRank.
Inspired by ranking web-pages on the internet, one can consider PageRank (Page et al., 1999) for computing the solution to the eigenvalue problem in Eqn. 3. Applied to our setting, we first realize that the memory is analogous to the power method that is , and the time complexity are in the order of .
Eigenvalue Decomposition.
Apart from the above, we can also consider the problem as a standard eigenvalue decomposition task (also what the original -Rank is implemented according to Lanctot et al. (2019)) and adopt the method in Coppersmith & Winograd (1990) to compute the stationary distribution. Unfortunately, state-of-the-art techniques for eigenvalue decomposition also require exponential memory and exhibit a time complexity of the form with . Clearly, these bounds restrict -Rank to small number of agents .
Mirror Descent.
The ordered subsets mirror descent (Ben-Tal et al., 2001) requires at each iteration a projection on standard dimensional simplex: As stated in the paper, the computing of this projection requires time. In our setting, is the total number of joint strategy profiles. Hence, the projection step is exponential in the number of agents . This makes mirror descent inapplicable for -Rank when is large.
Method | Time | Memory |
---|---|---|
Power Method | ||
PageRank | ||
Eig. Decomp. | ||
OSMD | ||
Our Method |
3.2 Our Proposal: An Optimization-Based Solution
Rather than seeking an exact solution to the problem in Eqn. 3, one can consider approximate solvers by defining a constraint optimization objective:
(4) |
The constrained objective in Eqn. 4 simply seeks a vector minimizing the distance between , itself, and (i.e., attempting to solve ) while ensuring that lies on an -dimensional simplex (i.e., , and ). Due to time and memory complexities required for computing exact solutions, we focus on determining an approximate vector defined to be the solution to the following relaxed problem of Eqn. 4:
(5) |
The optimization problem in Eqn. 5 can be solved using a barrier-like technique that we detail below. Before that, it is instructive to clarify the connection between the original and the relaxed problems
Proposition: [Connections to Markov Chain] Let be a solution to the relaxed optimization problem in Eqn. 5. Then, is the stationary distribution to the Markov chain in Section 2.
Importantly, the above proposition allows us to focus on solving the problem in Eqn. 5 which only exhibits inequality constraints. Problems of this nature can be solved by considering a barrier function leading to an unconstrained finite sum minimization problem. To do so, denoting to be the row of , we can write Introducing logarithmic barrier-functions, with being a penalty parameter, we arrive at
(6) |
Eqn. 6 is a standard finite minimization problem that can be solved using any off-the-shelve stochastic optimization algorithm, e.g., stochastic gradients, ADAM (Kingma & Ba, 2014) among others. A stochastic gradient execution involves sampling a strategy profile at iteration , and then executing a descent step: , with being a sub-sampled gradient of Eqn. 6, and being a scheduled penalty parameter with for some ,
(7) |
See Phase I in Algorithm 1 for the pseudo-code. We can further derive a convergence theorem of:
Theorem: [Convergence of Barrier Method] Let be the output of a gradient algorithm descending in the objective in Eqn. 6, after iterations, then
where expectation is taken w.r.t. all randomness of a stochastic gradient implementation, and is a decay-rate for , i.e., .
The proof of the above theorem (see the full proof in Appendix A.2) is interesting by itself, a more important aspect is the memory and time complexity implications posed by our algorithm. Theorem 2 implies that after iterations with being a precision parameter, our algorithm outputs a vector such that
Moreover, one can easily see^{4}^{4}4More details on these derivations can be found in the Appendix A.3 that after steps, the overall time and memory complexities of our update rules are given by and , respectively. Using eventually leads to a memory complexity of and for time (see the comparison in Table. 1). Hence, our algorithm is able to achieve an exponential reduction, in terms of number of agents, in both memory and time complexities.
3.3 Heuristic Search by Introducing Oracles
So far, we have presented scalable multi-agent evaluations through stochastic optimization. We can further boost scalability (to tens of millions of joint profiles) of our method by introducing an oracle mechanism. The heuristic of oracles was first introduced in solving large-scale zero-sum matrix games (McMahan et al., 2003). The idea is to first create a restricted sub-game in which all players are only allowed to play a restricted number of strategies, which are then expanded by adding incorporating each of the players’ best-responses to opponents; the sub-game will be replayed with agents’ augmented strategy pools before a new round of best responses is found. The worse-case scenario of introducing oracles would be to solve the original evaluation problem in full size. The best response is assumed to be given by an oracle that can be simply implemented by a grid search. Precisely, given the top-rank profile at iteration , the goal for agent is to select the optimal from the pre-defined strategy pool to maximize the reward
(8) |
with denoting the state, , denoting the actions from agent and the opponents, respectively. The heuristic of solving the full game from restricted sub-games is crucial especially when it is prohibitively expensive to list all joint-strategy profiles, e.g., in scenarios involving tens-of-millions of joint profiles.
For a complete exposition, we summarize the pseudo-code in Algorithm 1. In the first phase, vanilla -Rank is executed (lines 4-9), while in the second (lines 11 - 13), -Rank with Oracle (if turned on) is computed. To avoid any confusion, we refer to the latter as -Oracle. Note that even though in the two-player zero-sum games, the oracle algorithm (McMahan et al., 2003) is guaranteed to converge to the minimax equilibrium. Providing valid convergence guarantees for -Oracle is an interesting direction for future work. In this paper, we rather demonstrate the effectiveness of such an approach in a large-scale empirical study as shown in Section 4.
4 Experiments
In this section, we evaluate the scalability properties of -Rank^{5}^{5}5All of the experiments are run by a single machine with GB memory, and -core Intel I9-9900X CPU.. Precisely, we demonstrate that our method is capable of successfully recovering optimal policies in self-driving car simulations and in the Ising model where strategy spaces are in the order of up to tens-of-millions of possible strategies. We note that these sizes are well beyond the capabilities of state-of-the-art methods, e.g., -Rank (Omidshafiei et al., 2019) that considers at maximum four agents with four strategies, or AlphaStar which handles about strategies as detailed in Vinyals et al. (2019).
Sparsity Data Structures. During the implementation phase, we realised that the transition probability, , of the Markov chain induces a sparsity pattern (each row and column in contains non-zero elements, check Section 3.2) that if exploited can lead to significant speed-ups. To fully leverage such sparsity, we tailored a novel data structure for sparse storage and computations needed by Algorithm 1. More details can be found in Appendix B.1.
Correctness of Ranking Results. Before conducting large-scale sophisticated experiments, it is instructive to validate the correctness of our results on the simple cases especially those reported by Omidshafiei et al. (2019). We therefore test on three normal-form games. Due to space constraints, we refrain the full description of these tasks to Appendix B.2. Fig. 2 shows that, in fact, results generated by -Rank, the Phase I of Algorithm 1, are consistent with -Rank’s results.
Complexity Results on Random Matrices. We measured the time and memory needed by our method for computing the stationary distribution with varying sizes of simulated random matrices. Baselines includes eigenvalue decomposition from Numpy, optimization tools in PyTorch, and -Rank from OpenSpiel (Lanctot et al., 2019). For our algorithm we terminated execution with gradient norms being below a predefined threshold of . According to Fig. 3, -Rank can achieve three orders of magnitude reduction compared to eigenvalue decomposition in terms of time. Most importantly, the performance gap keeps developing with the increasing matrix size.
Autonomous Driving on Highway: High-way (Leurent, 2018) provides an environment for simulating self-driving scenarios with social vehicles designed to mimic real-world traffic flow as strategy pools. We conducted a ranking experiment involving agents each with strategies, i.e. a strategy space in the order of ( possible strategy profiles). Agent strategies varied between “rational” and “dangerous” drivers, which we encoded using different reward functions during training (complete details of defining reward functions can be found in Appendix C.2). Under this setting, we know, upfront, that optimal profile corresponds to all agents is five rational drivers. Cars trained using value-iteration and rewards averaged from 200 test trails were reported. Due to the size of the strategy space, we considered both -Rank and -Oracle. We set -Oracle to run iterations of gradient updates in solving the top-rank strategy profile (Phase I in Algorithm 1). Results depicted in Fig. 4(a) clearly demonstrate that both our implementations are capable of recovering the correct highest ranking strategy profile. We also note that though such sizes are feasible using -Rank and the power-method, our results achieve 4 orders of magnitude reduction in total number of iterations.
Ising Model Experiment: The Ising model (Ising, 1925) is the model for describing ferromagnetism in statistical mechanics. It assumes a system of magnetic spins, where each spin is either an up-spin, , or down-spin, . The system energy is defined by with and being constant coefficients. The probability of one set of spin configuration is where is the environmental temperature. Finding the equilibrium of the system is notoriously hard because it is needed to enumerate all possible configurations in computing .Traditional approaches include Markov Chain Monte Carlo (MCMC). An interesting phenomenon is the phase change, i.e., the spins will reach an equilibrium in the low temperatures, with the increasing , such equilibrium will suddenly break and the system becomes chaotic.
Here we try to observe the phase change through multi-agent evaluation methods. We assume each spins as an agent, and the reward to be , and set to build the link between Eqn. 1 and . We consider the top-rank strategy profile from -Oracle as the system equilibrium and compare it against the ground truth from MCMC. We consider a five-by-five 2D model which induces a prohibitively-large strategy space of size (tens of millions) to which the existing baselines, including -Rank on the single machine, are inapplicable. Fig. 4(b) illustrates that our method identifies the same phase change as what MCMC suggests. We show an example of how -Oracle’s top-ranked profile finds the system equilibrium in Fig. 4(c) at .
5 Conclusion
In this paper, we demonstrated that the approach in Omidshafiei et al. (2019) exhibits exponential time and memory complexities. We then proposed -Rank as a scalable solution for multi-agent evaluation with linear time and memory demands. In a set of experiments, we demonstrated that our method is truly scalable capable of handling large strategy spaces.
There are a lot of interesting avenues for future research. First, we plan to theoretically analyze convergence properties of the resulting oracle algorithm, and further introduce policy learning through oracles. Second, we plan take our method to the real-world by conducting multi-robot experiments.
References
- Abdolmaleki et al. (2018) Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920, 2018.
- Balduzzi et al. (2019) David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech M Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. Open-ended learning in symmetric zero-sum games. arXiv preprint arXiv:1901.08106, 2019.
- Barik et al. (2015) Sasmita Barik, Ravindra Bapat, and Sukanta Pati. On the laplacian spectra of product graphs. Applicable Analysis and Discrete Mathematics, 9, 04 2015. doi: 10.2298/AADM150218006B.
- Ben-Tal et al. (2001) Aharon Ben-Tal, Tamar Margalit, and Arkadi Nemirovski. The ordered subsets mirror descent optimization method with applications to tomography. SIAM Journal on Optimization, 12(1):79–108, 2001.
- Bowling & Veloso (2001) Michael Bowling and Manuela Veloso. Convergence of gradient dynamics with a variable learning rate. In ICML, pp. 27–34, 2001.
- Chen & Deng (2005) X Chen and X Deng. Settling the complexity of 2-player nash equilibrium. eccc. Technical report, Report, 2005.
- Coppersmith & Winograd (1990) Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions. Journal of symbolic computation, 9(3):251–280, 1990.
- Fudenberg & Imhof (2006) Drew Fudenberg and Lorens A Imhof. Imitation processes with small mutations. Journal of Economic Theory, 131(1):251–262, 2006.
- Grau-Moya et al. (2018) Jordi Grau-Moya, Felix Leibfried, and Haitham Bou-Ammar. Balancing two-player stochastic games with soft q-learning. arXiv preprint arXiv:1802.03216, 2018.
- Hart & Mas-Colell (2003) Sergiu Hart and Andreu Mas-Colell. Uncoupled dynamics do not lead to nash equilibrium. American Economic Review, 93(5):1830–1836, 2003.
- Ising (1925) Ernst Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik A Hadrons and Nuclei, 31(1):253–258, 1925.
- Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Lanctot et al. (2017) Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4190–4203, 2017.
- Lanctot et al. (2019) Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453, 2019.
- Leurent (2018) Edouard Leurent. An environment for autonomous driving decision-making. https://github.com/eleurent/highway-env, 2018.
- McMahan et al. (2003) H Brendan McMahan, Geoffrey J Gordon, and Avrim Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 536–543, 2003.
- Nash et al. (1950) John F Nash et al. Equilibrium points in n-person games. Proceedings of the national academy of sciences, 36(1):48–49, 1950.
- Nocedal & Wright (2006) Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, second edition, 2006.
- Omidshafiei et al. (2019) Shayegan Omidshafiei, Christos Papadimitriou, Georgios Piliouras, Karl Tuyls, Mark Rowland, Jean-Baptiste Lespiau, Wojciech M Czarnecki, Marc Lanctot, Julien Perolat, and Remi Munos. -rank: Multi-agent evaluation by evolution. Scientific Reports, Nature, 2019.
- Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
- Palaiopanos et al. (2017) Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. In Advances in Neural Information Processing Systems, pp. 5872–5882, 2017.
- Pinsky & Karlin (2010) Mark Pinsky and Samuel Karlin. An introduction to stochastic modeling. Academic press, 2010.
- Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojciech M Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, et al. Alphastar: Mastering the real-time strategy game starcraft ii. DeepMind Blog, 2019.
- Viossat (2007) Yannick Viossat. The replicator dynamics does not lead to correlated equilibria. Games and Economic Behavior, 59(2):397–407, 2007.
- Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, pp. 1995–2003, 2016.
- Wen et al. (2019) Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, and Wei Pan. Probabilistic recursive reasoning for multi-agent reinforcement learning. arXiv preprint arXiv:1901.09207, 2019.
- Wu et al. (2017) Cathy Wu, Aboudy Kreidieh, Eugene Vinitsky, and Alexandre M Bayen. Emergent behaviors in mixed-autonomy traffic. In Conference on Robot Learning, pp. 398–407, 2017.
Appendix
Appendix A Comprehensive Proofs
a.1 Lemma of Second-Smallest Eigenvalue
Lemma: [Second-Smallest Eigenvalue] Consider the Markov chain defined in Section 2 with states in and transition probability matrix . The second-smallest eigenvalue of the normalized Laplacian of the graph associated with the Markov chain is given by:
Proof: For simplicity we drop round index in the below derivation. Notice, the underlying graph for the constructed Markov Chain can be represented as a Cartesian product of complete graphs ^{6}^{6}6Here, denotes a complete graph with nodes.:
(9) |
Indeed, two vertices are connected by the edge if and if only these joint strategy profiles differ in at most one individual strategy, i.e .Hence, the spectral properties of can be described in terms of spectral properties of as follows (Barik et al., 2015):
where is the eigenvalue of the unnormalized Laplacian of the complete graph and is the corresponding eigenvector^{7}^{7}7In other words, for all and .. The spectrum of unnormalized Laplacian of the complete graph is given by and the only eigenvector corresponding to zero eigenvalue is . Therefore, the minimum non-zero eigenvalue of unnormalized Laplacian of is given by . Finally, due to the fact that is a regular graph (with degree of each node is equal to ), the smallest non-zero eigenvalue of the normalized Laplacian of is given by .
Giving this result, the overall time complexity of Power Method is bounded by . Indeed, notice that , hence, . As for the memory complexity, Power Method requires has the same requirements as PageRank algorithm. ^{8}^{8}8Due to necessity to store matrix These results imply that Power Method scales exponentially with number of agents , and therefore, inapplicable when is large.
a.2 Theorem of Convergence of Barrier Method
Log-Barrier Stochastic Gradient Descent
Theorem: [Convergence of Barrier Method] Let be the output of a gradient algorithm descending in the objective in Eqn. 6, after iterations, then
where expectation is taken w.r.t. all randomness of a stochastic gradient implementation, and is a decay-rate for , i.e., . See the Algorithm 2.
Proof: Let and be the solutions of Eqn. (5) and Eqn. (6) respectively. Convergence guarantees for logarithmic barrier method (Nocedal & Wright, 2006) with penalty parameter and barrier parameter gives:
(10) |
and using in (10) gives:
(11) |
Applying the convergence guarantees of stochastic gradient descent method to convex function gives:
Using the definition of function :
Comments
There are no comments yet.