1 Introduction
Scalable policy evaluation and learning have been longstanding challenges in multiagent reinforcement learning (MARL) with two difficulties obstructing progress. First, jointstrategy spaces exponentially explode when a large number of strategic decisionmakers is considered, and second, the underlying game dynamics may exhibit cyclic behaviour (e.g. the game of RockPaperScissor) rendering an appropriate evaluation criteria nontrivial. Focusing on the second challenge, much work in multiagent systems followed a gametheoretic treatment proposing fixedpoints, e.g., Nash
[nash1950equilibrium] equilibrium, as potentially valid evaluation metrics. Though appealing, such measures are normative only when prescribing behaviours of perfectly rational agents – an assumption rarely met in reality [grau2018balancing, wen2019probabilistic, Felix, wen2019multi]. In fact, many game dynamics have been proven not converge to any fixedpoint equilibria [hart2003uncoupled, viossat2007replicator], but rather to limit cycles [palaiopanos2017multiplicative, bowling2001convergence]. Apart from these challenges, solving for a Nash equilibrium even for “simple” settings, e.g. twoplayer games is known to be PPADcomplete [chen2005settling] – a demanding complexity class when it comes to computational requirements.To address some of the above limitations, [omidshafiei2019alpha] recently proposed Rank as a graphbased gametheoretic solution to multiagent evaluation. Rank adopts Markov Conley Chains to highlight the presence of cycles in game dynamics, and attempts to compute stationary distributions as a mean for strategy profile ranking. In a novel attempt, the authors reduce multiagent evaluation to computing a stationary distribution of a Markov chain. Namely, consider a set of agents each having a strategy pool of size , a Markov chain is, first, defined over the graph of joint strategy profiles with a transition matrix , and then a stationary distribution is computed solving:
. The probability mass in
then represents the ranking of jointstrategy profile.Extensions of Rank have been developed on various instances. [rowl2019multiagent] adapted Rank to model games with incomplete information. [muller2019generalized] combined Rank with the policy search space oracle (PSRO) [lanctot2017unified] and claimed their method to be a generalised training approach for multiagent learning. Unsurprisingly, these work inherit the same claim of tractability from Rank. For example, the abstract in [muller2019generalized] reads “Rank, which is unique (thus faces no equilibrium selection issues, unlike Nash) and tractable to compute in generalsum, manyplayer settings.”
In this work, we contribute to refine the claims made in Rank dependent on its input type. We thoroughly argue that Rank exhibits a prohibitive computational and memory bottleneck that is hard to remedy even if payoff matrices were provided as inputs. We measure such a restriction using money spent as a nonrefutable metric to assess Rank’s validity scale. With this in mind, we then present a stochastic solver that we title Rank as a scalable and memory efficient alternative. Our method reduces memory constraints, and makes use of the oracle mechanism for reductions in joint strategy spaces. This, in turn, allows us to run largescale multiplayer experiments, including evaluation on selfdriving cars and Ising models where the maximum size involves tens of millions of joint strategies.
2 A Review of Rank
In Rank, strategy profiles of agents are evaluated through an evolutionary process of mutation and selection. Initially, agent populations are constructed by creating multiple copies of each learner assuming that all agents (in one population) execute the same unified policy. With this, Rank then simulates a multiagent game played by randomly sampled learners from each population. Upon game termination, each participating agent receives a payoff to be used in policy mutation and selection after its return to the population. Here, the agent is faced with a probabilistic choice between switching to the mutation policy, continuing to follow its current policy, or randomly selecting a novel policy (other than the previous two) from the pool. This process repeats with the goal of determining an evolutionary dominant profile that spreads across the population of agents. Fig. 1 demonstrates a simple example of a threeplayer game, each playing three strategies.
Mathematical Formulation:
To formalise Rank, we consider agents each, denoted by , having access to a set of strategies of size . We refer to the strategy set for agent by , , with representing the allowed policy of the learner. represents the set of states and is the set of actions for agent . A joint strategy profile is a set of policies for all participating agents in the joint strategy set, i.e., : , with and . We assume hereafter.
To evaluate performance, we assume each agent is additionally equipped with a payoff (reward) function . Crucially, the domain of is the pool of joint strategies so as to accommodate the effect of other learners on the player’s performance. Finally, given a joint profile , we define the corresponding joint payoff to be the collection of all individual payoff functions, i.e., . After attaining payoffs from the environment, each agent returns to its population and faces a choice between switching the whole population to a mutation policy, exploring a novel policy, or sticking to the current one. Such a choice is probabilistic and defined proportional to rewards by
with being an exploration parameter^{2}^{2}2Please note that in the original paper
is heuristically set to a small positive constant to ensure at maximum two varying policies pereach population. Theoretical justification can be found in
[fudenberg2006imitation]., representing policies followed by other agents, and an rankingintensity parameter. Large ensures that the probability that a suboptimal strategy overtakes a better strategy is close to zero.As noted in [omidshafiei2019alpha], one can relate the above switching process to a random walk on a Markov chain with states defined as elements in . Essentially, the Markov chain models the sink strongly connected components (SSCC) of the response graph associated with the game. The response graph of a game is a directed graph where each node corresponds to each joint strategy profile, and directed edges if the deviating player’s new strategy is a better response to that player, and the SSCC of a directed graph are the (group of) nodes with no outgoing edges.
Each entry in the transition probability matrix of Markov chain refers to the probability of one agent switching from one policy in a relation to attained payoffs. Consider any two joint strategy profiles and that differ in only one individual strategy for the agent, i.e., there exists an unique agent such that and with , we set with defining the probability that one copy of agent with strategy invades the population with all other agents (in that population) playing . Following [pinsky2010introduction], for , such a probability is formalised as
(1) 
and otherwise, with being the size of the population. So far, we presented relevant derivations for the entry of the state transition matrix when exactly the agent differs in exactly one strategy. Having one policy change, however, only represents a subset of allowed variations; two more cases need to be considered. Now we restrict our attention to variations in joint policies involving more than two individual strategies, i.e., . Here, we set^{3}^{3}3 This assumption significantly reduces the analysis complexity as detailed in [fudenberg2006imitation]. . Consequently, the remaining event of selftransitions can be thus written as . Summarising the above three cases, we can write the ’s entry of the Markov chain’s transition matrix as:
(2) 
The goal in
Rank is to establish an ordering in policy profiles dependent on evolutionary stability of each joint strategy. In other words, higher ranked strategies are these that are prevalent in populations with higher average time of survival. Formally, such a notion can be easily derived as the limiting vector
of our Markov chain when evolving from an initial distribution. Knowing that the limiting vector is a stationary distribution, one can calculate in fact the solution to the following eigenvector problem:
(3) 
We summarised the pseudocode of Rank in Algorithm 1. As the input to Rank is unclear and turns out to be controversial later, we point the readers to the original description in Section 3.1.1 of [omidshafiei2019alpha], and the practical implementation of Rank from [lanctot2019openspiel] for selfjudgement. In what comes next, we demonstrate that the tractability claim of Rank needs to be relaxed as the algorithm exhibits exponential time and memory complexities in number of players dependent on the input type considered. This, consequently, renders Rank inapplicable to largescale multiagent systems contrary to the original presentation.
3 Claims & Refinements
Original presentation of Rank claims to be tractable in the sense that it runs in polynomial time with respect to the total number of jointstrategy profiles. Unfortunately, such a claim is not clear without a formal specification of the inputs to Algorithm 3.1.1 in [omidshafiei2019alpha]. In fact, we, next, demonstrate that Rank’s can easily exhibit exponential complexity under the input of table, rendering it inapplicable beyond finite small number of players. We also present a conjecture stating that determining the toprank joint strategy profile in Rank is in fact NPhard.
3.1 On Rank’s Computational Complexity
Before diving into the details of our arguments, it is first instructive to note that tractable algorithms are these that exhibit a worstcase polynomial running time in the size of their input [papadimitriou2003computational]. Mathematically, for a size input, a polynomial time algorithm adheres to an complexity for some constant independent of .
Following the presentation in Section 3.1.1 in [omidshafiei2019alpha], Rank assumes availability of a game simulator to construct a payoff matrix quantifying performance of joint strategy profiles. As such, we deem that necessary inputs for such a construction is of the size , where is the total number of agents and is the total number of strategies per agent, where we assumed for simplicity.
Following the definition above, if Rank possesses polynomial complexity then it should attain a time proportional to with being a constant independent of and . As the algorithm requires to compute a stationary distribution of a Markov chain described by a transition matrix with rows and columns, the time complexity of Rank amounts to . Clearly, this result demonstrates exponential, thus intractable, complexity in the number of agent . In fact, we conjecture that determining top rank joint strategy profile using Rank with an input is NPhard.
Conjecture: [Rank is NPhard] Consider agents each with strategies. Computing toprank joint strategy profile with respect to the stationary distribution of the Markov chain’s transition matrix, , is NPhard.
Reasoning: To illustrate the point of the conjecture above, imagine agents each with strategies. Following the certificate argument for determining complexity classes, we ask the question:
“Assume we are given a joint strategy profile , is top rank w.r.t the stationary distribution of the Markov chain?”
To determine an answer to the above question, one requires an evaluation mechanism of some sort. If the time complexity of this mechanism is polynomial with respect to the input size, i.e., , then one can claim that the problem belongs to the NP complexity class. However, if the aforementioned mechanism exhibits an exponential time complexity, then the problem belongs to NPhard complexity class. When it comes to Rank, we believe a mechanism answering the above question would require computing a holistic solution of the problem, which, unfortunately, is exponential (i.e., ). Crucially, if our conjecture proves correct, we do not see how Rank can handle more than a finite small number of agents. ∎
3.2 On OptimisationBased Techniques
Given exponential complexity as derived above, we can resort to approximations of stationary distributions that aim at determining close solution for some precision parameter
. Here, we note that a problem of this type is a longstanding classical problem from linear algebra. Various techniques including Power method, PageRank, eigenvalue decomposition, and mirror descent can be utilised. Briefly surveying this literature, we demonstrate that any such implementation (unfortunately) scales exponentially in the number of players. For a quick summary, please consult Table
1.Power Method.
One of the most common approaches to computing a stationary distribution is the power method that computes the stationary vector by constructing a sequence from a nonzero initialisation by applying . Though viable, we first note that the power method exhibits an exponential memory complexity in terms of the number of agents. To formally derive the bound, define to represent the total number of joint strategy profiles, i.e., , and the total number of transitions between the states of the Markov chain. By construction, one can easily see that as each row and column in contains nonzero elements. Hence, memory complexity of such implementation is in the order of
The time complexity of a power method, furthermore, is given by , where is the total number of iterations. Since is of the order , the total complexity of such an implementation is also exponential.
PageRank.
Inspired by ranking webpages on the internet, one can consider PageRank [page1999pagerank] for computing the solution to the eigenvalue problem presented above. Applied to our setting, we first realize that the memory is analogous to the power method that is^{4}^{4}4In some works, people typically equate the power method with PageRank algorithm. , and the time complexity are in the order of .
Method  Time  Memory 

Power Method  
PageRank  
Eig. Decomp.  
Mirror Descent 
Eigenvalue Decomposition.
Apart from the above, we can also consider the problem as a standard eigenvalue decomposition task (also what is used to implement Rank in [lanctot2019openspiel]) and adopt the method in [coppersmith1990matrix] to compute the stationary distribution. Unfortunately, stateoftheart techniques for eigenvalue decomposition also require exponential memory () and exhibit a time complexity of the form with [coppersmith1990matrix]. Clearly, these bounds restrict Rank to small agent number .
Mirror Descent.
Another optimisationbased alternative is the ordered subsets mirror descent algorithm [ben2001ordered]. This is an iterative procedure requiring projection step on the standard dimensional simplex on every iteration: As mentioned in [ben2001ordered], computing this projection requires time. Hence, the projection step is exponential in the number of agents . This makes mirror descent inapplicable to Rank when is large.
Apart from the methods listed above, we are aware of other approaches that could solve the leading eigenvector for big matrices, for example the online learning approach [garber2015online], the sketching methods [tropp2017practical], and the subspace iteration with RayleighRitz acceleration [golub2000eigenvalue]. The tradeoff of these methods is that they usually assume special structure of the matrix, such as being Hermitian or at least positive semidefinite, which Rank however does not fit. Importantly, they can not offer any advantages on the time complexity either.
4 Reconsidering Rank’s Inputs
Having discussed our results with the authors, we were suggested that “inputs” to Rank are exponentiallysized payoff matrices, i.e., assuming line 2 in Algorithm 1 as an input^{5}^{5}5We were also promised such claims to be clarified in subsequent submissions.. Though polynomial in an exponentiallysized input, this consideration does not resolve problems mentioned above. In this section, we further demonstrate additional theoretical and practical problems when considering the advised “input” by the authors.
4.1 On the Definition of Agents
Rank redefines a strategy to correspond to the agents under evaluation differentiating them from players in the game (see line 4 in Section 3.1.1 and also Fig. 2a in [omidshafiei2019alpha]). Complexity results are then given in terms of these “agents”, where tractability is claimed. We would like to clarify that such definitions do not necessarily reflect the true underlying time complexity, whereby without formal input definitions, it is difficult to claim tractability.
To illustrate, consider solving a travelling salesman problem in which a traveller needs to visit a set of cities while returning to the origin following the shortest route. Although it is wellknown that the travelling salesman problem is NPhard, following the line of thought presented in Rank, one can show that such a problem reduces to a polynomial time (linear, i.e., tractable) problem in the size of “metacities”, which is not a valid claim.
So what are the “metacities”, and what is wrong with the above argument?
A strategy in the travelling salesman problem corresponds to a permutation in the order of cities. Rather than operating with number of cities, following Rank, we can construct the space of all permutation calling each a “metacity” (or agent)^{6}^{6}6How to enumerate all these permutations is an interesting question. Analogously, if was not the input to Rank, enumerating an exponentially sized matrix is also an interesting question.. Having enumerated all permutations, somehow, searching for the shortest route can be performed in polynomial time. Even though, one can state that solving the travelling salesman problem is polynomial in the size of permutations, it is incorrect to claim that any such algorithm is tractable. The same exact argument can be made for Rank, whereby having a polynomial time algorithm in an exponentiallysized space does not at all imply tractability^{7}^{7}7Note that this claim does not apply on the complexity of solving Nash equilibrium. For example, in solving zerosum games, polynomial tractability is never claimed on the number of players, whereas Rank claims tractable in the number of players.. It is for this reason, that reporting complexity results needs to be done with respect to the size of the input without any redefinition (we believe these are agents in multiagent systems according to Section 3.1.1 in [omidshafiei2019alpha], and cities in the travelling salesman problem).
Game Env.  PetaFlop/sdays  Cost ($)  Time (days) 

AlphaZero Go [silver2017mastering]  
AlphaGo Zero [silver2016mastering]  
AlphaZero Chess [silver2017mastering]  
MuJoCo Soccer [liu2019emergent]  
Leduc Poker [lanctot2017unified]  
Kuhn Poker [heinrich2015fictitious]  
AlphaStar [vinyals2019grandmaster] 
As is clear sofar, inputs to Rank lack clarity. Confused on the form of the input, we realise that the we are left with two choices: 1) list of all joint strategy profiles, or 2) a table of the size – collection of all of the players’ strategy pools. If we are to follow the first direction, the claims made in the paper are of course correct; however, this by no means resolves the problem as it is not clear how one would construct such an input in a tractable manner. Precisely, given an table (collection of all of the players’ strategy pools) as input, constructing the aforementioned list requires exponential time (). In other words, providing Rank with such a list only hides the exponential complexity burden in a preprocessing step. Analogously, applying this idea to the travelling salesman problem described above would hide the exponential complexity under a preprocessing step used to construct all possible permutations. Provided as inputs, the travelling salesman problem can now be solved in linear time, i.e., transforming an intractable problem to a tractable one by a mere redefinition.
4.2 Dollars Spent: A NonRefutable Metric
Admittedly, our arguments have been mostly theoretical and can become controversial dependent on the setting one considers. To abolish any doubts, we followed the advice given by the authors and considered the input of Rank to be exponentiallysized payoff matrices. We then conducted an experiment measuring dollars spent to evaluate scalability of running just line 3 in Algorithm 1, while considering the tasks reported in [omidshafiei2019alpha].
Assuming is given at no cost, the total amount of floating point operations (FLOPS) needed for constructing given in Eqn. 2 is . In terms of money cost needed for just building , we plot the dollar amount in Fig. 2 considering the Nvidia Tesla K80 GPU^{8}^{8}8https://en.wikipedia.org/wiki/Nvidia_Tesla which can process under single precision at maximum GFlop/s at a price of /hour^{9}^{9}9https://aws.amazon.com/ec2/instancetypes/p2/. Clearly, Fig. 2 shows that due to the fact that Rank needs to construct a Markov chain with an exponential size in the number of agents, it is only “money” feasible on tasks with at most tens of agents. It is also worth noting that our analysis is optimistic in the sense that we have not considered costs of storing nor computing stationary distributions.
Conclusion I: Given exponentiallysized payoff matrices, constructing transition matrices in Rank for about agents each with strategies requires about one trillion dollars in budget.
Though assumed given, in reality, the payoff values come at a nontrivial cost themselves, which is particularly true in reinforcement learning tasks [silver2016mastering]. Here, we take a closer look at the amount of money it takes to attain payoff matrices for the experiments listed in [omidshafiei2019alpha] that we present in Table 2. Following the methodology in here, we first count the total FLOPS each model uses under the unit of PetaFlop/sday that consists of performing operations per second in one day. For each experiment, if the answer to “how many GPUs were trained and for how long” was not available, we then traced back to the neural architecture used and counted the operations needed for both forward and backward propagation. The cost in time was then transformed from PetaFlop/sday using Tesla K80 as discussed above. In addition, we also list the cost of attaining payoff values from the most recent AlphaStar model [vinyals2019grandmaster]. It is obvious that although Rank could take the payoff values as “input” at a hefty price, the cost of acquiring such values is not negligible, e.g., payoff values from GO cost about $, and require a single GPU to run for more than five thousand years^{10}^{10}10 It is worth mentioning that here we only count running experiment once for getting each payoff value. In practice, the exact payoff values are hard to know since the game outcomes are noisy; therefore, multiple samples are often needed (check Theorem 3.2 in [rowl2019multiagent]), which will turn the numbers in Table 2 to an even larger scale. !
Conclusion II: Acquiring necessary inputs to Rank easily becomes intractable giving credence to our arguments in Section 4.1.
5 A Practical Solution to Rank
One can consider approximate solutions to the problem in Eqn. 3. As briefly surveyed in Section 3, most current methods, unfortunately, require exponential time and memory complexities. We believe achieving a solution that aims at reducing time complexity is an interesting and open question in linear algebra in general, and leave such a study to future work. Here, we rather contribute by a stochastic optimisation method that can attain a solution through random sampling of payoff matrices without the need to store exponentialsize input. Contrary to memory requirements reported in Table 1, our method requires a linear (in number of agents) periteration complexity of the form . It is worth noting that most other techniques need to store exponentiallysized matrices before commencing with any numerical instructions. Though we do not theoretically contribute to reductions in time complexities, we do, however, augment our algorithm with a doubleoracle heuristic for joint strategy space reduction. In fact, our experiments reveal that Rank can converge to the correct toprank strategies in hundreds of iterations in large strategy spaces, i.e., spaces with 33 million profiles.
Optimisation Problem Formulation:
Computing the stationary distribution can be rewritten as an optimisation problem:
(4) 
where the constrained objective in Eqn. 4 simply seeks a vector minimising the distance between , itself, and while ensuring that lies on an dimensional. To handle exponential complexities needed for acquiring exact solutions, we pose a relaxation the problem in Eqn. 4 and focus on computing an approximate solution vector instead, where solves:
(5) 
Before proceeding, however, it is worth investigating the relation between the solutions of the original (Eqn. 4) and relaxed (Eqn. 5) problems. We summarise such a relation in the following proposition that shows that determining suffices for computing a stationary distribution of Rank’s Markov chain:
Proposition: [Connections to Markov Chain] Let be a solution to the relaxed optimisation problem in Eqn. 5. Then, is the stationary distribution of Eqn. 3 in Section 2.
Importantly, the above proposition, additionally, allows us to focus on solving the problem in Eqn. 5 that only exhibits inequality constraints. Problems of this nature can be solved by considering a barrier function leading to an unconstrained finite sum minimisation problem. By denoting to be the row of , we can, thus, write: Introducing logarithmic barrierfunctions, with being a penalty parameter, we have:
(6) 
Eqn. 6 represents a standard finite minimisation problem, which can be solved using any offtheshelf stochastic optimisation methods, e.g., stochastic gradients, ADAM [kingma2014adam]. A stochastic gradient execution involves sampling a strategy profile at iteration , and then executing a descent step: , with being a subsampled gradient of Eqn. 6, and being a scheduled penalty parameter with for some :
(7) 
To avoid any confusion, we name the above stochastic approach of solving Rank via Eqn. 6 & 7 as Rank and present its pseudocode in Algorithm 2. When comparing our algorithm to these reported in Table 1, it is worth highlighting that computing updates using Eqn. 7 requires no storage of the full transition or payoff matrices as updates are performed only using subsampled columns as shown in line 11 in Algorithm 2.
5.1 Rank with Efficient Exploration & Oracles
Stochastic sampling enables to solve Rank with no need to store the transition matrix ; however, the size of the column (i.e., ) can still be prohibitively large. Here we further boost scalability of our method by introducing an oracle mechanism. The heuristic of oracles was first proposed in solving largescale zerosum matrix games [mcmahan2003planning]. The idea is to first create a subgame in which all players are only allowed to play a restricted number of strategies, which are then expanded by adding each of the players’ bestresponses to their opponents; the subgame will be replayed with agents’ augmented strategy sets before a new round of best responses is computed.
The best response is assumed to be given by an oracle that can be simply implemented by a grid search, where given the toprank profile at iteration , the goal for agent is to select the optimal from a predefined strategy set to maximise its reward:
(8) 
with denoting the state, , denoting the actions from agent and the opponents, respectively. Though worsecase scenario of introducing oracles would require solving the original evaluation problem, our experimental results on largescale systems demonstrate efficiency by converging early.
For a complete exposition, we summarise the pseudocode of our proposed method, named as Oracle, in Algorithm 1. Oracle degenerates to Rank (lines ) if one initialises strategy sets of agents by the full size at the beginning, i.e., .
Providing valid convergence guarantee for Oracle is an interesting direction for future work. In fact, recently [muller2019generalized] proposed a close idea of adopting an oracle mechanism into Rank without any stochastic solver however. Interestingly, it is reported that bad initialisation can lead to failures in recovering toprank strategies. Contrary to the results reported in [muller2019generalized], we rather demonstrate the effectiveness of our approach through running multiple trails of initialisation for . In addition, we also believe the stochastic nature of Oracle potentially prevents from being trapped by the local minimal from subgames.
6 Experiments
In this section, we demonstrate the scalability of Rank in successfully recovering optimal policies in selfdriving car simulations and in the Ising model–a setting with tensofmillions of possible strategies. We note that these sizes are far beyond the capability of stateoftheart methods; Rank [omidshafiei2019alpha] considers at maximum agents with strategies. All of our experiments were run only on a single machine with GB memory and core Intel i9 CPU.
Sparsity Data Structure:
During the implementation phase, we realised that the transition probability, , of the Markov chain induces a sparsity pattern (each row and column in contains nonzero elements, check Section 5) that if exploited can lead to significant speedup. To fully leverage such sparsity, we tailored a novel data structure for sparse storage and computations needed by Algorithm 2. More details are in Appendix 1.1.
Correctness of Ranking Results:
As Algorithm 2 is a generalisation (in terms of scalability) of Rank, it is instructive to validate the correctness of our results on three simple matrix games. Due to space constraints, we refrain the full description of these tasks to Appendix 1.2. Fig. 3, however, shows that, in fact, results generated by Rank are consistent with these reported in [omidshafiei2019alpha].
Complexity Comparisons on Random Matrices:
To further assess scalability, we measured the time and memory needed by our method for computing stationary distributions of varying sizes of simulated random matrices. Baselines included eigenvalue decomposition from Numpy, optimisation tools from PyTorch, and
Rank from OpenSpiel [lanctot2019openspiel]. We terminated execution of Rank when gradient norms fellshort a predefined threshold of 0.01. According to Fig. 4, Rank can achieve three orders of magnitude reduction in time (i.e. faster) compared to default Rank implementation from [lanctot2019openspiel]. Memorywise, our method uses only half of the space when considering, for instance, matrices of the size .Autonomous Driving on Highway:
Having assessed correctness and scalability, we now present novel application domains on largescale multiagent/multiplayer systems. For that we made used of highway [highwayenv]; an environment for simulating selfdriving scenarios with social vehicles designed to mimic realworld traffic flow. We conducted a ranking experiment involving agents each with strategies, i.e., a strategy space in the order of ( possible strategy profiles). Agent strategies varied between “rational” and “dangerous” drivers, which we encoded using different reward functions during training (complete details of reward functions are in Appendix 2.2). Under this setting, we knew, upfront, that optimal profile corresponds to all agents being five rational drivers. Cars trained using value iteration and the rewards averaged from 200 test trails were reported.
We considered both Rank and Oracle, and reported the results by running random seeds. We set Oracle to run iterations of gradient updates in solving the toprank strategy profile (lines in Algorithm 2). Results depicted in Fig. 5(a) clearly demonstrate that both our proposed methods are capable of recovering the correct highest ranking strategy profile. Oracle converges faster than Rank, which we believe is due to the oracle mechanism saving time in inefficiently exploring “dangerous” drivers upon one observation. We also note that although such size of problem are feasible using Rank and the Power Method, our results achieve 4 orders of reduction in number of iterations.
Ising Model Experiment:
We repeated the above experiment on the Ising model [ising1925beitrag] that is typically used for describing ferromagnetism in statistical mechanics. It assumes a system of magnetic spins, where each spin is either an upspin, , or downspin, . The system energy is defined by with and being constant coefficients. The probability of one spin configuration is where is the environmental temperature. Finding the equilibrium of the system is notoriously hard because it is needed to enumerate all possible configurations in computing . Traditional approaches include Markov Chain Monte Carlo (MCMC). An interesting phenomenon is the phase change, i.e., the spins will reach an equilibrium in the low temperatures, with the increasing , such equilibrium will suddenly break and the system becomes chaotic.
Here we try to observe the phase change through multiagent evaluation methods. We assume each spins as an agent, and the reward to be . We consider the toprank strategy profile from Oracle as the system equilibrium and compare it against the ground truth from MCMC. We consider a 2D model which induces a prohibitivelylarge strategy space of the size () to which existing methods are inapplicable. Fig. 5(b) illustrates that our method identifies the same phase change as that of MCMC. We also show an example of how Oracle’s topranked profile finds the system’s equilibrium when in Fig. 5(c). Note that the problem of agent with strategies goes far beyond the capability of Rank on one single machine (billions of elements in ); we therefore don’t list its performance here.
7 Conclusions & Future Work
In this paper, we presented major bottlenecks prohibiting Rank from scaling beyond tens of agents. Dependent on the type of input, Rank’s time and memory complexities can easily become exponential. We further argued that notions introduced in Rank can lead to confusing tractability results on notoriously difficult NPhard problems. To eradicate any doubts, we empirically validated our claims by presenting dollars spent as a nonrefutable metric.
Realising these problems, we proposed a scalable alternative for multiagent evaluation based on stochastic optimisation and double oracles, along with rigorous scalability results on a variety of benchmarks. For future work, we plan to understand the relation between Rank’s solution and that of a Nash equilibrium. Second, we will attempt to conduct a theoretical study on the convergence of our proposed Oracle algorithm.
References
Appendix
1 Implementation of Rank
Experiments  Max. Iteration  

NFG (without selfplay)  50  1.0  1000  0.9  0.5  0.1  
NFG (selfplay)  50  0.03  1000  0.9  0.5  0.1  
Random Matrix  n/a  n/a  0.01  1000  n/a  0.1  0.01 
Car Experiment (SGD)  40  1.0  15.0  2000  0.999  0.5  0.1 
Car Experiment (Oracle)  40  1.0  1.0  200  0.999  0.5  0.1 
Ising Model  40  90.0  0.01  4000  0.999  0.5  0.1 
1.1 The Data Structure for Sparsity
The transitional probability matrix in Rank is sparse; each row and column in contains nonzero elements (see Section 5). To fully leverage such sparsity, we design a new data structure (see Fig. 6) for the storage and computation. Compared to standard techniques (e.g., COO, CSR, and CRS^{11}^{11}11https://docs.scipy.org/doc/scipy/reference/sparse.html) that store (row, column, value) of a sparse vector, our data structure adopts a more efficient protocol that stores (defaults, positions, biases) leading to improvements in computational efficiency, which gives us additional advantages in computational efficiency. We reload the operations for such data structure including addition, scalar multiplication, dot product, elementwise square root, L1 norm. We show the example of addition in Fig. 6.
1.2 Validity Check on Normalform Games
Our algorithm provides the expected ranking in all three normalform games shown in Fig. 3, which is consistent with the results in Rank [omidshafiei2019alpha].
Battle of sexes. Battle of sexes is an asymmetric game . Rank suggests that populations would spend an equal amount of time on the profile (O,O) and (M,M) during the evolution. The distribution mass of (M,O) drops to faster than that of (O,M), this is because deviating from (M,O) for either player has a larger gain (from to ) than deviating from (O,M) (from to ).
Biased RockPaperScissor. We consider the biased RPS game . As it is a singlepopulation game, we adopt the transitional probability matrix of Eqn. 11 in [omidshafiei2019alpha]. Such game has the inherent structure that Rock/Paper/Scissor is equally likely to be invaded by a mutant, e.g., the scissor population will always be fixated by the rock population, therefore, our method suggests the longterm survival rate for all three strategies are the same . Note this is different from the Nash equilibrium solution that is .
Prison’s Dilemma. In prison’s dilemma , cooperation is an evolutionary transient strategy since the cooperation player can always be exploited by the defection player. Our method thus yields as the only strategy profile that could survive in the longterm evolution.
2 Additional Details for Experiments
2.1 Hyperparameters Settings
For all of our experiments, the gradient updates include two phases: warmup phase and Adam [kingma2014adam]
phase. In the warmup phase, we used standard stochastic gradient descent; after that, we replace SGD with Adam till the convergence. In practice, we find this yields faster convergence than normal stochastic gradient descent. As our algorithm does column sampling for the stochastic matrix (i.e. batch size equals to one), adding momentum term intuitively help stabilise the learning. The warmup step is
for all experimentsWe also implement infinite [lanctot2019openspiel], when calculating transition matrix (or its column), where our noise term is set to be .
For most of our experiments that involve rank, we set the terminating condition to be, when the gradient norm is less than . However, for Random Matrix experiment, we set the terminating gradient norm to be

Learning rate to be in between 15  17

Alpha (ranking intensity) to be in between 1  2.5

Number of Population to be between 25  55 (in integer)
For all of the Adam experiments, after the warmupstep we chooses to decay and by for each time steps, where we have to always be . Similarly, starts at the value . However, in speed and memory experiment, we chooses the decay to be
List of symbols and names

Population size:

Ranking Intensity:

Learning rate:
2.2 Selfdriving Car Experiment
Collision Reward  Speed Reward  

Rational driver  2.0  0.4 
Dangerous driver 1  10.0  10.0 
Dangerous driver 2  20.0  10.0 
Dangerous driver 3  30.0  10.0 
Dangerous driver 4  40.0  10.0 
The environmental reward given to each agent is calculated by
Collision Reward is calculated when agent collided with either social car or other agents. All of our value iteration agents are based on [highwayenv] environment discretisation, which represents the environment in terms of time to collision MDP, taking into account that the other agents are moving in constant speed. For all experiments, we run valueiteration for steps with the discounting factor of . For each controllable cars, the default speed is randomised to be between to , while the social cars, the speed are randomised to be between to . We define five types of driving behaviours (one rational + four dangerous) by letting each controlled car have a different ego reward function during training (though the reward we report is the environmental reward which cannot be changed). By setting this, we can make sure, at upfront, the best jointstrategy strategy should be all cars to drive rationally.
Comments
There are no comments yet.