α^α-Rank: Practically Scaling α-Rank through Stochastic Optimisation

09/25/2019
by   Yaodong Yang, et al.
0

Recently, α-Rank, a graph-based algorithm, has been proposed as a solution to ranking joint policy profiles in large scale multi-agent systems. α-Rank claimed tractability through a polynomial time implementation with respect to the total number of pure strategy profiles. Here, we note that inputs to the algorithm were not clearly specified in the original presentation; as such, we deem complexity claims as not grounded, and conjecture solving α-Rank is NP-hard. The authors of α-Rank suggested that the input to α-Rank can be an exponentially-sized payoff matrix; a claim promised to be clarified in subsequent manuscripts. Even though α-Rank exhibits a polynomial-time solution with respect to such an input, we further reflect additional critical problems. We demonstrate that due to the need of constructing an exponentially large Markov chain, α-Rank is infeasible beyond a small finite number of agents. We ground these claims by adopting amount of dollars spent as a non-refutable evaluation metric. Realising such scalability issue, we present a stochastic implementation of α-Rank with a double oracle mechanism allowing for reductions in joint strategy spaces. Our method, α^α-Rank, does not need to save exponentially-large transition matrix, and can terminate early under required precision. Although theoretically our method exhibits similar worst-case complexity guarantees compared to α-Rank, it allows us, for the first time, to practically conduct large-scale multi-agent evaluations. On 10^4×10^4 random matrices, we achieve 1000x speed reduction. Furthermore, we also show successful results on large joint strategy profiles with a maximum size in the order of O(2^25) (33 million joint strategies) – a setting not evaluable using α-Rank with reasonable computational budget.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 14

09/25/2019

α^α-Rank: Scalable Multi-agent Evaluation through Evolution

Although challenging, strategy profile evaluation in large connected lea...
03/04/2019

α-Rank: Multi-Agent Evaluation by Evolution

We introduce α-Rank, a principled evolutionary dynamics methodology, for...
04/09/2021

Ranking Bracelets in Polynomial Time

The main result of the paper is the first polynomial-time algorithm for ...
03/29/2021

Scalable Planning in Multi-Agent MDPs

Multi-agent Markov Decision Processes (MMDPs) arise in a variety of appl...
06/11/2020

Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward

It has long been recognized that multi-agent reinforcement learning (MAR...
06/16/2020

Quantitative Group Testing and the rank of random matrices

Given a random Bernoulli matrix A∈{0,1}^m× n, an integer 0< k < n and th...
02/13/2017

Certificates for triangular equivalence and rank profiles

In this paper, we give novel certificates for triangular equivalence and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scalable policy evaluation and learning have been long-standing challenges in multi-agent reinforcement learning (MARL) with two difficulties obstructing progress. First, joint-strategy spaces exponentially explode when a large number of strategic decision-makers is considered, and second, the underlying game dynamics may exhibit cyclic behaviour (e.g. the game of Rock-Paper-Scissor) rendering an appropriate evaluation criteria non-trivial. Focusing on the second challenge, much work in multi-agent systems followed a game-theoretic treatment proposing fixed-points, e.g., Nash

[nash1950equilibrium] equilibrium, as potentially valid evaluation metrics. Though appealing, such measures are normative only when prescribing behaviours of perfectly rational agents – an assumption rarely met in reality [grau2018balancing, wen2019probabilistic, Felix, wen2019multi]. In fact, many game dynamics have been proven not converge to any fixed-point equilibria [hart2003uncoupled, viossat2007replicator], but rather to limit cycles [palaiopanos2017multiplicative, bowling2001convergence]. Apart from these challenges, solving for a Nash equilibrium even for “simple” settings, e.g. two-player games is known to be PPAD-complete [chen2005settling] – a demanding complexity class when it comes to computational requirements.

To address some of the above limitations, [omidshafiei2019alpha] recently proposed -Rank as a graph-based game-theoretic solution to multi-agent evaluation. -Rank adopts Markov Conley Chains to highlight the presence of cycles in game dynamics, and attempts to compute stationary distributions as a mean for strategy profile ranking. In a novel attempt, the authors reduce multi-agent evaluation to computing a stationary distribution of a Markov chain. Namely, consider a set of agents each having a strategy pool of size , a Markov chain is, first, defined over the graph of joint strategy profiles with a transition matrix , and then a stationary distribution is computed solving:

. The probability mass in

then represents the ranking of joint-strategy profile.

Extensions of -Rank have been developed on various instances. [rowl2019multiagent] adapted -Rank to model games with incomplete information. [muller2019generalized] combined -Rank with the policy search space oracle (PSRO) [lanctot2017unified] and claimed their method to be a generalised training approach for multi-agent learning. Unsurprisingly, these work inherit the same claim of tractability from -Rank. For example, the abstract in [muller2019generalized] reads “-Rank, which is unique (thus faces no equilibrium selection issues, unlike Nash) and tractable to compute in general-sum, many-player settings.”

In this work, we contribute to refine the claims made in -Rank dependent on its input type. We thoroughly argue that -Rank exhibits a prohibitive computational and memory bottleneck that is hard to remedy even if pay-off matrices were provided as inputs. We measure such a restriction using money spent as a non-refutable metric to assess -Rank’s validity scale. With this in mind, we then present a stochastic solver that we title -Rank as a scalable and memory efficient alternative. Our method reduces memory constraints, and makes use of the oracle mechanism for reductions in joint strategy spaces. This, in turn, allows us to run large-scale multi-player experiments, including evaluation on self-driving cars and Ising models where the maximum size involves tens of millions of joint strategies.

2 A Review of -Rank

Figure 1: Example of population based evaluation on players (star, triangle, circle) each with strategies (denoted by the colours) and copies. a) Each population obtains a fitness value depending on the strategies chosen, b) one mutation strategy (red star) occurs, and c) the population either selects the original strategy, or being fixated by the mutation strategy.

In -Rank, strategy profiles of agents are evaluated through an evolutionary process of mutation and selection. Initially, agent populations are constructed by creating multiple copies of each learner assuming that all agents (in one population) execute the same unified policy. With this, -Rank then simulates a multi-agent game played by randomly sampled learners from each population. Upon game termination, each participating agent receives a payoff to be used in policy mutation and selection after its return to the population. Here, the agent is faced with a probabilistic choice between switching to the mutation policy, continuing to follow its current policy, or randomly selecting a novel policy (other than the previous two) from the pool. This process repeats with the goal of determining an evolutionary dominant profile that spreads across the population of agents. Fig. 1 demonstrates a simple example of a three-player game, each playing three strategies.

Mathematical Formulation:

To formalise -Rank, we consider agents each, denoted by , having access to a set of strategies of size . We refer to the strategy set for agent by , , with representing the allowed policy of the learner. represents the set of states and is the set of actions for agent . A joint strategy profile is a set of policies for all participating agents in the joint strategy set, i.e., : , with and . We assume hereafter.

To evaluate performance, we assume each agent is additionally equipped with a payoff (reward) function . Crucially, the domain of is the pool of joint strategies so as to accommodate the effect of other learners on the player’s performance. Finally, given a joint profile , we define the corresponding joint payoff to be the collection of all individual payoff functions, i.e., . After attaining payoffs from the environment, each agent returns to its population and faces a choice between switching the whole population to a mutation policy, exploring a novel policy, or sticking to the current one. Such a choice is probabilistic and defined proportional to rewards by

with being an exploration parameter222Please note that in the original paper

is heuristically set to a small positive constant to ensure at maximum two varying policies per-each population. Theoretical justification can be found in  

[fudenberg2006imitation]., representing policies followed by other agents, and an ranking-intensity parameter. Large ensures that the probability that a sub-optimal strategy overtakes a better strategy is close to zero.

As noted in [omidshafiei2019alpha], one can relate the above switching process to a random walk on a Markov chain with states defined as elements in . Essentially, the Markov chain models the sink strongly connected components (SSCC) of the response graph associated with the game. The response graph of a game is a directed graph where each node corresponds to each joint strategy profile, and directed edges if the deviating player’s new strategy is a better response to that player, and the SSCC of a directed graph are the (group of) nodes with no out-going edges.

Each entry in the transition probability matrix of Markov chain refers to the probability of one agent switching from one policy in a relation to attained payoffs. Consider any two joint strategy profiles and that differ in only one individual strategy for the agent, i.e., there exists an unique agent such that and with , we set with defining the probability that one copy of agent with strategy invades the population with all other agents (in that population) playing . Following [pinsky2010introduction], for , such a probability is formalised as

(1)

and otherwise, with being the size of the population. So far, we presented relevant derivations for the entry of the state transition matrix when exactly the agent differs in exactly one strategy. Having one policy change, however, only represents a subset of allowed variations; two more cases need to be considered. Now we restrict our attention to variations in joint policies involving more than two individual strategies, i.e., . Here, we set333 This assumption significantly reduces the analysis complexity as detailed in [fudenberg2006imitation]. . Consequently, the remaining event of self-transitions can be thus written as . Summarising the above three cases, we can write the ’s entry of the Markov chain’s transition matrix as:

(2)
1:(Unspecified) Inputs: , Multi-agent Simulator
2:Listing all possible joint-strategy profiles, for each profile, run the multi-agent simulator to get the payoff values for all players .
3:Construct Markov chain’s transition matrix by Eqn. 2.
4:Compute the stationary distribution by Eqn. 3.
5:Rank all in based on their probability masses.
6:Outputs: The ranked list of (each element refers to the time that players spend in playing that during evolution).
Algorithm 1 -Rank (see Section 3.1.1 in [omidshafiei2019alpha])

The goal in

-Rank is to establish an ordering in policy profiles dependent on evolutionary stability of each joint strategy. In other words, higher ranked strategies are these that are prevalent in populations with higher average time of survival. Formally, such a notion can be easily derived as the limiting vector

of our Markov chain when evolving from an initial distribution

. Knowing that the limiting vector is a stationary distribution, one can calculate in fact the solution to the following eigenvector problem:

(3)

We summarised the pseudo-code of -Rank in Algorithm 1. As the input to -Rank is unclear and turns out to be controversial later, we point the readers to the original description in Section 3.1.1 of [omidshafiei2019alpha], and the practical implementation of -Rank from [lanctot2019openspiel] for self-judgement. In what comes next, we demonstrate that the tractability claim of -Rank needs to be relaxed as the algorithm exhibits exponential time and memory complexities in number of players dependent on the input type considered. This, consequently, renders -Rank inapplicable to large-scale multi-agent systems contrary to the original presentation.

3 Claims & Refinements

Original presentation of -Rank claims to be tractable in the sense that it runs in polynomial time with respect to the total number of joint-strategy profiles. Unfortunately, such a claim is not clear without a formal specification of the inputs to Algorithm 3.1.1 in [omidshafiei2019alpha]. In fact, we, next, demonstrate that -Rank’s can easily exhibit exponential complexity under the input of table, rendering it inapplicable beyond finite small number of players. We also present a conjecture stating that determining the top-rank joint strategy profile in -Rank is in fact NP-hard.

3.1 On -Rank’s Computational Complexity

Before diving into the details of our arguments, it is first instructive to note that tractable algorithms are these that exhibit a worst-case polynomial running time in the size of their input [papadimitriou2003computational]. Mathematically, for a size input, a polynomial time algorithm adheres to an complexity for some constant independent of .

Following the presentation in Section 3.1.1 in [omidshafiei2019alpha], -Rank assumes availability of a game simulator to construct a payoff matrix quantifying performance of joint strategy profiles. As such, we deem that necessary inputs for such a construction is of the size , where is the total number of agents and is the total number of strategies per agent, where we assumed for simplicity.

Following the definition above, if -Rank possesses polynomial complexity then it should attain a time proportional to with being a constant independent of and . As the algorithm requires to compute a stationary distribution of a Markov chain described by a transition matrix with rows and columns, the time complexity of -Rank amounts to . Clearly, this result demonstrates exponential, thus intractable, complexity in the number of agent . In fact, we conjecture that determining top rank joint strategy profile using -Rank with an input is NP-hard.

Conjecture: [-Rank is NP-hard] Consider agents each with strategies. Computing top-rank joint strategy profile with respect to the stationary distribution of the Markov chain’s transition matrix, , is NP-hard.

Reasoning: To illustrate the point of the conjecture above, imagine agents each with strategies. Following the certificate argument for determining complexity classes, we ask the question:

“Assume we are given a joint strategy profile , is top rank w.r.t the stationary distribution of the Markov chain?”

To determine an answer to the above question, one requires an evaluation mechanism of some sort. If the time complexity of this mechanism is polynomial with respect to the input size, i.e., , then one can claim that the problem belongs to the NP complexity class. However, if the aforementioned mechanism exhibits an exponential time complexity, then the problem belongs to NP-hard complexity class. When it comes to -Rank, we believe a mechanism answering the above question would require computing a holistic solution of the problem, which, unfortunately, is exponential (i.e., ). Crucially, if our conjecture proves correct, we do not see how -Rank can handle more than a finite small number of agents. ∎

3.2 On Optimisation-Based Techniques

Given exponential complexity as derived above, we can resort to approximations of stationary distributions that aim at determining -close solution for some precision parameter

. Here, we note that a problem of this type is a long-standing classical problem from linear algebra. Various techniques including Power method, PageRank, eigenvalue decomposition, and mirror descent can be utilised. Briefly surveying this literature, we demonstrate that any such implementation (unfortunately) scales exponentially in the number of players. For a quick summary, please consult Table 

1.

Power Method.

One of the most common approaches to computing a stationary distribution is the power method that computes the stationary vector by constructing a sequence from a non-zero initialisation by applying . Though viable, we first note that the power method exhibits an exponential memory complexity in terms of the number of agents. To formally derive the bound, define to represent the total number of joint strategy profiles, i.e., , and the total number of transitions between the states of the Markov chain. By construction, one can easily see that as each row and column in contains non-zero elements. Hence, memory complexity of such implementation is in the order of

The time complexity of a power method, furthermore, is given by , where is the total number of iterations. Since is of the order , the total complexity of such an implementation is also exponential.

PageRank.

Inspired by ranking web-pages on the internet, one can consider PageRank [page1999pagerank] for computing the solution to the eigenvalue problem presented above. Applied to our setting, we first realize that the memory is analogous to the power method that is444In some works, people typically equate the power method with PageRank algorithm. , and the time complexity are in the order of .

Method Time Memory
Power Method
PageRank
Eig. Decomp.
Mirror Descent
Table 1: Time and space complexity comparison given table as inputs.

Eigenvalue Decomposition.

Apart from the above, we can also consider the problem as a standard eigenvalue decomposition task (also what is used to implement -Rank in [lanctot2019openspiel]) and adopt the method in [coppersmith1990matrix] to compute the stationary distribution. Unfortunately, state-of-the-art techniques for eigenvalue decomposition also require exponential memory () and exhibit a time complexity of the form with [coppersmith1990matrix]. Clearly, these bounds restrict -Rank to small agent number .

Mirror Descent.

Another optimisation-based alternative is the ordered subsets mirror descent algorithm [ben2001ordered]. This is an iterative procedure requiring projection step on the standard -dimensional simplex on every iteration: As mentioned in [ben2001ordered], computing this projection requires time. Hence, the projection step is exponential in the number of agents . This makes mirror descent inapplicable to -Rank when is large.

Apart from the methods listed above, we are aware of other approaches that could solve the leading eigenvector for big matrices, for example the online learning approach [garber2015online], the sketching methods [tropp2017practical], and the subspace iteration with Rayleigh-Ritz acceleration [golub2000eigenvalue]. The trade-off of these methods is that they usually assume special structure of the matrix, such as being Hermitian or at least positive semi-definite, which -Rank however does not fit. Importantly, they can not offer any advantages on the time complexity either.

4 Reconsidering -Rank’s Inputs

Having discussed our results with the authors, we were suggested that “inputs” to -Rank are exponentially-sized payoff matrices, i.e., assuming line 2 in Algorithm 1 as an input555We were also promised such claims to be clarified in subsequent submissions.. Though polynomial in an exponentially-sized input, this consideration does not resolve problems mentioned above. In this section, we further demonstrate additional theoretical and practical problems when considering the advised “input” by the authors.

4.1 On the Definition of Agents

-Rank redefines a strategy to correspond to the agents under evaluation differentiating them from players in the game (see line 4 in Section 3.1.1 and also Fig. 2a in [omidshafiei2019alpha]). Complexity results are then given in terms of these “agents”, where tractability is claimed. We would like to clarify that such definitions do not necessarily reflect the true underlying time complexity, whereby without formal input definitions, it is difficult to claim tractability.

To illustrate, consider solving a travelling salesman problem in which a traveller needs to visit a set of cities while returning to the origin following the shortest route. Although it is well-known that the travelling salesman problem is NP-hard, following the line of thought presented in -Rank, one can show that such a problem reduces to a polynomial time (linear, i.e., tractable) problem in the size of “meta-cities”, which is not a valid claim.

So what are the “meta-cities”, and what is wrong with the above argument?

A strategy in the travelling salesman problem corresponds to a permutation in the order of cities. Rather than operating with number of cities, following -Rank, we can construct the space of all permutation calling each a “meta-city” (or agent)666How to enumerate all these permutations is an interesting question. Analogously, if was not the input to -Rank, enumerating an exponentially sized matrix is also an interesting question.. Having enumerated all permutations, somehow, searching for the shortest route can be performed in polynomial time. Even though, one can state that solving the travelling salesman problem is polynomial in the size of permutations, it is incorrect to claim that any such algorithm is tractable. The same exact argument can be made for -Rank, whereby having a polynomial time algorithm in an exponentially-sized space does not at all imply tractability777Note that this claim does not apply on the complexity of solving Nash equilibrium. For example, in solving zero-sum games, polynomial tractability is never claimed on the number of players, whereas -Rank claims tractable in the number of players.. It is for this reason, that reporting complexity results needs to be done with respect to the size of the input without any redefinition (we believe these are agents in multi-agent systems according to Section 3.1.1 in [omidshafiei2019alpha], and cities in the travelling salesman problem).

Figure 2: Money cost of constructing the transition matrix in computing -Rank (line 3 in Algorithm 1). Note that one trillion dollar is the world’s total hardware budget. The projected contours shows that due to the exponentially-growing size of -Rank’s “input”, under reasonable budget, it is infeasible to handle multi-agent evaluations with more than ten agents.
Game Env. PetaFlop/s-days Cost ($) Time (days)
AlphaZero Go [silver2017mastering]
AlphaGo Zero [silver2016mastering]
AlphaZero Chess [silver2017mastering]
MuJoCo Soccer [liu2019emergent]
Leduc Poker [lanctot2017unified]
Kuhn Poker [heinrich2015fictitious]
AlphaStar [vinyals2019grandmaster]
Table 2: Cost of getting the payoff table (line 2 in Algorithm 1) for the experiments conducted in [omidshafiei2019alpha]. We list the numbers by the cost of running one joint-strategy profile the number of joint-strategy profiles considered. Detailed computation can be found here.

As is clear so-far, inputs to -Rank lack clarity. Confused on the form of the input, we realise that the we are left with two choices: 1) list of all joint strategy profiles, or 2) a table of the size – collection of all of the players’ strategy pools. If we are to follow the first direction, the claims made in the paper are of course correct; however, this by no means resolves the problem as it is not clear how one would construct such an input in a tractable manner. Precisely, given an table (collection of all of the players’ strategy pools) as input, constructing the aforementioned list requires exponential time (). In other words, providing -Rank with such a list only hides the exponential complexity burden in a pre-processing step. Analogously, applying this idea to the travelling salesman problem described above would hide the exponential complexity under a pre-processing step used to construct all possible permutations. Provided as inputs, the travelling salesman problem can now be solved in linear time, i.e., transforming an intractable problem to a tractable one by a mere redefinition.

4.2 Dollars Spent: A Non-Refutable Metric

Admittedly, our arguments have been mostly theoretical and can become controversial dependent on the setting one considers. To abolish any doubts, we followed the advice given by the authors and considered the input of -Rank to be exponentially-sized payoff matrices. We then conducted an experiment measuring dollars spent to evaluate scalability of running just line 3 in Algorithm 1, while considering the tasks reported in [omidshafiei2019alpha].

Assuming is given at no cost, the total amount of floating point operations (FLOPS) needed for constructing given in Eqn. 2 is . In terms of money cost needed for just building , we plot the dollar amount in Fig. 2 considering the Nvidia Tesla K80 GPU888https://en.wikipedia.org/wiki/Nvidia_Tesla which can process under single precision at maximum GFlop/s at a price of /hour999https://aws.amazon.com/ec2/instance-types/p2/. Clearly, Fig. 2 shows that due to the fact that -Rank needs to construct a Markov chain with an exponential size in the number of agents, it is only “money” feasible on tasks with at most tens of agents. It is also worth noting that our analysis is optimistic in the sense that we have not considered costs of storing nor computing stationary distributions.

Conclusion I: Given exponentially-sized payoff matrices, constructing transition matrices in -Rank for about agents each with strategies requires about one trillion dollars in budget.

Though assumed given, in reality, the payoff values come at a non-trivial cost themselves, which is particularly true in reinforcement learning tasks [silver2016mastering]. Here, we take a closer look at the amount of money it takes to attain payoff matrices for the experiments listed in [omidshafiei2019alpha] that we present in Table 2. Following the methodology in here, we first count the total FLOPS each model uses under the unit of PetaFlop/s-day that consists of performing operations per second in one day. For each experiment, if the answer to “how many GPUs were trained and for how long” was not available, we then traced back to the neural architecture used and counted the operations needed for both forward and backward propagation. The cost in time was then transformed from PetaFlop/s-day using Tesla K80 as discussed above. In addition, we also list the cost of attaining payoff values from the most recent AlphaStar model [vinyals2019grandmaster]. It is obvious that although -Rank could take the payoff values as “input” at a hefty price, the cost of acquiring such values is not negligible, e.g., payoff values from GO cost about $, and require a single GPU to run for more than five thousand years101010 It is worth mentioning that here we only count running experiment once for getting each payoff value. In practice, the exact payoff values are hard to know since the game outcomes are noisy; therefore, multiple samples are often needed (check Theorem 3.2 in [rowl2019multiagent]), which will turn the numbers in Table 2 to an even larger scale. !

Conclusion II: Acquiring necessary inputs to -Rank easily becomes intractable giving credence to our arguments in Section 4.1.

1:Inputs: Number of trails , total number of iterations , decaying learning rate , penalty parameter , decay rate , and a constraint relaxation term , initialise .
2:while do:
3:    Set the counter of running oracles,
4:    Initialise the strategy set by sub-sampling from
5:    while AND do:
6:       Compute total number of joint profiles
7:       Initial a vector
8:       for do:           -Rank update
9:          Uniformly sample one strategy profile
10:          Construct as the row of
11:          Update by Eqn. 7
12:          Set
13:       Get by ranking the prob. mass of
14:       Set
15:       for each agent do:            The oracles (Section 5.1)
16:          Compute the best response to by Eqn. 8
17:          Update the strategy set by
18:    Set
19:Return: The best performing joint-strategy profile among .
Algorithm 2 -Oracle: Practical Multi-Agent Evaluation

5 A Practical Solution to -Rank

One can consider approximate solutions to the problem in Eqn. 3. As briefly surveyed in Section 3, most current methods, unfortunately, require exponential time and memory complexities. We believe achieving a solution that aims at reducing time complexity is an interesting and open question in linear algebra in general, and leave such a study to future work. Here, we rather contribute by a stochastic optimisation method that can attain a solution through random sampling of payoff matrices without the need to store exponential-size input. Contrary to memory requirements reported in Table 1, our method requires a linear (in number of agents) per-iteration complexity of the form . It is worth noting that most other techniques need to store exponentially-sized matrices before commencing with any numerical instructions. Though we do not theoretically contribute to reductions in time complexities, we do, however, augment our algorithm with a double-oracle heuristic for joint strategy space reduction. In fact, our experiments reveal that -Rank can converge to the correct top-rank strategies in hundreds of iterations in large strategy spaces, i.e., spaces with 33 million profiles.

Optimisation Problem Formulation:

Computing the stationary distribution can be rewritten as an optimisation problem:

(4)

where the constrained objective in Eqn. 4 simply seeks a vector minimising the distance between , itself, and while ensuring that lies on an -dimensional. To handle exponential complexities needed for acquiring exact solutions, we pose a relaxation the problem in Eqn. 4 and focus on computing an approximate solution vector instead, where solves:

(5)

Before proceeding, however, it is worth investigating the relation between the solutions of the original (Eqn. 4) and relaxed (Eqn. 5) problems. We summarise such a relation in the following proposition that shows that determining suffices for computing a stationary distribution of -Rank’s Markov chain:

Proposition: [Connections to Markov Chain] Let be a solution to the relaxed optimisation problem in Eqn. 5. Then, is the stationary distribution of Eqn. 3 in Section 2.

Importantly, the above proposition, additionally, allows us to focus on solving the problem in Eqn. 5 that only exhibits inequality constraints. Problems of this nature can be solved by considering a barrier function leading to an unconstrained finite sum minimisation problem. By denoting to be the row of , we can, thus, write: Introducing logarithmic barrier-functions, with being a penalty parameter, we have:

(6)

Eqn. 6 represents a standard finite minimisation problem, which can be solved using any off-the-shelf stochastic optimisation methods, e.g., stochastic gradients, ADAM [kingma2014adam]. A stochastic gradient execution involves sampling a strategy profile at iteration , and then executing a descent step: , with being a sub-sampled gradient of Eqn. 6, and being a scheduled penalty parameter with for some :

(7)

To avoid any confusion, we name the above stochastic approach of solving -Rank via Eqn. 67 as -Rank and present its pseudo-code in Algorithm 2. When comparing our algorithm to these reported in Table 1, it is worth highlighting that computing updates using Eqn. 7 requires no storage of the full transition or payoff matrices as updates are performed only using sub-sampled columns as shown in line 11 in Algorithm 2.

Figure 3: Ranking intensity sweep on (a) Battle of Sexes (b) Biased RPS (c) Prisoner’s Dilemma.
Figure 4: Comparisons of time and memory complexities on varying sizes of random matrices.

5.1 -Rank with Efficient Exploration & Oracles

Stochastic sampling enables to solve -Rank with no need to store the transition matrix ; however, the size of the column (i.e., ) can still be prohibitively large. Here we further boost scalability of our method by introducing an oracle mechanism. The heuristic of oracles was first proposed in solving large-scale zero-sum matrix games [mcmahan2003planning]. The idea is to first create a sub-game in which all players are only allowed to play a restricted number of strategies, which are then expanded by adding each of the players’ best-responses to their opponents; the sub-game will be replayed with agents’ augmented strategy sets before a new round of best responses is computed.

The best response is assumed to be given by an oracle that can be simply implemented by a grid search, where given the top-rank profile at iteration , the goal for agent is to select the optimal from a pre-defined strategy set to maximise its reward:

(8)

with denoting the state, , denoting the actions from agent and the opponents, respectively. Though worse-case scenario of introducing oracles would require solving the original evaluation problem, our experimental results on large-scale systems demonstrate efficiency by converging early.

Figure 5: Large-scale multi-agent evaluations. (a) Convergence of the optimal joint-strategy profile in self-driving simulation. (b) Status of the Ising-model equilibrium measured by . (c) Change of the top-rank profile from -Oracle under .

For a complete exposition, we summarise the pseudo-code of our proposed method, named as -Oracle, in Algorithm 1. -Oracle degenerates to -Rank (lines ) if one initialises strategy sets of agents by the full size at the beginning, i.e., .

Providing valid convergence guarantee for -Oracle is an interesting direction for future work. In fact, recently [muller2019generalized] proposed a close idea of adopting an oracle mechanism into -Rank without any stochastic solver however. Interestingly, it is reported that bad initialisation can lead to failures in recovering top-rank strategies. Contrary to the results reported in [muller2019generalized], we rather demonstrate the effectiveness of our approach through running multiple trails of initialisation for . In addition, we also believe the stochastic nature of -Oracle potentially prevents from being trapped by the local minimal from sub-games.

6 Experiments

In this section, we demonstrate the scalability of -Rank in successfully recovering optimal policies in self-driving car simulations and in the Ising model–a setting with tens-of-millions of possible strategies. We note that these sizes are far beyond the capability of state-of-the-art methods; -Rank [omidshafiei2019alpha] considers at maximum agents with strategies. All of our experiments were run only on a single machine with GB memory and -core Intel i9 CPU.

Sparsity Data Structure:

During the implementation phase, we realised that the transition probability, , of the Markov chain induces a sparsity pattern (each row and column in contains non-zero elements, check Section 5) that if exploited can lead to significant speed-up. To fully leverage such sparsity, we tailored a novel data structure for sparse storage and computations needed by Algorithm 2. More details are in Appendix 1.1.

Correctness of Ranking Results:

As Algorithm 2 is a generalisation (in terms of scalability) of -Rank, it is instructive to validate the correctness of our results on three simple matrix games. Due to space constraints, we refrain the full description of these tasks to Appendix 1.2. Fig. 3, however, shows that, in fact, results generated by -Rank are consistent with these reported in [omidshafiei2019alpha].

Complexity Comparisons on Random Matrices:

To further assess scalability, we measured the time and memory needed by our method for computing stationary distributions of varying sizes of simulated random matrices. Baselines included eigenvalue decomposition from Numpy, optimisation tools from PyTorch, and

-Rank from OpenSpiel [lanctot2019openspiel]. We terminated execution of -Rank when gradient norms fell-short a predefined threshold of 0.01. According to Fig. 4, -Rank can achieve three orders of magnitude reduction in time (i.e. faster) compared to default -Rank implementation from [lanctot2019openspiel]. Memory-wise, our method uses only half of the space when considering, for instance, matrices of the size .

Autonomous Driving on Highway:

Having assessed correctness and scalability, we now present novel application domains on large-scale multi-agent/multi-player systems. For that we made used of high-way [highway-env]; an environment for simulating self-driving scenarios with social vehicles designed to mimic real-world traffic flow. We conducted a ranking experiment involving agents each with strategies, i.e., a strategy space in the order of ( possible strategy profiles). Agent strategies varied between “rational” and “dangerous” drivers, which we encoded using different reward functions during training (complete details of reward functions are in Appendix 2.2). Under this setting, we knew, upfront, that optimal profile corresponds to all agents being five rational drivers. Cars trained using value iteration and the rewards averaged from 200 test trails were reported.

We considered both -Rank and -Oracle, and reported the results by running random seeds. We set -Oracle to run iterations of gradient updates in solving the top-rank strategy profile (lines in Algorithm 2). Results depicted in Fig. 5(a) clearly demonstrate that both our proposed methods are capable of recovering the correct highest ranking strategy profile. -Oracle converges faster than -Rank, which we believe is due to the oracle mechanism saving time in inefficiently exploring “dangerous” drivers upon one observation. We also note that although such size of problem are feasible using -Rank and the Power Method, our results achieve 4 orders of reduction in number of iterations.

Ising Model Experiment:

We repeated the above experiment on the Ising model [ising1925beitrag] that is typically used for describing ferromagnetism in statistical mechanics. It assumes a system of magnetic spins, where each spin is either an up-spin, , or down-spin, . The system energy is defined by with and being constant coefficients. The probability of one spin configuration is where is the environmental temperature. Finding the equilibrium of the system is notoriously hard because it is needed to enumerate all possible configurations in computing . Traditional approaches include Markov Chain Monte Carlo (MCMC). An interesting phenomenon is the phase change, i.e., the spins will reach an equilibrium in the low temperatures, with the increasing , such equilibrium will suddenly break and the system becomes chaotic.

Here we try to observe the phase change through multi-agent evaluation methods. We assume each spins as an agent, and the reward to be . We consider the top-rank strategy profile from -Oracle as the system equilibrium and compare it against the ground truth from MCMC. We consider a 2D model which induces a prohibitively-large strategy space of the size () to which existing methods are inapplicable. Fig. 5(b) illustrates that our method identifies the same phase change as that of MCMC. We also show an example of how -Oracle’s top-ranked profile finds the system’s equilibrium when in Fig. 5(c). Note that the problem of agent with strategies goes far beyond the capability of -Rank on one single machine (billions of elements in ); we therefore don’t list its performance here.

7 Conclusions & Future Work

In this paper, we presented major bottlenecks prohibiting -Rank from scaling beyond tens of agents. Dependent on the type of input, -Rank’s time and memory complexities can easily become exponential. We further argued that notions introduced in -Rank can lead to confusing tractability results on notoriously difficult NP-hard problems. To eradicate any doubts, we empirically validated our claims by presenting dollars spent as a non-refutable metric.

Realising these problems, we proposed a scalable alternative for multi-agent evaluation based on stochastic optimisation and double oracles, along with rigorous scalability results on a variety of benchmarks. For future work, we plan to understand the relation between -Rank’s solution and that of a Nash equilibrium. Second, we will attempt to conduct a theoretical study on the convergence of our proposed -Oracle algorithm.

References

Appendix

1 Implementation of -Rank

Experiments Max. Iteration
NFG (without self-play) 50 1.0 1000 0.9 0.5 0.1
NFG (self-play) 50 0.03 1000 0.9 0.5 0.1
Random Matrix n/a n/a 0.01 1000 n/a 0.1 0.01
Car Experiment (SGD) 40 1.0 15.0 2000 0.999 0.5 0.1
Car Experiment (Oracle) 40 1.0 1.0 200 0.999 0.5 0.1
Ising Model 40 90.0 0.01 4000 0.999 0.5 0.1
Table 3: Hyper-parameter settings for the experiments

1.1 The Data Structure for Sparsity

The transitional probability matrix in -Rank is sparse; each row and column in contains non-zero elements (see Section 5). To fully leverage such sparsity, we design a new data structure (see Fig. 6) for the storage and computation. Compared to standard techniques (e.g., COO, CSR, and CRS111111https://docs.scipy.org/doc/scipy/reference/sparse.html) that store (row, column, value) of a sparse vector, our data structure adopts a more efficient protocol that stores (defaults, positions, biases) leading to improvements in computational efficiency, which gives us additional advantages in computational efficiency. We reload the operations for such data structure including addition, scalar multiplication, dot product, element-wise square root, L1 norm. We show the example of addition in Fig. 6.

1.2 Validity Check on Normal-form Games

Our algorithm provides the expected ranking in all three normal-form games shown in Fig. 3, which is consistent with the results in -Rank [omidshafiei2019alpha].

Battle of sexes. Battle of sexes is an asymmetric game . -Rank suggests that populations would spend an equal amount of time on the profile (O,O) and (M,M) during the evolution. The distribution mass of (M,O) drops to faster than that of (O,M), this is because deviating from (M,O) for either player has a larger gain (from to ) than deviating from (O,M) (from to ).

Biased Rock-Paper-Scissor. We consider the biased RPS game . As it is a single-population game, we adopt the transitional probability matrix of Eqn. 11 in [omidshafiei2019alpha]. Such game has the inherent structure that Rock/Paper/Scissor is equally likely to be invaded by a mutant, e.g., the scissor population will always be fixated by the rock population, therefore, our method suggests the long-term survival rate for all three strategies are the same . Note this is different from the Nash equilibrium solution that is .

Prison’s Dilemma. In prison’s dilemma , cooperation is an evolutionary transient strategy since the cooperation player can always be exploited by the defection player. Our method thus yields as the only strategy profile that could survive in the long-term evolution.

2 Additional Details for Experiments

Figure 6: Sparse vector representation in -Rank.

2.1 Hyper-parameters Settings

For all of our experiments, the gradient updates include two phases: warm-up phase and Adam [kingma2014adam]

phase. In the warm-up phase, we used standard stochastic gradient descent; after that, we replace SGD with Adam till the convergence. In practice, we find this yields faster convergence than normal stochastic gradient descent. As our algorithm does column sampling for the stochastic matrix (i.e. batch size equals to one), adding momentum term intuitively help stabilise the learning. The warm-up step is

for all experiments

We also implement infinite [lanctot2019openspiel], when calculating transition matrix (or its column), where our noise term is set to be .

For most of our experiments that involve -rank, we set the terminating condition to be, when the gradient norm is less than . However, for Random Matrix experiment, we set the terminating gradient norm to be

  • Learning rate to be in between 15 - 17

  • Alpha (ranking intensity) to be in between 1 - 2.5

  • Number of Population to be between 25 - 55 (in integer)

For all of the Adam experiments, after the warmup-step we chooses to decay and by for each time steps, where we have to always be . Similarly, starts at the value . However, in speed and memory experiment, we chooses the decay to be

List of symbols and names

  • Population size:

  • Ranking Intensity:

  • Learning rate:

2.2 Self-driving Car Experiment

Collision Reward Speed Reward
Rational driver -2.0 0.4
Dangerous driver 1 10.0 10.0
Dangerous driver 2 20.0 10.0
Dangerous driver 3 30.0 10.0
Dangerous driver 4 40.0 10.0
Table 4: Reward settings in Self-driving Car Simulation.

The environmental reward given to each agent is calculated by

Collision Reward is calculated when agent collided with either social car or other agents. All of our value iteration agents are based on [highway-env] environment discretisation, which represents the environment in terms of time to collision MDP, taking into account that the other agents are moving in constant speed. For all experiments, we run value-iteration for steps with the discounting factor of . For each controllable cars, the default speed is randomised to be between to , while the social cars, the speed are randomised to be between to . We define five types of driving behaviours (one rational + four dangerous) by letting each controlled car have a different ego reward function during training (though the reward we report is the environmental reward which cannot be changed). By setting this, we can make sure, at upfront, the best joint-strategy strategy should be all cars to drive rationally.