# Open-ended Learning in Symmetric Zero-sum Games

Zero-sum games such as chess and poker are, abstractly, functions that evaluate pairs of agents, for example labeling them winner' and loser'. If the game is approximately transitive, then self-play generates sequences of agents of increasing strength. However, nontransitive games, such as rock-paper-scissors, can exhibit strategic cycles, and there is no longer a clear objective -- we want agents to increase in strength, but against whom is unclear. In this paper, we introduce a geometric framework for formulating agent objectives in zero-sum games, in order to construct adaptive sequences of objectives that yield open-ended learning. The framework allows us to reason about population performance in nontransitive games, and enables the development of a new algorithm (rectified Nash response, PSRO_rN) that uses game-theoretic niching to construct diverse populations of effective agents, producing a stronger set of agents than existing algorithms. We apply PSRO_rN to two highly nontransitive resource allocation games and find that PSRO_rN consistently outperforms the existing alternatives.

## Authors

• 32 publications
• 17 publications
• 17 publications
• 12 publications
• 27 publications
• 23 publications
• 42 publications
• ### Evolutionary Game Theory Squared: Evolving Agents in Endogenously Evolving Zero-Sum Games

The predominant paradigm in evolutionary game theory and more generally ...
12/15/2020 ∙ by Stratis Skoulakis, et al. ∙ 0

• ### Real World Games Look Like Spinning Tops

This paper investigates the geometrical properties of real world games (...
04/20/2020 ∙ by Wojciech Marian Czarnecki, et al. ∙ 10

• ### Fictitious play in zero-sum stochastic games

We present fictitious play dynamics for the general class of stochastic ...
10/08/2020 ∙ by Muhammed O. Sayin, et al. ∙ 0

• ### On the Computation of Strategically Equivalent Rank-0 Games

It has been well established that in a bimatrix game, the rank of the ma...
03/31/2019 ∙ by Joseph L. Heyman, et al. ∙ 0

• ### Discounting the Past in Stochastic Games

Stochastic games, introduced by Shapley, model adversarial interactions ...
02/13/2021 ∙ by Taylor Dohmen, et al. ∙ 0

• ### Modelling Behavioural Diversity for Learning in Open-Ended Games

Promoting behavioural diversity is critical for solving games with non-t...
03/14/2021 ∙ by Nicolas Perez Nieves, et al. ∙ 8

• ### Cycles in adversarial regularized learning

Regularized learning is a fundamental technique in online optimization, ...
09/08/2017 ∙ by Panayotis Mertikopoulos, et al. ∙ 0

## Code Repositories

### spieeltjie

Small lab for experiments with PSRO, game theoretic niching, etc.

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A story goes that a Cambridge tutor in the mid- century once proclaimed: “I’m teaching the smartest boy in Britain.” His colleague retorted: “I’m teaching the best test-taker.” Depending on the version of the story, the first boy was either Lord Kelvin or James Clerk Maxwell. The second boy, who indeed scored highest on the Tripos, is long forgotten.

Modern learning algorithms are outstanding test-takers: once a problem is packaged into a suitable objective, deep (reinforcement) learning algorithms often find a good solution. However, in many multi-agent domains, the question of what test to take, or what objective to optimize, is not clear. This paper proposes algorithms that adaptively and continually pose new, useful objectives which result in open-ended learning in two-player zero-sum games. This setting has a large scope of applications and is general enough to include function optimization as a special case.

Learning in games is often conservatively formulated as training agents that tie or beat, on average, a fixed set of opponents. However, the dual task, that of generating useful opponents to train and evaluate against, is under-studied. It is not enough to beat the agents you know; it is also important to generate better opponents, which exhibit behaviours that you don’t know.

There are very successful examples of algorithms that pose and solve a series of increasingly difficult problems for themselves through forms of self-play (Silver et al., 2018; Jaderberg et al., 2018; Bansal et al., 2018; Tesauro, 1995). Unfortunately, it is easy to encounter nontransitive games where self-play cycles through agents without improving overall agent strength – simultaneously improving against one opponent and worsening against another. In this paper, we develop a mathematical framework for analyzing nontransitive games, and present algorithms that systematically uncover and solve the latent problems embedded in a game.

Overview. The paper starts in Section 2 by introducing functional-form games (

s) as a mathematical model of zero-sum games played by parametrized agents such as neural networks. Theorem

1 decomposes any into a sum of transitive and cyclic components. Transitive games, and closely related monotonic games, are the natural setting for self-play, but the cyclic components, present in non-transitive games, require more sophisticated algorithms which motivates the remainder of the paper.

The main problem in tackling nontransitive games, where there is not necessarily a best agent, is understanding what the objective should be. In Section 3, we formulate the global objective in terms of gamescapes – convex polytopes that encode the interactions between agents in a game. If the game is transitive or monotonic, then the gamescape degenerates to a one-dimensional landscape. In nontransitive games, the gamescape can be high-dimensional because training against one agent can be fundamentally different from training against another.

Measuring the performance of individual agents is vexed in nontransitive games. Therefore, in Section 3, we develop tools to analyze populations of agents, including a population-level measure of performance, definition 3. An important property of population-level performance is that it increases transitively as the gamescape polytope expands in a nontransitive game. Thus, we reformulate the problem of learning in games from finding the best agent to growing the gamescape. We consider two approaches to do so, one directly performance related, and the other focusing on a measure of diversity, definition 4. Crucially, the measure quantifies diverse effective behaviors – we are not interested in differences in policies that do not lead to differences in outcomes, nor in agents that lose in new and surprising ways.

Section 4 presents two algorithms, one old and one new, for growing the gamescape. The algorithms can be seen as specializations of the policy space response oracle () introduced in Lanctot et al. (2017). The first algorithm is Nash response (), which is an extension to functional-form games of the double oracle algorithm from McMahan et al. (2003). Given a population, Nash response creates an objective to train against by averaging over the Nash equilibrium. The Nash serves as a proxy for the notion of ‘best agent’, which is not guaranteed to exist in general zero-sum games. A second, complementary algorithm is the rectified Nash response (). The algorithm amplifies strategic diversity in populations of agents by adaptively constructing game-theoretic niches that encourage agents to ‘play to their strengths and ignore their weaknesses’.

Finally, in Section 5, we investigate the performance of these algorithms in Colonel Blotto (Borel, 1921; Tukey, 1949; Roberson, 2006) and a differentiable analog we refer to as differentiable Lotto. Blotto-style games involve allocating limited resources, and are highly nontransitive. We find that outperforms

, both of which greatly outperform self-play in these domains. We also compare against an algorithm that responds to the uniform distribution

, which performs comparably to .

Related work. There is a large literature on novelty search, open-ended evolution, and curiosity, which aim to continually expand the frontiers of game knowledge within an agent (Lehman & Stanley, 2008; Taylor et al., 2016; Banzhaf et al., 2016; Brant & Stanley, 2017; Pathak et al., 2017; Wang et al., 2019). A common thread is that of adaptive objectives which force agents to keep improving. For example, in novelty search, the target objective constantly changes – and so cannot be reduced to a fixed objective to be optimized once-and-for-all.

We draw heavily on prior work on learning in games, especially Heinrich et al. (2015); Lanctot et al. (2017) which are discussed below. Our setting resembles multiobjective optimization (Fonseca & Fleming, 1993; Miettinen, 1998). However, unlike multiobjective optimization, we are concerned with both generating and optimizing objectives. Generative adversarial networks (Goodfellow et al., 2014) are zero-sum games that do not fall under the scope of this paper due to lack of symmetry, see appendix F.3.

Notation.Vectors are columns. The constant vectors of zeros and ones are and . We sometimes use to denote the entry of vector . Proofs are in the appendix.

## 2 Functional-form games (FFGs)

Suppose that, given any pair of agents, we can compute the probability of one beating the other in a game such as Go, Chess, or StarCraft. The setup is formalized as follows.

###### Definition 1.

Let be a set of agents parametrized by, say, the weights of a neural net. A symmetric zero-sum functional-form game () is an antisymmetric function, , that evaluates pairs of agents

 ϕ:W×W→R. (1)

The higher , the better for agent . We refer to , , and as wins, losses and ties for .

Note that the parametrization of the agents is folded into , so the game is a composite of the agent’s architecture and the environment itself.

Suppose the probability of beating , denoted

can be computed or estimated. Win/loss probabilities can be rendered into antisymmetric form via

or .

Tools for s. Solving s requires different methods to solving normal form games (Shoham & Leyton-Brown, 2008) due to their continuous nature. We therefore develop the following basic tools.

First, the curry operator converts a two-player game into a function from agents to objectives

 [ϕ:W×W ⟶R] curry −−−−→ [W⟶[W⟶R]] (2) ϕ(v,w) w↦ϕw(∙):=ϕ(∙,w) (3)

Second, an approximate best-response oracle that, given agent and objective , returns a new agent with

, if possible. The oracle could use gradients, reinforcement learning or evolutionary algorithms.

Third, given a population of agents, the antisymmetric evaluation matrix is

 AP:={ϕ(wi,wj):(wi,wj)∈P×P}=:ϕ(P⊗P). (4)

Fourth, we will use the (not necessarily unique) Nash equilibrium on the zero-sum matrix game specified by .

Finally, we use the following game decomposition. Suppose is a compact set equipped with a probability measure. The set of integrable antisymmetric functions on then forms a vector space. Appendix C shows the following:

###### Theorem 1 (game decomposition).

Every decomposes into a sum of a transitive and cyclic game

 FFG=transitive game⊕cyclic game. (5)

with respect to a suitably defined inner product.

Transitive and cyclic games are discussed below. Few games are purely transitive or cyclic. Nevertheless, understanding these cases is important since general algorithms should, at the very least, work in both special cases.

### 2.1 Transitive games

A game is transitive if there is a ‘rating function’ such that performance on the game is the difference in ratings:

 ϕ(v,w)=f(v)−f(w). (6)

In other words, if admits a ‘subtractive factorization’.

Optimization (training against a fixed opponent). Solving a transitive game reduces to finding

 v∗:=argmaxv∈Wϕw(v)=argmaxv∈Wf(v). (7)

Crucially, the choice of opponent makes no difference to the solution. The simplest learning algorithm is thus to train against a fixed opponent, see algorithm 1.

Monotonic games generalize transitive games. An is monotonic if there is a monotonic function such that

 ϕ(v,w)=σ(f(v)−f(w)). (8)

For example, Elo (1978) models the probability of one agent beating another by

 P(v≻w)=σ(f(v)−f(w)) for σ(x)=11+e−α⋅x (9)

for some , where assigns Elo ratings to agents. The model is widely used in Chess, Go and other games.

Optimizing against a fixed opponent fares badly in monotonic games. Concretely, if Elo’s model holds then training against a much weaker opponent yields no learning signal because the gradient vanishes once the sigmoid saturates when .

Self-play (algorithm 2) generates a sequence of opponents. Training against a sequence of opponents of increasing strength prevents gradients from vanishing due to large skill differentials, so self-play is well-suited to games modeled by eq. (8). Self-play has proven effective in Chess, Go and other games (Silver et al., 2018; Al-Shedivat et al., 2018).

Self-play is an open-ended learning algorithm: it poses and masters a sequence of objectives, rather than optimizing a pre-specified objective. However, self-play assumes transitivity: that local improvements ( beats ) imply global improvements ( beats ). The assumption fails in nontransitive games, such as the disc game below. Since performance is nontransitive, improving against one agent does not guarantee improvements against others.

### 2.2 Cyclic games

A game is cyclic if

 ∫Wϕ(v,w)⋅dw=0for all% v∈W. (10)

In other words, wins against some agents are necessarily counterbalanced with losses against others. Strategic cycles often arise when agents play imperfect information games such as rock-paper-scissors, poker, or StarCraft.

###### Example 1 (Disc game).

Fix , let with uniform measure and set

 (11)

The game is cyclic, see figure 2A.

###### Example 2 (Rock-paper-scissors embeds in disc game).

Set , and to obtain

 (12)

Varying yields a family of -- interactions that trend deterministic as increases, see figure 2B.

Our goal is to extend self-play to general s. The success of optimization and self-play derives from (i) repeatedly applying a local operation that (ii) improves a transitive measure. If the measure is not transitive, then applying a sequence of local improvements can result in no improvement at all. Our goal is thus to find practical substitutes for (i) and (ii) in general s.

## 3 Functional and Empirical Gamescapes

Rather than trying to find a single dominant agent which may not exist, we seek to find all the atomic components in strategy space of a zero-sum game. That is, we aim to discover the underlying strategic dimensions of the game, and the best ways of executing them. Given such knowledge, when faced with a new opponent, we will not only be able to react to its behavior conservatively (using the Nash mixture to guarantee a tie), but will also be able to optimally exploit the opponent. As opposed to typical game-theoretic solutions, we do not seek a single agent or mixture, but rather a population that embodies a complete understanding of the strategic dimensions of the game.

To formalize these ideas we introduce gamescapes, which geometrically represent strategies in functional form games. We show some general properties of these objects to build intuitions for the reader. Finally we introduce two critical concepts: population performance, which measures the progress in performance of populations, and effective diversity, which quantifies the coverage of the gamescape spanned by a population. Equipped with these tools we present algorithms that guarantee iterative improvements in s.

###### Definition 2.

The functional gamescape () of is the convex set

 Gϕ:=hull({ϕw(∙):w∈W})⊂C(W,R), (13)

where is the space of real-valued functions on .

Given population of agents with evaluation matrix , the corresponding empirical gamescape () is

 GP:={convex mixtures of rows of AP}. (14)

The represents all the mixtures of objectives implicit in the game. We cannot work with the directly because we cannot compute for infinitely many agents. The is a tractable proxy (Wellman, 2006). The two gamescapes represent all the ways agents can-in-principle and are-actually-observed-to interact respectively. The remainder of this section collects basic facts about gamescapes.

Optimization landscapes are a special case of gamescapes. If then the is, modulo constant terms, a single function . The degenerates into a landscape where, for each agent there is a unique direction in weight space which gives the steepest performance increase against all opponents. In a monotonic game, the gradient is . There is again a single steepest direction , with tendency to vanish controled by the ratings differential .

Redundancy. First, we argue that gamescapes are more fundamental than evaluation matrices. Consider

 ⎡⎢⎣01−1−1011−10⎤⎥⎦and⎡⎢ ⎢ ⎢⎣01−1−1−10111−1001−100⎤⎥ ⎥ ⎥⎦. (15)

The first matrix encodes rock-paper-scissors interactions; the second is the same, but with two copies of scissors. The matrices are difficult to compare since their dimensions are different. Nevertheless, the gamescapes are equivalent triangles embedded in and respectively.

###### Proposition 2.

An agent in a population is redundant if it is behaviorally identical to a convex mixture of other agents. The is invariant to redundant agents.

Invariance is explained in appendix D.

Dimensionality. The dimension of the gamescape is an indicator of the complexity of both the game and the agents playing it. In practice we find many s have a low dimensional latent structure.

Figure 1 depicts evaluation matrices of four populations of 40 agents. Although the gamescapes could be 40-dim, they turn out to have one- and two-dim representations. The dimension of the is determined by the rank of the evaluation matrix.

###### Proposition 3.

The of agents in population can be represented in , where .

A low-dim representation of the can be constructed via the Schur decomposition, which is the analog of PCA for antisymmetric matrices (Balduzzi et al., 2018b). The length of the longest strategic cycle in a game gives a lower-bound on the dimension of its gamescape:

###### Example 3 (latent dimension of long cycles).

Suppose agents form a long cycle: . Then is if is even and if

is odd.

Nash equilibria in a symmetric zero-sum game are (mixtures of) agents that beat or tie all other agents. Loosely speaking, they replace the notion of best agent in games where there is no best agent. Functional Nash equilibria, in the , are computationally intractable so we work with empirical Nash equilibria over the evaluation matrix.

###### Proposition 4.

Given population , the empirical Nash equilibria are

 NP={p distribution:p⊺AP⪰0}. (16)

In other words, Nash equilibria correspond to points in the empirical gamescape that intersect the positive quadrant . The positive quadrant thus provides a set of directions in weight space to aim for when training new agents, see below.

The gap between the and . Observing - interactions yields different conclusions from observing -- interactions; it is always possible that an agent that appears to be dominated is actually part of a partially observed cycle. Without further assumptions about the structure of , it is impossible to draw strong conclusions about the nature of the from the computed from a finite population. The gap is analogous to the exploration problem in reinforcement learning. To discover unobserved dimensions of the one could train against randomized distributions over opponents, which would eventually find them all.

### 3.1 Population performance

If then improving performance of agent reduces to increasing . In a cyclic game, the performance of individual agents is meaningless: beating one agent entails losing against another by eq. (10). We therefore propose a population performance measure.

###### Definition 3.

Given populations and , let be a Nash equilibrium of the zero-sum game on . The relative population performance is

 v(P,Q):=p⊺⋅AP,Q⋅q=n1,n2∑i,j=1Aij⋅piqj. (17)
###### Proposition 5.

(i) Performance is independent of the choice of Nash equilibrium. (ii) If is monotonic then performance compares the best agents in each population

 v(P,Q)=maxv∈Pf(v)−maxw∈Qf(w). (18)

(iii) If then and for any population .

The first two properties are sanity checks. Property (iii) implies growing the polytope spanned by a population improves its performance against any other population.

Consider the concentric rock-paper-scissors populations in figure 2B and example 2. The Nash equilibrium is , which is a uniform mixture over any of the populations. Thus, the relative performance of any two populations is zero. However, the outer population is better than the inner population at exploiting an opponent that only plays, say, rock because the outer version of paper wins more deterministically than the inner version.

Finding a population that contains the Nash equilibrium is necessary but not sufficient to fully solve an . For example, adding the ability to always force a tie to an makes finding the Nash trivial. However, the game can still exhibit rich strategies and counter-strategies that are worth discovering.

### 3.2 Effective diversity

Measures of diversity typically quantify differences in weights or behavior of agents but ignore performance. Effective diversity measures the variety of effective agents (agents with support under Nash):

###### Definition 4.

Denote the rectifier by if and otherwise. Given population , let be a Nash equilibrium on . The effective diversity of the population is:

 d(P):=p⊺⋅⌊AP⌋+⋅p=n∑i,j=1⌊ϕ(wi,wj)⌋+⋅pipj. (19)

Diversity quantifies how the best agents (those with support in the maximum entropy Nash) exploit each other. If there is a dominant agent then diversity is zero.

Effective diversity is a matrix norm, see appendix E.2. It measures the volume spanned by Nash supported agents. In figure 2B, there are four populations spanning concentric gamescapes: the Nash at and three variants of --. Going outwards to large gamescapes yields agents that are more diverse and better exploiters.

## 4 Algorithms

We now turn attention to constructing objectives that when trained against, produce new, effective agents. We present two algorithms that construct a sequence of fruitful local objectives that, when solved, iteratively add up to transitive population-level progress. Importantly, these algorithms output populations, unlike self-play which outputs single agents.

Concretely, we present algorithms that expand the empirical gamescape in useful directions. Following Lanctot et al. (2017), we assume access to a subroutine, or oracle, that finds an approximate best response to any mixture of objectives. The subroutine could be a gradient-based, reinforcement learning or evolutionary algorithm. The subroutine returns a vector in weight-space, in which existing agents can be shifted to create new agents. Any mixture constitutes a valid training objective. However, many mixtures do not grow the gamescape, because the vector could point towards redundant or weak agents.

### 4.1 Response to the Nash (PSRON)

Since the notion of ‘the best agent’ – one agent that beast all others – does not necessarily exist in nontransitive games, a natural substitute is the mixture over the Nash equilibrium on the most recent population . The policy space response to the Nash () iteratively generates new agents that are approximate best responses to the Nash mixture. If the game is transitive then degenerates to self-play. The algorithm is an extension of the double oracle algorithm (McMahan et al., 2003) to s, see also (Zinkevich et al., 2007; Hansen et al., 2008).

The following result shows that strictly enlarges the empirical gamescape:

###### Proposition 6.

If is a Nash equilibrium on and , then adding to strictly enlarges the empirical gamescape: .

A failure mode of arises when the Nash equilibrium of the game is contained in the empirical gamescape. For example, in the disc game in figure 2 the Nash equilibrium of the entire is the agent at the origin . If a population’s gamescape contains – which is true of any -- subpopulation – then will not expand the gamescape because there is no best response to . The next section presents an algorithm that uses niching to meaningfully grow the gamescape, even after finding the Nash equilibrium of the .

Response to the uniform distribution (). A closely related algorithm is fictitious (self-)play (Brown, 1951; Leslie & Collins, 2006; Heinrich et al., 2015). The algorithm finds an approximate best-response to the uniform distribution on agents in the current population: . has guarantees in matrix form games and performs well empirically (see below). However, we do not currently understand its effect on the gamescape.

### 4.2 Response to the rectified Nash (PSROrN)

Response to the rectified Nash (), introduces game-theoretic niches. Each effective agent – that is each agent with support under the Nash equilibrium – is trained against the Nash-weighted mixture of agents that it beats or ties. Intuitively, the idea is to encourage agents to ‘amplify their strengths and ignore their weaknesses’.

A special case of arises when there is a dominant agent in the population, that beats all other agents. The Nash equilibrium is then concentrated on the dominant agent, and degenerates to training against the best agent in the population, which can be thought of as a form of self-play (assuming the best agent is the most recent).

###### Proposition 7.

The objective constructed by rectified Nash response is effective diversity, definition 4.

Thus, amplifies the positive coordinates, of the Nash-supported agents, in their rows of the evaluation matrix. A pathological mode of is when there are many extremely local niches. That is, every agent admits a specific exploit that does not generalize to other agents. will grow the gamescape by finding these exploits, generating a large population of highly specialized agents.

Rectified Nash responses in the disc game (example 1). The disc game embeds many subpopulations with rock-paper-scissor dynamics. As the polytope they span expands outwards, the interactions go from noisy to deterministic. The disc game is differentiable, so we can use gradient-based learning for the oracle in . Figure 3B depicts the gradients resulting from training each of rock, paper and scissors against the agent it beats. Since the gradients point outside the polytope, training against the rectified Nash mixtures expands the gamescape.

Why ignore weaknesses? A natural modification of is to train effective agents against effective agents that they lose to. In other words, to force agents to improve their weak points whilst taking their strengths for granted. Figure 3C shows the gradients that would be applied to each of rock, paper and scissors under this algorithm. They point inwards, contracting the gamescape. Training rock against paper would make it more like scissors; similarly training paper against scissors would make it more like rock and so on. Perhaps counter-intuitively, building objectives out of the weak points of agents does not encourage diverse niches.

## 5 Experiments

We investigated the performance of the proposed algorithms in two highly nontransitive resource allocation games.

Colonel Blotto is a resource allocation game that is often used as a model for electoral competition. Each of two players has a budget of coins which they simultaneously distribute over a fixed number of areas. Area is won by the player with more coins on . The player that wins the most areas wins the game. Since Blotto is not differentiable we use maximum a posteriory policy optimization (MPO) (Abdolmaleki et al., 2018) as best response oracle. MPO is an inference-based policy optimization algorithm; many other reinforcement learning algorithms could be used.

Differentiable Lotto is inspired by continuous Lotto (Hart, 2008). The game is defined over a fixed set of ‘customers’, each being a point in . An agent’s strategy , distributes one unit of mass over servers, where each server is a point . Roughly, given the strategies of two players, , customers are softly assigned to the nearest servers, determining the agents’ payoffs. More formally, the payoff is

 ϕ((p,v),(q,w)):=c,k∑i,j=1(pjvij−qjwij), (20)

where the scalars and depend on the distance between customer and the servers:

 (vi1,…,wik):=softmax(−∥ci−v1∥2,…,−∥ci−wk∥2). (21)

The width of a cloud of points is the expected distance from the barycenter. We impose agents to have width equal one. We use gradient ascent as our oracle.

Experimental setups. The experiments examine the performance of self-play, , , and . We investigate performance under a fixed computational budget. Specifically, we track queries made to the oracle, following the standard model of computational cost in convex optimization (Nemirovski & Yudin, 1983). To compare two algorithms, we report the relative population performance (definition 3), of the populations they output. Computing evaluation matrices is expensive, , for large populations. This cost is not included in our computational model since populations are relatively small. The relative cost of evaluations and queries to the oracle depends on the game.

In Blotto, we investigate performance for areas and coins over games. An agent outputs a vector in which is passed to a softmax, and discretized to obtain three integers summing to 10. Differentiable Lotto experiments are from games with customers chosen uniformly at random in the square .

Results. Fig 4 shows the relative population performance, definition 3, between and each of , and self-play: the more positive the number is, the more outperforms the alternative method. We find that outperforms the other approaches across a wide range of allowed compute budgets. and perform comparably, and self-play performs the worst. Self-play, algorithm 2, outputs a single agent, so the above comparison only considers the final agent. If we upgrade self-play to a population algorithm (by tracking all agents produced over), then it still performs the worst in differentiable Lotto, but by a smaller margin. In Blotto, suprisingly, it slightly outperforms and .

Figure 5 shows how gamescapes develop during training. From the left panel, we see that grows the polytope in a more uniform manner than the other algorithms. The right panel shows the area of the empirical gamescapes generated by the algorithms (the areas of the convex hulls). All algorithms increase the area, but is the only method that increases the area at every iteration.

## 6 Conclusion

We have proposed a framework for open-ended learning in two-player symmetric zero-sum games, where strategies have a differentiable parametrization. We propose the goal of learning should be (i) to discover the underlying strategic components that constitute the game and (ii) to master each of them. We formalized these ideas using gamescapes, which geometrically represent the latent objectives in games, and provided tools to analyze them. Finally, we proposed and empirically validated a new algorithm, , for uncovering strategic diversity within functional form games.

The algorithms discussed here are simple and generic, providing the foundations for methods that unify modern gradient and reinforcement-based learning with the adaptive objectives derived from game-theoretic considerations. Future work lies in expanding this understanding and applying it to develop practical algorithms for more complex games.

Acknowledgements. We thank Marc Lanctot and Simon Osindero for useful feedback.

## References

• Abdolmaleki et al. (2018) Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. Maximum a Posteriori Policy Optimisation. In ICLR, 2018.
• Al-Shedivat et al. (2018) Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments. In ICLR, 2018.
• Baker (2018) Baker, M. Hodge theory in combinatorics. Bull. AMS, 55(1):57–80, 2018.
• Balduzzi et al. (2018a) Balduzzi, D., Racanière, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T. The mechanics of -player differentiable games. In ICML, 2018a.
• Balduzzi et al. (2018b) Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T. Re-evaluating Evaluation. In NeurIPS, 2018b.
• Bansal et al. (2018) Bansal, T., Pachocki, J., Sidor, S., Sutskever, I., and Mordatch, I. Emergent complexity via multi-agent competition. ICLR, 2018.
• Banzhaf et al. (2016) Banzhaf, W., Baumgaertner, B., Beslon, G., Doursat, R., Foster, J. A., McMullin, B., de Melo, V. V., Miconi, T., Spector, L., Stepney, S., , and White, R. Defining and Simulating Open-Ended Novelty: Requirements, Guidelines, and Challenges. Theory in Biosciences, 2016.
• Borel (1921) Borel, E. La théorie du jeu et les équations intégrales à noyau symétrique. Comptes rendus de l’Académie des Sciences, 1921.
• Brant & Stanley (2017) Brant, J. C. and Stanley, K. O. Minimal Criterion Coevolution: A New Approach to Open-Ended Search. In GECCO, 2017.
• Brown (1951) Brown, G. Iterative Solutions of Games by Fictitious Play. In Koopmans, T. C. (ed.), Activity Analysis of Production and Allocation. Wiley, 1951.
• Candogan et al. (2011) Candogan, O., Menache, I., Ozdaglar, A., and Parrilo, P. A. Flows and Decompositions of Games: Harmonic and Potential Games. Mathematics of Operations Research, 36(3):474–503, 2011.
• Elo (1978) Elo, A. E. The Rating of Chess players, Past and Present. Ishi Press International, 1978.
• Fonseca & Fleming (1993) Fonseca, C. and Fleming, P. Genetic Algorithms for Multiobjective Optimization: Formulation, Discussion and Generalization. In GECCO, 1993.
• Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. In NeurIPS, 2014.
• Hansen et al. (2008) Hansen, T. D., Miltersen, P. B., and Sørensen, T. B. On Range of Skill. In AAAI, 2008.
• Hart (2008) Hart, S. Discrete Colonel Blotto and General Lotto games. Int J Game Theory, 36:441–460, 2008.
• Heinrich et al. (2015) Heinrich, J., Lanctot, M., and Silver, D. Fictitious Self-Play in Extensive-Form Games. In ICML, 2015.
• Jaderberg et al. (2018) Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., Sonnerat, N., Green, T., Deason, L., Leibo, J. Z., Silver, D., Hassabis, D., Kavukcuoglu, K., and Graepel, T. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv:1807.01281, 2018.
• Jiang et al. (2011) Jiang, X., Lim, L.-H., Yao, Y., and Ye, Y. Statistical ranking and combinatorial Hodge theory. Math. Program., Ser. B, 127:203–244, 2011.
• Lanctot et al. (2017) Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning. In NeurIPS, 2017.
• Lehman & Stanley (2008) Lehman, J. and Stanley, K. O. Exploiting Open-Endedness to Solve Problems Through the Search for Novelty. In ALIFE, 2008.
• Leslie & Collins (2006) Leslie, D. and Collins, E. J. Generalised weakened fictitious play. Games and Economic Behavior, 56(2):285–298, 2006.
• McMahan et al. (2003) McMahan, H. B., Gordon, G., and Blum, A. Planning in the presence of cost functions controlled by an adversary. In ICML, 2003.
• Miettinen (1998) Miettinen, K. Nonlinear Multiobjective Optimization. Springer, 1998.
• Nemirovski & Yudin (1983) Nemirovski, A. and Yudin, D. Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience, 1983.
• Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven Exploration by Self-supervised Prediction. In ICML, 2017.
• Roberson (2006) Roberson, B. The Colonel Blotto game. Economic Theory, 29(1), 2006.
• Shoham & Leyton-Brown (2008) Shoham, Y. and Leyton-Brown, K. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2008.
• Silver et al. (2018) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362:1140–1144, 2018.
• Taylor et al. (2016) Taylor, T., Bedau, M., Channon, A., Ackley, D., Banzhaf, W., Beslon, G., Dolson, E., Froese, T., Hickinbotham, S., Ikegami, T., McMullin, B., Packard, N., Rasmussen, S., Virgo, N., Agmon, E., McGregor, E. C. S., Ofria, C., Ropella, G., Spector, L., Stanley, K. O., Stanton, A., Timperley, C., Vostinar, A., and Wiser, M. Open-Ended Evolution: Perspectives from the OEE Workshop in York. Artificial Life, 22:408–423, 2016.
• Tesauro (1995) Tesauro, G. Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3):58–68, 1995.
• Tukey (1949) Tukey, J. W. A problem of strategy. Econometrica, 17, 1949.
• Wang et al. (2019) Wang, R., Lehman, J., Clune, J., and Stanley, K. O. Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions. In arXiv:1901.01753, 2019.
• Wellman (2006) Wellman, M. P. Methods for empirical game-theoretic analysis. In AAAI, pp. 1552–1556, 2006.
• Zinkevich et al. (2007) Zinkevich, M., Bowling, M., and Burch, N. A New Algorithm for Generating Equilibria in Massive Zero-Sum Games. In AAAI, 2007.

## A Behaviorism

The paper adopts a strict form of behaviorism: an agent is what it does. More precisely, an agent is all that it can do. Agents are operationally characterized by all the ways they can interact with all other agents. The functional gamescape captures all possible interactions; the empirical gamescape captures all observed interactions.

Here, we restrict to the single facet of pairwise interactions reported by the function . One could always delve deeper in specific instances. For example, instead of just reporting win/loss probabilities, one could also report summary statistics regarding which units are built or which pieces are taken.

## B Convex geometry

A set contained in some vector space is convex if the lines

 Lv,w={α⋅v+(1−α)⋅w:0≤α≤1} (22)

are subsets of for all .

###### Definition 5.

The convex hull of a set is the intersection of all convex sets containing . Alternatively, the convex hull of is

 hull(A):={n∑i=1αi⋅wi:α⪰0,α⊺1=1,wi∈A}. (23)

The convex hull of a set is necessarily convex.

###### Theorem 8 (Krein-Milman).

An extreme point of a convex set is a point in which does not lie in any open line segment joining two points of . If is convex and compact in a normed space, then is the closed convex hull of its extreme points.

#### Row versus column mixtures.

In the main text we define the empirical gamescape to be all convex mixtures of rows of an antisymmetric matrix . It follows by antisymmetry of that working with convex mixtures of columns obtains the same polytope up to sign.

The same holds for functional gamescape, where we chose to work with convex combinations of functions of the form , but could have equivalently (up to sign) worked with convex combinations of functions of the form .

## C Proof of theorem 1

We generalize some results from combinatorial Hodge theory (Jiang et al., 2011; Candogan et al., 2011; Balduzzi et al., 2018b) to the setting where flows are antisymmetric functions instead of antisymmetric matrices.

### c.1 Combinatorial Hodge theory

Combinatorial Hodge theory is a combinatorial analog of differential geometry, see Jiang et al. (2011); Candogan et al. (2011); Baker (2018). Consider a fully connected graph with vertex set . Assign a flow to each edge of the graph. The flow in the opposite direction is , so flows are just antisymmetric matrices. The flow on a graph is analogous to a vector field on a manifold.

The combinatorial gradient of an -vector is the flow: . Flow is a gradient flow if for some , or equivalently if for all . The combinatorial divergence of a flow is the -vector . The divergence measures the contribution to the flow of each vertex, considered as a source. The curl

of a flow is the three-tensor

.

### c.2 Functional Hodge theory

We now extend the basic tools of combinatorial Hodge theory to the functional setting.

Let be a compact set with a probability measure. Given function , let prescribe the flow from to . The opposite flow from to is , so is antisymmetric. The combinatorial setting above arises as the special case where the compact set is finite.

The combinatorial gradient111Our use of ‘gradient’ and ‘divergence’ does not coincide with the usual usage in multivariate calculus. This is forced on us by the terminological discrepancy between combinatorial Hodge theory and multivariate calculus. No confusion should result; we use for combinatorial gradients and for calculus gradients. of a function is the flow given by

 (24)

The divergence of a flow is the function given by

 div(ϕ)(v):=∫Wϕ(v,w)⋅dw. (25)

The curl of flow is given by

 curl(ϕ)(u,v,w):=ϕ(u,v)+ϕ(v,w)−ϕ(u,w). (26)

The following proposition lays the groundwork for the Hodge decomposition, proved in the next section.

###### Lemma 9.

Let be the vector space of functions with zero expectation. Then (i) ; (ii) for all flows ; and (iii) for all .

###### Proof.

(i) Observe that

 div∘grad(f) =∫W(f(v)−f(w))⋅dw (27) =f(v)∫Wdw−∫Wf(w)dw (28) =f(v)−0 (29)

where because is a probability measure and by assumption.

(ii) Direct computation obtains that

 ∫Wdiv(ϕ)(v)dv =∫W×Wϕ(v,w)⋅dv⋅dw=0 (30)

by antisymmetry of .

(iii) Direct computation shows that is

as required. ∎

A basic result in combinatorial Hodge theory is the Hodge decomposition, which decomposes the space of flows into gradient-flows and curl-free flows (Jiang et al., 2011; Balduzzi et al., 2018b). The result is analogous to the Helmholtz decomposition in electrodynamics (Balduzzi et al., 2018a). Here, we prove a variant of the Hodge decomposition in the functional setting.

###### Theorem 1 (Hodge decomposition).

The vector space of games admits an orthogonal decomposition

with respect to the inner product on games .

###### Proof.

First we show that and are orthogonal. Suppose and . Then

 ⟨ϕ,ψ⟩= ∫(f(v)−f(w))ψ(v′,w′)dvdv′dwdw′ (34) = ∫f(v)ψ(v′,w′)dvdv′dw′ (35) −∫f(w)ψ(v′,w′)dv′dwdw′ (36) ≡ 0. (37)

Second, observe that any flow can be written as

where and satisfies

because by lemma 9 (i) and (ii). ∎

## D Proofs of propositions

Notation. The dot product is . Subscripts can indicate the dimensions of vectors and matrices, or , or their entries, or ; no confusion should result. The unit vector with a 1 in coordinate is .

#### Proof of proposition 2.

Before proving the proposition, we first tighten up the terminology.

Two agents and are behaviorally identical if for all . Given population , two agents are empirically indistinguishable if for all .

###### Definition 6.

Population is redundant relative to population if every agent in is empirically indistinguishable from a convex mixture of agents in .

More formally, for all in there is an agent such that for all .

Next, we need a notion of equivalence that applies to objects of different dimensions.

###### Definition 7.

Two polytopes and are equivalent if there is an orthogonal projection with left inverse (i.e. satisfying ) such that

 P=Q⋅πandQ=P⋅ξ. (41)
###### Proposition 2.

Suppose is redundant relative to . Then the gamescapes and are equivalent.

###### Proof.

Suppose that has elements and has elements. We assume, without loss of generality, that the elements of are ordered such that the first coincide with . The evaluation matrix thus contains the evaluation matrix as a ‘top left’ sub-block.

By assumption (since the additional agents in are convex combinations of agents in ), evaluation matrix has the block form

 AQ=[AP−A⊺PM⊺MAP−MA⊺PM⊺]=[Im×mMn×m]⋅AP⋅[Im×mMn×m]⊺ (42)

where specifies the convex combinations. It follows that is generated by convex mixtures of the first rows of :

 [AP−A⊺PM⊺]. (43)

Now, let

 B:=[AP−A⊺PM⊺]=AP⋅[Im×mM]⊺, (44)

and let

 ξ:=[Im×mM⊺]andπ:=[Im×m0]. (45)

Then and are equivalent under definition 7. It follows that the gamescape of is the orthgonal projection under of the gamescape of , and that can be recovered from by applying , the right inverse of . ∎

#### Proof of proposition 3.

###### Proposition 3.

The of population can be represented in , where .

###### Proof.

The rank of an antisymmetric matrix is even, so let . The Schur decomposition, see (Balduzzi et al., 2018b), factorizes an antisymmetric matrix as

 An×n=Wn×2k⋅J2k×2k⋅W⊺2k×n (46)

where with and the rows of are orthogonal.

Let . We claim the empirical gamescape, given by convex mixtures of rows of , is isomorphic to the polytope given by convex mixtures of rows of , which lives in . To see this, observe that and the columns of are orthogonal. ∎

#### Proof of proposition 4.

###### Proposition 4.

Geometrically, the empirical Nash equilibria are

 NP={p distribution:p⊺AP⪰0}. (47)
###### Proof.

Recall that the Nash equilibria, of the column-player, in a two-player zero-sum game specified by matrix

, are the solutions to the linear program:

 maxv∈Rv (48) s.t. p⊺A⪰v⋅1 (49) p⪰0 and p⊺1=1 (50)

where the resulting is the value of the game. Since is antisymmetric, the value of the game is zero and the result follows. ∎

#### Proof of proposition 5.

###### Proposition 5.

(i) Performance is independent of the choice of Nash equilibrium. (ii) If is monotonic then performance compares the best agents in each population

 v(P,Q)=maxv∈Pf(v)−maxw∈Qf(w). (51)

(iii) If then and for any population .

###### Proof.

(i) The value of a zero-sum two-player matrix game is independent of the choice of Nash equilibrium .

(ii) If then

 AP=(σ(f(wi)−f(wj)))ni,j=1 (52)

and the Nash equilibrium for each player is to concentrate its mass on the set

 argmaxi∈[n]f(wi). (53)

The result follows immediately.

(iii) If then the Nash on of the row player can be reconstructed by the column player (since every mixture within is also a mixture of agents in . Thus, is at most zero.

Similarly, if then every mixture available to is also available to , so ’s performance against any other population will be the same or worse than the performance of . ∎

#### Proof of proposition 6.

###### Proposition 6.

If is a Nash equilibrium on and , then adding to strictly enlarges the gamescape: .

###### Proof.

We are required to show that is not a convex combination of agents in . This follows by contradiction: if were a convex combination of agents in then it would either tie or lose to the Nash distribution – whereas beats the Nash. ∎

#### Proof of proposition 7

###### Proposition 7.

The objective constructed by rectified Nash response is effective diversity, definition 4.

###### Proof.

Rewrite

 d(P) =n∑i,j=1⌊ϕ(wi,wj)⌋+⋅pipj (54) =∑ipi∑jpj⌊ϕwj(wi)⌋+=:∑ipi⋅hi(wi) (55)

where . Recall that the rectified Nash response objectives are

 (56)

Finally, notice that as required. ∎

## E Rectified Nash response and a reduction to rock-paper-scissors

In this section, we present some basic analysis of the geometry of the gamescape and its relation to Nash equilibria.

### e.1 Nash reweighting of evaluation matrices

Let be an antisymmetric matrix with Nash equilibrium (not necessarily unique).

###### Lemma 10.

The Nash reweighted matrix defined by setting its entry to

 (p⊙A⊙p)ij:=Aij⋅pi⋅pj (57)

is (i) antisymmetric and (ii) satisfies and . That is, all its rows and columns sum to zero.

###### Proof.

(i) Antisymmetry holds since by antisymmetry of .

(ii) We show that all the entries of the vector are zero, by showing that they are all nonnegative and that they sum to zero.

Direct computation obtains that the entry of is . Recall that since is a Nash equilibrium and the value of the game is zero (since is antisymmetric). Thus, all the entries of are nonnegative. Finally,