Real World Games Look Like Spinning Tops

04/20/2020 ∙ by Wojciech Marian Czarnecki, et al. ∙ 10

This paper investigates the geometrical properties of real world games (e.g. Tic-Tac-Toe, Go, StarCraft II). We hypothesise that their geometrical structure resemble a spinning top, with the upright axis representing transitive strength, and the radial axis, which corresponds to the number of cycles that exist at a particular transitive strength, representing the non-transitive dimension. We prove the existence of this geometry for a wide class of real world games, exposing their temporal nature. Additionally, we show that this unique structure also has consequences for learning - it clarifies why populations of strategies are necessary for training of agents, and how population size relates to the structure of the game. Finally, we empirically validate these claims by using a selection of nine real world two-player zero-sum symmetric games, showing 1) the spinning top structure is revealed and can be easily re-constructed by using a new method of Nash clustering to measure the interaction between transitive and cyclical strategy behaviour, and 2) the effect that population size has on the convergence in these games.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 21

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: High-level visualisation of the geometry of Games of Skill. It shows a strong transitive dimension, that is accompanied by the highly cyclic dimensions, which gradually diminishes as skill grows towards the Nash Equilibrium (upward), and diminishes as skill evolves towards the worst possible strategies (downward). The simplest example of non-transitive behaviour is a cycle of length 3 that one finds e.g. in the Rock Paper Scissors game.

Game theory has been used as a formal framework to describe and analyse many naturally emerging strategic interactions (Smith, 1982; Harsanyi et al., 1988; Gibbons, 1992; Sigmund, 1993; Morrow, 1994; Jackson, 2008; David and Jon, 2010). It is general enough to describe very complex interactions between agents, including classic real world games like Tic-Tac-Toe, Chess, Go, and modern computer-based games like Quake, DOTA or StarCraft II. Simultaneously, game theory formalisms apply to abstract games that are not necessarily interesting for humans to play, but were created for different purposes. In this paper we ask the following question: Is there a common structure underlying the games that humans find interesting and engaging?

Why is it important to understand the geometry of real world games? Games have been used as benchmarks and challenges for the development of artificial intelligence for decades, starting with Shannon’s interest in Chess 

(Shannon, 1950)

, through to the first reinforcement learning success in Backgammon 

(Tesauro, 1995), IBM DeepBlue (Campbell et al., 2002) developed for Chess, and the more recent achievements of AlphaGo (Silver et al., 2016) mastering the game of Go, FTW (Jaderberg et al., 2019) for Quake III: Capture the Flag, AlphaStar (Vinyals et al., 2019) for StarCraft II, OpenAI Five (OpenAI et al., 2019) for DOTA 2, and Pluribus (Brown and Sandholm, 2019) for no-limit Texas Hold ’Em Poker. We argue that grasping any common structures to these real world games is essential to understand why specific solution methods work, and can additionally provide us with tools to develop AI based on a deeper understanding of the scope and limits of solutions to previously tackled problems. The analysis of non-transitive behaviour has been critical for algorithm development in general game theoretic settings in the past Lanctot et al. (2017); Balduzzi et al. (2018, 2019). Therefore a good tool to have would be the formalisation of non-transitive behaviour in real world games and a method of dealing with notion of transitive progress built on top of it.

We propose the Game of Skill hypothesis (Fig. 1) and argue that strategies in Games of Skill exhibit a geometry that resembles a spinning top, where the upright axis represents the transitive strength and the radial axis corresponds to cyclic, non-transitive dynamics. We focus on two aspects. Firstly, we theoretically and empirically validate whether the Games of Skill geometry materialises in real world games. Secondly, we unpack some of the key practical consequences of the hypothesis, in particular investigating the implications for training agents.

Some of the above listed projects use multi-agent training techniques that are not guaranteed to work in all games. In fact, there are conceptually simple, yet surprisingly difficult cyclic games that cannot be solved by these techniques (Balduzzi et al., 2019). This suggests that real world games form a subclass of games that is strictly smaller than 2-player symmetric zero-sum (games one side wins while the other side loses e.g. Go, Chess, 1v1 StarCraft, 1v1 DOTA etc.) that is often used as a formalisation. The Game of Skill hypothesis provides such a class, and makes specific predictions about how strategies behave. One clear prediction is the existence of tremendously long cycles, which permeate throughout the space of relatively weak strategies in each such game. Theorem 1 proves the existence of long cycles in a rich class of real world games that includes all the examples above. Additionally, we perform empirical analysis of nine real world games, and establish that the hypothesised Games of Skill geometry is indeed observed in each of them.

Finally, we analyse the implications of the Game of Skill hypothesis for learning. In many of the projects tackling real world games Jaderberg et al. (2019); Vinyals et al. (2019); OpenAI et al. (2019) some form of population-based training Jaderberg et al. (2017); Lanctot et al. (2017) is used, where a collection of agents is gathered and trained against. We establish theorems connecting population size and diversity with transitive improvement guarantees, underlining the importance of population-based training techniques used in many of the games-related research above, as well as the notion of diversity seeking behaviours. We also confirm these with simple learning experiments over empirical games coming from nine real world games.

In summary, our contributions are three-fold: i) we define a game class that models real world games, including those studied in recent AI breakthroughs (e.g. Go, StarCraft II, DOTA 2); ii) we show both theoretically and empirically that a spinning top geometry can be observed; iii) we provide theoretical arguments that elucidate why specific state-of-the-art algorithms lead to consistent improvements in such games, with an outlook on developing new population-based training methods.

Proofs of propositions are provided in Supplementary Materials B, together with details on implementations of empirical experiments (E, G, H), additional data (F), and algorithms used (A, C, D, I, J).

2 Game of Skill hypothesis

We argue that real world games have two critical features that make them Games of Skill. The first feature is the notion of progress. Players that regularly practice need to have a sense that they will improve and start beating less experienced players. This is a very natural property to keep people engaged, as there is a notion of skill involved. From a game theory perspective, this translates to a strong transitive component of the underlying game structure.

A game of pure Rock Paper Scissors (RPS) does not follow this principle and humans essentially never play it in a standalone fashion as a means of measuring strategic skill (without at least knowing the identity of their opponent and having some sense of their opponents previous strategies or biases).

The second feature is the availability of diverse game styles. A game is interesting if there are many qualitatively different strategies (Deterding, 2015; Lazzaro, 2009; Wang and Sun, 2011), that have their own strengths and weaknesses, whilst on average performing on a similar level in the population. Examples include the various openings in games like Chess and Go, which do not provide a universal advantage against all opponents, but rather work well against other specific openings. It follows that players with approximately the same transitive skill level, can still have imbalanced win rates against specific individuals within the group – their strategies and game styles will counter one another. This creates interesting dynamics, providing players, especially at lower levels of skill, direct information on where they can improve. Crucially, this richness gradually disappears as players get stronger, so at the highest level of play, the outcome relies mostly on skill and less on game style. From a game theory perspective, this translates to non-transitive components that rapidly decrease in magnitude relative to the transitive component as skill improves.

These two features combined would lead to a cone-like shape of the game geometry, with a wide, highly cyclic base, and a narrow top of highly skilled strategies. However, while players usually play the game to win, the strategy space includes many strategies whose goal is to lose. While there is often an asymmetry between seeking wins and losses (it is often easier to lose than it is to win), the overall geometry will be analogous - with very few strategies that lose against every other strategy, thus creating a peaky shape at the bottom of our hypothesised geometry. This leads to a spinning top (Figure 1) – a geometry, where, as we travel across the transitive dimension, the non-transitivity first rapidly increases, and then, after reaching a potentially huge quantity (more formally detailed later), quickly reduces as we approach the strongest strategies. We refer to games that exhibit such underlying geometry as Games of Skill.

3 Preliminaries

We first establish preliminaries related to game theory and assumptions made herein. We refer to the options available to any player of the game as a strategy

, in the game-theoretic sense Moreover, we focus on finite normal-form games (i.e. wherein the outcomes of a game are represented as a payoff tensor), unless otherwise stated.

We use to denote the set of all strategies in a given game, with denoting a single pure strategy. We further focus on symmetric, deterministic, zero sum games, where the payoff (outcome of a game) is denoted by . We say that beats when , draws when and loses otherwise. For games which are not fully symmetric (e.g. all turn based games) we symmetrise them by considering a game we play once as player 1 and once as player 2. Many games we talk about have an underlying time-dependent structure; thus, it might be more natural to think about them in the so-called extensive-form, wherein player decision-points are expressed in a temporal manner. To simplify our analysis, we conduct our analysis by casting all such games to the normal-form, though we still exploit some of the time-dependent characteristics. Consequently, when we refer to a specific game (e.g. Tic-Tac-Toe), we also analyse the rules of the game itself, which might provide additional properties and insights into the geometry of the payoffs . In such situations, we explicitly mention that the property/insight comes from game rules rather than its payoff structure . This is somewhat different from a typical game theoretical analysis (for normal form games) that might equate game and . We use a standard tree representation of temporally extended games, where a node represents a state of the game (e.g. the board at any given time in the game of Tic-Tac-Toe), and edges represent what is the next game state when the player takes a specific action (e.g. spaces where a player can mark their or ). The node is called terminal, when it is an end of the game and it provides an outcome . In this view a strategy is a deterministic mapping from every state to an action, and an outcome between two strategies is simply the outcome of the terminal state they reach when they play against each other. Figure 2 visualises these views on an exemplary three step game.

Figure 2: Left – extensive form/game tree representation of a simple 3-step game, where in each state a player can choose one of two actions, and after exactly 3 moves one of the players wins. Player 1 takes actions in circle nodes, and player 2 in diamond nodes. Outcomes are presented from the perspective of player 1. Middle – a partial normal form representation of this game, presenting outcomes for 4 strategies, colour coded on the graph representation. Right – a symmetrised version, where two colours denote which strategy one follows as player 1 and which as player 2.

We call a game monotonic when and implies . In other words, the relation of one strategy beating another is transitive in the set theory sense. We say that a set of strategies forms a cycle of length when for each we have and . For example, in the game of Rock Paper Scissors we have . Arguably no real world game is monotonic or purely cyclic, but rather these two components are present at the same time. There are various ways in which one could define a decomposition of a given game into the transitive and non-transitive components Balduzzi et al. (2019). In this paper, we establish several notions that could be used to describe these phenomena, depending on some assumptions about the game structure. Specifically, in Section 5, the transitive component corresponds to an index of a layer (cluster of strategies) that we define, while the non-transitivity corresponds to a size of this layer. In Section 6 this notion is relaxed for any game, where the transitive component becomes an index of a newly introduced Nash cluster, and the non-transitivity corresponds to the size of this cluster.

The manner in which we study the geometry of games in this paper is motivated by the structural properties that AI practitioners have exploited to build competent agents for real world games Vinyals et al. (2019); Silver et al. (2016); OpenAI et al. (2019), using reinforcement learning (RL). Specifically, consider an empirical game-theoretic

outlook on training of policies in a game (e.g. Tic-Tac-Toe), where each trained policy (e.g. neural network) for a player is considered as a strategy of the empirical game. In other words, an empirical game is a normal-form game wherein AI policies are synonymous with strategies. Each of these policies, when deployed on the true underlying game, yields an outcome (e.g. win/loss) captured by the payoff in the empirical game. Thus, in each step of training, the underlying RL algorithm produces an approximate best response in the actual underlying (

step, extensive form) game; this approximate best response is then added to the set of policies (strategies) in the empirical game, iteratively expanding it.

This AI training process is also often hierarchical – there is some form of multi-agent scheduling process that selects a set of agents to be beaten at a given iteration (e.g. playing against a previous version of an agent in self-play like fashion Silver et al. (2016), or against some distribution of agents generated in the past Vinyals et al. (2019)), and the underlying RL algorithm used for training new policies performs optimisation to find an agent that satisfies this constraint. Because of this structure, there is a risk that the RL algorithm finds very weak strategies that satisfy the constraint (e.g. strategies that are highly exploitable) or relying on some properties of the opponent behaviour that do not generalise to the wider set of opponents one may be interested in. For example, consider a neural network trained to beat a deterministic policy in Chess, which always opens with the same piece. A policy found by such an approach likely has little chance of generalising to any other opening; even though it might play brilliantly against this particular opponent, it might lose against even the weakest opponents who open differently. Issues like this have been observed in various large-scale projects (e.g. exploits that human players found in the Open AI Five OpenAI et al. (2019) or exploiters in League Training of AlphaStar Vinyals et al. (2019)). This exemplifies some of the challenges of creating AI agents, which are not the same that humans face when they play a specific game. Given these insights, we argue that algorithms can be disproportionately affected by the existence of various non-transitive geometries, in contrast to humans.

4 Real world games are complex

The spinning top hypothesis implies that at some relatively low level of transitive strength, one should expect very long cycles in any Game of Skill. We now prove, that in a large class of games (ranging from board games such as Go and Chess to modern computer games such as DOTA, StarCraft or first-person games like Quake), one can find tremendously long cycles, as well as any other non-transitive geometries.

We first introduce the notion of -bit communicative games, which provide a mechanism for lower bounding the number of cyclic strategies. For a given game with payoff , we define its win-draw-loss version with the same rules and a payoff , which simply removes the score value, and collapses all wins, draws, and losses onto +1, 0, and -1 respectively. Importantly, this transformation does not affect winning, nor the notion of cycles (though could, for example, change Nash equilibria).

Definition 1.

Consider the extensive form view of the win-draw-loss version of any underlying game; the underlying game is called -bit communicative if each player can transmit bits of information to the other player before reaching the node whereafter at least one of the outcomes ‘win’ or ‘loss’ is not attainable.

For example the game in Figure 2 is 1-bit communicative, as each player can take one out of two actions before their actions would predetermine the outcome. Given this definition we can show that, as games become more communicative, the set of strategies that form non-transitive interactions grows exponentially:

Theorem 1.

For every game that is at least -bit communicative, and every antisymmetric win-loss payoff matrix , there exists a set of pure strategies such that , and .

Proof.

Let us assume we are given some . We define corresponding strategies

such that each starts by transmitting its ID as a binary vector using

bits. Afterwards, strategy reads out based on its own id, as well as the decoded ID of an opponent , and since we assumed each win-draw-loss outcome can still be reached in a game tree, players then play to win/draw or lose, depending on the value of . We choose and to follow the first strategy in lexicographic ordering (to deal with partially observable/concurrent move games) over sequences of actions that leads to to guarantee the outcome. Ordering over actions is arbitrary and fixed. Since identities are transmitted using binary codes, there are possible ones. ∎

In particular, this means that if we pick to be cyclic – where for each we have for , and , and for the last strategy we do the same, apart from making it lose to strategy 1, by putting – we obtain a constructive proof of a cycle of length , since beats , beats , beats , …, beats . In practise, the longest cycles can be much longer (see the example of the Parity Game of Skill in the supplementary materials) and thus the above result should be treated as a lower bound.

Note, that strategies composing these long cycles will be very weak in terms of their transitive performance, but of course not as weak as strategies that actively seek to loose, and thus in the hypothesised geometry they would occupy the thick, middle level of the spinning top. Since such strategies do not particularly target winning or losing, they are unlikely to be executed by a human playing a game. Despite this, we use them to exemplify the most extreme part of the underlying geometry, and given that in both the extremes of very strong and very weak policies we expect non-transitivities to be much smaller than that, we hypothesise that they behave approximately monotonically in both these directions.

Interestingly, for many games, we can compute by traversing the game tree. The intuition behind this procedure is simple and uses basic recursive reasoning, as detailed below.

Remark 1.

Consider any fully observable game, where each legal action leads to a different state. All the states that, after taking one action have a determined outcome, are -bit communicative. For the remaining ones, we can derive the recursive relation

for state , and being the set of all states to which we can transition by taking an action in . With this notation, equals of the initial state. In more detail, the logarithm simply measures how many bits of information we can transmit (since the game is fully observable, the opponent perceives the action we took) if we restrict ourselves to selecting an action only from , which is a subset all possible transitions. Since we are interested in states where we guarantee transmitting a given amount of information, we need to take a minimum over how communicative are. We need to keep track of these quantities independently for each player, and simply remember to take the minimum w.r.t. both communication channels (as we are interested in both players communicating the same amount at each state). The overall time and memory complexity of this method is where is the state space and is the branching factor (the maximum number of actions per state). Details and pseudocode are provided in Supplementary Materials.

Running the above procedure on the game of Tic-Tac-Toe we find it is -bit communicative (which means that any payoff of size is realised by some strategies). Additionally, all 1-step games (e.g. RPS) are -bit communicative, as all actions immediately prescribe the outcome without the ability to communicate any information.

For games where

is too large to be traversed, we can simply rely on an analogous procedure, but considering some heuristic choice of

, thus providing a lower bound on instead of the exact value. In the game of Go, we can choose to be playing stones on one half of the board, and show that .

Proposition 1.

The game of Go is at least -bit communicative, thus there exists a cycle of length at least , which is larger than the number of atoms in the observable universe.

Proof.

Since Go has a resign action, one can use the entire state space for information encoding, whilst still being able to reach both winning and losing outcomes. The game is played on a 1919 board – if we split it in half we get 180 places to put stones per side, such that the middle point is still empty, and thus any placement of players stones on their half is legal and no stones die. These 180 fields give each player the ability to transfer bits. and according to Theorem 1 we thus have a cycle of length . Figure 7 provides visualisation of this construction. ∎

Proposition 2.

Modern games, such as StarCraft, DOTA or Quake are -bit communicative for any , unless a time limit is applied to a game. Even with reasonable time limits of 10 minutes, these games are at least -bit communicative.

The above analysis shows that real world games have an extraordinarily complex structure, which is not commonly analysed in classical game theory. The sequential, -step aspect of these games makes a substantial difference, as even though one could simply view each of them in a normal form way (Myerson, 2013), this would hide the true structure exposed via our analysis.

Naturally, the above does not prove that real world games follow the Games of Skill geometry. To validate the merit of this hypothesis, however, we simply follow the well-established path of proving hypothetical models in natural sciences (e.g. physics). Notably, the rich non-transitive structure (located somewhere in the middle of the transitive dimension) exposed by this analysis is a key property that the hypothesised Game of Skill geometry would imply. More concretely, in Section 8 we conduct empirical game theory-based analysis of a wide range of real world games to show that the hypothesised spinning top geometry can, indeed, be observed.

5 Layered game geometry

The practical consequences of huge sets of non-transitive strategies are two-fold. First, building naive multi-agent training regimes, that try to deal with non-transitivity by asking agents to form a cycle (e.g. by losing to some opponents), is likely to fail – there are just too many ways in which one can lose without providing any transitive improvement for other agents trained against it. Second, there exists a shared geometry and structure across many games, that we should exploit when designing multi-agent training algorithms. In particular, we show how these properties justify some of the recent training techniques involving population-level play and the League Training used in Vinyals et al. (Vinyals et al., 2019). In this section, we investigate the implications of such a game geometry on the training of agents, starting with a simplified variant that enables building of intuitions and algorithmic insights.

Definition 2 (-layered finite Game of Skill).

A finite game is a -layered finite Game of Skill if the set of strategies can be factorised into layers such that , and layers are fully transitive in the sense that and there exists such that for each we have and for .

More intuitively, in these -layered games, all the non-transitive interaction take place within each layer , whilst the skill (or transitive) component of the game corresponds to a layer ID.

Note that for every finite game there exists for which it is a -layered game (though when this structure is not useful). Moreover, every monotonic game has as many layers as there are strategies in the game.

Layered games can be challenging for many training algorithms used in practise (OpenAI et al., 2019; Silver et al., 2016; Jaderberg et al., 2019), such as naive self-play. However, a simple form of fictitious play with a hard limit on population size will converge independently of the oracle used (the oracle being the underlying algorithm that returns a new policy that satisfies a given improvement criterion):

Proposition 3.

Fixed-size population fictitious play, where at iteration one replaces some (e.g. the oldest) strategy in with a new strategy such that converges in layered Games of Skill, if the population is not smaller than the size of the lowest layer occupied by at least one strategy in the population and at least one strategy is above level . If all strategies are below , then required population size is that of .

Intuitively, to guarantee transitive improvements over time it is important to cover all possible strategies. In particular, a population improvement variant of the above procedure, where one requires finding the set of unique strategies that satisfy the same criterion, would guarantee transitive improvement each iteration using an identical proof method. This proposition also leads to a known result of needing just one strategy in the population (e.g. self-play) to keep improving in monotonic games (Balduzzi et al., 2019).

Proposition 3 also shows an important intuition related to how modern AI systems are built – the complexity of the non-transitivity discovery/handling methodology decreases as the overall transitive strength of the population grows. Various agent priors (e.g. search, architectural choices for parametric models such as neural networks, smart initialisation such as imitation learning etc.) will initialise in higher parts of the spinning top, and also restrict the set of representable strategies to the transitively stronger ones. This means that there exists a form of balance between priors one builds into an AI system and the amount of required multi-agent learning complexity required (see Figure 

3 for a comparison of various recent state of the art AI systems).

Figure 3: Visualization of various state of the art approaches for solving real world games, with respect to the multi-agent algorithm and agent modules used (on the left). Under the assumption that these projects led to the approximately best agents possible, and that the Game of Skill hypothesis is true for these games, we can predict what part of the spinning top each of them had to explore (represented as intervals on the right). This comes from the complexity of the multi-agent algorithm (the method of dealing with non-transitivity) that was employed – the more complex the algorithm, the larger the region of the top that was likely represented by the strategies using the specific agent stack. This analysis does not expose which approach is better or worse. Instead, it provides intuition into how the development of training pipelines used in the literature enables simplification of non-transitivity avoidance techniques, as it provides an initial set of strategies high enough in the spinning top.

From a practical perspective, there is no simple way of knowing without traversing the entire game tree. Consequently, this property is not directly transferable to the design of an efficient algorithm (as if one had access to the full game tree traversal, one could simply use Min-Max to solve the game). Instead, this analysis provides an intuitive mechanism, explaining why finite-memory fictitious self-play can work well in practice.

6 Relaxation of a layered game

In practice, covering all cycles is not feasible, and an assumption of perfect dominance in -layered games is too strong. Thus, we seek a more tractable notion of transitivity and non-transitivity. In this section, we show that games undergo a natural decomposition related to Nash equilibria, which also describes the Game of Skill hypothesis.

The idea behind this approach, called Nash clustering, is to first find the mixed Nash equilibrium of the game payoff over the set of pure strategies (we use notation to denote the equilibrium for payoff when restricted only to strategies in ), and form a first cluster by taking all the pure strategies in the support of this mixture. Then, we restrict our game to the remaining strategies, repeating the process until no strategies remain.

Definition 3 (Nash clustering).

We define Nash clustering of the finite zero-sum symmetric game strategy set by setting

(1)

for and .

Since a Nash always exists, each cluster is well defined, and since each Nash of a non-empty game has non-empty support, each cluster has non-zero size until some where . By construction we also have for each , thus ensuring a valid clustering.

While there might be many Nash clusterings per game, there exists a unique maximum entropy Nash clustering where at each iteration we select a Nash equlibrium with maximum Shannon entropy, which is guaranteed to be unique (Ortiz et al., 2007) due to the convexity of the objective.

Nash clustering induces a form of monotonic clustering, in the sense of Relative Population Performance (RPP) (Balduzzi et al., 2019), which is defined for two sets of agents with a corresponding Nash equilibrium of the asymmetric game as

Theorem 2.

Nash clustering satisfies for each .

Proof.

By definition for each and each we have thus for and every we have and

(2)

For a -layered game, Nash clusters do not cross the layers’ boundaries, but can split layers into smaller clusters. We view them as a form of generalising layers to arbitrary games by relaxing the notion of transitivity and thus relaxing the notion of -layered games. In fully transitive games, Nash clustering creates one cluster per strategy, and in games that are fully cyclic (e.g. RPS) it creates a single cluster for the entire strategy set.

As in the previous construction, we show that a diverse population that spans an entire cluster (layer) guarantees transitive improvement, despite not having access to any weaker policies nor knowledge of covering the cluster.

Theorem 3.

If at any point in time, the training population includes any full Nash cluster , then training against it by finding such that guarantees transitive improvement in terms of the Nash clustering .

Proof.

Lets assume that . This means, that

(3)

where the last inequality comes from the fact that and implies that . This leads to a contradiction with the Nash clustering and thus for some . Finally cannot belong to itself since . ∎

Consequently, in order to keep improving transitively, it is helpful to seek wide coverage of strategies around the current transitive strength (inside the cluster). This high level idea has been applied to single-player domains, where various notions of diversity were studied (Eysenbach et al., 2018; Pathak et al., 2017), but also in some multi-player games such as soccer (Le et al., 2017) and more recently StarCraft II. AlphaStar (Vinyals et al., 2019) explicitly attempts to cover the non-transitivities using exploiters, which implicitly try to expand on the current Nash. With the Game of Skill geometry one can rely on this required coverage to be smaller over time (as agents get stronger). Thus, forcing the new generation of agents to be the weakest ones that beat the previous one would be sufficient to keep covering cluster after cluster, until reaching the final one.

7 Random Games of Skill

We show that random games also exhibit a spinning top geometry and provide a possible model for Games of Skill, which admits more detailed theoretical analysis.

Definition 4 (Random Game of Skill).

We define a payoff of a Random Game of Skill as a random antisymmetric matrix, where each entry equals:

where and are iid of and respectively, where .

The intuition behind this construction is that will capture part of the transitive strength of a strategy . If all the components were removed then the game would be fully monotonic. It can be seen as a linear version of a common Elo model Elo (1978)

, where each player is assigned a single ranking, which is used to estimate winning probabilities. On the other hand,

is responsible for encoding all interactions that are specific only to playing against , and thus can represent various non-transitive interactions (i.e. cycles) but due to randomness, can also sometimes become transitive.

Let us first show that the above construction indeed yields a Game of Skill, by taking an instance of this game of size .

Proposition 4.

If then the difference between maximal and minimal in each Nash cluster is bounded by :

First, let us note that as the ratio of to grows, this implies that the number of Nash clusters grows as each of them has upper bounded difference in by that depends on magnitude of , while high value of guarantees that there are strategies with big differences in corresponding ’s. This constitutes of the transitive component of the random game. To see that the clusters sizes are concentrated around zero, lets note that because of the zero-mean assumption of , this is where majority of ’s are sampled from. As a result, there is a higher chance of forming cycles there, then it is in less densely populated regions of scale. With these two properties in place . Figure 4 further visualises this geometry.

Figure 4: Game profile of the random Game of Skill. Upper left: payoff matrix; Upper right: relation between fraction of strategies beaten for each strategy and number of RPS cycles it belongs to (colour shows which Nash cluster this strategy belongs to); Lower left: payoff between Nash clusters in terms of RPP (Balduzzi et al., 2019); Lower right: relation between fraction of clusters beaten wrt. RPP and the size of each Nash cluster. Payoffs are sorted for easier visual inspection.

This shape can also be seen by considering the limiting distribution of mean strengths.

Proposition 5.

As the game size grows, for any given the average payoff behaves like .

Now, let us focus our attention on training in such a game, given access to a uniform improvement oracle, which given a set of opponents returns a uniformly selected strategy from strategy space, among the ones that beat all of the opponents, we will show probability of improving average transitive strength of our population at time , denoted as .

Theorem 4.

Given a uniform improvement oracle we have that, where

is a random variable of zero mean and variance

. Moreover, we have

The theorem shows that the size of the population, against which we are training, has a strong effect on the probability of transitive improvement, as it reduces the variance of at a quartic rate. This result concludes our analysis of random Games of Skill, we now follow with empirical confirmation of both the geometry and properties predicted made above.

Table 1:

Game profiles of empirical game geometries, when sampling strategies in various real world games, such as Connect Four, Tic Tac Toe and even StarCraft II. The first three rows shows clearly the Game of Skill geometry, while the last row shows the geometry for games that are not Games of Skill, and clearly do not follow this geometry. Rows of the payoffs are sorted by mean winrate for easier visual inspection. The pink curve shows a fitted Skewed Gaussian to show the spinning top shape, details provided in Supplementary Materials.

8 Real world games

Table 2: Learning curves in empirical games, using a naive population training method for various population sizes: the oldest strategy in the population is replaced with one that beats the whole population on average using an adversarial oracle (returning the weakest strategy satisfying this goal). For Games of Skill (top) there is a phase change of behaviour for most games, where once the population is big enough to deal with the non transitivity, the system converges to the strongest policy. On the other hand in other games (bottom) like disc game, no population size avoids cycling, and for fully transitive games like Elo game, even naive self play converges.

In order to empirically validate the spinning top geometry, we consider a selection of two-player zero-sum games available in the OpenSpiel library  (Lanctot et al., 2019). Unfortunately, the strategy space is enormous even for the simplest of real world games. For example, the number of behaviourally unique pure strategies in Tic-Tac-Toe is larger than (see supplementary materials). A full enumeration-based analysis is therefore computationally infeasible. Instead, we rely on empirical game-theoretic analysis, an experimental paradigm that relies on simulation and sampling of strategies to construct abstracted counterparts of complex underlying games, which are more amenable for analysis  (Walsh et al., 2002, 2003; Phelps et al., 2004; Wellman, 2006; Phelps et al., 2007; Tuyls and Parsons, 2007). Specifically, we look for strategy sampling that covers the strategy space as uniformly as possible so that the underlying geometry of the game (as exposed by the empirical counterpart) is minimally biased. A simple and intuitive procedure for strategy sampling is as follows. First, apply a tree-search method, in the form of Alpha-Beta (Newell and Simon, 1976) and MCTS (Brügmann, 1993) and select a range of parameters that control the transitive strength of these algorithms (depth of search for Alpha-Beta and number of simulations for MCTS) to ensure coverage of transitive dimension. Second, for each such strategy we create multiple instances, with varied random number seed, thus causing them to behave differently. We additionally include Alpha-Beta agents that actively seek to lose, to ensure discovery of the lower cone of the hypothesised spinning top geometry. While this procedure does not guarantee uniform sampling of strategies, it at least provides decent coverage of the transitive dimension. In total, this yields approximately 1000 agents per game. Finally, following strategy sampling, we form an empirical payoff table with entries evaluating the payoffs of all strategy match-ups, remove all duplicate agents, and use this matrix to approximate the underlying game of interest.

Table 1 summarises the empirical analysis which, for the sake of completeness, includes both Games of Skill and games that are not Games of Skill such as for example the Disc game (Balduzzi et al., 2019), a purely transitive Elo game, and the Blotto game. Overall, all real world games results show the hypothesised spinning top geometry. If we look at e.g. Go (3x3) we notice that Nash clusters induced payoff look monotonic, and the sizes of these are maximal around the middle of the transitive strength, and quickly decrease as transitive strength both increases or decreases. At the level of the strongest strategies, we still have non-trivial Nash clusters, showing that even in this empirical approximation of the game of Go on a small board, one still needs some diversity of play styles. This is to be expected due to various game symmetries of the game rules. At the same time various games that were created to study game theory rather than for humans to play fail to exhibit the hypothesised geometry. If we look at Blotto, we see that the size of Nash clusters keep increasing, as the number of strategies one needs to mix at higher and higher level of play in this game keeps growing. This is a desired property for the purpose of studying complexity of games, but arguably not a desired property for a game that is simply played for enjoyment. In particular, the game of Blotto requires players to mix uniformly over all possible permutations to be unexploitable (since the game is invariant to permutations), which is a very hard thing to do for a human player.

We tested the population size claims of Nash coverage as follows. First, construct empirical games coming from the sampling of agents defined above. This acts as an approximation of the underlying games. Second, define a simple learning algorithm, where we start with (size of the population) weakest strategies (wrt. mean win-rate) and iteratively replace the oldest one with a strategy that beats the whole population on average, meaning that To pick a strategy we use the most pessimistic oracle, which selects the weakest strategy satisfying the win-rate condition. This counters the bias towards sampling stronger strategies. As a result, we hope to get a more fair approximation of typical greedy learning methods such as gradient-based methods or reinforcement learning.

For small population sizes training does not converge and cycles for all games (Table 2). As the population grows, strength increases but saturates in various suboptimal cycles. However, when the population exceeds a critical size, training converges to the best strategies in almost all experiments. For games that are not real world games we observe quite different behaviour - where despite growth of population size cycling keeps occuring (e.g. Disc game), convergence is guaranteed even with a population of size 1 (Elo game which is monotonic).

9 Conclusions

In this paper we have introduced Games of Skill, a class of games that, as shown both theoretically and empirically, includes many real world games, including Tic Tac Toe, Chess, Go and even StarCraft II and DOTA. In particular we showed, that -step games have tremendously long cycles, and provided both mathematical and algorithmic methods to estimate this quantity. We showed, that Games of Skill have a geometry resembling a spinning top, which can be used to reason about their learning dynamics. In particular, our insights provide useful guidance for research into population-based learning techniques building on League training (Vinyals et al., 2019) and PBT (Jaderberg et al., 2019), especially when enriched with notions of diversity seeking (Balduzzi et al., 2019). Interestingly, we show that many games from classical game theory are not Games of Skill, and as such might provide challenges that are not necessarily relevant to developing AI methods for real world games. We hope that this work will encourage researchers to study real world games structures, to build better AI techniques that can exploit their unique geometries.

Appendix A Computing in -bit communicative games

Our goal is to be able to encode identity of a pure strategy in actions it is taking, in such a way, that opponent will be able to decode it. We focus on fully observable, turn-based games. Note, that with pure policy, and fully observable game, the only way to sent information to the other player is by taking an action (which is observed). Consequently, if at given state one considers actions, then choosing one of them we can transmit bits. We will build our argument recursively, by considering subtrees of a game tree. Naturally, a subtree is a tree of some game. Since the assumption of -bit communicativeness is that we can transmit bits of information before outcomes become independent, it is easy to note that a subtree for which we cannot find terminal nodes with both outcomes (-1, +1) is 0-bit communicative. Let’s remove these nodes from the tree. In the new tree, all the leaf nodes are still 0-bit communicative, as now they are “one action away” from making the outcome deterministic. Let’s define function per state, that will output how many bits each player can transmit, before the game becomes deterministic, so for each player

The crucial element is how to now deal with a decision node. Let’s use notation to denote set of all children states, which we assume correspond to taking actions available in this state. If many actions would lead to the same state, we just pretend only one such action exists. From the perspective of player , what we can do, is to select a subset of states that are reachable from . If we do so, we will be able to encode bits in this move plus whatever we can encode in the future, which is simply as we need to guarantee being able to transmit this number of bits no matter which path is taken.

However, our argument is symmetric, meaning that we need to not only transmit bits as player , but also our opponent, and to do so we need to consider minimum over players respective communication channels:

It is easy to notice that for a starting state we now have that the game is -bit communicative. The last recursive equation might look intractable, due to iteration over subsets of children states. However, we can easily compute quantities like this in linear time. Let’s take general form of

(4)

and let’s consider Alg. 1.

  Input: functions , and set :
  begin
  
   {Eq. 4}
  sort in descending order of
  for  do
     if  then
        
     end if
  end for
  return
Algorithm 1 Solver for Eq. 4 in .

To prove that it outputs maximum of , let’s assume that at any point we decided to pick . Since has highest at this point, we have , and consequently so we decreased function value and conclude optimality proof.

We provide a pseudocode in Alg. 2 for the two-player, turn-based case with deterministic transitions. Analogous construction will work for players, simultaneous move games, as well as games with chance nodes (one just needs to define what we want to happen there, taking minimum will guarantee transmission of bits, and taking expectation will compute expected number of bits instead).

Exemplary execution at some state of Tic-Tac-Toe is provided in Figure 6. Figure 7 shows the construction from Proposition 1 for the game of Go.

We can use exactly the same procedure to compute -communicativeness over restricted set of policies. For example let us consider strategies using MinMax algorithm to a fixed depth, between 0 and 9. Furthermore, we restrict what kind of first move they can make (e.g. only in the centre, or in a way that is rotationally invariant). Each such class simply defines a new “leaf” labelling of our tree or set of available children. Once we reach a state, after which considered policy is deterministic, by definition its communicativeness is zero, so we put there. Then we again run the recursive procedure. Running this analysis on the game of Tic-Tac-Toe (Fig. 5) reveals the Spinning Top like geometry wrt. class of policies used. As MinMax depth grows, cycle length bound from Theorem 1 decreases rapidly. Similarly introducing more inductive bias in the form of selecting what are good first moves affect the shape in an analogous way.

Figure 5: Visualisation of cycle bound lengths coming from Theorem 1, when applied to the game of Tic-Tac-Toe over restricted set of policies – y axis corresponds to the depth of MinMax search (encoding transitive strength); and colour and line style correspond to restricted first move (encoding better and better inductive prior over how to play this game).

This example has two important properties. First, it shows cyclic dimensions behaviour over whole policy space, as we do not rely on any sampling, but rather consider the whole space, restricting the transitive strength and using Theorem 1 as a proxy of non-transitivity. Second, it acts as an exemplification of the claim of various inductive biases restricting the part of the spinning top one needs to deal with when developing and AI for the specific game.

Figure 6: Partial execution of the n-communicativeness algorithm for Tic-Tac-Toe. Black nodes represent states that no longer can reach all possible outcomes. Green ones last states before all the children nodes would be either terminating or are coloured black. The selected children states (building subset ) are encoded in green (for crosses) and blue (for circles), with the edge captioned with number of bits transmitted (logarithm of number of possible children), minimum number of bits one can transmit afterwards, and minimum number of bits for the other player (because it is a turn based game). at each node is minimum of and , while for a player making a move in state , we have . Red states are the one not selected in the parent node by maximisation over subsets.
Figure 7: Visualisation of construction from Proposition 1. Left) split of the 19 x 19 Go board into regions where black stones (red), and white stones (blue) will play. Each player has 180 possible moves. Centre) Exemplary first 7 moves, intuitively, ordering of stones encodes a permutation over 180, which corresponds to bits being transmitted. Right) After exactly 360 moves, board will always look like this, at which point depending on black player will resign (if it is supposed to lose), or play the centre stone (if it is supposed to win).
  Input: Game tree encoded with:
  - states:
  - value of a state:
  - set of children states
  - set of parent states
  - which player moves
  begin {Remove states with deterministic outcomes}
  
  update
   {Init with leaves}
  while  do
     
      {Alg. 3}
     for  do
        if  then
           .enqueue() {Enqueue a parent if all its children were analysed}
        end if
     end for
  end while
  return
Algorithm 2 Main algorithm to compute for which a given fully observable two-player zero-sum game is -bit communicative.
  Input: State
  begin
   {min over players}
   {other player bits}
  sort in decreasing order {Order by decreasing communicativeness}
  order in the same order
  
  for  to  do
     
     
     if  then
         {Update maximum}
     end if
  end for
  return
Algorithm 3 Aggregate (Agg) - helper function for Alg. 2

Appendix B Proofs

Proposition 2. Modern games, such as StarCraft, DOTA or Quake are -bit communicative for any , unless time limit is applied. Even with reasonable time limits of tens of minutes, these games are at least -bits communicative.

Proof.

With modern games running at 60Hz, as long as agents can “meet” in some place, and execute 60 actions per second that does not change their visibility (such as tiny rotations), they can transmit bits of information per 10 minute encounter. Note, that this is very loose lower bound, as we are only transmitting one bit of information per action, while this could be significantly enriched, if we allow for use of multiple actions (such as jumping, moving multiple units etc.). ∎

Proposition 3. Fixed-size population fictitious play, where at iteration one replaces some (e.g. the oldest) strategy in with a new strategy such that converges in layered games of skill, if the population is not smaller than the size of the lowest layer occupied by at least one strategy in the population and at least one strategy is above level . If all strategies are below , then required population size is that of .

Proof.

Let’s assume at least one strategy is above . We will prove, that there will be at most consecutive iterations where algorithm will not improve transitively (defined as a new strategy being part of where is smaller than the lowest number of all that have non empty intersections with ). Since we require the new strategy added at time to beat all previous strategies, it has to occupy at least a level, that is occupied by the strongest strategy in . Let’s denote this level by , then improves transitively, meaning that there exists such that , or it belongs to itself. Since by construction , this can happen at most times, as each strategy in needs to be beaten by and . By the analogous argument, if all the strategies are below , one can have at most consecutive iterations without transitive improvement. ∎

Proposition 4. If then clusters of Nash clustering have transitive spread bounded by :

Proof.

Let us hypothesise otherwise, so we have a Nash with strategy and such that . Let us show that has to achieve better outcome against each strategy than

(5)

consequently cannot be part of the Nash, contradiction.

Furthermore Nashes supports will be highest around transitive strength, where most of the probability mass of distribution is centred, and go towards as they go to . ∎

Proposition 5. As the game size grows, for any given the average payoff behaves like , where

Proof.

Using the central limit theorem and the fact that

and that these variables have a variance bounded by . ∎

Theorem 4. Given a uniform improvement oracle we have that, where is a random variable of zero mean and variance . Moreover, we have

Proof.

Uniform improvement oracle, given a set of index of strategies (the current members of our population) returns an index such that,

and creates that consists in replacing a randomly picked by . If the oracle cannot return such index then the training process stops. What we care about is the average skill of the population described by , where . By the definition of a uniform improvement oracle we have,

(6)

Thus, if we call and is the index of the replaced strategy we get

(7)
(8)
(9)

where . This concludes the first part of the theorem. For the second part we notice that since the strategy in is replaced uniformly and are independent of variance bounded by , we have,

(10)

Finally taking the expectation conditioned on , we get

(11)

Appendix C Cycles counting

In general even the problem of deciding if a graph has a simple path of length higher than some (large) is NP-hard. Consequently we focus our attention only on cycles of length 3 (which embed Rock-Paper-Scissor dynamics). For this problem, we can take adjacency matrix and simply compute , which will give us number of length 3 cycles that pass through each node. Note, that this technique no longer works for longer cycles as computes number of closed walks instead of closed paths (in other words – nodes could be repeated). For these concepts coincide though.

Appendix D Nash computation

We use iterative maximum entropy Nash solver for both Nash clustering and RPP Balduzzi et al. (2019) computation. Since we use numerical solvers, the mixtures found are not exactly Nash equilibria. To ensure that they are “good enough” we find a best response, and check if the outcome is bigger than -1e-4. If it fails, we continue iterating until it is satisfied. For the data considered, this procedure always terminated. While usage of maximum entropy Nash might lead to unnecessarily “heavy” tops of the spinning top geometry (since equivalently we could pick smallest entropy ones, which would form more peaky tops) it guaranteed determinisim of all the procedures (as maximum entropy Nash is unique).

Appendix E Games/payoffs definition

After construction of each empirical payoff , we first symmetrise it (so that ordering of players does not matter), and then standarise it for the analysis and plotting to keep all the scales easy to compare. This has no effect on Nashes or transitive strength, and is only used for consistent presentation of the results, as . For most of the games this was an identity operation (as for most we had ), and was mostly useful for various random games and Blotto, which have wider range of outcomes.

e.1 Real world games

We use OpenSpiel Lanctot et al. (2019) implementations of all the games analysed in this paper, with following setups:

  • Hex 3X3: hex(board_size=3)

  • Go 3X3: go(board_size=3,komi=6.5)

  • Go 4X4: go(board_size=4,komi=6.5)

  • Quoridor 3X3: quoridor(board_size=3)

  • Quoridor 4X4: quoridor(board_size=4)

  • Tic Tac Toe: tic_tac_toe()

  • Misere Tic Tac Toe (a game of Tic Tac Toe where one wins if and onlfy if opponent makes a line): misere(game=tic_tac_toe())

  • Connect Four: connect_four()

e.2 StarCraft II (AlphaStar)

We use payoff matrix of the League of the AlphaStar Final Vinyals et al. (2019) which represent a big population (900 agents) playing at a wide range of skills, using all 3 races of the game, and playing it without any simplifications. We did not run any of the StarCraft experiments. Sampling of these strategies is least controlled, and comes from a unique way in which AlphaStar system was trained. However, it looks very well aligned with the goal of covering strategy space (thanks to inclusion of what authors call exploiters), and as such fits our analysis well.

e.3 Rock Paper Scissor (RPS)

We use standard Rock-Paper-Scissor payoff of form

This game is fully cyclic, and there is no pure strategy Nash (the only Nash-equilibrium is the uniform mixture of strategies).

Maybe surprisingly, people do play RPS competitively, however it is important to note, that in “real-life” the game of RPS is much richer, than its game theoretic counterpart. First, it often involves repeated trials, which means one starts to reason about the strategy opponent is employing, and try to exploit it while not being exploited themselves. Second, identity of the opponent is often known, and since player are humans, they have inherit biases in the form of not being able to play completely randomly, having beliefs, preferences and other properties, that can be analysed (based on historical matches) and exploited. Finally, since the game is often played in a physical environment, there might be various subconscious tells for a given player, that inform the opponent about which move they are going to play, akin to Clever Hans phenomena.

e.4 Disc Game

We use definition of random game from the “Open-ended learning in symmetric zero-sum games” paper Balduzzi et al. (2019). We first sample points uniformly in the unit circle and then put

Similarly to RPS, this game is fully cyclic.

e.5 Elo game

We sample Elo rating Elo (1978) per player , and then put , which is equivalent of using scaled difference in strength

squashed through a sigmoid function

. It is easy to see that this game is monotonic, meaning that . We use samples.

e.6 Noisy Elo games

For a given noise we first build an Elo game, and then take independent samples from and add it to corresponding entries of , creating . After that, we symmetrise the payoff by putting .

e.7 Random Game of Skill

We put where each of the random variables comes from . We use samples.

e.8 Blotto

Blotto is a two-player symmetric zero-sum game, where each player selects a way to place N units onto K fields. The outcome of the game is simply number of fields, where a player has more units than the opponent minus the symmetric quantitiy. We choose N=10, K=5, which creates around 1000 pure strategies, but analogous results were obtained for various other setups we tested. One could ask why is Blotto getting more non-transitive as our strength increases. One simple answer is that the game is permutation invariant, and thus forces optimal strategy to be played uniformly over all possible permutations, which makes the Nash support grow. Real world games, on the other hand, are almost always ordered, sequential, in nature.

e.9 Kuhn Poker

Kuhn Poker (Kuhn, 1950) is a two-player, sequential-move, asymmetric game with 12 information states (6 per player). Each player starts the game with 2 chips, antes a single chip to play, then receives a face-down card from a deck of 3 cards. At each information state, each player has the choice of two actions, betting or passing. We use the implementation of this game in the OpenSpiel library (Lanctot et al., 2019). To construct the empirical payoff matrices, we enumerate all possible policies of each player, noting that some of the enumerated policies of player 1 may yield identical outcomes depending on the policy of player 2, as certain information states may not be reachable by player 1 in such situations. Due to the randomness involved in the card deals, we compute the average payoffs using 100 simulations per pair of policy match-ups for players 1 and 2. This yields an asymmetric payoff matrix (due to sequential-move nature of the game), which we then symmetrise to conduct our subsequent analysis.

e.10 Parity Game of Skill

Let us define a simple -step game (per player), that has game of skill geometry. It is a two-player, fully-observable, turn based game that lasts at most -steps. Game state is a single bit with initial value 0. At each step, player can choose to: 1) flip the bit (); 2) guess that bit is equal to 0 (); 3) guess the bit is equal to 1 (); 4) keep the bit as it is (). At (per player) step the only legal actions are 2) and 3). If any of these two actions is taken, game ends, and a player wins iff it guessed correctly. Since the game is fully observable, there is no real “guessing” here, agents know exactly what is the state, but we use this construction to be able to study the underlying geometry in the easiest way possible. First, we note that this game is -bit communicative, as at each turn agents can transmit bits of information, and game lasts for steps, and the last one cannot be used to transfer information. According to Theorem 1 this means that every antisymmetric payoff of size can be realised.

Figure 8: Game profile of Parity Game of Skill with 3 steps. Note that its Nash clusters are of size 40, and number of cycles exceeds 140, despite being only -bit communicative.

Figure 8 shows that this game with has hundreds of cycles, and Nash clusters of size 40, strongly exceeding lower bounds from Theorem 1. Since there are just 161 pure strategies, we do not have to rely on sampling, and we can clearly see Spinning Top like shape in the game profile.

Appendix F Other games that are not Games of Skill

Table 3 shows a few Noisy Elo Games, which cause Nashes to grow significantly over the transitive dimension. We also run analysis on Kuhn-Poker, with 64 pure policies, which seems to exhibit analogous geometry to Blotto game. Finally, there is also pure Rock Paper Scissor example, with everything degenerating to a single point.

Table 3: Top row, from left: Noisy Elo games with respectively. Middle row, from left: Blotto with equal ; , and respectively. Bottom row, from left: Kuhn-Poker and Rock Paper Scissors.

Appendix G Empirical Game Strategy Sampling

We use OpenSpiel Lanctot et al. (2019) implementations of AlphaBeta and MCTS players as base of our experiments. We expand AlphaBeta player to MinMax(d, s), which runs AlphaBeta algorithm up till depth , and if it did not succeed (game is deeper than ) then it executes random action using seed instead. We also define MaxMin(d, s) which acts in exactly same way, but uses flipped payoff (so seeks to lose). We also include MinMax’(d, s) and MinMax(d, s) which act in the same way as before, but if some branches of the game tree are longer than , then they are assumed to have value of 0 (in other words these use the value function that is contantly equal to 0). Finally we define which runs simulations, and randomness is controlled by seed . With these 3 types of players, we create a set of agents to evaluate of form:

  • MinMax(d,s) for each combination of

  • MinMax’(d,s) for each combination of

  • MaxMin(d,s) for each combination of

  • MaxMin’(d,s) for each combination of

  • MCTS(k,s) for each combination of

This gives us 2000 pure strategies, that span the transitive axis. Addition of MCTS is motivated by the fact that many of our games are too hard for AlphaBeta with depth 9 to yield strong policies. Also MinMax(0,s) is equivalent to a completely random policy with a seed , and thus acts as a sort of a baseline for randomly initialised neural networks. Each of players constructed this way codes a pure strategy (as thanks to seeding that act in a deterministic way).

Appendix H Empirical Game Payoff computation

For each game and pair of corresponding pure strategies, we play 2 matches, swapping which player goes first. We report payoff which is the average of these two situations, thus effectively we symmetrise games, which are not purely symmetric (due to their turn based nature). After this step, we check if there are any duplicate rows, meaning that two strategies have exactly the same payoff against every other strategy. We remove them from the game, treating this as a side effect of strategy sampling, which does not guarantee uniqueness (e.g. if the game has less than 2000 pure strategies, than naturally we need to sample some multiple times). Consequently each empirical game has a payoff not bigger than , and on average they are closer to .

Appendix I Fitting spinning top profile

For each plot relating mean RPP to size of Nash clusters, we construct a dataset

Next, we use Skewed Normal pdf as a parametric model:

where is a pdf of a standard Gaussian, and its cdf. We further compose this model with simple affine transformation since our targets are not normalised and not guaranteed to equal to 0 in infinities:

and find parameters minimising

In general, using probability of data under the MLE skewed normal distribution model could be used as a measure of “game of skillness”, but its applications and analysis is left for future research.

Appendix J Counting pure strategies

For a given 2 player turn-based game we can compute number of behaviourally different pure strategies by traversing the game tree, and again using a recursive argument. Using notation from previous sections, and to denote number of pure strategies for player we put, for each state such that :

where the second equation comes from the fact, that two pure strategies are behaviourally different if there exists a state, that both reach when facing some opponent, and they take different action there. So to count pure strategies, we simply sum over all our actions, but need to take product of opponent actions that follow, as our strategy needs to be defined in each of possible opponent moves, and each such we multiply in how many ways we can follow from there, completing the recursion. If we now ask our strategies to be able to play as both players (since in turn-based games are asymmetric) we simply report , since each combination of behaviour as first and second player is a different pure strategy.

For Tic-Tac-Toe and so in total we have approximately pure strategies that are behaviourally different. Note, that behavioural difference does not imply difference in terms of payoff, however difference in payoff implies behavioural difference. Consequently this is an upper bound on number of size of the minimal payoff describing Tic-Tac-Toe as a normal form game.

Appendix K Deterministic strategies and neural network based agents

Even though neural network based agents are technically often mixed strategies in the game theory sense (as they involve stochasticity coming either from Monte Carlo Tree Search, or at least from the use of softmax based parametrisation of the policy), in practise they were found to become almost purely deterministic as training progresses Mnih et al. (2016), so modelling them as pure strategies has empirical justification. However, study and extension of presented results to the mixed strategies regime is an important future research direction.

References

  • Balduzzi et al. [2018] David Balduzzi, Karl Tuyls, Julien Perolat, and Thore Graepel. Re-evaluating evaluation. In Advances in Neural Information Processing Systems, pages 3268–3279, 2018.
  • Balduzzi et al. [2019] David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech M Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. Open-ended learning in symmetric zero-sum games. ICML, 2019.
  • Brown and Sandholm [2019] Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
  • Brügmann [1993] Bernd Brügmann. Monte carlo go. Technical report, Citeseer, 1993.
  • Campbell et al. [2002] Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep blue. Artificial intelligence, 134(1-2):57–83, 2002.
  • David and Jon [2010] Easley David and Kleinberg Jon. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, USA, 2010. ISBN 0521195330.
  • Deterding [2015] Sebastian Deterding. The lens of intrinsic skill atoms: A method for gameful design. Human–Computer Interaction, 30(3-4):294–335, 2015.
  • Elo [1978] Arpad E Elo. The rating of chessplayers, past and present. Arco Pub., 1978.
  • Eysenbach et al. [2018] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
  • Gibbons [1992] Robert S Gibbons. Game theory for applied economists. Princeton University Press, 1992.
  • Harsanyi et al. [1988] John C Harsanyi, Reinhard Selten, et al. A general theory of equilibrium selection in games. MIT Press Books, 1, 1988.
  • Jackson [2008] Matthew O. Jackson. Social and Economic Networks. Princeton University Press, USA, 2008. ISBN 0691134405.
  • Jaderberg et al. [2017] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
  • Jaderberg et al. [2019] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
  • Kuhn [1950] Harold W Kuhn. A simplified two-person poker. Contributions to the Theory of Games, 1:97–103, 1950.
  • Lanctot et al. [2017] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 4190–4203, 2017.
  • Lanctot et al. [2019] Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, János Kramár, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai, Julian Schrittwieser, Thomas Anthony, Edward Hughes, Ivo Danihelka, and Jonah Ryan-Davis. OpenSpiel: A framework for reinforcement learning in games. CoRR, abs/1908.09453, 2019. URL http://arxiv.org/abs/1908.09453.
  • Lazzaro [2009] Nicole Lazzaro. Why we play: affect and the fun of games. Human-computer interaction: Designing for diverse users and domains, 155:679–700, 2009.
  • Le et al. [2017] Hoang M Le, Yisong Yue, Peter Carr, and Patrick Lucey. Coordinated multi-agent imitation learning. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    , pages 1995–2003. JMLR. org, 2017.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  • Morrow [1994] James D Morrow. Game theory for political scientists. Princeton University Press, 1994.
  • Myerson [2013] Roger B Myerson. Game theory. Harvard university press, 2013.
  • Newell and Simon [1976] Allen Newell and Herbert A Simon. Computer science as empirical inquiry: Symbols and search. In ACM Turing award lectures. 1976.
  • OpenAI et al. [2019] OpenAI, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique Pondé de Oliveira Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with large scale deep reinforcement learning. arXiv, 2019. URL https://arxiv.org/abs/1912.06680.
  • Ortiz et al. [2007] Luis E Ortiz, Robert E Schapire, and Sham M Kakade. Maximum entropy correlated equilibria. In Artificial Intelligence and Statistics, pages 347–354, 2007.
  • Pathak et al. [2017] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , pages 16–17, 2017.
  • Phelps et al. [2004] Steve Phelps, Simon Parsons, and Peter McBurney. An evolutionary game-theoretic comparison of two double-auction market designs. In Agent-Mediated Electronic Commerce VI, Theories for and Engineering of Distributed Mechanisms and Systems, AAMAS 2004 Workshop, AMEC 2004, New York, NY, USA, July 19, 2004, Revised Selected Papers, pages 101–114, 2004.
  • Phelps et al. [2007] Steve Phelps, Kai Cai, Peter McBurney, Jinzhong Niu, Simon Parsons, and Elizabeth Sklar. Auctions, evolution, and multi-agent learning. In Adaptive Agents and Multi-Agent Systems III. Adaptation and Multi-Agent Learning, 5th, 6th, and 7th European Symposium, ALAMAS 2005-2007 on Adaptive and Learning Agents and Multi-Agent Systems, Revised Selected Papers, pages 188–210, 2007.
  • Shannon [1950] Claude E Shannon. Xxii. programming a computer for playing chess. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 41(314):256–275, 1950.
  • Sigmund [1993] Karl Sigmund. Games of Life: Explorations in Ecology, Evolution and Behaviour. Oxford University Press, Inc., USA, 1993. ISBN 0198546653.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • Smith [1982] John Maynard Smith. Evolution and the Theory of Games. Cambridge University Press, 1982. doi: 10.1017/CBO9780511806292.
  • Tesauro [1995] Gerald Tesauro. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
  • Tuyls and Parsons [2007] Karl Tuyls and Simon Parsons. What evolutionary game theory tells us about multiagent learning. Artif. Intell., 171(7):406–416, 2007.
  • Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Walsh et al. [2002] William E Walsh, Rajarshi Das, Gerald Tesauro, and Jeffrey O Kephart. Analyzing complex strategic interactions in multi-agent systems. In AAAI-02 Workshop on Game-Theoretic and Decision-Theoretic Agents, pages 109–118, 2002.
  • Walsh et al. [2003] William E Walsh, David C Parkes, and Rajarshi Das. Choosing samples to compute heuristic-strategy nash equilibrium. In International Workshop on Agent-Mediated Electronic Commerce, pages 109–123. Springer, 2003.
  • Wang and Sun [2011] Hao Wang and Chuen-Tsai Sun. Game reward systems: Gaming experiences and social meanings. In DiGRA conference, volume 114, 2011.
  • Wellman [2006] Michael P Wellman. Methods for empirical game-theoretic analysis. In AAAI, pages 1552–1556, 2006.