1 OpenSpiel Overview
1.1 Disclaimer and Notes on Citations
This is a living document. The team intends to periodically update this document to reflect the current state of the code. At any given time, this document might be (slightly) outofdate; please refer to the OpenSpiel github page for reference. As a result, we request that authors not cite specific sections, tables, or figures of this document as they may change over time.
1.2 Acknowledgments
OpenSpiel has been possible due to a team of contributors. For a full list of all the contributors, please see the list of authors on github.
We would also like to thank the following people, who helped and supported the development of OpenSpiel:

Remi Munos

Michael Bowling

Thore Graepel

Shibl Mourad

Nathalie Beauguerlange

Ellen Clancy

Louise Deason

Andreas Fidjeland

Martin Schmid

Neil Burch

Damien Boudot

Adam Cain
1.3 OpenSpiel At a Glance
We provide an intentionally brief overview here. For details, please see Section 3.
OpenSpiel provides a framework for writing games and algorithms and evaluating them on a variety of benchmark games. OpenSpiel contains implementations of over 20 different games of various sorts (perfect information, simultaneous move, imperfect information, gridworld games, an auction game, and several normalform / matrix games). Game implementations are in C++ and wrapped in Python. Algorithms are implemented in C++
and/or Python. The API is almost identical in the two languages, so code can easily be translated if needed. A subset of the library has also been ported to Swift. Most of the learning algorithms written in Python use Tensorflow
[1], though we are actively seeking examples and other support for PyTorch
[53] and JAX^{1}^{1}1https://github.com/google/jax.OpenSpiel has been tested on Linux. We have not tested on MacOS nor Windows, but since the code uses freely available tools, we do not anticipate any (major) problems compiling and running under other major platforms. Patches and instructions would be appreciated.
Components of OpenSpiel are listed in Tables 1 and 2. There are three levels of status: indicates a thoroughlytested implementation, including–in many cases– verifying against known values and/or reproducing results from papers and used for papers, indicates implemented and lightly tested.
Game  Reference(s)  Status 

Backgammon  Wikipedia  
Breakthrough  Wikipedia  
Bridge bidding  Wikipedia  
Coin Game  [56]  
Connect Four  Wikipedia  
Cooperative BoxPushing  [62]  
Chess  Wikipedia  
Firstprice Sealedbid Auction  Wikipedia  
Go  Wikipedia  
Goofspiel  Wikipedia  
Havannah  Wikipedia  
Hex  Wikipedia  
Kuhn poker  Wikipedia, [33]  
Leduc poker  [65]  
Liar’s Dice  Wikipedia  
Markov Soccer  [37, 24]  
Matching Pennies (threeplayer)  [28]  
Matrix Games  [63]  
OshiZumo  [17, 8, 54]  
Oware  Wikipedia  
Pentago  Wikipedia  
Phantom TicTacToe  [35]  
Pig  [48]  
TicTacToe  Wikipedia  
Tiny Bridge  
Y  Wikipedia  
Catch (Pythononly)  [43] and [51, Appendix A]  
CliffWalking (Pythononly)  [67, Chapter 6] 
Algorithm  Category  Reference(s)  Status 

Minimax (and AlphaBeta) Search  Search  Wikipedia, Wikipedia, [29]  
Monte Carlo tree search  Search  Wikipedia, [30, 18, 16]  
Sequenceform linear programming 
Opt.  [31, 63]  
Counterfactual Regret Minimization (CFR)  Tabular  [78, 47]  
(Tabular) Exploitability  Tabular  [78]  
External sampling Monte Carlo CFR  Tabular  [34, 35]  
Outcome sampling Monte Carlo CFR  Tabular  [34, 35]  
Qlearning  Tabular  [67]  
Value Iteration  Tabular  [67]  
Advantage ActorCritic (A2C)  RL  [42]  
Deep Qnetworks (DQN)  RL  [44]  
Ephemeral Value Adjustments (EVA)  RL  [22]  
Deep CFR  MARL  [14]  
Exploitability Descent (ED)  MARL  [38]  
(Extensiveform) Fictitious Play (XFP)  MARL  [25]  
Neural Fictitious SelfPlay (NFSP)  MARL  [26]  
Neural Replicator Dynamics (NeuRD)  MARL  [49]  
Regret Policy Gradients (RPG, RMPG)  MARL  [66]  
PolicySpace Response Oracles (PSRO)  MARL  [36]  
Qbased “allaction” Policy Gradients (QPG)  MARL  [68, 55, 66]  
Regression CFR (RCFR)  MARL  [73, 46]  
Rectified Nash Response ()  MARL  [3]  
Rank  Eval / Viz  [50]  
Replicator / Evolutionary Dynamics  Eval / Viz  [27, 61] 
2 Getting Started
2.1 Getting and Building OpenSpiel
The following commands will clone the repository and build OpenSpiel on Debian or Ubuntu Linux. This is the fastest way to install OpenSpiel; however, there is at least one known problem^{2}^{2}2On Debian 10 and Ubuntu 19.04, the following error message is shown when running the pip3 command: tensorboard 1.14.0 has requirement setuptools>=41.0.0, but you’ll have setuptools 40.8.0 which is incompatible. The code builds and tests pass, but this may affect the usability of tensorboard.. Please see the recommended installation instructions using virtualenv for more detail.
Note that at this time, we have not tested OpenSpiel on any platform other than Linux. Also, some of the scripts and instructions currently assume Debianbased distributions (i.e. Debian, Ubuntu, etc.). All of the dependencies exist on other distributions, but may have different names, and package managers differ. Please see install.sh for necessary dependencies.
2.1.1 Setting PYTHONPATH
To be able to import the Python code (both the C++ binding pyspiel and the rest) from any location, you will need to add to your PYTHONPATH the root directory and the open_spiel directory. Add the following in your .bashrc or .profile:
2.2 Running the First Example
After having built OpenSpiel following Sec 2.1, run the example from the build directory without any arguments:
This prints out a list of registered games and the usage. Now, let’s play a game of TicTacToe with uniform random players:
Wow – how exhilarating! Now, why not try one of your favorite games?
Note that the structure in the build directory mirrors that of the source, so the example is found in open_spiel/examples/example.cc. At this stage you can run one of many binaries created, such as games/backgammon_test or algorithms/external_sampling_mccfr_test.
Once you have set your PYTHONPATH as explained in Sec 2.1.1, you can similarly run the python examples:
Nice!
2.3 Adding a New Game
We describe here only the simplest and fastest way to add a new game. It is ideal to first be aware of the general API, which is described on a high level in Section 3, on github, and via comments in spiel.h.

Choose a game to copy from in games/. Suggested games: TicTacToe and Breakthrough for perfect information without chance events, Backgammon or Pig for perfect information games with chance events, Goofspiel and OshiZumo for simultaneous move games, and Leduc poker and Liar’s dice for imperfect information games. For the rest of these steps, we assume TicTacToe.

Copy the header and source: tic_tac_toe.h, tic_tac_toe.cc, and tic_tac_toe_test.cc to new_game.h, new_game.cc, and new_game_test.cc.

Add the new game’s source files to games/CMakeLists.txt.

Add the new game’s test target to games/CMakeLists.txt

In new_game.h, rename the header guard at the the top and bottom of the file.

In the new files, rename the innermost namespace from tic_tac_toe to new_game

In the new files, rename TicTacToeGame and TicTacToeState to NewGameGame and NewGameState

At the top of new_game.cc, change the short name to new_game and include the new game’s header.

Add the short name to the list of excluded games in integration_tests/api_test.py.

Add the short name to the list of expected games in python/tests/pyspiel_test.py.

You should now have a duplicate game of TicTacToe under a different name. It should build and the test should run, and can be verified by rebuilding and running the example from Section 2.2.

Now, change the implementations of the functions in NewGameGame and NewGameState to reflect your new game’s logic. Most API functions should be clear from the game you copied from. If not, each API function that is overridden will be fully documented in superclasses in spiel.h. See also the description of extensiveform games in Section 3.1 which closely matches the API.

Once done, rebuild and rerun the tests from Sec 2.1 to ensure everything passes (including your new game’s test!)
2.4 Adding a New Algorithm
Adding a new algorithm is fairly straightforward. Like adding a game, it is easiest to copy and start from one of the existing algorithms. If adding a C++ algorithm, choose one from algorithms/. If adding a Python algorithm, choose one from python/algorithms/. For appropriate matches, see Table 2.
Unlike games, there is no specific structure or API that must be followed for an algorithm. If the algorithm is one in a class of existing algorithms, then we advise keeping the style and design similar to the ones in the same class, reusing function or modules where possible.
The algorithms themselves are not binaries, but classes or functions that can be used externally. The best way to show an example of an algorithm’s use is via a test. However, there are also binary executables in examples/ and python/examples/.
3 Design and API
The purpose of OpenSpiel is to promote general multiagent reinforcement learning across many different game types, in a similar way as general gameplaying [21] but with a heavy emphasis on learning and not in competition form. We hope that OpenSpiel could have a similar effect on general RL in games as the Atari Learning Environment [6, 39] has had on singleagent RL.
OpenSpiel provides a general API with a C++ foundation, which is exposed through Python bindings (via pybind11). Games are written in C++
. This allows for fast or memoryefficient implementations of basic algorithms that might need the efficiency. Some custom RL environments are also implemented in Python. Most algorithms that require machine learning are implemented in Python.
Above all, OpenSpiel is designed to be easy to install and use, easy to understand, easy to extend (“hackable”), and general/broad. OpenSpiel is built around two major important design criteria:

Keep it simple. Simple choices are preferred to more complex ones. The code should be readable, usable, extendable by nonexperts in the programming language(s), and especially to researchers from potentially different fields. OpenSpiel provides reference implementations that are used to learn from and prototype with, rather than fullyoptimized / highperformance code that would require additional assumptions (narrowing the scope / breadth) or advanced (or lowerlevel) language features.

Keep it light. Dependencies can be problematic for longterm compatibility, maintenance, and easeofuse. Unless there is strong justification, we tend to avoid introducing dependencies to keep things portable and easy to install.
3.1 ExtensiveForm Games
There are several formalisms and corresponding research communities for representing multiagent interactions. It is beyond the scope of this paper to survey the various formalisms, so we describe the ones most relevant to our implementations. There have been recent efforts to harmonize the terminology and make useful associations among algorithms between computational game theory and reinforcement learning [66, 38, 32], so we base our terminology on classical concepts and these recent papers.
Games in OpenSpiel are represented as procedural extensiveform games [52, 63]
, though in some cases can also be cyclic such as in Markov Decision Processes
[67] and Markov games [37]. We first give the classical definitions, then describe some extensions, and explain some equivalent notions between the fields of reinforcement learning and games.An extensiveform game is a tuple , where

is a finite set of players^{3}^{3}3Note that the player IDs range from to in the implementations.. There is also a special player , called chance.

is a finite set of actions that players can take. This is a global set of stateindependent actions; generally, only a subset of legal actions are available when agents decide.

is a finite set of histories. Each history is a sequence of actions that were taken from the start of the game.

is a subset of terminal histories that represents a completely played game.

, where , is the utility function assigning each player a utility at terminal states, and are constants representing the minimum and maximum utility.

is a player identity function; identifies which player acts at .

is a set of states. In general, is a partition of such that each state contains histories that cannot be distinguished by . Decisions are made by players at these states. There are several ways to precisely define as described below.
We denote the legal actions available at state as . Importantly, a history represents the true ground/world state: when agents act, they change this history, but depending on how the partition is chosen, some actions (including chance’s) may be private and not revealed to some players.
We will extend this formalism further on to more easily describe how games are represented in OpenSpiel. However, we can already state some important categories of games:

A constantsum (sum) game is one where .

A zerosum game is a constantsum game with .

An identical interest game is one where .

A generalsum game is one without any constraint on the sum of the utilities.
In other words: sum games are strictly competitive, identical interest games are strictly cooperative, and generalsum games are neither or somewhere in between. Also,

A perfect information game is one where there is only one history per state: .

A imperfect information game is one where there is generally more than one history per state, .
Chess, Go, and Breakthrough are examples of perfect information games without events (no chance player). Backgammon and Pig are examples of perfect information games with chance events. Leduc poker, Kuhn poker, Liar’s Dice, and Phantom TicTacToe are examples of imperfect information games. Every one of these example games is zerosum.
Definition 1.
A chance node (or chance event) is a history such that .
In zerosum perfect information games, minimax and alphabeta search are classical search algorithms for making decisions using heuristic value functions
[29]. The analogs for perfect information games with chance events are expectiminimax [41] and *minimax [5].3.1.1 Extension: SimultaneousMove Games
We can augment the extensiveform game with a special kind of player, the simultaneous move player: . When , each player has a set of legal actions , and all players act simultaneously choosing a joint action . Histories in these games are then sequences of joint actions, and transitions take the form . The rest of the properties from extensiveform games still hold.
Definition 2.
A normalform (or oneshot game) is a simultaneousmove game with a single state, . A matrix game is a normalform game where .
Fact 1.
A simultaneousmove game can be represented as a specific type of extensiveform game with imperfect information.
To see why this is true: consider the game of Rock, Paper, Scissors () where each player chooses a single action, revealing their choice simultaneously. An equivalent turnbased is the following: the first player writes their action on a piece of paper, and places it face down. Then, the second player does the same. Then, the choices are revealed simultaneously. The players acted at separate times, but the second player did not know the choice made by the first player (and hence could be in one of three histories: ), and the game has two states instead of one state. In a game with many states, the same idea can simply be repeated for every state.
Why, then, represent these games differently? There are several reasons:

They have historically been treated as separate in the multiagent RL literature.

They can sometimes be solved using Bellmanstyle dynamic programming, unlike general imperfect information games.

They are slightly more general. In fact, one can represent a turnbased game using a simultaneousmove game, simply by setting for or by adding a special pass move as the only legal action when it is not a player’s turn.
We elaborate on each of these points in the following section, when we relate simultaneousmove games to existing multiagent RL formalisms.
3.1.2 Policies, Objectives, and Multiagent Reinforcement Learning
We now add the last necessary ingredients for designing decisionmaking and learning algorithms, and bring in the remaining standard RL terms.
Definition 3.
A policy , where
represents the set of probability distributions over
, describes agent behavior. An agent acts by selecting actions from its policy: . A deterministic policy is one where at each state the distribution over actions has probability on one action and zero on the others. A policy that is not (necessarily) deterministic is called stochastic.In games, the chance player is special because it always plays with a fixed (stochastic) policy .
Definition 4.
A transition function defines a probability distribution over successor states when choosing action from state .
Fact 2.
A transition function can be equivalently represented using intermediate chance nodes between the histories of the predecessor and successor states and . The transition function is then determined by and .
Definition 5.
A player, or agent, has perfect recall if, the state does not lose the information about the past decisions made by the player. Formally, all histories , contain the same sequence of action of the current player: let be the history of only player ’s stateaction pairs experienced along . Player has perfect recall if for all .
In Poker, a player acts from an information state, and the histories corresponding to such an information state only differ in the chance event outcomes that correspond to the opponent’s private cards. In these partiallyobservable games, a state is normally called an information state to emphasize the fact that the agent’s perception of the state () is different than the true underlying world state (one of ).
The property of perfect recall turns out to be a very important criterion for determining convergence guarantees for exact tabular algorithms, as we show in Section 3.2.
Definition 6.
An observation is a partial view of the information state and contains strictly less information than the information state. To be valid, the sequence of observations and actions of all players should contain at least as much information as the information state. Formally: Let be a finite set of observations. Let be an observation function for player and denote as the observation. As contains histories , we will write if . A valid observation is such that the function defines a partition of the history space that is a subpartition of .
In a multiplayer game, we define a perstep reward to player for a transition as , with
representing the vector of returns to all players. In most OpenSpiel games, these
until is terminal, ending the episode, and these values are obtained by State::Rewards and State::PlayerReward function called on . Player interaction over an episode generates a trajectory whose length is . We define a return to player as with representing a vector of rewards to all players as with perstep rewards. In OpenSpiel, the State::Returns function provides and State::PlayerReturn provides. Note that we do not use a discount factor when defining rewards here because most games are episodic; learning agents are free to discount rewards however they like, if necessary. Note also that the standard (undiscounted) return is the random variable
.Each agent’s objective is to maximize its own return, or an expected return . However, note that the trajectory sampled depends not just on player ’s policy but on every other player’s policies! So, an agent cannot maximize its return in isolation: it must consider the other agents as part of its optimization problem. This is fundamentally different from traditional (singleagent) reinforcement learning, and the main challenge of multiagent RL.
3.2 Algorithms and Results
Here, we give an overview of the algorithms implemented within OpenSpiel.
3.2.1 Basic Algorithms
Suppose players are playing with a joint policy . The expected returns algorithm computes for all players exactly, by doing a tree traversal over the game and querying the policy at each state . Similarly, for small enough games, one can get all the states () in a game by doing a tree traversal and indexing each state by its information state string description.
The trajectories algorithms run a batch of episodes by following a joint policy , collecting various data such as the states visited, state policies, actions sampled, returns, episode lengths, etc., which could form the basis of the data collection for various RL algorithms.
There is a simple implementation of value iteration. In singleagent games, it is identical to the standard algorithm [67]. In twoplayer turntaking zerosum games, the values for state , i.e. , is stored in view of the player to play at , i.e. . This can be solved by applying the identities and .
3.2.2 Search Algorithms
There are two classical search algorithms for zerosum turntaking games of perfect information: minimax (and alphabeta) search [29, 59], and Monte Carlo tree search (MCTS) [18, 30, 16].
Suppose one wants to choose at some root state : given a heuristic value function for (representing the value of state to player ) and some depth , minimax search computes a policy that assigns 1 to an action that maximizes the following depthlimited adversarial multistep value backup:
where here we treat as a deterministic map for the successor state reached from taking action in state .
The Python implementation of minimax includes expectiminimax [41] as well, which also backs up expected values at chance nodes. Alphabeta style cutoffs could also be applied using minimax [5], but it is not currently implemented.
The implementations of MCTS are vanilla UCT with random playouts. Chance node are supported and represented explicitly in the tree: at chance nodes, the tree policy is always to sample according to the chance node’s probability distibution.
3.2.3 Optimization Algorithms
OpenSpiel includes some basic optimization algorithms applied to games, such as solving zerosum matrix games ([63, Section 4], [37]) and sequenceform linear programming for twoplayer zerosum extensiveform games ([31] and [63, Section 5]), and an algorithm to check whether an action is dominated by a mixture of other strategies in a normalform [63, Sec 4.5.2].
3.2.4 Traditional SingleAgent RL Algorithms
We currently have three algorithms usable for traditional (singleagent) RL: Deep QNetworks (DQN) [44], Advantage ActorCritic (A2C) [42], and Ephemeral Value Adjustments (EVA) [22]. Each algorithm will operate as the standard one in singleagent environments.
Each of these algorithms can also be run in the multiagent setting, in various ways. The default is that each player is independently running a copy of the algorithm with states and observations that include what other players did. The other way to use these algorithms is to compute an approximate best response to a fixed set of other players’ policies, described in Section 3.2.5.
The main difference between the implementations of these algorithms and other standard ones is that these are aware that only a subset of actions are legal / illegal. So, for example, in Qlearning the value update for a transition and policy updates are:
(1) 
(2) 
Note that the actions are in the set of legal actions and
rather than assuming that every action is legal at every state. For policy gradient methods, a masked softmax is used to set the logits of the illegal actions to
to force the policy to sets probability zero to illegal actions.3.2.5 PartiallyObservable (Imperfect Information) Games
There are many algorithms for reinforcement learning in partiallyobservable (zerosum) games, as this is the focus of the core team’s research interests.
Best Response and NashConv
Suppose is a joint policy. A best response policy for player is a policy that maximized player ’s return against the other players’ policies (). There may be many best responses, and we denote the set of such best responses,
Let be the incentive for player to deviate to one of its best responses: where . An approximate Nash equilibrium is a joint policy such that for all , where a Nash equilibrium is obtained at .
A common metric for determining the rates of convergence (to equilibria) of algorithms in practice is:
In twoplayer constantsum (i.e. sum) games, a similar metric has been used:
where . Nash equilibria are often considered optimal in twoplayer zerosum games, because they guarantee maximal worstcase returns against any other opponent policy. This is also true for approximate equilibria, so convergence to equilibra has been a focus in this class of games.
Fictitious Play and Best ResponseBased Iterative Algorithms
Fictitious play (FP) is a classic iterative procedure for computing policies in (normalform) games [12, 57]. Starting with a uniform random policy at time . Then, for , do:

Each player computes a best response to the opponents’ average policy: .

Each player updates their average policy: .
OpenSpiel has an implementation of extensiveform fictitious play (XFP) [25], which is equivalent to the classical fictitious play. To run it on normalform games, the game needs to be transformed into a turnbased game using TurnBasedSimultaneousGame in game_transforms/
. Fictitious SelfPlay is a sampledbased RL version of XFP that uses supervised learning to learn the average policy and reinforcement learning to compute approximate best responses. Neural Fictitious SelfPlay (NFSP) scales these ideas using neural networks and a reservoirsampled buffer to maintain a uniform sample of experience to train the average policy
[26].The average policy in fictitious play can be described equivalently as a metapolicy that assigns uniform weight over all the previous best response policies, and each iteration computes a best response to the opponents’ metapolicies. PolicySpace Response Oracles (PSRO) generalizes fictitious play and the doubleoracle algorithm [36, 40] by analyzing this metagame using empirical gametheoretic analysis [74]. Exploitabiliy Descent replaces the second step of fictitious play with a policy gradient ascent against the stateaction values given the opponents play their best responses [38]. This one change allows convergence of the policies themselves rather than having to maintain an average policy; in addition, it makes the optimization of the polices amenable to RLstyle general function approximation.
A convergence curve for XFP and ED are shown in Figure 1. A convergence curve for NFSP in 2player Leduc is found below (Figure 3), included with the policy gradient methods.
Counterfactual Regret Minimization
Counterfactual regret (CFR) minimization is a policy iteration algorithm for computing approximate equilibra in twoplayer zerosum games [78]. It has revolutionized Poker AI research [58, 60], leading to the largest variants of poker being solved and competitive polices that have beat top human professionals [10, 45, 13, 15].
CFR does two main things: (a) define a new notion of stateaction value, the counterfactual value, and (b) define a decomposed regret minimization procedure (based on these values) at every information state that, together, leads to minimization of overall average regret. This means that the average policy of two CFR players approaches an approximate equilibrium.
Define as the set of terminal histories that pass through , paired with the prefix of each terminal . Define a reach probability to be the product of all players’ probabilities of stateaction pairs along (including chance’s), which can be decomposed into player ’s contribution and their opponents’ contributions: . Similarly define similarly from to and as the history appended with action . The counterfactual stateaction value for is:
The state value is then .
CFR starts with a uniform random policy and proceeds by applying regret minimization at every information state independently. Define to be the instantaneous counterfactual regret. CFR proceeds by minimizing this regret, typically using regretmatching [23]. A table of cumulative regret is maintained , and the policy at each state is updated using:
where .
In addition to basic CFR, OpenSpiel contains a few variants of Monte Carlo CFR [34] such as outcome sampling and external sampling, and CFR+ [69].
Regression CFR
Regression CFR (RCFR) was the first variant to combine RLstyle function approximation with CFR techniques [73, 46]. The main idea is to train a regressor to predict the cumulative or average counterfactual regrets, or , instead of reading them from a table. The original paper used domainspecific features and regression trees. The implementation in OpenSpiel uses neural networks with raw inputs obtained by each game’s InformationSetAsNormalizedVector bit string.
Figure 2 shows the convergence rate of RCFR compared to a tabular CFR.
Deep CFR [14] applies these ideas to a significantly larger game using convolutional networks, external sampling Monte Carlo CFR, and–like NFSP–a reservoirsampled buffer.
Regret Policy Gradients
Valuebased RL algorithms, such as temporaldifference learning and Qlearning, evaluate a policy
by computing or estimating state (or stateaction) values that represent the expected return conditioned on having reached state
,Policies are improved by choosing the actions that lead to highervalued states or highervalued returns.
In episodic partiallyobservable games, when agents have perfect recall (Def 5), there is an important connection between traditional values in valuebased RL and counterfactual values [66, Section 3.2]:
where is the Bayes normalization term to ensure that is a probability distribution. CFR is then as a (tabular) allactions policy gradient algorithm with generalized infinitesimal gradient ascent (GIGA) at each state [66], inspiring new RL variants for partially observable games.
These variants: Qbased “allactions” Policy Gradient (QPG), Regret Policy Gradients (RPG), and RegretMatching Policy Gradients (RMGP) are included in OpenSpiel, along with classic batched A2C. RPG differs from QPG in that the policy is optimized toward a noregret region, minimizing the loss based on , the motivation being that a policy with zero regret is, by definition, an equilibrium policy. Convergence results for these algorithms are show in Figure 3.
Neural Replicator Dynamics
Neural Replicator Dynamics (NeuRD) [49] takes the policy gradient connection to CFR a step further: in [66], the relationship between policy gradients and CFR was possible via GIGA [77]; however, this requires projections of policies after the gradient step. NeuRD, on the other hand, works directly with the common softmaxbased policy representations. Instead of differentiating through the softmax as policy gradient does, NeuRD differentiates only with respect to the logits. This is equivalent to updating the policy of a parameterized replicator dynamics from evolutionary game theory [27, 61] using an Euler discretization. The resulting update reduces to the wellknown multiplicative weights update algorithm or Hedge [20], which minimizes regret. Hence, NeuRD in partiallyobservable games can replace regretmatching in CFR and retain convergence guarantees in the tabular case since that algorithm reduces to CFR with Hedge.
One practical benefit is that the NeuRD policy updates are not weighted by the policy like policy gradient is. As a result, in nonstationary domains, NeuRD is also more adaptive to changes in the environment. Results for NeuRD are show in Figures 4 and 5.
Kuhn poker  Goofspiel  Leduc poker 
3.3 Tools and Evaluation
OpenSpiel has a few tools for visualization and evaluation, though some would also be considered algorithms (such as Rank). The best response algorithm is also a tool in some sense, but is listed in Section 2 due to its association with partiallyobservable games.
For now, all the tools and evaluation we mention in this section is contained under the egt/ subdirectory of the code base. We expect this to change over time as that subdirectory is currently the home of the techniques inspired by evolutionary game theory.
3.3.1 Visualization of Evolutionary and Policy Learning Dynamics
One common visualization tool in the multiagent learning literature (especially in games) is a phase portrait that shows a vector field and/or trajectories of particle the depict local changes to the policy under specific update dynamics [64, 72, 11, 71, 9, 74, 2, 76, 75, 7, 70].
For example, consider the wellknown singlepopulation replicator dynamic for symmetric games, where each player follows a learning dynamic described by:
where represents the expected utility of playing action against the full policy , and is the expected value over all actions .
Figure 6 shows plots generated from OpenSpiel for replicator dynamics in the game of Rock–Paper–Scissors. Figure 7 shows plots generated from OpenSpiel for four common bimatrix games.
3.3.2 Rank
Rank [50] is an algorithm that leverages evolutionary game theory to rank AI agents interacting in multiplayer games. Specifically, Rank defines a Markov transition matrix with states corresponding to the profile of agents being used by the players (i.e., tuples of AI agents), and transitions informed by a specific evolutionary model that ensures correspondence of the rankings to a gametheoretic solution concept known as a MarkovConley Chain. A key benefit of Rank is that it can rank agents in scenarios involving intransitive agent relations (e.g., the agents Rock, Paper, and Scissors in the eponymous game), unlike the Elo rating system [4]; an additional practical benefit is that it is also tractable to compute in general games, unlike ranking systems relying on Nash equilibria [19].
OpenSpiel currently supports using
Rank for both singlepopulation (symmetric) and multipopulation games. Specifically, users may specify games via payoff tables (or tensors for the >2 players case) as well as Heuristic Payoff Tables (HPTs). Note that here we only include an overiew of the technique and visualizations; for a tour through the usage and code please see the
Rank doc on the web site.Figure 8(a) shows a visualization of the Markov transition matrix of Rank run on the Rock, Paper, Scissors game. The next example demonstrates computing Rank on an asymmetric 3player metagame, constructed by computing utilities for Kuhn poker agents from the best response policies generated in the first few rounds of via extensiveform fictitious play (XFP) [25]. The result is shown in Figure 8(b).
.  
.  
(a)  (b) 
One may choose to conduct a sweep over the rankingintensity parameter, (as opposed to choosing a fixed ). This is, in general, useful for general games where bounds on utilities may be unknown, and where the ranking computed by Rank should use a sufficiently high value of (to ensure correspondence to the underlying MarkovConley Chain solution concept). In such cases, the following interface can be used to both visualize the sweep and obtain the final rankings computed. The result is shown in Figure 9.
4 Guide to Contributing
If you are looking for ideas on potential contributions or want to see a rough road map for the future of OpenSpiel, please visit the Roadmap and Call for Contributions on github.
Before making a contribution to OpenSpiel, please read the design philosophy in Section 3. We also kindly request that you contact us before writing any large piece of code, in case (a) we are already working on it and/or (b) it’s something we have already considered and may have some design advice on its implementation. Please also note that some games may have copyrights which could require legal approval(s). Otherwise, happy hacking!
4.1 Contacting Us
If you would like to contact us regarding anything related to OpenSpiel, please create an issue on the github site so that the team is notified, and so that the responses are visible to everyone.
References
 [1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
 [2] Sherief Abdallah and Victor Lesser. A multiagent reinforcement learning algorithm with nonlinear dynamics. JAIR, 33(1):521–549, 2008.
 [3] David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech M. Czarnecki, Julien Pérolat, Max Jaderberg, and Thore Graepel. Openended learning in symmetric zerosum games. CoRR, abs/1901.08106, 2019. http://arxiv.org/abs/1901.08106.
 [4] David Balduzzi, Karl Tuyls, Julien Perolat, and Thore Graepel. Reevaluating evaluation. In Advances in Neural Information Processing Systems, pages 3268–3279, 2018. Also available at http://arxiv.org/abs/1806.02643.
 [5] B. W. Ballard. The *minimax search procedure for trees containing chance nodes. Artificial Intelligence, 21(3):327–350, 1983.
 [6] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, June 2013.
 [7] Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics of multiagent learning: A survey. J. Artif. Intell. Res. (JAIR), 53:659–697, 2015.
 [8] Branislav Bošanský, Viliam Lisý, Marc Lanctot, Jiří Čermák, and Mark H.M. Winands. Algorithms for computing strategies in twoplayer simultaneous move games. Artificial Intelligence, 237:1–40, 2016.
 [9] Michael Bowling. Convergence and noregret in multiagent learning. In Advances in Neural Information Processing Systems 17 (NIPS), pages 209–216, 2005.
 [10] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Headsup Limit Hold’em Poker is solved. Science, 347(6218):145–149, January 2015.
 [11] Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning rate. Artificial Intelligence, 136:215–250, 2002.
 [12] G. W. Brown. Iterative solutions of games by fictitious play. In T.C. Koopmans, editor, Activity Analysis of Production and Allocation, pages 374–376. John Wiley & Sons, Inc., 1951.
 [13] Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for regret minimization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017.
 [14] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. CoRR, abs/1811.00164, 2018. http://arxiv.org/abs/1811.00164.
 [15] Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 11, 2019.
 [16] C.B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, March 2012.
 [17] M. Buro. Solving the oshizumo game. In Van Den Herik H.J., Iida H., and Heinz E.A., editors, Advances in Computer Games, volume 135 of IFIP  The International Federation for Information Processing. Springer, 2004.
 [18] R. Coulom. Efficient selectivity and backup operators in MonteCarlo tree search. In Proceedings of the 5th international conference on Computers and games, volume 4630 of CG’06, pages 72–83, Berlin, Heidelberg, 2007. SpringerVerlag.
 [19] Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. SIAM Journal on Computing, 39(1):195–259, 2009.
 [20] Y. Freund and R. E. Shapire. A decisiontheoretic generalization of online learning and an application to boosting. In Computational Learning Theory: Second European Conference (EuroCOLT’95), pages 23–37. SpringerVerlag, 1995.
 [21] M. Genesereth, N. Love, and B. Pell. General gameplaying: Overview of the AAAI competition. AI Magazine, 26:62–72, 2005.
 [22] Steven Hansen, Pablo Sprechmann, Alexander Pritzel, André Barreto, and Charles Blundell. Fast deep reinforcement learning using online adjustments from the past. CoRR, abs/1810.08163, 2018. http://arxiv.org/abs/1810.08163.
 [23] S. Hart and A. MasColell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
 [24] He He, Jordan L. BoydGraber, Kevin Kwok, and Hal Daumé III. Opponent modeling in deep reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning (ICML 2016), 2016. Preprint available at https://arxiv.org/abs/1609.05559.
 [25] Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious selfplay in extensiveform games. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), 2015.
 [26] Johannes Heinrich and David Silver. Deep reinforcement learning from selfplay in imperfectinformation games. CoRR, abs/1603.01121, 2016.
 [27] Josef Hofbauer and Karl Sigmund. Evolutionary Games and Population Dynamics. Cambridge University Press, 1998.
 [28] J. S. Jordan. Three problems in learning mixedstrategy Nash equilibria. Games and Economic Behavior, 5:368–386, 1993.
 [29] Donald E. Knuth and Ronald W Moore. An analysis of alphabeta pruning. Artificial Intelligence, 6(4):293–326, 1975.
 [30] L. Kocsis and C. Szepesvári. Banditbased Monte Carlo planning. In 15th European Conference on Machine Learning, volume 4212 of LNCS, pages 282–293, 2006.

[31]
D. Koller, N. Megiddo, and B. von Stengel.
Fast algorithms for finding randomized strategies in game trees.
In
Proceedings of the 26th ACM Symposium on Theory of Computing (STOC ’94)
, pages 750–759, 1994.  [32] Vojtech Kovarík, Martin Schmid, Neil Burch, Michael Bowling, and Viliam Lisý. Rethinking formal models of partially observable multiagent decision making. CoRR, abs/1906.11110, 2019. http://arxiv.org/abs/1906.11110.
 [33] H. W. Kuhn. Simplified twoperson Poker. Contributions to the Theory of Games, 1:97–103, 1950.
 [34] M. Lanctot, K. Waugh, M. Bowling, and M. Zinkevich. Sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems (NIPS 2009), pages 1078–1086, 2009.
 [35] Marc Lanctot. Monte Carlo Sampling and Regret Minimization for Equilibrium Computation and DecisionMaking in Large Extensive Form Games. PhD thesis, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, June 2013.
 [36] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified gametheoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
 [37] Michael L. Littman. Markov games as a framework for multiagent reinforcement learning. In In Proceedings of the Eleventh International Conference on Machine Learning, pages 157–163. Morgan Kaufmann, 1994.
 [38] Edward Lockhart, Marc Lanctot, Julien Pérolat, JeanBaptiste Lespiau, Dustin Morrill, Finbarr Timbers, and Karl Tuyls. Computing approximate equilibria in sequential adversarial games by exploitability descent. In In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2019. See also full version at https://arxiv.org/abs/1903.05614.
 [39] Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew J. Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
 [40] H. McMahan, G. Gordon, and A. Blum. Planning in the presence of cost functions controlled by an adversary. In Proceedings of the Twentieth International Conference on Machine Learning (ICML2003), 2003.
 [41] D. Michie. Gameplaying and gamelearning automata. Advances in Programming and NonNumerical Computation, pages 183–200, 1966.
 [42] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 1928–1937, 2016.
 [43] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2204–2212. Curran Associates, Inc., 2014.
 [44] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518:529–533, 2015.
 [45] Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expertlevel artificial intelligence in headsup nolimit poker. Science, 358(6362), October 2017.
 [46] Dustin Morrill. Using regret estimation to solve games compactly. Master’s thesis, Computing Science Department, University of Alberta, April 2016.
 [47] Todd W. Neller and Marc Lanctot. An introduction to counterfactual regret minimization. In Proceedings of Model AI Assignments, The Fourth Symposium on Educational Advances in Artificial Intelligence (EAAI2013), 2013. http://modelai.gettysburg.edu/2013/cfr/index.html.
 [48] Todd W. Neller and Clifton G.M. Presser. Optimal play of the dice game pig. The UMAP Journal, 25(1):25–47, 2004.
 [49] Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Rémi Munos, Julien Pérolat, Marc Lanctot, Audrunas Gruslys, JeanBaptiste Lespiau, and Karl Tuyls. Neural replicator dynamics. CoRR, abs/1906.00190, 2019. http://arxiv.org/abs/1906.00190.
 [50] Shayegan Omidshafiei, Christos Papadimitriou, Georgios Piliouras, Karl Tuyls, Mark Rowland, JeanBaptiste Lespiau, Wojciech M. Czarnecki, Marc Lanctot, Julien Perolat, and Remi Munos. rank: Multiagent evaluation by evolution. Scientific Reports, 9(1):9937, 2019.
 [51] Ian Osband, Yotam Doron, Matteo Hessel, John Aslanides, Eren Sezener, Andre Saraiva, Katrina McKinney, Tor Lattimore, Csaba Szepezvari, Satinder Singh, Benjamin Van Roy, Richard Sutton, David Silver, and Hado Van Hasselt. Behaviour suite for reinforcement learning. CoRR, abs/1908.03568, 2016. https://arxiv.org/abs/1908.03568.
 [52] M.J. Osborne and A. Rubinstein. A Course in Game Theory. MIT Press, 1994.
 [53] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
 [54] Julien Pérolat, Bilal Piot, Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. Softened approximate policy iteration for markov games. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 1860–1868. JMLR.org, 2016.
 [55] Jan Peters. Policy gradient methods for control applications. Technical Report TRCLMC20071, University of Southern California, 2002.
 [56] Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in multiagent reinforcement learning. CoRR, abs/1802.09640, 2018. http://arxiv.org/abs/1802.09640.
 [57] J Robinson. An iterative method of solving a game. Annals of Mathematics, 54:296–301, 1951.
 [58] J. Rubin and I. Watson. Computer poker: A review. Artificial Intelligence, 175(5–6):958–987, 2011.
 [59] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, 2009.
 [60] T. Sandholm. The state of solving large incompleteinformation games, and application to poker. AI Magazine, 31(4):13–32, 2010.
 [61] William H. Sandholm. Population Games and Evolutionary Dynamics. MIT Press, 2010.
 [62] Sven Seuken and Shlomo Zilberstein. Improved memorybounded dynamic programming for decentralized pomdps. CoRR, abs/1206.5295, 2012. http://arxiv.org/abs/1206.5295.
 [63] Y. Shoham and K. LeytonBrown. Multiagent Systems: Algorithmic, GameTheoretic, and Logical Foundations. Cambridge University Press, 2009.
 [64] Satinder P. Singh, Michael J. Kearns, and Yishay Mansour. Nash convergence of gradient dynamics in generalsum games. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, UAI ’00, pages 541–548, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
 [65] Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI, pages 550–558, 2005.
 [66] Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Perolat, Karl Tuyls, Remi Munos, and Michael Bowling. Actorcritic policy optimization in partially observable multiagent environments. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3422–3435. Curran Associates, Inc., 2018. Full version available at https://arxiv.org/abs/1810.09026.
 [67] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
 [68] Richard S. Sutton, Satinder Singh, and David McAllester. Comparing policygradient algorithms, 2001. Unpublished.
 [69] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving headsup limit Texas Hold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, 2015.
 [70] Karl Tuyls, Julien Perolat, Marc Lanctot, Joel Z Leibo, and Thore Graepel. A Generalised Method for Empirical Game Theoretic Analysis . In AAMAS, 2018.
 [71] W. E. Walsh, D. C. Parkes, and R. Das. Choosing samples to compute heuristicstrategy Nash equilibrium. In Proceedings of the Fifth Workshop on AgentMediated Electronic Commerce, 2003.
 [72] William E Walsh, Rajarshi Das, Gerald Tesauro, and Jeffrey O Kephart. Analyzing Complex Strategic Interactions in MultiAgent Systems. In AAAI, 2002.
 [73] Kevin Waugh, Dustin Morrill, J. Andrew Bagnell, and Michael Bowling. Solving games with functional regret estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015. https://arxiv.org/abs/1411.7974.
 [74] Michael P. Wellman. Methods for empirical gametheoretic analysis. In Proceedings, The TwentyFirst National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, pages 1552–1556, 2006.
 [75] Michael Wunder, Michael Littman, and Monica Babes. Classes of multiagent qlearning dynamics with greedy exploration. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 1167–1174, 2010.
 [76] Chongjie Zhang and Victor Lesser. Multiagent learning with policy prediction. In Proceedings of the TwentyFourth AAAI Conference on Artificial Intelligence, pages 927–934, 2010.
 [77] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of Twentieth International Conference on Machine Learning (ICML2003), 2003.
 [78] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20 (NIPS 2007), 2008.
Comments
There are no comments yet.