DeepAI
Log In Sign Up

For Learning in Symmetric Teams, Local Optima are Global Nash Equilibria

07/07/2022
by   Scott Emmons, et al.
7

Although it has been known since the 1970s that a globally optimal strategy profile in a common-payoff game is a Nash equilibrium, global optimality is a strict requirement that limits the result's applicability. In this work, we show that any locally optimal symmetric strategy profile is also a (global) Nash equilibrium. Furthermore, we show that this result is robust to perturbations to the common payoff and to the local optimum. Applied to machine learning, our result provides a global guarantee for any gradient method that finds a local optimum in symmetric strategy space. While this result indicates stability to unilateral deviation, we nevertheless identify broad classes of games where mixed local optima are unstable under joint, asymmetric deviations. We analyze the prevalence of instability by running learning algorithms in a suite of symmetric games, and we conclude by discussing the applicability of our results to multi-agent RL, cooperative inverse RL, and decentralized POMDPs.

READ FULL TEXT VIEW PDF

page 2

page 5

07/20/2020

Evolution toward a Nash equilibrium

In this paper, we study the dynamic behavior of Hedge, a well-known algo...
11/14/2017

Symmetric Decomposition of Asymmetric Games

We introduce new theoretical insights into two-population asymmetric gam...
03/16/2018

A Generalised Method for Empirical Game Theoretic Analysis

This paper provides theoretical bounds for empirical game theoretical an...
07/05/2020

Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions

This paper seeks to establish a framework for directing a society of sim...
07/24/2019

The Complexity of Computational Problems about Nash Equilibria in Symmetric Win-Lose Games

We revisit the complexity of deciding, given a bimatrix game, whether i...
07/10/2013

Optimisation dans la détection de communautés recouvrantes et équilibre de Nash

Community detection in graphs has been the subject of many algorithms. R...
10/21/2021

Threshold Tests as Quality Signals: Optimal Strategies, Equilibria, and Price of Anarchy

We study a signaling game between two firms competing to have their prod...

1 Introduction

We consider common-payoff games (also known as identical interest games (Ui, 2009)), in which the payoff to all players is always the same. Such games model a wide range of situations involving cooperative action towards a common goal. Under the heading of team theory, they form an important branch of economics (Marschak, 1955; Marschak & Radner, 1972). In cooperative AI (Dafoe et al., 2021), the common-payoff assumption holds in Dec-POMDPs (Oliehoek et al., 2016), where multiple agents operate independently according to policies designed centrally to achieve a common objective. Many applications of multiagent reinforcement learning also assume a common payoff (Foerster et al., 2016, 2018; Gupta et al., 2017). Finally, in assistance games (Russell, 2019)

(also known as cooperative inverse reinforcement learning or CIRL games

(Hadfield-Menell et al., 2017)), which include at least one human and one or more “robots,” it is assumed that the robots’ payoffs are exactly the human’s payoff, even if the robots must learn it.

Our focus is on symmetric strategy profiles in common-payoff games. Loosely speaking, a symmetric strategy profile is one in which some subset of players share the same strategy; Section 3 defines this in a precise sense. For example, in Dec-POMDPs, an offline solution search may consider only symmetric strategies as a way of reducing the search space. (Notice that this does not lead to identical behavior, because strategies are state-dependent.) In common-payoff multiagent reinforcement learning, each agent may collect percepts and rewards independently, but the reinforcement learning updates can be pooled to learn a single parameterized policy that all agents share; prior work has found experimentally that “parameter sharing is crucial for reaching the optimal protocol” (Foerster et al., 2016). In team theory, it is common to develop a strategy that can be implemented by every employee in a given category and leads to high payoff for the company. In civic contexts, symmetry commonly arises through notions of fairness and justice. In treaty negotiations and legislation that mandates how parties behave, for example, there is often a constraint that all parties be treated equally.

For the purposes of this paper, we consider Nash equilibria—strategy profiles for all players from which no individual player has an incentive to deviate—as a reasonable solution concept. Marschak & Radner (1972) make the obvious point that a globally optimal (possibly asymmetric) strategy profile—one that achieves the highest common payoff—is necessarily a Nash equilibrium. Moreover, it can be found in time linear in the size of the payoff matrix.

Mobile
H W
Auto H 1 0
W 2 1
(a) Taxis have different permits
Mobile
H W
Auto H 1 2
W 2 1
(b) Taxis are identical
Mobile
H W
Auto H 1 0
W 0 1
(c) Large groups need both taxis
Table 1: Three versions of the self-driving taxi game. Solutions are described in the text.

In any sufficiently complex game, however, we should not expect to be able to find a globally optimal strategy profile. For example, matrix games have size exponential in the number of players, and the matrix representation of a game tree has size exponential in the depth of the tree. Therefore, global search over all possible contingency plans is infeasible for all but the smallest of games. This is why some of the most effective methods in machine learning, such as gradient methods, employ local search over strategy space.

Lacking global guarantees, local search methods may converge only to locally optimal strategy profiles. Roughly speaking, a locally optimal strategy profile is a strategy profile from which no group of players has an incentive to slightly deviate. Obviously, a locally optimal profile may not be a Nash equilibrium, as a player may still have an incentive to deviate to some more distant point in strategy space. Nonetheless, Ratliff et al. (2016) argue that a local Nash equilibrium may still be stable in a practical sense if agents are computationally unable to find a better strategy.

The central question of this work is: what can we say about the (global) properties of locally optimal symmetric strategy profiles? Our first main result, informally stated, is that in a symmetric, common-payoff game, every local optimum in symmetric strategies is a (global) Nash equilibrium. Section 4 states the result more precisely and gives an example illustrating its generality. Section 4.2 shows that the result is robust to perturbations to the common payoff and to the local optimum. Section 3.5 elaborates on the symmetry required by the result, illustrating how the theorem applies even when the physical environment is asymmetric and when players have differing capabilities. Complete proofs for all of our results are in the appendices.

Despite decades of research on symmetry in common-payoff games (Sandholm, 2001; Brandt et al., 2009), our result appears to be novel. There are some echoes of the result in the literature on single-agent decision making (Piccione & Rubinstein, 1997; Briggs, 2010; Schwarz, 2015), which can be connected to symmetric solutions of common-payoff games by treating all players jointly as a single agent, but our result appears more general than published results. Perhaps closest to our work is Piccione & Rubinstein (1997), which establishes an equilibrium-of-sorts among the “modified multi-selves” of a single player’s information set. The proof we give of our result has similarities with the proof (of a related but different result) in Taylor (2016).

In the second half of our paper, we turn to the thorny question of stability. Instability, if not handled carefully, might lead to major coordination failures in practice (Bostrom et al., 2016). It is already known that local strict optima in a totally symmetric team game attain one type of stability, but the issue is complex because there are several ways of enforcing (or not enforcing) strict symmetries in payoffs and strategies (Milchtaich, 2016). Whereas our first main result implies stability to unilateral deviation, our second main result establishes when stability exists to joint, possibly-asymmetric, deviation. We prove for a non-degenerate class of games that local optima in symmetric strategy space fail to be local optima in asymmetric strategy space if and only if at least one player is mixing, and we experimentally quantify how often mixing occurs for learning algorithms in the GAMUT suite of games (Nudelman et al., 2004).

2 Motivating Examples

Figure 1: The strategy profile landscape of the symmetric self-driving taxi game (Table 0(b)). Lighter color is higher expected utility. Although the symmetric optimum has lower expected utility than the unrestricted optima, total symmetry of the game implies that the symmetric optimum is a Nash equilibrium; this is a special case of Theorem 4.0.2.

To gain some intuition for these concepts and claims, let us consider a situation with two self-driving taxis, Auto and Mobile. Two groups of people need rides: one group needs to go home (), and the other needs to go to work (). It is evident that a symmetric strategy profile—both taxis driving home or both driving to work—is not ideal, because the other trip will not get made.

The first version of the game, whose payoffs are shown in Table 0(a), is asymmetric: Auto only has a work entrance permit, whereas Mobile only has a home entrance permit. Here, as Marschak & Radner (1972) pointed out, the strategy profile

is both globally optimal and a Nash equilibrium. If we posit a mixed (randomized) strategy profile in which Auto and Mobile have work probabilities

and respectively, the gradients and are and , driving the solution to .

In the second version of the game (Table 0(b)), both taxis have both permits, and symmetry is restored. The pure profiles and are (asymmetric) globally optimal solutions and hence Nash equilibria. Figure 1 shows the entire payoff landscape as a function of and : looking just at symmetric strategy profiles, it turns out that there is a local optimum at , i.e., where Auto and Mobile toss fair coins to decide what to do. Although the expected payoff of this solution is lower than that of the asymmetric optima, the local optimum is, nonetheless, a Nash equilibrium. All unilateral deviations from the symmetric local optimum result in the same expected payoff because if one taxi is tossing a coin, the other taxi can do nothing to improve the final outcome.

In the third version of the game (Table 0(c)), both the home and work groups of people are large and need both taxis. In this case, there is again a Nash equilibrium at , but it is a local minimum rather than a local maximum in symmetric strategy space. Thus, not all symmetric Nash equilibria are symmetric local optima; this is because Nash equilibria depend on unilateral deviations, whereas symmetric local optima depend on joint deviations that maintain symmetry.

2.1 Complex Coordination Example where a Simple Symmetric Strategy is Best

Consider 10 robots that must each choose between 3 actions, , , and . If all robots play action , they receive a reward of 1. If exactly one robot plays action while the rest play action , they receive a reward of . Otherwise, the reward is . For small enough , the optimal symmetric policy is for all robots to play action . Here, trying to coordinate in symmetric strategies to reach the asymmetric optimum is suboptimal—the best symmetric strategy is the simple one. Furthermore, our subsequent theory shows that the best symmetric strategy is stable; it is locally optimal even when considering joint (possibly asymmetric) deviations.

3 Preliminaries: Games and Symmetries

3.1 Normal-form Games

Throughout, we consider normal-form games defined by a finite set with players, a finite set of action profiles with specifying the actions available to player , and the utility function with giving the utility for each player (Shoham & Leyton-Brown, 2008). We call common-payoff if for all action profiles and all players . In common-payoff games we may omit the player subscript from utility functions.

We model each player as employing a (mixed) strategy

, a probability distribution over actions. We denote the support of the probability distribution

by . Given a (mixed) strategy profile that specifies a strategy for each player, player ’s expected utility is . If a strategy for player maximizes expected utility given the strategies of all the other players, i.e., if , we call a best response to . If each strategy in a strategy profile is a best response to , we call a Nash equilibrium. A Nash equilibrium is strict if every is the unique best response to .

Note that, while we have chosen to use the normal-form game representation for simplicity, normal-form games are highly expressive. Normal-form games can represent mixed strategies in all finite games, including games with sequential actions, stochastic transitions, and partial observation such as imperfect-information extensive form games with perfect recall, Markov games, and Dec-POMDPs. To represent a sequential game in normal form, one simply lets each normal-form action be a complete strategy (contingency plan) accounting for every potential game decision.

3.2 Symmetry in Game Structure

We adopt the fairly general group-theoretic notions of symmetry introduced by von Neumann & Morgenstern (1944) and Nash (1951), and we borrow notation from Plan (2017). More recent work has analyzed narrower notions of symmetry (Reny, 1999; Vester, 2012; Milchtaich, 2016; Li & Wellman, 2020). For example, Daskalakis & Papadimitriou (2007) study “anonymous games” and show that anonymity substantially reduces the complexity of finding solutions. Additionally, Ham (2013) generalizes the player-based notion of symmetry to include further symmetries revealed by renamings of actions. We conjecture our results extend to this more general case, at some cost in notational complexity, but we leave this to future work.

Our basic building block is a symmetry of a game:

Definition 3.2.1.

Call a permutation of player indices a symmetry of a game if, for all strategy profiles , permuting the strategy profile permutes the expected payoffs:

Note that, when we speak of a symmetry of a game, we implicitly assume for all with so that permuting the strategy profile is well-defined.111 We make this choice to ease notational burden, but we conjecture that our results can be generalized to allow for mappings between actions (Ham, 2013), which we leave for future work.

We characterize the symmetric structure of a game by its set of game symmetries:

Definition 3.2.2.

Denote the set of all symmetries of a game by: .

A spectrum of game symmetries is possible. On one end of the spectrum, the identity permutation might be the only symmetry for a given game. On the other end of the spectrum, all possible permutations might be symmetries for a given game. Following the terminology of von Neumann & Morgenstern (1944), we call the former case totally unsymmetric and the latter case totally symmetric:

Definition 3.2.3.

If , the full symmetric group, we call the game totally symmetric. If contains only the identity permutation, we call the game totally unsymmetric.

Let be any subset of the game symmetries. Because is closed under composition, we can repeatedly apply permutations in to yield a group of game symmetries :

Definition 3.2.4.

Let be a subset of the game symmetries. The group generated by , denoted , is the set of all permutations that can result from (possibly repeated) composition of permutations in : .

Group theory tells us that defines a closed binary operation (permutation composition) including an identity and inverse maps, and is the closure of under function composition.

With a subset of game symmetries in hand, we can use the permutations in to carry one player index to another. For each player , we give a name to the set of player indices to which permutations in can carry : we call it player ’s orbit.

Definition 3.2.5.

Let be a subset of the game symmetries . The orbit of player under is the set of all other player indices that can assign to : .

By standard group theory, the orbits of a group action on a set partition the set’s elements, so:

Proposition 3.2.6.

Let . The orbits of partition the game’s players.

Proposition 3.2.6 tells us each yields an equivalence relation among the players. To gain intuition for this equivalence relation, consider two extreme cases. In a totally unsymmetric game, contains only the identity permutation, in which case each player is in its own orbit of ; the equivalence relation induced by the orbit partition shows that no players are equivalent. In a totally symmetric game, by contrast, every permutation is an element of , i.e., , the full symmetric group; now, all the players share the same orbit of , and the equivalence relation induced by the orbit partition shows that all the players are equivalent.

We leverage the orbit structure of an arbitrary to define an equivalence relation among players because it adapts to however much or little symmetry is present in the game. Between the extreme cases of no symmetry ( orbits) and total symmetry (1 orbit) mentioned above, there could be any intermediate number of orbits of . Furthermore, it might not be the case that players who share an orbit can be swapped in arbitrary ways. For an example of this, see Appendix C.

3.3 Symmetry in Strategy Profiles

Having formalized a symmetry of a game in the preceding section, we follow Nash (1951) and define symmetry in strategy profiles with respect to symmetry in game structure:

Definition 3.3.1.

Let be a subset of the game symmetries . We call a strategy profile -invariant if for all .

The equivalence relation among players induced by the orbit structure of is fundamental to our definition of symmetry in strategy profiles by the following proposition:

Proposition 3.3.2.

A strategy profile is -invariant if and only if for each pair of players and with .

To state Proposition 3.3.2 another way, a strategy profile is -invariant if all pairs of players and that are equivalent under the orbits of play the same strategy.

(a) Symmetric agents & environment
(b) Same initial conditions
(c) Veil of ignorance
Figure 2: Various self-driving taxi grid-world games that satisfy our symmetry requirement. (a) Symmetric agents in a symmetric environment. (b) Although the environment is asymmetric, the game is still symmetric because the agents have the same initial condition. (c) When agents must be programmed before knowing their initial conditions (e.g. location, type), symmetry holds behind the veil of ignorance (Section 3.4) even with nonidentical agents and asymmetric environments.

3.4 Symmetry via the Veil of Ignorance

Sometimes strategies must be specified for all players before knowing the players’ roles and initial conditions. Consider writing laws or programming household robots; all players are treated equally in specifying situation-dependent contingency plans. When all players have equal likelihood of ending up in any given situation (e.g., when all players have the same initial state distribution), the game of choosing contingency plans a priori is totally symmetric. (Appendix A gives an example.) For its analog in the philosophy of Rawls (1971) and Harsanyi (1975), we call this situation the veil of ignorance.

3.5 What do Symmetric Games Look Like?

To illustrate types of symmetry in games, Figure 2 presents symmetric variants of a self-driving taxi grid-world game inspired by the motivating example of Section 2. The taxis can move to adjacent grid cells, and they are on a team to drive people around a town with a residential area and a business area.

An idealized symmetric environment is shown in Figure 1(a). Here, the self-driving taxis are identical, and the environment is perfectly symmetric; the symmetry of the game is clear. This is the sort of symmetry that might be found in highly controlled environments such as factories.

Identical agents in an asymmetric environment are shown in Figure 1(b). Because the self-driving taxis are identical and have the same initial condition, their action sequences can be swapped without changing the outcome of the game. Thus, the game is symmetric even though the environment is asymmetric. While it is impossible for real-world agents to have the exact same physical location, it suffices for them to have the same distribution over initial conditions. Furthermore, we expect that virtual agents (such as customer service chatbots or nodes in a compute cluster) may have identical initial conditions.

Nonidentical agents in an asymmetric environment are shown in Figure 1(c). If we assume that the type and / or the initial location of each self-driving taxi is equally random, then the game of choosing contingency plans behind the veil of ignorance (Section 3.4) is totally symmetric. We expect this case of symmetry to be common when AI uses the same source code or the same learned parameters. In fact, weight sharing is a common practice in multi-agent RL (Foerster et al., 2016).

4 Local Symmetric Optima are (Global) Nash Equilibria

After the formal definitions of symmetry in the previous section, we are almost ready to formally state the first of our main results. The only remaining definition is that of a local symmetric optimum:

Definition 4.0.1.

Call a locally optimal -invariant strategy profile of a common-payoff game if: (i) is -invariant, and (ii) for some , no -invariant strategy with can be formed by adding or subtracting at most to the probability of taking any given action . If, furthermore, condition (ii) holds for all , we call a globally optimal -invariant strategy profile or simply an optimal -invariant strategy profile.

Now we can state our first main theorem, that local symmetric optima are (global) Nash equilibria:

Theorem 4.0.2.

Let be a common-payoff normal-form game, and let be a subset of the game symmetries . Any locally optimal -invariant strategy profile is a Nash equilibrium.

Proof.

We provide a sketch here and full details in Appendix B. Suppose, for the sake of contradiction, that an individual player could beneficially deviate to action (if a beneficial deviation exists, then there is one to a pure strategy). Then, consider instead a collective change to a symmetric strategy profile in which all the players in ’s orbit shift slightly more probability to . By making the amount of probability shifted ever smaller, the probability that this change affects exactly one agent’s realized action (making it when it would not have been before) can be arbitrarily larger than the probability that it affects multiple agents’ realized actions. Moreover, if this causes exactly one agent’s realized action to change, this must be in expectation beneficial, since the original unilateral deviation was in expectation beneficial. Hence, the original strategy profile cannot have been locally optimal. ∎

4.1 Applications of the Theorem

First, we provide an example of applying Theorem 4.0.2 to multi-agent RL.

Example 4.1.1.

Consider a cooperative multi-agent RL environment where all agents have the same initial state distribution. Suppose, as is typical practice (Foerster et al., 2016), that we use a gradient method to train the parameters of a policy that all agents will share. Assume that the gradient method reaches a symmetric local optimum in mixed strategy space. If we wanted to improve upon this symmetric local optimum, we might lift the symmetry requirement and perform iterative best response, i.e., continue learning by updating the parameters of just one agent. However, by Theorem 4.0.2, the symmetric local optimum is a Nash equilibrium. Thus, updating the parameters of a single agent cannot improve the common payoff; updating the parameters of at least two agents is necessary.

The preceding example assumes that a gradient method in multi-agent RL reaches a symmetric local optimum in mixed strategy space. In practice, agents may employ behavioral strategies, and it may not be possible to verify how close a symmetric strategy profile is to a local optimum.

In Appendix C, we give another example that shows how Theorem 4.0.2 is more general than the case of total symmetry. The example illustrates the existence of rotational symmetry without total symmetry, and it illustrates how picking different leads to different optimal -invariant strategies and thus different -invariant Nash equilibria by Theorem 4.0.2.

4.2 Robustness to Payoff and Strategy Perturbations

Theorem 4.0.2 assumes that all players’ payoffs are exactly the same, and it applies to strategy profiles that are exact local optima. If we relax these assumptions, the theorem still holds approximately. If all players’ payoffs are equal , or if a strategy profile is distance away from a symmetric local optimum, then a robust version of Theorem 4.0.2 guarantees a -Nash equilibrium for some game-dependent constant . See Appendix D for a precise treatment of these robustness results.

While the results of this section concern Nash equilibria, we note that Nash equilibria, by definition, consider the possibility of only a single agent deviating. In the next section, we investigate when multiple agents might have an incentive to simultaneously deviate by studying the optimality of symmetric strategy profiles in possibly-asymmetric strategy space.

4.3 Extending the Theorem to Multiple Teams

So far, we have considered the cooperation of a single team. We can also study the interaction of many different teams. Suppose each team shares a common payoff while the interaction between the different teams is general sum. For example, we could extend the self-driving taxi game of Section 2 to have multiple self-driving taxi companies. In prior work, the special case of zero-sum interaction between one team and a single adversary is called an adversarial team game (von Stengel & Koller, 1997; Celli & Gatti, 2018; Carminati et al., 2022).

Theorem 4.0.2 directly translates to this setting with multiple teams. To see why, consider a metagame with one player for each team. In the metagame, each of the metaplayers controls the strategy profile of their team. Suppose we are at a (local) Nash equilibrium in this metagame and that each team is playing a strategy profile that is -invariant in the original game. Now consider just one team trying to update to improve its payoff. If we leave the strategies of the other teams fixed, then this becomes a single-team, common-payoff game. So Theorem 4.0.2 applies, and no individual player can deviate to improve their payoff. By repeating this argument for every team, we see that no individual player on any team can deviate to improve their payoff. Therefore, the individual players of the original game are in a (global) Nash equilibrium.

5 When are Local Optima in Symmetric Strategy Space also Local Optima in Possibly-asymmetric Strategy Space?

Our preceding theory applies to locally optimal -invariant, i.e., symmetric, strategy profiles. This leaves open the question of how well locally optimal symmetric strategy profiles perform when considered in the broader, possibly-asymmetric strategy space. When are locally optimal -invariant strategy profiles also locally optimal in possibly-asymmetric strategy space? This question is important in machine learning (ML) applications where users of symmetrically optimal ML systems might be motivated to make modifications to the systems, even for purposes of a common payoff.

To address this precisely, we formally define a local optimum in possibly-asymmetric strategy space:

Definition 5.0.1.

A strategy profile of a common-payoff normal-form game is locally optimal among possibly-asymmetric strategy profiles, or, equivalently, a local optimum in possibly-asymmetric strategy space, if for some , no strategy profile with can be formed by changing in such a way that the probability of taking any given action for any player changes by at most .

[] Definition 5.0.1 relates to notions of stability under dynamics, such as those with perturbations or stochasticity, that allow multiple players to make asymmetric deviations. In particular, if is not a local maximum in asymmetric strategy space, this means that there is some set of players and strategy arbitrarily close to , such that if players were to play (by mistake or due to stochasticity), some Player would develop a strict preference over the support of . To illustrate this, we return to the motivating example of self-driving taxis.

Example 5.0.2.

Consider again the game of Table 0(b). As Figure 1 illustrates, the symmetric optimum is for both Auto and Mobile to randomize uniformly between H and W. While this is a Nash equilibrium, it is not a local optimum in possibly-asymmetric strategy space. If one player deviates from uniformly randomizing, the other player develops a strict preference for either or .

To generalize the phenomenon of Example 5.0.2, we use the following degeneracy222We note that “degnerate” is already an established term in the game-theoretic literature where it is often applied only to two-player games (see, e.g, von Stengel, 2007, Definition 3.2). While similar to the established notion of degeneracy, our definition of degeneracy is stronger, which makes our statements about non-degenerate games more general. (See Appendix E for details.) condition:

Definition 5.0.3.

Let be a Nash equilibrium of a game : (i) If is deterministic, i.e., if every is a Dirac delta function on some , then is degenerate if at least two players are indifferent between and some other . (ii) Otherwise, if is mixed, then is degenerate if for all players and all , the term is constant across .

We call a game degenerate if it has at least one degenerate Nash equilibrium.

Intuitively, our definition says that a deterministic Nash equilibrium is non-degenerate when it is strict or almost strict (excepting of at most one player who may be indifferent over available actions). A mixed Nash equilibrium, on the other hand, is non-degenerate when mixing matters.

In non-degenerate games, our next theorem shows that a local symmetric optimum is a local optimum in possibly-asymmetric strategy space if and only if it is deterministic. Formally:

Theorem 5.0.4.

Let be a non-degenerate common-payoff normal-form game, and let be a subset of the game symmetries . A locally optimal -invariant strategy profile is locally optimal among possibly-asymmetric strategy profiles if and only if it is deterministic.

a b c
a 1 1 1
b 1 -10
c 1 -10

To see why the non-degeneracy condition is needed in Theorem 5.0.4, we provide an example of a degenerate game:

Example 5.0.5.

Consider the 3x3 symmetric game shown above. Here, is the unique global optimum in symmetric strategy space. By Theorem 4.0.2, it is therefore also a Nash equilibrium. However, it is a degenerate Nash equilibrium and not locally optimal in asymmetric strategic space. The payoff can be improved by, e.g., the row player shifting small probability to , and the column player shifting small probability to .

We have already seen an example of a non-degenerate deterministic equilibrium. The symmetric optimum from Section 2.1, even though it is not the global asymmetric optimum, is nevertheless locally optimal in possibly-asymmetric strategy space by Theorem 5.0.4.

6 Learning Symmetric Strategies in GAMUT

Theorem 5.0.4 shows that, in non-degenerate games, a locally optimal symmetric strategy profile is stable in the sense of Definition 5.0.1 if and only if it is pure. For those concerned about stability, this raises the question: how often are optimal strategies pure, and how often are they mixed?

To answer this question, we present an empirical analysis of learning symmetric strategy profiles in the GAMUT suite of game generators (Nudelman et al., 2004). We are interested both in how centralized optimization algorithms (such as gradient methods) search for symmetric strategies and in how decentralized populations of agents evolve symmetric strategies. To study the former, we run Sequential Least SQuares Programming (SLSQP) (Kraft, 1988; Virtanen et al., 2020), a local search method for constrained optimization. To study the latter, we simulate the replicator dynamics (Fudenberg & Levine, 1998)

, an update rule from evolutionary game theory with connections to reinforcement learning

(Börgers & Sarin, 1997; Tuyls et al., 2003a, b). (See Appendix F.3 for details.)

6.1 Experimental Setup

We ran experiments in all three classes of symmetric GAMUT games: RandomGame, CoordinationGame, and CollaborationGame. (While other classes of GAMUT games, such as the prisoner’s dilemma, exist, they cannot be turned into a symmetric, common-payoff game without losing their essential structure.) Intuitively, a RandomGame draws all payoffs uniformly at random, whereas in a CoordinationGame and a CollaborationGame, the highest payoffs are always for outcomes where all players choose the same action. (See Appendix F.1 for details.) Because CoordinationGame and CollaborationGame have such similar game structures, our experimental results in the two games are nearly identical. To avoid redundancy, we only include experimental results for CoordinationGame.

For each game class, we sweep the parameters of the game from 2 to 5 players and 2 to 5 actions, i.e., with . We sample 100 games at each parameter setting and then attempt to calculate the global symmetric optimum using (i) 10 runs of SLSQP and (ii) 10 runs of the replicator dynamic (each with a different initialization drawn uniformly at random over the simplex), resulting in 10 + 10 = 20 solution attempts per game. Because we do not have ground truth for the globally optimal solution of the game (which is NP-hard to compute), we instead use the best of our 20 solution attempts, which we call the “best solution.”

6.2 How Often are Symmetric Optima Local Optima among Possibly-asymmetric Strategies?

Here, we try to get a sense for how often symmetric optima are stable in the sense that they are also local optima in possibly-asymmetric strategy space (see Definition 5.0.1). In Appendix Table 3, we show in what fraction of games the best solution of our 20 optimization attempts is mixed; by Theorem 5.0.4 and Proposition F.2.1 from the Appendix, this is the fraction of games whose symmetric optima are not local optima in possibly-asymmetric strategy space. In CoordinationGames, the symmetric optimum is always (by construction) for all players to choose the same action, leading to stability. By contrast, we see that 36% to 60% of RandomGames are unstable. We conclude that if real-world games do not have the special structure of CoordinationGames, then instability may be common.

6.3 How Often do SLSQP and the Replicator Dynamic Find an Optimal Solution?

As sequential least squares programming and the replicator dynamic are not guaranteed to converge to a global optimum, we test empirically how often each run converges to the best solution of our 20 optimization runs. In Appendix Table 4 / Table 6, we show what fraction of the time any single SLSQP / replicator dynamics run finds the best solution, and in Appendix Table 5 / Table 7, we show what fraction of the time at least 1 of 10 SLSQP / replicator dynamics runs finds the best solution. First, we note that the tables for SLSQP and the replicator dynamics are quite similar, differing by no more than a few percentage points in all cases. So the replicator dynamics, which are used as a model for how populations evolve strategies, can also be used as an effective optimization algorithm. Second, we see that individual runs of each algorithm are up to 93% likely to find the best solution in small RandomGames, but they are less likely (as little as 24% likely) to find the best solution in larger RandomGames and in CoordinationGames. The best of 10 runs, however, finds the best solution of the time; so random algorithm restarts benefit symmetric strategy optimization.

7 Conclusion

There are a variety of reasons we expect to see symmetric games in machine learning systems. The first is mass hardware production, which will proliferate identical robots such as self-driving cars, that require ad-hoc cooperation (Stone et al., 2010). The second is interaction over the internet, where websites treat all users equally. The third is anonymous protocols, such as voting, which depend on symmetry. As Figure 2 shows, symmetric games can still arise even when agents and the environment are asymmetric.

Similarly, there are a variety of reasons we expect to see symmetric strategies in practice. The first is software copies: we expect many artificial agents will run the same source code. The second is optimization - enforcing symmetric strategies exponentially reduces the joint-strategy space. The third is parameter sharing between different neural networks, which can be critical to success in multi-agent RL

(Foerster et al., 2016) and may occur as a result of pretraining on large datasets (Dasari et al., 2020). The fourth is communication: symmetry (and symmetry breaking) is a key component of zero-shot coordination with other agents and humans (Hu et al., 2020; Treutlein et al., 2021). The fifth is that a single-player game with imperfect recall can be interpreted as a multi-agent game in symmetric strategies (Aumann et al., 1997).

When cooperative AI is deployed in the world with symmetric strategy profiles, it raises questions about the properties of such profiles. Would individual agents (or the users they serve) want to deviate from these profiles? Are they robust to small changes in the game or in the executed strategies? Could there be better asymmetric strategy profiles nearby?

Our results yield a mix of good and bad news. Theorems 4.0.2 and D.0.3 are good news for stability, showing that even local optima in symmetric strategy space are (global) Nash equilibria in a robust sense. So, with respect to unilateral deviations among team members, symmetric optima are relatively stable. On the other hand, this may be bad news for optimization because unilateral deviation cannot improve on a local symmetric optimum (Example 4.1.1). Furthermore, Theorem 5.0.4 is perhaps bad news, showing that a broad class of symmetric local optima are unstable when considering joint deviations in asymmetric strategy space (Definition 5.0.1). Empirically, our results with learning algorithms in GAMUT suggest that these unstable solutions may not be uncommon in practice (Section 6.2).

Future work could build on our analysis in a few different ways. First, we focus on mixed strategy space. However, future work may wish to deal with behavioral strategy space. While mixed and behavioral strategies are equivalent in games of perfect recall (Kuhn, 1953; Aumann, 1961), they are not equivalent for games of imperfect recall (Piccione & Rubinstein, 1997). Second, we focus on players who can play an arbitrary mixed strategy over discrete actions. Future work could consider continuous action space and players who act according to a learned probability distribution. We expect learned probability distributions to pose an additional challenge because, for some probability distributions, our proof of Theorem 4.0.2 in Appendix B

will not directly transfer. Our proof requires that agents can always adjust a strategy by moving arbitrarily small probability onto a single action. However, this is not possible with many distributions such as Gaussian distributions. Finally, our experimental results focus on the normal-form representation of games in GAMUT

(Nudelman et al., 2004). It would be interesting to see what experimental properties symmetric optima have in sequential decision making benchmarks.

Acknowledgements

We thank Stephen Casper, Lawrence Chan, Michael Dennis, Frances Ding, Daniel Filan, Rachel Freedman, Jakob Foerster, Adam Gleave, Rohin Shah, Sam Toyer, Alex Turner, and the anonymous reviewers for helpful feedback on this work.

We are grateful for the support we received for this work. It includes NSF Award IIS-1814056, funding from the DOE CSGF under grant number DE-SC0020347, and funding from the Cooperative AI Foundation, the Center for Emerging Risk Research, the Berkeley Existential Risk Initiative, and the Open Philanthropy Foundation. We also appreciate the Leverhulme Trust’s support for the Centre for the Future of Intelligence.

References

  • Aumann (1961) Aumann, R. J. Mixed and behavior strategies in infinite extensive games. Technical report, Princeton Univ NJ, 1961.
  • Aumann et al. (1997) Aumann, R. J., Hart, S., and Perry, M. The absent-minded driver. Games and Economic Behavior, 20:102–116, 1997.
  • Börgers & Sarin (1997) Börgers, T. and Sarin, R. Learning through reinforcement and replicator dynamics. Journal of economic theory, 77(1):1–14, 1997.
  • Bostrom et al. (2016) Bostrom, N., Douglas, T., and Sandberg, A. The unilateralist’s curse and the case for a principle of conformity. Social epistemology, 30(4):350–371, 2016.
  • Brandt et al. (2009) Brandt, F., Fischer, F., and Holzer, M. Symmetries and the complexity of pure nash equilibrium. Journal of computer and system sciences, 75(3):163–177, 2009.
  • Briggs (2010) Briggs, R. Putting a value on beauty. In Oxford Studies in Epistemology, volume 3, pp. 3–24. Oxford University Press, 2010.
  • Carminati et al. (2022) Carminati, L., Cacciamani, F., Ciccone, M., and Gatti, N. Public information representation for adversarial team games. arXiv preprint arXiv:2201.10377, 2022.
  • Celli & Gatti (2018) Celli, A. and Gatti, N. Computational results for extensive-form adversarial team games. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 32, 2018.
  • Dafoe et al. (2021) Dafoe, A., Bachrach, Y., Hadfield, G., Horvitz, E., Larson, K., and Graepel, T. Cooperative ai: machines must learn to find common ground, 2021.
  • Dasari et al. (2020) Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., Singh, S., Levine, S., and Finn, C. Robonet: Large-scale multi-robot learning. In Conference on Robot Learning, pp. 885–897. PMLR, 2020.
  • Daskalakis & Papadimitriou (2007) Daskalakis, C. and Papadimitriou, C. H. Computing equilibria in anonymous games. In FOCS, 2007.
  • Foerster et al. (2016) Foerster, J., Assael, I. A., De Freitas, N., and Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning. In Advances in neural information processing systems, pp. 2137–2145, 2016.
  • Foerster et al. (2018) Foerster, J. N., Farquhar, G., Afouras, T., Nardelli, N., and Whiteson, S. Counterfactual multi-agent policy gradients. In Thirty-second AAAI conference on artificial intelligence, 2018.
  • Fudenberg & Levine (1998) Fudenberg, D. and Levine, D. K. The theory of learning in games, volume 2. MIT press, 1998.
  • Gupta et al. (2017) Gupta, J. K., Egorov, M., and Kochenderfer, M. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Springer, 2017.
  • Hadfield-Menell et al. (2017) Hadfield-Menell, D., Dragan, A. D., Abbeel, P., and Russell, S. J. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing 29, 2017.
  • Ham (2013) Ham, N. Notions of symmetry for finite strategic-form games. arXiv:1311.4766, 2013.
  • Harris et al. (2020) Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with NumPy. Nature, 585(7825):357–362, September 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2.
  • Harsanyi (1975) Harsanyi, J. C. Can the maximin principle serve as a basis for morality? a critique of john rawls’s theory. American political science review, 69(2):594–606, 1975.
  • Hu et al. (2020) Hu, H., Lerer, A., Peysakhovich, A., and Foerster, J. “other-play” for zero-shot coordination. In International Conference on Machine Learning, pp. 4399–4410. PMLR, 2020.
  • Hunter (2007) Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007. doi: 10.1109/MCSE.2007.55.
  • Kraft (1988) Kraft, D. A software package for sequential quadratic programming. Technical Report DFVLR-FB 88-28, DLR German Aerospace Center – Institute for Flight Mechanics, Koln, Germany, 1988.
  • Kuhn (1953) Kuhn, H. W. Extensive games and the problem of information, contributions to the theory of games ii. Annals of Mathematics Studies, 28:193–216, 1953.
  • Li & Wellman (2020) Li, Z. and Wellman, M. Structure learning for approximate solution of many-player games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 2119–2127, 2020.
  • Marschak (1955) Marschak, J. Elements for a theory of teams. Management Science, 1(2):127–137, 1955.
  • Marschak & Radner (1972) Marschak, J. and Radner, R. Economic Theory of Teams. Yale University Press, 1972.
  • Meurer et al. (2017) Meurer, A., Smith, C. P., Paprocki, M., Čertík, O., Kirpichev, S. B., Rocklin, M., Kumar, A., Ivanov, S., Moore, J. K., Singh, S., Rathnayake, T., Vig, S., Granger, B. E., Muller, R. P., Bonazzi, F., Gupta, H., Vats, S., Johansson, F., Pedregosa, F., Curry, M. J., Terrel, A. R., Roučka, v., Saboo, A., Fernando, I., Kulal, S., Cimrman, R., and Scopatz, A. Sympy: symbolic computing in python. PeerJ Computer Science, 3:e103, January 2017. ISSN 2376-5992. doi: 10.7717/peerj-cs.103. URL https://doi.org/10.7717/peerj-cs.103.
  • Milchtaich (2016) Milchtaich, I. Static stability in symmetric and population games. Technical report, Bar-Ilan University, November 2016. URL https://www.biu.ac.il/soc/ec/wp/2008-04/2008-04.pdf.
  • Nash (1951) Nash, J. Non-cooperative games. Annals of Mathematics, pp. 286–295, 1951.
  • Nudelman et al. (2004) Nudelman, E., Wortman, J., Shoham, Y., and Leyton-Brown, K. Run the gamut: a comprehensive approach to evaluating game-theoretic algorithms. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004., pp. 880–887. IEEE, 2004.
  • Oliehoek et al. (2016) Oliehoek, F. A., Amato, C., et al. A Concise Introduction to Decentralized POMDPs. Springer, 2016.
  • Piccione & Rubinstein (1997) Piccione, M. and Rubinstein, A. On the interpretation of decision problems with imperfect recall. Games and Economic Behavior, 20(1):3–24, 1997.
  • Plan (2017) Plan, A. Symmetric n-player games. Working paper, available at asafplan.com, 2017.
  • Ratliff et al. (2016) Ratliff, L. J., Burden, S. A., and Sastry, S. S. On the characterization of local nash equilibria in continuous games. IEEE Transactions on Automatic Control, 61(8):2301–2307, Aug. 2016. ISSN 0018-9286. doi: 10.1109/TAC.2016.2583518.
  • Rawls (1971) Rawls, J. A Theory of Justice. Belknap Press, Cambridge, Massachusetts, 1971.
  • Reback et al. (2021) Reback, J., McKinney, W., jbrockmendel, den Bossche, J. V., Augspurger, T., Cloud, P., Hawkins, S., gfyoung, Sinhrks, Roeschke, M., Klein, A., Petersen, T., Tratner, J., She, C., Ayd, W., Naveh, S., patrick, Garcia, M., Schendel, J., Hayden, A., Saxton, D., Jancauskas, V., Gorelli, M., Shadrach, R., McMaster, A., Battiston, P., Seabold, S., Dong, K., chris b1, and h vetinari. pandas-dev/pandas: Pandas 1.2.4, April 2021. URL https://doi.org/10.5281/zenodo.4681666.
  • Reny (1999) Reny, P. J. On the existence of pure and mixed strategy Nash equilibria in discontinuous games. Econometrica, 67(5):1029–1056, 1999.
  • Russell (2019) Russell, S. Human Compatible: Artificial Intelligence and the Problem of Control. Viking Press, 2019.
  • Sandholm (2001) Sandholm, W. H. Potential games with continuous player sets. Journal of Economic theory, 97(1):81–108, 2001.
  • Schwarz (2015) Schwarz, W. Lost memories and useless coins: revisiting the absentminded driver. Synthese, 192:3011–3036, 2015.
  • Shoham & Leyton-Brown (2008) Shoham, Y. and Leyton-Brown, K. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2008. ISBN 0521899435. URL http://www.masfoundations.org/.
  • Stone et al. (2010) Stone, P., Kaminka, G., Kraus, S., and Rosenschein, J. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, 2010.
  • Taylor (2016) Taylor, J. In memoryless Cartesian environments, every UDT policy is a CDT+SIA policy, June 2016. URL https://www.alignmentforum.org/posts/5bd75cc58225bf06703751b2/in-memoryless-cartesian-environments-every-udt-policy-is-a-cdt-sia-policy.
  • Treutlein et al. (2021) Treutlein, J., Dennis, M., Oesterheld, C., and Foerster, J. A new formalism, method and open issues for zero-shot coordination. In Proceedings of the Thirty-eighth International Conference on Machine Learning (ICML’21). 2021.
  • Tsybakov (2009) Tsybakov, A.

    Introduction to Nonparametric Estimation

    .
    Springer, 2009.
  • Tuyls et al. (2003a) Tuyls, K., Heytens, D., Nowe, A., and Manderick, B. Extended replicator dynamics as a key to reinforcement learning in multi-agent systems. In European Conference on Machine Learning, pp. 421–431. Springer, 2003a.
  • Tuyls et al. (2003b) Tuyls, K., Verbeeck, K., and Lenaerts, T. A selection-mutation model for q-learning in multi-agent systems. In Proceedings of the second international joint conference on Autonomous agents and multiagent systems, pp. 693–700, 2003b.
  • Ui (2009) Ui, T. Bayesian potentials and information structures: Team decision problems revisited. International Journal of Economic Theory, 5(3):271–291, 2009.
  • Vester (2012) Vester, S. Symmetric Nash Equilibria. PhD thesis, Ecole Normale Superieure de Cachan, 2012.
  • Virtanen et al. (2020) Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
  • von Neumann & Morgenstern (1944) von Neumann, J. and Morgenstern, O. Theory of Games and Economic Behavior. Princeton University Press, 1944.
  • von Stengel (2007) von Stengel, B. Equilibrium computation for two-player games in strategic and extensive form. In Algorithmic Game Theory. Cambridge University Press, 2007.
  • von Stengel & Koller (1997) von Stengel, B. and Koller, D. Team-maxmin equilibria. Games and Economic Behavior, 21(1-2):309–321, 1997.
  • Wes McKinney (2010) Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman (eds.), Proceedings of the 9th Python in Science Conference, pp. 56 – 61, 2010. doi: 10.25080/Majora-92bf1922-00a.

Appendix A Veil of Ignorance Example

Two robots arrive at a resource that can be used by only one of them. They can choose as their action either Cautious or Aggressive. If both choose C, one of them gets the resource at random. If exactly one chooses A, that one gets the resource. If both choose A, the resource is destroyed and neither gets it (utility 0).

Each robot privately knows whether it has High or Low need for the resource (each type occurs independently with probability ). A robot that has High need values the resource at 6; one that has Low need values it at 4. Robots are on the same team and care about the sum of utilities.

From behind the veil of ignorance, the optimal symmetric strategy (contingency plan) is: when having type L, always play C; when having type H, play A with probability (and C otherwise). Note, as guaranteed by Theorem 4.0.2, that this is a Nash equilibrium. To verify this, observe that from the perspective of a robot with type H, the expected team utility for playing A (when the other follows the given strategy with ) is , and for playing C it is , and if these are equal. In contrast, from the perspective of a robot with type L, the expected team utility for playing A (when the other follows the given strategy) is , and for playing C it is , so C is strictly preferred.

Overall, this optimal symmetric strategy results in an expected team utility of . Compare this with an asymmetric strategy where robot 1 plays A when it has type H but otherwise C is always played by both robots, which results in a team utility of . (If types were not private knowledge, would be possible.)

In this example, we see how players can coordinate using symmetric strategies from behind the veil of ignorance. Although it is possible to achieve a higher payoff using asymmetric strategies, the optimal symmetric strategy is nonetheless a Nash equilibrium by Theorem 4.0.2.

Appendix B Proofs of Section 4 Results

See 4.0.2

Proof.

We proceed by contradiction. Suppose is locally optimal among -invariant strategy profiles but is not a Nash equilibrium. We will construct an arbitrarily close to with .

Without loss of generality, suppose is not a best response to but that the pure strategy of always playing is a best response to . For an arbitrary probability , consider the modified strategy that plays action with probability and follows with probability . Now, construct as follows:

In words, modifies by having the members of player ’s orbit mix in a probability of playing . We claim for all sufficiently small that .

To establish this claim, we break up the expected utility of according to cases of how many players in ’s orbit play the action because of mixing in with probability . In particular, we observe

where is the probability of

successes for a binomial random variable on

independent events that each have success probability and where is arbitrary. Note that the crucial step in writing this expression is grouping the terms with the coefficient . We can do this because for any player , there exists a symmetry with .

Now, to achieve , we require

We know , but we must deal with the case when is arbitrarily negative. Because , by making sufficiently small, can be made greater than by an arbitrarily large ratio. The result follows. ∎

Appendix C Example of General Symmetry in Theorem 4.0.2

Example C.0.1.

There are four groups of partygoers positioned in a square. We number these 1,2,3,4 clockwise, such that, e.g., 1 neighbors 4 and 2. There is also a robot butler at each vertex of the square. The partygoers can fetch refreshments from the robot butler at their vertex of the square and from the robot butler at adjacent vertices of the square, but it is too far of a walk for them to fetch refreshments from the robot at the opposite vertex.

The game has each robot butler choose what refreshment to hold. For simplicity, suppose each robot butler can hold food or drink. The common payoff of the game is the sum of the utilities of the four groups of partygoers. For each group, if the group cannot fetch drink, the payoff for that group is 0. If the group can only fetch drink, the payoff is 1, and if the group can fetch food and drink, the group’s payoff is 2.

The symmetries of the game include the set of permutations generated by rotating the robot butlers once clockwise. In standard notation for permutations, .

First, consider applying the theorem to . In this case, the constraint of -invariance requires all the robot butlers play the same strategy because all of them are in the same orbit. As we show in the proof below, the optimal -invariant strategy is then for each robot to hold food with probability . Theorem 4.0.2 tells us that this optimal -invariant strategy profile is a Nash equilibrium. The proof below also shows how to verify this without the use of Theorem 4.0.2.

Second, consider applying the theorem to the case where consists only of the rotation twice clockwise, i.e., the permutation which maps each robot onto the robot on the opposite vertex of the square. In standard notation for permutations, . Now, the constraint of -invariance requires robot butlers at opposite vertices of the square to play the same strategy. However, neighboring robots can hold different refreshments. The optimal -invariant strategy is for one pair of opposite-vertex robots, e.g., 1 and 3, to hold food and for the other pair of robots, 2 and 4, to hold drink. While it turns out to be immediate that this optimal -invariant strategy is a Nash equilibrium because it achieves the globally optimal outcome, we could have applied Theorem 4.0.2 to know that this optimal -invariant strategy profile is a Nash equilibrium even without knowing what the optimal -invariant strategy was.

Proof.

We here calculate the optimal -invariant strategy profile for Example C.0.1. Let be the probability of holding drink. By symmetry of the game and linearity of expectation, the expected utility given is simply four times the expected utility of any one group of partygoers. The utility of one group of partygoers is with probability , is with probability and is with the remaining probability. Hence, the expected utility of a single group of partygoers is

The maximum of this term (and thus the maximum of the overall utility of all neighborhoods) can be found by any computer algebra system to be , which gives an expected utility of .

To double-check, we can also calculate the symmetric Nash equilibrium of this game. It’s easy to see that that Nash equilibrium must be mixed and therefore must make each robot butler indifferent about what to hold. So let again be the probability with which each robot butler holds drink. The expected utility of holding drink relative to holding nothing for any of the three relevant neighborhoods is . (Holding drink lifts the utility of a group of partygoers from 0 to 2 if they can not already fetch drink. Otherwise, it doesn’t help to hold drink.) The expected utility of holding food relative to broadcasting nothing is simply . We can find the symmetric Nash equilibrium by setting

which gives us the same solution for as before. ∎

Appendix D Robustness of Theorem 4.0.2 to Payoff and Strategy Perturbations

The first type of robustness we consider is robustness to perturbations in the game’s payoff function. Formally, we define an -perturbation of a game as follows:

Definition D.0.1.

Let be a normal-form game with utility function . For some , we call an -perturbation of if has utility function satisfying .

There are a variety of reasons why -perturbations might arise in practice. Our game model may contain errors such as the game not being perfectly symmetric; the players’ preferences might fluctuate over time; or we might have used function approximation to learn the game’s payoffs. With Proposition D.0.2, we note a generic observation about Nash equilibria showing that our main result, Theorem 4.0.2, is robust in the sense of degrading linearly in the payoff perturbation’s size:

Proposition D.0.2.

Let be a common-payoff normal-form game, and let be a locally-optimal -invariant strategy profile for some subset of game symmetries . Suppose is an -perturbation of . Then is a -Nash equilibrium in .

Proof.

By Theorem 4.0.2, is a Nash equilibrium in . After perturbing by to form , payoffs have increased / decreased at most , so the difference between any two actions’ expected payoffs has changed by at most . ∎

The second type of robustness we consider is robustness to symmetric solutions that are only approximate. For example, we might try to find a symmetric local optimum through an approximate optimization method, or the evolutionary dynamics among players’ strategies might lead them to approximate local symmetric optima. Again, a generic result about Nash equilibria shows that the guarantee of Theorem 4.0.2 degrades linearly in this case:

Theorem D.0.3.

Let be a common-payoff normal-form game, and let be a locally-optimal -invariant strategy profile for some subset of game symmetries . Suppose is a strategy profile with total variation distance . Then is an -Nash equilibrium with .

Proof.

Consider the perspective of an arbitrary player . The difference in expected utility of playing any action between the opponent strategy profiles and is given by:

In particular, let be an action in the support of , and let be any other action. Then, using the above, we have:

where because is a Nash equilibrium by Theorem 4.0.2. ∎

By Theorem D.0.3

, we have a robustness guarantee in terms of the total variation distance between an approximate local symmetric optimum and a true local symmetric optimum. Without much difficulty, we can also convert this into a robustness guarantee in terms of the Kullback-Leibler divergence:

Corollary D.0.4.

Let be a common-payoff normal-form game, and let be a locally-optimal -invariant strategy profile for some subset of game symmetries . Suppose is a strategy profile with Kullback-Leibler divergence satisfying or . Then is an -Nash equilibrium with .

Proof.

By Pinsker’s inequality (Tsybakov, 2009), we have

As and with a similar application of Pinsker’s inequality, we have by assumption that . Applying Theorem D.0.3 with yields the result. ∎

Appendix E Proof of Section 5 Results

First, we clarify how our notion of non-degeneracy compares to the existing literature. If a two-player game is non-degenerate in the usual sense from the literature, it is non-degenerate in the sense of Definition 5.0.3. Moreover, if is common-payoff, then for each player , we can define a two-player game played by and another single player who controls the strategies of . If for all these two-player games are non-degenerate in the established sense, then is non-degenerate in the sense of Definition 5.0.3.

Now, we proceed with the proof of Section 5 results:

See 5.0.4

Proof.

Let be a locally optimal -invariant strategy profile. By Theorem 4.0.2, is a Nash equilibrium. Because is non-degenerate, so is . We prove the claim by proving that (1) if is deterministic, it is locally optimal in asymmetric strategy space; and (2) if is mixed then it is not locally optimal in asymmetric strategy space.

(1) The deterministic case: Let be deterministic. Now consider a potentially asymmetric strategy profile . We must show as becomes sufficiently close to that .

Let and be such that for , can be interpreted as following with probability and following with probability , where . Then (similar to the proof of Theorem 4.0.2), we can write

The second line is the expected value if everyone plays , the third line is the sum over the possibilities of one player deviating to , and so forth. We now make two observations. First, because is a Nash equilibrium, the expected utilities in the third line are all at most as big as . Now consider any later term corresponding to the deviation of some set , containing at least two players . Note that it may be . However, this term is multiplied by . Thus, as the go to , the significance of this term in the average vanishes in comparison to that of both the terms corresponding to the deviation of just and just , which are multiplied only by and , respectively. By non-degeneracy, it is or . Thus, if the are small enough, the overall sum is less than .

(2) The mixed case: Let be mixed. We proceed by constructing a strategy profile that is arbitrarily close to with .

Let be the largest integer where for all subsets of players with , the expected payoff is constant across all joint deviations to for all , i.e., where for all . As is a non-degenerate Nash equilibrium, .

By definition of , there exists a subset of players with and choice of actions where is not constant across the available actions for some player . Denote player ’s best response to the joint deviation as , and note .

To construct , modify by letting player mix according to with probability and play action with probability . Similarly, let each player mix according to with probability and play their action specified by with probability . Because we allow to be arbitrarily small, all we have left to show is .

Observe as before that we can break up into cases based on the number of players who deviate according to the modified probability :

By construction, every value in the expected value calculation is equal to except for the last value , which is greater than . We conclude . ∎

Appendix F GAMUT Details and Additional Experiments

f.1 GAMUT Games

In Section 6.1, we analyzed all three classes of symmetric GAMUT games: RandomGame, CoordinationGame, and CollaborationGame. Below, we give a formal definiton of these game classes:

Definition F.1.1.

A RandomGame with players and actions assumes for all and draws a payoff from for each unordered action profile .

Definition F.1.2.

A CoordinationGame with players and actions assumes for all . For each unordered action profile with for all , it draws a payoff from ; for all other unordered action profiles, it draws a payoff from .

Definition F.1.3.

A CollaborationGame with players and actions assumes for all . For each unordered action profile with for all , the payoff is 100; for all other unordered action profiles, it draws a payoff from .

Note that these games define payoffs for each unordered action profile because the games are totally symmetric (Definition 3.2.3). Table 2 gives illustrative examples.

Player
Player
Table 2: A payoff matrix with and to illustrate GAMUT games. In a RandomGame, , , and are i.i.d. draws from . In a CoordinationGame, and are i.i.d. draws from while is a draw from . In a CollaborationGame, , and is a draw from .

f.2 Proof of Non-degeneracy in GAMUT

Proposition F.2.1.

Drawing a degenerate game is a measure-zero event in RandomGames, CoordinationGames, and CollaborationGames, i.e., these games are almost surely non-degenerate.

Proof.

By Definition 5.0.3, in order for a game to be degenerate, there must exist a player , a set of actions for the other players , and a pair of actions with . In RandomGames, CoordinationGames, and CollaborationGames, and

are continuous random variables that are independent of each other. (Or, in the case of a CollaborationGame,

may be a fixed value outside of the support of .) So is a measure-zero event. ∎

f.3 Replicator Dynamics

Consider a game where all players share the same action set, i.e., with for all , and consider a totally symmetric strategy profile . In the replicator dynamic, each action can be viewed as a species, and defines the distribution of each individual species (action) in the overall population (of actions). At each iteration of the replicator dynamic, the prevalence of an individual species (action) grows in proportion to its relative fitness in the overall population (of actions). In particular, the replicator dynamic evolves over time for each as follows:

To simulate the replicator dynamic with Euler’s method, we need to choose a stepsize and a total number of iterations. Experimentally, we found the fastest convergence with a stepsize of 1, and we found that 100 iterations sufficed for convergence; see Figure 4. For good measure, we ran 10,000 iterations of the replicator dynamic in all of our experiments.

Figure 3: The magnitude of the replicator dynamics update step averaged over 10,000 RandomGames444In this simulation only we rescaled the RandomGames so that each payoff is a draw from . with 2 players and 2 actions. Although this plot indicates that the replicator dynamics converge by 100 iterations, we ran 10,000 iterations for good measure in all of our experiments.

We are interested in the replicator dynamic for two reasons. First, it is a model for how agents in the real world may collectively arrive at a symmetric solution to a game (e.g., through evolutionary pressure). Second, it is a learning algorithm that performs local search in the space of symmetric strategies. In our experiments of Appendix F.5, we find that using the replicator dynamic as an optimization algorithm is competitive with Sequential Least SQuares Programming (SLSQP), a local search method from the constrained optimization literature (Kraft, 1988; Virtanen et al., 2020).

f.4 What Fraction of Symmetric Optima are Local Optima in Possibly-asymmetric Strategy Space?

As discussed in Section 6.2, we would like to get a sense for how often symmetric optima are stable in the sense that they are also local optima in possibly-asymmetric strategy space (see Definition 5.0.1). Table 3 shows in what fraction of games the best solution we found is unstable.

[RandomGame] A 2 3 4 5 N 2 0.36 0.44 0.44 0.50 3 0.38 0.49 0.59 0.60 4 0.42 0.45 0.46 0.46 5 0.45 0.48 0.49 0.47 [CoordinationGame] A 2 3 4 5 N 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0

Table 3: The fraction of games whose symmetric optima are mixed. By Theorem 5.0.4, these symmetric equilibria are the ones unstable in the sense of Definition 5.0.1. Numbers in the table were empirically determined from 100 randomly sampled games per GAMUT class.

f.5 How Often do SLSQP and the Replicator Dynamic Find an Optimal Solution?

As discussed in Section 6.3, Table 4 and Table 5 show how often SLSQP finds an optimal solution, while Table 6 and Table 7 show how often the replicator dynamic finds an optimal solution.

[RandomGame] A 2 3 4 5 N 2 0.92 0.81 0.70 0.64 3 0.80 0.69 0.57 0.48 4 0.75 0.57 0.40 0.35 5 0.70 0.45 0.36 0.31 [CoordinationGame] A 2 3 4 5 N 2 0.59 0.50 0.40 0.33 3 0.53 0.38 0.28 0.29 4 0.53 0.37 0.29 0.26 5 0.53 0.36 0.33 0.25

Table 4: The fraction of single SLSQP runs that achieve the best solution found in our 20 total optimization attempts. Numbers in the table were empirically determined from 100 randomly sampled games per GAMUT class.

[RandomGame] A 2 3 4 5 N 2 1.00 0.99 0.99 0.98 3 1.00 0.99 1.00 0.96 4 1.00 0.96 0.94 0.88 5 0.98 0.90 0.88 0.91 [CoordinationGame] A 2 3 4 5 N 2 0.99 1.00 0.98 0.97 3 1.00 0.99 0.93 0.95 4 1.00 0.97 0.97 0.93 5 0.99 1.00 0.95 0.92

Table 5: The fraction of games in which at least 1 of 10 SLSQP runs achieves the best solution found in our 20 total optimization attempts. Numbers in the table were empirically determined from 100 randomly sampled games per GAMUT class.

[RandomGame] A 2 3 4 5 N 2 0.93 0.81 0.68 0.65 3 0.81 0.70 0.58 0.46 4 0.76 0.58 0.36 0.34 5 0.69 0.43 0.36 0.30 [CoordinationGame] A 2 3 4 5 N 2 0.58 0.45 0.40 0.33 3 0.57 0.35 0.29 0.27 4 0.53 0.37 0.28 0.25 5 0.51 0.33 0.33 0.24

Table 6: The fraction of single replicator dynamics runs that achieve the best solution found in our 20 total optimization attempts. Numbers in the table were empirically determined from 100 randomly sampled games per GAMUT class.

[RandomGame] A 2 3 4 5 N 2 1.00 1.00 1.00 1.00 3 0.99 1.00 0.95 0.96 4 1.00 0.98 0.91 0.91 5 0.98 0.97 0.92 0.87 [CoordinationGame] A 2 3 4 5 N 2 1.00 1.00 0.99 0.94 3 1.00 0.97 0.93 0.96 4 0.99 1.00 0.93 0.92 5 1.00 0.98 0.96 0.90

Table 7: The fraction games in which at least 1 of 10 replicator dynamics runs achieves the best solution found in our 20 total optimization attempts. Numbers in the table were empirically determined from 100 randomly sampled games per GAMUT class.

f.6 How Costly is Payoff Perturbation under the Simultaneous Best Response Dynamic?

When a game is not stable in the sense of Definition 5.0.1, we would like to understand how costly the worst-case -perturbation of the game can be. (See Definition D.0.1 for the definition of an -perturbation of a game.) In particular, we study the case when individuals simultaneously update their strategies in possibly-asymmetric ways by defining the following simultaneous best response dynamic:

Definition F.6.1.

The simultaneous best response dynamic at updates from strategy profile to strategy profile with every a best response to .

For each of the RandomGames in Section 6.2 whose symmetric optimum is not a local optimum in possibly-asymmetric strategy space, we compute the worst-case payoff perturbation for infinitesimal . Then, we update each player’s strategy according to the simultaneous best response dynamic at . This necessarily leads to a decrease in the original common payoff because the players take simultaneous updates on an objective that, after payoff perturbation, is no longer common. Table 8 reports the average percentage decrease in expected utility, which ranges from 55% to 89%. Our results indicate that simultaneous best responses after payoff perturbation in RandomGames can be quite costly.

[RandomGame] A 2 3 4 5 N 2 58.9% 55.9% 61.8% 64.6% 3 73.7% 70.9% 73.4% 73.7% 4 74.1% 77.4% 78.4% 82.5% 5 77.4% 84.9% 89.9% 87.5%

Table 8: The average decrease in expected utility that worst-case infinitesimal asymmetric payoff perturbations cause to unstable symmetric optima. To get these numbers, we first perturb payoffs in the 100 RandomGames from Section 6.2 whose symmetric optima are not local optima in possibly-asymmetric strategy space. Then, in each perturbed game, we compute a simultaneous best-response update to and record its decrease in expected utility.

Appendix G Code and Computational Resources

All of our code is available at https://github.com/scottemmons/coordination under the MIT License. With a reduced number of random seeds, we guess that it would be possible to reproduce the experiments in this paper on a modern laptop. To test a large number of random seeds, we ran our experiments for a few days on an Amazon Web Services c5.24xlarge instance.

Our code uses the following Python libraries:

  • Matplotlib (Hunter, 2007), released under “a nonexclusive, royalty-free, world-wide license,”

  • NumPy (Harris et al., 2020), released under the BSD 3-Clause “New” or “Revised” License,

  • pandas (Reback et al., 2021; Wes McKinney, 2010), released under the BSD 3-Clause “New” or “Revised” License,

  • SciPy (Virtanen et al., 2020), released under the BSD 3-Clause “New” or “Revised” License, and

  • SymPy (Meurer et al., 2017), released under the New BSD License.