1 Introduction
Most interesting realworld games and tasks involve separate and potentially competing objectives (are multiagent in nature) and do not admit a single winning strategy that beats all (i.e. have some nontransitive element). Training agents that master these types of games poses several challenges. For example, in multiagent games performance is only defined in relation to other players rather than absolutely. Additionally, as a result of the nontransitive nature of these games, performance against one opponent can often be uninformative or even misleading about performance against other opponents.
In this paper we show that a number of the challenges associated with nontransitive multiagent environments can be addressed by ensuring the strategy space of our agents is maximally diverse. At a higher level, agents that are strategically diverse are able to estimate other agents’ performance more accurately, are better opponents to train other agents against and, overall, perform better at test time.
Given the importance of strategic diversity we thus pose the question of how to methodically train such diverse agents. A key observation is that training the same agent against different opponents results in varying exploration behaviour during training as well as different final performance of said agent. We use this insight to our advantage and train strategically diverse agents by systematically selecting the opponents that an agent encounters during training.
We consider twoplayer games and train populations of agents by setting each agent against a specific subset of the rest of the population. In order to characterise these different sets of opponents we introduce the notion of interaction graphs, which describe the training interactions between agents in a population. Depending on the properties of the graph the resulting population will exhibit varying levels of strategic diversity and performance. We study this effect for a number of different graphs on a modified version of RockPaperScissors, Blotto and Starcraft.
It is important to note, that the setup analysed here is qualitatively different from a standard approach, where a best response is found with respect a distribution of previously trained (fixed) agents [16, 26]. Instead, interaction graph describes a matchmaking schedule for cotraining players, and thus imposes a continuous dynamical system of their evolution.
The contributions of this paper are threefold:

We provide evidence of the importance of diversity at various levels of training in multiagent, nontransitive games.

We introduce the interaction graph framework to describe the control of training interactions in populations.

We analyse the effect of different population graphs on the resulting populations for three different games.
2 Related work
The research in this paper shares a lot of the ideas with early work on coevolution, where agents are trained and evaluated against other agents that are being trained at the same time [22, 20, 4]. Within coevolution, a concept that is particularly related is that of Nash memory [6, 21]. One could think of our approach as an amortised version of Nash memory with a single fixedsized memory of strategies. In recent years coevolution has gained a lot of interest in the context of population based training [14, 18, 19].
Also related to our line of work is research on diversity in reinforcement learning, which is often studied in the context of intrinsic motivation and skill discovery
[5, 8, 13, 11]. The ability to discover diverse sets of options has also recently been considered from a generalisation and metalearning point of view [17, 9, 7]. The notion of strong diversity that we are interested in has also been studied in evolutionary computation under the name of quality diversity
[23].Making use of graphs to capture interactions between agents in a population has also been proposed by Guestrin in the context of coordination graphs [12]. However, this method assumes a graph of interactions between agents in a population is known and makes use of it to approximate a global value function. Our approach on the other hand only models separate value functions and does not require prior knowledge about the roles of agents in a populations or the structure of their interactions.
Finally, some of the graphs we suggest can be seen as cotraining version of some previously introduced training scheduling algorithms like SelfPlay [25], PSRO [16] and more recent work on PSRO rectified Nash [1]. As such interaction graphs can be seen as a unifying language to describe agent matching algorithms in populationbased learning.
3 Definitions
Populations of agents
In this paper we focus on populationbased methods. Our general setup consists of fixedsized populations of deterministic reinforcement learning (RL) agents that we train through pairwise interactions on twoplayer games. From a gametheoretic point of view these deterministic agents represent strategies and mixtures of strategies can therefore be obtained as mixtures of agents.
Empirical evaluation matrix
Following [1], we define the antisymmetric function that evaluates the performance of any pair of opponents and on zerosum twoplayer games. We say beats if and we capture all pairwise evaluations between agents of a population and a second population in the empirical evaluation matrix .
Relative population performance
From the evaluation matrix we can calculate the relative population performance of against as:
where and are the Nash equilibria (see appendix for definition) for and calculated from .
Effective diversity
There are typically far more ways of acting incompetently than there are of acting effectively. Therefore, it is necessary to couple the notion of diversity with a measure of effectiveness. Effective diversity [1] measures the ‘useful’ diversity of a population of agents and is defined as:
where is the payoff matrix between all of the agents in against each other and is the Nash equilibrium of . Computing the rectifier
implies that effective diversity is, roughly, a measure of the variety of ways agents can beat each other. Using the Nash distribution – rather than, say, the uniform distribution – ensures that effective diversity is relative to the best performing or least exploitable agents. In particular, if there is a single dominant agent then effective diversity is zero.
Effective diversity in non zerosum games
We can extend the effective diversity measure to non zerosum games by subtracting the relative population performance from the payoff matrix first:
Intuitively, this measure ‘counts’ the number exploits (payoffs above the relative performance of the population) that occur in the Nash.
4 Strategic Diversity and MultiAgent training
Consider the task of training an agent that will compete in an online gaming league. In multiagent environments we can distinguish between three functionally different roles that will be involved in training such an agent.

Learner: this is the agent we are ultimately interested in and that will compete online. Our goal is to learn the right parameters for this agent from interactions with other agents.

Trainers: a single agent or a set of agents that play against the learner to generate the experience that the learner uses for its updates. For example, a set of previously trained agents or human players.

Evaluators: a single agent or a set of agents that are used to evaluate the performance of the agent. In the case of the online game this could be all other players worldwide, who the learner agent is ranked against.
Given that in this paper we will be working with populations we will refer to these as learner population, trainer population and evaluator population instead of as agents from now on, but the same principles hold. While these three are functionally different, in practice a population can carry out several roles (often the trainers are used to evaluate the agent and in some cases the trainers are learning themselves). In the following we set out to prove that strategic diversity is necessary across all three types of populations in order to obtain strong final performance of the learner population.
4.1 Why evaluators should be strategically diverse
In multiplayer games the outcome is a function of several players. As such, we can only evaluate the performance of an agent or population of agents relative to an opponent . One way to define a measure of absolute performance is to evaluate against the entire agent space and choose the worstcase performance.
(1) 
This performance metric implies the learning goal is to find the least exploitable population, i.e. the population that maximises the objective (1
) (as is the case in the context of the minimax theorem in game theory).
In practice we can only approximate by comparing the performance of population to a finite set of evaluator populations . A diverse evaluator population may contain approximate best response to , i.e., for any given there exists an agent in than can exploit approximately as well as any agent of the entire space . The quality of the approximate evaluation of against can be viewed as a distance in functional space . As such, in order to be able to estimate the performance of our populations correctly, our goal is to find a that minimises it. We therefore argue that strategic diversity of the evaluator population is instrumental for something as fundamental as defining the performance of the learner population in the first place.
4.2 Why trainers should be strategically diverse
If our goal is to train agents that perform well according to the notion of performance introduced in (1) then we aim at solving a maximin problem of the form
(2) 
This optimization problem is challenging as it has been shown that standard gradient methods may not converge even on very simple domains [2]. One way to contravene this issue is to consider the training of a main agent against a fixed population,
(3) 
As argued in Section 4.1, with a diverse enough population the latter objective is a reasonable approximation of the original optimization problem (2). Moreover, almost everywhere, the gradients of can be computed in a tractable way by using the identity
where is a best response against . Assuming that the payoff is differentiable, standard gradient based algorithms can be used to find an approximate solution of (2). This type of populationbased training is called prioritised fictitious selfplay and was introduced by [26] to successfully train agents to play the game of StarCraft II. The particular instance of this method used by [26] solves a smooth version of (3) where the agents sampled according to a softmax of their performance against a ‘main agent’. [26] put a major emphasis in obtaining a diverse population of ‘main agents’ and ‘exploiters’ to successfully train the ‘main agents’ via prioritised fictitious selfplay.
4.3 Why learners should be strategically diverse
The need for learners to be strategically diverse is a result of the cyclic nature of nontransitive games. A game is nontransitive if, whenever player beats player and beats player , it does not follow that beats (rock, paper, scissors is perhaps the bestknown example of a nontransitive game). These games involve strategic tradeoffs: following a particular strategy will beat a subset of the remaining strategies but also lose to a different subset. As there is no ”best” strategy, it is necessary for a population to cover a diverse set of strategies in order to perform well against the evaluator population at test time.
5 Interaction Graphs
5.1 Influencing agent training in multiplayer games
Suppose we have a reinforcement learning agent whose policy is parametrised by some nonlinear function approximator with parameters . The parameter updates of are computed from the rewards collected by . In the case of multiagent environments this reward will be some function of the agent’s policy as well as that of its opponent . As a result, there are two ways of influencing the parameter updates of : (1) at an individual agent level via the parametrisation and update rule of the policy or (2) at a population level by choosing different opponents .
In this paper we opt for the second populationlevel approach and define structured populationlevel objectives via interaction graphs that specify the objectives of each agent in the population in terms of (mixtures of) other agents. As a small motivating example of how strong the effect of training against different opponents can be on an agent we carry out the following experiment: we train two populations of RL agents on a nontransitive game. Each of the agents is trained either against every other agent in the population (setting 1) or only against itself (setting 2). We can visualise the different policies that the agents learn during training as trajectories in a 2D strategy space (Fig 1). As shown in the plots different sets of opponents lead to radically different exploration behaviours of the populations. In the case of multiagent training it is therefore imperative to consider the populationlevel interactions in addition to the agentlevel objectives.
5.2 Interaction graphs
We have argued that high effective diversity is a desirable property of learner, trainer and evaluator populations. Our goal is thus to develop systematic tools for training such diverse populations. In practice, when training a new learner population one often does not have access to a trainer population. In these cases the learners themselves are used as a trainers. Training every agent in a population against all other agents, however, would mean all agents have the same objective, which is not conducive to diverse behaviour.
Instead, we propose to train each agent against a specific subset of the agents in the same population that will act as that agent’s particular trainer population. We express these relations via interaction graphs. The nodes of the interaction graph correspond to the agents of the population and the weights of the edges indicate to what extent the experience obtained against another agent in the population is used to update the parameters of agent . Crucially, the graph can be directed, meaning that might care about beating , while the reverse might not be the case. This allows for specialisation. In addition the weights of the edges can change dynamically over training, such that the agents’ objectives can be adapted to the strategies covered by the population. Previous populationbased training regimes can be expressed as a graph as well (alltoall, selfplay, PSRO etc). Other lines of research that are related to the concenpt of interaction graphs include [15, 16, 24].
5.3 Restricting the Information Flow in Populations
By choosing only a subset of the population as the trainer for each agent we are restricting the information flow between the agents of the population. In the following we show why this restriction can be beneficial for games that have nontransitive strategy spaces.
Suppose that is differentiable, and introduce the shorthand to emphasise that the weights of the opponent are fixed and only is to undergo training. Consider the gradients obtained against a set of opponents . Further, we say a game is monotone if factorises as where is a rating function that specifies the skill of each agent, and is a monotone increasing function. If a game is monotone, then the performance of agents simply comes down to the difference in their ratings. The Elo rating system, widely used in Chess, assumes monotonicity.
Proposition 1.
If the game is monotone then gradients against all opponents have nonnegative inner product with one another:
(4) 
The inner product is zero if and only if there is at a local maximum of the rating function .
Proof.
By the chain rule,
. Since is monotone increasing, we have that is always positive, and the result follows. ∎Proposition 1 shows that there are no strategic tradeoffs in monotone games. It makes little difference which opponent you train against, because the gradients all point in the same direction: infinitesimally improving against one opponent leads to infinitesimal improvement against all. The only difference arises from the magnitude of . For example, if is the sigmoid, then training against much better opponents does not help because saturates and the gradient vanishes.
In contrast, in nontransitive games, training with – and improving performance against – one opponent can worsen performance against other opponents. Roughly, training against rock makes you more like paper and so you perform worse against scissors, see [1] for a detailed example. It follows that, in nontransitive games, training against mixtures can cause gradients to cancel out. It is therefore necessary to carefully control which gradients – and so which opponents – agents in a population are exposed to during training.
6 Experiments
6.1 Graphs
We compare nine different graph structures to start characterising the effect of restricting the training interactions within populations. We distinguish between fixed and adaptive interaction graphs:
Fixed interaction graphs are defined at the beginning of training and remain fixed throughout:

Alltoall: A fully connected graph, where every agent of the population trains against every other agent including itself. The objectives of all agents are the same and the information flow between agents is maximal. An example of agent populations trained with this type of interaction graphs is [14]

Selfplay: every agent in the population trains independently against a past version of themselves. The flow of information is minimal. The AlphaGo agent [25], for example, was trained using selfplay.

Cycle: One directed cycle of the same length as the number of agents in the population. The motivation behind this type of graph is to reflect the cyclic nature of the game in the training regime.

Hierarchical cycle: All agents except for one are part of a directed cycle and the final agent has directed connections to all. This includes the cyclic structure from 3, and an additional agent that learns best response to everyone in the cycle.

PSRO: All agents in the population are numbered and every agent plays all the agents with a smaller index than itself. The idea behind this hierarchical structure is to have increasing levels of competence [16].
Adaptive interaction graphs start out fully connected and the edges are then continuously updated during training according to some metric such as relative performance against the other agents in the population:

Play better: every agent only trains against those agents in the population that it is is losing against as reflected in the payoff matrix.

Play worse: every agent only trains against those agents in the population that it is beating according to the payoff matrix.

Play worse and self: Same as play worse, but agents also train against themselves.

PSRO Rectified Nash: introduced in [1] and essentially play worse, but only for those agents that have support in the Nash. The rest does selfplay.
6.2 Environments
GmmRps
The motivation behind this continuous variant of the classic RockPaperScissors (RPS) game is to create a simple game that combines cyclic and transitive components. At a higher level, the game has cyclical dynamics just like RPS. In addition there is a transitive element of strength: for example, a stronger rock can beat a weaker rock. The game is illustrated in Fig 2 and for a more detailed description see Section 9.2 in the appendix. We distinguish between GMMRPS(3), which has 3 modes (like rock, paper and scissors in RPS) and GMMRPS(7) which has 7 modes and as a result a stronger cyclic component than GMMRPS(3) as the modes lie closer to each other.
Colonel Blotto.
Colonel Blotto is a twoplayer, zerosum resource distribution game. Two players are given tokens that they distribute simultaneously over areas. Whichever player has more tokens on an area wins that area and the player that wins the most areas wins the overall game. Colonel Blotto is wellstudied in gametheory where it’s usually of particular interest because of its highly nontransitive strategy space. In this paper we consider a variant of Blotto with and .
Starcraft.
Finally, we include some experiments in the StarCraft II environment [27]. StarCraft II is a realtime strategy 2player game with highly nontransitive game dynamics. In addition to being significantly more complex than the previously described games it also has a temporal element that the other two lack. The motivation behind including this benchmark is to rest whether any of the findings obtained on the simple onestep games translate to more complex temporal games.
6.3 Evaluation metrics
In order to characterise the behaviours of populations obtained with different interaction graphs we consider several metrics. On a qualitative level we analyse the training trajectories of the agents by plotting the policy of the agent as it evolves over training iterations. In the case of GMMRPS the strategies live in 2D space so the plots are easily interpretable. On a quantitative level we measure the effective diversity and relative population performance (RPP) of the different populations. In order to measure the RPP we need an evaluator baseline. We use learned populations with high and low measured effective diversity as well as a ‘ground truth’ population containing all the strategies in the Nash if they are known. Finally, we also measure the convergence of the agents’ policies to evaluate whether they get stuck in cycles.
6.3.1 StarCraft
Given the complexity of Starcraft we evaluate the populations on different but related metrics.
Performance
To measure overall performance, we report maximum (over agents in the population) average winrate against the test set . A performance of the population is the maximum of the performances among the population. It can be seen as an RPP metric where the test population is a single agent, that players a uniform mixture of the test set.
(5)  
We use this version of the performance, as none of our population was able to get nonzero win rate against entire test set, and so for each .
Diversity
We first define marginal distributions of wheter given units were produced through last 7e8 steps (there are Protoss units in the game of SC2, see Fig. 7 in the Appendix) by each agent in the population . Then we use a proxy for diversity, where the distance between two agents and
is simply the maximum squared distance over units production probability. And to define a population level measure we take minimum of such a distance over all possible pairs of agents.
(6) 
Coverage
Given the complexity of the game of SC2 it is really hard to derive a meaningful measure of strategic coverage. However, we argue that in order to be prepared for every possible scenario in the game, at least agents need to perceive every unit in the game. We thus look at a simple proxy, fraction of units that a population as a whole, is creating over series of episodes, to get a rough notion of strategic coverage. We count unit as present, if through last 7e8 training steps it was created at least 5% of the time.
6.4 General population training
We train populations of four A2C agents on the GMMRPS and Blotto environments. Every 10000 frames sampled from environment interactions we update the interaction graph which determines the opponents for the actor. To sample opponents from the graph we first select agent 1 from the population with uniform probability and then sample agent 2 with probability proportional to the weights of the edges that go into the node of agent 1. This experience is then only used to update the weights of agent 1.
6.4.1 StarCraft
For the StarCraft experiments we use the PySC2 environment [27]. Each experiment uses a population of 4 agents, trained according to the population graph (as opposed to the setup of [26] we do not have checkpoints or exploiters). We use exactly the same architecture, action and observation spaces as well as pretraining procedure as [26]. However, we only trained agents to play one race – Protoss, and just one map – Kairos Junction, to limit the game complexity. Each job was ran for 3 billion learner steps.
We use a set of pretrained simple agents that we refer to as test agents, that were hard coded to follow various strategies, following the description of [26].
7 Results
7.1 Training trajectories
We record the actions produced by the policy at the time of the interaction graph update throughout training. These actions can then be plotted on the 2D plane for the GMMRPS games to visualise how the agents’ policies are being updated over time. We plot the actions of each individual agent as a line and mark the last training step with a circle as shown in Fig. 3 for each interaction graph on the game of GMMRPS(3). In order to show the consistency of the observed exploration behaviour across runs we plot four repetitions of a population trained using the same interaction graph in Figure 8 in the Appendix. Overall the behaviour observed for each of the graphs is consistent. We can also see from these plots that the graphs that encourage specialisation (such as rectified Nash or Hierarchical Cycle) can also end in cycling behaviour. In this figure we also show the trajectories for GMMRPS(7). The cyclic component is stronger in this environment as the different modes are closer together. This is also reflected in the training trajectories as there is a higher number of populations that end up cycling. The graphs that encourage specialisation also display cycling behaviour but rather than cycling together what we observe is that the agents switch modes one at a time, while still covering a wider spread of the modes.
7.2 Diversity and performance
We train 15 populations on each of the nine interaction graphs described in section 6.1. Out of these 15 populations we choose the 10 with the highest effective diversity for each interaction graph. We calculate the effective diversity for each of the 90 populations using the equation from Section 3 and we take the average over the last 50 logged training iterations. The effective diversity results for the different interaction graphs on GMMRPS(3) and Blotto are shown in Figure 3 and on GMMRPS(7) in Figure 6 in the Appendix.
Similarly, we measure the relative population performance as described in section 3. When measuring the relative performance, however, we also require an evaluation population. As in the previous section we consider three evaluation populations: ground truth, high diversity and low diversity populations. We only include the ground truth population for the GMMRPS games as the strategy space of Blotto is too large to enumerate. We plot the relative population performance against the effective diversity in Figures 4 for on GMMRPS(3) and 6 in the Appendix for on GMMRPS(7).
7.3 Convergence vs Coverage
As discussed in Section 8, convergence of the individual agents is not as important for performance as populationlevel convergence. This means individual agents may still cycle as long as the population as a whole covers all strong strategies. To test this we quantify the coverage of different populations. For GMMRPS(3) in particular we measure how many of the Gaussian modes a population covers as a whole. As shown in Fig 3C for this particular game populations that have good coverage tend to consist of agents which have individually converged. This however is not always the case. Populations trained with the ‘selfplay’ interaction graph, for example, achieve comparable coverage, despite of all agents cycling throughout training. On the other hand there are also populations with high convergence that cover only a single mode. Furthermore, among the populations with the highest coverage there are varying degrees of convergence. In these cases most agents have converged, but a few are still cycling (in the case of the ‘cycle’ graph, for example, all agents are slowly cycling but only one at at time, which results in a high effective convergence, despite the fact that agents are still cycling). We have also plotted the convergence for the three different games (Fig 3D). As expected, populations that cycle in GMMRPS(3) cycle even more in GMMRPS(7) as the cyclic component is stronger. In addition, the interaction graphs that encourage convergence of individual agents in GMMRPS also do so in Blotto.
7.4 Starcraft
The learning curves of the populations trained on the different interaction graphs are shown in Fig. 5. The graphs that resulted in high performance on the simpler games do not necessarily perform well on the significantly more complex game of StarCraft. Alltoall, for example, performs very well, where as a graph as simple as the Cycle does not perform well on a game with high strategic complexity. The scores for performance as well as diversity and coverage are summarised in Table 1. From these results we can also see that in the case of StarCraft plain diversity does not directly translate to performance. This is reflected both in the coverage as well as the diversity numbers that which are not necessarily correlated with the performance scores obtained.
8 Discussion
We characterise the differences across all nine interaction graphs by looking at the evaluation metrics mentioned above. In the following we summarise the main findings.
Diverse evaluator populations estimate performance more accurately.
As shown in Figure 4 the perfomance of the 90 populations evaluated by an evaluator population with high effective diversity matches the ground truth (absolute) performance better than that of a low diversity evaluator population. This effect is even stronger as the game increases in complexity (see results on GMMRPS(7) in Figure 6 in the appendix). These results aligns with our claims in Section 4.1.
Method  Performance  Coverage  Diversity 

Alltoall  47%  89%  39% 
Hier. Cycle  46%  74%  29% 
Cycle  30%  74%  12% 
Rect. Nash  43%  63%  3% 
Selfplay  44%  53%  0% 
Diverse learner populations perform better.
Graphs influence the training behaviour of populations.
The training trajectories that result from training on different interaction graphs vary drastically as shown in Figure 3 for GMMRPS(3) (more trajectories as well as the trajectories on GMRPS(7) can be found in Figure 8 in the appendix). Given the simplicity of the game most populations display one of two possible behaviours: the agents either all synchronise within a population or they cover different areas of the action space independently. In general, interaction graphs that allow for cyclic training interactions cover all modes, while those that don’t contain cycles end up cycling.
Graphs influence the effective diversity of populations.
We can further quantify this behaviour by measuring the average effective diversity obtained by populations trained on the different interaction graphs. The right hand plots in Figure 3 confirm the trend observed in the trajectory plots to the left, whereby interaction graphs that can contain cyclic relations between agents have a larger spread across action space. For GMMRPS and Blotto this spread also translates to higher effective diversity.
Graphs with cycles encourage specialisation and increase effective diversity in simple nontransitive games. As reflected in the training trajectories (Fig. 3) the directed nature of cycles allows individual agents to focus on a subset of the population (that does not necessarily focus on them) and thus to specialise. Populations trained with undirected graphs, on the other hand, tend to collapse to the same strategy as the symmetry in the connections means agents have the same objective. As a result populations trained with cyclic interaction graphs have higher effective diversity (Fig. 3B).
A fixed graph structure is powerful when it matches the underlying game structure, otherwise adapting graphs might be a better choice. The hierarchical cycle, for example, works well on the RPSlike games as it matches the underlying structure. It does not, however, perform as well on Blotto which has a richer strategy structure (compare effective diversity (Fig. 3) and RPP (Fig. 4) across games). The adaptive graphs, on the other hand, find a good approximation in either case.
Focusing on those that are better than you makes you less exploitable and focusing on those that are worse than you makes you a better exploiter. Focusing on opponents that beat you involves learning best response against a more diverse set of strategies which encourages agents to be more robust. Playing against those you are beating already, on the other hand, allows agents to specialise. As a result populations might become more exploitable as they might get stuck on weak enemies. The rectified Nash interaction graph could be one way of remedying this, as agents only specialise on other agents that have support in the Nash.
Individual convergence is not as important as populationlevel convergence for diversity and coverage. Individual agents may cycle as long as all important strategies are covered. We can see this in Figure 3C where the ‘One cycle’ interaction graph has low individual convergence but good coverage of the three GMMRPS modes at all times.
When moving to significantly more complex environments some fundamental insights hold, but some do not. We have chosen simple games as a starting point for our analysis. While most insights hold across these games, they may not translate to significantly more complex games such as StarCraft. In fact, some intuitions, e.g. the usefulness of directed graphs, the fact that the wrong fixed graph can hinder learning or that focusing on agents you beat allows you to specialise seem to agree with the results obtained on StarCraft. However, it is also clear that one should be careful to translate graphs or particular methods directly from very simple environments to more complex ones. Stark difference in game dynamics might lead to unexpected failure modes (e.g. the collapse of rectified Nash onto a single Nash agent that it can’t recover from) or unforeseen successes (e.g. the ability of ’Alltoall’ to explore the strategy space).
8.1 Outlook
A large part of current machine learning algorithms rely on optimisation. As a result, the community has developed a powerful repertoire of tools – such as gradient descent, reinforcement learning, and evolutionary algorithms – for minimizing losses and maximizing discounted future rewards. Multiagent systems break this paradigm, because there is
no longer a fixed objective. Performance depends on the behavior of the other agents or opponents in the system and there are often many ways of behaving well.It is therefore necessary to design algorithms that search the space of possible behaviors and find diverse, effective strategies. In this paper, we have explored populationbased learning algorithms where the objectives of agents in the population are specified by interaction graphs. We find that training with certain graphs consistently yields populations of agents that are both stronger and more diverse than naive baselines. Interactions graphs provide a basic framework for reasoning about effective populationlevel objectives that encourage diversity and improve overall performance.
References
 [1] David Balduzzi, Marta Garnelo, Yoram Bachrach, Wojciech M Czarnecki, Julien Perolat, Max Jaderberg, and Thore Graepel. Openended learning in symmetric zerosum games. International Conference on Machine Learning, 2019.
 [2] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of nplayer differentiable games. arXiv preprint arXiv:1802.05642, 2018.
 [3] David Balduzzi, Karl Tuyls, Julien Perolat, and Thore Graepel. Reevaluating evaluation. In Advances in Neural Information Processing Systems, pages 3268–3279, 2018.
 [4] Rafał Dreżewski. A model of coevolution in multiagent system. In International Central and Eastern European Conference on MultiAgent Systems, pages 314–323. Springer, 2003.
 [5] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. International Conference on Learning Representations, 2018.
 [6] Sevan G Ficici and Jordan B Pollack. A gametheoretic memory mechanism for coevolution. In Genetic and Evolutionary Computation Conference, pages 286–297. Springer, 2003.

[7]
Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine.
Oneshot visual imitation learning via metalearning.
In Conference on Robot Learning, pages 357–368, 2017. 
[8]
Carlos Florensa, Yan Duan, and Pieter Abbeel.
Stochastic neural networks for hierarchical reinforcement learning.
International Conference on Learning Representations, 2016.  [9] Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta learning shared hierarchies. arXiv preprint arXiv:1710.09767, 2017.
 [10] Marta Garnelo, Wojciech Marian Czarnecki, Siqi Liu, Dhruva Tirumala, Junhyuk Oh, Gauthier Gidel, Hado van Hasselt, and David Balduzzi. Pick your battles: Interaction graphs as populationlevel objectives for strategic diversity. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pages 1501–1503, 2021.
 [11] Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Variational intrinsic control. International Conference on Learning Representations, 2017.

[12]
Carlos Guestrin, Michail Lagoudakis, and Ronald Parr.
Coordinated reinforcement learning.
In ICML
, volume 2, pages 227–234. Citeseer, 2002.
 [13] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. International Conference on Learning Representations, 2018.
 [14] Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Humanlevel performance in 3d multiplayer games with populationbased reinforcement learning. Science, 364(6443):859–865, 2019.
 [15] Michael Kearns, Michael L Littman, and Satinder Singh. Graphical models for game theory. arXiv preprint arXiv:1301.2281, 2013.
 [16] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, and Thore Graepel. A unified gametheoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 4190–4203, 2017.

[17]
Da Li, Yongxin Yang, YiZhe Song, and Timothy M Hospedales.
Learning to generalize: Metalearning for domain generalization.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [18] Siqi Liu, Guy Lever, Josh Merel, Saran Tunyasuvunakool, Nicolas Heess, and Thore Graepel. Emergent coordination through competition. International Conference on Learning Representations, 2018.
 [19] Kevin R. McKee, Ian Gemp, Brian McWilliams, Edgar A. DuéñezGuzmán, Edward Hughes, and Joel Z. Leibo. Social diversity and social preferences in mixedmotive reinforcement learning. arXiv preprint arXiv:2002.02325, 2020.

[20]
Jason Morrison and Franz Oppacher.
A general model of coevolution for genetic algorithms.
In Artificial Neural Nets and Genetic Algorithms, pages 262–268. Springer, 1999.  [21] Frans A Oliehoek, Edwin D De Jong, and Nikos Vlassis. The parallel nash memory for asymmetric games. In Proceedings of the 8th annual conference on Genetic and evolutionary computation, pages 337–344, 2006.
 [22] Jan Paredis. Coevolutionary computation. Artificial life, 2(4):355–375, 1995.
 [23] Justin K Pugh, Lisa B Soros, and Kenneth O Stanley. Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3:40, 2016.
 [24] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science, 362(6419):1140–1144, 2018.
 [25] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
 [26] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multiagent reinforcement learning. Nature, 575(7782):350–354, 2019.
 [27] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
9 Appendix
9.1 Definitions
Nash on empirical games
Given an empirical evaluation matrix as above, define the Nash as the minmax solution to matrix game:
(7) 
where we write for the corresponding distributions. In general there could be more than one pair of distributions that form a Nash equilibrium. There does exist a unique maxent Nash equilibrium with nice properties, see [3] for details. In practice, we find it convenient to use the Nash returned by an LP solver. If the matrix is antisymmetric (which occurs for example when is the empirical evaluation matrix of a population playing against itself) then it is shown in [3] that there are symmetric Nash equilibria of the form – i.e. where the Nash involves both metaplayers choosing the same distribution.
9.2 GMMRPS Environment
The motivation behind this version of continuous RPS is to create an environment that has both a cyclic as well as a transitive component to it. The game consists of a 2D plane with three equidistant bivariate Gaussians that represent rock, paper and scissors respectively. A strategy corresponds to a point on that plane, defined by a pair of 2D coordinates. In Figure 2 we visualise three trategies AC on the 2D plane along with their coordinates. These 2D coordinates are translated into rockpaperscissors (RPS) proportions by measuring the pdf under each of the three Gaussians for every point (the RPS weights for AC are shown on the right). In this particular setup A is predominantly a rockstrategy, B an equal mixture between rock, paper and scissors and C an equal mixture between rock and paper. GMMRPS(3) deviates from continuous RPS in two ways:

We define a nonlinear mapping from to between the agent’s actions and the continuous RPS strategies. The idea behind this is to obtain some local optima around the pure strategies of rock, paper and scissors. More specifically, we define the action space to be a 2 dimensional plane and place three equidistant bivariate Gaussians on this plane that each represent one of the pure strategies (rock, paper and scissors). Every action is therefore a point on the plane that is mapped into the
space of continuous RPS by measuring the probability density function under each of the three Gaussians (see Fig.
2. 
We also add a transitive component to the game by defining the game matrix to be:
(8) the 0.5 on the diagonal means that a strategy with a stronger rock, for example, will beat a weaker rock.
GMMRPS(7) is equivalent but with seven Gaussians instead of three so the mapping is from to and every strategy loses against half of the remaining strategies and beats the rest.
9.3 Additional Results
9.3.1 GmmRps(7)
We run the same experiments that were carried out for GMMRPS(3) on GMMRPS(7). As in the main text we report the average effective diversity obtained with each interaction graph as well as a plot of the RPP against effective diversity with different evaluator populations (see Figure 6). The results obtained on this environment match those observed with GMMRPS(3) whereby graphs that allow for cycles have a higher diversity and higher diversity is correlated with higher RPP. As before, evaluator populations that are more diverse themselves are better at estimating the performance of leaner populations than nondiverse ones.
9.3.2 Diversity in StarCraft
We visualise the diversity obtained with different interaction graphs by plotting the marginal distribution of units produced by each agent in Fig. 7.
9.3.3 Additional trajectories for GMMRPS
We show the training trajectories for several repetitions of training a population of agents with the different interaction graphs in Fig. 8. In these plots every row corresponds to a different experiment and every column to a different interaction graph.