1 Introduction
In twoplayer zerosum (2p0s) settings, principled selfplay algorithms converge to a minimax equilibrium, which in a balanced game ensures that a player will not lose in expectation regardless of the opponent’s strategy (neumann1928theorie). This fact has allowed selfplay, even without human data, to achieve remarkable success in 2p0s games like chess (silver2018general), Go (silver2017mastering), poker (bowling2015heads; brown2017superhuman), and Dota 2 (berner2019dota).^{1}^{1}1Dota 2 is a twoteam zerosum game, but the presence of full information sharing between teammates makes it equivalent to 2p0s. Beyond 2p0s settings, selfplay algorithms have also proven successful in highly adversarial games like sixplayer poker brown2019superhuman. In principle, any finite 2p0s game can be solved via selfplay given sufficient compute and memory. However, in games involving cooperation, selfplay alone no longer guarantees good performance when playing with humans, even with infinite compute and memory. This is because in complex domains there may be arbitrarily many conventions and expectations for how to cooperate, of which humans may use only a small subset (lerer2019learning). The clearest example of this is language. A selfplay agent trained from scratch without human data in a cooperative game involving freeform communication channels would almost certainly not converge to using English as the medium of communication. Obviously, such an agent would perform poorly when paired with a human English speaker. Indeed, prior work has shown that naïve extensions of selfplay from scratch without human data perform poorly when playing with humans or humanlike agents even in dialoguefree domains that involve cooperation rather than just competition, such as the benchmark games nopress Diplomacy (bakhtin2021no) and Hanabi (siu2021evaluation; cui2021k).
Recently, (jacob2022modeling) introduced piKL, which models human behavior in many games better than pure behavioral cloning (BC) on human data by regularizing inferencetime planning toward a BC policy. In this work, we introduce an extension of piKL, called DiLpiKL, that replaces piKL’s single fixed regularization parameter
with a probability distribution over
parameters. We then show how DiLpiKL can be combined with selfplay reinforcement learning, allowing us to train a strong agent that performs well with humans. We call this algorithm RLDiLpiKL.Using RLDiLpiKL we trained an agent, Diplodocus, to play nopress Diplomacy, a difficult benchmark for multiagent AI that has been actively studied in recent years (paquette2019no; anthony2020learning; gray2020human; bakhtin2021no; jacob2022modeling). We conducted a 200game nopress Diplomacy tournament with a diverse pool of human players, including expert humans, in which we tested two versions of Diplodocus using different RLDiLpiKL settings, and other baseline agents. All games consisted of one bot and six humans, with all players being anonymous for the duration of the game. These two versions of Diplodocus achieved the top two average scores in the tournament among all 48 participants who played more than two games, and ranked first and third overall among all participants according to an Elo ratings model.
2 Background and Prior work
Diplomacy is a benchmark 7player mixed cooperative/competitive game featuring simultaneous moves and a heavy emphasis on negotiation and coordination. In the nopress variant of the game, there is no cheap talk communication. Instead, players only implicitly communicate through moves.
In the game, seven players compete for majority control of 34 “supply centers” (SCs) on a map. On each turn, players simultaneously choose actions consisting of an order for each of their units to hold, move, support or convoy another unit. If no player controls a majority of SCs and all remaining players agree to a draw or a turn limit is reached then the game ends in a draw. In this case, we use a common scoring system in which the score of player is , where is the number of SCs player owns. A more detailed description of the rules is provided in Appendix B.
Most recent successes in nopress Diplomacy use deep learning to imitate human behavior given a corpus of human games. The first Diplomacy agent to leverage deep imitation learning was
paquette2019no. Subsequent work on nopress Diplomacy have mostly relied on a similar architecture with some modeling improvements (gray2020human; anthony2020learning; bakhtin2021no).gray2020human proposed an agent that plays an improved policy via oneply search. It uses policy and value functions trained on human data to to conduct search using regret minimization.
Several works explored applying selfplay to compute improved policies. paquette2019no applied an actorcritic approach and found that while the agent plays stronger in populations of other selfplay agents, it plays worse against a population of humanimitation agents. anthony2020learning used a selfplay approach based on a modification of fictitious play in order to reduce drift from human conventions. The resulting policy is stronger than pure imitation learning in both 1vs6 and 6vs1 settings but weaker than agents that use search. Most recently, bakhtin2021no combined oneply search based on equilibrium computation with value iteration to produce an agent called DORA. DORA achieved superhuman performance in a 2p0s version of Diplomacy without human data, but in the full 7player game plays poorly with agents other than itself.
jacob2022modeling showed that regularizing inferencetime search techniques can produce agents that are not only strong but can also model human behaviour well. In the domain of nopress Diplomacy, they show that regularizing hedge (an equilibriumfinding algorithm) with a KLdivergence penalty towards a human imitation learning policy can match or exceed the human action prediction accuracy of imitation learning while being substantially stronger. KLregularization toward human behavioral policies has previously been proposed in various forms in single and multiagent RL algorithms (nair2018overcoming; siegel2020keep; nair2020accelerating), and was notably employed in AlphaStar (vinyals2019grandmaster), but this has typically been used to improve sample efficiency and aid exploration rather than to better model and coordinate with human play.
An alternative line of research has attempted to build humancompatible agents without relying on human data (hu2020other; hu2021off; strouse2021collaborating). These techniques have shown some success in simplified settings but have not been shown to be competitive with humans in largescale collaborative environments.
2.1 Markov Games
In this work, we focus on multiplayer Markov games (shapley1953stochastic).
Definition.
An player Markov game is a tuple where is the state space, is the action space of player (), is the reward function for player , is the transition function.
The goal of each player , is to choose a policy that maximizes the expected reward for that player, given the policies of all other players. In case of
, a Markov game reduces to a Markov Decision Process (MDP) where an agent interacts with a fixed environment.
At each state , each player simultaneously chooses an action from a set of actions . We denote the actions of all players other than as
. Players may also choose a probability distribution over actions, where the probability of action
is denoted orand the vector of probabilities is denoted
or .2.2 Hedge
Hedge littlestone1994weighted; freund1997decision is an iterative algorithm that converges to an equilibrium. We use variants of hedge for planning by using them to compute an equilibrium policy on each turn of the game and then playing that policy.
Assume that after player chooses an action and all other players choose actions , player receives a reward of , where will come from our RLtrained value function. We denote the average reward in hindsight for action up to iteration as .
On each iteration of hedge, the policy is set according to where is a temperature parameter.^{2}^{2}2We use rather than used in jacob2022modeling in order to clean up notation. .
It is proven that if is set to then as the average policy over all iterations converges to a coarse correlated equilibrium, though in practice it often comes close to a Nash equilibrium as well. In all experiments we set on iteration , where
is the observed standard deviation of the player’s utility up to iteration
, based on a heuristic from
brown2017dynamic. A simpler choice is to set , which makes the algorithm equivalent to fictitious play (brown1951iterative).Regret matching (RM) (blackwell1956analog; hart2000simple) is an alternative equilibriumfinding algorithm that has similar theoretical guarantees to hedge and was used in previously work on Diplomacy gray2020human; bakhtin2021no. We do not use this algorithm but we do evaluate baseline agents that use RM.
2.3 DORA: Selfplay learning in Markov games
Our approach draws significantly from DORA (bakhtin2021no), which we describe in more detail here. In this approach, the authors run an algorithm that is similar to past modelbased reinforcementlearning methods such as AlphaZero (silver2018general)
, except in place of Monte Carlo tree search, which is unsound in simultaneousaction games such as Diplomacy or other imperfect information games, it instead uses an equilibriumfinding algorithm such as hedge or RM to iteratively approximate a Nash equilibrium for the current state (i.e., onestep lookahead search). A deep neural net trained to predict the policy is used to sample plausible actions for all players to reduce the large action space in Diplomacy down to a tractable subset for the equilibriumfinding procedure, and a deep neural net trained to predict state values is used to evaluate the results of joint actions sampled by this procedure. Beginning with a policy and value network randomly initialized from scratch, a large number of selfplay games are played and the resulting equilibrium policies and the improved 1step value estimates computed on every turn from equilibriumfinding are added to a replay buffer used for subsequently improving the policy and value. Additionally, a doubleoracle
(mcmahan2003doubleoracle) method was used to allow the policy to explore and discover additional actions, and the same equilibriumfinding procedure was also used at test time.For the core update step, bakhtin2021no propose Deep Nash Value Iteration (DNVI), a value iteration procedure similar to Nash QLearning (hu2003nash), which is a generalization of Qlearning (watkins1989learning) from MDPs to Stochastic games. The idea of NashQ is to compute equilibrium policies in a subgame where the actions correspond to the possible actions in a current state and the payoffs are defined using the current approximation of the value function. bakhtin2021no propose an equivalent update that uses a state value function instead of a stateaction value function :
(3) 
where is the learning rate, is the probability of joint action in equilibrium, is joint action, and is the transition function. For 2p0s games and certain other game classes, this algorithm converges to a Nash equilibrium in the original stochastic game under the assumption that an exploration policy is used such that each state is visited infinitely often .
The tabular approach of NashQ does not scale to large games such as Diplomacy. DNVI replaces the explicit value function table and update rule in 3
with a value function parameterized by a neural network,
and uses gradient descent to update it using the following loss:(4) 
The summation used in 4 is not feasible in games with large action spaces as the number of joint actions grow exponentially with the number of players. bakhtin2021no address this issue by considering only a subset of actions at each step. An auxiliary function, a policy proposal network , models the probability that an action of player is in the support of the equilibrium . Only the top sampled actions from this distribution are considered when solving for the equilibrium policy and computing the above value loss. Once the equilibrium is computed, the equilibrium policy is also used to further train the policy proposal network using cross entropy loss:
(5) 
bakhtin2021no report that the resulting agent DORA does very well when playing with other copies of itself. However, DORA performs poorly in games with 6 human humanlike agents.
2.4 piKL: Modeling humans with imitationanchored planning
Behavioral cloning (BC) is the standard approach for modeling human behaviors given data. Behavioral cloning learns a policy that maximizes the likelihood of the human data by gradient descent on a crossentropy loss. However, as observed and discussed in jacob2022modeling, BC often falls short of accurately modeling or matching humanlevel performance, with BC models underperforming the human players they are trained to imitate in games such as Chess, Go, and Diplomacy. Intuitively, it might seem that initializing selfplay with an imitationlearned policy would result in an agent that is both strong and humanlike. Indeed, bakhtin2021no showed improved performance against humanlike agents when initializing the DORA training procedure from a human imitation policy and value, rather than starting from scratch. However, we show in subsection 5.3 that such an approach still results in policies that deviate from humancompatible equilibria.
jacob2022modeling found that an effective solution was to perform search with a regularization penalty proportional to the KL divergance from a human imitation policy. This algorithm is referred to as piKL. The form of piKL we focus on in this paper is a variant of hedge called piKLhedge, in which each player seeks to maximize expected reward, while at the same time playing “close” to a fixed anchor policy . The two goals can be reconciled by defining a composite utility function that adds a penalty based on the “distance” between the player policy and their anchor policy, with coefficient scaling the penalty.
For each player , we define ’s utility as a function of the agent policy given policies of all other agents:
(6) 
This results in a modification of hedge such that on each iteration , is set according to
(7) 
When is large, the utility function is dominated by the KLdivergence term , and so the agent will naturally tend to play a policy close to the anchor policy . When is small, the dominating term is the rewards and so the agent will tend to maximize reward without as closely matching the anchor policy .
3 Distributional Lambda piKL (DiLpiKL)
piKL trades off between the strength of the agent and the closeness to the anchor policy using a single fixed parameter. In practice, we find that sampling from a probability distribution each iteration produces better performance. In this section, we introduce distributional lambda piKL (DiLpiKL), which replaces the single parameter in piKL with a probability distribution over values. On each iteration, each player samples a value from and then chooses a policy based on Equation 7 using that sampled . Figure 1 highlights the difference between piKL and DiLpiKL.
.9
Data: • set of actions for Player ;
• reward function for Player ; yellow!50 • a set of values to consider for Player ; yellow!50 • a belief distribution over values for Player .
1 function Initialize()
2 for each action do
3
gray
4 function Play()
5 yellow!50 sample let be the policy such that
6

One interpretation of DiLpiKL is that each choice of is an agent type, where agent types with high choose policies closer to while agent types with low choose policies that are more “optimal” and less constrained to a commonknowledge anchor policy. A priori, each player is randomly sampled from this population of agent types, and the distribution represents the commonknowledge uncertainty about which of the agent types player may be. Another interpretation is that piKL assumed an exponential relation between action EV and likelihood, whereas DiLpiKL results in a fattertailed distribution that may more robustly model different playing styles or game situations.
3.1 Coordinating with piKL policies
While piKL and DiLpiKL are intended to model human behavior, an optimal policy in cooperative environments should be closer to a best response to this distribution. Selecting different values for the commonknowledge population versus the policy the agent actually plays allows us to interpolate between BC, best response to BC, and equilibrium policies (Figure 2). In practice, our agent samples from during equilibrium computation but ultimately plays a low policy, modeling the fact that other players are unaware of our agent’s true type.
3.2 Theoretical Properties of DiLpiKL
DiLpiKL can be understood as a sampled form of followtheregularizedleader (FTRL). Specifically, one can think of Algorithm 1 as an instantiation of FTRL over the Bayesian game induced by the set of types and the regularized utilities of each player . In the appendix we show that when a player learns using DiLpiKL, the distributions for any type are noregret with respect to the regularized utilities defined in (6). Formally:
Theorem 1 (abridged).
Let be a bound on the maximum absolute value of any payoff in the game, and . Then, for any player , type , and number of iterations , the regret cumulated can be upper bounded as
where the game constant is defined as .
The traditional analysis of FTRL is not applicable to DiLpiKL because the utility functions, as well as their gradients, can be unbounded due to the nonsmoothness of the regularization term that appears in the regularized utility function , and therefore a more sophisticated analysis needs to be carried out. Furthermore, even in the special case of a single type (i.e., a singleton set ), where DiLpiKL coincides with piKL, the above guarantee significantly refines the analysis of piKL in two ways. First, it holds no matter the choice of stepsize , thus implying a regret bound without assumptions on other than . Second, in the cases in which is tiny, by choosing we recover a sublinear guarantee (of order ) on the regret.
In 2p0s games, the logarithmic regret of Theorem 1 immediately implies that the average policy of each player is a approximate BayesNash equilibrium strategy. In fact, a strong guarantee on the lastiterate convergence of the algorithm can be obtained too:
Theorem 2 (abridged; Lastiterate convergence of piKL in 2p0s games).
When both players in a 2p0s game learn using DiLpiKL for iterations, their policies converge almost surely to the unique BayesNash equilibrium of the regularized game defined by utilities (6).
The lastiterate guarantee stated in Theorem 2 crucially relies on the strong convexity of the regularized utilities, and conceptually belongs with related efforts in showing lastiterate convergence of online learning methods. However, a key difficulty that sets apart Theorem 2 is the fact that the learning agents observe sampled actions from the opponents, which makes the proof of the result (as well as the obtained convergence rate) different from prior approaches.
4 Description of Diplodocus
By replacing the equilibriumfinding algorithm used in DORA with DiLpiKL, we hypothesize that we can learn a strong and humancompatible policy as well as a value function that can accurately evaluate game states, assuming strong and humanlike continuation policies. We call this selfplay algorithm RLDiLpiKL. We use RLDiLpiKL to train value and policy proposal networks and use DiLpiKL during testtime search.
4.1 Training
Our training algorithm closely follows that of DORA, described in Section 2.3
. The loss functions used are identical to DORA and the training procedure is largely the same, except in place of RM to compute the equilibrium policy
on each turn of a game during selfplay, we use DiLpiKL with a distribution and human imitation anchor policy that is fixed for all of training. See Appendix H for a detailed description of differences between DORA and RLDiLpiKL.4.2 TestTime Search
Following bakhtin2021no, at evaluation time we perform 1ply lookahead where on each turn we sample up to 30 of the most likely actions for each player from the RL policy proposal network. However, rather than using RM to compute the equilibrium , we apply DiLpiKL.
As also mentioned previously in Section 3, while our agent samples from the probability distribution when computing the DiLpiKL equilibrium, the agent chooses its own action to actually play using a fixed low . For all experiments, including all ablations, the agent uses the same BC anchor policy. For DiLpiKL experiments for each player we set to be uniform over and play according to , except for the first turn of the game. On the first turn we instead sample from and play according to , so that the agent plays more diverse openings, which more closely resemble those that humans play.
5 Experiments
We first compare the performance of two variants of Diplodocus in a population of prior agents and other baseline agents. We then report results of Diplodocus playing in a tournament with humans.
5.1 Experimental setup
In order to measure the ability of agents to play well against a diverse set of opponents, we play many games between AI agents where each of the seven players are sampled randomly from a population of baselines (listed in Appendix D) or the agent to be tested. We report scores for each of the following algorithms against the baseline population:
DiplodocusLow and DiplodocusHigh are the proposed agents that use RLDiLpiKL during training with 2 player types and , respectively.
DORA is an agent that is trained via selfplay and uses RM as the search algorithm during training and testtime. Both the policy and the value function are randomly initialized at the start of training.
DNVI is similar to DORA, but the policy proposal and value networks are initialized from human BC pretraining.
DNVINPU is similar to DNVI, but during training only the RL value network is updated. The policy proposal network is still trained but never fed back to selfplay workers, to limit selfplay drift from human conventions. The final RL policy proposal network is only used at the end, at test time (along with the RL value network).
BRBot is an approximate best response to the BC policy.
It was trained the same as Diplodocus, except that during training the agent plays one distinguished player each game with while all other players use .
SearchBot, a onestep lookahead equilibrium search agent from (gray2020human), evaluated using their published model.
HedgeBot is an agent similar to SearchBot (gray2020human) but using our latest architecture and using hedge rather than RM as the equilibriumfinding algorithm.
FPPI2 and SL are two agents from (anthony2020learning), evaluated using their published model.
After computing these population scores, as a final evaluation we organized a tournament where we evaluated four agents for 50 games each in a population of online human participants. We evaluated two baseline agents, BRBot and DORA, and two of our new agents, DiplodocusLow and DiplodocusHigh.
In order to limit the duration of games to only a few hours, these games used a time limit of 5 minutes per turn and a stochastic gameend rule where at the beginning of each game year between 1909 and 1912 the game ends immediately with 20% chance per year, increasing in 1913 to a 40% chance. Players were not told which turn the game would end on for a specific game, but were told the distribution it was sampled from. Our agents were also trained based on this distribution.^{3}^{3}3Games were run by a thirdparty contractor. In contradiction of the criteria we specified, the contractor ended games artificially early for the first 80 games played in the tournament, with end dates of 19091911 being more common than they should have been. We immediately corrected this problem once it was identified. Players were recruited from Diplomacy mailing lists and from webdiplomacy.net. In order to mitigate the risk of cheating by collusion, players were paid hourly rather than based on ingame performance. Each game had exactly one agent and six humans. The players were informed that there was an AI agent in each game, but did not know which player was the bot in each particular game. In total 62 human participants played 200 games with 44 human participants playing more than two games and 39 human participants playing at least 5 games.
5.2 Experimental Results
Agent  Score against population 

DiplodocusLow  29% 1% 
DiplodocusHigh  28% 1% 
DNVINPU (retrained) (bakhtin2021no)  20% 1% 
BRBot  18% 1% 
DNVI (retrained) (bakhtin2021no)  15% 1% 
HedgeBot (retrained) (jacob2022modeling)  14% 1% 
DORA (retrained) (bakhtin2021no)  13% 1% 
FPPI2 (anthony2020learning)  9% 1% 
SearchBot (gray2020human)  7% 1% 
SL (anthony2020learning)  6% 1% 
shows one standard error.
We first report results for our agents in the fixed population described in Appendix D. The results, shown in Table 1, show DiplodocusLow and DiplodocusHigh perform the best by a wide margin.
We next report results for the human tournament in Table 2. For each listed player, we report their average score, Elo rating, and rank within the tournament based on Elo among players who played at least 5 games. Elo ratings were computed using a standard generalization of BayesElo (coulom2005bayeselo) to multiple players (hunter2004mmbt) (see Appendix I
for details). This gives similar rankings as average score, but also attempts to correct for both the average strength of the opponents, since some games may have stronger or weaker opposition, as well as for which of the seven European powers a player was assigned in each game, since some starting positions in Diplomacy are advantaged over others. To regularize the model, a weak Bayesian prior was applied such that each player’s rating was normally distributed around 0 with a standard deviation of around 350 Elo.
The results show that DiplodocusHigh performed best among all the humans by both Elo and average score. DiplodocusLow followed closely behind, ranking second according to average score and third by Elo. BRBot performed relatively well, but ended ranked below that of both DiLpiKL agents and several humans. DORA performed relatively poorly.
Two participants achieved a higher average score than the Diplodocus agents, a player averaging 35% but who only played two games, and a player with a score of 29% who played only one game.
We note that given the large statistical error margins, the results in Table 2 do not conclusively demonstrate that Diplodocus outperforms the best human players, nor do they alone demonstrate an unambiguous separation between Diplodocus and BRBot. However, the results do indicate that Diplodocus performs at least at the level of expert players in this population of players with diverse skill levels. Additionally, the superior performance of both Diplodocus agents compared to BRBot is consistent with the results from the agent population experiments in Table 1.
Rank  Elo  Avg Score  # Games  

DiplodocusHigh  1  181  27% 4%  50 
Human  2  162  25% 6%  13 
DiplodocusLow  3  152  26% 4%  50 
Human  4  138  22% 9%  7 
Human  5  136  22% 3%  57 
BRBot  6  119  23% 4%  50 
Human  7  102  18% 8%  8 
Human  8  96  17% 3%  51 
DORA  32  20  13% 3%  50 
Human  43  187  1% 1%  7 
In addition to the tournament, we asked three expert human players to evaluate the strength of the agents in the tournament games based on the quality of their actions. Games were presented to these experts with anonymized labels so that the experts were not aware of which agent was which in each game when judging that agent’s strategy. All the experts picked a Diplodocus agent as the strongest agent, though they disagreed about whether DiplodocusHigh or DiplodocusLow was best. Additionally, all experts indicated one of the Diplodocus agents as the one they would most like to cooperate with in a game. We provide detailed responses in Appendix C.
5.3 RL training comparison
Figure 3 compares different RL agents across the course of training. To simplify the comparison, we vary the training methods for the value and policy proposal networks, but use the same search setting at evaluation time.
As a proxy for agent strength, we measure the average score of an agent vs 6 copies of HedgeBot. As a proxy for modeling humans, we compute prediction accuracy of human moves on a validation dataset of roughly 630 games held out from training of the human BC model, i.e., how often the most probable action under the policy corresponds to the one chosen by a human. Similar to bakhtin2021no, we found that agents without biasing techniques (DORA and DNVI) diverge from human play as training progress. By contrast, DiplodocusHigh achieves significant improvement in score while keeping the human prediction accuracy high.
6 Discussion
In this work we describe RLDiLpiKL and use it to train an agent for nopress Diplomacy that placed first in a human tournament. We ascribe Diplodocus’s success in Diplomacy to two ideas.
First, DiLpiKL models a population of player types with different amounts of regularization to a human policy while ultimately playing a strong (low) policy itself. This improves upon simply playing a best response to a BC policy by accounting for the fact that humans are less likely to play highly suboptimal actions and by reducing overfitting of the best response to the BC policy. Second, incorporating DiLpiKL in selfplay allows us to learn an accurate value function in a diversity of situations that arise from strong and humanlike players. Furthermore, this value assumes a human continuation policy that makes fewer blunders than the BC policy, allowing us to correctly estimate the values of positions that require accurate play (such as stalemate lines).
In conclusion, combining human imitation, planning, and RL presents a promising avenue for building agents for complex cooperative and mixedmotive environments. Further work could explore regularized search policies that condition on more complex human behavior, including dialogue.
References
Appendix A Author Contributions
A. Bakhtin primarily contributed to RL, infrastructure, experimentation, and direction. D. J. Wu primarily contributed to RL and infrastructure. A. Lerer primarily contributed to DiLpiKL, infrastructure, and direction. J. Gray primarily contributed to infrastructure. A. P. Jacob primarily contributed to DiLpiKL and experimentation. G. Farina primarily contributed to theory and DiLpiKL. A. H. Miller primarily contributed to experimentation. N. Brown primarily contributed to DiLpiKL, experimentation, and direction.
Appendix B Description of Diplomacy
The rules of nopress Diplomacy are complex; a full description is provided by paquette2019no. Nopress Diplomacy is a sevenplayer zerosum board game in which a map of Europe is divided into 75 provinces. 34 of these provinces contain supply centers (SCs), and the goal of the game is for a player to control a majority (18) of the SCs. Each players begins the game controlling three or four SCs and an equal number of units.
The game consists of three types of phases: movement phases in which each player assigns an order to each unit they control, retreat phases in which defeated units retreat to a neighboring province, and adjustment phases in which new units are built or existing units are destroyed.
During a movement phase, a player assigns an order to each unit they control. A unit’s order may be to hold (defend its province), move to a neighboring province, convoy a unit over water, or support a neighboring unit’s hold or move order. Support may be provided to units of any player. We refer to a tuple of orders, one order for each of a player’s units, as an action. That is, each player chooses one action each turn. There are an average of 26 valid orders for each unit (paquette2019no), so the game’s branching factor is massive and on some turns enumerating all actions is intractable.
Importantly, all actions occur simultaneously. In live games, players write down their orders and then reveal them at the same time. This makes Diplomacy an imperfectinformation game in which an optimal policy may need to be stochastic in order to prevent predictability.
Diplomacy is designed in such a way that cooperation with other players is almost essential in order to achieve victory, even though only one player can ultimately win.
A game may end in a draw on any turn if all remaining players agree. Draws are a common outcome among experienced players because players will often coordinate to prevent any individual from reaching 18 centers. The two most common scoring systems for draws are drawsize scoring (DSS), in which all surviving players equally split a win, and sumofsquares scoring (SoS), in which player receives a score of , where is the number of SCs that player controls (fogel2020whom). Throughout this paper we use SoS scoring except in anonymous games against humans where the human host chooses a scoring system.
Appendix C Expert evaluation of the agents
The anonymous format of the tournament aimed at reducing possible biases of players towards the agent, e.g., trying to collectively eliminate the agents as targeting the agent is a simple way to break the symmetry. At the same time a significant property of Diplomacy is knowing the play styles of different players and using this knowledge to make decision of whom to trust and whom to chose as an ally. To evaluates this aspect of the game play we asked for qualitative feedback from three Diplomacy experts. Each player was given 7 games (one per power) from each of the 4 different agents that played in the tournament. The games evaluated by each expert were disjoint from the games evaluated by the other experts. The games were anonymized such that the experts were not able to tell which agent played in the game based on the username or from the date. We asked a few questions about the game play of each agent independently and then asked the experts to choose the best agent for strength and humanlike behavior. The experts referred to the agents as Agent1, …, Agent4, but we deanonymized the agents in the answers below.
c.1 Overall
What is the strongest agent?
Expert 1
I think DiplodocusLow was the strongest, then BRBot closely followed by DiplodocusHigh. DORA is a distant third.
Expert 2
DiplodocusHigh.
Expert 3
DiplodocusLow. This feels stronger than a human in certain ways while still being very humanlike.
What is the most humanlike/botlike agent?
Expert 1
Most humanlike is DiplodocusHigh. A boring human, but a human nonetheless. DiplodocusLow is not far behind, then BRBot and DORA both of which are very nonhuman albeit in very different ways.
Expert 2
DiplodocusHigh.
Expert 3
DiplodocusLow.
What is the agent you’d like to cooperate with?
Expert 1
This is the most interesting question. I think DiplodocusLow, because I like how it plays  we’d “vibe”  but also because I think it is quite predictable in what motivates it to change alliances. That’s potentially exploitable, even with the strong tactics it has. I’d least like to work with DiplodocusHigh as it seems to be very much in it for itself. I suspect it would be quite unpleasant to play against as it is tactically excellent and seems hard to work with.
I’d love to be on a board with DORA, as I’d expect my chances to solo to go up dramatically! It would be a very fun game so long as you weren’t on the receiving end of some of its play.
Expert 2
DiplodocusHigh.
Expert 3
DiplodocusLow. DiplodocusHigh is also strong, but seems much less interesting to play with, because of the way it commits to alliances without taking into account who is actually willing to work with it. This limits what a human can do to change their situation quite a lot and would be fairly frustrating in the position of a neighbour being attacked by it.
BRBot and DORA feel too weak to be particularly interesting.
c.2 Dora
How would you evaluate the overall strength of the agent?
Expert 1
Not great. There’s a lot to criticize here  from bad opening play (Russia = bonkers), to poor defense (Turkey) and just generally bad tactics and strategy compared to the other agents (France attacking Italy when Italy is their only ally was an egregious example of this).
Expert 2
Very weak. Seemed to invite its own demise with the way it routinely picked fights in theaters it had no business being in and failing to cooperate with neighbors
Expert 3
Poor. It is bad at working with players, and it makes easily avoidable blunders even when working alone.
How would you evaluate the ability of the agent to cooperate?
Expert 1
It seems to make efforts, but it also seems to misjudge what humans are likely to do. There’s indicative support orders and they’re pretty good, but it also doesn’t seem to understand or account for vindictiveness over always playing best. The Turkey game where it repeatedly seems to expect Russia to not attack is an example of this.
Expert 2
Poor. Seemed to pick fights without seeing or soliciting support necessary to win, failed to support potential allies in useful ways to take advantage of their position.
Expert 3
Middling to Poor. It very occasionally enters good supports but it often enters bad ones, and has a habit of attacking too many people at once (and not considering that attacking those people will turn them against it). It has a habit of annoying many players and doing badly as a result.
c.3 BRBot
How would you evaluate the overall strength of the agent?
Expert 1
The agent has solid, at least human level tactics and clearly sees opportunities to advance and acts accordingly. Sometimes this is to the detriment of the strategic position, but the balance is fair given the gunboat nature of the games. Overall, the bot feels naturally to be in the “better than average human” range rather than superhuman, but the results indicate that it performs at a higher level than the “feeling” it gives. It has a major opportunity for improvement, discussed in the next point.
Expert 2
Overall, seemed fairly weak and seemed to be able to succeed most frequently when benefiting from severe mistakes from neighboring agents. That being said it was able to exploit those mistakes somewhat decently in some cases and at least grow to some degree off of it.
Expert 3
Middling. It is tactically strong when not having to work with other players and when it has a considerable number of units, but is quite weak when attempting to cooperate with other players. Its defensive strength varies quite significantly too, possibly also based on unit count  when it had relatively few units it missed very obvious defensive tactics.
How would you evaluate the ability of the agent to cooperate?
Expert 1
The bot is hyperactively trying to coordinate and signal to the other players that it wants to work with them. Sometimes this is in the form of ridiculous orders that probably indicate desperation more than a mutually beneficial alliance, and this backfires as you may expect. At its best it makes exceptional signaling moves (RUSSIA game^{4}^{4}4DOUBLE BLIND : War  Mos in Fall 1901 is exceptional) but at worst it is embarrassingly bad and leads to it getting attacked (TURKEY game^{5}^{5}5DOUBLE BLIND : supporting convoys from Gre  Smy or supporting other powers moving to Armenia). The other weakness is that it tends to make moves like these facing all other powers  this is not optimal as indicating to all other powers that you want to work with them is equivalent to not indicating anything at all  if anything it seems a little duplicitous. This is especially true when the bot is still inclined to stab when the opportunity presents itself, which means the signaling is superficial and unlikely to work repeatedly. Overall, the orders show the ability to cooperate, signal, and work together, but the hyperactivity of the bot is limiting the effectiveness of the tools to achieve the best results.
Expert 2
Poor. Random support orders seemed to be thrown without an overarching strategy behind them. Moves didn’t seem to suggest long term thoughts of collaboration.
Expert 3
Poor. When attempting to work with another player, it almost always gives them the upper hand, and even issues supports that suggest it is okay with that player taking its SCs when it should not be. It sometimes matches supports to human moves, but does not seem to do this very often. The nonsensical supports are much more common.
c.4 DiplodocusHigh
How would you evaluate the overall strength of the agent?
Expert 1
The tactics are unadventurous and sometimes seem below human standards (for example, the train of army units in the Italy game; the whole Turkey game) but conversely they also have a longer view of the game (see also: Italy game  the trained bounces don’t matter strategically). There’s less nonsense too; if I were to sum the bot up in two words it would be “practical” and “boring”.
Expert 2
Seemed to be strong. Wrote generally good tactical orders, showed generally good strategic sense. Showed discipline and a willingness to let allies survive in weak positions while having units that could theoretically stab for dots with ease remaining right next to that weak ally.
There were some highly questionable moments as both Italy and France early on in 1901 strategy which seemed to heavily harm their ability to get out of the box.
The Austrian game was particularly impressive in terms of its ability to handle odd scenarios and achieve the solo despite receiving pressure on multiple occasions on multiple fronts.
Expert 3
Generally strong. It is good at signalling and forming alliances, is tactically strong when in its favoured alliance, and is especially strong when ahead. Its main weakness seems to be an inability to adapt  if its favoured alliance is declined, it will often keep trying to ‘pitch’ that same alliance instead of working towards alternatives.
How would you evaluate the ability of the agent to cooperate?
Expert 1
Low. It doesn’t put much effort into this. The French game, for example, the bot just seems to accept it is being attacked and fight through it. It’s so boring and tactical and shows little care for cooperation. Many great gunboat players do this but it will not hold up in press games. What it does seem to do is capitalize on other player’s mistakes  see the Austrian game where it sneaks into Scandinavia and optimizes to get to 18 (there can’t be a lot of training data for that!).
Expert 2
Very strong ability to cooperate as seen in the Turkish game, but in other games seemed to try and pick fights against the entire world in ways that were ultimately selfdefeating.
Expert 3
Good. It can work well with human players, matching supports and even using signalling supports in ways humans would. It frequently attempts to side with a player who is attacking it, though, so it seems to have a problem with identifying which player to work with.
c.5 DiplodocusLow
How would you evaluate the overall strength of the agent?
Expert 1
Exceptional. Very strong tactics and a clear directionality to what it does  it seems to understand what the strategic value of a position is and it acts with efficiency to achieve the strategic goals. It has great results (time drawn out of a few wins!) but also fights back from “losing” positions extremely well which makes it quantifiably strong, but it also just plays a beautiful and effective game. Very strong indeed. It does sometimes act too aggressively for tournament play (Austria is the example where this came home to roost)  the high risk/reward plays are generally but not always correct in single games, but for tournament play it goes for broke a bit too much (This is outside the scope of the agent I suspect, as it is playing to scoring system not tournament scoring system). Against human players who may not see the longer term impact of their play, it results in games like this one… which is ugly both for Austria and for everyone else except Turkey.
Expert 2
Very weak. Seemed to abandon its own position in many cases to pursue questionable adventures. Sometimes they worked out but generally they failed, resulting in things like a Germany under siege holding Edi while they as England are off in Portugal and are holding onto their home centers only because FG were under siege by the south.
Expert 3
Very strong. It can signal alliances very well and generally chooses the correct allies, seems strong tactically even on defence, and makes some plays you would not expect from a human player but which are outright stronger than a human player would make.
How would you evaluate the ability of the agent to cooperate?
Expert 1
Pretty good. It sends signaling moves and makes efforts to support other players quite a lot (see in particular Russia). I particularly like the skills being shown to work together tactically and try and support other units  this is both effective and quite human. This is my favorite bot by some distance when it comes to cooperating with the other players. There is a weakness in that it does seem to reassess alliances every turn, which means sometimes the excellent work indicating and supporting is undone without getting the chance to realize the gains (Examples with Russia and Italy).
Expert 2
Poor. Didn’t seem to give meaningful support orders when they would have helped and gave plenty of meaningless signaling supports and some questionable ones like supporting the English into SKA in F1901 as Germany among other oddities
Expert 3
Good. It signals alliances in very human ways, through clear signalling builds, accurate support moves where it makes sense, and support holds otherwise. It also seems to match supports with its allies well.
Appendix D Population based evaluation
In generalsum games like Diplomacy, winrate in headtohead matches against a previous version of an agent may not be as informative because of nontransitivity between agents. For example, exploitative agents such as bestresponsetoBC may do particularly well against BC or other pure imitationlearning agents, and less well against all other agents. Additionally, bakhtin2021no found that a pair of independently and equallywelltrained RL agents may each appear very weak in a population composed of the other due to converging to incompatible equilibria. Many agents also varied significantly in how well they performed against other searchbased agents.
Therefore, we resort to playing against a population of previously training agents as was done in jacob2022modeling, intended to measure more broadly how well an agent does on average against a wider suite of various humanlike agents.
More precisely, we define a fixed set of baseline agents as a population. To determine an agent’s average population score, we add that agent into the population and then play games where in each game, all 7 players are uniformly randomly sampled from the population with replacement, keeping only games where the agent to be tested was sampled at least once. Note that unlike jacob2022modeling, we run a separate population test for each new agent to be tested, rather than combining all agents to be tested within a single population.
For the experiments in Table 1 and subsection 5.1 we used the following 8 baseline agents:

An singleturn BR agent that assumes everyone else plays the BC policy.

An agent doing RM search with BC policy and value functions. We use 2 copies of this agent trained on different subsets of data.

DiLpiKL agent with BC policy and value functions. We use 4 different versions of this data with different training data and model architecture.

DiLpiKL agent where the policy and value functions are trained with selfplay with ReinforcedPiKL with high lambda ().
For the experiments in this paper we used 1000 games for each such population test.
Appendix E Theoretical Properties of DiLpiKL
In this section we study the lastiterate convergence of DiLpiKL, establishing that in twoplayer zerosum games DiLpiKL converges to the (unique) BayesNash equilibrium of the regularized Bayesian game. As a corollary (in the case in which each player has exactly one type), we conclude that piKL converges to the Nash equilibrium of the regularized game in twoplayer zerosum games. We start from a technical result. In all that follows, we will always let be a shorthand for the vector .
Lemma 1.
Proof.
If , then the results follows from direct inspection: is the uniform policy (and so for any , and so the statement reduces to the firstorder optimality conditions for the problem . So, we now focus on the case . The iterates and produced by DiLpiKL are respectively the solutions to the optimization problem
where we let the averages utility vectors be
Since the regularizing function negative entropy is Legendre, the policies and are in the relative interior of the probability simplex, and therefore the firstorder optimality conditions for and are respectively
Taking the difference between the equalities, we find
We now use the fact that
to further write
From LABEL:eq:vi_for_t_plus_1 we find
and so, plugging back the previous relationship in LABEL:eq:diff_of_vis we can write, for all ,
Dividing by yields the statement. ∎
Corollary 1.
Proof.
Since Lemma 1 holds for all , we can in particular set , and obtain
Using the threepoint identity
in LABEL:eq:pre_threepoint yields
Multiplying by yields the statement. ∎
e.1 Regret Analysis
Let be the regularized utility of agent type
Observation 1.
We note the following:

For any and , the function satisfies

Furthermore,
Using Corollary 1 we have the following
Lemma 2.
For any player and type ,
Proof.
From Lemma 1,
Rearranging, we find
We now upper bound the term in (LABEL:eq:increment_term) using convexity of the function , as follows:
Substituting the above bound into (LABEL:eq:pre_flip) yields
where the second inequality follows from Young’s inequality. Finally, by using the strong convexity of the KL divergence between points and , that is,
yields the statement. ∎
Noting that the righthand side of Lemma 2 is telescopic, we immediately have the following.
Theorem 3.
For any player and type , and policy , the following regret bound holds at all times :
Proof.
From Lemma 2 we have that
where the second inequality follows from the fact that and the fact that is the uniform strategy. ∎
e.2 LastIterate Convergence in TwoPlayer ZeroSum Games
In twoplayer game with payoff matrix for Player , a BayesNash equilibrium to the regularized game is a collection of policies such that for any supported type of Player , the policy is a best response to the average policy of the opponent. In symbols,
Denoting , the firstorder optimality conditions for the best response problems above are
We also mention the following standard lemma.
Lemma 3.
Let be the unique BayesNash equilibrium of the regularized game. Let policies be arbitrary, and let:

;

;

.
Then,
The following potential function will be key in the analysis:
Proposition 1.
At all times , let
The potential satisfies the inequality
Proof.
By multiplying both sides of Corollary 1 for the choice , taking expectations over , and summing over the player , we find
We now proceed to analyze the last summation on the righthand side. First,
Using Lemma 3 we can immediately write
By manipulating the inner product in (LABEL:eq:stepA3), we have
where the last inequality follow from the fact that for all choices of and . Substituting the individual bounds into (LABEL:eq:stepA_decomposition) yields
Finally, plugging the above bound into (LABEL:eq:step0) and rearranging terms yields
as we wanted to show. ∎
Theorem 4.
Proof.
Using the bound on given by Proposition 1 we obtain
We can now bound
On the other hand, note that
where the last inequality follows from expanding the definition of the KL divergence and using the fact that is the uniform strategy. Combining the inequalities and dividing by yields
Finally, using the fact that yields the statement. ∎