Log In Sign Up

Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning

by   Anton Bakhtin, et al.

No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model.


page 1

page 2

page 3

page 4


Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Prior AI breakthroughs in complex games have focused on either the purel...

No-Press Diplomacy from Scratch

Prior AI successes in complex games have largely focused on settings wit...

Finding Friend and Foe in Multi-Agent Games

Recent breakthroughs in AI for multi-agent games like Go, Poker, and Dot...

Survey of Self-Play in Reinforcement Learning

In reinforcement learning (RL), the term self-play describes a kind of m...

Malthusian Reinforcement Learning

Here we explore a new algorithmic framework for multi-agent reinforcemen...

Regulating Reward Training by Means of Certainty Prediction in a Neural Network-Implemented Pong Game

We present the first reinforcement-learning model to self-improve its re...

Improving Fictitious Play Reinforcement Learning with Expanding Models

Fictitious play with reinforcement learning is a general and effective f...

1 Introduction

In two-player zero-sum (2p0s) settings, principled self-play algorithms converge to a minimax equilibrium, which in a balanced game ensures that a player will not lose in expectation regardless of the opponent’s strategy (neumann1928theorie). This fact has allowed self-play, even without human data, to achieve remarkable success in 2p0s games like chess (silver2018general), Go (silver2017mastering), poker (bowling2015heads; brown2017superhuman), and Dota 2 (berner2019dota).111Dota 2 is a two-team zero-sum game, but the presence of full information sharing between teammates makes it equivalent to 2p0s. Beyond 2p0s settings, self-play algorithms have also proven successful in highly adversarial games like six-player poker brown2019superhuman. In principle, any finite 2p0s game can be solved via self-play given sufficient compute and memory. However, in games involving cooperation, self-play alone no longer guarantees good performance when playing with humans, even with infinite compute and memory. This is because in complex domains there may be arbitrarily many conventions and expectations for how to cooperate, of which humans may use only a small subset (lerer2019learning). The clearest example of this is language. A self-play agent trained from scratch without human data in a cooperative game involving free-form communication channels would almost certainly not converge to using English as the medium of communication. Obviously, such an agent would perform poorly when paired with a human English speaker. Indeed, prior work has shown that naïve extensions of self-play from scratch without human data perform poorly when playing with humans or human-like agents even in dialogue-free domains that involve cooperation rather than just competition, such as the benchmark games no-press Diplomacy (bakhtin2021no) and Hanabi (siu2021evaluation; cui2021k).

Recently, (jacob2022modeling) introduced piKL, which models human behavior in many games better than pure behavioral cloning (BC) on human data by regularizing inference-time planning toward a BC policy. In this work, we introduce an extension of piKL, called DiL-piKL, that replaces piKL’s single fixed regularization parameter 

with a probability distribution over 

parameters. We then show how DiL-piKL can be combined with self-play reinforcement learning, allowing us to train a strong agent that performs well with humans. We call this algorithm RL-DiL-piKL.

Using RL-DiL-piKL we trained an agent, Diplodocus, to play no-press Diplomacy, a difficult benchmark for multi-agent AI that has been actively studied in recent years (paquette2019no; anthony2020learning; gray2020human; bakhtin2021no; jacob2022modeling). We conducted a 200-game no-press Diplomacy tournament with a diverse pool of human players, including expert humans, in which we tested two versions of Diplodocus using different RL-DiL-piKL settings, and other baseline agents. All games consisted of one bot and six humans, with all players being anonymous for the duration of the game. These two versions of Diplodocus achieved the top two average scores in the tournament among all 48 participants who played more than two games, and ranked first and third overall among all participants according to an Elo ratings model.

2 Background and Prior work

Diplomacy is a benchmark 7-player mixed cooperative/competitive game featuring simultaneous moves and a heavy emphasis on negotiation and coordination. In the no-press variant of the game, there is no cheap talk communication. Instead, players only implicitly communicate through moves.

In the game, seven players compete for majority control of 34 “supply centers” (SCs) on a map. On each turn, players simultaneously choose actions consisting of an order for each of their units to hold, move, support or convoy another unit. If no player controls a majority of SCs and all remaining players agree to a draw or a turn limit is reached then the game ends in a draw. In this case, we use a common scoring system in which the score of player  is , where is the number of SCs player  owns. A more detailed description of the rules is provided in Appendix B.

Most recent successes in no-press Diplomacy use deep learning to imitate human behavior given a corpus of human games. The first Diplomacy agent to leverage deep imitation learning was 

paquette2019no. Subsequent work on no-press Diplomacy have mostly relied on a similar architecture with some modeling improvements (gray2020human; anthony2020learning; bakhtin2021no).

gray2020human proposed an agent that plays an improved policy via one-ply search. It uses policy and value functions trained on human data to to conduct search using regret minimization.

Several works explored applying self-play to compute improved policies. paquette2019no applied an actor-critic approach and found that while the agent plays stronger in populations of other self-play agents, it plays worse against a population of human-imitation agents. anthony2020learning used a self-play approach based on a modification of fictitious play in order to reduce drift from human conventions. The resulting policy is stronger than pure imitation learning in both 1vs6 and 6vs1 settings but weaker than agents that use search. Most recently, bakhtin2021no combined one-ply search based on equilibrium computation with value iteration to produce an agent called DORA. DORA achieved superhuman performance in a 2p0s version of Diplomacy without human data, but in the full 7-player game plays poorly with agents other than itself.

jacob2022modeling showed that regularizing inference-time search techniques can produce agents that are not only strong but can also model human behaviour well. In the domain of no-press Diplomacy, they show that regularizing hedge (an equilibrium-finding algorithm) with a KL-divergence penalty towards a human imitation learning policy can match or exceed the human action prediction accuracy of imitation learning while being substantially stronger. KL-regularization toward human behavioral policies has previously been proposed in various forms in single- and multi-agent RL algorithms (nair2018overcoming; siegel2020keep; nair2020accelerating), and was notably employed in AlphaStar (vinyals2019grandmaster), but this has typically been used to improve sample efficiency and aid exploration rather than to better model and coordinate with human play.

An alternative line of research has attempted to build human-compatible agents without relying on human data (hu2020other; hu2021off; strouse2021collaborating). These techniques have shown some success in simplified settings but have not been shown to be competitive with humans in large-scale collaborative environments.

2.1 Markov Games

In this work, we focus on multiplayer Markov games (shapley1953stochastic).


An -player Markov game is a tuple where is the state space, is the action space of player (), is the reward function for player , is the transition function.

The goal of each player , is to choose a policy that maximizes the expected reward for that player, given the policies of all other players. In case of

, a Markov game reduces to a Markov Decision Process (MDP) where an agent interacts with a fixed environment.

At each state , each player  simultaneously chooses an action  from a set of actions . We denote the actions of all players other than  as

. Players may also choose a probability distribution over actions, where the probability of action 

is denoted or

and the vector of probabilities is denoted

or .

2.2 Hedge

Hedge littlestone1994weighted; freund1997decision is an iterative algorithm that converges to an equilibrium. We use variants of hedge for planning by using them to compute an equilibrium policy on each turn of the game and then playing that policy.

Assume that after player  chooses an action  and all other players choose actions , player  receives a reward of , where will come from our RL-trained value function. We denote the average reward in hindsight for action  up to iteration as .

On each iteration  of hedge, the policy is set according to where is a temperature parameter.222We use rather than used in jacob2022modeling in order to clean up notation. .

It is proven that if is set to then as the average policy over all iterations converges to a coarse correlated equilibrium, though in practice it often comes close to a Nash equilibrium as well. In all experiments we set on iteration , where

is the observed standard deviation of the player’s utility up to iteration 

, based on a heuristic from 

brown2017dynamic. A simpler choice is to set , which makes the algorithm equivalent to fictitious play (brown1951iterative).

Regret matching (RM) (blackwell1956analog; hart2000simple) is an alternative equilibrium-finding algorithm that has similar theoretical guarantees to hedge and was used in previously work on Diplomacy gray2020human; bakhtin2021no. We do not use this algorithm but we do evaluate baseline agents that use RM.

2.3 DORA: Self-play learning in Markov games

Our approach draws significantly from DORA (bakhtin2021no), which we describe in more detail here. In this approach, the authors run an algorithm that is similar to past model-based reinforcement-learning methods such as AlphaZero (silver2018general)

, except in place of Monte Carlo tree search, which is unsound in simultaneous-action games such as Diplomacy or other imperfect information games, it instead uses an equilibrium-finding algorithm such as hedge or RM to iteratively approximate a Nash equilibrium for the current state (i.e., one-step lookahead search). A deep neural net trained to predict the policy is used to sample plausible actions for all players to reduce the large action space in Diplomacy down to a tractable subset for the equilibrium-finding procedure, and a deep neural net trained to predict state values is used to evaluate the results of joint actions sampled by this procedure. Beginning with a policy and value network randomly initialized from scratch, a large number of self-play games are played and the resulting equilibrium policies and the improved 1-step value estimates computed on every turn from equilibrium-finding are added to a replay buffer used for subsequently improving the policy and value. Additionally, a double-oracle 

(mcmahan2003doubleoracle) method was used to allow the policy to explore and discover additional actions, and the same equilibrium-finding procedure was also used at test time.

For the core update step, bakhtin2021no propose Deep Nash Value Iteration (DNVI), a value iteration procedure similar to Nash Q-Learning (hu2003nash), which is a generalization of Q-learning (watkins1989learning) from MDPs to Stochastic games. The idea of Nash-Q is to compute equilibrium policies in a subgame where the actions correspond to the possible actions in a current state and the payoffs are defined using the current approximation of the value function. bakhtin2021no propose an equivalent update that uses a state value function instead of a state-action value function :


where is the learning rate, is the probability of joint action in equilibrium, is joint action, and is the transition function. For 2p0s games and certain other game classes, this algorithm converges to a Nash equilibrium in the original stochastic game under the assumption that an exploration policy is used such that each state is visited infinitely often .

The tabular approach of Nash-Q does not scale to large games such as Diplomacy. DNVI replaces the explicit value function table and update rule in 3

with a value function parameterized by a neural network,

and uses gradient descent to update it using the following loss:


The summation used in 4 is not feasible in games with large action spaces as the number of joint actions grow exponentially with the number of players. bakhtin2021no address this issue by considering only a subset of actions at each step. An auxiliary function, a policy proposal network , models the probability that an action of player is in the support of the equilibrium . Only the top- sampled actions from this distribution are considered when solving for the equilibrium policy and computing the above value loss. Once the equilibrium is computed, the equilibrium policy is also used to further train the policy proposal network using cross entropy loss:


bakhtin2021no report that the resulting agent DORA does very well when playing with other copies of itself. However, DORA performs poorly in games with 6 human human-like agents.

2.4 piKL: Modeling humans with imitation-anchored planning

Behavioral cloning (BC) is the standard approach for modeling human behaviors given data. Behavioral cloning learns a policy that maximizes the likelihood of the human data by gradient descent on a cross-entropy loss. However, as observed and discussed in jacob2022modeling, BC often falls short of accurately modeling or matching human-level performance, with BC models underperforming the human players they are trained to imitate in games such as Chess, Go, and Diplomacy. Intuitively, it might seem that initializing self-play with an imitation-learned policy would result in an agent that is both strong and human-like. Indeed, bakhtin2021no showed improved performance against human-like agents when initializing the DORA training procedure from a human imitation policy and value, rather than starting from scratch. However, we show in subsection 5.3 that such an approach still results in policies that deviate from human-compatible equilibria.

jacob2022modeling found that an effective solution was to perform search with a regularization penalty proportional to the KL divergance from a human imitation policy. This algorithm is referred to as piKL. The form of piKL we focus on in this paper is a variant of hedge called piKL-hedge, in which each player  seeks to maximize expected reward, while at the same time playing “close” to a fixed anchor policy . The two goals can be reconciled by defining a composite utility function that adds a penalty based on the “distance” between the player policy and their anchor policy, with coefficient scaling the penalty.

For each player , we define ’s utility as a function of the agent policy given policies of all other agents:


This results in a modification of hedge such that on each iteration , is set according to


When is large, the utility function is dominated by the KL-divergence term , and so the agent will naturally tend to play a policy close to the anchor policy . When is small, the dominating term is the rewards and so the agent will tend to maximize reward without as closely matching the anchor policy .

3 Distributional Lambda piKL (DiL-piKL)

piKL trades off between the strength of the agent and the closeness to the anchor policy using a single fixed parameter. In practice, we find that sampling from a probability distribution each iteration produces better performance. In this section, we introduce distributional lambda piKL (DiL-piKL), which replaces the single parameter in piKL with a probability distribution  over values. On each iteration, each player  samples a value from  and then chooses a policy based on Equation 7 using that sampled . Figure 1 highlights the difference between piKL and DiL-piKL.


Data:  •  set of actions for Player ;
 •  reward function for Player ;
yellow!50   •  a set of values to consider for Player ;
yellow!50   •  a belief distribution over values for Player .
1 function Initialize()
2      for each action  do
4 function Play()
5      yellow!50  sample let be the policy such that
sample an action play and observe actions played by the opponents for each  do
Algorithm 1 DiL-piKL (for Player )
Figure 1: DiL-piKL algorithm. Lines with highlights show the main differences between this algorithm and piKL-Hedge algorithm proposed in jacob2022modeling.


Figure 2: represents the common-knowledge belief about the parameter or distribution used by all players. represents the value actually used by the agent to determine its policy. By having differ from

, DiL-piKL interpolates between an equilibrium under the utility function 

, behavioral cloning and best response to behavioral cloning policies. piKL assumed a common , which moved it along one axis of the space. Our agent models and coordinates with high- players while playing a lower itself.

One interpretation of DiL-piKL is that each choice of is an agent type, where agent types with high choose policies closer to while agent types with low choose policies that are more “optimal” and less constrained to a common-knowledge anchor policy. A priori, each player is randomly sampled from this population of agent types, and the distribution represents the common-knowledge uncertainty about which of the agent types player  may be. Another interpretation is that piKL assumed an exponential relation between action EV and likelihood, whereas DiL-piKL results in a fatter-tailed distribution that may more robustly model different playing styles or game situations.

3.1 Coordinating with piKL policies

While piKL and DiL-piKL are intended to model human behavior, an optimal policy in cooperative environments should be closer to a best response to this distribution. Selecting different values for the common-knowledge population versus the policy the agent actually plays allows us to interpolate between BC, best response to BC, and equilibrium policies (Figure 2). In practice, our agent samples from during equilibrium computation but ultimately plays a low policy, modeling the fact that other players are unaware of our agent’s true type.

3.2 Theoretical Properties of DiL-piKL

DiL-piKL can be understood as a sampled form of follow-the-regularized-leader (FTRL). Specifically, one can think of Algorithm 1 as an instantiation of FTRL over the Bayesian game induced by the set of types and the regularized utilities of each player . In the appendix we show that when a player  learns using DiL-piKL, the distributions for any type are no-regret with respect to the regularized utilities defined in (6). Formally:

Theorem 1 (abridged).

Let be a bound on the maximum absolute value of any payoff in the game, and . Then, for any player , type , and number of iterations , the regret cumulated can be upper bounded as

where the game constant is defined as .

The traditional analysis of FTRL is not applicable to DiL-piKL because the utility functions, as well as their gradients, can be unbounded due to the nonsmoothness of the regularization term that appears in the regularized utility function , and therefore a more sophisticated analysis needs to be carried out. Furthermore, even in the special case of a single type (i.e., a singleton set ), where DiL-piKL coincides with piKL, the above guarantee significantly refines the analysis of piKL in two ways. First, it holds no matter the choice of stepsize , thus implying a regret bound without assumptions on other than . Second, in the cases in which is tiny, by choosing we recover a sublinear guarantee (of order ) on the regret.

In 2p0s games, the logarithmic regret of Theorem 1 immediately implies that the average policy of each player is a -approximate Bayes-Nash equilibrium strategy. In fact, a strong guarantee on the last-iterate convergence of the algorithm can be obtained too:

Theorem 2 (abridged; Last-iterate convergence of piKL in 2p0s games).

When both players in a 2p0s game learn using DiL-piKL for iterations, their policies converge almost surely to the unique Bayes-Nash equilibrium of the regularized game defined by utilities  (6).

The last-iterate guarantee stated in Theorem 2 crucially relies on the strong convexity of the regularized utilities, and conceptually belongs with related efforts in showing last-iterate convergence of online learning methods. However, a key difficulty that sets apart Theorem 2 is the fact that the learning agents observe sampled actions from the opponents, which makes the proof of the result (as well as the obtained convergence rate) different from prior approaches.

4 Description of Diplodocus

By replacing the equilibrium-finding algorithm used in DORA with DiL-piKL, we hypothesize that we can learn a strong and human-compatible policy as well as a value function that can accurately evaluate game states, assuming strong and human-like continuation policies. We call this self-play algorithm RL-DiL-piKL. We use RL-DiL-piKL to train value and policy proposal networks and use DiL-piKL during test-time search.

4.1 Training

Our training algorithm closely follows that of DORA, described in Section 2.3

. The loss functions used are identical to DORA and the training procedure is largely the same, except in place of RM to compute the equilibrium policy

on each turn of a game during self-play, we use DiL-piKL with a distribution and human imitation anchor policy that is fixed for all of training. See Appendix H for a detailed description of differences between DORA and RL-DiL-piKL.

4.2 Test-Time Search

Following bakhtin2021no, at evaluation time we perform 1-ply lookahead where on each turn we sample up to 30 of the most likely actions for each player from the RL policy proposal network. However, rather than using RM to compute the equilibrium , we apply DiL-piKL.

As also mentioned previously in Section 3, while our agent samples from the probability distribution when computing the DiL-piKL equilibrium, the agent chooses its own action to actually play using a fixed low . For all experiments, including all ablations, the agent uses the same BC anchor policy. For DiL-piKL experiments for each player  we set to be uniform over and play according to , except for the first turn of the game. On the first turn we instead sample from and play according to , so that the agent plays more diverse openings, which more closely resemble those that humans play.

5 Experiments

We first compare the performance of two variants of Diplodocus in a population of prior agents and other baseline agents. We then report results of Diplodocus playing in a tournament with humans.

5.1 Experimental setup

In order to measure the ability of agents to play well against a diverse set of opponents, we play many games between AI agents where each of the seven players are sampled randomly from a population of baselines (listed in Appendix D) or the agent to be tested. We report scores for each of the following algorithms against the baseline population:

Diplodocus-Low and Diplodocus-High are the proposed agents that use RL-DiL-piKL during training with 2 player types and , respectively.
DORA is an agent that is trained via self-play and uses RM as the search algorithm during training and test-time. Both the policy and the value function are randomly initialized at the start of training.
DNVI is similar to DORA, but the policy proposal and value networks are initialized from human BC pretraining.
DNVI-NPU is similar to DNVI, but during training only the RL value network is updated. The policy proposal network is still trained but never fed back to self-play workers, to limit self-play drift from human conventions. The final RL policy proposal network is only used at the end, at test time (along with the RL value network).
BRBot is an approximate best response to the BC policy. It was trained the same as Diplodocus, except that during training the agent plays one distinguished player each game with while all other players use .
SearchBot, a one-step lookahead equilibrium search agent from (gray2020human), evaluated using their published model.
HedgeBot is an agent similar to SearchBot (gray2020human) but using our latest architecture and using hedge rather than RM as the equilibrium-finding algorithm.
FPPI-2 and SL are two agents from (anthony2020learning), evaluated using their published model.

After computing these population scores, as a final evaluation we organized a tournament where we evaluated four agents for 50 games each in a population of online human participants. We evaluated two baseline agents, BRBot and DORA, and two of our new agents, Diplodocus-Low and Diplodocus-High.

In order to limit the duration of games to only a few hours, these games used a time limit of 5 minutes per turn and a stochastic game-end rule where at the beginning of each game year between 1909 and 1912 the game ends immediately with 20% chance per year, increasing in 1913 to a 40% chance. Players were not told which turn the game would end on for a specific game, but were told the distribution it was sampled from. Our agents were also trained based on this distribution.333Games were run by a third-party contractor. In contradiction of the criteria we specified, the contractor ended games artificially early for the first 80 games played in the tournament, with end dates of 1909-1911 being more common than they should have been. We immediately corrected this problem once it was identified. Players were recruited from Diplomacy mailing lists and from In order to mitigate the risk of cheating by collusion, players were paid hourly rather than based on in-game performance. Each game had exactly one agent and six humans. The players were informed that there was an AI agent in each game, but did not know which player was the bot in each particular game. In total 62 human participants played 200 games with 44 human participants playing more than two games and 39 human participants playing at least 5 games.

5.2 Experimental Results

Agent Score against population
Diplodocus-Low 29% 1%
Diplodocus-High 28% 1%
DNVI-NPU (retrained) (bakhtin2021no) 20% 1%
BRBot 18% 1%
DNVI (retrained) (bakhtin2021no) 15% 1%
HedgeBot (retrained) (jacob2022modeling) 14% 1%
DORA (retrained) (bakhtin2021no) 13% 1%
FPPI-2 (anthony2020learning) 9% 1%
SearchBot (gray2020human) 7% 1%
SL (anthony2020learning) 6% 1%
Table 1: Performance of different agents in a population of various agents. Agents above the line were trained using identical neural network architectures. Agents below the line were evaluated using the models and the parameters provided by the authors. The

shows one standard error.

We first report results for our agents in the fixed population described in Appendix D. The results, shown in Table 1, show Diplodocus-Low and Diplodocus-High perform the best by a wide margin.

We next report results for the human tournament in Table 2. For each listed player, we report their average score, Elo rating, and rank within the tournament based on Elo among players who played at least 5 games. Elo ratings were computed using a standard generalization of BayesElo (coulom2005bayeselo) to multiple players (hunter2004mmbt) (see Appendix I

for details). This gives similar rankings as average score, but also attempts to correct for both the average strength of the opponents, since some games may have stronger or weaker opposition, as well as for which of the seven European powers a player was assigned in each game, since some starting positions in Diplomacy are advantaged over others. To regularize the model, a weak Bayesian prior was applied such that each player’s rating was normally distributed around 0 with a standard deviation of around 350 Elo.

The results show that Diplodocus-High performed best among all the humans by both Elo and average score. Diplodocus-Low followed closely behind, ranking second according to average score and third by Elo. BRBot performed relatively well, but ended ranked below that of both DiL-piKL agents and several humans. DORA performed relatively poorly.

Two participants achieved a higher average score than the Diplodocus agents, a player averaging 35% but who only played two games, and a player with a score of 29% who played only one game.

We note that given the large statistical error margins, the results in Table 2 do not conclusively demonstrate that Diplodocus outperforms the best human players, nor do they alone demonstrate an unambiguous separation between Diplodocus and BRBot. However, the results do indicate that Diplodocus performs at least at the level of expert players in this population of players with diverse skill levels. Additionally, the superior performance of both Diplodocus agents compared to BRBot is consistent with the results from the agent population experiments in Table 1.

Rank Elo Avg Score # Games
Diplodocus-High 1 181 27% 4% 50
Human 2 162 25% 6% 13
Diplodocus-Low 3 152 26% 4% 50
Human 4 138 22% 9% 7
Human 5 136 22% 3% 57
BRBot 6 119 23% 4% 50
Human 7 102 18% 8% 8
Human 8 96 17% 3% 51
DORA 32 -20 13% 3% 50
Human 43 -187 1% 1% 7
Table 2: Performance of four different agents in a population of human players, ranked by Elo, among all 43 participants who played at least 5 games. The shows one standard error.

In addition to the tournament, we asked three expert human players to evaluate the strength of the agents in the tournament games based on the quality of their actions. Games were presented to these experts with anonymized labels so that the experts were not aware of which agent was which in each game when judging that agent’s strategy. All the experts picked a Diplodocus agent as the strongest agent, though they disagreed about whether Diplodocus-High or Diplodocus-Low was best. Additionally, all experts indicated one of the Diplodocus agents as the one they would most like to cooperate with in a game. We provide detailed responses in Appendix C.

5.3 RL training comparison

Figure 3 compares different RL agents across the course of training. To simplify the comparison, we vary the training methods for the value and policy proposal networks, but use the same search setting at evaluation time.

As a proxy for agent strength, we measure the average score of an agent vs 6 copies of HedgeBot. As a proxy for modeling humans, we compute prediction accuracy of human moves on a validation dataset of roughly 630 games held out from training of the human BC model, i.e., how often the most probable action under the policy corresponds to the one chosen by a human. Similar to bakhtin2021no, we found that agents without biasing techniques (DORA and DNVI) diverge from human play as training progress. By contrast, Diplodocus-High achieves significant improvement in score while keeping the human prediction accuracy high.


Figure 3: Performance of different agents as a function of the number of RL training steps. Left: Scores against 6 human-like HedgeBot agents. The gray dotted line at score corresponds to tying HedgeBot. The error bars show one standard error. Right: Order prediction accuracy of each agent’s raw RL policy on a held-out set of human games. The gray dotted line corresponds to the behavioral cloning policy. Overall: Diplodocus-High achieves a high score while also maintaining high prediction accuracy. Unregularized agents DNVI and DORA do far worse on both metrics.

6 Discussion

In this work we describe RL-DiL-piKL and use it to train an agent for no-press Diplomacy that placed first in a human tournament. We ascribe Diplodocus’s success in Diplomacy to two ideas.

First, DiL-piKL models a population of player types with different amounts of regularization to a human policy while ultimately playing a strong (low-) policy itself. This improves upon simply playing a best response to a BC policy by accounting for the fact that humans are less likely to play highly suboptimal actions and by reducing overfitting of the best response to the BC policy. Second, incorporating DiL-piKL in self-play allows us to learn an accurate value function in a diversity of situations that arise from strong and human-like players. Furthermore, this value assumes a human continuation policy that makes fewer blunders than the BC policy, allowing us to correctly estimate the values of positions that require accurate play (such as stalemate lines).

In conclusion, combining human imitation, planning, and RL presents a promising avenue for building agents for complex cooperative and mixed-motive environments. Further work could explore regularized search policies that condition on more complex human behavior, including dialogue.


Appendix A Author Contributions

A. Bakhtin primarily contributed to RL, infrastructure, experimentation, and direction. D. J. Wu primarily contributed to RL and infrastructure. A. Lerer primarily contributed to DiL-piKL, infrastructure, and direction. J. Gray primarily contributed to infrastructure. A. P. Jacob primarily contributed to DiL-piKL and experimentation. G. Farina primarily contributed to theory and DiL-piKL. A. H. Miller primarily contributed to experimentation. N. Brown primarily contributed to DiL-piKL, experimentation, and direction.

Appendix B Description of Diplomacy

The rules of no-press Diplomacy are complex; a full description is provided by paquette2019no. No-press Diplomacy is a seven-player zero-sum board game in which a map of Europe is divided into 75 provinces. 34 of these provinces contain supply centers (SCs), and the goal of the game is for a player to control a majority (18) of the SCs. Each players begins the game controlling three or four SCs and an equal number of units.

The game consists of three types of phases: movement phases in which each player assigns an order to each unit they control, retreat phases in which defeated units retreat to a neighboring province, and adjustment phases in which new units are built or existing units are destroyed.

During a movement phase, a player assigns an order to each unit they control. A unit’s order may be to hold (defend its province), move to a neighboring province, convoy a unit over water, or support a neighboring unit’s hold or move order. Support may be provided to units of any player. We refer to a tuple of orders, one order for each of a player’s units, as an action. That is, each player chooses one action each turn. There are an average of 26 valid orders for each unit (paquette2019no), so the game’s branching factor is massive and on some turns enumerating all actions is intractable.

Importantly, all actions occur simultaneously. In live games, players write down their orders and then reveal them at the same time. This makes Diplomacy an imperfect-information game in which an optimal policy may need to be stochastic in order to prevent predictability.

Diplomacy is designed in such a way that cooperation with other players is almost essential in order to achieve victory, even though only one player can ultimately win.

A game may end in a draw on any turn if all remaining players agree. Draws are a common outcome among experienced players because players will often coordinate to prevent any individual from reaching 18 centers. The two most common scoring systems for draws are draw-size scoring (DSS), in which all surviving players equally split a win, and sum-of-squares scoring (SoS), in which player  receives a score of , where is the number of SCs that player  controls (fogel2020whom). Throughout this paper we use SoS scoring except in anonymous games against humans where the human host chooses a scoring system.

Appendix C Expert evaluation of the agents

The anonymous format of the tournament aimed at reducing possible biases of players towards the agent, e.g., trying to collectively eliminate the agents as targeting the agent is a simple way to break the symmetry. At the same time a significant property of Diplomacy is knowing the play styles of different players and using this knowledge to make decision of whom to trust and whom to chose as an ally. To evaluates this aspect of the game play we asked for qualitative feedback from three Diplomacy experts. Each player was given 7 games (one per power) from each of the 4 different agents that played in the tournament. The games evaluated by each expert were disjoint from the games evaluated by the other experts. The games were anonymized such that the experts were not able to tell which agent played in the game based on the username or from the date. We asked a few questions about the game play of each agent independently and then asked the experts to choose the best agent for strength and human-like behavior. The experts referred to the agents as Agent1, …, Agent4, but we de-anonymized the agents in the answers below.

c.1 Overall

What is the strongest agent?

Expert 1

I think Diplodocus-Low was the strongest, then BRBot closely followed by Diplodocus-High. DORA is a distant third.

Expert 2


Expert 3

Diplodocus-Low. This feels stronger than a human in certain ways while still being very human-like.

What is the most human-like/bot-like agent?

Expert 1

Most human-like is Diplodocus-High. A boring human, but a human nonetheless. Diplodocus-Low is not far behind, then BRBot and DORA both of which are very non-human albeit in very different ways.

Expert 2


Expert 3


What is the agent you’d like to cooperate with?

Expert 1

This is the most interesting question. I think Diplodocus-Low, because I like how it plays - we’d “vibe” - but also because I think it is quite predictable in what motivates it to change alliances. That’s potentially exploitable, even with the strong tactics it has. I’d least like to work with Diplodocus-High as it seems to be very much in it for itself. I suspect it would be quite unpleasant to play against as it is tactically excellent and seems hard to work with.

I’d love to be on a board with DORA, as I’d expect my chances to solo to go up dramatically! It would be a very fun game so long as you weren’t on the receiving end of some of its play.

Expert 2


Expert 3

Diplodocus-Low. Diplodocus-High is also strong, but seems much less interesting to play with, because of the way it commits to alliances without taking into account who is actually willing to work with it. This limits what a human can do to change their situation quite a lot and would be fairly frustrating in the position of a neighbour being attacked by it.

BRBot and DORA feel too weak to be particularly interesting.

c.2 Dora

How would you evaluate the overall strength of the agent?

Expert 1

Not great. There’s a lot to criticize here - from bad opening play (Russia = bonkers), to poor defense (Turkey) and just generally bad tactics and strategy compared to the other agents (France attacking Italy when Italy is their only ally was an egregious example of this).

Expert 2

Very weak. Seemed to invite its own demise with the way it routinely picked fights in theaters it had no business being in and failing to cooperate with neighbors

Expert 3

Poor. It is bad at working with players, and it makes easily avoidable blunders even when working alone.

How would you evaluate the ability of the agent to cooperate?

Expert 1

It seems to make efforts, but it also seems to misjudge what humans are likely to do. There’s indicative support orders and they’re pretty good, but it also doesn’t seem to understand or account for vindictiveness over always playing best. The Turkey game where it repeatedly seems to expect Russia to not attack is an example of this.

Expert 2

Poor. Seemed to pick fights without seeing or soliciting support necessary to win, failed to support potential allies in useful ways to take advantage of their position.

Expert 3

Middling to Poor. It very occasionally enters good supports but it often enters bad ones, and has a habit of attacking too many people at once (and not considering that attacking those people will turn them against it). It has a habit of annoying many players and doing badly as a result.

c.3 BRBot

How would you evaluate the overall strength of the agent?

Expert 1

The agent has solid, at least human level tactics and clearly sees opportunities to advance and acts accordingly. Sometimes this is to the detriment of the strategic position, but the balance is fair given the gunboat nature of the games. Overall, the bot feels naturally to be in the “better than average human” range rather than super-human, but the results indicate that it performs at a higher level than the “feeling” it gives. It has a major opportunity for improvement, discussed in the next point.

Expert 2

Overall, seemed fairly weak and seemed to be able to succeed most frequently when benefiting from severe mistakes from neighboring agents. That being said it was able to exploit those mistakes somewhat decently in some cases and at least grow to some degree off of it.

Expert 3

Middling. It is tactically strong when not having to work with other players and when it has a considerable number of units, but is quite weak when attempting to cooperate with other players. Its defensive strength varies quite significantly too, possibly also based on unit count - when it had relatively few units it missed very obvious defensive tactics.

How would you evaluate the ability of the agent to cooperate?

Expert 1

The bot is hyperactively trying to coordinate and signal to the other players that it wants to work with them. Sometimes this is in the form of ridiculous orders that probably indicate desperation more than a mutually beneficial alliance, and this backfires as you may expect. At its best it makes exceptional signaling moves (RUSSIA game444DOUBLE BLIND : War - Mos in Fall 1901 is exceptional) but at worst it is embarrassingly bad and leads to it getting attacked (TURKEY game555DOUBLE BLIND : supporting convoys from Gre - Smy or supporting other powers moving to Armenia). The other weakness is that it tends to make moves like these facing all other powers - this is not optimal as indicating to all other powers that you want to work with them is equivalent to not indicating anything at all - if anything it seems a little duplicitous. This is especially true when the bot is still inclined to stab when the opportunity presents itself, which means the signaling is superficial and unlikely to work repeatedly. Overall, the orders show the ability to cooperate, signal, and work together, but the hyperactivity of the bot is limiting the effectiveness of the tools to achieve the best results.

Expert 2

Poor. Random support orders seemed to be thrown without an overarching strategy behind them. Moves didn’t seem to suggest long term thoughts of collaboration.

Expert 3

Poor. When attempting to work with another player, it almost always gives them the upper hand, and even issues supports that suggest it is okay with that player taking its SCs when it should not be. It sometimes matches supports to human moves, but does not seem to do this very often. The nonsensical supports are much more common.

c.4 Diplodocus-High

How would you evaluate the overall strength of the agent?

Expert 1

The tactics are unadventurous and sometimes seem below human standards (for example, the train of army units in the Italy game; the whole Turkey game) but conversely they also have a longer view of the game (see also: Italy game - the trained bounces don’t matter strategically). There’s less nonsense too; if I were to sum the bot up in two words it would be “practical” and “boring”.

Expert 2

Seemed to be strong. Wrote generally good tactical orders, showed generally good strategic sense. Showed discipline and a willingness to let allies survive in weak positions while having units that could theoretically stab for dots with ease remaining right next to that weak ally.

There were some highly questionable moments as both Italy and France early on in 1901 strategy which seemed to heavily harm their ability to get out of the box.

The Austrian game was particularly impressive in terms of its ability to handle odd scenarios and achieve the solo despite receiving pressure on multiple occasions on multiple fronts.

Expert 3

Generally strong. It is good at signalling and forming alliances, is tactically strong when in its favoured alliance, and is especially strong when ahead. Its main weakness seems to be an inability to adapt - if its favoured alliance is declined, it will often keep trying to ‘pitch’ that same alliance instead of working towards alternatives.

How would you evaluate the ability of the agent to cooperate?

Expert 1

Low. It doesn’t put much effort into this. The French game, for example, the bot just seems to accept it is being attacked and fight through it. It’s so boring and tactical and shows little care for cooperation. Many great gunboat players do this but it will not hold up in press games. What it does seem to do is capitalize on other player’s mistakes - see the Austrian game where it sneaks into Scandinavia and optimizes to get to 18 (there can’t be a lot of training data for that!).

Expert 2

Very strong ability to cooperate as seen in the Turkish game, but in other games seemed to try and pick fights against the entire world in ways that were ultimately self-defeating.

Expert 3

Good. It can work well with human players, matching supports and even using signalling supports in ways humans would. It frequently attempts to side with a player who is attacking it, though, so it seems to have a problem with identifying which player to work with.

c.5 Diplodocus-Low

How would you evaluate the overall strength of the agent?

Expert 1

Exceptional. Very strong tactics and a clear directionality to what it does - it seems to understand what the strategic value of a position is and it acts with efficiency to achieve the strategic goals. It has great results (time drawn out of a few wins!) but also fights back from “losing” positions extremely well which makes it quantifiably strong, but it also just plays a beautiful and effective game. Very strong indeed. It does sometimes act too aggressively for tournament play (Austria is the example where this came home to roost) - the high risk/reward plays are generally but not always correct in single games, but for tournament play it goes for broke a bit too much (This is outside the scope of the agent I suspect, as it is playing to scoring system not tournament scoring system). Against human players who may not see the longer term impact of their play, it results in games like this one… which is ugly both for Austria and for everyone else except Turkey.

Expert 2

Very weak. Seemed to abandon its own position in many cases to pursue questionable adventures. Sometimes they worked out but generally they failed, resulting in things like a Germany under siege holding Edi while they as England are off in Portugal and are holding onto their home centers only because FG were under siege by the south.

Expert 3

Very strong. It can signal alliances very well and generally chooses the correct allies, seems strong tactically even on defence, and makes some plays you would not expect from a human player but which are outright stronger than a human player would make.

How would you evaluate the ability of the agent to cooperate?

Expert 1

Pretty good. It sends signaling moves and makes efforts to support other players quite a lot (see in particular Russia). I particularly like the skills being shown to work together tactically and try and support other units - this is both effective and quite human. This is my favorite bot by some distance when it comes to cooperating with the other players. There is a weakness in that it does seem to reassess alliances every turn, which means sometimes the excellent work indicating and supporting is undone without getting the chance to realize the gains (Examples with Russia and Italy).

Expert 2

Poor. Didn’t seem to give meaningful support orders when they would have helped and gave plenty of meaningless signaling supports and some questionable ones like supporting the English into SKA in F1901 as Germany among other oddities

Expert 3

Good. It signals alliances in very human ways, through clear signalling builds, accurate support moves where it makes sense, and support holds otherwise. It also seems to match supports with its allies well.

Appendix D Population based evaluation

In general-sum games like Diplomacy, winrate in head-to-head matches against a previous version of an agent may not be as informative because of nontransitivity between agents. For example, exploitative agents such as best-response-to-BC may do particularly well against BC or other pure imitation-learning agents, and less well against all other agents. Additionally, bakhtin2021no found that a pair of independently and equally-well-trained RL agents may each appear very weak in a population composed of the other due to converging to incompatible equilibria. Many agents also varied significantly in how well they performed against other search-based agents.

Therefore, we resort to playing against a population of previously training agents as was done in jacob2022modeling, intended to measure more broadly how well an agent does on average against a wider suite of various human-like agents.

More precisely, we define a fixed set of baseline agents as a population. To determine an agent’s average population score, we add that agent into the population and then play games where in each game, all 7 players are uniformly randomly sampled from the population with replacement, keeping only games where the agent to be tested was sampled at least once. Note that unlike jacob2022modeling, we run a separate population test for each new agent to be tested, rather than combining all agents to be tested within a single population.

For the experiments in Table 1 and subsection 5.1 we used the following 8 baseline agents:

  • An single-turn BR agent that assumes everyone else plays the BC policy.

  • An agent doing RM search with BC policy and value functions. We use 2 copies of this agent trained on different subsets of data.

  • DiL-piKL agent with BC policy and value functions. We use 4 different versions of this data with different training data and model architecture.

  • DiL-piKL agent where the policy and value functions are trained with self-play with Reinforced-PiKL with high lambda ().

For the experiments in this paper we used 1000 games for each such population test.

Appendix E Theoretical Properties of DiL-piKL

In this section we study the last-iterate convergence of DiL-piKL, establishing that in two-player zero-sum games DiL-piKL converges to the (unique) Bayes-Nash equilibrium of the regularized Bayesian game. As a corollary (in the case in which each player has exactly one type), we conclude that piKL converges to the Nash equilibrium of the regularized game in two-player zero-sum games. We start from a technical result. In all that follows, we will always let be a shorthand for the vector .

Lemma 1.

Fix any player , , and . For all , the iterates and defined in Line 1 of Algorithm 1 satisfy


If , then the results follows from direct inspection: is the uniform policy (and so for any , and so the statement reduces to the first-order optimality conditions for the problem . So, we now focus on the case . The iterates and produced by DiL-piKL are respectively the solutions to the optimization problem

where we let the averages utility vectors be

Since the regularizing function negative entropy is Legendre, the policies and are in the relative interior of the probability simplex, and therefore the first-order optimality conditions for and are respectively

Taking the difference between the equalities, we find

We now use the fact that

to further write

From LABEL:eq:vi_for_t_plus_1 we find

and so, plugging back the previous relationship in LABEL:eq:diff_of_vis we can write, for all ,

Dividing by yields the statement. ∎

Corollary 1.

Fix any player , , and . For all , the iterates and defined in Line 1 of Algorithm 1 satisfy


Since Lemma 1 holds for all , we can in particular set , and obtain

Using the three-point identity

in LABEL:eq:pre_threepoint yields

Multiplying by yields the statement. ∎

e.1 Regret Analysis

Let be the regularized utility of agent type

Observation 1.

We note the following:

  • For any and , the function satisfies

  • Furthermore,

Using Corollary 1 we have the following

Lemma 2.

For any player and type ,


From Lemma 1,

Rearranging, we find

We now upper bound the term in (LABEL:eq:increment_term) using convexity of the function , as follows:

Substituting the above bound into (LABEL:eq:pre_flip) yields

where the second inequality follows from Young’s inequality. Finally, by using the strong convexity of the KL divergence between points and , that is,

yields the statement. ∎

Noting that the right-hand side of Lemma 2 is telescopic, we immediately have the following.

Theorem 3.

For any player and type , and policy , the following regret bound holds at all times :


From Lemma 2 we have that

where the second inequality follows from the fact that and the fact that is the uniform strategy. ∎

e.2 Last-Iterate Convergence in Two-Player Zero-Sum Games

In two-player game with payoff matrix for Player , a Bayes-Nash equilibrium to the regularized game is a collection of policies such that for any supported type of Player , the policy is a best response to the average policy of the opponent. In symbols,

Denoting , the first-order optimality conditions for the best response problems above are

We also mention the following standard lemma.

Lemma 3.

Let be the unique Bayes-Nash equilibrium of the regularized game. Let policies be arbitrary, and let:

  • ;

  • ;

  • .


The following potential function will be key in the analysis:

Proposition 1.

At all times , let

The potential satisfies the inequality


By multiplying both sides of Corollary 1 for the choice , taking expectations over , and summing over the player , we find

We now proceed to analyze the last summation on the right-hand side. First,

Using Lemma 3 we can immediately write

By manipulating the inner product in (LABEL:eq:stepA3), we have

where the last inequality follow from the fact that for all choices of and . Substituting the individual bounds into (LABEL:eq:stepA_decomposition) yields

Finally, plugging the above bound into (LABEL:eq:step0) and rearranging terms yields

as we wanted to show. ∎

Theorem 4.

As in Proposition 1, let

Let be the notion of distance defined as

At all times ,



Using the bound on given by Proposition 1 we obtain

We can now bound

On the other hand, note that

where the last inequality follows from expanding the definition of the KL divergence and using the fact that is the uniform strategy. Combining the inequalities and dividing by yields

Finally, using the fact that yields the statement. ∎

Theorem 5 (Last-iterate convergence of DiL-piKL in two-player zero-sum games).

Let be as in the statement of Theorem 4. When both players in a zero-sum game learn using DiL-piKL for iterations, their policies converge to the unique Bayes-Nash equilibrium of the regularized game defined by utilities (6), in the following senses:

  1. [(a)]

  2. In expectation: for all