K-level Reasoning for Zero-Shot Coordination in Hanabi

by   Brandon Cui, et al.
University of Oxford

The standard problem setting in cooperative multi-agent settings is self-play (SP), where the goal is to train a team of agents that works well together. However, optimal SP policies commonly contain arbitrary conventions ("handshakes") and are not compatible with other, independently trained agents or humans. This latter desiderata was recently formalized by Hu et al. 2020 as the zero-shot coordination (ZSC) setting and partially addressed with their Other-Play (OP) algorithm, which showed improved ZSC and human-AI performance in the card game Hanabi. OP assumes access to the symmetries of the environment and prevents agents from breaking these in a mutually incompatible way during training. However, as the authors point out, discovering symmetries for a given environment is a computationally hard problem. Instead, we show that through a simple adaption of k-level reasoning (KLR) Costa Gomes et al. 2006, synchronously training all levels, we can obtain competitive ZSC and ad-hoc teamplay performance in Hanabi, including when paired with a human-like proxy bot. We also introduce a new method, synchronous-k-level reasoning with a best response (SyKLRBR), which further improves performance on our synchronous KLR by co-training a best response.


page 1

page 2

page 3

page 4


"Other-Play" for Zero-Shot Coordination

We consider the problem of zero-shot coordination - constructing AI agen...

Off-Belief Learning

The standard problem setting in Dec-POMDPs is self-play, where the goal ...

A New Formalism, Method and Open Issues for Zero-Shot Coordination

In many coordination problems, independently reasoning humans are able t...

Any-Play: An Intrinsic Augmentation for Zero-Shot Coordination

Cooperative artificial intelligence with human or superhuman proficiency...

Quasi-Equivalence Discovery for Zero-Shot Emergent Communication

Effective communication is an important skill for enabling information e...

Heterogeneous Multi-agent Zero-Shot Coordination by Coevolution

Generating agents that can achieve Zero-Shot Coordination (ZSC) with uns...

How to Motivate Your Dragon: Teaching Goal-Driven Agents to Speak and Act in Fantasy Worlds

We seek to create agents that both act and communicate with other agents...

1 Introduction

Research into multi-agent reinforcement learning (MARL) has recently seen a flurry of activity, ranging from large-scale multiplayer zero-sum settings such as StarCraft

Vinyals et al. (2017) to partially observable, fully cooperative settings, such as Hanabi Bard et al. (2020). The latter (cooperative) setting is of particular interest, as it covers human-AI coordination, one of the longstanding goals of AI research Engelbart (1962); Carter and Nielsen (2017). However, most work in the cooperative setting—typically modeled as a Dec-POMDPs—has approached the problem in the self-play (SP) setting, where the only goal is to find a team of agents that works well together. Unfortunately, optimal SP policies in Dec-POMDPs commonly communicate information through arbitrary handshakes (or conventions), which fail to generalize to other, independently trained, AI agents or humans at test time.

To address this, the zero-shot coordination setting Hu et al. (2020) was recently introduced, where the goal is to find training strategies that allow independently trained agents

to coordinate at test time. The main idea of this line of work is to develop learning algorithms that can use the structure of the Dec-POMDP itself to independently find mutually compatible policies, a necessary step towards human-AI coordination. Related coordination problems have also been studied by different communities, in particular behavioural game theory. One of the best-known approaches in this area is the cognitive-hierarchies (CH) framework 

Camerer et al. (2004), in which a hierarchy of agents is trained. For this method, an agent at level- models other agents as coming from a distribution up to level and best-responds accordingly. The CH framework has been shown to model human behavior in games for which equilibrium theory does not match empirical data Camerer et al. (2004); thus, in principle, the CH framework could be leveraged to facilitate human-AI coordination in complex settings. A specific instance of CH that is relevant to our work is K-level reasoning (KLR) Costa-Gomes and Crawford (2006), wherein the level- agent models the other agents as level-. However, KLR, like many of the ideas developed in these works, has not been successfully scaled to large scale coordination problems Hu et al. (2020).

Figure 1: Visualization of various hierarchical training schemas, including sequential KLR, synchronous KLR, synchronous CH, and our new SyKLRBR for 4 levels. Thicker arrows indicate a greater proportion of games played with the level. Additionally, red boxes indicate an actively trained agent, while grey boxes indicate a fixed agent. Typically is a uniform random agent.

In this paper we show that k-level reasoning can indeed be scaled to large partially observable coordination problems, like Hanabi. We identify two key innovations that both increase training speed and improve the performance of the method. First, rather than training the different levels of the hierarchy sequentially, as would be suggested by a literal interpretation of the method (as was done as a baseline in Hu et al. (2020)), we instead develop a synchronous version, where all levels are trained in parallel (see figure 1). The obvious advantage is that the wall-clock time can be reduced from linear in the number of levels to constant, taking advantage of parallel training. The more surprising finding is that synchronous training also acts as a regularizer on the policies, stabilizing training.

The second innovation is that in parallel we also train a best response (BR) to the entire KLR hierarchy, with more weight being placed on the highest two levels. This constitutes a hybrid approach between CH and KLR, and the resulting BR is our final test time policy. Our method, synchronous-k-level reasoning with a best response (SyKLRBR), obtains high scores when evaluating independently trained agents in cross-play (XP). Importantly, this method also improves ad-hoc teamplay performance, indicating a robust policy that plays well with various conventions.

Lastly, we evaluate our SyKLRBR agents paired with a proxy human policy and establish new state-of-the-art performance, beating recent strong algorithms that, in contrast to our approach, require additional information beyond the game-provided observations Hu et al. (2020, 2021b).

Our results show that indeed KLR can be adapted to address large scale coordination problems, in particular those in which the main challenge is to prevent information exchange through arbitrary conventions. Our analysis shows that synchronous training regularizes the training process and prevents level- from overfitting to the now changing policy at level-. In contrast, in sequential training each agent overfits to the static agent at the level below, leading to arbitrary handshakes and brittle conventions. Furthermore, training a best response to the entire hierarchy improves the final ZSC performance and robustness in ad-hoc settings. This is intuitive since the BR can carry out on-the-fly adaptation in the ZSC setting.

Our results show that the exact graph-structure used, which were similarly studied in Garnelo et al. (2021), and the type of training regime (synchronous vs sequential) can have a major impact on the final outcome when adapting ideas from the cognitive hierarchy literature to the deep MARL setting. We hope our findings will encourage other practitioners to seek inspiration in the game theory literature and to scale those ideas to high dimensional problems, even when there is precedent of unsuccessful attempts in prior work.

2 Related Work

A significant portion of research in MARL has been focused on creating agents that do well in fully cooperative, partially observable settings. A standard approach is through variations of self-play (SP) methods Devlin et al. (2011); Devlin and Kudenko (2016); Foerster et al. (2019); Lerer and Peysakhovich (2019); Hu and Foerster (2020); however, as shown in Hu et al. (2020) generally optimal SP agents learn highly arbitrary policies, which are incompatible with independently trained agents. Clearly, test-time coordination with other, independently trained agents including humans is an important requirement for AI agents that is not captured in the SP problem setting. To address this, Hu et al. (2020) introduced the zero-shot coordination (ZSC) setting, where the explicit goal is to develop learning algorithms that allow independently trained agents to collaborate at test time.

Another recent area of work trains an RL agent separately, and then evaluate its performance in a new group of AI agents or humans assuming access to a small amount of test-time data Lerer and Peysakhovich (2019); Tucker et al. (2020). These methods build SP policies that are compatible with the test time agents by guiding the learning to the nearest equilibria Lerer and Peysakhovich (2019); Tucker et al. (2020). Other methods use human data to build a human model and then train an approximate best response to this human model, making it compatible with human play Carroll et al. (2019). While this presents a promising near-term approach for learning human-like policies in specific settings where we have enough data, it does not enable us to understand the fundamental principles of coordination that lead to this behavior in the first place, which is the goal of the ZSC setting. Our paper shows that even relatively simple training methods can lead to drastically improved ZSC performance when combined with modern engineering best practices and, importantly, that the performance gains directly translate into better coordination with a human-like proxy and ad-hoc teamplay (without requiring human data in the training process).

Our approach for scaling KLR to Hanabi relies heavily on the parallel training of all of the different levels, where each level is trained on one machine and the models are exchanged periodically through a central server. This architecture draws inspiration from population based training (PBT), which was first popularized for hyperparameter turning 

Jaderberg et al. (2017) and then applied in the multi-agent context to train more robust policies in two player zero-sum team settings Jaderberg et al. (2019). PBT has also been used to obtain better self-play policies in Hanabi, both in  Foerster et al. (2019) and  Bard et al. (2020). In contrast to prior work, we do not use PBT to avoid local optima or train less exploitable agents but instead leverage this framework to implement a KLR and a best response to this KLR that is geared towards ZSC and coordination with human-like proxies.

There are a few other methods directly addressing the ZSC framework. The first, other-play (OP) Hu et al. (2020) requires access to the ground truth symmetries of the problem setting and then learns policies that avoid breaking these symmetries during training. OP has previously been applied to Hanabi and KLR compares favorably to the OP results (see Section 4). We also note, that KLR does not require access to the symmetries and can be applied in settings where no symmetries are present. The next method, Ridge Rider (RR) Parker-Holder et al. (2020) uses the connection between symmetries in an environment and

repeated eigenvalues of the Hessian

, to solve ZSC problems. Like KLR, RR does not require prior ground truth access. However, unlike KLR, RR is extremely computationally expensive and has not been scaled to large scale RL problems. Life-Long Learning (LLL) has been studied for ZSC Nekoei et al. (2021). However, LLL requires access to a pool of pre-trained agents, and in this case they had access to symmetries, whereas our method never required access to such symmetries and our method compares favorably in the ZSC setting. Lastly, Off-Belief Learning (OBL) Hu et al. (2021b) has been shown to provide well-grounded play in hanabi and strong results in the ZSC setting, but requires simulator access to train. We note that KLR doesn’t require simulator access and also matches or even outperforms OBL on various metrics.

3 Background

3.1 Dec-POMDPs

This work considers a class of problems, Dec-POMDPs Bernstein et al. (2002), where agents interact in a partially observable environment. The partial observability implies that every agent has an observation obtained from via the observation function from the underlying state . In our setting, at each timestep the acting agent samples an action from policy , , where

are the weights of the neural networks, and all other agents take no-op actions. Here we use

action-observation histories (AOH) which we denote as , where is the length of the trajectory, and is the common reward at timestep defined by the reward function . The goal of the agents is to maximize the total reward ; here we consider to be the discounted sum of rewards, i.e. , where is the discount factor. Additionally, the environments in this work are turn-based and bounded in length at .

3.2 Deep Multi-Agent Reinforcement Learning

Deep reinforcement learning has been applied to a multitude of multi-agent learning problems with great success. Cooperative MARL is readily addressed with extensions of Deep Q-learning Mnih et al. (2015), where the Q-function is parameterized by neural networks to learn to predict the expected return based on the current state and action , . Our work also builds off of state of the state of the art algorithm Recurrent Replay Distributed Deep Q-Network (R2D2) Kapturowski et al. (2019). R2D2 also incorporates other recent advancements such as using a dueling network architecture Wang et al. (2016), prioritized replay experience Schaul et al. (2016), and double DQN learning van Hasselt et al. (2016). Additionally, we use a similar architecture as the one proposed in Horgan et al. (2018) and run many environments in parallel, each of which has actors with varying exploration rates that add to a centralized replay buffer.

The simplest way to adapt deep Q-learning to the Dec-POMDP setting is through Independent Q-learning (IQL) as proposed by Tan (1993)

. In IQL, every agent individually estimates the total return and treats other agents as part of the environment. There are other methods that explicitly account for the multi-agent structure by taking advantage of the

centralized training with decentralized control regime Sunehag et al. (2018); Rashid et al. (2018). However, since our work is based on learning best responses, here we only consider IQL.

3.3 Zero-Shot Coordination Setting

Generally, many past works have focused on solving solving the self-play (SP) case for Dec-POMDPs. However, as shown in Hu et al. (2020), these policies typically lead to arbitrary handshakes that work well within a team when jointly training agents together, but fail when evaluated with other independently trained agents from the same algorithm or humans. However, many real-world problems require interaction with never before seen AI agents and humans.

This desiderata was formalized as the zero-shot coordination (ZSC) by Hu et al. (2020), in which the goal is to develop algorithms that allow independently trained agents to coordinate at test time. ZSC requires agents not to rely on arbitrary conventions as they lead to mutually incompatible policies across different training runs and implementations of the same algorithm. While extended episodes allow for agents to adapt to each other, this must happen at test time within the episode. Crucially, the ZSC setting is a stepping stone towards human-AI coordination, since it aims to uncover the fundamental principles underlying coordination in complex, partially observable, fully cooperative settings.

Lastly, the ZSC setting addresses some of the shortcomings of the ad-hoc team play Stone et al. (2010) problem setting, where the goal is to do well when paired with any well performing SP policy at test time. As Hanabi shows, this fails in settings where there is little overlap between good SP policies and those that are suitable for coordination. So notably in our ad-hoc experiments we do not use SP policies but instead ones that can be meaningfully coordinated with.

4 Cognitive Hierarchies for ZSC

The methods we investigate and improve upon in this work are multi-agent RL adaptations of behavioral game theory’s Camerer et al. (2003) cognitive hierarchies, where level agents are a BR to all preceding levels

; we define CH’s as a Poisson distributions over all previously trained levels. We consider

k-level reasoning (KLR) to be a hierarchy wehre level agents are trained as an approximate BR to level agents Costa-Gomes and Crawford (2006). Lastly, we propose a new type of hierarchy, SyKLRBR, which is a hybrid of the two, where we train a BR to a Poisson distribution over all levels of a KLR (see appendix A.1 for more details).

For all hierarchies, we start training the first level of the hierarchy as an approximate BR to a uniform random policy over all legal actions 111In Hanabi there are some illegal moves, e.g., an agent cannot provide a hint when the given color or rank is not present in the hand of the team mate., . The main idea of this choice is that it prevents the agent from communicating any information through its actions, beyond the grounded information revealed by the environment (see Hu et al. (2021b) for more info). It thus forces the agent to only play based on this grounded information provided, without any conventions.

Furthermore, it is a natural choice for solving zero-shot coordination problems since it makes the least assumptions about a specific policy and certainly does not break any potential symmetries in the environment. Crucially, as is shown in  Camerer et al. (2003), in principle, the convergence point of CH and KLR should be a deterministic function of and thus a common-knowledge should allow for zero-shot coordination between two independently trained agents.

A typical implementation of these training schemas is to train all levels sequentially, one level at a time, until the given level has converged. We also draw inspiration from Lanctot et al. (2017) and their deep cognitive hierarchies framework (DCH) to instead train all levels simultaneously. To do so, we use a central server to store the policies of a given hierarchy and periodically refresh these policies by sending updated policies to the server and retrieving policies we are best responding to from the central server.

We implement the sequential training as follows: We halt the training of a policy at a given level after 24 hours and start training the next level as a BR to the trained set of policies , where for the CH case and in the KLR case. This is the standard implementation of KLR and CH, as it was unsuccessfully explored in Hanabi by Hu et al. (2020).

For synchronous training we train all levels in parallel under a client-server implementation (see algorithm 1). Here all policies are initialized randomly on the server. A client training a given level , fetches a policy and corresponding set of partner policies and trains as an approximate BR to . Periodically, the client sends a copy of its updated policy , fetches an updated and then continues to train . The entire hierarchy is synchronously trained for 24 hours, the same amount as a single level is trained in the sequential case.

1:  Inputs: a level 2:  Initialization: From the server retrieve a trainable policy and corresponding set of collaborative policies , for k-level reasoning, for cognitive hierarchies, and for the Best Response Agent in SyKLRBR. 3:  iteration = 0; 4:  for

 epoch in

5:     for iter in  do 6:         if iter  then 7:            UpdateWeightsOnServer() 8:            RetrieveServerWeights() 9:         end if 10:         Update weights for towards a best response to 11:     end for 12:  end for
Algorithm 1 Client-Server Implementation of k-level reasoning, cognitive hierarchies, SyKLRBR.

5 Experimental Setup

5.1 Hanabi Setup

Hanabi is a cooperative card game that has has been established as a complex benchmark for fully cooperative partially observable multi-agent decision making Bard et al. (2020). In Hanabi, each player can see every player’s hand but their own. As a result, players can receive information about their hand either by receiving direct (grounded) “hints” from other players, or by doing counterfactual reasoning to interpret other player’s actions. In 5-card Hanabi, the variant considered in this paper, there are 5 colors (G, B, W, Y, R) and 5 ranks (1 to 5). A “hint” can be of color or rank and will reveal all cards of the underlying color or rank in the target player; an example hint is, “your first and fourth cards are 1s.” A caveat is that each hint costs a scarce information token, which can only be recovered by “discard” a card.

The goal in Hanabi is to complete 5 stacks, one for each of the 5 colors, each stack starting with the “1” and ending with the “5”. At one point per card the maximum score is 25. To add to a stack players “play” cards and cards played out of order cost a life token. Once the deck is exhausted or the team loses all 3 lives (“bombs out”), the game will terminate.

5.2 Training Details

For a single agent we utilize distributed deep recurrent Q-Networks with prioritized replay experience Kapturowski et al. (2019). Thus, during training there are a large number of simultaneously running environments calling deep Q-networks to generating and adding trajectories to a centralized replay buffer, which are then used to update the model. The network calls are dynamically batched in order to run efficiently on GPUs Espeholt et al. (2020). This agent training schema for Hanabi was first used in Hu and Foerster (2020), and achieved strong results in the self-play setting. Please see the Appendix A for complete training details.

5.3 Evaluation

We evaluate our method and baseline in both self-play (SP), zero-shot coordination (ZSC), ad-hoc teamplay and human-AI settings. For zero-shot coordination, we follow the problem definition from Hu et al. (2020) and evaluate models through cross-play (XP) where we repeat training 5 times with different seeds and pair the independently trained agents with each other.

To test our models’ performance against a diverse set of unseen, novel partners (ad-hoc team play Stone et al. (2010)), we next use RL to train two types of agents that use distinct conventions. The first RL agent is trained with Other-Play, which almost always hints for the rank of the playable card to communicate with their partners. For example, in a case where “Red 1" and “Red 2” have been played and the partner just draw a new “Red 3”, the other agent will hint 3 and then partner will play that card deeming that 3 being a red card based on the convention. This agent is therefore referred to as Rank Bot. The second RL agent is a color-based equivalent of Rank Bot produced by adding extra reward for color hinting during early stage of the training to encourage color hinting behavior. This agent is called Color Bot. More details are in the appendix.

We also train a supervised bot (Clone Bot) on human data, as a proxy evaluation for zero-shot human-AI coordination. We used a dataset of games obtained from Board Game Arena (https://en.boardgamearena.com/). During training, we duplicate the games so that there is a training example from the perspective of each player, for a total of examples; that is, observations contain visible private information for exactly one of the players (the other being hidden). Using this dataset, we trained an agent to reproduce human moves by applying behavioral cloning. The agent is trained by minimizing cross-entropy loss on the actions of the visible player. After each epoch, the agent performs 1000 games of self-play, and we keep the model with the highest self-play score across all epochs.

Level Self-play Cross-Play w/ (k-1)th level XP (k-1)th level w/ Color Bot w/ Rank Bot w/ Clone Bot
Table 1: Performance of sequentially trained KLR for Self-play (SP), Cross-Play (XP), with the th level, XP with the th level, with Color Bot, with Rank Bot, and with Clone Bot. We find that the score with level drops from over points to roughly in XP. This indicates that in the sequential training, each level- can overfit to the static level- and thus develop arbitrary handshakes that propagate along the hierarchy.
Level Self-play Cross-Play w/ (k-1)th level XP (k-1)th level w/ Color Bot w/ Rank Bot w/ Clone Bot
Table 2: Synchronously trained KLR performance for Self-play (SP), Cross-Play (XP), with the th level, XP with the th level, with Color Bot, with Rank Bot, and with Clone Bot. Synchronous training produces extremely stable outcomes across the different runs, as indicated by the close correspondence between SP and XP scores. The fact that all levels are changing during training regularizes the process and prevents overfitting to the level .

6 Results and Discussion

In this section we present the main results and analysis of our work, for sequential training, synchronous training, and SyKLRBR. For each variant/level we present self-play, cross-play, ad-hoc teamplay and human-proxy results. Although we present self-play numbers, the purpose of this paper is not to produce good self-play scores, rather we are optimizing for the ZSC and ad-hoc settings. Therefore, our analysis focuses on the cross-play and ad-hoc teamplay settings, including the human-proxy results. We demonstrate that simply training the KLR synchronously achieves significant improvement over its sequentially trained counterpart in the ZSC setting. We also demonstrate that our new method SyKLRBR is able to further improve upon the synchronous KLR results and achieve SOTA results in certain metrics e.g. scores with clone bot. We also provide analysis into the issues with sequential training and how synchronous training addresses them.

6.1 XP Performance

Table 3 shows the XP scores for other-play, OBL, sequential KLR, synchronous KLR, and SyKLRBR. Changing the training schema from sequential to synchronous significantly increases the XP score to the state-of-the-art XP score for methods that don’t use access to the environment or known symmetries. Thus, by synchronously training the KLR, we are able to achieve strong results in the ZSC setting without requiring underlying game symmetries (other-play) or using simulator access (OBL). SyKLRBR improves upon this result by synchronously training the BR and the KLR, yielding even better XP results. Additionally, tables 1 and 2 show the performance of all levels of KLR. A KLR trained sequentially or synchronously is able to achieve good scores with the th level, as level is explicitly optimized to be an approximate best response to level . However, the sequential KLR has a significant dropoff for the XP score with the th level, indicating that sequential KLRs have large inconsistencies across runs. This also indicates that the sequentially trained hierarchy is overfitting to the exact th level. In contrast, the synchronously trained hierarchy keeps its score with the th level close to the XP score with the level. Thus, by synchronously training the hierarchy we are able to minimize overfitting. For more analysis on overfitting see section 6.4.

6.2 Ad-hoc Teamplay

Table 3 shows the scores in the ad-hoc teamplay setting i.e. evaluation with color and rank bot, where the synchronously trained KLR outperforms the sequentially trained KLR for both bots. Similarly, our SyKLRBR further improves performance with both rank and color bot. Thus, the benefits from synchronous training and from training a BR measured in the ZSC setting translate to improvements in ad-hoc teamplay.

Method Self-play Cross-Play w/ Color Bot w/ Rank Bot w/ Clone Bot Limitations
Other-Play - Sym
OBL (level 4) Env
Sequential -
Synchronous -
Table 3: Other-Play, OBL level 4, level 5 Sequential KLR, level 5 Synchronous KLR, and SyKLRBR for Self-play (SP), Cross-Play (XP), ad-hoc play with Color Bot, Rank Bot, and Clone Bot. We include methodological limitations, requiring underlying game symmetries (sym) or requiring access to the simulator (env). Both synchronous training and our SyKLRBR improve upon XP, ad-hoc teamplay, and clone-bot scores. Also SyKLRBR achieves state-of-the-art results with clone-bot.

6.3 Zero-Shot Coordination with Human Proxies

Up to now we have focused on AI agents playing well with each other. Next we measure performance of bots playing with bots trained on human data, representing a human proxy, specifically the Clone Bot described in Section 5.3. In table 3, we present overall performance of our agents when trained under OP, OBL, sequentially KLR, synchronously KLR, and SyKLRBR. As a reference, we trained a bot using Hu et al. (2020)’s OP, which when paired with Clone Bot achieved an average score score of .

When synchronously trained, KLR monotonically improves its score with Clone Bot. By level 5 the synchronously trained KLR is able to achieve a score of ; the sequentially trained KLR has a significantly lower score. Additionally, the synchronously trained KLR Clone Bot score is comparable to the more algorithmically complex OBL bot, which furthermore requires access to the simulator. Lastly, our new method, SyKLRBR, is able to achieve state-of-the-art results in coordination with human proxies. Therefore, through simply synchronously training KLR we are able to produce bots that cooperate well with human-like proxy policies at test time and by co-training a BR we obtain state-of-the-art results.

Figure 2:

A plot of probability distributions of actions an agent at level

will play with level for KLR trained sequentially and synchronously. At lower levels, the synchronous KLR is stochastic, but at higher levels it stabilizes. The stochasticity in the lower levels broadens the range of policies seen and robustifies lower levels, which propagates upwards and leads to stable, robust policies.
Training Schema % Bomb out with (k-1)th level % Bomb out in XP with (k-1)th level
Table 4: Percentage of bombing out for the level 5 agent playing with the level 4 agent it trained with ((k-1)th level) or the other level 4 agents unseen at training time (XP with (k-1)th level). By synchronously training we prevent overfitting to the policy distribution of a fixed agent. This allows us to be better off distribution, which significantly reduces bombing out in the XP case.

6.4 Observations of Training Behaviors

We plot the probability that an agent from

will take a given type of action when playing with an agent from level in Figure 2. At low levels of the hierarchy (levels 2, 3), the synchronous hierarchy is trained as an approximate best response to a set of changing policies. Higher up in the hierarchy the change in the policies gets attenuated, leading to stable policies towards the end of training. By synchronously training the hierarchy, we allow each policy to see a wider distribution of states and ensure it is robust to different policies at test time. This robustness is reflected in the improved ZSC, XP with th levels, ad-hoc teamplay, and human-proxy coordination.

In table 4 we present the percentage of “bombing out” for the level 5 agent playing with the level 4 agent it trained with or level 4 agents from other seeds of our KLRs. “Bombing out” is a failure case when too many unplayable cards have been played, leading to the agent losing all points in the game. Both the sequential and synchronous KLRs rarely bomb out when paired with their training partners. Only the sequential KLR bombs out significantly more in XP, roughly 20% compared with <1% with the agent it trained with. This high rate illustrates that the agent is making a large number of mistakes, indicating that it is off-distribution in XP. We verified this by checking the Q-values of the action the agent takes when it bombs out. The vast majority of cases () the agent has a positive Q-value for its play action when it bombs out and negative Q-values associated with other actions (discarding and hinting). Since the play action is causing the large negative reward, while the other actions are safe, these Q-values are clearly nonsensical, another indicator that the agent is off-distribution. All of this illustrates that the “bomb out” rate is a good proxy for being off-distribution, which shows that the synchronously trained KLR agents are more on-distribution during XP testing.

6.5 Understanding Synchronous Training

At a training step , the synchronous KLR is trained towards a BR to . There are a few reasons why synchronous training helps regularize training. First of all, weights are typically initialized s.t. Q-values at the beginning of training are small, so under a softmax all are close to uniform. Secondly, over the course of training the entropy for each policy decreases, as Q-values become more accurate and drift apart, so (the final policy) will have the lowest entropy. Lastly, the entropy of the average policy across the set is higher than the average of the entropies from the same set (e.g. the average of two deterministic policies is stochastic, but the average entropy of the policies 0). Therefore, by playing against a changing distribution over stochastic policies we significantly broaden the support of our policy.

Entropy in has two effects: First of all it increases robustness by exposing to more states during training and, secondly, more entropic (i.e. random) policies will generally induce less informative posterior (counterfactual) beliefs (a fully random policy is the extreme case, with a completely uninformative posterior). As a consequence, the BR to a set of different snapshots of a policy is both more robust and less able to extract information from the actions of than the BR to only the final . This forces the policy to rely to a greater extend on grounded information in the game, rather than arbitrary conventions.

Empirically we show this effect by training a belief model on a and on a set of snapshots of for . The cross entropy of belief model for the final is , while the cross entropy for the set is substantially higher () (both averaged over 3 seeds).

6.6 Cognitive Hierarchies (CH)

We also use our synchronous setup to train a CH (i.e., a best response to a Poisson sample of lower levels) and present the results in table 5. We note that the scores for the synchronous CH are lower than the synchronous KLR in terms of SP, XP, ad-hoc teamplay, and human-proxy coordination. This is likely because even at higher levels, the majority of the partner agents come from lower levels, as a result the performance is similar to that of KLR level 3. Additionally, computing a best response to a mix of lower level agents makes the hints provided less reliable and disincentivizes the agent to hint.

Level Self-play Cross-Play w/ Color Bot w/ Rank Bot w/ Clone Bot
Table 5: CH synchronously trained performance for Self-play (SP), Cross-Play (XP), and ad-hoc play with Color Bot, Rank Bot, and Clone Bot. In the CH setting we are unable to obtain very strong results, regardless of setting.

7 Conclusion

How to coordinate with independently trained agents is one of the great challenges of multi-agent learning. In this paper we show that a classical method from the behavioral game-theory literature, k-level reasoning, can easily be adapted to the deep MARL setting to obtain strong performance on the challenging benchmark task of two player Hanabi. Crucially, we showed that a simple engineering decision, to train the different levels of the hierarchy at the same time, made a large difference for the final performance of the method. Our analysis shows that this difference is due to the changing policies at lower levels regularizing the training process, preventing the higher levels from overfitting to specific policies. We have also developed a new method SyKLRBR, which further improves on our synchronous training schema and achieves state-of-the-art results for ad-hoc teamplay performance.

This raises a number of possibilities for follow-up work: What other ideas have been unsuccessfully tried and abandoned too early? Where else can we use synchronous training as a regularizer? Another interesting avenue is to investigate whether the different levels of the hierarchy are evaluated off-distribution during training and how this can be addressed. Level- is only trained on the distribution induced when paired with level-, but evaluated on the distribution induced from playing with . Furthermore, extending the work of Garnelo et al. (2021) and searching for the optimal graph-structure during training is a promising avenue for future work.

8 Limitations

Although our synchronous training schema does alleviate overfitting in the KLR case, there is still a large gap between cross-play and playing with the th level. This indicates that there still exist some unfavorable dynamics in the hierarchy. Similarly, although our work does provide steps towards human-AI cooperation, the policy can still be brittle with unseen bots resulting in lower scores.

9 Broader Impact

We have demonstrated that synchronously training a KLR greatly improves on sequentially training a KLR in the complex Dec-POMDP setting, Hanabi. This in essence is a simple engineering decision, but it improves performance to very competitive methods. Our method, SyKLRBR, synchronously trains a BR to the KLR, which resulted in SOTA performance for coordination with human proxies through “clone bot.” We have found that our method works as it provides distributional robustness in the trained policies. As a result, it can be a positive step towards improving human-AI cooperation. Clearly no technology is safe from being used for malicious purposes, which also applies to our research. However, fully-cooperative settings are clearly targeting benevolent applications.


  • N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes, I. Dunning, S. Mourad, H. Larochelle, M. G. Bellemare, and M. Bowling (2020) The hanabi challenge: a new frontier for ai research. Artificial Intelligence 280, pp. 103216. External Links: ISSN 0004-3702 Cited by: §1, §2, §5.1.
  • D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein (2002)

    The complexity of decentralized control of markov decision processes

    Math. Oper. Res. 27 (4), pp. 819–840. External Links: ISSN 0364-765X, Link, Document Cited by: §3.1.
  • C. F. Camerer, T. Ho, and J. Chong (2004) A cognitive hierarchy model of games. The Quarterly Journal of Economics 119 (3), pp. 861–898. Cited by: §A.1, §1.
  • C. Camerer, T. Ho, and J. Chong (2003) A cognitive hierarchy theory of one-shot games and experimental analysis. Social Science Research Network, pp. . External Links: Link Cited by: §4, §4.
  • M. Carroll, R. Shah, M. Ho, T. Griffiths, S. Seshia, P. Abbeel, and A. Dragan (2019) On the utility of learning about humans for human-ai coordination. Advances in Neural Information Processing Systems, pp. . Cited by: §2.
  • S. Carter and M. Nielsen (2017) Using artificial intelligence to augment human intelligence. Distill. Note: https://distill.pub/2017/aia External Links: Document Cited by: §1.
  • M. Costa-Gomes and V. P. Crawford (2006) Cognition and behavior in two-person guessing games: an experimental study. In American Economic Review, Vol. 96(5), pp. 1737–1768. Cited by: K-level Reasoning for Zero-Shot Coordination in Hanabi, §1, §4.
  • S. Devlin, D. Kudenko, and M. Grzes (2011) An empirical study of potential-based reward shaping and advice in complex, multi-agent systems.. Advances in Complex Systems 14, pp. 251–278. External Links: Document Cited by: §2.
  • S. Devlin and D. Kudenko (2016) Plan-based reward shaping for multi-agent reinforcement learning.

    The Knowledge Engineering Review

    31 (1), pp. 44–58.
    External Links: Document Cited by: §2.
  • D. C. Engelbart (1962) Augmenting Human Intellect: A Conceptual Framework. Note: Air Force Office of Scientific Research, AFOSR-3233, www.bootstrap.org/augdocs/friedewald030402/augmentinghumanintellect/ahi62index.html Cited by: §1.
  • L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, and M. Michalski (2020) SEED rl: scalable and efficient deep-rl with accelerated central inference. In International Conference on Learning Representations, Cited by: §5.2.
  • J. Foerster, F. Song, E. Hughes, N. Burch, I. Dunning, S. Whiteson, M. Botvinick, and M. Bowling (2019) Bayesian action decoder for deep multi-agent reinforcement learning. In

    Proceedings of the 36th International Conference on Machine Learning

    , K. Chaudhuri and R. Salakhutdinov (Eds.),
    Proceedings of Machine Learning Research, Vol. 97, pp. 1942–1951. Cited by: §2, §2.
  • M. Garnelo, W. M. Czarnecki, S. Liu, D. Tirumala, J. Oh, G. Gidel, H. van Hasselt, and D. Balduzzi (2021) Pick your battles: interaction graphs as population-level objectives for strategic diversity. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’21, Richland, SC, pp. 1501–1503. External Links: ISBN 9781450383073 Cited by: §1, §7.
  • D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Silver (2018) Distributed prioritized experience replay. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §3.2.
  • H. Hu and J. N. Foerster (2020) Simplified action decoder for deep multi-agent reinforcement learning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §2, §5.2.
  • H. Hu, A. Lerer, N. Brown, and J. N. Foerster (2021a) Learned belief search: efficiently improving policies in partially observable settings. External Links: Link Cited by: Appendix A.
  • H. Hu, A. Lerer, B. Cui, L. Pineda, D. Wu, N. Brown, and J. N. Foerster (2021b) Off-belief learning. (To Appear) ICML. External Links: Link, 2103.04000 Cited by: §1, §2, §4.
  • H. Hu, A. Lerer, A. Peysakhovich, and J. Foerster (2020) “Other-play” for zero-shot coordination. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 4399–4410. Cited by: Appendix B, K-level Reasoning for Zero-Shot Coordination in Hanabi, §1, §1, §1, §2, §2, §3.3, §3.3, §4, §5.3, §6.3.
  • M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, et al. (2019) Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 364 (6443), pp. 859–865. Cited by: §2.
  • M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §2.
  • S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney (2019) Recurrent experience replay in distributed reinforcement learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: Appendix A, §3.2, §5.2.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Table 6.
  • M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Perolat, D. Silver, and T. Graepel (2017) A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 4190–4203. Cited by: §4.
  • A. Lerer and A. Peysakhovich (2019) Learning existing social conventions via observationally augmented self-play. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’19, New York, NY, USA, pp. 107–114. External Links: ISBN 9781450363242 Cited by: §2, §2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §3.2.
  • H. Nekoei, A. Badrinaaraayanan, A. Courville, and S. Chandar (2021) Continuous coordination as a realistic scenario for lifelong learning. In Proceedings of the 38th International Conference on Machine Learning, M. Meila and T. Zhang (Eds.), Proceedings of Machine Learning Research, Vol. 139, pp. 8016–8024. External Links: Link Cited by: §2.
  • J. Parker-Holder, L. Metz, C. Resnick, H. Hu, A. Lerer, A. Letcher, A. Peysakhovich, A. Pacchiano, and J. Foerster (2020)

    Ridge rider: finding diverse solutions by following eigenvectors of the hessian

    Advances in Neural Information Processing Systems 33, pp. 753–765. External Links: Link Cited by: §2.
  • T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. N. Foerster, and S. Whiteson (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 4292–4301. External Links: Link Cited by: §3.2.
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2016) Prioritized experience replay. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Appendix A, §3.2.
  • P. Stone, G. A. Kaminka, S. Kraus, and J. S. Rosenschein (2010) Ad hoc autonomous agent teams: collaboration without pre-coordination. In Twenty-Fourth AAAI Conference on Artificial Intelligence, Cited by: §3.3, §5.3.
  • P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al. (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th international conference on autonomous agents and multiagent systems, pp. 2085–2087. Cited by: §3.2.
  • M. Tan (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §3.2.
  • M. Tucker, Y. Zhou, and J. Shah (2020) Adversarially guided self-play for adopting social conventions. ArXiv abs/2001.05994. Cited by: §2.
  • H. van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, D. Schuurmans and M. P. Wellman (Eds.), pp. 2094–2100. External Links: Link Cited by: Appendix A, §3.2.
  • O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. V. Hasselt, D. Silver, T. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ekermo, J. Repp, and R. Tsing (2017) StarCraft ii: a new challenge for reinforcement learning. ArXiv abs/1708.04782. Cited by: §1.
  • Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas (2016) Dueling network architectures for deep reinforcement learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1995–2003. External Links: Link Cited by: Appendix A, §3.2.


The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default to , , or . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

  • Did you include the license to the code and datasets?

  • Did you include the license to the code and datasets? The code and the data are proprietary.

  • Did you include the license to the code and datasets?

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? We discuss limitations in section 8.

    3. Did you discuss any potential negative societal impacts of your work? discussed in section 9.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? we included all information needed to reproduce results in section 4 and appendix A

      . We will release an open source version of our code and copies of our trained agents later.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? They are included in appendix A

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? reported in appendix A

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Experimental Details

In training every agent we use a distributed framework for simulation and training. For simulation, we run 6400 Hanabi environments in parallel and the trajectories are batched together for efficient GPU computation. This is done efficiently as every thread can hold many environments in which many agents interact. Every agent chooses actions based on neural network calls, which are more intensive and done by GPUs. By doing these calls asynchronously it allows a thread to support multiple environments while waiting for prior agents’ actions to be computed. Therefore, by stacking multiple environments into a thread and utilizing multiple threads we are able to maximize GPU utility and generate a massive amount of data on the simulation side. Every environment is considered to be in a permanent simulation loop, where at the end of the environment the entire action observation history, consisting of action, observation, and reward is aggregated together into a trajectory, padded to a length 80, and then added to a centralized replay buffer as done in

[29]. We compute the priority of each trajectory as [21], where is the TD error per step. From the training perspective we have a training loop that continuously samples trajectories from the replay buffer and updates the model based on TD error. The simulation policies are updated to be the training policy every gradient steps. We utilize epsilon exploration for training agent exploration. At the beginning of every simulated game we generate epsilon from the equation , where . For our entire training, inference infrastructure we use a machine with 30 CPU cores and 2 GPUs, one GPU for training and one GPU for simulation.

We use the same network architecture as described in [16]. We follow their design choices of utilizing a 3-layer feedforward neural network to encode the entire observation and then using a one-layer feedforward neural network followed by an LSTM to encode only the public observation. We combine these two outputs with element-wise multiplication and use a dueling architecture [36] to get the final Q-values. We also use double DQN as done in [34]. Other relevant hyper-parameters are presented in table 6.

Hyper-parameters Value
Replay Buffer Parameters
burn-in-frames 10000
replay buffer size
priority exponent 0.9
priority weight 0.6
maximum trajectory length 80
Optimization Parameters
optimizer Adam [22]
gradient clip 5
batchsize 128
Q-learning Parameters
n step 3
discount factor 0.999
num gradient steps sync target net 2500
Table 6: Hyper-parameters for Hanabi agent training

For synchronous hierarchy training, every gradient steps, each client sends the weights of the policy it is training to the server and queries the server for the corresponding set of updated policies that is trained to be an approximate best response.

a.1 Poisson Distribution Details

For CH and SyKLRBR, each responds to a Poisson distribution over some set of agents . Concretely, each of the games played simultaneously has an agent from a set level. We use a Poisson distribution with a PMF of . For SyKLRBR we use , which means for a given level and a hierarchy of levels in the PMF. Therefore, a BR to a 5 level KLR has of the actors from level 5, from level 4, from level 3, from level 2, from level 1, and from level 0.

Similarly, for CH we use , which is a standard value for CHs as noted by [3]. Thus, a CH at a given level and partner level , it will have in the Poisson PMF for a given level (excluding level 0). Therefore, for a 5 level cognitive hierarchy, of the actors are from level 1, from level 2, from level 3, and are from level 4.

Appendix B Details on Rank Bot and Color Bot

We train two distinct policies to test the ad-hoc teamplay performance of our agents. Both two policies use the same network design as our KLR policies. The first policy is trained with the Other-Play [18] technique where one of the two players always observe the world, i.e. both input observation and output action space, in a randomly permuted color space. The color permutation is sampled once at the beginning of each episode. This method is capable of preventing the agent from learning arbitrary conventions and previously achieved the best zero-shot coordination score in Hanabi. Empirically, policies trained with Other-Play tends to use a rank based convention where it hints about the rank of a playable card to indicate play and partner will often safely play a rank hinted card without knowing the color. Therefore we refer to this policy as Rank Bot. Similarly, we may expect a color based equivalent of the Rank Bot but in practice we find it difficult to learn such policy naturally. We instead use a reward shaping technique where we give extra reward of 0.25 when the agent hints a color. To wash out the artifact of the reward shaping, we first train the agent with reward shaping till convergence and then disable the extra reward and train it for another 24 hours. However, we find that the reward shaping may lead to inconsistent training results across different runs and thus make it hard to reproduce. We use a simple trick of zeroing out the last action field of the observation to stabilize the learning. Note that the last action is a shortcut to learn arbitrary conventions but it is redundant in our setting since the agent with RNN can infer last action from the board. The policy trained this way predominantly uses color based conventions and is referred to as Color Bot.