Multi-Agent Generalized Recursive Reasoning

by   Ying Wen, et al.

We propose a new reasoning protocol called generalized recursive reasoning (GR2), and embed it into the multi-agent reinforcement learning (MARL) framework. The GR2 model defines reasoning categories: level-0 agent acts randomly, and level-k agent takes the best response to a mixed type of agents that are distributed over level 0 to k-1. The GR2 leaners can take into account the bounded rationality, and it does not need the assumption that the opponent agents play Nash strategy in all stage games, which many MARL algorithms require. We prove that when the level k is large, the GR2 learners will converge to at least one Nash Equilibrium (NE). In addition, if lower-level agents play the NE, high-level agents will surely follow as well. We evaluate the GR2 Soft Actor-Critic algorithms in a series of games and high-dimensional environment; results show that the GR2 methods have faster convergence speed than strong MARL baselines.


Bi-level Actor-Critic for Multi-agent Coordination

Coordination is one of the essential problems in multi-agent systems. Ty...

Probabilistic Recursive Reasoning for Multi-Agent Reinforcement Learning

Humans are capable of attributing latent mental contents such as beliefs...

R2-B2: Recursive Reasoning-Based Bayesian Optimization for No-Regret Learning in Games

This paper presents a recursive reasoning formalism of Bayesian optimiza...

Multi-Agent Inverse Reinforcement Learning: Suboptimal Demonstrations and Alternative Solution Concepts

Multi-agent inverse reinforcement learning (MIRL) can be used to learn r...

Decentralised Learning in Systems with Many, Many Strategic Agents

Although multi-agent reinforcement learning can tackle systems of strate...

Identifying reasoning patterns in games

We present an algorithm that identifies the reasoning patterns of agents...

Bayes-ToMoP: A Fast Detection and Best Response Algorithm Towards Sophisticated Opponents

Multiagent algorithms often aim to accurately predict the behaviors of o...

1 Introduction

In the studies of multi-agent reinforcement learning (MARL), one fundamental assumption is that agents will behave rationally (Hu & Wellman, 2003; Shapley, 1953). When interacting with others, perfect rationality stands for the condition that an agent would hold correct beliefs about other agents’ behaviors, and will only choose the actions that maximize its utility function. In reality, however, agents’ rationality are often bounded by practical limitations (Simon, 1972), e.g. the imperfect precedents or the maintenance cost. Such bounded rationality

can lead to agent’s incorrect estimation about the behaviors of others, thus deviate the real behaviors considerably from the theoretical outcome. In fact, in many real-world games, the assumption of perfect rationality cannot precisely predict agents’ behaviors

(Moulin, 1986; Crawford et al., 2013; Gracia-Lázaro et al., 2017). It becomes critical for MARL in acknowledging the bounded rationality and enhancing the model capacity when agents are not perfectly rational in the corresponding cases.

Under the rationality assumption, game theorist believe agents’ behaviors will end up in some kind of an equilibrium state (Nash et al., 1950; Aumann, 1987). Following this, MARL research is focused on considering the Nash Equilibrium (NE) as the essential assumption in both the learning algorithm and the evaluation criterion (Hu & Wellman, 2003; Littman, 2001; Greenwald et al., 2003).

However, the justification of assuming agents are always playing NE in multi-agent learning is untenable; and Nash strategy does not always have the prescriptive force. One most-cited psychological experiment that NE fails to describe is the Keynes Beauty Contest (Keynes, 1936). In a simpler version, all players are asked to pick one number from to , and the rule stipulates that the player whose guess is closest to of the average is the winner. The theory of NE suggests that the only rational choice for each player is because all the players will reason as follows: “if all players guess randomly, the average of those guesses would be (level-), I therefore should guess no more than (level-), and then if the other players think similarly as me, I should guess no more than (level-), …”. The level of recursions can keep developing until all players guess that is the only NE in this game. This theoretical result is however inconsistent with the experimental finding which evidences that most human players choose numbers between and (Coricelli & Nagel, 2009). The underlying mechanism is that not all human players exhibit perfect rationality and they are bounded by the levels they would like to recursively reason with. An interesting finding is that even with a bounded level of reasoning capability, human players can still eventually reach NE after a few rounds of playing and learning in the game (Camerer, 2011).

Inspired by the above finding in behavioral game theory, in this work, we propose a novel learning protocol –

Generalized Recursive Reasoning (GR2) – that acknowledges the bounded rationality and requires no assumption that agents must play Nash strategy for all the stage games encountered. Following cognitive hierarchy theory (Camerer et al., 2004), GR2 develops a recursive model of reasoning by steps. In GR2, agent types are drawn from a hierarchy of reasoning capability with arbitrary thinking depth. It begins with level- (L for short) type agents who do not assume anything about the other agents. Level- thinkers believe all their opponents are in level-, level- thinkers believe all opponents are in either level- or level-, and they take these beliefs into consideration before making the best decision. The GR2 agents are all self-interested in the sense of best responding to their beliefs of others, i.e. level- agents take the best response to either level- thinkers or a mixture of agents with the level ranging from to .

In this paper, we immerse the generalized recursive reasoning (GR2) protocol into the framework of MARL, and develop the GR2 Soft Actor-Critic algorithm. Due to the recognition of agents’ bounded rationality, the GR2 MARL algorithm is adept at capturing the non-equilibrium behaviors in forming the correct expectations about others, and ultimately it helps the convergence to the equilibrium. We prove that when the level of reasoning is large, GR2 methods can converge to at least one NE. In addition, if the lower-level agent plays Nash strategy, then all higher-level agents will follow. We evaluate our methods against multiple strong MARL baselines on the Keynes Beauty Contest and matrix games. Results justify our assumption that the GR2 method is able to incorporate the irrational behaviors of opponents, because of which the learning algorithm GR2 induces can converge to the equilibrium much faster than others. In the high-dimensional particle games, GR2 methods present superior performance over those baselines that do not have recursive reasoning modules.

2 Related Work

Assigning agents with different levels of reasoning power, aka, the level- model, was first introduced by Stahl (1993); Stahl II & Wilson (1994) and Nagel (1995). In the level- model, agents anchor their beliefs through thought-experiments with iterated best response in a chain style, i.e. L best responds to L, L to L, and so on. Agents’ levels are heterogeneous and are drawn from a common distribution. It is expected that agent would exhibit a pattern of behavior whose complexity grows as the cognitive levels increases (Gracia-Lázaro et al., 2017). Interestingly, across different games in the real world (Camerer et al., 2004; Benndorf et al., 2017), human beings tend to reason between - levels of recursion. This explains why people commonly guess between to in the Keynes Beauty Contest (see the previous section).

However, the level- models are shown to exhibit extreme stereotype bias due to concentrating on only one level below (Chong et al., 2016). Camerer et al. (2004) extended the level- model by making the L best responds to not only the level

, but a mixture of lower types of agents as well, and named it cognitive hierarchy model (CHM). The mixture of lower hierarchy is usually approximated by a discrete probability distribution, with its parameter estimated from the empirical experimental data. CHM presents distinct advantages over the level-

model on making robust predictions in a series of games (Crawford et al., 2013). Our method, GR2, is inspired by CHM. We bring the idea of mixing the cognitive hierarchy into the deep reinforcement learning (RL) framework, and develop algorithms that could both incorporate opponents’ non-equilibrium behaviors in the games and converge to the NE faster. This is different from the work (Hu & Wellman, 2003; Yang et al., 2018; Wen et al., 2019) that the convergence to NE requires agents to solve NE at each stage of the game.

GR2, modeling the opponents in a recursive manner, i.e. “I believe how you would believe about how I believe”, can be regarded as a special type of opponent modeling (Albrecht & Stone, 2018). It was first introduced by game theorists (Harsanyi, 1962, 1967). The study on Theory of Mind (ToM) (Goldman et al., 2012; Rabinowitz et al., 2018) introduced the uncertainty of the agent’s belief on opponents’ behaviors, their payoffs, and the recursion depths. Gmytrasiewicz & Doshi (2005) further implemented the idea of ToM in the RL setting. It extends the POMDP (Sondik, 1971) by an extra dimension that models the opponent state. Despite the added flexibility, I-POMDP has limitations in its solvability (Seuken & Zilberstein, 2008). Haarnoja et al. (2017) enables a hierarchical structure for agent policy, but it is for single-agent case and needs layer-wise pre-training.

By contrast, we embed GR2 into the RL framework via probabilistic inference. Levine (2018) built the connection between inferring the optimal policy in RL and the probabilistic inference on graphical models. Within this framework, Haarnoja et al. (2017, 2018) designed the soft Q-learning and soft actor-critic methods. Grau-Moya et al. (2018) and Wei et al. (2018) further generalized the soft Q-learning to the multi-agent scenario; however, these methods optimize towards the optimal Q-function of the joint actions; by doing this, it implies that the opponents will always collaborate with the central agent to reach its best equilibrium. However, forming the incorrect beliefs about the opponents is detrimental in many real-world applications (Albrecht & Stone, 2018).

Our GR2 method is most related to the probabilistic recursive reasoning (PR2) method (Wen et al., 2019) that also implement the idea of conducting recursive reasoning for MARL tasks. However, PR2 only considers the case of level- reasoning. Here we extend the recursion depth to arbitrary number and design the learning protocol so that agents can make the best response to a mixture of agents that are at different reasoning levels. We believe recursive reasoning with higher-order beliefs could better incorporate the non-equilibrium behavior during the game especially in the places where the traditional theory of NE cannot explain (e.g. the Keynes Beauty Contest). In the meantime, we also show that GR2 have better convergence property.

3 Preliminaries

Our GR2 method is built upon the framework of -agent stochastic game (Shapley, 1953), denoted by , where represents the state space, is the transition probability of the environment, is the discount factor for future rewards, and are the action space and the reward function for agent respectively. Agent chooses its action based on its policy , that is parameterized by , and represents its opponents’ behaviors. Each agent tries to find the optimal policy that maximizes its expected cumulative reward: with sampled from . The state-action -function for each agent follows the Bellman equation as . We define the trajectory by , and to be the reply buffer that stores those trajectories for agent .

3.1 Nash Equilibrium

Agents are self-interested to learn only the optimal policy in the game. The variable represents how optimal agent is at time ; it is proportional to the reward (Levine, 2018). In other words, can describe how likely the trajectory will be observed, e.g. if agents were egocentric, altruism actions are less likely to be observed. The optimal action for agent is then stated as .

The best response to the opponent policy is the policy s.t. for all valid . When all the agents act in their best response, forms a Nash Equilibrium (Nash et al., 1950).

3.2 Multi-Agent Soft Learning

Soft Q-learning, also called maximum entropy RL, is the outcome if one applies variational inference to solve the optimal policy on the graphical model of RL (Levine, 2018). Compared to the normal Q-learning, it has one additional entropy term in the objective. When the soft Q-learning is adapted into the multi-agent case, the entropy term is instead applied on the joint policy (Grau-Moya et al., 2018). The soft value function therefore reads as follows:


Figure 1: Graphical model of the level- recursive reasoning. Note that the subfix here stands for the level of thinking not the timestep. The unobservable opponent policies are approximated by . The omitted level- model considers opponents fully randomized. Agent rolls out the recursive reasoning about opponents in its mind (grey area). In the recursion, agents with higher-level beliefs take the best response to the lower-level thinkers’ actions. Higher-level models would conduct all the computations that the lower-level models have done, e.g. level- contains level-.

For policy evaluation, a modified Bellman operator can be applied, Compared to the operator in the Bellman equation for in the case without the entropy term, it is soft because . Correspondingly, the independent policy can be learned by minimizing , which leads to the Soft RL (Haarnoja et al., 2017, 2018).

However, policy improvement becomes tricky in the multi-agent soft Q-learning because the Q-function guides the improvement direction for the joint policy rather than for each individual agent. One naive approach is to discard the opponents actions sampled from the joint policy and take the marginalized output (Wei et al., 2018).

3.3 Probabilistic Recursive Reasoning

PR2 (Wen et al., 2019) adopted the so-called correlated factorization on the joint policy, i.e. , where represents the level- reasoning process of the opponent’s consideration of the action from agent . PR2 approximates the actual opponent conditional policy via a best-fit model from a family of distributions parameterized by . Through variational inference, the optimal opponent conditional policy in PR2 Q-learning is described to be

Figure 2: Reasoning paths of the level- policy.

Based on the optimal opponent model given by Eq. 2, agent can learn the best response policy according to the optimal -function: . Together with the Bellman equation for policy evaluation on Eq. 1, PR2 is proved to converge to the unique NE.

4 Generalized Recursive Reasoning

Our goal is to develop more generalized and flexible recursive reasoning into MARL. In this section, we first introduce the protocol of level- recursive reasoning (GR2-L), and then propose a mixture model that takes into account agents with different hierarchical levels of policies (GR2-M).

4.1 Level- Recursive Reasoning

We start by constructing a level- recursive reasoning process, naming it GR2-L. A level- policy for agent is the best response to the opponents’ policy at level (see Fig. 1). We assume is not directly observable, and agents have to approximate it by . The models on and can be regarded as latent variable model, with the marginal policies , serving as the prior, and the conditional policies , as the posterior. Actions can be sampled in a sequential manner. The level- policy is therefore constructed by integrating over best responses from all lower-level policies,


From agent ’s perspective, it believes that the opponents will take the best response to its own (fictitious) action that are two levels below: , even though the beliefs, , that agent holds for its opponents might be inaccurate yet. can be further expanded recursively until meeting

that is assumed uniformly distributed. For brevity, we mostly omit the subscript level

in actions from now on while acknowledging that the actions in the lower level are imagined (fictitiously taken). Considering the computational feasibility, we assume that each of the agents adopts the same functional form for its policy across the reasoning process in different depths; as shown in Fig. 1, we maintain , and .

4.2 Mixture of Hierarchy Recursive Reasoning

While the level- policy assumes agents best respond to the level opponents, we can further generalize the model to let agent best respond to a mixture of agents that are distributed over the lower hierarchies. We name it GR2-M.

We assume that the level- agents have an accurate guess about the relative proportion of agents who are doing less thinking than them. This specification implies that even though the agent could maintain an inaccurate belief on the probability distribution of the agents’ recursive depths, as increases, the deviation between agent’s belief and the actual distribution profile of will shrink. This further suggests that when is very large, there is no benefit for level- thinkers to reason even harder, as they will have the same beliefs, subsequently make the same decisions as agents at . Since more steps of computations are required as the grows, it is reasonable to make such assumption so that with the increasing , fewer agents are willing to do the reasoning beyond . Such assumption can also be justified by the limited amount of working memory in the strategic thinking on human beings (Devetag & Warglien, 2003).

In order to meet this assumption, we model the proportion of recursive depth by the Poisson distribution

, where the mean, also the variance, of the distribution is

. A nice property of Poisson is that is inversely proportional to , which satisfies our previous assumption. Interestingly, a study on humans suggests that on average people tend to act with (Camerer et al., 2004).

With the mixture of hierarchy modeled by Poisson, we can now mix all levels’ thinkings into one perception, and build each agent’s belief on its opponents by


where is a normalization term. In the reinforcement learning framework to be shown later, can be set as a hyper-parameter, similar to the TD- (Tesauro, 1995). Note that GR2-L is in fact a special case of GR2-M. When the mixture of depth is Poisson distributed, we have ; the model will put heavy weights on the level if ; that is to say, a level- agent will act as if all the opponents are reasoning at level .

Figure 3: Inter-level policy improvement. Higher-order strategies should weakly dominant lower-order strategies, e.g. with the opponent behaviors unchanged, should perform at least as good as .

4.3 Nash Convergence when

Even though GR2-M does not intentionally move towards Nash Equilibrium as the target, we present the main theorem that in the extreme case when the level goes to , GR2-M will reach at least one NE. We start by the proposition that

Proposition 1.

In both the GR2-L & GR2-M model, if the agents play pure strategies, once level- agent reaches a Nash Equilibrium, all higher-level agents will follow it too.

Proof: See Appendix A.

In GR2, since high-order thinkers will always conduct at least the same amount of computations as the lower-order thinkers, with Prop. 1, we can have the corollary that

Corollary 1.

In both GR2-L & GR2-M, higher-order strategies weakly dominant lower-order strategies (see Fig. 3).

Theorem 1.

In the GR2-M model, if the agents play the pure strategy, as , the prediction of GR2-M will reach at least one Nash Equilibrium (if it exists).

Proof: See Appendix B.

The property of converging to NE enables GR2-M not only to model the non-equilibrium behaviors in the games, but also to induce multi-agent learning algorithms that can potentially be more efficient.

5 Multi-Agent GR2 Reinforcement Learning

  Set hyper-parameter and (learning rates).
  Initialize for each agent .Assign target parameters: . empty replay buffer for each agent .
  for each episode do
     for each step  do
        For each agent , sample action according to
          in Eq. 3 or in Eq. 4.
        Sample next state: .
        Add the tuple to .
        for each agent  do
           Sample .
           Roll out policy from level to get ,
            with each level take the best response.
           Record inter-level results
            for auxiliary objective in Eq. 5.2.
           Sample for each , .
        end for
        Update target network for each agent :
     end for
  end for
Algorithm 1 GR2-L/M Soft Actor-Critic Algorithm

In this section, we incorporate the GR2 models into the MARL, in particular the framework of soft learning in Sec. 3.2, and propose the GR2 Soft Actor-Critic algorithm.

5.1 GR2 Soft Actor-Critic Algorithm

To evaluate the level- policy, since agent cannot access the joint soft as it depends on the exact opponent policies that are unknown, we instead compute the independent soft by marginalizing the joint -function via the estimated opponent model :


With the marginalized , combining with the soft value function defined in Eq. 1, we can obtain the value function of the level- policy by


Each agent rolls out the recursive reasoning to level , either through GR2-L or through GR2-M, and then sample the best response. With the value function in Eq. 6, we can train the parameter of the joint soft -function via minimizing the soft Bellman residual,


where is next state, and


The level- policy parameter can be learned by the following objective, which is in fact equivalent to minimizing the KL divergence between and (Haarnoja et al., 2017):


In the multi-agent soft learning setting, the optimal opponent model still follows Eq. 2. We can therefore update by minimizing the KL-divergence between the current estimated policy and the advantage function:


In practice, we can maintain two approximated -functions of and separately, and then iteratively update the via SVGD (Haarnoja et al., 2017; Liu & Wang, 2016).

We summarize the whole algorithm in Algo. 1.

5.2 Inter-level Policy Improvement

As pointed out by Corollary. 1, higher-level policy should respond at least as good as lower-level policy to the opponent actions. In fact, the highest-level policy directly receives the feedback reward from the environment and then back propagates it to lower level policies. The intermediate level policy then receives the reward signal through the chain. As shown in Fig. 3 , we shall have for . In order to maintain this property, we introduce an auxiliary objective during training, that is, for ,


As it is showed in the later section, such auxiliary objective plays a critical role in helping the convergence.

5.3 Best Response as Deterministic Strategy

Considering the computational feasibility, we formulate the best response in the form of deterministic strategy. Fig. 2 shows an example of the possible combination of paths at the -th level reasoning. As the reasonings process involves iterated usages of and , should they be stochastic policies, the cost of integrating over actions from all possible lower level actions would be unsustainable when increases. Besides, the reasoning process is affected by the stochasticity of the environment, the variance will be further amplified on the stochastic policies.

Our approach to model the deterministic strategy is based on the bijective transformation. Dinh et al. (2016) first applied it on the deep generative models. Let

be the random variable, by employing the change of variable rule, we have


The computation of the determinant can be further simplified by choosing a special bijective transformations that have friendly Jacobian matrix. For example, in the GR2-L model, level- is usually assumed as uniformly distributed, we have:

6 Experiments

recursive depth Level 3 Level 2 Level 1 Level 0

Table 1: The converged equilibrium on the Keynes Beauty Contest with different and agent number settings.

We compare the proposed GR2-L/GR2-M Soft Actor-Critic algorithm with a series of baselines including PR2 (Wen et al., 2019), MASQL (Wei et al., 2018; Grau-Moya et al., 2018), MADDPG (Lowe et al., 2017) and independent learner via DDPG (Lillicrap et al., 2015). To compare against traditional opponent modeling methods, similar to (Rabinowitz et al., 2018; He et al., 2016), we implement an additional baseline of DDPG with an opponent module that is trained online with supervision in order to capture the latest opponent behaviors, called DDPG-OM.

The Recursive Level. We regard DDPG, DDPG-OM, MASQL, MADDPG as level- reasoning models because from the policy level, they do not explicitly model the impact of one agent’s action on the other agents or consider the reactions from the other agents. Even though the value function of the joint policy is learned in MASQL and MADDPG, but they conduct a non-correlated factorization (Wen et al., 2019) when it comes to each individual agent’s policy. PR2 is in fact a level- reasoning model, but note that the level- model in GR2 stands for , while the level- model in PR2 starts from the opponent’s angel, that is .

Experimental Settings. We describe as the highest order of reasoning in GR2-L and GR2-M. All the policies and -functions are parameterized by the MLP with hidden layers, where each has

units with the ReLU activation. In the Keynes Beauty Contest, we train all the methods including the baselines for

iterations with steps per iteration. In the matrix game, we train the agents for iterations with steps per iteration. And for the particle games, all the models are trained up to steps with maximum episode length. In the actor-critic algorithms, we set the exploration noise to in the first steps. The annealing parameters in soft algorithms are set to to balance the exploration/exploitation. We adopt and set , and leave the more detailed ablation study on the hyper-parameter settings of and in Appendix C.

Figure 4: Learning curves with or without the auxiliary loss of Eq. 5.2

6.1 Keynes Beauty Contest

In the Keynes Beauty Contest game, all agents pick a number between and . The winner is the agent whose “guess” is closest to times of the average number. The reward is set as the absolute difference to the winner guess. We evaluate the effect of reward shaping in Appendix. D.

We first conduct the experiments by setting different and values. Table. 1 presents the results. We notice that the GR2-L algorithms can effectively approach the equilibrium, while the other baselines struggle to reach. We believe it is because the level- methods are unable to model the explicit dependency between agents’ actions, so that the synergy of guessing smaller/larger together will not happen. In addition, with the maximum level of recursive reasoning grows, the equilibrium that GR2-L finds gets increasingly closer to the true NE, which is expected. In the case of , the equilibrium is first reached by GR2-L, the other higher-order models follow the same equilibrium afterwards; this is in line with the Proposition 1. The equilibrium found by GR2-L3 is when , the error is due to the saturated reward design that stops the momentum of agents’ changing the actions anywhere further.

Furthermore, we analyze the effectiveness of the auxiliary loss in Eq. 5.2 that aims to guarantee the inter-level policy improvement. In Fig. 4, we can find that although the GR2-L model without auxiliary loss learns fast in the beginning, it however fails to reach a better equilibrium. We believe adding the auxiliary loss in the objective can help the joint -function to guide a better direction for agents’ reasonings.

6.2 Learning Matrix Games

(a) Stag Hunt Game
(b) Prisoner’s Dilemma
Figure 5: Learning curves on the matrix games.

Figure 6: Performance comparison between different models in the self-play competitive environments.

We then evaluate on two matrix games: Prisoner’s Dilemma (PD) and Stag Hunt (SH). The reward matrix of PD is . Agents can choose to cooperate (C) or defect (D). Defect is the dominant strategy for both agents, while there exists a Pareto optima (C, C). In SH, the reward matrix is . Agents can either cooperate to hunt the large stag (S) or to chase a small rabbit (P) alone. In this game, there is no dominant strategy, and it comes with two NEs, i.e. (S, S) that is also Pareto-optimal, and the deficient equilibrium (P, P). To turn the matrix games compatible with the actor-critic setting, we make the actor output the probability of choosing one action and define the reward based on the payoff matrix. We define the state at time to be .

Figure 7: Performance comparison in the cooperative game.

In SH, we expect the agents to reach the Pareto optima, so that both agents receive the maximum reward. Fig. 4(a) compares the average training reward for a list of algorithms. GR2-L/GR2-M models, together with PR2, can easily reach the Pareto optima with maximum average return , whereas other models are trapped into deficient equilibrium with average return . On the convergence speed, higher-order thinkers are faster than lower-order thinkers, and mixture models are faster than level- models. SH is a coordination game that motivates players to cooperate, but the challenging part is the conflict between individual safety and social welfare. Without knowing the choice of the other, GR2 has to anchor the belief that opponents would like to choose the social welfare, and then reinforce this belief into the high-order reasoning to finally build the trust among agents. Since there is no dominant strategy in SH, agents will in the end choose to stay fixed in that strategy. The level- methods cannot build such synergy because they cannot discriminate the individual safety from the social welfare; both NEs can saturate the value function of the joint action.

PD is different from SH in that there is a dominant strategy of defection which leads the Pareto optima to be unstable. In Fig. 4(b), although GR2-L3 could reach the Pareto optima within the first steps, but soon it chooses to defect. Since defection is dominant, as a level- thinker, it will surely defect; such reasoning will be passed into the level- thinker, and GR2-L3 soon realizes and best respond with defection immediately. GR2-M3 under-performs GR2-L3 because it is a mixture strategy, which still considers the level- thinker who still chases for cooperation. As for the rest of the baselines, they do not present such behavioral changes at all; meanwhile, they converge to a worse equilibrium.

6.3 Particle World Environments

We further adopt the Particle World Environments (Lowe et al., 2017) to test our method on high-dimensional state/action space scenarios. It includes four testing scenarios: 1) Cooperative Navigation with agents and landmarks. Agents are collectively rewarded based on the proximity of any agent to each landmark while avoiding collisions; 2) Physical Deception with adversary, good agents, and landmarks. All agents observe the positions of landmarks and other agents. Only one landmark is the true target landmark. Good agents are rewarded based on how close any of them is to the target landmark, and how well they deceive the adversary; 3) Keep-away with agent, adversary, and landmark. Agent is rewarded based on distance to landmark. Adversary is rewarded if it push away the agent from the landmark; 4) Predator-prey with prey agent who moves faster try to run away from adversary predator who move slower but are motivated to catch the prey cooperatively. Task is fully-cooperative, while - are cooperative-competitive tasks.

We measure the performance of agents trained by different algorithms in each of the environment. The results are demonstrated in Figs. 6 & 7; we report the rewards of the type of good agents and normalize them to - for tasks -. We notice that in all the scenarios, the GR2-M methods can achieve the highest score. The level- models can out-perform the lower-level models. In the cooperative game, the GR2-L/M methods have critical advantages over traditional baselines, this is inline with the finding in SH that GR2 methods are good at finding the Pareto optimality.

7 Conclusion

In this paper, following the study of bounded rationality (Simon, 1972) and behavior game theory (Camerer, 2003), we introduce the protocol of generalized recursive reasoning (GR2) that allows agents to have different levels of recursive reasoning capability when dealing with other agents. In GR2, each agent takes the best response to a mixed type of agents that think and behave at lower levels. We integrate GR2 into the MARL framework. The induced GR2 Soft Actor-Critic Algorithm shows superior performance on building the correct beliefs about the opponents behaviors even when they are non-equilibrium. We prove in theory that GR2 can reach NE when the level approximates infinity. Empirically, when compared to conventional opponent models (level 0), learning with higher level recursive reasoning capability would lead to faster convergence.


  • Albrecht & Stone (2018) Albrecht, S. V. and Stone, P. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258:66–95, 2018.
  • Aumann (1987) Aumann, R. J. Correlated equilibrium as an expression of bayesian rationality. Econometrica: Journal of the Econometric Society, pp. 1–18, 1987.
  • Benndorf et al. (2017) Benndorf, V., Kübler, D., and Normann, H.-T. Depth of reasoning and information revelation: An experiment on the distribution of k-levels. International Game Theory Review, 19(04):1750021, 2017.
  • Camerer (2003) Camerer, C. F. Behavioural studies of strategic thinking in games. Trends in cognitive sciences, 7(5):225–231, 2003.
  • Camerer (2011) Camerer, C. F. Behavioral game theory: Experiments in strategic interaction. Princeton University Press, 2011.
  • Camerer et al. (2004) Camerer, C. F., Ho, T.-H., and Chong, J.-K. A cognitive hierarchy model of games. The Quarterly Journal of Economics, 119(3):861–898, 2004.
  • Chong et al. (2016) Chong, J.-K., Ho, T.-H., and Camerer, C. A generalized cognitive hierarchy model of games. Games and Economic Behavior, 99:257–274, 2016.
  • Coricelli & Nagel (2009) Coricelli, G. and Nagel, R. Neural correlates of depth of strategic reasoning in medial prefrontal cortex. Proceedings of the National Academy of Sciences, 106(23):9163–9168, 2009.
  • Crawford et al. (2013) Crawford, V. P., Costa-Gomes, M. A., and Iriberri, N. Structural models of nonequilibrium strategic thinking: Theory, evidence, and applications. Journal of Economic Literature, 51(1):5–62, 2013.
  • Devetag & Warglien (2003) Devetag, G. and Warglien, M. Games and phone numbers: Do short-term memory bounds affect strategic behavior? Journal of Economic Psychology, 24(2):189–202, 2003.
  • Dinh et al. (2016) Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
  • Gmytrasiewicz & Doshi (2005) Gmytrasiewicz, P. J. and Doshi, P. A framework for sequential planning in multi-agent settings. Journal of Artificial Intelligence Research, 24:49–79, 2005.
  • Goldman et al. (2012) Goldman, A. I. et al. Theory of mind. The Oxford handbook of philosophy of cognitive science, pp. 402–424, 2012.
  • Gracia-Lázaro et al. (2017) Gracia-Lázaro, C., Floría, L. M., and Moreno, Y. Cognitive hierarchy theory and two-person games. Games, 8(1):1, 2017.
  • Grau-Moya et al. (2018) Grau-Moya, J., Leibfried, F., and Bou-Ammar, H. Balancing two-player stochastic games with soft q-learning. IJCAI, 2018.
  • Greenwald et al. (2003) Greenwald, A., Hall, K., and Serrano, R. Correlated q-learning. In ICML, volume 3, pp. 242–249, 2003.
  • Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165, 2017.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
  • Harsanyi (1962) Harsanyi, J. C. Bargaining in ignorance of the opponent’s utility function. Journal of Conflict Resolution, 6(1):29–38, 1962.
  • Harsanyi (1967) Harsanyi, J. C. Games with incomplete information played by bayesian players, i–iii part i. the basic model. Management science, 14(3):159–182, 1967.
  • He et al. (2016) He, H., Boyd-Graber, J., Kwok, K., and Daumé III, H. Opponent modeling in deep reinforcement learning. In International Conference on Machine Learning, pp. 1804–1813, 2016.
  • Hu & Wellman (2003) Hu, J. and Wellman, M. P. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
  • Keynes (1936) Keynes, J. M. The General Theory of Employment, Interest and Money. Macmillan, 1936. 14th edition, 1973.
  • Levine (2018) Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Littman (2001) Littman, M. L. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pp. 322–328, 2001.
  • Liu & Wang (2016) Liu, Q. and Wang, D.

    Stein variational gradient descent: A general purpose bayesian inference algorithm.

    In Advances In Neural Information Processing Systems, pp. 2378–2386, 2016.
  • Lowe et al. (2017) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390, 2017.
  • Moulin (1986) Moulin, H. Game theory for the social sciences. NYU press, 1986.
  • Nagel (1995) Nagel, R. Unraveling in guessing games: An experimental study. The American Economic Review, 85(5):1313–1326, 1995.
  • Nash et al. (1950) Nash, J. F. et al. Equilibrium points in n-person games. Proceedings of the national academy of sciences, 36(1):48–49, 1950.
  • Rabinowitz et al. (2018) Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., Eslami, S., and Botvinick, M. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.
  • Seuken & Zilberstein (2008) Seuken, S. and Zilberstein, S. Formal models and algorithms for decentralized decision making under uncertainty. Autonomous Agents and Multi-Agent Systems, 17(2):190–250, 2008.
  • Shapley (1953) Shapley, L. S. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
  • Simon (1972) Simon, H. A. Theories of bounded rationality. Decision and organization, 1(1):161–176, 1972.
  • Sondik (1971) Sondik, E. J. The optimal control of partially observable markov processes. Technical report, STANFORD UNIV CALIF STANFORD ELECTRONICS LABS, 1971.
  • Stahl (1993) Stahl, D. O. Evolution of smartn players. Games and Economic Behavior, 5(4):604–617, 1993.
  • Stahl II & Wilson (1994) Stahl II, D. O. and Wilson, P. W. Experimental evidence on players’ models of other players. Journal of economic behavior & organization, 25(3):309–327, 1994.
  • Tesauro (1995) Tesauro, G. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
  • Wei et al. (2018) Wei, E., Wicke, D., Freelan, D., and Luke, S. Multiagent soft q-learning. AAAI, 2018.
  • Wen et al. (2019) Wen, Y., Yang, Y., Luo, R., Wang, J., and Pan, W. Probabilistic recursive reasoning for multi-agent reinforcement learning. In International Conference on Learning Representations, 2019. URL
  • Yang et al. (2018) Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., and Wang, J. Mean field multi-agent reinforcement learning. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5571–5580, Stockholmsmassan, Stockholm Sweden, 10–15 Jul 2018. PMLR.