1 Introduction
In the studies of multiagent reinforcement learning (MARL), one fundamental assumption is that agents will behave rationally (Hu & Wellman, 2003; Shapley, 1953). When interacting with others, perfect rationality stands for the condition that an agent would hold correct beliefs about other agents’ behaviors, and will only choose the actions that maximize its utility function. In reality, however, agents’ rationality are often bounded by practical limitations (Simon, 1972), e.g. the imperfect precedents or the maintenance cost. Such bounded rationality
can lead to agent’s incorrect estimation about the behaviors of others, thus deviate the real behaviors considerably from the theoretical outcome. In fact, in many realworld games, the assumption of perfect rationality cannot precisely predict agents’ behaviors
(Moulin, 1986; Crawford et al., 2013; GraciaLázaro et al., 2017). It becomes critical for MARL in acknowledging the bounded rationality and enhancing the model capacity when agents are not perfectly rational in the corresponding cases.Under the rationality assumption, game theorist believe agents’ behaviors will end up in some kind of an equilibrium state (Nash et al., 1950; Aumann, 1987). Following this, MARL research is focused on considering the Nash Equilibrium (NE) as the essential assumption in both the learning algorithm and the evaluation criterion (Hu & Wellman, 2003; Littman, 2001; Greenwald et al., 2003).
However, the justification of assuming agents are always playing NE in multiagent learning is untenable; and Nash strategy does not always have the prescriptive force. One mostcited psychological experiment that NE fails to describe is the Keynes Beauty Contest (Keynes, 1936). In a simpler version, all players are asked to pick one number from to , and the rule stipulates that the player whose guess is closest to of the average is the winner. The theory of NE suggests that the only rational choice for each player is because all the players will reason as follows: “if all players guess randomly, the average of those guesses would be (level), I therefore should guess no more than (level), and then if the other players think similarly as me, I should guess no more than (level), …”. The level of recursions can keep developing until all players guess that is the only NE in this game. This theoretical result is however inconsistent with the experimental finding which evidences that most human players choose numbers between and (Coricelli & Nagel, 2009). The underlying mechanism is that not all human players exhibit perfect rationality and they are bounded by the levels they would like to recursively reason with. An interesting finding is that even with a bounded level of reasoning capability, human players can still eventually reach NE after a few rounds of playing and learning in the game (Camerer, 2011).
Inspired by the above finding in behavioral game theory, in this work, we propose a novel learning protocol –
Generalized Recursive Reasoning (GR2) – that acknowledges the bounded rationality and requires no assumption that agents must play Nash strategy for all the stage games encountered. Following cognitive hierarchy theory (Camerer et al., 2004), GR2 develops a recursive model of reasoning by steps. In GR2, agent types are drawn from a hierarchy of reasoning capability with arbitrary thinking depth. It begins with level (L for short) type agents who do not assume anything about the other agents. Level thinkers believe all their opponents are in level, level thinkers believe all opponents are in either level or level, and they take these beliefs into consideration before making the best decision. The GR2 agents are all selfinterested in the sense of best responding to their beliefs of others, i.e. level agents take the best response to either level thinkers or a mixture of agents with the level ranging from to .In this paper, we immerse the generalized recursive reasoning (GR2) protocol into the framework of MARL, and develop the GR2 Soft ActorCritic algorithm. Due to the recognition of agents’ bounded rationality, the GR2 MARL algorithm is adept at capturing the nonequilibrium behaviors in forming the correct expectations about others, and ultimately it helps the convergence to the equilibrium. We prove that when the level of reasoning is large, GR2 methods can converge to at least one NE. In addition, if the lowerlevel agent plays Nash strategy, then all higherlevel agents will follow. We evaluate our methods against multiple strong MARL baselines on the Keynes Beauty Contest and matrix games. Results justify our assumption that the GR2 method is able to incorporate the irrational behaviors of opponents, because of which the learning algorithm GR2 induces can converge to the equilibrium much faster than others. In the highdimensional particle games, GR2 methods present superior performance over those baselines that do not have recursive reasoning modules.
2 Related Work
Assigning agents with different levels of reasoning power, aka, the level model, was first introduced by Stahl (1993); Stahl II & Wilson (1994) and Nagel (1995). In the level model, agents anchor their beliefs through thoughtexperiments with iterated best response in a chain style, i.e. L best responds to L, L to L, and so on. Agents’ levels are heterogeneous and are drawn from a common distribution. It is expected that agent would exhibit a pattern of behavior whose complexity grows as the cognitive levels increases (GraciaLázaro et al., 2017). Interestingly, across different games in the real world (Camerer et al., 2004; Benndorf et al., 2017), human beings tend to reason between  levels of recursion. This explains why people commonly guess between to in the Keynes Beauty Contest (see the previous section).
However, the level models are shown to exhibit extreme stereotype bias due to concentrating on only one level below (Chong et al., 2016). Camerer et al. (2004) extended the level model by making the L best responds to not only the level
, but a mixture of lower types of agents as well, and named it cognitive hierarchy model (CHM). The mixture of lower hierarchy is usually approximated by a discrete probability distribution, with its parameter estimated from the empirical experimental data. CHM presents distinct advantages over the level
model on making robust predictions in a series of games (Crawford et al., 2013). Our method, GR2, is inspired by CHM. We bring the idea of mixing the cognitive hierarchy into the deep reinforcement learning (RL) framework, and develop algorithms that could both incorporate opponents’ nonequilibrium behaviors in the games and converge to the NE faster. This is different from the work (Hu & Wellman, 2003; Yang et al., 2018; Wen et al., 2019) that the convergence to NE requires agents to solve NE at each stage of the game.GR2, modeling the opponents in a recursive manner, i.e. “I believe how you would believe about how I believe”, can be regarded as a special type of opponent modeling (Albrecht & Stone, 2018). It was first introduced by game theorists (Harsanyi, 1962, 1967). The study on Theory of Mind (ToM) (Goldman et al., 2012; Rabinowitz et al., 2018) introduced the uncertainty of the agent’s belief on opponents’ behaviors, their payoffs, and the recursion depths. Gmytrasiewicz & Doshi (2005) further implemented the idea of ToM in the RL setting. It extends the POMDP (Sondik, 1971) by an extra dimension that models the opponent state. Despite the added flexibility, IPOMDP has limitations in its solvability (Seuken & Zilberstein, 2008). Haarnoja et al. (2017) enables a hierarchical structure for agent policy, but it is for singleagent case and needs layerwise pretraining.
By contrast, we embed GR2 into the RL framework via probabilistic inference. Levine (2018) built the connection between inferring the optimal policy in RL and the probabilistic inference on graphical models. Within this framework, Haarnoja et al. (2017, 2018) designed the soft Qlearning and soft actorcritic methods. GrauMoya et al. (2018) and Wei et al. (2018) further generalized the soft Qlearning to the multiagent scenario; however, these methods optimize towards the optimal Qfunction of the joint actions; by doing this, it implies that the opponents will always collaborate with the central agent to reach its best equilibrium. However, forming the incorrect beliefs about the opponents is detrimental in many realworld applications (Albrecht & Stone, 2018).
Our GR2 method is most related to the probabilistic recursive reasoning (PR2) method (Wen et al., 2019) that also implement the idea of conducting recursive reasoning for MARL tasks. However, PR2 only considers the case of level reasoning. Here we extend the recursion depth to arbitrary number and design the learning protocol so that agents can make the best response to a mixture of agents that are at different reasoning levels. We believe recursive reasoning with higherorder beliefs could better incorporate the nonequilibrium behavior during the game especially in the places where the traditional theory of NE cannot explain (e.g. the Keynes Beauty Contest). In the meantime, we also show that GR2 have better convergence property.
3 Preliminaries
Our GR2 method is built upon the framework of agent stochastic game (Shapley, 1953), denoted by , where represents the state space, is the transition probability of the environment, is the discount factor for future rewards, and are the action space and the reward function for agent respectively. Agent chooses its action based on its policy , that is parameterized by , and represents its opponents’ behaviors. Each agent tries to find the optimal policy that maximizes its expected cumulative reward: with sampled from . The stateaction function for each agent follows the Bellman equation as . We define the trajectory by , and to be the reply buffer that stores those trajectories for agent .
3.1 Nash Equilibrium
Agents are selfinterested to learn only the optimal policy in the game. The variable represents how optimal agent is at time ; it is proportional to the reward (Levine, 2018). In other words, can describe how likely the trajectory will be observed, e.g. if agents were egocentric, altruism actions are less likely to be observed. The optimal action for agent is then stated as .
The best response to the opponent policy is the policy s.t. for all valid . When all the agents act in their best response, forms a Nash Equilibrium (Nash et al., 1950).
3.2 MultiAgent Soft Learning
Soft Qlearning, also called maximum entropy RL, is the outcome if one applies variational inference to solve the optimal policy on the graphical model of RL (Levine, 2018). Compared to the normal Qlearning, it has one additional entropy term in the objective. When the soft Qlearning is adapted into the multiagent case, the entropy term is instead applied on the joint policy (GrauMoya et al., 2018). The soft value function therefore reads as follows:
(1) 
For policy evaluation, a modified Bellman operator can be applied, Compared to the operator in the Bellman equation for in the case without the entropy term, it is soft because . Correspondingly, the independent policy can be learned by minimizing , which leads to the Soft RL (Haarnoja et al., 2017, 2018).
However, policy improvement becomes tricky in the multiagent soft Qlearning because the Qfunction guides the improvement direction for the joint policy rather than for each individual agent. One naive approach is to discard the opponents actions sampled from the joint policy and take the marginalized output (Wei et al., 2018).
3.3 Probabilistic Recursive Reasoning
PR2 (Wen et al., 2019) adopted the socalled correlated factorization on the joint policy, i.e. , where represents the level reasoning process of the opponent’s consideration of the action from agent . PR2 approximates the actual opponent conditional policy via a bestfit model from a family of distributions parameterized by . Through variational inference, the optimal opponent conditional policy in PR2 Qlearning is described to be
4 Generalized Recursive Reasoning
Our goal is to develop more generalized and flexible recursive reasoning into MARL. In this section, we first introduce the protocol of level recursive reasoning (GR2L), and then propose a mixture model that takes into account agents with different hierarchical levels of policies (GR2M).
4.1 Level Recursive Reasoning
We start by constructing a level recursive reasoning process, naming it GR2L. A level policy for agent is the best response to the opponents’ policy at level (see Fig. 1). We assume is not directly observable, and agents have to approximate it by . The models on and can be regarded as latent variable model, with the marginal policies , serving as the prior, and the conditional policies , as the posterior. Actions can be sampled in a sequential manner. The level policy is therefore constructed by integrating over best responses from all lowerlevel policies,
(3) 
From agent ’s perspective, it believes that the opponents will take the best response to its own (fictitious) action that are two levels below: , even though the beliefs, , that agent holds for its opponents might be inaccurate yet. can be further expanded recursively until meeting
that is assumed uniformly distributed. For brevity, we mostly omit the subscript level
in actions from now on while acknowledging that the actions in the lower level are imagined (fictitiously taken). Considering the computational feasibility, we assume that each of the agents adopts the same functional form for its policy across the reasoning process in different depths; as shown in Fig. 1, we maintain , and .4.2 Mixture of Hierarchy Recursive Reasoning
While the level policy assumes agents best respond to the level opponents, we can further generalize the model to let agent best respond to a mixture of agents that are distributed over the lower hierarchies. We name it GR2M.
We assume that the level agents have an accurate guess about the relative proportion of agents who are doing less thinking than them. This specification implies that even though the agent could maintain an inaccurate belief on the probability distribution of the agents’ recursive depths, as increases, the deviation between agent’s belief and the actual distribution profile of will shrink. This further suggests that when is very large, there is no benefit for level thinkers to reason even harder, as they will have the same beliefs, subsequently make the same decisions as agents at . Since more steps of computations are required as the grows, it is reasonable to make such assumption so that with the increasing , fewer agents are willing to do the reasoning beyond . Such assumption can also be justified by the limited amount of working memory in the strategic thinking on human beings (Devetag & Warglien, 2003).
In order to meet this assumption, we model the proportion of recursive depth by the Poisson distribution
, where the mean, also the variance, of the distribution is
. A nice property of Poisson is that is inversely proportional to , which satisfies our previous assumption. Interestingly, a study on humans suggests that on average people tend to act with (Camerer et al., 2004).With the mixture of hierarchy modeled by Poisson, we can now mix all levels’ thinkings into one perception, and build each agent’s belief on its opponents by
(4) 
where is a normalization term. In the reinforcement learning framework to be shown later, can be set as a hyperparameter, similar to the TD (Tesauro, 1995). Note that GR2L is in fact a special case of GR2M. When the mixture of depth is Poisson distributed, we have ; the model will put heavy weights on the level if ; that is to say, a level agent will act as if all the opponents are reasoning at level .
4.3 Nash Convergence when
Even though GR2M does not intentionally move towards Nash Equilibrium as the target, we present the main theorem that in the extreme case when the level goes to , GR2M will reach at least one NE. We start by the proposition that
Proposition 1.
In both the GR2L & GR2M model, if the agents play pure strategies, once level agent reaches a Nash Equilibrium, all higherlevel agents will follow it too.
Proof: See Appendix A.
In GR2, since highorder thinkers will always conduct at least the same amount of computations as the lowerorder thinkers, with Prop. 1, we can have the corollary that
Corollary 1.
In both GR2L & GR2M, higherorder strategies weakly dominant lowerorder strategies (see Fig. 3).
Theorem 1.
In the GR2M model, if the agents play the pure strategy, as , the prediction of GR2M will reach at least one Nash Equilibrium (if it exists).
Proof: See Appendix B.
The property of converging to NE enables GR2M not only to model the nonequilibrium behaviors in the games, but also to induce multiagent learning algorithms that can potentially be more efficient.
5 MultiAgent GR2 Reinforcement Learning
In this section, we incorporate the GR2 models into the MARL, in particular the framework of soft learning in Sec. 3.2, and propose the GR2 Soft ActorCritic algorithm.
5.1 GR2 Soft ActorCritic Algorithm
To evaluate the level policy, since agent cannot access the joint soft as it depends on the exact opponent policies that are unknown, we instead compute the independent soft by marginalizing the joint function via the estimated opponent model :
(5) 
With the marginalized , combining with the soft value function defined in Eq. 1, we can obtain the value function of the level policy by
(6) 
Each agent rolls out the recursive reasoning to level , either through GR2L or through GR2M, and then sample the best response. With the value function in Eq. 6, we can train the parameter of the joint soft function via minimizing the soft Bellman residual,
(7)  
where is next state, and
(8) 
The level policy parameter can be learned by the following objective, which is in fact equivalent to minimizing the KL divergence between and (Haarnoja et al., 2017):
(9) 
In the multiagent soft learning setting, the optimal opponent model still follows Eq. 2. We can therefore update by minimizing the KLdivergence between the current estimated policy and the advantage function:
(10) 
In practice, we can maintain two approximated functions of and separately, and then iteratively update the via SVGD (Haarnoja et al., 2017; Liu & Wang, 2016).
We summarize the whole algorithm in Algo. 1.
5.2 Interlevel Policy Improvement
As pointed out by Corollary. 1, higherlevel policy should respond at least as good as lowerlevel policy to the opponent actions. In fact, the highestlevel policy directly receives the feedback reward from the environment and then back propagates it to lower level policies. The intermediate level policy then receives the reward signal through the chain. As shown in Fig. 3 , we shall have for . In order to maintain this property, we introduce an auxiliary objective during training, that is, for ,
(11) 
As it is showed in the later section, such auxiliary objective plays a critical role in helping the convergence.
5.3 Best Response as Deterministic Strategy
Considering the computational feasibility, we formulate the best response in the form of deterministic strategy. Fig. 2 shows an example of the possible combination of paths at the th level reasoning. As the reasonings process involves iterated usages of and , should they be stochastic policies, the cost of integrating over actions from all possible lower level actions would be unsustainable when increases. Besides, the reasoning process is affected by the stochasticity of the environment, the variance will be further amplified on the stochastic policies.
Our approach to model the deterministic strategy is based on the bijective transformation. Dinh et al. (2016) first applied it on the deep generative models. Let
be the random variable, by employing the change of variable rule, we have
(12) 
The computation of the determinant can be further simplified by choosing a special bijective transformations that have friendly Jacobian matrix. For example, in the GR2L model, level is usually assumed as uniformly distributed, we have:
6 Experiments
recursive depth  Level 3  Level 2  Level 1  Level 0  

Equilibrium  Nash  GR2L3  GR2L2  GR2L1  PR2  MADDPG  DDPGOM  MASQL  DDPG 


We compare the proposed GR2L/GR2M Soft ActorCritic algorithm with a series of baselines including PR2 (Wen et al., 2019), MASQL (Wei et al., 2018; GrauMoya et al., 2018), MADDPG (Lowe et al., 2017) and independent learner via DDPG (Lillicrap et al., 2015). To compare against traditional opponent modeling methods, similar to (Rabinowitz et al., 2018; He et al., 2016), we implement an additional baseline of DDPG with an opponent module that is trained online with supervision in order to capture the latest opponent behaviors, called DDPGOM.
The Recursive Level. We regard DDPG, DDPGOM, MASQL, MADDPG as level reasoning models because from the policy level, they do not explicitly model the impact of one agent’s action on the other agents or consider the reactions from the other agents. Even though the value function of the joint policy is learned in MASQL and MADDPG, but they conduct a noncorrelated factorization (Wen et al., 2019) when it comes to each individual agent’s policy. PR2 is in fact a level reasoning model, but note that the level model in GR2 stands for , while the level model in PR2 starts from the opponent’s angel, that is .
Experimental Settings. We describe as the highest order of reasoning in GR2L and GR2M. All the policies and functions are parameterized by the MLP with hidden layers, where each has
units with the ReLU activation. In the Keynes Beauty Contest, we train all the methods including the baselines for
iterations with steps per iteration. In the matrix game, we train the agents for iterations with steps per iteration. And for the particle games, all the models are trained up to steps with maximum episode length. In the actorcritic algorithms, we set the exploration noise to in the first steps. The annealing parameters in soft algorithms are set to to balance the exploration/exploitation. We adopt and set , and leave the more detailed ablation study on the hyperparameter settings of and in Appendix C.6.1 Keynes Beauty Contest
In the Keynes Beauty Contest game, all agents pick a number between and . The winner is the agent whose “guess” is closest to times of the average number. The reward is set as the absolute difference to the winner guess. We evaluate the effect of reward shaping in Appendix. D.
We first conduct the experiments by setting different and values. Table. 1 presents the results. We notice that the GR2L algorithms can effectively approach the equilibrium, while the other baselines struggle to reach. We believe it is because the level methods are unable to model the explicit dependency between agents’ actions, so that the synergy of guessing smaller/larger together will not happen. In addition, with the maximum level of recursive reasoning grows, the equilibrium that GR2L finds gets increasingly closer to the true NE, which is expected. In the case of , the equilibrium is first reached by GR2L, the other higherorder models follow the same equilibrium afterwards; this is in line with the Proposition 1. The equilibrium found by GR2L3 is when , the error is due to the saturated reward design that stops the momentum of agents’ changing the actions anywhere further.
Furthermore, we analyze the effectiveness of the auxiliary loss in Eq. 5.2 that aims to guarantee the interlevel policy improvement. In Fig. 4, we can find that although the GR2L model without auxiliary loss learns fast in the beginning, it however fails to reach a better equilibrium. We believe adding the auxiliary loss in the objective can help the joint function to guide a better direction for agents’ reasonings.
6.2 Learning Matrix Games
We then evaluate on two matrix games: Prisoner’s Dilemma (PD) and Stag Hunt (SH). The reward matrix of PD is . Agents can choose to cooperate (C) or defect (D). Defect is the dominant strategy for both agents, while there exists a Pareto optima (C, C). In SH, the reward matrix is . Agents can either cooperate to hunt the large stag (S) or to chase a small rabbit (P) alone. In this game, there is no dominant strategy, and it comes with two NEs, i.e. (S, S) that is also Paretooptimal, and the deficient equilibrium (P, P). To turn the matrix games compatible with the actorcritic setting, we make the actor output the probability of choosing one action and define the reward based on the payoff matrix. We define the state at time to be .
In SH, we expect the agents to reach the Pareto optima, so that both agents receive the maximum reward. Fig. 4(a) compares the average training reward for a list of algorithms. GR2L/GR2M models, together with PR2, can easily reach the Pareto optima with maximum average return , whereas other models are trapped into deficient equilibrium with average return . On the convergence speed, higherorder thinkers are faster than lowerorder thinkers, and mixture models are faster than level models. SH is a coordination game that motivates players to cooperate, but the challenging part is the conflict between individual safety and social welfare. Without knowing the choice of the other, GR2 has to anchor the belief that opponents would like to choose the social welfare, and then reinforce this belief into the highorder reasoning to finally build the trust among agents. Since there is no dominant strategy in SH, agents will in the end choose to stay fixed in that strategy. The level methods cannot build such synergy because they cannot discriminate the individual safety from the social welfare; both NEs can saturate the value function of the joint action.
PD is different from SH in that there is a dominant strategy of defection which leads the Pareto optima to be unstable. In Fig. 4(b), although GR2L3 could reach the Pareto optima within the first steps, but soon it chooses to defect. Since defection is dominant, as a level thinker, it will surely defect; such reasoning will be passed into the level thinker, and GR2L3 soon realizes and best respond with defection immediately. GR2M3 underperforms GR2L3 because it is a mixture strategy, which still considers the level thinker who still chases for cooperation. As for the rest of the baselines, they do not present such behavioral changes at all; meanwhile, they converge to a worse equilibrium.
6.3 Particle World Environments
We further adopt the Particle World Environments (Lowe et al., 2017) to test our method on highdimensional state/action space scenarios. It includes four testing scenarios: 1) Cooperative Navigation with agents and landmarks. Agents are collectively rewarded based on the proximity of any agent to each landmark while avoiding collisions; 2) Physical Deception with adversary, good agents, and landmarks. All agents observe the positions of landmarks and other agents. Only one landmark is the true target landmark. Good agents are rewarded based on how close any of them is to the target landmark, and how well they deceive the adversary; 3) Keepaway with agent, adversary, and landmark. Agent is rewarded based on distance to landmark. Adversary is rewarded if it push away the agent from the landmark; 4) Predatorprey with prey agent who moves faster try to run away from adversary predator who move slower but are motivated to catch the prey cooperatively. Task is fullycooperative, while  are cooperativecompetitive tasks.
We measure the performance of agents trained by different algorithms in each of the environment. The results are demonstrated in Figs. 6 & 7; we report the rewards of the type of good agents and normalize them to  for tasks . We notice that in all the scenarios, the GR2M methods can achieve the highest score. The level models can outperform the lowerlevel models. In the cooperative game, the GR2L/M methods have critical advantages over traditional baselines, this is inline with the finding in SH that GR2 methods are good at finding the Pareto optimality.
7 Conclusion
In this paper, following the study of bounded rationality (Simon, 1972) and behavior game theory (Camerer, 2003), we introduce the protocol of generalized recursive reasoning (GR2) that allows agents to have different levels of recursive reasoning capability when dealing with other agents. In GR2, each agent takes the best response to a mixed type of agents that think and behave at lower levels. We integrate GR2 into the MARL framework. The induced GR2 Soft ActorCritic Algorithm shows superior performance on building the correct beliefs about the opponents behaviors even when they are nonequilibrium. We prove in theory that GR2 can reach NE when the level approximates infinity. Empirically, when compared to conventional opponent models (level 0), learning with higher level recursive reasoning capability would lead to faster convergence.
References
 Albrecht & Stone (2018) Albrecht, S. V. and Stone, P. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258:66–95, 2018.
 Aumann (1987) Aumann, R. J. Correlated equilibrium as an expression of bayesian rationality. Econometrica: Journal of the Econometric Society, pp. 1–18, 1987.
 Benndorf et al. (2017) Benndorf, V., Kübler, D., and Normann, H.T. Depth of reasoning and information revelation: An experiment on the distribution of klevels. International Game Theory Review, 19(04):1750021, 2017.
 Camerer (2003) Camerer, C. F. Behavioural studies of strategic thinking in games. Trends in cognitive sciences, 7(5):225–231, 2003.
 Camerer (2011) Camerer, C. F. Behavioral game theory: Experiments in strategic interaction. Princeton University Press, 2011.
 Camerer et al. (2004) Camerer, C. F., Ho, T.H., and Chong, J.K. A cognitive hierarchy model of games. The Quarterly Journal of Economics, 119(3):861–898, 2004.
 Chong et al. (2016) Chong, J.K., Ho, T.H., and Camerer, C. A generalized cognitive hierarchy model of games. Games and Economic Behavior, 99:257–274, 2016.
 Coricelli & Nagel (2009) Coricelli, G. and Nagel, R. Neural correlates of depth of strategic reasoning in medial prefrontal cortex. Proceedings of the National Academy of Sciences, 106(23):9163–9168, 2009.
 Crawford et al. (2013) Crawford, V. P., CostaGomes, M. A., and Iriberri, N. Structural models of nonequilibrium strategic thinking: Theory, evidence, and applications. Journal of Economic Literature, 51(1):5–62, 2013.
 Devetag & Warglien (2003) Devetag, G. and Warglien, M. Games and phone numbers: Do shortterm memory bounds affect strategic behavior? Journal of Economic Psychology, 24(2):189–202, 2003.
 Dinh et al. (2016) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 Gmytrasiewicz & Doshi (2005) Gmytrasiewicz, P. J. and Doshi, P. A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research, 24:49–79, 2005.
 Goldman et al. (2012) Goldman, A. I. et al. Theory of mind. The Oxford handbook of philosophy of cognitive science, pp. 402–424, 2012.
 GraciaLázaro et al. (2017) GraciaLázaro, C., Floría, L. M., and Moreno, Y. Cognitive hierarchy theory and twoperson games. Games, 8(1):1, 2017.
 GrauMoya et al. (2018) GrauMoya, J., Leibfried, F., and BouAmmar, H. Balancing twoplayer stochastic games with soft qlearning. IJCAI, 2018.
 Greenwald et al. (2003) Greenwald, A., Hall, K., and Serrano, R. Correlated qlearning. In ICML, volume 3, pp. 242–249, 2003.
 Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165, 2017.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
 Harsanyi (1962) Harsanyi, J. C. Bargaining in ignorance of the opponent’s utility function. Journal of Conflict Resolution, 6(1):29–38, 1962.
 Harsanyi (1967) Harsanyi, J. C. Games with incomplete information played by bayesian players, i–iii part i. the basic model. Management science, 14(3):159–182, 1967.
 He et al. (2016) He, H., BoydGraber, J., Kwok, K., and Daumé III, H. Opponent modeling in deep reinforcement learning. In International Conference on Machine Learning, pp. 1804–1813, 2016.
 Hu & Wellman (2003) Hu, J. and Wellman, M. P. Nash qlearning for generalsum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
 Keynes (1936) Keynes, J. M. The General Theory of Employment, Interest and Money. Macmillan, 1936. 14th edition, 1973.
 Levine (2018) Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Littman (2001) Littman, M. L. Friendorfoe qlearning in generalsum games. In ICML, volume 1, pp. 322–328, 2001.

Liu & Wang (2016)
Liu, Q. and Wang, D.
Stein variational gradient descent: A general purpose bayesian inference algorithm.
In Advances In Neural Information Processing Systems, pp. 2378–2386, 2016.  Lowe et al. (2017) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., and Mordatch, I. Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390, 2017.
 Moulin (1986) Moulin, H. Game theory for the social sciences. NYU press, 1986.
 Nagel (1995) Nagel, R. Unraveling in guessing games: An experimental study. The American Economic Review, 85(5):1313–1326, 1995.
 Nash et al. (1950) Nash, J. F. et al. Equilibrium points in nperson games. Proceedings of the national academy of sciences, 36(1):48–49, 1950.
 Rabinowitz et al. (2018) Rabinowitz, N. C., Perbet, F., Song, H. F., Zhang, C., Eslami, S., and Botvinick, M. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.
 Seuken & Zilberstein (2008) Seuken, S. and Zilberstein, S. Formal models and algorithms for decentralized decision making under uncertainty. Autonomous Agents and MultiAgent Systems, 17(2):190–250, 2008.
 Shapley (1953) Shapley, L. S. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
 Simon (1972) Simon, H. A. Theories of bounded rationality. Decision and organization, 1(1):161–176, 1972.
 Sondik (1971) Sondik, E. J. The optimal control of partially observable markov processes. Technical report, STANFORD UNIV CALIF STANFORD ELECTRONICS LABS, 1971.
 Stahl (1993) Stahl, D. O. Evolution of smartn players. Games and Economic Behavior, 5(4):604–617, 1993.
 Stahl II & Wilson (1994) Stahl II, D. O. and Wilson, P. W. Experimental evidence on players’ models of other players. Journal of economic behavior & organization, 25(3):309–327, 1994.
 Tesauro (1995) Tesauro, G. Temporal difference learning and tdgammon. Communications of the ACM, 38(3):58–68, 1995.
 Wei et al. (2018) Wei, E., Wicke, D., Freelan, D., and Luke, S. Multiagent soft qlearning. AAAI, 2018.
 Wen et al. (2019) Wen, Y., Yang, Y., Luo, R., Wang, J., and Pan, W. Probabilistic recursive reasoning for multiagent reinforcement learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rkl6As0cF7.
 Yang et al. (2018) Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., and Wang, J. Mean field multiagent reinforcement learning. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5571–5580, Stockholmsmassan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
Comments
There are no comments yet.