1 Introduction
Casting decision making and optimal control as an inference problem have a long history, which dates back to [Kalman1960]
where the Kalman smoothing is used to solve optimal control in linear dynamics with quadratic cost. Bayesian methods can capture the uncertainties regarding the transition probabilities, the rewards functions in the environment or other agents’ policies. This distributional information can be used to formulate a more structured exploration/exploitation strategy than those commonly used in classical RL, e.g.
greedy. A common approach in many works [Toussaint and Storkey2006, Rawlik et al.2013, Levine and Koltun2013, Abdolmaleki et al.2018] for framing RL as an inference problem is by introducing a binary random variable which represents “optimality”. By this way, RL problems are able to lend itself to powerful inference tools [Levine2018]. However, the Bayesian approach in a multiagent environment is less well studied.In many singleagent works, maximizing entropy is part of a training agent’s objective for resolving ambiguities in inverse reinforcement learning [Ziebart et al.], improving the diversity [Florensa et al.2017], robustness [Fox et al.2015] and the compositionality [Haarnoja et al.2018a] of the learned policy. In Bayesian RL, it often presents in the evidence lower bound (ELBO) for the log likelihood of optimality [Haarnoja et al.2017, Schulman et al.2017, Haarnoja et al.2018b], commonly known as maximum entropy objective (MEO), which encourages the optimal policy to maximize the expected return and long term entropy.
In MARL, there is more than one agent interacting with a stationary environment. In contrast with the single agent environment, an agent’s reward not only depends on the current environment state and the agent’s action but also on the actions of others. The existence of other agents increases the uncertainty in the environment. Therefore, the capability of reasoning about other agents’ belief, private information, behavior, strategy, and other characteristics is crucial. A reasoning model can be used in many different ways, but the most common case is where an agent utilize its reasoning model to help its self decision making [Brown1951, Heinrich and Silver2016, He et al.2016, Raileanu et al.2018, Wen et al.2019, Tian et al.2018]. In this work, we use the word “opponent” when referring to another agent in the environment irrespective of the environment’s cooperative or adversarial nature.
In this work, we reformulate the MARL problem into Bayesian inference and derive a multiagent version of MEO, which we call the regularized opponent model with maximum entropy objective (ROMMEO). Optimizing this objective with respect to one agent’s opponent model gives rise to a new perceptive on opponent modeling. We present two offpolicy RL algorithms for optimizing ROMMEO in MARL. ROMMEOQ is applied in discrete action case with proof of convergence. For the complex and continuous action environment, we propose ROMMEO ActorCritic (ROMMEOAC), which approximates the former procedure and extend itself to continuous problems. We evaluate these two approaches on the matrix game and the differential game against strong baselines and show that our methods can outperform all the baselines in terms of the overall performance and speed of convergence.
2 Method
2.1 Stochastic Games
For an agent stochastic game [Shapley1953], we define a tuple , where denotes the state space, is the distribution of the initial state, is the discount factor for future rewards, and are the action space and the reward function for agent respectively. Agent chooses its action according to the policy parameterized by conditioning on some given state . Let us define the joint policy as the collection of all agents’ policies with representing the joint parameter. It is convenient to interpret the joint policy from the perspective of agent such that , where , , and is a compact representation of the joint policy of all complementary agents of . At each stage of the game, actions are taken simultaneously.
In fully cooperative games, different agents have the same reward function . Therefore, each agent’s objective is to maximize the shared expected return:
(1) 
with sample from .
2.2 A Variational Lower Bound for MultiAgent Reinforcement Learning Problems
We embed the control problem into a graphical model by introducing a binary random variable which serves as the indicator for “optimality” for each agent at each time step . Recall that in single agent problem, reward is bounded, but the achievement of the maximum reward given the action is unknown. Therefore, in the singleagent case, indicates the optimality of achieving the bounded maximum reward . It thus can be regarded as a random variable and we have . However, the definition of “optimality” in the multiagent case is subtlety different from the one in the singleagent situation.
In cooperative multiagent reinforcement learning (CMARL), to define agent ’s optimality, we first introduce the definition of optimum and optimal policy:
Definition 1.
In cooperative multiagent reinforcement learning, optimum is a strategy profile such that:
(2) 
where and Agent ’s optimal policy is .
We also define a best response function as:
Definition 2.
For agent , the best response function to an arbitrary policy profile of its opponents is the policy which produces the highest reward:
(3) 
where .
Therefore, we can easily see that:
Proposition 1.
For player let be its best response function. Given the optimal policy profile , then is the optimal policies of its opponent if and only if the best response to is agent ’s optimal policy:
(4) 
In CMARL, a single agent’s“optimality” cannot implies that it obtains the maximum reward because the reward depends on the joint actions of all agents . Therefore, we define only indicates that agent ’s policy at time step is optimal
. For the posterior probability of agent
’s optimality given its action we have:(5) 
We assume that given other players actions , the posterior probability of agent ’s optimality is proportional to its exponential reward:
(6) 
The conditional probabilities of “optimality” in both of CMARL and singleagent case have similar forms. However, it is worth mentioning that the “optimality” in CMARL has a different interpretation to the one in singleagent case.
For cooperative games, if all agents play optimally, then agents can receive the maximum rewards, which is the optimum of the games. Therefore, given the fact that other agents are playing their optimal policies , the probability that agent also plays its optimal policy is the probability of obtaining the maximum reward from agent ’s perspective. Therefore, we define agent ’s objective as:
(7) 
The graphical model is summarized in Fig. 1. Without loss of generality, we derive the solution for agent but the same solution can be derived for other agents. By using a variational distribution , we derive a lower bound on the likelihood of optimality of agent :
(8)  
(9) 
Written out in full, is agent
’s opponent model estimating
optimal policies of its opponents, is the agent ’s conditional policy at optimum () and is the prior of optimal policy of opponents. In our work, we set the prior equal to the observed empirical distribution of opponents’ actions given states. As we are only interested in the case where , we drop them in and thereafter. Eq. 8 is a variational lower bound of and the derivation is deferred to Appendix B.1.2.3 The Learning of Opponent Model
We can further expand Eq. 8 into Eq. 9. and we find that it resembles the maximum entropy objective in singleagent reinforcement learning [Kappen2005, Todorov2007, Ziebart et al., Haarnoja et al.2017]. We denote agent ’s expectation of reward plus entropy of the conditional policy as agent ’s maximum entropy objective (MEO). In the multiagent version, however, it is worthy of noting that optimizing the MEO will lead to the optimization of . This can be counterintuitive at first sight as opponent behaviour models are normally trained with only past stateaction data to predict opponents’ actions.
However, recall that is modelling opponents’ optimal policies in our work. By Prop. 1, we know that if and only if the best response function to of agent is ’s optimal policy, the estimated opponent’s policies are their optimal polices:
(10) 
Then, probabilistically, we have:
(11) 
From Eq. 11, we can derive that:
(12) 
Proof.
See Appendix B.2. ∎
Therefore, agent ’s maximum entropy objective is a lower bound of the probability that its opponent model is optimal policies of its opponents. Given agent ’s policy being fixed, optimizing this lower bound updates agent ’s opponent model in the direction of the higher shared reward and the more stochastic conditional policy , making it closer to the real optimal policies of the opponents.
Without any regularization, at iteration , agent can freely learn a new opponent model which is the closest to the optimal opponent policies from its perspective given . Next, agent can optimize the lower bound with respect to given . Then we have an EMlike iterative training and can show it monotonically increases the probability that the opponent model is optimal policies of the opponents. Then, by acting optimally to the converged opponent model , we can recover agent ’s optimal policy .
However, it is unrealistic to learn such an opponent model. As the opponents have no access to agent ’s conditional policy , the learning of its policy can be different from the one of agent ’s opponent model. Then the actual opponent policies can be very different from agent ’s converged opponent model learned in the above way given agent ’s conditional policy . Therefore, acting optimally to an opponent model far from the real opponents’ policies can lead to poor performance.
The last term in Eq. 9 can prevent agent building an unrealistic opponent model. The KullbackLeibler (KL) divergence between opponent model and a prior can act as a regularizer of . By setting the prior to the empirical distribution of opponent past behaviour, the KL divergence penalizes heavily if it deviates from the empirical distribution too much. As the objective in Eq. 9 can be seen as a Maximum Entropy objective for one agent’s policy and opponent model with regularization on the opponent model, we call this objective as Regularized Opponent Model with Maximum Entropy Objective (ROMMEO).
3 MultiAgent Soft Actor Critic
To optimize the ROMMEO in Eq. 9 derived in the previous section, we propose two reinforcement learning algorithms. We first introduce an exact tabular Qiteration method with proof of convergence. For practical implementation in a complex continuous environment, we then propose the ROMMEO actor critic ROMMEOAC, which is an approximation to this procedure.
3.1 Regularized Opponent Model with Maximum Entropy Objective QIteration
In this section, we derive a multiagent version of Soft Qiteration algorithm proposed in [Haarnoja et al.2017] and we name our algorithm as ROMMEOQ. The derivation follows from a similar logic to [Haarnoja et al.2017], but the extension of Soft Qlearning to MARL is still nontrivial. From this section, we slightly modify the objective in Eq. 9 by adding a weighting factor for the entropy term and it can be recovered by setting .
We first define multiagent soft Qfunction and Vfunction respectively. Then we can show that the conditional policy and opponent model defined in Eq. 15 and 16 below are optimal solutions with respect to the objective defined in Eq. 9:
Theorem 1.
We define the soft stateaction value function of agent as
(13) 
and soft state value function as
(14) 
Then the optimal conditional policy and opponent model for Eq. 8 are
(15) 
and
(16) 
Proof.
See Appendix C.2. ∎
Following from Theorem 1, we can find the optimal solution of Eq. 9 by learning the soft multiagent Qfunction first and recover the optimal policy and opponent model by Equations 15 and 16. To learn the Qfunction, we show that it satisfies a Bellmanlike equation, which we name it as multiagent soft Bellman equation:
Theorem 2.
We define the soft multiagent Bellman equation for the soft stateaction value function of agent as
(17) 
Proof.
See Appendix C.3. ∎
With this Bellman equation defined above, we can derive a solution to Eq. 17 with a fixed point iteration, which we call ROMMEO Qiteration (ROMMEOQ). Additionally, We can show that it can converge to the optimal and with certain restrictions as stated in [Wen et al.2019]:
Theorem 3.
ROMMEO Qiteration. In a symmetric game with only one global optimum, i.e. , where is the optimal strategy profile. Let and be bounded and assume
and that exists. Then the fixedpoint iteration
(18) 
(19) 
converges to and respectively.
Proof.
See Appendix C.3. ∎
3.2 Regularized Opponent Model with Maximum Entropy Objective Actor Critic
The ROMMEOQ assumes we have the model of the environment and is impractical to implement in highdimensional continuous problems. To solve these problems, we propose the ROMMEO actor critic (ROMMEOAC) which is a modelfree method with parameterized policy and Q functions. We use neural networks (NNs) as function approximators for the conditional policy, opponent model and Qfunction and learn these functions by stochastic gradient. We parameterize the Qfunction, conditional policy and opponent model by
, and respectively.Without access to the environment model, we replace the Qiteration with Qlearning. Therefore, we can train to minimize:
(20) 
with
(21) 
where are target functions for providing relatively stable target values. We use denoting the action sampled from agent ’s opponent model and it should be distinguished from which is the real action taken by agent ’s opponent. Eq. 21 can be derived from Eq. 15 and 16.
To recover the optimal conditional policy and opponent model, we follow the method in [Haarnoja et al.2018b] where and are trained to minimize the KLdivergence:
(22) 
(23) 
By using the reparameterization trick: and , we can rewrite the objectives above as:
(24) 
(25) 
The gradient of Eq. 20, 24 and 25 with respect to the corresponding parameters are listed as below:
(26) 
(27) 
(28) 
We list the pseudocode of ROMMEOQ and ROMMEOAC in Appendix A.
4 Related Works
In early works, the maximum entropy principle has been used in policy search in linear dynamics [Todorov2010, Toussaint2009, Levine and Koltun2013] and path integral control in general dynamics [Kappen2005, Theodorou et al.2010]. Recently, offpolicy methods [Haarnoja et al.2017, Schulman et al.2017, Nachum et al.2017] have been proposed to improve the sample efficiency in optimizing MEO. However, in a continuous environment, the complex approximate inference may be needed for sampling actions. To avoid the complex sampling procedure, training a policy in supervised fashion is employed in [Haarnoja et al.2018b]. Our work is closely related to this series of recent works because ROMMEO is an extension of MEO to MARL.
A few related works to ours have been conducted in multiagent soft Qlearning [Wen et al.2019, Wei et al.2018, GrauMoya et al.2018], where variants of soft Qlearning are applied for solving different problems in MARL. However, unlike previous works, we do not take the soft Qlearning as given and apply it to MARL problems with modifications. In our work, we first establish a novel objective ROMMEO and ROMMEOQ is only an offpolicy method we derive with complete convergence proof, which can optimize the objective. There are other ways of optimizing ROMMEO, for example, the onpolicy gradientbased methods, but they are not included in the paper. ROMMEOQ resembles soft Qlearning because ROMMEO is an extension to MEO in MARL but not because it is a modification of soft Qlearning applying on MARL problems.
There has been substantial progress in combining RL with probabilistic inference. However, most of the existing works focus on the singleagent case. The literature of Bayesian methods in MARL is limited. Among these are methods performing on cooperative games with prior knowledge on distributions of the game model and the possible strategies of others [Chalkiadakis and Boutilier2003] or policy parameters and possible roles of other agents [Wilson et al.2010]. In our work, we assume very limited prior knowledge of the environment model, optimal policy, opponents or the observations during the play. In addition, our algorithms are fully decentralized at training and execution, which is more challenging than problems from the centralized training [Foerster et al.2017, Lowe et al.2017, Rashid et al.2018].
5 Experiments
5.1 Iterated Matrix Games
We first present the proofofprinciple result of ROMMEOQ^{2}^{2}2The experiment code is available at https://github.com/rommeoijcai2019/rommeo. on iterated matrix games where players need to cooperate to achieve the shared maximum reward. To this end, we study the iterated climbing games (ICG) which is a classic purely cooperative twoplayer stateless iterated matrix games.
Climbing game (CG) is a fully cooperative game proposed in [Claus and Boutilier1998] whose payoff matrix is summarized as follows: . It is a challenging benchmark because of the difficulty of convergence to its global optimum. There are two Nash equilibrium and but one global optimal . The punishment of miscoordination by choosing a certain action increases in the order of . The safest action is and the miscoordination punishment is the most server for . Therefore it is very difficult for agents to converge to the global optimum in ICG.
We compare our method to a series of strong baselines in MARL, including Joint Action Learner (JAL) [Claus and Boutilier1998], WoLF Policy Hillclimbing (WoLFPHC) [Bowling and Veloso2001], Frequency Maximum Q (FMQ) [Kapetanakis and Kudenko2002] and Probabilistic Recursive Reasoning (PR2) [Wen et al.2019]. ROMMEOQEMP is an ablation study to evaluate the effectiveness of our proposed opponent model learning process, where we replace our opponent model with empirical frequency. Fig. 1(a) shows the learning curves on ICG for different algorithms. The difference of rewards between ROMMEOQ and FMQc10 may seem small because of the small reward margin between the global optimum and the local one. However, ROMMEOQ actually outperforms all baselines significantly in terms of converging to the global optimum, which is shown in Fig. 1(b). To further analyze the opponent modeling described in Sec. 2.3, we visualize the probability of agent taking the optimal action estimated by agent ’s opponent model and its opponent true policy in Fig. 1(c). Agent ’s opponent model “thinks ahead” of agent and converges to agent ’s optimal policy before agent itself converges to the optimal policy. This helps agent to respond to its opponent model optimally by choosing action , which in turn leads to the improvement of agent ’s opponent model and policy. Therefore, the game converges to the global optimum. To note, the big drop of for both policies and opponent models at the beginning of the training comes from the severe punishment of miscoordination associated with action .
5.2 Differential Games
We adopt the differential Max of Two Quadratic Game [Wei et al.2018] for continuous case. The agents have continuous action space of . Each agent’s reward depends on the joint action following the equations: where . We compare the algorithm with a series of baselines including PR2 [Wen et al.2019], MASQL [Wei et al.2018, GrauMoya et al.2018], MADDPG [Lowe et al.2017] and independent learner via DDPG [Lillicrap et al.2015]. To compare against traditional opponent modeling methods, similar to [Rabinowitz et al.2018, He et al.2016], we implement an additional baseline of DDPG with an opponent module that is trained online with supervision in order to capture the latest opponent behaviors, called DDPGOM. We trained all agents for 200 iterations with 25 steps per iteration.
This is a challenging task to most continuous gradient based RL algorithms because gradient update tends to direct the training agent to the suboptimal point. The reward surface is provided in Fig. 2(a) ; there is a local maximum at and a global maximum at , with a deep valley staying in the middle. If the agents’ policies are initialized to (the red starred point) that lies within the basin of the left local maximum, the gradientbased methods would tend to fail to find the global maximum equilibrium point due to the valley blocking the upper right area.
A learning path of ROMMEOAC is summarized in Fig. 2(a) and the solid bright circle on the right corner implies the convergence to the global optimum. The learning curve is presented in Fig. 2(b), ROMMEOAC shows the capability of converging to the global optimum in a very small amount of steps, while most of the baselines can only reach the suboptimal point. An exception is PR2AC, which can also achieve the global optimum but requires many more steps to explore and learn. Additionally, fine tuning on the exploration noise or separate exploration stage is required for deterministic RL methods (MADDPG, DDPG, DDPGOM, PR2AC), and the learning outcomes of energybased RL method (MASQL) is extremely sensitive to the annealing scheme for the temperature. In contrast, ROMMEOAC employs a stochastic policy and controls the exploration level by the weighting factor . It does not need a separate exploration stage at the beginning of the training or a delicately designed annealing scheme for .
Furthermore, we analyze the learning path of policy and modeled opponent policy during the training, the results are shown in Fig. 2(c). The red and orange lines are mean of modeled opponent policy , which always learn to approach the optimal ahead of the policy (in dashed blue and green lines). This helps the agents to establish the trust and converge to the optimum quickly, which further justifies the effectiveness and benefits of conducting a regularized opponent model proposed in Sec. 2.3.
6 Conclusion
In this paper, we use Bayesian inference to formulate MARL problem and derive a novel objective ROMMEO which gives rise to a new perspective on opponent modeling. We design an offpolicy algorithm ROMMEOQ with complete convergence proof for optimizing ROMMEO. For better generality, we also propose ROMMEOAC, an actor critic algorithm powered by NNs to solve complex and continuous problems. We also give a theoretical and empirical analysis of the effect of the new learning process of the opponent modeling on agent’s performance in MARL. We evaluate our methods on the challenging matrix game and differential game and show that they can outperform a series of strong base lines.
.88
References
 [Abdolmaleki et al.2018] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Rémi Munos, Nicolas Heess, and Martin A. Riedmiller. Maximum a posteriori policy optimisation. CoRR, abs/1806.06920, 2018.
 [Bowling and Veloso2001] Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games. In IJCAI, San Francisco, CA, USA, 2001.
 [Brown1951] George W. Brown. Iterative solution of games by fictitious play. In AAPA. New York, 1951.
 [Chalkiadakis and Boutilier2003] Georgios Chalkiadakis and Craig Boutilier. Coordination in multiagent reinforcement learning: A bayesian approach. In AAMAS, pages 709–716, New York, NY, USA, 2003. ACM.

[Claus and
Boutilier1998]
Caroline Claus and Craig Boutilier.
The dynamics of reinforcement learning in cooperative multiagent
systems.
AAAI ’98/IAAI ’98, Menlo Park, CA, USA, 1998. American Association for Artificial Intelligence.
 [Florensa et al.2017] Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic Neural Networks for Hierarchical Reinforcement Learning. arXiv eprints, April 2017.
 [Foerster et al.2017] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual MultiAgent Policy Gradients. arXiv eprints, page arXiv:1705.08926, May 2017.
 [Fox et al.2015] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the Noise in Reinforcement Learning via Soft Updates. arXiv eprints, page arXiv:1512.08562, December 2015.
 [GrauMoya et al.2018] Jordi GrauMoya, Felix Leibfried, and Haitham BouAmmar. Balancing TwoPlayer Stochastic Games with Soft QLearning. arXiv eprints, page arXiv:1802.03216, February 2018.
 [Haarnoja et al.2017] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. CoRR, abs/1702.08165, 2017.
 [Haarnoja et al.2018a] Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel, and Sergey Levine. Composable Deep Reinforcement Learning for Robotic Manipulation. arXiv eprints, page arXiv:1803.06773, March 2018.
 [Haarnoja et al.2018b] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft ActorCritic: OffPolicy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv eprints, page arXiv:1801.01290, January 2018.
 [He et al.2016] H. He, J. BoydGraber, K. Kwok, and H. Daumé III. Opponent modeling in deep reinforcement learning. In ICML ’16, volume 48, 2016.
 [Heinrich and Silver2016] Johannes Heinrich and David Silver. Deep reinforcement learning from selfplay in imperfectinformation games. CoRR, abs/1603.01121, 2016.
 [Kalman1960] Rudolf Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME  Journal of basic Engineering, 82:35–45, 01 1960.
 [Kapetanakis and Kudenko2002] Spiros Kapetanakis and Daniel Kudenko. Reinforcement learning of coordination in cooperative multiagent systems. In Eighteenth National Conference on Artificial Intelligence, Menlo Park, CA, USA, 2002.
 [Kappen2005] H J Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment, 2005(11), 2005.
 [Levine and Koltun2013] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, NIPS ’13. 2013.
 [Levine2018] Sergey Levine. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv eprints, page arXiv:1805.00909, May 2018.
 [Lillicrap et al.2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [Lowe et al.2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. MultiAgent ActorCritic for Mixed CooperativeCompetitive Environments. arXiv eprints, page arXiv:1706.02275, June 2017.
 [Nachum et al.2017] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the Gap Between Value and Policy Based Reinforcement Learning. arXiv eprints, page arXiv:1702.08892, February 2017.
 [Rabinowitz et al.2018] Neil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan Zhang, SM Eslami, and Matthew Botvinick. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.
 [Raileanu et al.2018] R. Raileanu, E. Denton, A. Szlam, and R. Fergus. Modeling others using oneself in multiagent reinforcement learning. CoRR, abs/1802.09640, 2018.
 [Rashid et al.2018] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. QMIX: Monotonic Value Function Factorisation for Deep MultiAgent Reinforcement Learning. arXiv eprints, page arXiv:1803.11485, March 2018.
 [Rawlik et al.2013] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference (extended abstract). IJCAI ’13, pages 3052–3056, 2013.
 [Schulman et al.2017] John Schulman, Xi Chen, and Pieter Abbeel. Equivalence Between Policy Gradients and Soft QLearning. arXiv eprints, page arXiv:1704.06440, April 2017.
 [Shapley1953] Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
 [Theodorou et al.2010] Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. J. Mach. Learn. Res., 11, December 2010.
 [Tian et al.2018] Zheng Tian, Shihao Zou, Tim Warr, Lisheng Wu, and Jun Wang. Learning Multiagent Implicit Communication Through Actions: A Case Study in Contract Bridge, a Collaborative ImperfectInformation Game. arXiv eprints, page arXiv:1810.04444, October 2018.
 [Todorov2007] Emanuel Todorov. Linearlysolvable markov decision problems. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, NIPS ’19. 2007.
 [Todorov2010] Emanuel Todorov. Policy gradients in linearlysolvable mdps. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, NIPS. 2010.

[Toussaint and Storkey2006]
Marc Toussaint and Amos Storkey.
Probabilistic inference for solving discrete and continuous state markov decision processes.
ICML ’06, pages 945–952, New York, NY, USA, 2006. ACM.  [Toussaint2009] Marc Toussaint. Robot trajectory optimization using approximate inference. ICML ’09, pages 1049–1056, New York, NY, USA, 2009. ACM.
 [Wei et al.2018] Ermo Wei, Drew Wicke, David Freelan, and Sean Luke. Multiagent soft qlearning. AAAI, 2018.
 [Wen et al.2019] Ying Wen, Yaodong Yang, Rui Luo, Jun Wang, and Wei Pan. Probabilistic recursive reasoning for multiagent reinforcement learning. In ICLR, 2019.
 [Wilson et al.2010] Aaron Wilson, Alan Fern, and Prasad Tadepalli. Bayesian policy search for multiagent role discovery. AAAI’10, pages 624–629. AAAI Press, 2010.
 [Ziebart et al.] Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In AAAI ’08.
Comments
There are no comments yet.