1 Introduction
Reinforcement learning (RL) methods are becoming increasingly successful at optimizing reward signals in complex, high dimensional environments (Espeholt et al., 2018). A key limitation of RL, however, is the difficulty of designing suitable reward functions for complex and not wellspecified tasks (HadfieldMenell et al., 2017; Amodei et al., 2016). If the reward function does not cover all important aspects of the task, the agent could easily learn undesirable behaviors (Amodei and Clark, 2016). This problem is further exacerbated in multiagent scenarios, such as multiplayer games (Peng et al., 2017), multirobot control (Matignon et al., 2012) and social interactions (Leibo et al., 2017); in these cases, agents do not even necessarily share the same reward function, especially in competitive settings where the agents might have conflicting rewards.
Imitation learning methods address these problems via expert demonstrations (Ziebart et al., 2008; Englert and Toussaint, 2015; Finn et al., 2016; Stadie et al., 2017); the agent directly learns desirable behaviors by imitating an expert. Notably, inverse reinforcement learning (IRL) frameworks assume that the expert is (approximately) optimizing an underlying reward function, and attempt to recover a reward function that rationalizes the demonstrations; an agent policy is subsequently learned through RL (Ng et al., 2000; Abbeel and Ng, 2004). Unfortunately, this paradigm is not suitable for general multiagent settings due to environment being nonstationary to individual agents (Lowe et al., 2017) and the existence of multiple equilibrium solutions (Hu et al., 1998). The optimal policy of one agent could depend on the policies of other agents, and vice versa, so there could exist multiple solutions in which each agents’ policy is the optimal response to others.
In this paper, we propose a new framework for multiagent imitation learning – provided with demonstrations of a set of experts interacting with each other within the same environment, we aim to learn multiple parametrized policies that imitate the behavior of each expert respectively. Using the framework of Markov games, we integrate multiagent RL with a suitable extension of multiagent inverse RL. The resulting procedure strictly generalizes Generative Adversarial Imitation Learning (GAIL, (Ho and Ermon, 2016)
) in the single agent case. Imitation learning corresponds to a twoplayer game between a generator and a discriminator. The generator controls the policies of all the agents in a distributed way, and the discriminator contains a classifier for each agent that is trained to distinguish that agent’s behavior from that of the corresponding expert. Upon training, the behaviors produced by the policies are indistinguishable from the training data through the discriminator. We can incorporate prior knowledge into the discriminators, including the presence of cooperative or competitive agents. In addition, we propose a novel multiagent natural policy gradient algorithm that addresses the issue of high variance gradient estimates commonly observed in reinforcement learning
(Lowe et al., 2017; Foerster et al., 2016). Empirical results demonstrate that our method can imitate complex behaviors in highdimensional environments, such as particle environments and cooperative robotic control tasks, with multiple cooperative or competitive agents; the imitated behaviors are close to the expert behaviors with respect to “true” reward functions which the agents do not have access to during training.2 Preliminaries
2.1 Markov games
We consider an extension of Markov decision processes (MDPs) called Markov games
(Littman, 1994). A Markov game (MG) for agents is defined via a set of states , sets of actions . The function describes the (stochastic) transition process between states, wheredenotes the set of probability distributions over the set
. Given that we are in state at time , the agents take actions and the state transitions to with probability .Each agent obtains a (bounded) reward given by a function . Each agent aims to maximize its own total expected return , where is the discount factor and is the time horizon, by selecting actions through a (stationary and Markovian) stochastic policy . The initial states are determined by a distribution .
The joint policy is defined as , where we use bold variables without subscript to denote the concatenation of all variables for all agents (e.g. denotes the joint policy in a multiagent setting, denotes all rewards, denotes actions of all agents).
We use expectation with respect to a policy to denote an expectation with respect to the trajectories it generates. For example,
denotes the following sample process for the right hand side: , , , yet if we do not take expectation over the state , then
assumes the policy samples only the nextstep action .
We use subscript to denote all agents except for . For example, represents , the actions of all agents.
2.2 Reinforcement learning and Nash equilibrium
In reinforcement learning (RL), the goal of each agent is to maximize total expected return given access to the reward signal . In single agent RL, an optimal Markovian policy exists but the optimal policy might not be unique (e.g., all policies are optimal for an identically zero reward; see Sutton and Barto (1998), Chapter 3.8). An entropy regularizer can be introduced to resolve this ambiguity. The optimal policy is found via the following RL procedure:
(1) 
where is the discounted causal entropy (Bloem and Bambos, 2014) of policy . [discounted Causal Entropy] The discounted causal entropy for a policy is defined as follows:
If we scale the reward function by any positive value, the addition of resolves ambiguity by selecting the policy among the set of optimal policies that have the highest causal entropy^{1}^{1}1For the remainder of the paper, we may use the term “entropy” to denote the discounted causal entropy for policies. – the policy with both the highest reward and the highest entropy is unique because the entropy function is concave with respect to and the set of optimal policies is convex.
In Markov games, however, the optimal policy of an agent depends on other agents’ policies. One approach is to use an equilibrium solution concept, such as Nash equilibrium (Hu et al., 1998). Informally, a set of policies is a Nash equilibrium if no agent can achieve higher reward by unilaterally changing its policy, i.e. . The process of finding a Nash equilibrium can be defined as a constrained optimization problem (Filar and Vrieze (2012), Theorem 3.7.2):
(2)  
(3)  
where the joint action includes actions sampled from and . Intuitively, could represent some estimated value function for each state and represents the function that corresponds to . The constraints enforce the Nash equilibrium condition – when the constraints are satisfied, is nonnegative for every . Hence is always nonnegative for a feasible . Moreover, this objective has a global minimum of zero if a Nash equilibrium exists, and forms a Nash equilibrium if and only if reaches zero while being a feasible solution (Prasad and Bhatnagar (2015), Theorem 2.4).
2.3 Inverse reinforcement learning
Suppose we do not have access to the reward signal , but have demonstrations provided by an expert ( expert agents in Markov games). Imitation learning aims to learn policies that behave similarly to these demonstrations. In Markov games, we assume all experts/players operate in the same environment, and the demonstrations are collected by sampling ; we assume knowledge of , , , , as well as access to and as black boxes. We further assume that once we obtain , we cannot ask for additional expert interactions with the environment (unlike in DAgger (Ross et al., 2011) or CIRL (HadfieldMenell et al., 2016)).
Let us first consider imitation in Markov decision processes (as a special case to Markov games) and the framework of singleagent Maximum Entropy IRL (Ziebart et al., 2008; Ho and Ermon, 2016) where the goal is to recover a reward function that rationalizes the expert behavior :
In practice, expectations with respect to are evaluated using samples from .
The IRL objective is illdefined (Ng et al., 2000; Finn et al., 2016) and there are often multiple valid solutions to the problem when we consider all . For example, we can assign the reward function for trajectories that are not visited by the expert arbitrarily so long as these trajectories yields lower rewards than the expert trajectories. To resolve this ambiguity, Ho and Ermon (2016) introduce a convex reward function regularizer , which can be used to restrict rewards to be linear in a predetermined set of features (Ho and Ermon, 2016):
(4) 
2.4 Imitation by matching occupancy measures
Ho and Ermon (2016) interpret the imitation learning problem as matching two occupancy measures, i.e., the distribution over states and actions encountered when navigating the environment with a policy. Formally, for a policy , it is defined as . Ho and Ermon (2016) draw a connection between IRL and occupancy measure matching, showing that the former is a dual of the latter:
Proposition (Proposition 3.1 in (Ho and Ermon, 2016))
Here is the convex conjugate of , which could be interpreted as a measure of similarity between the occupancy measures of expert policy and agent’s policy. One instance of gives rise to the Generative Adversarial Imitation Learning (GAIL) method:
(5) 
The resulting imitation learning method from Proposition 2.4 involves a discriminator (a classifier ) competing with a generator (a policy ). The discriminator attempts to distinguish real vs. synthetic trajectories (produced by ) by optimizing (5). The generator, on the other hand, aims to perform optimally under the reward function defined by the discriminator, thus “fooling” the discriminator with synthetic trajectories that are difficult to distinguish from the expert ones.
3 Generalizing IRL to Markov games
Extending imitation learning to multiagent settings is difficult because there are multiple rewards (one for each agent) and the notion of optimality is complicated by the need to consider an equilibrium solution (Hu et al., 1998). We use to denote the set of (stationary and Markovian) policies that form a Nash equilibrium under and have the maximum discounted causal entropy (among all equilibria):
(6)  
where is defined as in Eq. 3. Our goal is to define a suitable inverse operator MAIRL, in analogy to IRL in Eq. 4. The key idea of Eq. 4 is to choose a reward that creates a margin between the expert and every other policy. However, the constraints in the Nash equilibrium optimization (Eq. 6) can make this challenging. To that end, we derive an equivalent Lagrangian formulation of (6), where we “move” the constraints into the objective function, so that we can define a margin between the expected reward of two sets of policies that captures their “difference”.
3.1 Equivalent constraints via temporal difference learning
Intuitively, the Nash equilibrium constraints imply that any agent cannot improve via 1step temporal difference learning; if the condition for Equation 3 is not satisfied for some , , and , this would suggest that we can update the policy for agent and its value function. Based on this notion, we can derive equivalent versions of the constraints corresponding to step temporal difference (TD) learning. For a certain policy and reward , let be the unique solution to the Bellman equation:
Denote as the discounted expected return for the th agent conditioned on visiting the trajectory in the first steps and choosing action at the step, when other agents use policy :
Then is Nash equilibrium if and only if
(7)  
Intuitively, Theorem 3.1 states that if we replace the 1step constraints with step constraints, we obtain the same solution as , since step TD updates (over one agent at a time) is still stationary with respect to a Nash equilibrium solution. So the constraints can be unrolled for steps and rewritten as (corresponding to Equation 7).
3.2 Multiagent inverse reinforcement learning
We are now ready to construct the Lagrangian dual of the primal in Equation 6, using the equivalent formulation from Theorem 3.1. The first observation is that for any policy , when is defined as in Theorem 3.1 (see Lemma A.1 in appendix). Therefore, we only need to consider the “unrolled” constraints from Theorem 3.1, obtaining the following dual problem
(8) 
where is the set of all length trajectories of the form , with as initial state,
is a vector of
Lagrange multipliers, and is defined as in Theorem 3.1. This dual formulation is a sum over agents and trajectories, which uniquely corresponds to the constraints in Equation 7.In the following theorem, we show that for a specific choice of we can recover the difference of the sum of expected rewards between two policies, a performance gap similar to the one used in single agent IRL in Eq. (4). This amounts to “relaxing” the primal problem.
For any two policies and , let
be the probability of generating the sequence using policy and . Then
(9) 
where corresponds to the dual function where the multipliers are the probability of generating their respective trajectories of length .
We provide a proof in Appendix A.3. Intuitively, the weights correspond to the probability of generating trajectory when the policy is for agent and for the other agents. As , the first term of left hand side in Equation 9, , converges to the expected total reward , which is the first term of right hand side. The marginal of over the initial states is the initial state distribution, so the second term of left hand side, , converges to , which is the second term of right hand side. Thus, the left hand side and right hand side of Equation 9 are the same as .
Theorem 3.2 motivates the following definition of multiagent IRL with regularizer .
(10) 
where is the discounted causal entropy for policy when other agents follow , and is a hyperparameter controlling the strength of the entropy regularization term as in (Ho and Ermon, 2016). This formulation is a strict generalization to the single agent IRL in (Ho and Ermon, 2016). If , then . Furthermore, if the regularization is additively separable, and for each agent , is the unique optimal response to other experts , we obtain the following: Assume that , is convex for each , and that has a unique solution^{2}^{2}2The set of Nash equilibria is not always convex, so we have to assume returns a unique solution. for all , then
where denotes for agent and for other agents.
The above theorem suggests that regularized multiagent inverse reinforcement learning is seeking, for each agent , a policy whose occupancy measure is close to one where we replace policy with expert , as measured by the convex function .
However, we do not assume access to the expert policy during training, so it is not possible to obtain . In the settings of this paper, we consider an alternative approach where we match the occupancy measure between and instead. We can obtain our practical algorithm if we select an adversarial reward function regularizer and remove the effect from entropy regularizers.
Proposition
If , and where if ; otherwise, and
then
Theorem 3.2 and Proposition 3.2 discuss the differences from the single agent scenario. On the one hand, in Theorem 3.2 we make the assumption that has a unique solution, which is always true in the single agent case due to convexity of the space of the optimal policies. On the other hand, in Proposition 3.2 we remove the entropy regularizer because here the causal entropy for may depend on the policies of the other agents, so the entropy regularizer on two sides are not the same quantity. Specifically, the entropy for the left hand side conditions on and the entropy for the right hand side conditions on (which would disappear in the singleagent case).
4 Practical multiagent imitation learning
Despite the recent successes in deep RL, it is notoriously hard to train policies with RL algorithmsbecause of high variance gradient estimates. This is further exacerbated in Markov games since an agent’s optimal policy depends on other agents (Lowe et al., 2017; Foerster et al., 2016). In this section, we address these problems and propose practical algorithms for multiagent imitation.
4.1 Multiagent generative adversarial imitation learning
We select to be our reward function regularizer in Proposition 3.2; this corresponds to the twoplayer game introduced in Generative Adversarial Imitation Learning (GAIL, (Ho and Ermon, 2016)). For each agent , we have a discriminator (denoted as ) mapping state actionpairs to scores optimized to discriminate expert demonstrations from behaviors produced by . Implicitly, plays the role of a reward function for the generator, which in turn attempts to train the agent to maximize its reward thus fooling the discriminator. We optimize the following objective:
(11) 
We update through reinforcement learning, where we also use a baseline to reduce variance. We outline the algorithm – MultiAgent GAIL (MAGAIL) – in Appendix B.
We can augment the reward regularizer using an indicator denoting whether fits our prior knowledge; the augmented reward regularizer is then: if and if . We introduce three types of for common settings.
Centralized
The easiest case is to assume that the agents are fully cooperative, i.e. they share the same reward function. Here and . One could argue this corresponds to the GAIL case, where the RL procedure operates on multiple agents (a joint policy).
Decentralized
We make no prior assumptions over the correlation between the rewards. Here and . This corresponds to one discriminator for each agent which discriminates the trajectories as observed by agent . However, these discriminators are not learned independently as they interact indirectly via the environment.
Zero Sum
Assume there are two agents that receive opposite rewards, so . As such, is no longer additively separable. Nevertheless, an adversarial training procedure can be designed using the following fact:
where is the expected outcome for agent 1. The discriminator could maximize the reward for trajectories in and minimize the reward for trajectories in .
These three settings are in summarized in Figure 1.
4.2 Multiagent actorcritic with Kronecker factors
To optimize over the generator parameters in Eq. (11) we wish to use an algorithm for multiagent RL that has good sample efficiency in practice. Our algorithm, which we refer to as Multiagent ActorCritic with Kroneckerfactors (MACK), is based on ActorCritic with Kroneckerfactored Trust Region (ACKTR, (Wu et al., 2017)), a stateoftheart natural policy gradient (Amari, 1998; Kakade, 2002) method in deep RL. MACK uses the framework of centralized training with decentralized execution (Foerster et al., 2016); policies are trained with additional information to reduce variance but such information is not used during execution time. We let the advantage function of every agent agent be a function of all agents’ observations and actions:
(12) 
where is the baseline for , utilizing the additional information for variance reduction. We use (approximated) natural policy gradients to update both and but without trust regions to schedule the learning rate – a linear decay learning rate schedule achieves similar empirical performance.
MACK has some notable differences from MultiAgent Deep Deterministic Policy Gradient (Lowe et al., 2017). On the one hand, MACK does not assume knowledge of other agent’s policies nor tries to infer them; the value estimator merely collects experience from other agents (and treats them as black boxes). On the other hand, MACK does not require gradient estimators such as Gumbelsoftmax (Jang et al., 2016; Maddison et al., 2016) to optimize over discrete actions, which is necessary for DDPG (Lillicrap et al., 2015).
5 Experiments
We evaluate the performance of (centralized, decentralized, and zerosum versions) of MAGAIL under two types of environments. One is a particle environment which allows for complex interactions and behaviors; the other is a control task, where multiple agents try to cooperate and move a plank forward. We collect results by averaging over 5 random seeds. Our implementation is based on OpenAI baselines (Dhariwal et al., 2017); please refer to Appendix C for implementation details.
We compare our methods (centralized, decentralized, zerosum MAGAIL) with two baselines. The first is behavior cloning (BC), which learns a maximum likelihood estimate for given each state and does not require actions from other agents. The second baseline is the GAIL IRL baseline that operates on each agent separately – for each agent we first pretrain the other agents with BC, and then train the agent with GAIL; we then gather the trained GAIL policies from all the agents and evaluate their performance.
5.1 Particle environments
We first consider the particle environment proposed in (Lowe et al., 2017), which consists of several agents and landmarks. We consider two cooperative environments and two competitive ones. All environments have an underlying true reward function that allows us to evaluate the performance of learned agents.
The environments include: Cooperative Communication – two agents must cooperate to reach one of three colored landmarks. One agent (“speaker”) knows the goal but cannot move, so it must convey the message to the other agent (“listener”) that moves but does not observe the goal. Cooperative Navigation – three agents must cooperate through physical actions to reach three landmarks; ideally, each agent should cover a single landmark. KeepAway – two agents have contradictory goals, where agent 1 tries to reach one of the two targeted landmarks, while agent 2 (the adversary) tries to keep agent 1 from reaching its target. The adversary does not observe the target, so it must act based on agent 1’s actions. PredatorPrey – three slower cooperating adversaries must chase the faster agent in a randomly generated environment with obstacles; the adversaries are rewarded by touching the agent while the agent is penalized.
For the cooperative tasks, we use an analytic expression defining the expert policy; for the competitive tasks, we use MACK to train expert policies based on the true underlying rewards (using larger policy and value networks than the ones that we use for imitation). We then use the expert policies to simulate trajectories , and then do imitation learning on as demonstrations, where we assume the underlying rewards are unknown. Following (Li et al., 2017), we pretrain our MultiAgent GAIL methods and the GAIL baseline using behavior cloning as initialization to reduce sample complexity for exploration. We consider 100 to 400 episodes of expert demonstrations, each with 50 timesteps, which is close to the amount of timesteps used for the control tasks in Ho and Ermon (2016). Moreover, we randomly sample the starting position of agent and landmarks each episode, so our policies have to learn to generalize when they encounter new settings.
5.1.1 Cooperative tasks
We evaluate performance in cooperative tasks via the average expected reward obtained by all the agents in an episode. In this environment, the starting state is randomly initialized, so generalization is crucial. We do not consider the zerosum case, since it violates the cooperative nature of the task. We display the performance of centralized, decentralized, GAIL and BC in Figure 2.
Naturally, the performance of BC and MAGAIL increases with more expert demonstrations. MAGAIL performs consistently better than BC in all the settings; interestingly, in the cooperative communication task, centralized MAGAIL is able to achieve expertlevel performance with only 200 demonstrations, but BC fails to come close even with 400 trajectories. Moreover, the centralized MAGAIL performs slightly better than decentralized MAGAIL due to the better prior, but decentralized MAGAIL still learns a highly correlated reward between two agents.
Task  PredatorPrey  
Agent  Behavior Cloning  G  C  D  ZS  
Adversary  BC  G  C  D  ZS  Behavior Cloning  
Rewards  93.20  93.71  93.75  95.22  95.48  90.55  91.36  85.00  89.4 
Task  KeepAway  
Agent  Behavior Cloning  G  C  D  ZS  
Adversary  BC  G  C  D  ZS  Behavior Cloning  
Rewards  24.22  24.04  23.28  23.56  23.19  26.22  26.61  28.73  27.80 
5.1.2 Competitive tasks
We consider all three types of MultiAgent GAIL (centralized, decentralized, zerosum) and BC in both competitive tasks. Since there are two opposing sides, it is hard to measure performance directly. Therefore, we compare by letting (agents trained by) BC play against (adversaries trained by) other methods, and vice versa. From Table 1, decentralized and zerosum MAGAIL often perform better than centralized MAGAIL and BC, which suggests that the selection of the suitable prior is important for good empirical performance. More details for all the particle environments are in the appendix.
5.2 Cooperative control
In some cases we are presented with suboptimal expert demonstrations because the environment has changed; we consider this case in a cooperative control task (K. Gupta and Egorov, 2017), where bipedal walkers cooperate to move a long plank forward; the agents have incentive to collaborate since the plank is much longer than any of the agents. The expert demonstrates its policy on an environment with no bumps on the ground and heavy weights, while we perform imitation in an new environment with bumps and lighter weights (so one is likely to use too much force). Agents trained with BC tend to act more aggressively and fail, whereas agents trained with centralized MAGAIL can adapt to the new environment. With 10 (imperfect) expert demonstrations, BC agents have a chance of failure of (with a reward of 1.26), while centralized MAGAIL agents fail only of the time (with a reward of 26.57). We show videos of respective policies in the supplementary.
6 Related work and discussion
There is a vast literature on singleagent imitation learning (Bagnell, 2015)
. Behavior Cloning (BC) learns the policy through supervised learning
(Pomerleau, 1991). Inverse Reinforcement Learning (IRL) assumes the expert policy optimizes over some unknown reward, recovers the reward, and learns the policy through reinforcement learning (RL). BC does not require knowledge of transition probabilities or access to the environment, but suffers from compounding errors and covariate shift (Ross and Bagnell, 2010; Ross et al., 2011).Most existing work in multiagent imitation learning assumes the agents have very specific reward structures. The most common case is fully cooperative agents, where the challenges mainly lie in other factors, such as unknown role assignments (Le et al., 2017), scalability to swarm systems (Šošic et al., 2016) and agents with partial observations (Bogert and Doshi, 2014). In noncooperative settings, Lin et al. (2014)
consider the case of IRL for twoplayer zerosum games and cast the IRL problem as Bayesian inference, while
Reddy et al. (2012) assume agents are noncooperative but the reward function is a linear combination of prespecified features.Our work is the first to propose a general multiagent IRL framework that bridges the gap between stateofthe art multiagent reinforcement learning methods (Lowe et al., 2017; Foerster et al., 2016) and implicit generative models such as generative adversarial networks (Goodfellow et al., 2014). Experimental results demonstrate that it is able to imitate complex behaviors in highdimensional environments with both cooperative and adversarial interactions. An interesting research direction is to explore new techniques for gathering expert demonstration; for example, when the expert is allowed to aid the agents by participating in part of the agent’s learning process (HadfieldMenell et al., 2016).
References

Abbeel and Ng (2004)
Pieter Abbeel and Andrew Y Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, page 1. ACM, 2004.  Amari (1998) ShunIchi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Amodei and Clark (2016) Dario Amodei and Jack Clark. Faulty reward functions in the wild, 2016.
 Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
 Bagnell (2015) J Andrew Bagnell. An invitation to imitation. Technical report, CARNEGIEMELLON UNIV PITTSBURGH PA ROBOTICS INST, 2015.
 Bloem and Bambos (2014) Michael Bloem and Nicholas Bambos. Infinite time horizon maximum causal entropy inverse reinforcement learning. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pages 4911–4916. IEEE, 2014.
 Bogert and Doshi (2014) Kenneth Bogert and Prashant Doshi. Multirobot inverse reinforcement learning under occlusion with interactions. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems, pages 173–180. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
 Englert and Toussaint (2015) Peter Englert and Marc Toussaint. Inverse kkt–learning cost functions of manipulation tasks from demonstrations. In Proceedings of the International Symposium of Robotics Research, 2015.
 Espeholt et al. (2018) Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561, 2018.
 Filar and Vrieze (2012) Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science & Business Media, 2012.
 Finn et al. (2016) Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
 Foerster et al. (2016) Jakob Foerster, Yannis Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multiagent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 HadfieldMenell et al. (2016) Dylan HadfieldMenell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, pages 3909–3917, 2016.
 HadfieldMenell et al. (2017) Dylan HadfieldMenell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design. In Advances in Neural Information Processing Systems, pages 6768–6777, 2017.
 Ho and Ermon (2016) Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
 Hu et al. (1998) Junling Hu, Michael P Wellman, et al. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, volume 98, pages 242–250. Citeseer, 1998.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 K. Gupta and Egorov (2017) Jayesh K. Gupta and Maxim Egorov. Multiagent deep reinforcement learning environment. https://github.com/sisl/madrl, 2017.
 Kakade (2002) Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
 Le et al. (2017) Hoang M Le, Yisong Yue, and Peter Carr. Coordinated multiagent imitation learning. arXiv preprint arXiv:1703.03121, 2017.
 Leibo et al. (2017) Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multiagent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464–473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
 Li et al. (2017) Yunzhu Li, Jiaming Song, and Stefano Ermon. Infogail: Interpretable imitation learning from visual demonstrations. arXiv preprint arXiv:1703.08840, 2017.
 Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Lin et al. (2014) Xiaomin Lin, Peter A Beling, and Randy Cogill. Multiagent inverse reinforcement learning for zerosum games. arXiv preprint arXiv:1403.6508, 2014.
 Littman (1994) Michael L Littman. Markov games as a framework for multiagent reinforcement learning. In Proceedings of the eleventh international conference on machine learning, volume 157, pages 157–163, 1994.
 Lowe et al. (2017) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. arXiv preprint arXiv:1706.02275, 2017.
 Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.

Martens and Grosse (2015)
James Martens and Roger Grosse.
Optimizing neural networks with kroneckerfactored approximate curvature.
In International Conference on Machine Learning, pages 2408–2417, 2015.  Matignon et al. (2012) Laëtitia Matignon, Laurent Jeanpierre, AbdelIllah Mouaddib, et al. Coordinated multirobot exploration under communication constraints using decentralized markov decision processes. In AAAI, 2012.
 Ng et al. (2000) Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
 Peng et al. (2017) Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionallycoordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017.
 Pomerleau (1991) Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
 Prasad and Bhatnagar (2015) HL Prasad and Shalabh Bhatnagar. A study of gradient descent schemes for generalsum stochastic games. arXiv preprint arXiv:1507.00093, 2015.
 Reddy et al. (2012) Tummalapalli Sudhamsh Reddy, Vamsikrishna Gopikrishna, Gergely Zaruba, and Manfred Huber. Inverse reinforcement learning for decentralized noncooperative multiagent systems. In Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on, pages 1930–1935. IEEE, 2012.
 Ross and Bagnell (2010) Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In AISTATS, pages 3–5, 2010.
 Ross et al. (2011) Stéphane Ross, Geoffrey J Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In AISTATS, page 6, 2011.
 Šošic et al. (2016) Adrian Šošic, Wasiur R KhudaBukhsh, Abdelhak M Zoubir, and Heinz Koeppl. Inverse reinforcement learning in swarm systems. stat, 1050:17, 2016.
 Stadie et al. (2017) Bradly Stadie, Pieter Abbeel, and Ilya Sutskever. Third person imitation learning. In ICLR, 2017.
 Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Wu et al. (2017) Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. In Advances in neural information processing systems, pages 5285–5294, 2017.
 Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Appendix A Proofs
We use , and to represent , and , where we implicitly assume dependency over and .
a.1 Proof to Lemma a.1
For any policy , when is the value function of (due to Bellman equations). However, only policies that form a Nash equilibrium satisfies the constraints in Eq. 2; we formalize this in the following Lemma. Let be the solution to the Bellman equation
and . Then for any ,
Furthermore, is Nash equilibrium under if and only if for all .
By definition of we have:
which uses the fact that and are independent conditioned on . Hence immediately follows.
If is a Nash equilibrium, and at least one of the constrains does not hold, i.e. there exists some and such that , then agent can achieve a strictly higher expected return if it chooses to take actions whenever it encounters state and follow for rest of the states, which violates the Nash equilibrium assumption.
If the constraints hold, i.e. for all and , then
so value iteration over converges. If we can find another policy such that , then there should be at least one violation in the constraints since must be a convex combination (expectation) over actions . Therefore, for any policy and action for any agent , always hold, so is the optimal response to , and constitutes a Nash equilibrium when we repeat this argument for all agents.
Notably, Theorem 3.8.2 in Filar and Vrieze (2012) discusses the equivalence by assuming for some ; if satisfies the assumptions, then .
a.2 Proof to Theorem 3.1
If is a Nash equilibrium, and at least one of the constraints does not hold, i.e. there exists some and , such that
Then agent can achieve a strictly higher expected return on its own if it chooses a particular sequence of actions by taking whenever it encounters state , and follow for the remaining states. We note that this is in expectation over the policy of other agents. Hence, we construct a policy for agent that has strictly higher value than without modifying , which contradicts the definition of Nash equilibrium.
If the constraints hold, i.e for all and ,
then we can construct any via a convex combination by taking the expectation over :
where the expectation over is taken over actions (the expectation over states are contained in the inner expectation over ). Therefore, ,
and we recover the constraints in Eq. 2. By Lemma A.1, is a Nash equilibrium.
a.3 Proof to Theorem 3.2
We use to denote the , and quantities defined for policy . For the two terms in we have:
(13) 
For any agent , we note that
which amounts to using for agent for the first steps and using for the remaining steps, whereas other agents follow . As , this converges to since and is bounded. Moreover, for , we have
Combining the two we have