1 Introduction
Reinforcement learning (RL) is a general and powerful framework for decision making under uncertainty. Recent advances in deep learning have enabled a variety of RL applications such as games
(Silver et al., 2016; Mnih et al., 2015), robotics (Gu et al., 2017; Levine et al., 2016), automated machine learning (Zoph & Le, 2016) and generative modeling (Yu et al., 2017). RL algorithms are also showing promise in multiagent systems, where multiple agents interact with each other, such as multiplayer games (Peng et al., 2017), social interactions (Leibo et al., 2017) and multirobot control systems (Matignon et al., 2012). However, the success of RL crucially depends on careful reward design (Amodei et al., 2016). As reinforcement learning agents are prone to undesired behaviors due to reward misspecification (Amodei & Clark, 2016), designing suitable reward functions can be challenging in many realworld applications (HadfieldMenell et al., 2017). In multiagents systems, since different agents may have completely different goals and stateaction representations, handtuning reward functions becomes increasingly more challenging as we take more agents into consideration.Imitation learning presents a direct approach to programming agents with expert demonstrations, where agents learn to produce behaviors similar to the demonstrations. However, imitation learning algorithms, such as behavior cloning (Pomerleau, 1991) and generative adversarial imitation learning (Ho & Ermon, 2016; Ho et al., 2016), typically sidestep the problem of interring an explicit representation for the underlying reward functions. Because the reward function is often considered as the most succinct, robust and transferable representation of a task (Abbeel & Ng, 2004; Fu et al., 2017), it is important to consider the problem of inferring reward functions from expert demonstrations, which we refer to as inverse reinforcement learning (IRL). IRL can offer many advantages compared to direct policy imitation, such as analyzing and debugging an imitation learning algorithm, inferring agents’ intentions and reoptimizing rewards in new environments (Ng et al., 2000).
However, IRL is illdefined, as many policies can be optimal for a given reward and many reward functions can explain a set of demonstrations. Maximum entropy inverse reinforcement learning (MaxEnt IRL) (Ziebart et al., 2008) provides a general probabilistic framework to solve the ambiguity by finding the trajectory distribution with maximum entropy that matches the reward expectation of the experts. As MaxEnt IRL requires solving an integral over all possible trajectories for computing the partition function, it is only suitable for small scale problems with known dynamics. Adversarial IRL (Finn et al., 2016a; Fu et al., 2017) scales MaxEnt IRL to large and continuous problems by drawing an analogy between a sampling based approximation of MaxEnt IRL and Generative Adversarial Networks (Goodfellow et al., 2014) with a particular discriminator structure. However, the approach is restricted to singleagent settings.
In this paper, we consider the IRL problem in multiagent environments with highdimensional continuous stateaction space and unknown dynamics. Generalizing MaxEnt IRL and Adversarial IRL to multiagent systems is challenging. Since each agent’s optimal policy depends on other agents’ policies, the notion of optimality, central to Markov decision processes, has to be replaced by an appropriate equilibrium solution concept. Nash equilibrium
(Hu et al., 1998) is the most popular solution concept for multiagent RL, where each agent’s policy is the best response to others. However, Nash equilibrium is incompatible with MaxEnt RL in the sense that it assumes the agents never take suboptimal actions. Thus imitation learning and inverse reinforcement learning methods based on Nash equilibrium or correlated equilibrium (Aumann, 1974) might lack the ability to handle irrational (or computationally bounded) experts.In this paper, inspired by logistic quantal response equilibrium (McKelvey & Palfrey, 1995, 1998) and Gibbs sampling (Hastings, 1970), we propose a new solution concept termed logistic stochastic best response equilibrium (LSBRE), which allows us to characterize the trajectory distribution induced by parameterized reward functions and handle the bounded rationality of expert demonstrations in a principled manner. Specifically, by uncovering the close relationship between LSBRE and MaxEnt RL, and bridging the optimization of joint likelihood and conditional likelihood with maximum pseudolikelihood estimation, we propose MultiAgent Adversarial Inverse Reinforcemnt Learning (MAAIRL), a novel MaxEnt IRL framework for Markov games. MAAIRL is effective and scalable to large highdimensional Markov games with unknown dynamics, which are not amenable to previous methods relying on tabular representation and linear or quadratic programming (Natarajan et al., 2010; Waugh et al., 2013; Lin et al., 2014, 2018). We experimentally demonstrate that MAAIRL is able to recover reward functions that are highly correlated to the ground truth rewards, while simultaneously learning policies that significantly outperform stateoftheart multiagent imitation learning algorithms (Song et al., 2018) in mixed cooperative and competitive tasks (Lowe et al., 2017).
2 Preliminaries
2.1 Markov Games
Markov games (Littman, 1994) are generalizations of Markov decision processes (MDPs) to the case of interacting agents. A Markov game is defined via a set of states , and sets of actions . The function describes the (stochastic) transition process between states, where
denotes the set of probability distributions over the set
. Given that we are in state at time and the agents take actions , the state transitions to with probability . Each agent obtains a (bounded) reward given by a function . The function specifies the distribution of the initial state. We use bold variables without subscript to denote the concatenation of all variables for all agents (e.g., denotes the joint policy, denotes all rewards and denotes actions of all agents in a multiagent setting). We use subscript to denote all agents except for . For example, represents , the actions of all agents. The objective of each agent is to maximize its own expected return (i.e., the expected sum of discounted rewards) , where is the discount factor and is the reward received steps into the future. Each agent achieves its own objective by selecting actions through a stochastic policy . Depending on the context, the policies can be Markovian (i.e., depend only on the state) or require additional coordination signals. For each agent , we further define the expected return for a stateaction pair as:2.2 Solution Concepts for Markov Games
A correlated equilibrium (CE) for a Markov game (Ziebart et al., 2011) is a joint strategy profile, where no agent can achieve higher expected reward through unilaterally changing its own policy. CE first introduced by (Aumann, 1974, 1987) is a more general solution concept than the wellknown Nash equilibrium (NE) (Hu et al., 1998), which further requires agents’ actions in each state to be independent, i.e. . It has been shown that many decentralized, adaptive strategies will converge to CE instead of a more restrictive equilibrium such as NE (Gordon et al., 2008; Hart & MasColell, 2000). To take bounded rationality into consideration, (McKelvey & Palfrey, 1995, 1998) further propose logistic quantal response equilibrium (LQRE) as a stochastic generalization to NE and CE.
Definition 1.
A logistic quantal response equilibrium for Markov game corresponds to any strategy profile satisfying a set of constraints, where for each state and action, the constraint is given by:
Intuitively, in LQRE, agents choose actions with higher expected return with higher probability.
2.3 Learning from Expert Demonstrations
Suppose we do not have access to the ground truth reward signal , but have demonstrations provided by an expert ( expert agents in Markov games). is a set of trajectories , where is an expert trajectory collected by sampling . contains the entire supervision to the learning algorithm, i.e., we assume we cannot ask for additional interactions with the experts during training. Given , imitation learning (IL) aims to directly learn policies that behave similarly to these demonstrations, whereas inverse reinforcement learning (IRL) (Russell, 1998; Ng et al., 2000) seeks to infer the underlying reward functions which induce the expert policies.
The MaxEnt IRL framework (Ziebart et al., 2008) aims to recover a reward function that rationalizes the expert behaviors with the least commitment, denoted as :
where is the policy entropy. However, MaxEnt IRL is generally considered less efficient and scalable than direct imitation, as we need to solve a forward RL problem in the inner loop. In the context of imitation learning, (Ho & Ermon, 2016) proposed to use generative adversarial training (Goodfellow et al., 2014), to learn the policies characterized by directly, leading to the Generative Adversarial Imitation Learning (GAIL) algorithm:
where
is a discriminator that classifies expert and policy trajectories, and
is the parameterized policy that tries to maximize its score under . According to Goodfellow et al. (2014), with infinite data and infinite computation, at optimality, the distribution of generated stateaction pairs should exactly match the distribution of demonstrated stateaction pairs under the GAIL objective. The downside to this approach, however, is that we bypass the intermediate step of recovering rewards. Specifically, note that we cannot extract reward functions from the discriminator, as will converge to for all pairs.2.4 Adversarial Inverse Reinforcement Learning
Besides resolving the ambiguity that many optimal rewards can explain a set of demonstrations, another advantage of MaxEnt IRL is that it can be interpreted as solving the following maximum likelihood estimation (MLE) problem:
(1)  
Here, are the parameters of the reward function and is the partition function, i.e. an integral over all possible trajectories consistent with the environment dynamics. is intractable to compute when the stateaction spaces are large or continuous, and the environment dynamics are unknown.
Combining Guided Cost Learning (GCL) (Finn et al., 2016b) and generative adversarial training, Finn et al.; Fu et al. proposed adversarial IRL framework as an efficient sampling based approximation to the MaxEnt IRL, where the discriminator takes on a particular form:
where is the learned function, is the probability of the adaptive sampler precomputed as an input to the discriminator, and the policy is trained to maximize .
To alleviate the reward shaping ambiguity (Ng et al., 1999), where many reward functions can explain an optimal policy, (Fu et al., 2017) further restricted to a reward estimator and a potential shaping function :
It has been shown that under suitable assumptions, and will recover the true reward and value function up to a constant.
3 Method
3.1 Logistic Stochastic Best Response Equilibirum
To extend MaxEnt IRL to Markov games, we need be able to characterize the trajectory distribution induced by a set of (parameterized) reward functions (analogous to Equation (1)). However existing optimality notions introduced in Section 2.2 do not explicitly define a tractable joint strategy profile that we can use to maximize the likelihood of expert demonstrations (as a function of the rewards); they do so implicitly as the solution to a set of constraints.
Motivated by Gibbs sampling (Hastings, 1970), dependency networks (Heckerman et al., 2000), best response dynamic (Nisan et al., 2011; Gandhi, 2012) and LQRE, we propose a new solution concept that allows us to characterize rational (joint) policies induced from a set of reward functions. Intuitively, our solution concept corresponds to the result of repeatedly applying a stochastic (entropyregularized) best response mechanism, where each agent (in turns) attempts to optimize its actions while keeping the other agents’ actions fixed.
To begin with, let us first consider a stateless singleshot normalform game with players and a reward function for each player . We consider the following Markov chain over
, where the state of the Markov chain at step
is denoted, with each random variable
taking values in . The transition kernel of the Markov Chain is defined by the following equations:(2) 
and each agent is updated in scan order. Given all other players’ actions , the th player picks an action proportionally to , where is a parameter that controls the level of rationality of the agents. For , the agent will select actions uniformly at random, while for , the agent will select actions greedily (best response). Because the Markov Chain is ergodic, it admits a unique stationary distribution which we denote . Interpreting this stationary distribution over as a policy, we call this stationary joint policy a logistic stochastic best response equilibrium for normalform games.
Now let us generalize the solution concept to Markov games. For each agent , let denote a set of timedependent policies. First we define the state action value function for each agent . Starting from the base case:
then we recursively define:
which generalizes the standard stateaction value function in singleagent RL ( when ).
Definition 2.
Given a Markov game with horizon , the logistic stochastic best response equilibrium (LSBRE) is a sequence of stochastic policies constructed by the following process. Consider T Markov chains over , where the state of the tth Markov chain at step is , with each random variable taking values in . For
, we recursively define the the stationary joint distribution
of the th Markov chain in terms of as:For , we update the state of the Markov chain as:
(3) 
where parameter controls the level of rationality of the agents, and specifies a set of conditional distributions. LSBRE for Markov game is the sequence of joint stochastic policies . Each joint policy is given by:
(4) 
where the probability is taken with respect to the unique stationary distribution of the tth Markov chain.
When the set of conditionals in Equation (2) are compatible (in the sense that each conditional can be inferred from the same joint distribution (Arnold & Press, 1989)), the above process corresponds to a Gibbs sampler, which will converge to a stationary joint distribution , whose conditional distributions are consistent with the ones used during sampling, namely Equation (2). This is the case, for example, if the agents are cooperative, i.e., they share the same reward function . In general, is the distribution specified by the dependency network (Heckerman et al., 2000) defined via conditionals in Equation (2). The same argument can be made for the Markov Chains in Definition 2 with respect to the conditionals in Equation (3).
When the set of conditionals in Equation (2) and (3) are incompatible, the procedure is called a pseudo Gibbs sampler. As discussed in literatures on dependency networks (Heckerman et al., 2000; Chen et al., 2011; Chen & Ip, 2015), when the conditionals are learned from a sufficiently large dataset, the pseudo Gibbs sampler asymptotically works well in the sense that the conditionals of the stationary joint distribution are nearly consistent with the conditionals used during sampling. Under some conditions, theoretical bounds on the approximation can be obtained (Heckerman et al., 2000).
3.2 Trajectory Distributions Induced by LSBRE
Following (Fu et al., 2017; Levine, 2018), without loss of generality, in the remainder of this paper we consider the case where . First, we note that there is a connection between the notion of LSBRE and maximum causal entropy reinforcement learning (Ziebart, 2010). Specifically, we can characterize the trajectory distribution induced by LSBRE policies with an energybased formulation, where the probability of a trajectory increases exponentially as the sum of rewards increases. Formally, with LSBRE policies, the probability of generating a certain trajectory can be characterized with the following theorem:
Theorem 1.
Given a joint policy specified by LSBRE, for each agent , let denote other agents’ marginal distribution and denote agent ’s conditional distribution, both obtained from the LSBRE joint policies. Then the LSBRE conditional distributions are the optimal solution to the following optimization problem:
(5)  
(6) 
Proof.
See Appendix A.1. ∎
Intuitively, for singleshot normal form games, the above statement holds obviously from the definition in Equation (2). For Markov games, similar to the process introduced in Definition 2, we can employ a dynamic programming algorithm to find the conditional policies which minimizes Equation (5). Specifically, we first construct the base case of as a normal form game, then recursively construct the conditional policy for each time step , based on the policies from to that have already been constructed. It can be shown that the constructed optimal policy which minimizes the KL divergence between its trajectory distribution and the trajectory distribution defined in Equation (6) corresponds to the set of conditional policies in LSBRE.
3.3 MultiAgent Adversarial IRL
In the remainder of this paper, we assume that the expert policies form a unique LSBRE under some unknown (parameterized) reward functions, according to Definition 2. By adopting LSBRE as the optimality notion, we are able to rationalize the demonstrations by maximizing the likelihood of the expert trajectories with respect to the LSBRE stationary distribution, which is in turn induced by the parameterized reward functions .
The probability of a trajectory generated by LSBRE policies in a Markov game is defined by the following generative process:
(7) 
where are the unique stationary joint distributions for the LSBRE induced by . The initial state distribution and transition dynamics are specified by the Markov game.
As mentioned in Section 2.4, the MaxEnt IRL framework interprets finding suitable reward functions as maximum likelihood over the expert trajectories in the distribution defined in Equation (7), which can be reduced to:
(8) 
since the initial state distribution and transition dynamics do not depend on the parameterized rewards.
Note that in Equation (8) is the joint policy defined in Equation (4), whose conditional distributions are given by Equation (3). From Section 3.1, we know that given a set of parameterized reward functions, we are able to characterize the conditional policies for each agent . However direct optimization over the joint MLE objective in Equation (8) is intractable, as we cannot obtain a closed form for the stationary joint policy. Fortunately, we are able to construct an asymptotically consistent estimator by approximating the joint likelihood with a product of the conditionals , which is termed a pseudolikelihood (Besag, 1975).
With the asymptotic consistency property of the maximum pseudolikelihood estimation (Besag, 1975; Lehmann & Casella, 2006), we have the following theorem:
Theorem 2.
Let demonstrations be independent and identically distributed (sampled from LSBRE induced by some unknown reward functions), and suppose that for all , is differentiable with respect to . Then, with probability tending to as , the equation
(9) 
has a root such that tends to the maximizer of the joint likelihood in Equation (8).
Proof.
See Appendix A.2. ∎
Theorem 2 bridges the gap between optimizing the joint likelihood and each conditional likelihood. Now we are able to maximize the objective in Equation (8) as:
(10) 
To optimize the maximum pseudolikelihood objective in Equation (10), we can instead optimize the following surrogate loss which is a variational approximation to the psuedolikelihood objective (from Theorem 1):
where is the partition function of the distribution in Equation (6). It is generally intractable to exactly compute and optimize the partition function , which involves an integral over all trajectories. Similar to GCL (Finn et al., 2016b) and singleagent AIRL (Fu et al., 2017), we employ importance sampling to estimate the partition function with adaptive samplers . Now we are ready to introduce our practical MultiAgent Adversarial IRL (MAAIRL) framework, where we train the parameterized discriminators as:
(11) 
and we train the parameterized generators as:
(12) 
Specifically, for each agent , we have a discriminator with a particular structure for a binary classification, and a generator as an adaptive importance sampler for estimating the partition function. Intuitively,
is trained to minimize the KL divergence between its trajectory distribution and that induced by the reward functions, for reducing the variance of importance sampling, while
in the discriminator is trained to estimate the reward function. At optimality, will approximate the advantage function for the expert policy and will approximate the expert policy.3.4 Solving Reward Ambiguity in MultiAgent IRL
For singleagent reinforcement learning, Ng et al. shows that for any stateonly potential function , potentialbased reward shaping defined as:
is a necessary and sufficient condition to guarantee invariance of the optimal policy in both finite and infinite horizon MDPs. In other words, given a set of expert demonstrations, there is a class of reward functions, all of which can explain the demonstrated expert behaviors. Thus without further assumptions, it would be impossible to identify the groundtruth reward that induces the expert policy within this class. Similar issues also exist when we consider multiagent scenarios. Devlin & Kudenko show that in multiagent systems, using the same reward shaping for one or more agents will not alter the set of Nash equilibria. It is possible to extend this result to other solution concepts such as CE and LSBRE. For example, in the case of LSBRE, after specifying the level of rationality , for any , we have:
(13) 
since each individual LSBRE conditional policy is the optimal solution to the corresponding entropy regularized RL problem (See Appendix (A.1)). It can be also shown that any policy that satisfies the inequality in Equation (13) will still satisfy the inequality after reward shaping (Devlin & Kudenko, 2011).
To mitigate the reward shaping effect and recover reward functions with higher linear correlation to the ground truth reward, as in (Fu et al., 2017), we further assume the functions in Equation (3.3) have a specific structure:
where is a reward estimator and is a potential function. We summarize the MAAIRL training procedure in Algorithm 1.
4 Related Work
A vast number of methods and paradigms have been proposed for singleagent imitation learning and inverse reinforcement learning. However, multiagent scenarios are less commonly investigated, and most existing works assume specific reward structures. These include fully cooperative games (Barrett et al., 2017; Le et al., 2017; Šošić et al., 2017; Bogert & Doshi, 2014), two player zerosum games (Lin et al., 2014), and rewards as linear combinations of prespecified features (Reddy et al., 2012; Waugh et al., 2013). Recently, Song et al. proposed MAGAIL, a multiagent extension of GAIL which works on general Markov games.
While both MAAIRL and MAGAIL are based on adversarial training, the methods are inherently different. MAGAIL is based on the notion of Nash equilibrium, and is motivated via a specific choice of Lagrange multipliers for a constraint optimization problem. MAAIRL, on the other hand, is derived from MaxEnt RL and LSBRE, and aims to obtain an MLE solution for the joint trajectories; we connect this with a set of conditionals via pseudolikelihood, which are then solved with the adversarial reward learning framework. From a reward learning perspective, the discriminators’ outputs in MAGAIL will converge to uninformative uniform distribution, while MAAIRL allows us to recover reward functions from the optimal discriminators.
5 Experiments
We seek to answer the following questions via empirical evaluation: (1) Can MAAIRL efficiently recover the expert policies for each individual agent from the expert demonstrations (policy imitation)? (2) Can MAAIRL effectively recover the underlying reward functions, for which the expert policies form a LSBRE (reward recovery)?
Task Description To answer these questions, we evaluate our MAAIRL algorithm on a series of simulated particle environments (Lowe et al., 2017). Specifically, we consider the following scenarios: cooperative navigation, where three agents cooperate through physical actions to reach three landmarks; cooperative communication, where two agents, a speaker and a listener, cooperate to navigate to a particular landmark; and competitive keepaway, where one agent tries to reach a target landmark, while an adversary, without knowing the target a priori, tries to infer the target from the agent’s behaviors and prevent it from reaching the goal through physical interactions.
In our experiments, for generality, the learning algorithms will not leverage any prior knowledge on the types of interactions (cooperative or competitive). Thus for all the tasks described above, the learning algorithms will take a decentralized form and we will not utilize additional reward regularization, besides penalizing the 2 norm of the reward parameters to mitigate overfitting (Ziebart, 2010; Kalakrishnan et al., 2013).
Training Procedure In the simulated environments, we have access to the groundtruth reward functions, which enables us to accurately evaluate the quality of both recovered policies and reward functions. We use a multiagent version of ACKTR (Wu et al., 2017; Song et al., 2018), an efficient modelfree policy gradient algorithm for training the experts as well as the adaptive samplers in MAAIRL. The supervision signals for the experts come from the groundtruth rewards, while the reward signals for the adaptive samplers come from the discriminators. Specifically, we first obtain expert policies induced by the groundtruth rewards, then we use them to generate demonstrations, from which the learning algorithms will try to recover the policies as well as the underlying reward functions. We compare MAAIRL against the stateoftheart multiagent imitation learning algorithm, MAGAIL (Song et al., 2018), which is a generalization of GAIL to Markov games. Following (Li et al., 2017; Song et al., 2018), we use behavior cloning to pretrain MAAIRL and MAGAIL to reduce sample complexity for exploration, and we use 200 episodes of expert demonstrations, each with 50 time steps, which is close to the amount of time steps used in (Ho & Ermon, 2016)^{1}^{1}1The codebase for this work can be found at https://github.com/ermongroup/MAAIRL..
5.1 Policy Imitation
Although MAGAIL achieved superior performance compared with behavior cloning (Song et al., 2018), it only aims to recover policies via distribution matching. Moreover, the training signal for the policy will become less informative as training progresses; according to (Goodfellow et al., 2014) with infinite data and computational resources the discriminator outputs will converge to 0.5 for all stateaction pairs, which could potentially hinder the robustness of the policy towards the end of training. To empirically verify our claims, we compare the quality of the learned policies in terms of the expected return received by each agent.
In the cooperative environment, we directly use the groundtruth rewards from the environment as the oracle metric, since all agents share the same reward. In the competitive environment, we follow the evaluation procedure in (Song et al., 2018), where we place the experts and learned policies in the same environment. A learned policy is considered “better” if it receives a higher expected return while its opponent receives a lower expected return. The results for cooperative and competitive environments are shown in Tables 1 and 2 respectively. MAAIRL consistently performs better than MAGAIL in terms of the received reward in all the considered environments, suggesting superior imitation learning capabilities to the experts.
Algorithm  Nav. ExpRet  Comm. ExpRet 

Expert  43.195 2.659  12.712 1.613 
Random  391.314 10.092  125.825 3.4906 
MAGAIL  52.810 2.981  12.811 1.604 
MAAIRL  47.515 2.549  12.727 1.557 
Agent #1  Agent #2  Agent #1 ExpRet 
Expert  Expert  6.804 0.316 
MAGAIL  Expert  6.978 0.305 
MAAIRL  Expert  6.785 0.312 
Expert  MAGAIL  6.919 0.298 
Expert  MAAIRL  7.367 0.311 
5.2 Reward Recovery
The second question we seek to answer is concerned with the reward recovering problem as in inverse reinforcement learning: is the algorithm able to recover the ground truth reward functions with expert demonstrations being the only source of supervision? To answer this question, we evaluate the statistical correlations between the ground truth rewards (which the learning algorithms have no access to) and the inferred rewards for the same stateaction pairs.
Specifically, we consider two types of statistical correlations: Pearson’s correlation coefficient (PCC), which measures the linear correlation between two random variables; and Spearman’s rank correlation coefficient (SCC), which measures the statistical dependence between the rankings of two random variables. Higher SCC suggests that two reward functions have higher monotonic relationships and higher PCC suggests higher linear correlations. For each trajectory, we compare the groundtruth return from the environment with the supervision signals from the discriminators, which correspond to in MAAIRL and in MAGAIL.
Tables 3 and 4 provide the SCC and PCC statistics for cooperative and competitive environments respectively. In the cooperative case, compared to MAGAIL, MAAIRL achieves a much higher PCC and SCC, which could facilitate policy learning. The statistical correlations between reward signals gathered from discriminators for each agent are also quite high, suggesting that while we do not reveal the agents are cooperative, MAAIRL is able to discover high correlations between the agents’ reward functions. In the competitive case, the reward functions learned by MAAIRL also significantly outperform MAGAIL in terms of SCC and PCC statistics. In Figure 1, we further show the changes of PCC statistics with respect to training time steps for MAGAIL and MAAIRL. The reward functions recovered by MAGAIL initially have a high correlation with the ground truth, yet that dramatically decreases as training continues, whereas the functions learned by MAAIRL maintains a high correlation throughout the course of training, which is in line with the theoretical analysis that in MAGAIL, reward signals from the discriminators will become less informative towards convergence.
Task  Metric  MAGAIL  MAAIRL 

Nav.  SCC  0.792 0.085  0.934 0.015 
PCC  0.556 0.081  0.882 0.028  
Comm.  SCC  0.879 0.059  0.936 0.080 
PCC  0.612 0.093  0.848 0.099 
Algorithm  MAGAIL  MAAIRL 

SCC #1  0.424  0.534 
SCC #2  0.653  0.907 
Average SCC  0.538  0.721 
PCC #1  0.497  0.720 
PCC #2  0.392  0.667 
Average PCC  0.445  0.694 
6 Discussion and Future Work
We propose MAAIRL, the first multiagent MaxEnt IRL framework that is effective and scalable to Markov games with highdimensional stateaction space and unknown dynamics. We derive our algorithm based on a solution concept termed LSBRE and we employ maximum pseudolikelihood estimation to achieve tractability. Experimental results demonstrate that MAAIRL is able to imitate expert behaviors in highdimensional complex environments, as well as learn reward functions that are highly correlated with the ground truth rewards. An exciting avenue for future work is to include reward regularization to mitigate overfitting and leverage prior knowledge of the task structure.
Acknowledgments
This research was supported by Toyota Research Institute, NSF (#1651565, #1522054, #1733686), ONR (N000141912145), AFOSR (FA95501910024), Amazon AWS, and Qualcomm.
References
 Abbeel & Ng (2004) Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, pp. 1, 2004.
 Amodei & Clark (2016) Amodei, D. and Clark, J. Faulty reward functions in the wild, 2016.
 Amodei et al. (2016) Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. Concrete problems in AI safety. arXiv preprint arXiv:1606. 06565, 2016.
 Arnold & Press (1989) Arnold, B. C. and Press, S. J. Compatible conditional distributions. Journal of the American Statistical Association, 84(405):152–156, 1989.
 Aumann (1974) Aumann, R. J. Subjectivity and correlation in randomized strategies. Journal of mathematical Economics, 1(1):67–96, 1974.
 Aumann (1987) Aumann, R. J. Correlated equilibrium as an expression of bayesian rationality. Econometrica: Journal of the Econometric Society, pp. 1–18, 1987.
 Barrett et al. (2017) Barrett, S., Rosenfeld, A., Kraus, S., and Stone, P. Making friends on the fly: Cooperating with new teammates. Artificial Intelligence, 242:132–171, 2017.
 Besag (1975) Besag, J. Statistical analysis of nonlattice data. The statistician, pp. 179–195, 1975.
 Bogert & Doshi (2014) Bogert, K. and Doshi, P. Multirobot inverse reinforcement learning under occlusion with interactions. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems, pp. 173–180, 2014.
 Chen & Ip (2015) Chen, S.H. and Ip, E. H. Behaviour of the gibbs sampler when conditional distributions are potentially incompatible. Journal of statistical computation and simulation, 85(16):3266–3275, 2015.
 Chen et al. (2011) Chen, S.H., Ip, E. H., and Wang, Y. J. Gibbs ensembles for nearly compatible and incompatible conditional models. Computational statistics & data analysis, 55(4):1760–1769, 2011.
 Dawid & Musio (2014) Dawid, A. P. and Musio, M. Theory and applications of proper scoring rules. Metron, 72(2):169–183, 2014.
 Devlin & Kudenko (2011) Devlin, S. and Kudenko, D. Theoretical considerations of potentialbased reward shaping for multiagent systems. In The 10th International Conference on Autonomous Agents and Multiagent Systems  Volume 1, AAMAS ’11, pp. 225–232, Richland, SC, 2011. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9780982657157, 9780982657157.
 Finn et al. (2016a) Finn, C., Christiano, P., Abbeel, P., and Levine, S. A connection between generative adversarial networks, inverse reinforcement learning, and energybased models. arXiv preprint arXiv:1611.03852, 2016a.
 Finn et al. (2016b) Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pp. 49–58, June 2016b.
 Fu et al. (2017) Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
 Gandhi (2012) Gandhi, A. The stochastic response dynamic: A new approach to learning and computing equilibrium in continuous games. Technical Report, 2012.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Gordon et al. (2008) Gordon, G. J., Greenwald, A., and Marks, C. Noregret learning in convex games. In Proceedings of the 25th international conference on Machine learning, pp. 360–367. ACM, 2008.
 Gu et al. (2017) Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 3389–3396. IEEE, 2017.
 HadfieldMenell et al. (2017) HadfieldMenell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in Neural Information Processing Systems, pp. 6765–6774, 2017.
 Hart & MasColell (2000) Hart, S. and MasColell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
 Hastings (1970) Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. 1970.

Heckerman et al. (2000)
Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., and Kadie, C.
Dependency networks for inference, collaborative filtering, and data visualization.
Journal of Machine Learning Research, 1(Oct):49–75, 2000.  Ho & Ermon (2016) Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
 Ho et al. (2016) Ho, J., Gupta, J., and Ermon, S. Modelfree imitation learning with policy optimization. In International Conference on Machine Learning, pp. 2760–2769, 2016.
 Hu et al. (1998) Hu, J., Wellman, M. P., and Others. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, volume 98, pp. 242–250, 1998.
 Kalakrishnan et al. (2013) Kalakrishnan, M., Pastor, P., Righetti, L., and Schaal, S. Learning objective functions for manipulation. In Robotics and Automation (ICRA), 2013 IEEE International Conference on, pp. 1331–1336, 2013.
 Le et al. (2017) Le, H. M., Yue, Y., and Carr, P. Coordinated MultiAgent imitation learning. arXiv preprint arXiv:1703.03121, March 2017.
 Lehmann & Casella (2006) Lehmann, E. L. and Casella, G. Theory of point estimation. Springer Science & Business Media, 2006.
 Leibo et al. (2017) Leibo, J. Z., Zambaldi, V., Lanctot, M., Marecki, J., and Graepel, T. Multiagent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 464–473, 2017.
 Levine (2018) Levine, S. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805. 00909, 2018.
 Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Li et al. (2017) Li, Y., Song, J., and Ermon, S. InfoGAIL: Interpretable imitation learning from visual demonstrations. arXiv preprint arXiv:1703. 08840, 2017.
 Lin et al. (2014) Lin, X., Beling, P. A., and Cogill, R. Multiagent inverse reinforcement learning for zerosum games. arXiv preprint arXiv:1403. 6508, 2014.
 Lin et al. (2018) Lin, X., Adams, S. C., and Beling, P. A. Multiagent inverse reinforcement learning for generalsum stochastic games. arXiv preprint arXiv:1806.09795, 2018.
 Littman (1994) Littman, M. L. Markov games as a framework for multiagent reinforcement learning. In Proceedings of the eleventh international conference on machine learning, volume 157, pp. 157–163, 1994.
 Lowe et al. (2017) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. MultiAgent ActorCritic for mixed CooperativeCompetitive environments. arXiv preprint arXiv:1706.02275, June 2017.
 Matignon et al. (2012) Matignon, L., Jeanpierre, L., Mouaddib, A.I., and Others. Coordinated MultiRobot exploration under communication constraints using decentralized markov decision processes. In AAAI, 2012.
 McKelvey & Palfrey (1995) McKelvey, R. D. and Palfrey, T. R. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
 McKelvey & Palfrey (1998) McKelvey, R. D. and Palfrey, T. R. Quantal response equilibria for extensive form games. Experimental economics, 1(1):9–41, 1998.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Natarajan et al. (2010) Natarajan, S., Kunapuli, G., Judah, K., Tadepalli, P., Kersting, K., and Shavlik, J. Multiagent inverse reinforcement learning. In Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on, pp. 395–400, 2010.
 Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
 Ng et al. (2000) Ng, A. Y., Russell, S. J., et al. Algorithms for inverse reinforcement learning. In Icml, pp. 663–670, 2000.
 Nisan et al. (2011) Nisan, N., Schapira, M., Valiant, G., and Zohar, A. Bestresponse mechanisms. In ICS, pp. 155–165, 2011.
 Peng et al. (2017) Peng, P., Yuan, Q., Wen, Y., Yang, Y., Tang, Z., Long, H., and Wang, J. Multiagent BidirectionallyCoordinated nets for learning to play StarCraft combat games. arXiv preprint arXiv:1703. 10069, 2017.

Pomerleau (1991)
Pomerleau, D. A.
Efficient training of artificial neural networks for autonomous navigation.
Neural computation, 3(1):88–97, 1991. ISSN 08997667.  Reddy et al. (2012) Reddy, T. S., Gopikrishna, V., Zaruba, G., and Huber, M. Inverse reinforcement learning for decentralized noncooperative multiagent systems. In Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on, pp. 1930–1935, 2012.

Russell (1998)
Russell, S.
Learning agents for uncertain environments.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pp. 101–103. ACM, 1998.  Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 Song et al. (2018) Song, J., Ren, H., Sadigh, D., and Ermon, S. Multiagent generative adversarial imitation learning. 2018.
 Šošić et al. (2017) Šošić, A., KhudaBukhsh, W. R., Zoubir, A. M., and Koeppl, H. Inverse reinforcement learning in swarm systems. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pp. 1413–1421. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
 Waugh et al. (2013) Waugh, K., Ziebart, B. D., and Andrew Bagnell, J. Computational rationalization: The inverse equilibrium problem. arXiv preprint arXiv:1308.3506, August 2013.
 Wu et al. (2017) Wu, Y., Mansimov, E., Liao, S., Grosse, R., and Ba, J. Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation. arXiv preprint arXiv:1708.05144, August 2017.
 Yu et al. (2017) Yu, L., Zhang, W., Wang, J., and Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pp. 2852–2858, 2017.
 Ziebart (2010) Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. 2010.
 Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pp. 1433–1438, 2008.
 Ziebart et al. (2011) Ziebart, B. D., Bagnell, J. A., and Dey, A. K. Maximum causal entropy correlated equilibria for markov games. In The 10th International Conference on Autonomous Agents and Multiagent Systems  Volume 1, AAMAS ’11, pp. 207–214, Richland, SC, 2011. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9780982657157, 9780982657157.
 Zoph & Le (2016) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Appendix A Appendix
a.1 Trajectory Distribution Induced by Logistic Stochastic Best Response Equilibrium
Let denote other agents’ marginal LSBRE policies, and denote agent
’s conditional policy. With chain rule, the induced trajectory distribution is given by:
(14) 
Suppose the desired distribution is given by:
(15) 
Now we will shown that the optimal solution to the following optimization problem correspond to the LSBRE conditional policies:
(16) 
The optimization problem in Equation (16) is equivalent to (the partition function of the desired distribution is a constant with respect to optimized policies):
(17) 
To maximize this objective, we can use a dynamic programming procedure. Let us first consider the base case of optimizing :
(18) 
where is the partition function and . The optimal policy is given by:
(19) 
With the optimal policy in Equation (19), Equation (A.1) is equivalent to (with the KL divergence being zero):
(20) 
Then recursively, for a given time step , must maximize:
(21)  
Comments
There are no comments yet.