Multi-Agent Generative Adversarial Imitation Learning

07/26/2018 ∙ by Jiaming Song, et al. ∙ Stanford University 0

Imitation learning algorithms can be used to learn a policy from expert demonstrations without access to a reward signal. However, most existing approaches are not applicable in multi-agent settings due to the existence of multiple (Nash) equilibria and non-stationary environments. We propose a new framework for multi-agent imitation learning for general Markov games, where we build upon a generalized notion of inverse reinforcement learning. We further introduce a practical multi-agent actor-critic algorithm with good empirical performance. Our method can be used to imitate complex behaviors in high-dimensional environments with multiple cooperative or competing agents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) methods are becoming increasingly successful at optimizing reward signals in complex, high dimensional environments (Espeholt et al., 2018). A key limitation of RL, however, is the difficulty of designing suitable reward functions for complex and not well-specified tasks (Hadfield-Menell et al., 2017; Amodei et al., 2016). If the reward function does not cover all important aspects of the task, the agent could easily learn undesirable behaviors (Amodei and Clark, 2016). This problem is further exacerbated in multi-agent scenarios, such as multiplayer games (Peng et al., 2017), multi-robot control (Matignon et al., 2012) and social interactions (Leibo et al., 2017); in these cases, agents do not even necessarily share the same reward function, especially in competitive settings where the agents might have conflicting rewards.

Imitation learning methods address these problems via expert demonstrations (Ziebart et al., 2008; Englert and Toussaint, 2015; Finn et al., 2016; Stadie et al., 2017); the agent directly learns desirable behaviors by imitating an expert. Notably, inverse reinforcement learning (IRL) frameworks assume that the expert is (approximately) optimizing an underlying reward function, and attempt to recover a reward function that rationalizes the demonstrations; an agent policy is subsequently learned through RL (Ng et al., 2000; Abbeel and Ng, 2004). Unfortunately, this paradigm is not suitable for general multi-agent settings due to environment being non-stationary to individual agents (Lowe et al., 2017) and the existence of multiple equilibrium solutions (Hu et al., 1998). The optimal policy of one agent could depend on the policies of other agents, and vice versa, so there could exist multiple solutions in which each agents’ policy is the optimal response to others.

In this paper, we propose a new framework for multi-agent imitation learning – provided with demonstrations of a set of experts interacting with each other within the same environment, we aim to learn multiple parametrized policies that imitate the behavior of each expert respectively. Using the framework of Markov games, we integrate multi-agent RL with a suitable extension of multi-agent inverse RL. The resulting procedure strictly generalizes Generative Adversarial Imitation Learning (GAIL, (Ho and Ermon, 2016)

) in the single agent case. Imitation learning corresponds to a two-player game between a generator and a discriminator. The generator controls the policies of all the agents in a distributed way, and the discriminator contains a classifier for each agent that is trained to distinguish that agent’s behavior from that of the corresponding expert. Upon training, the behaviors produced by the policies are indistinguishable from the training data through the discriminator. We can incorporate prior knowledge into the discriminators, including the presence of cooperative or competitive agents. In addition, we propose a novel multi-agent natural policy gradient algorithm that addresses the issue of high variance gradient estimates commonly observed in reinforcement learning 

(Lowe et al., 2017; Foerster et al., 2016). Empirical results demonstrate that our method can imitate complex behaviors in high-dimensional environments, such as particle environments and cooperative robotic control tasks, with multiple cooperative or competitive agents; the imitated behaviors are close to the expert behaviors with respect to “true” reward functions which the agents do not have access to during training.

2 Preliminaries

2.1 Markov games

We consider an extension of Markov decision processes (MDPs) called Markov games 

(Littman, 1994). A Markov game (MG) for agents is defined via a set of states , sets of actions . The function describes the (stochastic) transition process between states, where

denotes the set of probability distributions over the set

. Given that we are in state at time , the agents take actions and the state transitions to with probability .

Each agent obtains a (bounded) reward given by a function . Each agent aims to maximize its own total expected return , where is the discount factor and is the time horizon, by selecting actions through a (stationary and Markovian) stochastic policy . The initial states are determined by a distribution .

The joint policy is defined as , where we use bold variables without subscript to denote the concatenation of all variables for all agents (e.g. denotes the joint policy in a multi-agent setting, denotes all rewards, denotes actions of all agents).

We use expectation with respect to a policy to denote an expectation with respect to the trajectories it generates. For example,

denotes the following sample process for the right hand side: , , , yet if we do not take expectation over the state , then

assumes the policy samples only the next-step action .

We use subscript to denote all agents except for . For example, represents , the actions of all agents.

2.2 Reinforcement learning and Nash equilibrium

In reinforcement learning (RL), the goal of each agent is to maximize total expected return given access to the reward signal . In single agent RL, an optimal Markovian policy exists but the optimal policy might not be unique (e.g., all policies are optimal for an identically zero reward; see Sutton and Barto (1998), Chapter 3.8). An entropy regularizer can be introduced to resolve this ambiguity. The optimal policy is found via the following RL procedure:

(1)

where is the -discounted causal entropy (Bloem and Bambos, 2014) of policy . [-discounted Causal Entropy] The -discounted causal entropy for a policy is defined as follows:

If we scale the reward function by any positive value, the addition of resolves ambiguity by selecting the policy among the set of optimal policies that have the highest causal entropy111For the remainder of the paper, we may use the term “entropy” to denote the -discounted causal entropy for policies. – the policy with both the highest reward and the highest entropy is unique because the entropy function is concave with respect to and the set of optimal policies is convex.

In Markov games, however, the optimal policy of an agent depends on other agents’ policies. One approach is to use an equilibrium solution concept, such as Nash equilibrium (Hu et al., 1998). Informally, a set of policies is a Nash equilibrium if no agent can achieve higher reward by unilaterally changing its policy, i.e. . The process of finding a Nash equilibrium can be defined as a constrained optimization problem (Filar and Vrieze (2012), Theorem 3.7.2):

(2)
(3)

where the joint action includes actions sampled from and . Intuitively, could represent some estimated value function for each state and represents the -function that corresponds to . The constraints enforce the Nash equilibrium condition – when the constraints are satisfied, is non-negative for every . Hence is always non-negative for a feasible . Moreover, this objective has a global minimum of zero if a Nash equilibrium exists, and forms a Nash equilibrium if and only if reaches zero while being a feasible solution (Prasad and Bhatnagar (2015), Theorem 2.4).

2.3 Inverse reinforcement learning

Suppose we do not have access to the reward signal , but have demonstrations provided by an expert ( expert agents in Markov games). Imitation learning aims to learn policies that behave similarly to these demonstrations. In Markov games, we assume all experts/players operate in the same environment, and the demonstrations are collected by sampling ; we assume knowledge of , , , , as well as access to and as black boxes. We further assume that once we obtain , we cannot ask for additional expert interactions with the environment (unlike in DAgger (Ross et al., 2011) or CIRL (Hadfield-Menell et al., 2016)).

Let us first consider imitation in Markov decision processes (as a special case to Markov games) and the framework of single-agent Maximum Entropy IRL (Ziebart et al., 2008; Ho and Ermon, 2016) where the goal is to recover a reward function that rationalizes the expert behavior :

In practice, expectations with respect to are evaluated using samples from .

The IRL objective is ill-defined (Ng et al., 2000; Finn et al., 2016) and there are often multiple valid solutions to the problem when we consider all . For example, we can assign the reward function for trajectories that are not visited by the expert arbitrarily so long as these trajectories yields lower rewards than the expert trajectories. To resolve this ambiguity, Ho and Ermon (2016) introduce a convex reward function regularizer , which can be used to restrict rewards to be linear in a pre-determined set of features (Ho and Ermon, 2016):

(4)

2.4 Imitation by matching occupancy measures

Ho and Ermon (2016) interpret the imitation learning problem as matching two occupancy measures, i.e., the distribution over states and actions encountered when navigating the environment with a policy. Formally, for a policy , it is defined as . Ho and Ermon (2016) draw a connection between IRL and occupancy measure matching, showing that the former is a dual of the latter:

Proposition (Proposition 3.1 in (Ho and Ermon, 2016))

Here is the convex conjugate of , which could be interpreted as a measure of similarity between the occupancy measures of expert policy and agent’s policy. One instance of gives rise to the Generative Adversarial Imitation Learning (GAIL) method:

(5)

The resulting imitation learning method from Proposition 2.4 involves a discriminator (a classifier ) competing with a generator (a policy ). The discriminator attempts to distinguish real vs. synthetic trajectories (produced by ) by optimizing (5). The generator, on the other hand, aims to perform optimally under the reward function defined by the discriminator, thus “fooling” the discriminator with synthetic trajectories that are difficult to distinguish from the expert ones.

3 Generalizing IRL to Markov games

Extending imitation learning to multi-agent settings is difficult because there are multiple rewards (one for each agent) and the notion of optimality is complicated by the need to consider an equilibrium solution (Hu et al., 1998). We use to denote the set of (stationary and Markovian) policies that form a Nash equilibrium under and have the maximum -discounted causal entropy (among all equilibria):

(6)

where is defined as in Eq. 3. Our goal is to define a suitable inverse operator MAIRL, in analogy to IRL in Eq. 4. The key idea of Eq. 4 is to choose a reward that creates a margin between the expert and every other policy. However, the constraints in the Nash equilibrium optimization (Eq. 6) can make this challenging. To that end, we derive an equivalent Lagrangian formulation of (6), where we “move” the constraints into the objective function, so that we can define a margin between the expected reward of two sets of policies that captures their “difference”.

3.1 Equivalent constraints via temporal difference learning

Intuitively, the Nash equilibrium constraints imply that any agent cannot improve via 1-step temporal difference learning; if the condition for Equation 3 is not satisfied for some , , and , this would suggest that we can update the policy for agent and its value function. Based on this notion, we can derive equivalent versions of the constraints corresponding to -step temporal difference (TD) learning. For a certain policy and reward , let be the unique solution to the Bellman equation:

Denote as the discounted expected return for the -th agent conditioned on visiting the trajectory in the first steps and choosing action at the step, when other agents use policy :

Then is Nash equilibrium if and only if

(7)

Intuitively, Theorem 3.1 states that if we replace the 1-step constraints with -step constraints, we obtain the same solution as , since -step TD updates (over one agent at a time) is still stationary with respect to a Nash equilibrium solution. So the constraints can be unrolled for steps and rewritten as (corresponding to Equation 7).

3.2 Multi-agent inverse reinforcement learning

We are now ready to construct the Lagrangian dual of the primal in Equation 6, using the equivalent formulation from Theorem 3.1. The first observation is that for any policy , when is defined as in Theorem 3.1 (see Lemma A.1 in appendix). Therefore, we only need to consider the “unrolled” constraints from Theorem 3.1, obtaining the following dual problem

(8)

where is the set of all length- trajectories of the form , with as initial state,

is a vector of

Lagrange multipliers, and is defined as in Theorem 3.1. This dual formulation is a sum over agents and trajectories, which uniquely corresponds to the constraints in Equation 7.

In the following theorem, we show that for a specific choice of we can recover the difference of the sum of expected rewards between two policies, a performance gap similar to the one used in single agent IRL in Eq. (4). This amounts to “relaxing” the primal problem.

For any two policies and , let

be the probability of generating the sequence using policy and . Then

(9)

where corresponds to the dual function where the multipliers are the probability of generating their respective trajectories of length .

We provide a proof in Appendix A.3. Intuitively, the weights correspond to the probability of generating trajectory when the policy is for agent and for the other agents. As , the first term of left hand side in Equation 9, , converges to the expected total reward , which is the first term of right hand side. The marginal of over the initial states is the initial state distribution, so the second term of left hand side, , converges to , which is the second term of right hand side. Thus, the left hand side and right hand side of Equation 9 are the same as .

Theorem 3.2 motivates the following definition of multi-agent IRL with regularizer .

(10)

where is the discounted causal entropy for policy when other agents follow , and is a hyper-parameter controlling the strength of the entropy regularization term as in (Ho and Ermon, 2016). This formulation is a strict generalization to the single agent IRL in (Ho and Ermon, 2016). If , then . Furthermore, if the regularization is additively separable, and for each agent , is the unique optimal response to other experts , we obtain the following: Assume that , is convex for each , and that has a unique solution222The set of Nash equilibria is not always convex, so we have to assume returns a unique solution. for all , then

where denotes for agent and for other agents.

The above theorem suggests that -regularized multi-agent inverse reinforcement learning is seeking, for each agent , a policy whose occupancy measure is close to one where we replace policy with expert , as measured by the convex function .

However, we do not assume access to the expert policy during training, so it is not possible to obtain . In the settings of this paper, we consider an alternative approach where we match the occupancy measure between and instead. We can obtain our practical algorithm if we select an adversarial reward function regularizer and remove the effect from entropy regularizers.

Proposition

If , and where if ; otherwise, and

then

Theorem 3.2 and Proposition 3.2 discuss the differences from the single agent scenario. On the one hand, in Theorem 3.2 we make the assumption that has a unique solution, which is always true in the single agent case due to convexity of the space of the optimal policies. On the other hand, in Proposition 3.2 we remove the entropy regularizer because here the causal entropy for may depend on the policies of the other agents, so the entropy regularizer on two sides are not the same quantity. Specifically, the entropy for the left hand side conditions on and the entropy for the right hand side conditions on (which would disappear in the single-agent case).

4 Practical multi-agent imitation learning

Despite the recent successes in deep RL, it is notoriously hard to train policies with RL algorithmsbecause of high variance gradient estimates. This is further exacerbated in Markov games since an agent’s optimal policy depends on other agents (Lowe et al., 2017; Foerster et al., 2016). In this section, we address these problems and propose practical algorithms for multi-agent imitation.

4.1 Multi-agent generative adversarial imitation learning

We select to be our reward function regularizer in Proposition 3.2; this corresponds to the two-player game introduced in Generative Adversarial Imitation Learning (GAIL, (Ho and Ermon, 2016)). For each agent , we have a discriminator (denoted as ) mapping state action-pairs to scores optimized to discriminate expert demonstrations from behaviors produced by . Implicitly, plays the role of a reward function for the generator, which in turn attempts to train the agent to maximize its reward thus fooling the discriminator. We optimize the following objective:

(11)

We update through reinforcement learning, where we also use a baseline to reduce variance. We outline the algorithm – Multi-Agent GAIL (MAGAIL) – in Appendix B.

We can augment the reward regularizer using an indicator denoting whether fits our prior knowledge; the augmented reward regularizer is then: if and if . We introduce three types of for common settings.

Centralized

The easiest case is to assume that the agents are fully cooperative, i.e. they share the same reward function. Here and . One could argue this corresponds to the GAIL case, where the RL procedure operates on multiple agents (a joint policy).

Decentralized

We make no prior assumptions over the correlation between the rewards. Here and . This corresponds to one discriminator for each agent which discriminates the trajectories as observed by agent . However, these discriminators are not learned independently as they interact indirectly via the environment.

Zero Sum

Assume there are two agents that receive opposite rewards, so . As such, is no longer additively separable. Nevertheless, an adversarial training procedure can be designed using the following fact:

where is the expected outcome for agent 1. The discriminator could maximize the reward for trajectories in and minimize the reward for trajectories in .

(a) Centralized (Cooperative)
(b) Decentralized (Mixed)
(c) Zero-sum (Competitive)
Figure 1: Different MAGAIL algorithms obtained with different priors on the reward structure. The discriminator tries to assign higher rewards to top row and low rewards to bottom row. In centralized and decentralized, the policy operates with the environment to match the expert rewards. In zero-sum, the policy do not interact with the environment; expert and policy trajectories are paired together as input to the discriminator.

These three settings are in summarized in Figure 1.

4.2 Multi-agent actor-critic with Kronecker factors

To optimize over the generator parameters in Eq. (11) we wish to use an algorithm for multi-agent RL that has good sample efficiency in practice. Our algorithm, which we refer to as Multi-agent Actor-Critic with Kronecker-factors (MACK), is based on Actor-Critic with Kronecker-factored Trust Region (ACKTR, (Wu et al., 2017)), a state-of-the-art natural policy gradient (Amari, 1998; Kakade, 2002) method in deep RL. MACK uses the framework of centralized training with decentralized execution (Foerster et al., 2016); policies are trained with additional information to reduce variance but such information is not used during execution time. We let the advantage function of every agent agent be a function of all agents’ observations and actions:

(12)

where is the baseline for , utilizing the additional information for variance reduction. We use (approximated) natural policy gradients to update both and but without trust regions to schedule the learning rate – a linear decay learning rate schedule achieves similar empirical performance.

MACK has some notable differences from Multi-Agent Deep Deterministic Policy Gradient (Lowe et al., 2017). On the one hand, MACK does not assume knowledge of other agent’s policies nor tries to infer them; the value estimator merely collects experience from other agents (and treats them as black boxes). On the other hand, MACK does not require gradient estimators such as Gumbel-softmax (Jang et al., 2016; Maddison et al., 2016) to optimize over discrete actions, which is necessary for DDPG (Lillicrap et al., 2015).

5 Experiments

We evaluate the performance of (centralized, decentralized, and zero-sum versions) of MAGAIL under two types of environments. One is a particle environment which allows for complex interactions and behaviors; the other is a control task, where multiple agents try to cooperate and move a plank forward. We collect results by averaging over 5 random seeds. Our implementation is based on OpenAI baselines (Dhariwal et al., 2017); please refer to Appendix C for implementation details.

We compare our methods (centralized, decentralized, zero-sum MAGAIL) with two baselines. The first is behavior cloning (BC), which learns a maximum likelihood estimate for given each state and does not require actions from other agents. The second baseline is the GAIL IRL baseline that operates on each agent separately – for each agent we first pretrain the other agents with BC, and then train the agent with GAIL; we then gather the trained GAIL policies from all the agents and evaluate their performance.

5.1 Particle environments

We first consider the particle environment proposed in (Lowe et al., 2017), which consists of several agents and landmarks. We consider two cooperative environments and two competitive ones. All environments have an underlying true reward function that allows us to evaluate the performance of learned agents.

The environments include: Cooperative Communication – two agents must cooperate to reach one of three colored landmarks. One agent (“speaker”) knows the goal but cannot move, so it must convey the message to the other agent (“listener”) that moves but does not observe the goal. Cooperative Navigation – three agents must cooperate through physical actions to reach three landmarks; ideally, each agent should cover a single landmark. Keep-Away – two agents have contradictory goals, where agent 1 tries to reach one of the two targeted landmarks, while agent 2 (the adversary) tries to keep agent 1 from reaching its target. The adversary does not observe the target, so it must act based on agent 1’s actions. Predator-Prey – three slower cooperating adversaries must chase the faster agent in a randomly generated environment with obstacles; the adversaries are rewarded by touching the agent while the agent is penalized.

For the cooperative tasks, we use an analytic expression defining the expert policy; for the competitive tasks, we use MACK to train expert policies based on the true underlying rewards (using larger policy and value networks than the ones that we use for imitation). We then use the expert policies to simulate trajectories , and then do imitation learning on as demonstrations, where we assume the underlying rewards are unknown. Following (Li et al., 2017), we pretrain our Multi-Agent GAIL methods and the GAIL baseline using behavior cloning as initialization to reduce sample complexity for exploration. We consider 100 to 400 episodes of expert demonstrations, each with 50 timesteps, which is close to the amount of timesteps used for the control tasks in Ho and Ermon (2016). Moreover, we randomly sample the starting position of agent and landmarks each episode, so our policies have to learn to generalize when they encounter new settings.

5.1.1 Cooperative tasks

We evaluate performance in cooperative tasks via the average expected reward obtained by all the agents in an episode. In this environment, the starting state is randomly initialized, so generalization is crucial. We do not consider the zero-sum case, since it violates the cooperative nature of the task. We display the performance of centralized, decentralized, GAIL and BC in Figure 2.

Naturally, the performance of BC and MAGAIL increases with more expert demonstrations. MAGAIL performs consistently better than BC in all the settings; interestingly, in the cooperative communication task, centralized MAGAIL is able to achieve expert-level performance with only 200 demonstrations, but BC fails to come close even with 400 trajectories. Moreover, the centralized MAGAIL performs slightly better than decentralized MAGAIL due to the better prior, but decentralized MAGAIL still learns a highly correlated reward between two agents.

Figure 2: Average true reward from cooperative tasks. Performance of experts and random policies are normalized to one and zero respectively. We use inverse log scale for better comparison.
Task Predator-Prey
Agent Behavior Cloning G C D ZS
Adversary BC G C D ZS Behavior Cloning
Rewards -93.20 -93.71 -93.75 -95.22 -95.48 -90.55 -91.36 -85.00 -89.4
Task Keep-Away
Agent Behavior Cloning G C D ZS
Adversary BC G C D ZS Behavior Cloning
Rewards 24.22 24.04 23.28 23.56 23.19 26.22 26.61 28.73 27.80
Table 1: Average agent rewards in competitive tasks. We compare behavior cloning (BC), GAIL (G), Centralized (C), Decentralized (D), and Zero-Sum (ZS) methods. Best marked in bold (high vs. low rewards is preferable depending on the agent vs. adversary role).

5.1.2 Competitive tasks

We consider all three types of Multi-Agent GAIL (centralized, decentralized, zero-sum) and BC in both competitive tasks. Since there are two opposing sides, it is hard to measure performance directly. Therefore, we compare by letting (agents trained by) BC play against (adversaries trained by) other methods, and vice versa. From Table 1, decentralized and zero-sum MAGAIL often perform better than centralized MAGAIL and BC, which suggests that the selection of the suitable prior is important for good empirical performance. More details for all the particle environments are in the appendix.

5.2 Cooperative control

In some cases we are presented with sub-optimal expert demonstrations because the environment has changed; we consider this case in a cooperative control task (K. Gupta and Egorov, 2017), where bipedal walkers cooperate to move a long plank forward; the agents have incentive to collaborate since the plank is much longer than any of the agents. The expert demonstrates its policy on an environment with no bumps on the ground and heavy weights, while we perform imitation in an new environment with bumps and lighter weights (so one is likely to use too much force). Agents trained with BC tend to act more aggressively and fail, whereas agents trained with centralized MAGAIL can adapt to the new environment. With 10 (imperfect) expert demonstrations, BC agents have a chance of failure of (with a reward of 1.26), while centralized MAGAIL agents fail only of the time (with a reward of 26.57). We show videos of respective policies in the supplementary.

6 Related work and discussion

There is a vast literature on single-agent imitation learning (Bagnell, 2015)

. Behavior Cloning (BC) learns the policy through supervised learning 

(Pomerleau, 1991). Inverse Reinforcement Learning (IRL) assumes the expert policy optimizes over some unknown reward, recovers the reward, and learns the policy through reinforcement learning (RL). BC does not require knowledge of transition probabilities or access to the environment, but suffers from compounding errors and covariate shift (Ross and Bagnell, 2010; Ross et al., 2011).

Most existing work in multi-agent imitation learning assumes the agents have very specific reward structures. The most common case is fully cooperative agents, where the challenges mainly lie in other factors, such as unknown role assignments (Le et al., 2017), scalability to swarm systems (Šošic et al., 2016) and agents with partial observations (Bogert and Doshi, 2014). In non-cooperative settings,  Lin et al. (2014)

consider the case of IRL for two-player zero-sum games and cast the IRL problem as Bayesian inference, while 

Reddy et al. (2012) assume agents are non-cooperative but the reward function is a linear combination of pre-specified features.

Our work is the first to propose a general multi-agent IRL framework that bridges the gap between state-of-the art multi-agent reinforcement learning methods (Lowe et al., 2017; Foerster et al., 2016) and implicit generative models such as generative adversarial networks (Goodfellow et al., 2014). Experimental results demonstrate that it is able to imitate complex behaviors in high-dimensional environments with both cooperative and adversarial interactions. An interesting research direction is to explore new techniques for gathering expert demonstration; for example, when the expert is allowed to aid the agents by participating in part of the agent’s learning process (Hadfield-Menell et al., 2016).

References

Appendix A Proofs

We use , and to represent , and , where we implicitly assume dependency over and .

a.1 Proof to Lemma a.1

For any policy , when is the value function of (due to Bellman equations). However, only policies that form a Nash equilibrium satisfies the constraints in Eq. 2; we formalize this in the following Lemma. Let be the solution to the Bellman equation

and . Then for any ,

Furthermore, is Nash equilibrium under if and only if for all .

By definition of we have:

which uses the fact that and are independent conditioned on . Hence immediately follows.

If is a Nash equilibrium, and at least one of the constrains does not hold, i.e. there exists some and such that , then agent can achieve a strictly higher expected return if it chooses to take actions whenever it encounters state and follow for rest of the states, which violates the Nash equilibrium assumption.

If the constraints hold, i.e. for all and , then

so value iteration over converges. If we can find another policy such that , then there should be at least one violation in the constraints since must be a convex combination (expectation) over actions . Therefore, for any policy and action for any agent , always hold, so is the optimal response to , and constitutes a Nash equilibrium when we repeat this argument for all agents.

Notably, Theorem 3.8.2 in Filar and Vrieze (2012) discusses the equivalence by assuming for some ; if satisfies the assumptions, then .

a.2 Proof to Theorem 3.1

If is a Nash equilibrium, and at least one of the constraints does not hold, i.e. there exists some and , such that

Then agent can achieve a strictly higher expected return on its own if it chooses a particular sequence of actions by taking whenever it encounters state , and follow for the remaining states. We note that this is in expectation over the policy of other agents. Hence, we construct a policy for agent that has strictly higher value than without modifying , which contradicts the definition of Nash equilibrium.

If the constraints hold, i.e for all and ,

then we can construct any via a convex combination by taking the expectation over :

where the expectation over is taken over actions (the expectation over states are contained in the inner expectation over ). Therefore, ,

and we recover the constraints in Eq. 2. By Lemma A.1, is a Nash equilibrium.

a.3 Proof to Theorem 3.2

We use to denote the , and quantities defined for policy . For the two terms in we have:

(13)

For any agent , we note that

which amounts to using for agent for the first steps and using for the remaining steps, whereas other agents follow . As , this converges to since and is bounded. Moreover, for , we have

Combining the two we have