Scalable Multi-Agent Inverse Reinforcement Learning via Actor-Attention-Critic

02/24/2020 ∙ by Wonseok Jeon, et al. ∙ 0

Multi-agent adversarial inverse reinforcement learning (MA-AIRL) is a recent approach that applies single-agent AIRL to multi-agent problems where we seek to recover both policies for our agents and reward functions that promote expert-like behavior. While MA-AIRL has promising results on cooperative and competitive tasks, it is sample-inefficient and has only been validated empirically for small numbers of agents – its ability to scale to many agents remains an open question. We propose a multi-agent inverse RL algorithm that is more sample-efficient and scalable than previous works. Specifically, we employ multi-agent actor-attention-critic (MAAC) – an off-policy multi-agent RL (MARL) method – for the RL inner loop of the inverse RL procedure. In doing so, we are able to increase sample efficiency compared to state-of-the-art baselines, across both small- and large-scale tasks. Moreover, the RL agents trained on the rewards recovered by our method better match the experts than those trained on the rewards derived from the baselines. Finally, our method requires far fewer agent-environment interactions, particularly as the number of agents increases.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inverse reinforcement learning (IRL) (ng2000algorithms) captures the problem of inferring a reward function that reflects the objective function of an expert, from limited observations of the expert’s behavior. Traditionally, IRL required planning algorithms as an inner step (ziebart2008maximum), which makes IRL expensive in high-dimensional control tasks. Later work alleviates this by using adversarial training objectives (ho2016generative; fu2018learning), inspired by the generative adversarial network (GAN) method (goodfellow2014generative). Essentially, these methods iteratively train a discriminator to measure the difference between the agent’s and the expert’s behaviors, and optimize a policy to reduce such difference via reinforcement learning. Combined with modern deep RL algorithms (schulman2015trust), adversarial imitation and IRL show improved scalability to high-dimensional tasks.

Recently, adversarial imitation and IRL have been extended to multi-agent imitation (song2018multi) and multi-agent IRL (yu2019multi), respectively, where the agents in the same environment aim to learn rewards or policies from multiple experts’ demonstrations. Both these previous works have shown strong theoretical relationship between single-agent and multi-agent learning methods, and demonstrated that the proposed methods outperform baselines. However, there remains some questions on their empirical performances. First of all, both song2018multi and yu2019multi have focused on the performance after convergence, but the sample efficiency of the proposed methods in terms of agent-environment interactions has not been considered rigorously. Also, both used the multi-agent extension of ACKTR, MACK (wu2017scalable), which is built upon the centralized-training decentralized-execution framework (lowe2017multi; foerster2018counterfactual)

and uses centralized critics to stabilize training. If such centralized critics are not carefully designed, the resultant MARL algorithm may scale poorly with the number of agents due to the curse of dimensionality, i.e., the joint observation-action space grows exponentially with the number of agents.

In this work, we propose multi-agent discriminator-actor-attention-critic (MA-DAAC), a multi-agent algorithm capable of sample-efficient imitation and inverse reward learning, regardless of the number of agents. MA-DAAC uses multi-agent actor-attention-critic (iqbal2019actor) that scales to large numbers of agents thanks to a shared attention-critic network (vaswani2017attention). We verify that MA-DAAC is more sample-efficient than the multi-agent imitation and IRL baselines, and demonstrate that the reward functions learned by MA-DAAC lead to improved empirical performances. Finally, MA-DAAC is shown to be more robust to smaller number of experts’ demonstration.

2 Preliminaries

2.1 Markov Games and Notations

In a Markov Game (littman1994markov), multiple agents observe a shared environment state and take individual actions based on their observations. Then, each agent gets a reward and the environment transitions to a new state. Mathematically, a Markov Game is defined by , where is the number of agents, is the state space, is the action space for agent , is a reward function for agent , is a state transition distribution, is an initial state distribution, is a discount factor. Also, the policy

is the probability of the

-th agent choosing an action at the state . For succinct notation, we use bars to indicate joint quantities over the agents, e.g., is a joint action space, is a joint action, is a joint policy, is a joint reward. The value function of the -th agent with respect to is defined by , where the superscript on the expectation implies that states and joint actions are sampled from and . Additionally, the -discounted state occupancy measure of the joint policy is defined by , where is an indicator function.

In this work, we consider a partially observable Markov Game, where each agent can only use its own observation from the environment’s state . Due to the partial observability, we consider the policy which is the probability that the -th agent chooses an action after observing . Also, we consider value functions of the form that use the joint observation instead of the state , which is commonly done in previous works (lowe2017multi; iqbal2019actor).

2.2 Multi-Agent Adversarial Imitation and IRL

In the multi-agent IRL problem, agents respectively try to mimic experts’ policies . Here, each agent is not allowed to access to its own target expert’s policy directly and should rely on a limited amount of experts’ demonstration. There are two possible objectives in this problem; (1) policy imitation – learning policies close to those of the experts – and (2) reward learning – recovering reward functions that lead to expert-like behavior.

Multi-agent generative adversarial imitation learning (MA-GAIL) enables agents to learn experts’ policies by optimizing the following mini-max objective (song2018multi):

In practice, MA-GAIL iteratively optimizes discriminators and policies

, where the discriminators are trained to classify whether state-action pairs come from agents or experts and the policies are optimized with MARL methods and reward functions

defined by the discriminators. Although MA-GAIL successfully imitates experts, learned rewards of MA-GAIL cannot be used as reward functions due to discriminators converging to  (goodfellow2014generative; fu2018learning).

MA-AIRL addressed this issue by modifying the following parts in MA-GAIL. First, structured form of discriminators motivated by logistic stochastic best response equilibrium (LSBRE) was used (yu2019multi):

In addition, MA-AIRL used as reward functions instead of the functions of MA-GAIL. It turns out that either or can recover the reward functions that lead to the experts’ behavior.

3 Related Works

3.1 Sample-Efficient Adversarial Imitation Learning

In kostrikov2018discriminator, TD3 (fujimoto2018addressing) was used with a discriminator. To stabilize their algorithm, they proposed to learn the terminal-state values using a discriminator and use them in the RL inner loop of imitation learning, whereas conventional off-policy reinforcement learning algorithms implicitly consider zero terminal-state values and do not account for them. In sasaki2018sample, another sample-efficient imitation learning algorithm was proposed. In contrast with prior works, their method did not use discriminators by considering the Bellman consistency of the imitation learning reward signal. Then, by using off-policy actor-critic (degris2012off), it was shown that the proposed method is much more sample-efficient than GAIL.

3.2 Scalable Multi-Agent Learning

For large number of agents, it has been regarded as a challenging problem for MARL to achieve coordinated multi-agent behavior. Although existing works using centralized critics such as MADDPG (lowe2017multi) and COMA (foerster2018counterfactual)

make a handful of agents coordinated, they struggle when the number of agents to manage increases. This is mainly due to the exponential growth of the critic inputs with the increasing number of agents, which possibly increases the input variance of training as well. MAAC 

(iqbal2019actor) addressed such an issue by using the attention mechanism (vaswani2017attention) and a shared critic network, and it outperformed existing algorithms for large number of agents. Thanks to the attention mechanism, MAAC is trained to focus only on part of joint observations and actions, which leads to rapid and efficient training. Mean-field MARL (yang2018mean) is another approach to address the scalability issue of MARL, which is based on mean-field approximation for the centralized critics. However, its application is restricted to the situation where all agents are homogeneous, whereas MAAC can be applied to much general scenarios in which non-homogeneous agents exist.

Meanwhile, there were some multi-agent imitation learning algorithms applied to large-scale environments. le2017coordinated proposed multi-agent imitation learning in the index-free control setting, where agents are not allowed to know the indices of their own expert. The proposed method trains a model that infers and assigns the role of each agent with rollout trajectories and given experts’ demonstration and exploits that model to make highly coordinated behavior. sanghvi2019modeling uses multi-agent imitation learning to learn social group communication among agents with a single shared policy network among agents. However, they used multi-agent behavioral cloning and focused on the environments with homogeneous agents. In contrast with those works, our algorithm can deal with non-homogeneous agents as well. In addition, it has been reported in existing literature (kostrikov2018discriminator; sasaki2018sample; song2018multi) that behavioral cloning performs poorly when there are only small number of experts’ demonstration due to the co-variate shift problem (ross2011reduction). For these reasons, we narrow down our scope to MA-GAIL and MA-AIRL in this work.

1:  Input: a buffer for agents’ rollout trajectories, experts’ demonstration , policy networks , a shared attention critic , and discriminators
2:  for each iteration do
3:     Sample rollout trajectories:
4:     Add sampled trajectories to .
5:     for each training iteration do
6:        Sample from .
7:        Sample from .
8:        Update rewards by using discriminators.
9:        // Policy learning via MAAC Update with (1) and (2).
10:        // Reward learning Update with (3).
11:     end for
12:  end for
13:  Output:
Algorithm 1 Multi-Agent Discriminator-Actor-Attention-Critic (MA-DAAC)

4 Our Method

We consider multi-agent IRL problems where agents desire to learn experts’ behavior as well as reward functions that lead to such behavior. Note that agents cannot access to experts directly, but learning agents are allowed to used the limited amount of experts’ demonstration. In such setting, we introduce MA-DAAC, our multi-agent IRL method as outlined in Algorithm 1. Our method iteratively trains discriminators and policies using experts’ demonstration and agents’ rollout trajectories – using MARL and a reward signal modeled by the discriminators – similar to MA-GAIL and MA-AIRL.

MARL Algorithm. In our method, we use MAAC, which are off-policy MARL algorithms and shown to be sample-efficient and scalable to large number of agents (iqbal2019actor). We summarize it as follows. Assuming discrete action spaces, let

denote action values, where each element of

is a vector-valued action value of the corresponding agents and

is a set of neural network parameters for the critic network. In MAAC, the

-th agent’s action value was modeled as a neural network

where is a network looking at the -th agent’s local observation and action, is another network that considers other agents’ observations and actions. Finally, is a network that takes into account the extracted features from both of the previous neural networks. The main idea of MAAC is to model with an attention mechanism (vaswani2017attention) and to share this network among agents, i.e., for the -th agent’s embedding ,

where

for element-wise non-linear activation and shared neural network parameters , and among agents. The objective of the critic training is to minimize the sum of temporal difference (TD) errors:

(1)

where , implies are sampled from an experience replay buffer , is the value for in , is a target policy parameter, is a target critic parameter, and is the -th agent’s reward function. For policy updates, the policy gradient

(2)

was used, where is the change of the -th action in to .

Discriminator. We consider two types of discriminator models that were considered in MA-GAIL (song2018multi); a centralized discriminator that takes all agents’ observations and actions as its input and outputs multi-head classification for each agent; a decentralized discriminator that takes each agent’s local observations and actions and outputs a single-head classification result. It should be noted that both sample efficiency and scalability for both discriminators have not been rigorously analyzed in MA-GAIL and MA-AIRL (song2018multi; yu2019multi).

Let denote the vector-valued discriminator output. This is a general expression for both types of discriminators since one can consider a shared feature among for the centralized discriminator, while decentralized discriminators do not share feature but ignore other agents’ observations and actions, i.e., . For each training iteration, we train discriminator by maximizing

(3)

Intuitively, discriminators are trained in a way that expert-like behavior gets higher values, whereas non-expert-like behavior results in lower values. Also, note that the objective means the use of rollout trajectories sampled from agents’ policies in the first expectation. In practice, however, we use the samples from the replay buffer without off-policy correction to enhance the sample-efficiency of discriminator training via sample reuse. As shown in our experiments, such an abuse of samples does not harm the performance, which is similar to the results in kostrikov2018discriminator. Similar to MA-AIRL, we assume the discriminators are parameterized neural networks such that

(4)

During training, we use for as agents’ reward functions. Especially for centralized discriminators, we use observation-only rewards (fu2018learning) that ignore the action inputs of the discriminators and lead to slightly better performance than those using action inputs. In Section 6, we discuss about it in detail.

Figure 1: Imitation performance and reward learning performance in small-scale environments. The results in the same row share their environments; Keey Away, Cooperative Communication, Cooperative Navigation from top to bottom row. For the first three columns, we report of policies during training, where 50, 100 and 200 experts’ demonstration were used from left to right column. Note that MA-DAAC converges to the best performance among all methods sample-efficiently. For the last column, we report

of policies learned by reward functions from MA-DAAC. The result show that MA-DAAC always recover better reward functions than the baselines. Note that means and 95% confidence intervals over 10 runs are considered.

Figure 2: Imitation performance and reward learning performance in Rover Tower tasks. The results in the same row share the number of agents; 8, 12 and 16 from top to bottom row. For the first three columns, we report of policies during training, where 50, 100 and 200 experts’ demonstration were used from left to right column. Regardless of the number of agents, MA-DAAC converges much faster than the baselines. Also, its convergence is not affected much by the number of agents, whereas those of baselines becomes slower for the increasing number of agents. For the last column, we report of policies learned by reward functions from MA-DAAC. The policies learned by the rewards from MA-DAAC achieve higher . Note that means and 99% confidence intervals over 10 runs are considered.

5 Experiments

Our experiments are designed to answer the following questions:

1. Is MA-DAAC capable of recovering multi-agent reward functions effectively?
2. Is MA-DAAC sample-efficient in terms of the number of agent-environment interactions and available expert demonstration?
3. Is MA-DAAC scalable to the environments with many of agents?

We evaluate our methods from two perspectives; policy imitation and reward learning. We briefly summarize our experiment setup in the following sections and include more detailed information in Appendix.

5.1 Experiment Setup

Tasks. We consider two classes of environments, which respectively cover small-scale and large-scale environments. All of them run on OpenAI Multi-agent Particle Environment (MPE) (mordatch2017emergence). The small-scale environments include:

Keep Away There are 2 agents “reacher” and “pusher”, where reacher tries to reach the goal and pusher tries to push it away from the goal.

Cooperative Communication There are 2 agents, “speaker” and “listener”. One of three landmarks is randomly chosen as a target at each episode and its location can only be seen by speaker. Speaker cannot move, whereas listener can observe speaker’s message and move toward the target landmark.

Cooperative Navigation There are 3 agents and 3 landmarks, and the goal of agents is to cover as many landmarks as possible.

We also consider a large-scale environments proposed in MAAC (iqbal2019actor) to measure the scalability of MA-DAAC and the existing methods:

Rover Tower (iqbal2019actor) There are even number of agents (8, 12, 16), where half of them are “rovers” and the others are “towers”. For each episode, rovers and towers are randomly paired, and each tower has its own goal destination. Similar to Cooperative Communication, towers cannot move but can communicate with rovers so that rovers can move toward corresponding goals.

Experts. For experts’ policies, we trained policies by using MAAC over either 50,000 episodes (Keep Away, Cooperative Communication, Cooperative Navigation) or 100,000 episodes (Rover Tower). Then, we considered those trained policies as experts and generated trajectories from them, where the actions in those episodes are always taken with the largest probability. Throughout our experiment, we vary the number of available demonstration from 50, 100 to 200.

Performance Measure. In multi-agent IRL problems, we need to define a proper performance measure to see the gap between learned agents and experts during training. However, episodic-score-based measure widely used in single-agent learning (ho2016generative; kostrikov2018discriminator; sasaki2018sample) cannot be directly used in our problems due to the multiple reward functions for each agent and their unnormalized scales. Therefore, we define the normalized score similarity () as follows:

Here, is the -th agent’s (episode) score during training, is the -th expert’s average score for experts’ demonstration, and is the average score of the -th agent when uniformly random actions were taken by all agents. Intuitively, gets close to 1 if every agent shows expert-like behavior since such behavior will lead to the experts’ score. One advantage of is that we can evaluate multi-agent imitation performance for both competitive and cooperative tasks. In our experiments, we show that it is an effective measure.

5.2 Small-Scale Environments

Policy Imitation. The results in the small-scale environments are summarized in Figure 1 (column 1-3). For all small-scale environments, we demonstrate MA-DAAC converges faster than the baselines. This is due to the use of MAAC – an off-policy MARL methods – rather than using MACK – an on-policy MARL methods – proposed by song2018multi. Also, there is only a negligible gap between the performances of using centralized discriminators and using decentralized discriminators. It should be noted that in song2018multi, imitation learning with centralized discriminators leads to slightly better performance compared to its decentralized counterparts, whereas we get comparable results for both types of discriminators. We believe this small difference comes from using different MACK implementations and experts’ demonstration.

Reward Learning. For all imitation and IRL methods, we first train both policies and rewards over 50,000 episodes and re-train policies with MACK and learned rewards over 50,000 episodes. For rewards from MA-DAAC and MA-AIRL, we use in (4), a learned reward without potential functions  (fu2018learning; yu2019multi). For MA-GAIL, we use as compared by yu2019multi. The results of reward learning are described in Figure 1 (column 4). Note that the policies trained by either imitation or IRL methods have comparable mean for given the number of expert demonstration and environment. Nevertheless, the rewards learned by either MA-DAAC or MA-AIRL with decentralized discriminators achieve the best retraining performance. One interesting observation is that rewards from MA-DAAC with centralized discriminators lead to better performance compared to those from MA-AIRL with centralized discriminators. We believe using the off-policy samples in MA-DAAC makes the reward training more robust since MA-AIRL is trained by on-policy samples and can easily overfit to the latest rollouts. Additional results in small-scale environments are given in Appendix.

Figure 3: The number of trainable parameters for each method with decentralized discriminator in Rover Tower tasks. The number of parameters linearly increases for MA-DAAC, whereas it increases for MA-AIRL. Note that MA-DAAC performs better than MA-AIRL for both policy imitation and reward learning with fewer number of parameters in the environments with either 12 or 16 agents.

5.3 Large-Scale Environments

Policy Imitation. The imitation performances in the large-scale environments are depicted in Figure 2 (column 1-3). In the large-scale environments, we observe that sample efficiencies of the methods with decentralized discriminators are extremely higher than those with centralized discriminators. This is in contrast with the results in the small-scale environments in the sense that both efficiencies are comparable in those environments. This comes from the higher variance of training centralized discriminators in the large-scale environments compared to the small-scale counterparts. Among the methods with decentralized discriminators, we find out MA-DAAC learns much faster than the baselines. Especially, MA-DAAC robustly achieves the best irrespective of the number of agents, whereas the convergence of the baselines becomes slower for the increasing number of agents. This comes from the fact that MA-DAAC uses a shared attention-based critic as well as the off-policy samples via replay.

Reward Learning. For the large-scale environments, we train policies and rewards over 100,000 episodes and retrain policies with MAAC and learned reward over 100,000 episodes. Also, the same reward models as those in the small-scale environments are used. The reward learning results are described in Figure 1 (column 4). Among all methods, learned rewards from MA-DAAC with decentralized discriminators leads to the best . The performance of retrained policies decreases as the number of agents increases.

Number of Learnable Parameters. MA-DAAC is more efficient than baselines in terms of the number of parameters. As depicted in Figure 3, the number of MA-DAAC’s parameters linearly increases, while the number of the baselines’s parameters exponentially increase. Such an exponential increase comes from the fact that MACK doesn’t share critic networks among the agents. Note that the number of discriminator and policy parameters linearly increases for all cases. Additional results in small-scale environments are given in Appendix.

# Agents Algorithm # Expert Traj.
50 100 200
8 MA-AIRL
MA-DAAC
12 MA-AIRL
MA-DAAC
16 MA-AIRL
MA-DAAC
Table 1: Imitation learning performance relative to that of MA-GAIL in Rover Tower tasks. Note that the gain becomes much larger for the decreasing amount of available experts’ demonstration. The evaluation score after training 100,000 episodes were used.
# Agents Algorithm # Expert Traj.
50 100 200
8 MA-AIRL
MA-DAAC
12 MA-AIRL
MA-DAAC
16 MA-AIRL
MA-DAAC
Table 2: Reward learning performance relative to that of MA-GAIL in Rover Tower tasks. MA-DAAC consistently performs better than the baselines. The evaluation scores after training 100,000 episodes were used.

5.4 Effect of Number of Experts’ Demonstration

For both small-scale and large-scale environments, we vary the number of available experts’ demonstration among 50, 100, 200 and check its effect on policy imitation and reward learning performances (See Figure 12). In the large-scale environments, the performance is highly affected by the number of experts’ demonstration, whereas there’s a negligible effect in the small-scale environments. We believe this comes from the different size of joint observation-action spaces between small-scale and large-scale environments. Specifically for the fixed amount of experts’ demonstration, the effective amount of training data – the number of experts’ demonstration relative to input dimensions – decreases and causes discriminators to be more biased as the number of agents increases. In the end, this leads to learning in the large-scale environments more difficult than learning in the small-scale environments.

Especially for learning with decentralized discriminators in the large-scale environments, we demonstrate that MA-DAAC performs better than the baselines with small amount of experts’ demonstration (Table 12). For policy imitation (Table 1), we observe the relative score of MA-DAAC becomes larger as the number of available demonstration decreases irrespective of the number of agents, whereas the relative score of MA-AIRL is much smaller than that of MA-DAAC. This result supports that our method is much more robust than the baseline with respect to the number of experts’ demonstration. For reward learning (Table 2)), we again observe that the relative socre of MA-DAAC is consistently higher than that of MA-AIRL.

6 Discussion

Why do decentralized discriminators work well? Let’s think about the sources of coordination that can lead to successful learning with decentralized discriminators. The first one is centralized critics, which take joint observations and actions as inputs as used in many MARL algorithms. The second one is achieved by sampling experts’ joint observations and actions that happened in the same time step. Since experts’ joint trajectories include information about how to coordinate at the specific time step, decentralized discriminators can be sufficiently trained toward the coordination of experts. These two mechanisms allow decentralized discriminators to focus on local experiences, which highly reduces the input spaces and leads to better scalability.

Why don’t we use observation-only decentralized discriminators? In single-agent AIRL (fu2018learning), it was shown that a discriminator model that ignores action inputs can recover a reward function that is robust to changes of environment. In the multi-agent problems, however, we find that observation-only discriminators may fail to imitate well, depending on the task. In Cooperative Communication, for example, the speaker’s observation is fixed as the color of a goal landmark in an episode, i.e., for the length of the episode, and the speaker’s action (message) at becomes the part of the listener’s successor observation at . In such setting, if observation-only decentralized discriminators and are used, the speaker cannot learn how to send a correct message since does not include the message information . Due to the incorrect message from the speaker, the observation of the listener becomes noisy, which results in poor performance of listener as well. On the other hand, an observation-only centralized discriminator does not suffer from the above issue since the centralized discriminators of both speaker and listener can exploit the full observation transition and match the transition with transitions in experts’ demonstration. Such a problem, from the partial observable nature of multi-agent problem, opens a new challenge: learning reward functions that are both scalable and robust. We leave this problem to future work.

Figure 4: Pearson’s correlation coefficient (PCC) between ground truth rewards and learned rewards (left) and NSS (right) during retraining with learned rewards. The results imply the experts’ behavior can be achieved by the learned rewards having low correlation with the ground truth.

Is the correlation between learned rewards and ground truth rewards meaningful? In the experiments of MA-AIRL (yu2019multi), statistical correlations between ground truth rewards – the rewards used to train experts’ policies – and learned rewards were regarded as the performance measure of reward learning. We also check statistical correlations in our implementation, but the reward recovery performance reported in yu2019multi cannot be reproduced. Discrepancies in the results may come from differences between our implementation and that of yu2019multi. However, we additionally find that high correlation between the learned rewards and the ground truth rewards is not a necessary condition of learned rewards leading to the experts’ behavior, as depicted in Figure. 4. It is well known that IRL problems are ill-defined, therefore there can exist multiple rewards that can be matched to the experts’ observed trajectories.

7 Conclusion

We propose MA-DAAC, a multi-agent IRL method that is much more scalable and sample-efficient than existing works. We massively and rigorously analyze the performance of MA-DAAC and compare to baselines in terms of sample efficiency and the retraining score with newly defined measure (), using various types of discriminators (decentralized and centralized). We show that MA-DAAC with decentralized discriminators outperforms other methods. One interesting future direction is a scalable multi-agent IRL with a centralized discriminator, so that we can efficiently interpret sequential behavior of a large number of agents by looking at the resultant centralized reward functions.

References

Appendix A Details of Experiments

Tasks. In our experiments, we consider four tasks built on OpenAI’s Multi-Agent Particle Environments (mordatch2017emergence; lowe2017multi; iqbal2019actor); Keep Away, Cooperative Communication, Cooperative Navigation, Rover Tower. For all tasks, the length of each episode is set to be 25. For Rover Tower, the number of agents is chosen among 8, 12, 16.

MARL implementations. We implement both MACK (song2018multi) and MAAC (iqbal2019actor)

in PyTorch 

(PyTorch) by refactoring the codes released by the respective authors111 https://github.com/ermongroup/multiagent-gail 222 https://github.com/shariqiqbal2810/MAAC

. Especially to implement MACK in PyTorch, we use generalized advantage estimation (GAE) 

(schulman2015high) and KFAC optimizer from ACKTR implementation in PyTorch333 https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail . Additionally, we apply advantage normalization technique used in OpenAI baselines444 https://github.com/openai/baselines which has not been implemented in original MACK implementation but highly stabilizes and improves the performance of MACK. For all policies and the value baselines in MACK, we use two-layer neural networks with 128 hidden units and LeakyReLU activations. For attention critic in MAAC, we use the same network architecture used in the released code. For all methods, we divide the rewards with the length of episodes.

Experts.

We use our MAAC implementation to train experts’ policies for 50,000 episodes. We use the normalization of inputs and rewards for each agent based on the methods of calculating running mean and standard deviation

555 https://github.com/openai/baselines/blob/master/baselines/common/running_mean_std.py in OpenAI baselines. Other hyperparamters are summarized in Table 3. After training experts, we sample 500 episodes by using learned experts’ policies and use the average score of each agent over 500 episodes to define in .

hyperparameters value
discount factor 0.995
buffer size 50,000
policy learning rate 0.001
target policy update rate 0.01
policy entropy regularization coefficient 0.01
critic learning rate 0.001
target critic update rate 0.01
critic gradient norm clipping 1.0

critic loss function

Huber loss
batch size 1,000
update period 100
Table 3: Hyperparameters for training experts with MAAC

Random Agents. We sample 500 episodes by uniformly sample actions and use the average score of each agent over 500 episodes to define in .

Multi-agent inverse RL. For inverse RL, we don’t use any kind of normalization as opposed to learning experts’ policies. While song2018multi and yu2019multi have considered the policy initialization with behavioral cloning, we use randomly initialized policies as done in existing works on single-agent adversarial imitation learning (ho2016generative; kostrikov2018discriminator; sasaki2018sample). Although we haven’t demonstrated the effect of behavioral-cloning initialization on scalablity, we guess such initialization does not give significant gain to either MA-GAIL or MA-AIRL because of both using MACK as MARL algorithm. For the discriminators of MA-GAIL, we use two-layer neural networks with 128 hidden units and LeakyReLU activation to model . For the discriminators of MA-AIRL and MA-DAAC, we use two-layer neural networks with 128 hidden units and LeakyReLU activation for both reward estimation and potential shaping function . For the training of discriminators, we use entropy regularization of discriminators used in ho2016generative666 https://github.com/openai/imitation instead of using L2 regularization of discriminators in the released codes777 https://github.com/ermongroup/MA-AIRL . This empirically leads to much better imitation and reward learning performances than using L2 regularization. Other hyperparameters for MA-GAIL and MA-AIRL are in Table 4, and those for MA-DAAC are in Table 5.

hyperparameters value
for GAE 0.95
discount factor 0.995
policy learning rate 0.001
policy target update rate 0.0005
policy entropy regularization coefficient 0.01
critic learning rate 0.001
critic target update rate 0.001
critic gradient norm clipping 10
critic loss function Huber loss
discriminator learning rate 0.01
discriminator entropy regularization coefficient 0.01
discriminator gradient norm clipping 10
batch size 1,000
Table 4: Hyperparameters for both MA-GAIL and MA-AIRL
hyperparameters value
discount factor 0.995
buffer size 1,250,000
policy learning rate 0.001
target policy update rate 0.0005
policy entropy regularization coefficient 0.01
critic learning rate 0.001
target critic update rate 0.0005
critic gradient norm clipping 1.0
critic loss function Huber loss
discriminator learning rate 0.0005
discriminator entropy regularization coefficient 0.01
discriminator gradient norm clipping 10
batch size 1,000
update period 100
Table 5: Hyperparameters for MA-DAAC

Appendix B Results in Small-Scale Environments

The scores of learned policies are given in Table 6, and the scores of policies retrained with learned rewards are given in Table 7. For both cases, we train policies for 50,000 episodes.

Number of Experts’ Demonstration Algorithm Discriminator Type Keep Away Cooperative Communication Cooperative       Navigation
Pusher Reacher
Random
Expert
50 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
100 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
200 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
Table 6: 95% confidence interval of scores after either imitation learning (MA-GAIL) or inverse reinforcement learning (MA-AIRL, MA-DAAC) in small-scale environments. Mean and confidence interval are calculated for 10 runs.
Number of Experts’ Demonstration Algorithm Discriminator Type Keep Away Cooperative Communication Cooperative       Navigation
Pusher Reacher
Random
Expert
50 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
100 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
200 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
Table 7: 95% confidence interval of scores after retraining with either discriminator (MA-GAIL) or learned reward functions (MA-AIRL, MA-DAAC) in small-scale environments. Mean and confidence interval are calculated for 10 runs.

Appendix C Results in Large-Scale Environments

The scores of learned policies are given in Table 8, and the scores of policies retrained with learned rewards are given in Table 9. For both cases, we train policies for 100,000 episodes.

Number of Experts’ Demonstration Algorithm Discriminator Type Rover Tower
8 Agents 12 Agents 16 Agents

Random
Expert
50 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
100 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
200 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
Table 8: 95% confidence interval of scores after either imitation learning (MA-GAIL) or inverse reinforcement learning (MA-AIRL, MA-DAAC) in large-scale environments. Mean and confidence interval are calculated for 10 runs.
Number of Experts’ Demonstration Algorithm Discriminator Type Rover Tower
8 Agents 12 Agents 16 Agents

Random
Expert
50 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
100 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
200 MA-GAIL Decentralized
Centralized
MA-AIRL Decentralized
Centralized
MA-DAAC Decentralized
Centralized
Table 9: 95% confidence interval of scores after retraining with either discriminator (MA-GAIL) or learned reward functions (MA-AIRL, MA-DAAC) in large-scale environments. Mean and confidence interval are calculated for 10 runs.