DeepAI
Log In Sign Up

PIC: Permutation Invariant Critic for Multi-Agent Deep Reinforcement Learning

10/31/2019
by   Iou-Jen Liu, et al.
0

Sample efficiency and scalability to a large number of agents are two important goals for multi-agent reinforcement learning systems. Recent works got us closer to those goals, addressing non-stationarity of the environment from a single agent's perspective by utilizing a deep net critic which depends on all observations and actions. The critic input concatenates agent observations and actions in a user-specified order. However, since deep nets aren't permutation invariant, a permuted input changes the critic output despite the environment remaining identical. To avoid this inefficiency, we propose a 'permutation invariant critic' (PIC), which yields identical output irrespective of the agent permutation. This consistent representation enables our model to scale to 30 times more agents and to achieve improvements of test episode reward between 15 environment (MPE).

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/10/2022

Effects of Spectral Normalization in Multi-agent Reinforcement Learning

A reliable critic is central to on-policy actor-critic learning. But it ...
05/18/2021

Permutation Invariant Policy Optimization for Mean-Field Multi-Agent Reinforcement Learning: A Principled Approach

Multi-agent reinforcement learning (MARL) becomes more challenging in th...
09/07/2021

The Sensory Neuron as a Transformer: Permutation-Invariant Neural Networks for Reinforcement Learning

In complex systems, we often observe complex global behavior emerge from...
06/10/2017

ACCNet: Actor-Coordinator-Critic Net for "Learning-to-Communicate" with Deep Multi-agent Reinforcement Learning

Communication is a critical factor for the big multi-agent world to stay...
03/17/2021

Set-to-Sequence Methods in Machine Learning: a Review

Machine learning on sets towards sequential output is an important and u...
06/17/2021

Many Agent Reinforcement Learning Under Partial Observability

Recent renewed interest in multi-agent reinforcement learning (MARL) has...
01/26/2022

Probe-Based Interventions for Modifying Agent Behavior

Neural nets are powerful function approximators, but the behavior of a g...

1 Introduction

Single-agent deep reinforcement learning has achieved impressive performance in many domains, including playing Go [Silver16, Silver17] and Atari games [dqn1, dqn2]. However, many real world problems, such as traffic congestion reduction [Bazzan08, Sunehag18], antenna tilt control [Dandanov17], and dynamic resource allocation [Nguyen18] are more naturally modeled as multi-agent systems. Unfortunately, directly deploying single-agent reinforcement learning to each agent in a multi-agent system does not result in satisfying performance [Tang93, maddpg]. Particularly, in multi-agent reinforcement learning [Nguyen18, maddpg, Foerster17, Foerster18, Iqbal19, Jiang18, Das19, Foerster16, Kim19, Shu19, Han19]

, estimating the value function is challenging, because the environment is non-stationary from the perspective of an individual agent 

[maddpg, Foerster17]. To alleviate the issue, recently, multi-agent deep deterministic policy gradient (MADDPG) [maddpg] proposed a centralized critic whose input is the concatenation of all agents’ observations and actions. Similar to MADDPG, Foerster17, Foerster18, Kim19, Jiang18, Das19, Iqbal19, Yang18 also deploy centralized critics to handle a non-stationary environment.

However, concatenating all agents’ observations and actions assigns an implicit order, , the placement of an agent’s observations and actions will make a difference in the predicted outcome. Consider the case of two homogeneous

agents and let us denote the action and observation of the two agents with ‘A’ and ‘B.’ There exists two equally valid permutations (AB) and (BA) which represent the environment. Using a permuted input in classical deep nets will result in a different output. Consequently, referring to the same underlying state of the environment with two different vector representations makes learning of the critic sample-inefficient: the deep net needs to learn that both representations are identical. Due to an increase in the number of possible permutations, this representational inconsistency worsens as the number of agents grows.

To address this concern, we propose the ‘permutation invariant critic’ (PIC). Due to the permutation invariance property of PICs, the same environment state will result in the same critic output, irrespective of the agent ordering (as shown in fig:idea_fig). In addition, to tackle environments with homogeneous and heterogeneous agents (, agents which have different action space, observation space, or play different roles in a task), we augment PICs with attributes. This enables the proposed PIC to model the relation between heterogeneous agents.

For rigorous results we follow the strict evaluation protocol proposed by Henderson17 and Colas18 when performing experiments in multi-agent particle environments (MPEs) [maddpg, Mordatch17]. We found that permutation invariant critics result in to higher average test episode rewards than the MADDPG baseline [maddpg]. Furthermore, we scaled the MPE to 200 agents. Our permutation invariant critic successfully learns the desired policy in environments with a large number of agents, while the baseline MADDPG [maddpg] fails to develop any useful strategies.

In summary, our main contributions are as follows: a) We develop a permutation invariant critic (PIC) for multi-agent reinforcement learning algorithms. Compared with classic MLP critics, the PIC achieves better sample efficiency and scalability. b) To deal with heterogeneous agents we study adding attributes. c) We speedup the multi-agent particle environment (MPE) [maddpg, Mordatch17] by a factor of 30. This permits to scale the number of agents to 200, 30 times more than those used in the original MPE environment (6 agents). Code is available at https://github.com/IouJenLiu/PIC.

Figure 1: Consider an environment with four homogeneous cooperative agents. Different permutations of the agents’ observations and actions refer to the same underlying environment state. However, MLP critics result in different outputs for the same environment state. In contrast, permutation invariant critics yield the same output value for all equivalent permutations.

2 Related Work

We briefly review graph neural nets, which are the building block of PICs, and multi-agent deep reinforcement learning algorithms with centralized critics.

Graph Neural Networks. Graph neural networks are deep nets which operate on graph structured data [scarselli2009graph]. Input to the network are hence a set of node vectors and connectivity information about the nodes. More notably, these graph networks are permutation equivariant, , the ordering of the nodes in a vector representation does not change the underlying graph [zaheer2017deep]. Many variants of graph networks exists, for example, Graph Convolutional Nets (GCN) [kipf2017semi], the Message Passing Network [gilmer2017neural], and others [zaheer2017deep, qi2017pointnet]. The relation and difference between these approaches are reviewed in [battaglia2018relational]. The effectiveness of graph nets has been shown on tasks such as link prediction [schlichtkrull2018modeling, zhang2018link], node classification [hamilton2017inductive, kipf2017semi], language and vision [NarasimhanNIPS2018, SchwartzCVPR2019a], graph classification [yeh2019diverse, duvenaud2015convolutional, zhang2018end, ying2018hierarchical], . Graph nets also excel on point sets, , [zaheer2017deep, qi2017pointnet]. Most relevant to our multi-agent reinforcement learning setting, graph networks have been shown to be effective in modeling and reasoning about physical systems [battaglia2016interaction] and multi-agent sports dynamics [hoshen2017vain, kipf2018neural, yeh2019diverse]. Different from these works, here, we study the effectiveness of graph nets for multi-agent reinforcement learning.

Multi-agent Reinforcement Learning. To deal with non-stationary environments form the perspective of a single agent, MADDPG [maddpg] uses a centralized critic that operates on all agents’ observations and actions. Similar to MADDPG, Foerster18 use a centralized critic. In addition, to handle the credit assignment problem [Nguyen18, Panait05, Chang03], a counterfactual baseline has been proposed to marginalize one agent’s action and keep the other agents’ actions fixed. In “Monotonic Value Function Factorisation” (QMIX) [qmix] each agent maintains its own value function which conditions only on the agent’s local observation. The overall value function is estimated via a non-linear combination of an individual agent’s value function. Iqbal19 propose an attention mechanism which enables the centralized critic to select relevant information for each agent. However, as discussed in sec:intro, the output of centralized critics parameterized by classic deep nets differs if the same environment state is encoded with a permuted vector. This makes learning inefficient.

“Graph convolutional RL” (DGN) [Jiang19] is concurrent work on arXiv which uses a nearest-neighbor graph net as the Q-function of a deep Q-network (DQN) [dqn1, dqn2]. However, the nearest-neighbor graph net only has access to local information of an environment. Consequently, the Q-function in DGN is not a fully centralized critic. Therefore, it suffers from the non-stationary environment issue [maddpg, Foerster17]. In addition, due to the DQN formulation, DGN can only be used in environments with discrete action spaces. Note, DGN considers homogeneous cooperative agents and leaves environments with heterogeneous cooperative agents to future work. In contrast, our permutation invariant critic is fully centralized and can be scaled to a large number of agents. Thanks to different node attributes, our approach can handle environments with heterogeneous cooperative agents. In addition, our approach is designed for continuous state and action spaces.

3 Preliminaries

3.1 Deep Deterministic Policy Gradient

In classic single-agent reinforcement learning, an agent interacts with the environment and collects rewards over time. Formally, at each timestep , with the horizon, the agent finds itself in state and selects an action according to a deterministic policy . Hereby, is the state space, and is the action space. Upon executing an action, the agent runs into the next state and obtains a scalar reward . Given a trajectory of length collected by following a policy, we obtain the discounted return , where is the discount factor. The goal of reinforcement learning is to find a policy which maximizes the return .

Deep deterministic policy gradient (DDPG) [ddpg] is a widely used deep reinforcement learning algorithm for continuous control. In DDPG, a deterministic policy , which is parameterized by , maps a state to an action . A critic , which is parameterized by , is deployed to estimate the return of taking action at state . The parameters of the actor policy are updated iteratively so that the action maximizes the critic , ,

(1)

Here is drawn from the replay buffer which stores experience tuples , ,

. Using the chain rule, the gradient is

To optimize , similar to deep Q-learning [dqn1, dqn2], we minimize the loss

(2)

The experience tuple is drawn from the replay buffer and the target value is defined as

(3)

where is a recent Q-network parameterized by a past .

3.2 Multi-agent Markov Decision Process

To extend single-agent reinforcement learning to the multi-agent setting, we first define the multi-agent Markov decision process (MDP). We consider partially observable multi-agent MDPs 

[Littman94]. An -agent partially observable multi-agent MDP is defined by a transition function , a set of reward functions , a state space , a set of observation spaces , and a set of action spaces . Action space , observation space and reward function correspond to agent . The transition function maps the current state and the actions taken by all the agents to a next state, , . Each agent receives reward and observation that is related to the state, , .

The goal of agent is to maximize the expected return . Note that, the goal of cooperative agents is to maximize the collective expected return .

3.3 Multi-agent Deep Deterministic Policy Gradient

In this paper, we study multi-agent reinforcement learning using the decentralized execution and centralized training framework [Foerster16, maddpg]. Multi-agent deep deterministic policy gradient (MADDPG) [maddpg] is a well-established algorithm for this framework. Consider agents with policies , which are parameterized by . The centralized critics associated with the agents are parameterized by . Following DDPG, the parameters for policy are updated iteratively so that the associated critic is optimized via

(4)

where and are the concatenation of all agents’ observation and action at timestep , , and . Note, is the observation received by agent at time step . Using the chain rule, the gradient is derived as follows:

(5)

Following DDPG, the centralized critic parameters are optimized by minimizing the loss

(6)

where is the concatenation of rewards received by all agents at timestep , , . The target value is defined as follows

(7)

where is a Q-network parameterized by a past .

4 Permutation Invariant Critic (PIC) and Environment Improvements

We first describe the proposed permutation invariant critic (PIC), then show improvements for MPE.

4.1 Permutation Invariant Critic (PIC)

Consider training homogeneous cooperative agents using a centralized critic  [Nguyen18, maddpg, Foerster17, Foerster18, Iqbal19, Jiang18, Das19, Foerster16, Kim19, Shu19, Han19]. As discussed in sec:back, the input to the centralized critic is the concatenation of all agents observations and actions. Let denote the concatenation of all agents’ observations at timestep , where is the dimension of each observation . Similarly, we represent the concatenation of all agents’ actions at timestep via , where is the dimension of each agent’s action . Note that concatenating observations and actions implicitly imposes an agent ordering. Any agent ordering seems plausible. Importantly, shuffling the agents observations and actions doesn’t change the state of the environment. One would therefore expect the centralized critic to return the same output if the input is concatenated in a different order. Formally, this property is called permutation invariance, , we strive for a critic such that

where and are two permutation matrices from the set of all possible permutation matrices .

To achieve permutation invariance, we propose to use a graph convolutional neural net (GCN) as the centralized critic. In the remainder of the section, we describe the GCN model in detail and discuss how to deploy the permutation invariant critic to environments with homogeneous and heterogeneous agents.

Permutation Invariant Critic. We model an -agent environment as a graph. Each node represents an agent, and the edges capture the relations between agents. To compute the scalar output of the critic we use graph convolution layers . A graph convolution layer takes node representations and the graph’s adjacency matrix as input and computes a new representation for each node. More specifically, maps the input to the output , where and are the input and output node representation’s dimension for layer . Formally,

(8)

where is the graph’s adjacency matrix, are the layer’s trainable weight matrices,

is an element-wise non-linear activation function, and

is an identity matrix of size

.

Note that in Equation 8, each agent’s representation, , each row of , is multiplied with the same set of weights and . Due to this weight sharing scheme, a permutation matrix applied at the input is equivalent to applying it at the output, ,

(9)

Another advantage of the weight sharing scheme is that the number of trainable parameters of PIC does not increase with the number of agents.

Subsequently, a pooling layer is applied to the -th graph convolutional layer’s representation . Pooling is performed over the agents’ representation, , over the rows of . We refer to the output of the pooling layer as

. Either average pooling or max pooling is suitable. Average pooling, subsequently denoted

, averages the node representations, , . Max pooling, subsequently referred to as , takes the maximum value across the rows for each of the columns, , . Both max pooling and average pooling satisfy the permutation invariance property as summation and element-wise maximization are commutative operations. Therefore, an -layer graph convolutional net is obtained via .

Homogeneous setting. If agents in an environment are homogeneous, we first concatenate the observations and actions into a matrix , , . Setting and , we construct a permutation invariant critic as follows:

(10)

Hereby maps the output of the graph nets to a real number, which is the estimated scalar critic value for the environment observation and action . We model with a standard fully connected layer which maintains permutation invariance. To ensure that the permutation invariant critic is fully centralized, , to ensure that we consider all agents’ actions and observations, we use an adjacency matrix corresponding to a complete graph, , is a matrix of all ones with zeros on the diagonal. As later mentioned in sec:exp we also study other settings but found a complete graph to yield best results.

Heterogeneous setting. Consider cooperative agents, which are divided into multiple groups. In this heterogeneous setting, agents in different groups have different characteristics, , size and speed, or play different roles in the task. In a heterogeneous setting, a permutation invariant critic is not directly applicable, because the relation between two heterogeneous agents differs from the relation between two homogeneous agents. For instance, the interaction between two fast-moving agents differs from the interaction between a fast-moving and a slow-moving agent. However, in the aforementioned critic , relations between all agents are modeled equivalently.

To address this concern, for the heterogeneous setting, we propose to add node attributes to the PIC. With node attributes, the PIC can distinguish agents from different groups. Specifically, for group , we construct a group attribute , where denotes the dimension of the group attribute. Let denote the input representation of agent , , the -th row of . We obtain the augmented representation via where denotes the group index of agent . We perform the augmentation for each agent to obtain the augmented representation . Using Equation 10 and setting and , results in a PIC that can handle environments with heterogeneous agents.

4.2 Improved MPE Environment

The multiple particle environment (MPE) [maddpg, Mordatch17]

is a multi-agent environment which contains a variety of tasks and provides a challenging open source platform for our community to evaluate new algorithms. However, the MPE targets only settings with a small number of agents. Specifically, we found it challenging to train more than 30 agents as it takes more than 100 hours. To scale the MPE to more agents, we improve the implementation of the MPE. More specifically, we develop vectorized versions for many of the computations, such as computing the force between agents, computing the collision between agents, . Moreover, for tasks with global rewards, instead of computing rewards for each agent, we only compute the reward once and send it to all agents. With this improved MPE, we can train up to 200 agents within one day.

5 Experiments

In this section, we first introduce tasks in the multiple particle environment (MPE) [maddpg, Mordatch17]. We then present the details of the experimental setup, evaluation protocol, and our results.

MLP Critic MLP Critic + Data Augaumentation Ours Permutation Invariant Critic
# of agents final absolute final absolute final absolute

Cooperative Navigation
N=3 -362.73 -362.71 -361.76 -361.55 -355.99 -355.74
N=6 -3943.2 -3933.3 -4025.4 -4016.2 -3383.2 -3381.8
N=15 -6489.7 -6394.3 -6280.0 -6222.4 -1999.1 -1977.6
N=30 -20722 -20583 -21205 -20840 -11363 -11294
N=100 -128100 -128086 -128024 -128013 -71495 -71074
N=200 -502349 -502348 -509963 -507457 -436215 -433846


Prey & Predator
N=3 38.52 40.91 43.78 44.99 65.16 67.69
N=6 26.15 30.34 -24.84 -17.49 176.70 184.22
N=15 3982.80 4416.23 4198.41 4401.64 10139 10239
N=30 -377.19 -386.49 -93.86 -75.59 6662.1 6745.6
N=100 19894 21114 28741 29347 99391 100812

Cooperative Push
N=3 -171.26 -170.20 -171.96 -171.74 -155.17 -155.10
N=6 -561.73 -542.29 -672.80 -672.27 -401.79 -395.70
N=15 -2538.3 -2536.0 -2645.2 -2610.1 -2231.1 -2225.7
N=30 -3499.7 -3465.9 -3761.5 -3688.3 -3117.1 -3094.3


Heterogeneous Navigation
N=4 -100.74 -100.35 -98.96 -98.67 -83.84 -83.54
N=8 -398.31 -397.66 -683.24 -684.39 -398.31 -397.66
N=16 -3410.0 -3405.9 -3479.0 -3470.8 -1825.8 -1820.4
N=30 -12944 -12934 -12812 -12809 -6293 -6270
N=100 -121563 -121472 -130028 -129436 -94324 -93996
Table 1: Average rewards of our approach and two MADDPG baselines. ‘final’ represents final metric [Colas18], which is the average reward over the last 10,000 evaluation episodes, , 1000 episodes for each of the last ten policies during training. ‘absolute’ represents absolute metric [Colas18], which is the best policy’s average reward over 1,000 evaluation episodes.

Environment. We evaluate the proposed approach on an improved version of MPE [maddpg, Mordatch17]. We consider the following four tasks:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • Cooperative navigation: agents move cooperatively to cover landmarks in the environment. The reward encourages the agents to get close to landmarks.

  • Prey and predator: slower predators work together to chase fast-moving preys. The predators get positive reward when colliding with preys. Preys are environment controlled.

  • Cooperative push: agents work together to push a large ball to a landmark. The agents get rewarded when the large ball approaches the landmark.

  • Heterogeneous navigation: small and fast agents and big and slow agents work cooperatively to cover landmarks. The reward encourages the agents to get close to landmarks. If a small agent collides with a big agent, a large negative reward is received.

Figure 2: Comparison of average episode reward.
Figure 3: Comparison of average Q loss (Equation 6).

Note cooperative navigation, prey and predator and cooperative push are environments with homogeneous agents. In each task, the reward is global for cooperative agents, , cooperative agents always receive the same reward in each timestep.

Experimental Setup. We use MADDPG [maddpg]

with a classic MLP critic as the baseline. We implement MADDPG and the proposed approach in Pytorch 

[pytorch]. To ensure the correctness of our implementation, we compare it with the official MADDPG code [maddpg_code] on MPE. Our implementation reproduces the results of the official code. Please see tb:base in the supplementary for the results. Following MADDPG [maddpg]

, the actor policy is parameterized by a two-layer MLP with 128 hidden units per layer, and ReLU activation function. The MLP critic of the MADDPG baseline has the same architecture as the actor. Our permutation invariant critic (PIC) is a two-layer graph convolution net (GCN) with 128 hidden units per layer and a max pooling at the top. The activation function for GCN is also a ReLU. Following MADDPG, the Adam optimizer is used. The learning rates for the actor, MLP critic, and our permutation invariant critic are

. The learning rate is linearly decreased to zero at the end of training. Agents are trained for episodes in all tasks (episode length is either or steps). The size of the replay buffer is one million, the batch size is , and the discounted factor . In an eight-agent environment, the MLP critic and the PIC have around 44k and 40k trainable parameters respectively. In an 100-agent environment, the MLP critic has 413k trainable parameters while the PIC has only 46k parameters.

Evaluation Protocol. To ensure a fair and rigorous evaluation, we follow the strict evaluation protocols suggested by Colas18 and Henderson17. For each experiment, we report final metrics and absolute metrics. The final metric [Colas18] is the average reward over the last evaluation episodes, , episodes for each of the last ten policies during training. The absolute metric [Colas18] is the best policy’s average reward over

evaluation episodes. To analyze the significance of the reported improvement over the baseline, we perform a two-sample t-test and boostrapped estimation of the

confidence interval for the mean reward difference obtained by the baseline and our approach. We use the Scipy [scipy] implementation for the t-test, and the Facebook Boostrapped implementation with boostrap samples for confidence interval estimation. All experiments are repeated for five runs with different random seeds.

t-test Boostrap C.I. t-stat. (p-value) mean (C.I.) N=3 abs. 1.42 (1.9e-1) 5.81 (-1.47, 13.09) final 1.41 (1.9e-1) 5.76 (-1.65, 13.18) N=6 abs. 9.46 (3.2e-5) 634.4 (487.9, 757.1) final 9.34 (4.1e-5) 642.2 (496.1, 762.3) N=15 abs. 21.04 (2.5e-5) 4244 (3819, 4575) final 20.6 (2.8e-5) 4280 (3854, 4616) N=30 abs. 3.17 (2.2e-2) 9546 (6719, 12639) final 3.10 (2.1e-2) 9841 (6692, 13201) N=100 abs. 17.8 (2.9e-3) 56939 (51842, 62764) final 17.3 (3.0e-3) 56529 (51153, 62343) N=200 abs. 2.72 (4.9e-2) 73611 (25930, 116198) final 2.66 (4.8e-2) 73747 (26557, 116457) Table 2: T-test and boostrap confidence interval of mean difference between MLP critic and our permutation invariant critic on cooperative navigation. Figure 4: Top: Average critic loss ratio (MLP / Ours). Bottom: Training time comparison.

Results. We compare the proposed permutation invariant critic (PIC) with an MLP critic and an MLP critic with data augmentation. Data augmentation shuffles the order of agents’ observations and actions when training the MLP critic, which is considered to be a simple way to alleviate the ordering issue. The results are summarized in tb:main, where denotes the number of agents, ‘final’ and ‘absolute’ are the final metric and absolute metric respectively. We observe that data augmentation does not boost the performance of an MLP critic much. In some cases, it even deteriorates the results. In contrast, as shown in tb:main, a permutation invariant critic outperforms the MLP critic and the MLP critic with data augmentation by about in all tasks.

The training curves are given in fig:r_plot and fig:q_plot. As shown in fig:q_plot, the loss (Equation 6) of the permutation invariant critic is lower than that of MLP critics. Please see the supplementary material for more training curves. fig:q_loss_and_time (top) shows the ratio of the MLP critic’s average loss to the permutation invariant critic’s loss on cooperative navigation environments. We observed that the ratio grows when the number of agents increases. This implies that a permutation invariant approach gives much more accurate value estimation than the MLP critic particularly when the number of agents is large.

To confirm that the performance gain of our permutation invariant critic is significant, we report the 2-sample t-test results and the boostrapped confidence interval on the mean difference of our approach and the baseline MADDPG with MLP critic. For the 2-sample t-test, the t-statistic and the p-value are reported. The difference is considered significant when the p-value is smaller than . tb:t_test summarizes the analysis on cooperative navigation environments. As shown in tb:t_test, the p-value is much smaller than which suggests the improvement is significant. In addition, positive confidence interval and means suggest that our approach achieves higher rewards than the baseline. Please see the supplementary material for additional analysis.

In addition to fully connected graphs we also tested -nearest neighbor graphs, , each node is connected only to its nearest neighbors. The distance between two nodes is the physical distance between the corresponding agents in the environment. We found that using fully connected graphs achieves better results than using -nearest neighbor graphs. We report absolute rewards for (, , fully connected graph): the results on cooperative navigation (), prey and predator (), cooperative push (), and heterogeneous navigation () are , , , and respectively.

fig:q_loss_and_time (bottom) compares training time of MADDPG on the original MPE and our improved MPE. MADDPG is trained for episodes with a PIC. As shown in fig:q_loss_and_time (bottom), training agents in the original MPE environment takes more than 100 hours. In contrast, with the improved MPE environment, we can train 30 agents within five hours, and scale to 200 agents within a day of training.

6 Conclusion

We propose and study permutation invariant critics to estimate a consistent value for all permutations of the agents’ order in multi-agent reinforcement learning. Empirically, we demonstrate that a permutation invariant critic outperforms classical deep nets on a variety of multi-agent environments, both homogeneous and heterogeneous. Permutation invariant critics lead to better sample complexity and permit to scale learning to environments with a large number of agents. We think that the permutation invariance property is important for deep multi-agent reinforcement learning. Going forward we will study other designs to scale to an even larger number of agents.

Acknowledgements: This work is supported in part by NSF under Grant 1718221 and MRI 1725729, UIUC, Samsung, 3M, Cisco Systems Inc. (Gift Award CG 1377144), Adobe, and a Google PhD Fellowship to RY. We thank NVIDIA for providing GPUs used for this work and Cisco for access to the Arcetri cluster.

References

Appendix A MADDPG Baseline

We implement the MADDPG baseline in Pytorch. To ensure we implement MADDPG correctly, we compare the performance of our implementation with MADDPG’s official code on the MPE. As shown in tb:base, our implementation reproduces the results of MADDPG in all environments.

Cooperative Navigation Prey and Predator Push
good adversary good adversary good
MADDPG official code -379.57 2.91 5.48 -7.05 -1.67
Our implementation -367.75 5.97 14.79 -6.54 -1.81
Table A1: Comparison of the official MADDPG code [maddpg_code] and our MADDPG implementation on MPE.

Appendix B Environment Details and Group Embedding

In this section, we first provide details of observation and action space in each environment we considered in our experiments. Subsequently, we discuss a PIC’s group embedding.

In all the four environments, the action dimension is five. One dimension is no-op. The other four dimensions represent the left, right, forward, and backward force applied to the particle agent. An agent’s observation always contains its location and velocity. Depending on the environment the observation may contain relative location and velocity of neighboring agents and landmarks. In our experiments, the number of visible neighbors in an agent’s observation is equal to or less than ten because we empirically found a larger number of visible neighbors to not boost the performance of MADDPG but rather slow the training speed. Note, the number of visible neighbors in an agent’s observation is different from the -nearest neighbor graph discussed in sec:exp. The in the -nearest neighbor graph refers to the number of agent observations and actions which are used as input to the centralized critic, while the number of visible neighbors in an agent’s observation is a characteristic of an environment. The details of the observation representation for each environment are as follows:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • Cooperative navigation: An agent observes its location and velocity, and the relative location of the nearest landmarks and agents. is 2, 5, 5, 5, 5, 5 for an environment with 3, 6, 15, 30, 100, 200 agents. As a result, the observation dimension is 14, 26, 26, 26, 26, 26 for an environment with 3, 6, 15, 30, 100, 200 agents.

  • Prey and predator: A predator observes its location and velocity, the relative location of the nearest landmarks and fellow predators, the relative location and velocity of the nearest preys. is (2, 1), (3, 2), (5, 5), (5, 5) (5, 5) for an environment with 3, 6, 15, 30, 100 agents. As a result, the observation dimension is 16, 28, 34, 34, 34 for an environment with 3, 6, 15, 30, 100 agents.

  • Cooperative push: An agent observes its location and velocity, the relative location of the target landmark and the large ball, and the relative location of the nearest agents. is 2, 5, 10, 10 for an environment with 3, 6, 15, 30 agents. As a result, the observation dimension is 12, 18, 28, 28 for an environment with 3, 6, 15, 30 agents.

  • Heterogeneous navigation: An agent observes its location and velocity, and the relative location of the nearest landmarks and agents. is 2, 5, 5, 5, 5 for an environment with 3, 6, 15, 30, 100 agents. Consequently, the observation dimension is 14, 26, 26, 26, 26 for an environment with 3, 6, 15, 30, 100 agents.

In the heterogeneous environment Heterogeneous navigation, the number of groups is two, , two groups of agents that have different characteristics. For the PIC, the group embedding for each group is a two-dimensional vector, , for each group

in the environment. We train the embedding along with the network parameters. The group embedding is randomly initialized from a normal distribution

.

Appendix C T-test and Training Curves

The two-sample t-test and the confidence interval of the mean difference of the MLP critic and the PIC are summarized in tb:t_test_spreada, tb:t_test_taga, tb:t_test_pusha, and tb:t_test_heteroa. p-values smaller than and positive confidence intervals indicated that PIC’s improvement over the MLP critic is significant. The training curves for MADDPG with MLP critic and our PIC are shown in fig:r_plot_all1 and fig:r_plot_all2. The PIC outperforms the MLP critic in all the environment settings.

valign=t t-test Boostrap C.I. t-stat. (p-value) mean (C.I.) N=3 abs. 1.42 (1.9e-1) 5.81 (-1.47, 13.09) final 1.41 (1.9e-1) 5.76 (-1.65, 13.18) N=6 abs. 9.46 (3.2e-5) 634.4 (487.9, 757.1) final 9.34 (4.1e-5) 642.2 (496.1, 762.3) N=15 abs. 21.04 (2.5e-5) 4244 (3819, 4575) final 20.6 (2.8e-5) 4280 (3854, 4616) N=30 abs. 3.17 (2.2e-2) 9546 (6719, 12639) final 3.10 (2.1e-2) 9841 (6692, 13201) N=100 abs. 17.8 (2.9e-3) 56939 (51842, 62764) final 17.3 (3.0e-3) 56529 (51153, 62343) N=200 abs. 2.72 (4.9e-2) 73611 (25930, 116198) final 2.66 (4.8e-2) 73747 (26557, 116457) valign=t t-test Boostrap C.I. t-stat. (p-value) mean (C.I.) N=3 abs. 6.61 (1.6e-4) 22.7 (15.2, 29.7) final 5.38 (8.0e-4) 21.3 (13.6, 29.6) N=6 abs. 3.14 (3.4e-2) 201 (88.2, 316) final 3.11 (3.5e-2) 201 (86.4, 318) N=15 abs. 20.34 (1.7e-6) 5838 (5468, 6248) final 20.29 (5.7e-7) 5941 (5609, 6303) N=30 abs. 5.91 (4.0e-3) 6821 (5176, 9006) final 5.82 (4.3e-3) 6755 (5106, 8925) N=100 abs. 6.92 (7.3e-2) 92753 (82974, 102532) final 7.44 (7.1e-2) 89980 (80796, 99163)
valign=t t-test Boostrap C.I. t-stat. (p-value) mean (C.I.) N=3 abs. 4.03 (1.4e-2) 16.6 (10.3, 24.4) final 4.04 (1.4e-2) 16.7 (10.4, 24.6) N=6 abs. 66.4 (1.0e-10) 276.5 (269, 287) final 67.2 (1.3e-11) 271.0 (263, 281) N=15 abs. 4.90 (6.3e-3) 384 (250, 507) final 5.12 (5.2e-3) 414 (280, 546) N=30 abs. 4.41 (7.1e-3) 593 (400, 778.316) final 4.64 (5.4e-3) 644 (419, 844) valign=t t-test Boostrap C.I. t-stat. (p-value) mean (C.I.) N=3 abs. 7.07 (1.7e-3) 15.1 (11.8, 18.4) final 6.76 (2.2e-3) 15.1 (11.6, 18.6) N=6 abs. 19.7 (2.1e-5) 285 (259, 317) final 19.6 (2.2e-5) 286 (260, 318) N=15 abs. 62.2 (3.0e-9) 1650 (1619, 1678) final 61.9 (6.5e-9) 1653 (1619, 1684) N=30 abs. 74.1 (3.7e-9) 6538 (6380, 6725) final 76.0 (2.6e-9) 6518 (6362, 6695) N=100 abs. 5.82 (1.9e-2) 36327 (27272, 50292) final 5.65 (1.8e-2) 36618 (26339, 51579)
Table A2: Ours (PIC) sMLP critic in the cooperative navigation environment.
Table A3: Ours (PIC) sMLP critic in the prey and predator environment.
Table A4: Ours (PIC) sMLP critic in the cooperative push environment.
Table A5: Ours (PIC) sMLP critic in the heterogeneous navigation environment.
Figure A1: Average episode reward comparison. Our permutation invariant critic (PIC) outperforms the MLP critic in all environment settings.
Figure A2: Average episode reward comparison. Our permutation invariant critic (PIC) outperforms the MLP critic in all environment settings.