Supplementary Material
Software and demos supplementing this paper: https://github.com/BaoqianWang/IROS22_DARL1N
I Introduction
Recent years have witnessed tremendous success of reinforcement learning (RL) in challenging decision making problems, such as robot control and video games. Research efforts are currently focused on multiagent settings, such as cooperative robot navigation, multiplayer games, and traffic management. A direct application of RL techniques in a multiagent setting by simultaneously running a singleagent algorithm at each agent exhibits poor performance [14]. This is because, without considering interactions among the agents, the environment becomes nonstationary from the perspective of a single agent.
Multiagent reinforcement learning (MARL) [4] addresses the aforementioned challenge by considering all agent dynamics collectively when learning the policy of an individual agent. This is achieved by learning a centralized value or actionvalue (Q) function that involves the states and actions of all agents. Most effective MARL algorithms, such as multiagent deep deterministic policy gradient (MADDPG) [14], counterfactual multiagent (COMA) [7], and multi actor attention critic (MAAC) [9], adopt this strategy. However, learning a joint Q function is challenging due to the exponentially growing size of the joint state and action space with the increasing number of agents [15]. A policy obtained by parametrizing the joint Q function directly has poor performance in largescale settings as shown in [22, 13].
Recently, MARL algorithms that reduce the representation complexity of the Q function have been shown to significantly improve the quality of the learned policies for largescale multiagent settings. Successful methods include value factorization algorithms, such as meanfield MARL [22], evolutionary population curriculum (EPC) [13] and scalable actor critic (SAC) [15]. While these methods achieve excellent performance, their training time can be exceedingly slow as the number of agents increases because they require simultaneous state transitions for all agents. This requirement prevents fully distributed training over a computing cluster because the compute nodes need to receive the state transitions of all agents simultaneously.
Our contribution is a distributed MARL training method called Distributed multiAgent Reinforcement Learning with Onehop Neighbors (DARL1N). Its main advantage over stateoftheart methods is a fully distributed training procedure, in which each compute node only simulates a very small subset of the agents locally. This is made possible by representing the agent topology as a proximity graph and approximating the Q function over onehop neighborhoods. When agent interactions are restricted to onehop neighbors, training the Q function of an agent requires simulation only of the agent itself and its potential twohop neighbors. This enables fully distributed training and greatly accelerates the training of largescale MARL policies.
Ii Related Work
Stateoftheart MARL algorithms like MADDPG [14], COMA [7], MAAC [9] and MAPPO [23] learn joint Q/value functions over all agent states and actions. As the number of agents increases, the exponential growth of the joint state and action spaces increases the representation complexity required to model the Q functions drastically. Moreover, the need to perform simultaneous state transitions for all agents prevents distributed or parallel training.
Two directions have been explored in the literature to scale MARL up. The first direction aims to reduce the complexity of the Q function representation by factorizing it using local value functions that only depend on the states and actions of some agents. The second direction considers distributed or parallel computing architectures to speed MARL training up.
We first review factorization techniques for the Q and policy functions in MARL. VDN [19] proposes a decomposition of the joint Q function into a sum of local value functions. QMIX [16]
improves VDN by combining the local value functions monotonically using a mixing neural network, which provides a more expressive Q function. QTRAN
[18] further extends VDN and QMIX by factorizing an alternative Q function having equivalent optimal actions with the original Q function without requiring additivity and monotonicity assumptions. The Q function can also be factorized according to a coordination graph specifying the agent interactions. For instance, [11, 8, 10, 24] decomposed the global Q function into a set of local value functions with dependencies specified by agents that are connected in a static undirected graph. Dynamic coordination graphs for cooperative MARL were considered in [21]. To approximate the local value functions, Böhmer et al. [2] used deep neural networks. MeanField MARL techniques [22], including MeanField Actor Critic (MFAC) and MeanField Q (MFQ), factorize the Q function into a weighted sum of local Q functions, each depending only on one agent’s action and a mean value of the actions of its neighboring agents. SAC [15] approximates the Q function of each agent using the states and actions of its hop neighbors. The Q and policy factorization of SAC with is adopted in DARL1N. The main difference with respect to SAC is that DARL1N also uses the graph structure to devise a distributed training approach. While training SAC requires simultaneous state transitions, we show that the Q function of an agent can be trained offpolicy using state transitions only for the agent itself and its potential twohop neighbors. The relationship between SAC and DARL1N will be discussed in more detail in Sec.VIA after introducing DARL1N.While value factorization techniques reduce the representation complexity of the MARL policy and value functions, simultaneous simulation and parameter updates for all agents are required in a single compute node. To speed up training for MARL, distributed/parallel computing is a promising technique, which has been rarely studied for MARL. Only a few distributed or parallel approaches have been proposed to accelerate MARL training. Multiagent A3C was explored in [17], where each parallelly running compute node performs independent simulation and training for all agents, and asynchronously updates and fetches the parameters stored in the central controller. EPC [13] applies curriculum learning and adopts population invariant policy and Q functions to support varying numbers of agents in different learning stages. EPC was implemented in a parallel computing architecture that consists of multiple compute nodes. Each compute node simulates all agents in multiple independent environments, and each environment and agent runs in an independent and parallel process. Also of interest is the distributed architecture presented in [5], which was designed to reduce the communication overhead between compute nodes and the central controller. Although these methods can improve training efficiency by leveraging distributed or parallel computing, their training costs are still high when the number of agents is large, as they require each compute node to simulate all agents.
Iii Background
We consider the MARL problem in which agents learn to optimize their behaviors by interacting with the environment. Denote the state and action of agent by and , respectively, where and are the corresponding state and action spaces. Let and denote the joint state and action of all agents. At time , a joint action applied at state triggers a transition to a new state
according to a conditional probability density function (pdf)
. After each transition, each agent receives a reward , determined by the joint state and action according to the function . The objective of each agent is to choose a deterministic policy to maximize the expected cumulative discounted reward:where denotes the joint policy of all agents and is a discount factor. The function is known as the value function of agent associated with joint policy .
An optimal policy for agent can also be obtained by maximizing the actionvalue (Q) function:
and setting , where and denotes the actions of all agents except . In the rest of the paper, we omit the time notation for simplicity, when there is no risk of confusion.
Iv Problem Statement
To develop a distributed MARL algorithm, we impose additional structure on the MARL problem. Assume that all agents share a common state space, i.e., , . Let be a distance metric on the homogeneous state space. We introduce a proximity graph [3] to model the topology of the agent team. A disk proximity graph is defined as a mapping that associates the joint state with an undirected graph such that and . Define the set of onehop neighbors of agent as . We make the following regularity assumption about the agents’ motion.
Assumption 1.
The distance between two consecutive states, and , of agent is bounded, i.e., , for some .
This assumption is justified in many problems of interest where, e.g., due to physical limitations, the agent states can only change by a bounded amount in a single time step. Following this assumption, we define the set of potential neighbors of agent at time as , which captures the set of agents that may become onehop neighbors of agent at time .
Denote the joint state and action of the onehop neighbors of agent by and , respectively, where . Our key idea is to let agent ’s policy, , only depend on the onehop neighbor states . The intuition is that agents that are far away from agent at time have little impact on its current action . To support this policy model, we make two additional assumptions on the problem structure. To emphasize that the output of a function is affected only by a subset of the input dimensions, we use the notation for and .
Assumption 2.
The reward of agent can be fully specified using its onehop neighbor states and actions , i.e., and its absolute value is upper bounded by , for some .
Assumption 2 is satisfied in many multiagent problems where the reward of one agent is determined only by the states and actions of nearby agents. Examples are provided in Sec. VI. Similar assumptions are adopted in [15, 6, 12].
Assumption 3.
The transition model of agent depends only on its action and onehop neighbor states , i.e., .
Assumption 3 is common for multiagent networked systems as in [15, 12]. As a result, the joint state transition pdf decomposes as:
The objective of each agent is to obtain an optimal policy by solving the following problem:
where is the optimal actionvalue (Q) function introduced in the previous section.
V Distributed MARL with Onehop Neighbors
In this section, we develop the DARL1N algorithm to solve the MARL problem with proximitygraph structure. DARL1N limits the interactions among agents within onehop neighbors to significantly reduce the learning complexity. To further speed training up, DARL1N adopts a distributed training framework that exploits local interactions to decompose and distribute the computation load.
Va Onehop Neighborhood Value and Policy Factorization
The Q function of each agent associated with the MARL problem over the proximity graph can be written as
where denote joint states and actions of agents except onehop neighbors of agent . Inspired by the SAC algorithm [15], we approximate the Q function as that depends only on onehop neighbor states and actions:
where the weights satisfy . The approximation error is given in the following lemma with proof provided in Appendix A.
We then parameterize the approximated Q function and the policy by and , respectively. To handle the varying sizes of and , in the implementation, we let the input dimension of to be the largest possible dimension of
, and apply zeropadding for agents that are not in the onehop neighborhood of agent
. The same procedure is applied to represent . More implementation details are provided in Appendix C.b.To learn the approximated Q function , instead of incremental onpolicy updates to the Q function as in SAC, we adopt offpolicy temporaldifference learning with a buffer similar to MADDPG. The parameters of the approximated Q function are updated by minimizing:
(1) 
where is the replay buffer for agent that contains information only from , the onehop neighbors of agent at the current and next time step, and the onehop neighbors for . To stabilize the training, a target Q function with parameters and a target policy function with parameters are used. The parameters and are updated using Polyak averaging, and , where
is a hyperparameter. In contrast to MADDPG, the replay buffer
for agent only needs to store its local interactions with nearby agents. Note that is used to calculate . Also, in contrast to SAC, each agent only needs to collect its own training data by simulating local twohop interactions. This allows an efficient and distributed training framework as we explain in the next subsection.Agent ’s policy parameters are updated using gradients from the policy gradient theorem [20]:
(2) 
where again data only from local interactions is needed.
VB Distributed Training with Local Interactions
To implement the parameter updates proposed above, agent needs training data from its onehop neighbors at the current and next time steps, whose dynamics obey the following proposition (see proof in Appendix B.).
Proposition 1.
Under Assumption 1, if an agent is not a potential neighbor of agent at time , i.e., , it will not be a onehop neighbor of agent at time , i.e., .
Proposition 1 allows us to decouple the global interactions among agents and limit message exchanges or observations to be among onehop neighbors. It also allows parallel training on a distributed computing architecture, where each compute node only needs to simulate a small subset of the agents. This leads to significant training efficiency gains as demonstrated in Sec. VI.
To collect training data, at each time step, agent first interacts with its onehop neighbors to obtain their states and actions and compute its reward . To obtain for all , we first determine agent ’s onehop neighbors at the next time step, . Using Proposition 1, we let each potential neighbor perform a transition to a new state , which is sufficient to determine . Then, we let the potential neighbors of each new neighbor perform transitions to determine and obtain . Fig. 1(a) illustrates the data collection process. At time , agent obtains , , and for . Then, the potential neighbors of agent , , proceed to their next states at time . This is sufficient to determine that and obtain . Finally, we let agent , which belongs to set , perform a transition to determine that and obtain .
We now describe a distributed learning framework that exploits the local interactions among the agents to optimize the policy and Q function parameters. Our training framework consists of a central controller and learners, each training a different agent. The central controller stores a copy of all policy and target policy parameters, , , . In each training iteration, the central controller broadcasts the parameters to all learners. Each learner updates its own policy parameters , and returns the updated values to the central controller.
Each learner maintains the parameters and of agent ’s approximated Q and target Q function. In each training iteration, learner uses policies with parameters received from the central controller. Transitions are simulated only for agent , its potential neighbors , and the potential neighbors of each new neighbor as described above. DARL1N achieves significant computation savings because (i) the Q function parameters and are stored locally at each learner and do not need to be communicated, and (ii) the agent transition simulation occurs only over small groups of agents, distributed among the learners instead of centralized over all agents at the central controller. The interaction data are stored in the replay buffer . Finally, learner updates and using (VA) and (2), respectively, and the target network parameters and via Polyak averaging. Fig. 1(b) illustrates the distributed training procedure. The pseudocode of DARL1N is provided in Alg. 1.
Vi Experiments
In this section, we conduct experiments to evaluate the performance of DARL1N.
Via Experiment Settings
Environments
Benchmarks
We compare DARL1N with three stateoftheart MARL algorithms: MADDPG [14], MFAC [22], and EPC [13]. All benchmark methods need states of all agents through observation or communication during training and execution. In contrast, DARL1N only needs states of onehop neighbors during execution and twohop potential neighbors during training. While the most closely related method to DARL1N is SAC [15], we do not compare to SAC because DARL1N is a distributed training version of SAC with the same Q function factorization. DARL1N utilizes offpolicy training and allows each compute node to simulate state transitions only for one agent and its potential twohop neighbors. In contrast, SAC is an onpolicy approach running in a single compute node without parallel processing but with the same Q function factorization.
Evaluation Metrics
We evaluate all methods using two criteria: training efficiency and policy quality. To measure the training efficiency, we use two metrics: 1) average training time spent to run a specified number of training iterations and 2) convergence time
. The convergence time is defined as the time when the variance of the average total training reward over 90 consecutive iterations does not exceed 2% of the absolute mean reward, where the average total training reward is the total reward of all agents averaged over 10 episodes in three training runs with different random seeds. To measure policy quality, we use
convergence reward, which is the average total training reward at the convergence time.Experiment Configurations
We run our experiments on Amazon EC2 computing clusters [1]. To understand the scalability of each method, for each environment, we consider four scenarios with increasing number of agents. The number of agents in the Ising Model and Food Collection environments are set to and , respectively. In the Grassland and Adversarial Battle environments, the number of agents are set to . In the experiments, the number of adversary agents, grass pellets and resource units are all set to , and all adversary agents adopt policies trained by MADDPG. More experiment settings including training parameters, Q function and policy function representation, and neighborhood configurations are described in Appendix C.
To evaluate the training efficiency, we configure the computing resources used to train each method in a way so that DARL1N utilizes roughly the same or fewer resources measured by the money spent per hour on Amazon EC2 computing cluster. In particular, to train DARL1N, Amazon EC2 instance is used in all scenarios for all environments. To train MADDPG and MFAC, instance is used in the first scenario () for Ising Model and in the first two scenarios for Food Collection, Grassland and Adversarial Battle. In the other scenarios, instance is used. To train EPC, we use instance in all scenarios for Food Collection and in the first three scenarios for Ising Model, Grassland and Adversarial Battle. The other scenarios adopt instance . To configure the parallel computing architecture in EPC, we set the number of parallel computing instances and the number of independent environments to and , respectively. The configurations of Amazon EC2 instances as the compute nodes are summarized in Tab. I.
Instances  CPU cores  CPU frequency  Memory  Network  Hourly price 
2  3.4 GHz  5.3 GB  25 Gb  $ 0.108  
12  4 GHz  96 GB  10 Gb  $ 1.116  
24  4 GHz  192 GB  10 Gb  $ 2.232  
48  3.6 GHz  96 GB  12 Gb  $ 2.04  
72  3.6 GHz  144 GB  25 Gb  $ 3.06 
ViB Experiment Results
Ising Model
Method  Convergence Time (s)  Convergence Reward  
MADDPG  62  263  810  1996  460  819  1280  1831 
MFAC  63  274  851  2003  468  814  1276  1751 
EPC  101  26  51  62  468  831  1278  3321 
EPC Scratch  101  412  993  2995  468  826  1275  2503 
DARL1N  38  102  210  110  465  828  1279  2282 
Tab. II shows the convergence reward and convergence time of different methods. When the number of agents is small (), all methods achieve roughly the same reward. DARL1N takes the least amount of time to converge while EPC takes the longest time. When the number of agents increases, it can be observed that the EPC converges immediately and the convergence reward it achieves when is much higher than the other methods. The reason is that, in the Ising Model, each agent only needs information of its four fixed neighbors, and hence in EPC the policy obtained from the previous stage can be applied to the current stage. The other methods train the agents from scratch without curriculum learning. For illustration, we also show the convergence reward and convergence time achieved by training EPC from scratch without curriculum learning (denoted as EPC Scratch in Tab. II). The results show that EPC Scratch converges much slower than EPC as the number of agents increases. Note that when the number of agents is 9, EPC and EPC Scratch are the same. Moreover, DARL1N achieves a reward comparable with that of EPC Scratch but converges much faster. Fig. 2 shows the average time taken to train each method for 10 iterations in different scenarios. DARL1N requires much less time to perform a training iteration than the benchmark methods.
Food Collection
Method  Convergence Time (s)  Convergence Reward  
MADDPG  501  1102  4883  2005  24  24  112  364 
MFAC  512  832  4924  2013  20  23  115  362 
EPC  1314  723  2900  8104  19  11  8  2 
DARL1N  502  480  310  730  14  25  43  61 
Method  Convergence Time (s)  Convergence Reward  
MADDPG  423  6271  2827  1121  21  11  302  612 
MFAC  431  7124  3156  1025  23  9  311  608 
EPC  4883  2006  3324  15221  12  38  105  205 
DARL1N  103  402  1752  5221  18  46  113  210 
The convergence rewards and convergence times in this environment are shown in Tab. III. The results show that, when the problem scale is small, DARL1N, MADDPG and MFAC achieve similar performance in terms of policy quality. As the problem scale increases, the performance of MADDPG and MFAC degrades significantly and becomes much worse than DARL1N or EPC when and , which is also shown in Figs. 33. The convergence reward achieved by DARL1N is comparable or sometimes higher than that achieved by EPC. Moreover, the convergence speed of DARL1N is the highest among all methods in all scenarios.
Fig. 2 shows the average training time for running 30 iterations. Similar as the results obtained in the Ising Model, DARL1N achieves the highest training efficiency and its training time grows linearly as the number of agents increases. When , EPC takes the longest training time. This is because of the complex policy and Q neural network architectures in EPC, the input dimensions of which grow linearly and quadratically, respectively, with more agents.
Grassland
Similar as the results in the Food Collection environment, the policy generated by DARL1N is equally good or even better than those generated by the benchmark methods, as shown in Tab. IV and Fig. 2, especially when the problem scale is large. DARL1N also has the fastest convergence speed and takes the shortest time to run a training iteration.
Adversarial Battle
Method  Convergence Time (s)  Convergence Reward  
MADDPG  452  1331  1521  7600  72  211  725  1321 
MFAC  463  1721  1624  6234  73  221  694  1201 
EPC  1512  1432  2041  9210  75  215  405  642 
DARL1N  121  756  1123  3110  71  212  410  682 
In this environment, DARL1N again achieves good performance in terms of policy quality and training efficiency compared to the baseline methods, as shown in Tab. V and Fig. 2. For illustration, the states of a subset of agents trained by different methods during an episode are shown in Fig. 4. It can be observed that both DARL1N and EPC agents can successfully collect resource units and kill agents from the other team, but MADDPG and MFAC fail to do so. To further evaluate the performance, we reconsider the last scenario () and train the good agents and adversary agents using two different methods. The trained good agents and adversarial agents are then compete with each other in the environment. We then apply the MinMax normalization to measure the normalized total reward of agents at each side achieved in an episode. To reduce uncertainty, we generate 10 episodes and record the mean values and standard deviations. As shown in Fig. 5, DARL1N achieves the best performance, and both DARL1N and EPC significantly outperform MADDPG and MFAC.
Vii Conclusion
This paper introduces DARL1N, a scalable MARL algorithm. DARL1N features a novel training scheme that breaks the curse of dimensionality for action value function approximation by restricting the interactions among agents within onehop neighborhoods. This reduces the learning complexity and enables fully distributed and parallel training, in which individual compute nodes only simulate interactions among a small subset of agents. To demonstrate the scalability and training efficiency of DARL1N, we conducted comprehensive evaluation in comparison with three stateoftheart MARL algorithms, MADDPG, MFAC, and EPC. The results show that DARL1N generates equally good or even better policies in almost all scenarios with significantly higher training efficiency than benchmark methods, especially in largescale problem settings.
References
 [1] (2022) Amazon ec2. Note: https://aws.amazon.com/ec2/Accessed: 20220113 Cited by: §VIA.

[2]
(202004)
Deep coordination graphs.
In
International Conference on Machine Learning
, Online. Cited by: §II.  [3] (2009) Distributed control of robotic networks: a mathematical approach to motion coordination algorithms. Princeton University Press. Cited by: §IV.
 [4] (2010) Multiagent reinforcement learning: an overview. Innovations in Multiagent Systems and Applications, pp. 183–221. Cited by: §I.
 [5] (2021) Communicationefficient policy gradient methods for distributed reinforcement learning. IEEE Transactions on Control of Network Systems. Cited by: §II.
 [6] (2019) Multiagent deep reinforcement learning for largescale traffic signal control. IEEE Transactions on Intelligent Transportation Systems 21 (3), pp. 1086–1095. Cited by: §IV.

[7]
(201802)
Counterfactual multiagent policy gradients.
In
AAAI Conference on Artificial Intelligence
, Louisiana, USA. Cited by: §I, §II.  [8] (200207) Coordinated reinforcement learning. In International Conference on Machine Learning, Sydney, Australia. Cited by: §II.
 [9] (201906) Actorattentioncritic for multiagent reinforcement learning. In International Conference on Machine Learning, CA, USA. Cited by: §I, §II.
 [10] (2006) Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research 7, pp. 1789–1828. Cited by: §II.
 [11] (200809) Multiagent reinforcement learning for urban traffic control using coordination graphs. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium. Cited by: §II.
 [12] (2020) Distributed reinforcement learning in multiagent networked systems. arXiv preprint arXiv:2006.06555. Cited by: §IV, §IV.
 [13] (202004) Evolutionary population curriculum for scaling multiagent reinforcement learning. In International Conference on Learning Representations (ICLR), Online. Cited by: §I, §I, §II, §VIA, §VIA.
 [14] (201712) Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, CA, USA. Cited by: §I, §I, §II, §VIA.
 [15] (202006) Scalable reinforcement learning of localized policies for multiagent networked systems. In Learning for Dynamics and Control, Online. Cited by: §I, §I, §II, §IV, §IV, §VA, §VIA.
 [16] (201807) Qmix: monotonic value function factorisation for deep multiagent reinforcement learning. In International Conference on Machine Learning, Stockholm, Sweden. Cited by: §II.
 [17] (2020) Multiagent actor centralizedcritic with communication. Neurocomputing. Cited by: §II.
 [18] (201906) Qtran: learning to factorize with transformation for cooperative multiagent reinforcement learning. In Proceedings of 36th the International Conference on Machine Learning, CA, USA. Cited by: §II.
 [19] (201807) Valuedecomposition networks for cooperative multiagent learning based on team reward. In International Conference on Autonomous Agents and MultiAgent Systems, Stockholm, Sweden. Cited by: §II.
 [20] (2018) Reinforcement Learning: An Introduction. MIT Press. Cited by: §VA.
 [21] (2021) Contextaware sparse deep coordination graphs. arXiv preprint arXiv:2106.02886. Cited by: §II.
 [22] (201806) Mean field multiagent reinforcement learning. In International Conference on Machine Learning, Cited by: §I, §I, §II, §VIA, §VIA.
 [23] (2021) The surprising effectiveness of ppo in cooperative, multiagent games. arXiv preprint arXiv:2103.01955. Cited by: §II.
 [24] (201305) Coordinating multiagent reinforcement learning with limited communication. In International Conference on Autonomous Agents and MultiAgent Systems, MN, USA. Cited by: §II.
Appendix A Appendix
Aa Proof of Lemma 1
We first prove the following inequality
(3) 
where and . Particularly, letting and denote and , respectively, we have:
(4) 
where derives from the fact that are part of both and . In the above equations, we have removed the subscription of the expectation function for simplicity, which should be . Then, we have
∎ 
AB Proof of Proposition 1
If agent , then based on the definition of potential neighbors, we have . According to the triangle inequality, , and according to Assumption 1, . Therefore, . Furthermore, using triangle inequality, we can obtain . As , we have . Therefore, agent will not be a onehop neighbor of agent at time . ∎
AC Experiment Settings
Training Parameters
All environments adopt the same training parameters. In particular, Adam optimizer is used to update the policy and Q function parameters with a learning rate of 0.01. The parameter in the Polyak averaging algorithm for updating target policy and target Q functions is set to . The discount factor is set to . The size of the buffer is set to . The parameters are updated after every episodes. The in Alg. 1 of DARL1N is set to times of the length of one episode. In the Ising Model and Food Collection environments, the length of each episode is set to 25 in all scenarios. In the Grassland and Adversarial Battle environments, the length of an episode is set to and for the scenarios of and , respectively. The size of a mini batch is set to in the Ising Model and in other environments.
Q Function and Policy Function Representation
In the implementations of DARL1N, MADDPG and MFAC, we use neural networks with fully connected layers to represent the approximated Q function and policy function. The neural networks have three hidden layers with each layer having 64 units and adopting ReLU as the activation function. To handle the varying sizes of
and in approximated Q function in DARL1N, we let the input dimension of the approximated Q function to be the size of the joint state and action space of the maximum number of agents that can be in and apply zero padding for agents that are not in the onehop neighborhood of agent . In particular, in the Ising Model, the maximum number of onehop neighbors of an agent is 5, which is fixed. The input dimension of the approximated Q function for agent is then . For other environments, the maximum number of onehop neighbors of an agent is the total number of agents. The EPC adopts a populationinvariant neural network architecture with attention modules to support arbitrary number of agents in different stages for training the Q function and policy function.Environment and Neighborhood Configurations
The onehop neighbors of an agent are defined over the agent’s state space using a distance metric, which is the prior knowledge of environment. In the Ising Model, the topology of the agents is fixed, and the onehop neighbors of an agent are its vertically and horizontally adjacent agents and itself. In the other environments, the Euclidean distance between two agents in the 2D space is used as the distance metric, and the neighbor distance is set to , , , , when , respectively. The bound is determined according to the maximum velocity and time interval between two consecutive time steps, and is set to , , , , when , respectively. The size of agents’ activity space is set to , , , when , respectively.