Log In Sign Up

DARL1N: Distributed multi-Agent Reinforcement Learning with One-hop Neighbors

Most existing multi-agent reinforcement learning (MARL) methods are limited in the scale of problems they can handle. Particularly, with the increase of the number of agents, their training costs grow exponentially. In this paper, we address this limitation by introducing a scalable MARL method called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N). DARL1N is an off-policy actor-critic method that breaks the curse of dimensionality by decoupling the global interactions among agents and restricting information exchanges to one-hop neighbors. Each agent optimizes its action value and policy functions over a one-hop neighborhood, significantly reducing the learning complexity, yet maintaining expressiveness by training with varying numbers and states of neighbors. This structure allows us to formulate a distributed learning framework to further speed up the training procedure. Comparisons with state-of-the-art MARL methods show that DARL1N significantly reduces training time without sacrificing policy quality and is scalable as the number of agents increases.


A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

This paper extends off-policy reinforcement learning to the multi-agent ...

Learning Open Domain Multi-hop Search Using Reinforcement Learning

We propose a method to teach an automated agent to learn how to search f...

A Deep Multi-Agent Reinforcement Learning Approach to Autonomous Separation Assurance

A novel deep multi-agent reinforcement learning framework is proposed to...

Semi-Centralised Multi-Agent Reinforcement Learning with Policy-Embedded Training

Centralised training (CT) is the basis for many popular multi-agent rein...

Distributed Multi-agent Video Fast-forwarding

In many intelligent systems, a network of agents collaboratively perceiv...

Dimension-Free Rates for Natural Policy Gradient in Multi-Agent Reinforcement Learning

Cooperative multi-agent reinforcement learning is a decentralized paradi...

Supplementary Material

Software and demos supplementing this paper:

I Introduction

Recent years have witnessed tremendous success of reinforcement learning (RL) in challenging decision making problems, such as robot control and video games. Research efforts are currently focused on multi-agent settings, such as cooperative robot navigation, multi-player games, and traffic management. A direct application of RL techniques in a multi-agent setting by simultaneously running a single-agent algorithm at each agent exhibits poor performance [14]. This is because, without considering interactions among the agents, the environment becomes non-stationary from the perspective of a single agent.

Multi-agent reinforcement learning (MARL) [4] addresses the aforementioned challenge by considering all agent dynamics collectively when learning the policy of an individual agent. This is achieved by learning a centralized value or action-value (Q) function that involves the states and actions of all agents. Most effective MARL algorithms, such as multi-agent deep deterministic policy gradient (MADDPG) [14], counterfactual multi-agent (COMA) [7], and multi actor attention critic (MAAC) [9], adopt this strategy. However, learning a joint Q function is challenging due to the exponentially growing size of the joint state and action space with the increasing number of agents [15]. A policy obtained by parametrizing the joint Q function directly has poor performance in large-scale settings as shown in [22, 13].

Recently, MARL algorithms that reduce the representation complexity of the Q function have been shown to significantly improve the quality of the learned policies for large-scale multi-agent settings. Successful methods include value factorization algorithms, such as mean-field MARL [22], evolutionary population curriculum (EPC) [13] and scalable actor critic (SAC) [15]. While these methods achieve excellent performance, their training time can be exceedingly slow as the number of agents increases because they require simultaneous state transitions for all agents. This requirement prevents fully distributed training over a computing cluster because the compute nodes need to receive the state transitions of all agents simultaneously.

Our contribution is a distributed MARL training method called Distributed multi-Agent Reinforcement Learning with One-hop Neighbors (DARL1N). Its main advantage over state-of-the-art methods is a fully distributed training procedure, in which each compute node only simulates a very small subset of the agents locally. This is made possible by representing the agent topology as a proximity graph and approximating the Q function over one-hop neighborhoods. When agent interactions are restricted to one-hop neighbors, training the Q function of an agent requires simulation only of the agent itself and its potential two-hop neighbors. This enables fully distributed training and greatly accelerates the training of large-scale MARL policies.

Ii Related Work

State-of-the-art MARL algorithms like MADDPG [14], COMA [7], MAAC [9] and MAPPO [23] learn joint Q/value functions over all agent states and actions. As the number of agents increases, the exponential growth of the joint state and action spaces increases the representation complexity required to model the Q functions drastically. Moreover, the need to perform simultaneous state transitions for all agents prevents distributed or parallel training.

Two directions have been explored in the literature to scale MARL up. The first direction aims to reduce the complexity of the Q function representation by factorizing it using local value functions that only depend on the states and actions of some agents. The second direction considers distributed or parallel computing architectures to speed MARL training up.

We first review factorization techniques for the Q and policy functions in MARL. VDN [19] proposes a decomposition of the joint Q function into a sum of local value functions. QMIX [16]

improves VDN by combining the local value functions monotonically using a mixing neural network, which provides a more expressive Q function. QTRAN

[18] further extends VDN and QMIX by factorizing an alternative Q function having equivalent optimal actions with the original Q function without requiring additivity and monotonicity assumptions. The Q function can also be factorized according to a coordination graph specifying the agent interactions. For instance, [11, 8, 10, 24] decomposed the global Q function into a set of local value functions with dependencies specified by agents that are connected in a static undirected graph. Dynamic coordination graphs for cooperative MARL were considered in [21]. To approximate the local value functions, Böhmer et al. [2] used deep neural networks. Mean-Field MARL techniques [22], including Mean-Field Actor Critic (MFAC) and Mean-Field Q (MF-Q), factorize the Q function into a weighted sum of local Q functions, each depending only on one agent’s action and a mean value of the actions of its neighboring agents. SAC [15] approximates the Q function of each agent using the states and actions of its -hop neighbors. The Q and policy factorization of SAC with is adopted in DARL1N. The main difference with respect to SAC is that DARL1N also uses the graph structure to devise a distributed training approach. While training SAC requires simultaneous state transitions, we show that the Q function of an agent can be trained off-policy using state transitions only for the agent itself and its potential two-hop neighbors. The relationship between SAC and DARL1N will be discussed in more detail in Sec.VI-A after introducing DARL1N.

While value factorization techniques reduce the representation complexity of the MARL policy and value functions, simultaneous simulation and parameter updates for all agents are required in a single compute node. To speed up training for MARL, distributed/parallel computing is a promising technique, which has been rarely studied for MARL. Only a few distributed or parallel approaches have been proposed to accelerate MARL training. Multi-agent A3C was explored in [17], where each parallelly running compute node performs independent simulation and training for all agents, and asynchronously updates and fetches the parameters stored in the central controller. EPC [13] applies curriculum learning and adopts population invariant policy and Q functions to support varying numbers of agents in different learning stages. EPC was implemented in a parallel computing architecture that consists of multiple compute nodes. Each compute node simulates all agents in multiple independent environments, and each environment and agent runs in an independent and parallel process. Also of interest is the distributed architecture presented in [5], which was designed to reduce the communication overhead between compute nodes and the central controller. Although these methods can improve training efficiency by leveraging distributed or parallel computing, their training costs are still high when the number of agents is large, as they require each compute node to simulate all agents.

Iii Background

We consider the MARL problem in which agents learn to optimize their behaviors by interacting with the environment. Denote the state and action of agent by and , respectively, where and are the corresponding state and action spaces. Let and denote the joint state and action of all agents. At time , a joint action applied at state triggers a transition to a new state

according to a conditional probability density function (pdf)

. After each transition, each agent receives a reward , determined by the joint state and action according to the function . The objective of each agent is to choose a deterministic policy to maximize the expected cumulative discounted reward:

where denotes the joint policy of all agents and is a discount factor. The function is known as the value function of agent associated with joint policy .

An optimal policy for agent can also be obtained by maximizing the action-value (Q) function:

and setting , where and denotes the actions of all agents except . In the rest of the paper, we omit the time notation for simplicity, when there is no risk of confusion.

Iv Problem Statement

To develop a distributed MARL algorithm, we impose additional structure on the MARL problem. Assume that all agents share a common state space, i.e., , . Let be a distance metric on the homogeneous state space. We introduce a proximity graph [3] to model the topology of the agent team. A -disk proximity graph is defined as a mapping that associates the joint state with an undirected graph such that and . Define the set of one-hop neighbors of agent as . We make the following regularity assumption about the agents’ motion.

Assumption 1.

The distance between two consecutive states, and , of agent is bounded, i.e., , for some .

This assumption is justified in many problems of interest where, e.g., due to physical limitations, the agent states can only change by a bounded amount in a single time step. Following this assumption, we define the set of potential neighbors of agent at time as , which captures the set of agents that may become one-hop neighbors of agent at time .

Denote the joint state and action of the one-hop neighbors of agent by and , respectively, where . Our key idea is to let agent ’s policy, , only depend on the one-hop neighbor states . The intuition is that agents that are far away from agent at time have little impact on its current action . To support this policy model, we make two additional assumptions on the problem structure. To emphasize that the output of a function is affected only by a subset of the input dimensions, we use the notation for and .

Assumption 2.

The reward of agent can be fully specified using its one-hop neighbor states and actions , i.e., and its absolute value is upper bounded by , for some .

Assumption 2 is satisfied in many multi-agent problems where the reward of one agent is determined only by the states and actions of nearby agents. Examples are provided in Sec. VI. Similar assumptions are adopted in [15, 6, 12].

Assumption 3.

The transition model of agent depends only on its action and one-hop neighbor states , i.e., .

Assumption 3 is common for multi-agent networked systems as in [15, 12]. As a result, the joint state transition pdf decomposes as:

The objective of each agent is to obtain an optimal policy by solving the following problem:

where is the optimal action-value (Q) function introduced in the previous section.

V Distributed MARL with One-hop Neighbors

In this section, we develop the DARL1N algorithm to solve the MARL problem with proximity-graph structure. DARL1N limits the interactions among agents within one-hop neighbors to significantly reduce the learning complexity. To further speed training up, DARL1N adopts a distributed training framework that exploits local interactions to decompose and distribute the computation load.

V-a One-hop Neighborhood Value and Policy Factorization

The Q function of each agent associated with the MARL problem over the proximity graph can be written as

where denote joint states and actions of agents except one-hop neighbors of agent . Inspired by the SAC algorithm [15], we approximate the Q function as that depends only on one-hop neighbor states and actions:

where the weights satisfy . The approximation error is given in the following lemma with proof provided in Appendix A.

Lemma 1.

Under Assumptions 2 and 3, the approximation error between and is bounded by:

We then parameterize the approximated Q function and the policy by and , respectively. To handle the varying sizes of and , in the implementation, we let the input dimension of to be the largest possible dimension of

, and apply zero-padding for agents that are not in the one-hop neighborhood of agent

. The same procedure is applied to represent . More implementation details are provided in Appendix C.b.

To learn the approximated Q function , instead of incremental on-policy updates to the Q function as in SAC, we adopt off-policy temporal-difference learning with a buffer similar to MADDPG. The parameters of the approximated Q function are updated by minimizing:


where is the replay buffer for agent that contains information only from , the one-hop neighbors of agent at the current and next time step, and the one-hop neighbors for . To stabilize the training, a target Q function with parameters and a target policy function with parameters are used. The parameters and are updated using Polyak averaging, and , where

is a hyperparameter. In contrast to MADDPG, the replay buffer

for agent only needs to store its local interactions with nearby agents. Note that is used to calculate . Also, in contrast to SAC, each agent only needs to collect its own training data by simulating local two-hop interactions. This allows an efficient and distributed training framework as we explain in the next subsection.

Agent ’s policy parameters are updated using gradients from the policy gradient theorem [20]:


where again data only from local interactions is needed.

V-B Distributed Training with Local Interactions

To implement the parameter updates proposed above, agent needs training data from its one-hop neighbors at the current and next time steps, whose dynamics obey the following proposition (see proof in Appendix B.).

Proposition 1.

Under Assumption 1, if an agent is not a potential neighbor of agent at time , i.e., , it will not be a one-hop neighbor of agent at time , i.e., .

Proposition 1 allows us to decouple the global interactions among agents and limit message exchanges or observations to be among one-hop neighbors. It also allows parallel training on a distributed computing architecture, where each compute node only needs to simulate a small subset of the agents. This leads to significant training efficiency gains as demonstrated in Sec. VI.

To collect training data, at each time step, agent first interacts with its one-hop neighbors to obtain their states and actions and compute its reward . To obtain for all , we first determine agent ’s one-hop neighbors at the next time step, . Using Proposition 1, we let each potential neighbor perform a transition to a new state , which is sufficient to determine . Then, we let the potential neighbors of each new neighbor perform transitions to determine and obtain . Fig. 1(a) illustrates the data collection process. At time , agent obtains , , and for . Then, the potential neighbors of agent , , proceed to their next states at time . This is sufficient to determine that and obtain . Finally, we let agent , which belongs to set , perform a transition to determine that and obtain .

(a)  (a)  (b)
Fig. 1: (a) One-hop neighbor transitions from one time step to the next in a -disk proximity graph; (b) Distributed MARL with local interactions.

We now describe a distributed learning framework that exploits the local interactions among the agents to optimize the policy and Q function parameters. Our training framework consists of a central controller and learners, each training a different agent. The central controller stores a copy of all policy and target policy parameters, , , . In each training iteration, the central controller broadcasts the parameters to all learners. Each learner updates its own policy parameters , and returns the updated values to the central controller.

Each learner maintains the parameters and of agent ’s approximated Q and target Q function. In each training iteration, learner uses policies with parameters received from the central controller. Transitions are simulated only for agent , its potential neighbors , and the potential neighbors of each new neighbor as described above. DARL1N achieves significant computation savings because (i) the Q function parameters and are stored locally at each learner and do not need to be communicated, and (ii) the agent transition simulation occurs only over small groups of agents, distributed among the learners instead of centralized over all agents at the central controller. The interaction data are stored in the replay buffer . Finally, learner updates and using (V-A) and (2), respectively, and the target network parameters and via Polyak averaging. Fig. 1(b) illustrates the distributed training procedure. The pseudocode of DARL1N is provided in Alg. 1.

// Central controller:
1 Initialize policy, target policy parameters .
2 Broadcast to the learners.
3 do
4        Listen to channel and collect updated from learners.
5while updated is not received;
// Learner :
6 Initialize parameters , of , and replay buffer .
7 for  do
8        Listen to channel.
9        if  received from the central controller then
               // Local interactions:
10               for  do
11                      Randomly initialize all agent states.
12                      Simulate one-step transitions for the potential neighbors of agent to determine .
13                      Simulate one-step transitions for the potential neighbors of each agent to get .
14                      Store the obtained local interactions in the buffer .
              // One-hop neighbor-based learning:
15               Sample a mini-batch from and update .
16       Send updated to the central controller.
Algorithm 1 DARL1N: Distributed multi-Agent Reinforcement Learning with One-hop Neighbors

Vi Experiments

In this section, we conduct experiments to evaluate the performance of DARL1N.

Vi-a Experiment Settings


We evaluate DARL1N in four environments, Ising Model [22], Food Collection, Grassland, and Adversarial Battle [13], which cover cooperative and mixed cooperative competitive games.


We compare DARL1N with three state-of-the-art MARL algorithms: MADDPG [14], MFAC [22], and EPC [13]. All benchmark methods need states of all agents through observation or communication during training and execution. In contrast, DARL1N only needs states of one-hop neighbors during execution and two-hop potential neighbors during training. While the most closely related method to DARL1N is SAC [15], we do not compare to SAC because DARL1N is a distributed training version of SAC with the same Q function factorization. DARL1N utilizes off-policy training and allows each compute node to simulate state transitions only for one agent and its potential two-hop neighbors. In contrast, SAC is an on-policy approach running in a single compute node without parallel processing but with the same Q function factorization.

Evaluation Metrics

We evaluate all methods using two criteria: training efficiency and policy quality. To measure the training efficiency, we use two metrics: 1) average training time spent to run a specified number of training iterations and 2) convergence time

. The convergence time is defined as the time when the variance of the average total training reward over 90 consecutive iterations does not exceed 2% of the absolute mean reward, where the average total training reward is the total reward of all agents averaged over 10 episodes in three training runs with different random seeds. To measure policy quality, we use

convergence reward, which is the average total training reward at the convergence time.

Experiment Configurations

We run our experiments on Amazon EC2 computing clusters [1]. To understand the scalability of each method, for each environment, we consider four scenarios with increasing number of agents. The number of agents in the Ising Model and Food Collection environments are set to and , respectively. In the Grassland and Adversarial Battle environments, the number of agents are set to . In the experiments, the number of adversary agents, grass pellets and resource units are all set to , and all adversary agents adopt policies trained by MADDPG. More experiment settings including training parameters, Q function and policy function representation, and neighborhood configurations are described in Appendix C.

To evaluate the training efficiency, we configure the computing resources used to train each method in a way so that DARL1N utilizes roughly the same or fewer resources measured by the money spent per hour on Amazon EC2 computing cluster. In particular, to train DARL1N, Amazon EC2 instance is used in all scenarios for all environments. To train MADDPG and MFAC, instance is used in the first scenario () for Ising Model and in the first two scenarios for Food Collection, Grassland and Adversarial Battle. In the other scenarios, instance is used. To train EPC, we use instance in all scenarios for Food Collection and in the first three scenarios for Ising Model, Grassland and Adversarial Battle. The other scenarios adopt instance . To configure the parallel computing architecture in EPC, we set the number of parallel computing instances and the number of independent environments to and , respectively. The configurations of Amazon EC2 instances as the compute nodes are summarized in Tab. I.

Instances CPU cores CPU frequency Memory Network Hourly price
2 3.4 GHz 5.3 GB 25 Gb $ 0.108
12 4 GHz 96 GB 10 Gb $ 1.116
24 4 GHz 192 GB 10 Gb $ 2.232
48 3.6 GHz 96 GB 12 Gb $ 2.04
72 3.6 GHz 144 GB 25 Gb $ 3.06
TABLE I: Configurations of Amazon EC2 instances

Vi-B Experiment Results

Fig. 2: Average training time of different methods to run (a) 10 iterations in the Ising Model, (b) 30 iterations in the Food Collection, (c) 30 iterations in the Grassland, and (d) 30 iterations in the Adversarial Battle environments.

Ising Model

Method Convergence Time (s) Convergence Reward
MADDPG 62 263 810 1996 460 819 1280 1831
MFAC 63 274 851 2003 468 814 1276 1751
EPC 101 26 51 62 468 831 1278 3321
EPC Scratch 101 412 993 2995 468 826 1275 2503
DARL1N 38 102 210 110 465 828 1279 2282
TABLE II: Convergence time and convergence reward of different methods in the Ising Model environment.

Tab. II shows the convergence reward and convergence time of different methods. When the number of agents is small (), all methods achieve roughly the same reward. DARL1N takes the least amount of time to converge while EPC takes the longest time. When the number of agents increases, it can be observed that the EPC converges immediately and the convergence reward it achieves when is much higher than the other methods. The reason is that, in the Ising Model, each agent only needs information of its four fixed neighbors, and hence in EPC the policy obtained from the previous stage can be applied to the current stage. The other methods train the agents from scratch without curriculum learning. For illustration, we also show the convergence reward and convergence time achieved by training EPC from scratch without curriculum learning (denoted as EPC Scratch in Tab. II). The results show that EPC Scratch converges much slower than EPC as the number of agents increases. Note that when the number of agents is 9, EPC and EPC Scratch are the same. Moreover, DARL1N achieves a reward comparable with that of EPC Scratch but converges much faster. Fig. 2 shows the average time taken to train each method for 10 iterations in different scenarios. DARL1N requires much less time to perform a training iteration than the benchmark methods.

Food Collection

Method Convergence Time (s) Convergence Reward
MADDPG 501 1102 4883 2005 24 24 -112 -364
MFAC 512 832 4924 2013 20 23 -115 -362
EPC 1314 723 2900 8104 19 11 8 -2
DARL1N 502 480 310 730 14 25 43 61
TABLE III: Convergence time and convergence reward of different methods in the Food Collection environment.
Fig. 3: Average total training reward of different methods in the Food Collection environment when there are (a) , (b) agents.
Method Convergence Time (s) Convergence Reward
MADDPG 423 6271 2827 1121 21 11 -302 -612
MFAC 431 7124 3156 1025 23 9 -311 -608
EPC 4883 2006 3324 15221 12 38 105 205
DARL1N 103 402 1752 5221 18 46 113 210
TABLE IV: Convergence time and convergence reward of different methods in the Grassland environment.

The convergence rewards and convergence times in this environment are shown in Tab. III. The results show that, when the problem scale is small, DARL1N, MADDPG and MFAC achieve similar performance in terms of policy quality. As the problem scale increases, the performance of MADDPG and MFAC degrades significantly and becomes much worse than DARL1N or EPC when and , which is also shown in Figs. 3-3. The convergence reward achieved by DARL1N is comparable or sometimes higher than that achieved by EPC. Moreover, the convergence speed of DARL1N is the highest among all methods in all scenarios.

Fig. 2 shows the average training time for running 30 iterations. Similar as the results obtained in the Ising Model, DARL1N achieves the highest training efficiency and its training time grows linearly as the number of agents increases. When , EPC takes the longest training time. This is because of the complex policy and Q neural network architectures in EPC, the input dimensions of which grow linearly and quadratically, respectively, with more agents.


Similar as the results in the Food Collection environment, the policy generated by DARL1N is equally good or even better than those generated by the benchmark methods, as shown in Tab. IV and Fig. 2, especially when the problem scale is large. DARL1N also has the fastest convergence speed and takes the shortest time to run a training iteration.

Adversarial Battle

Fig. 4: States of a subset of agents during an episode in Adversarial Battle with agents trained by different methods when there are agents.
Method Convergence Time (s) Convergence Reward
MADDPG 452 1331 1521 7600 -72 -211 -725 -1321
MFAC 463 1721 1624 6234 -73 -221 -694 -1201
EPC 1512 1432 2041 9210 -75 -215 -405 -642
DARL1N 121 756 1123 3110 -71 -212 -410 -682
TABLE V: Convergence time and convergence reward of different methods in the Adversarial Battle environment.
Fig. 5:

Mean and standard deviation of normalized total reward of competing agents trained by different methods in the Adversarial Battle environment with


In this environment, DARL1N again achieves good performance in terms of policy quality and training efficiency compared to the baseline methods, as shown in Tab. V and Fig. 2. For illustration, the states of a subset of agents trained by different methods during an episode are shown in Fig. 4. It can be observed that both DARL1N and EPC agents can successfully collect resource units and kill agents from the other team, but MADDPG and MFAC fail to do so. To further evaluate the performance, we reconsider the last scenario () and train the good agents and adversary agents using two different methods. The trained good agents and adversarial agents are then compete with each other in the environment. We then apply the Min-Max normalization to measure the normalized total reward of agents at each side achieved in an episode. To reduce uncertainty, we generate 10 episodes and record the mean values and standard deviations. As shown in Fig. 5, DARL1N achieves the best performance, and both DARL1N and EPC significantly outperform MADDPG and MFAC.

Vii Conclusion

This paper introduces DARL1N, a scalable MARL algorithm. DARL1N features a novel training scheme that breaks the curse of dimensionality for action value function approximation by restricting the interactions among agents within one-hop neighborhoods. This reduces the learning complexity and enables fully distributed and parallel training, in which individual compute nodes only simulate interactions among a small subset of agents. To demonstrate the scalability and training efficiency of DARL1N, we conducted comprehensive evaluation in comparison with three state-of-the-art MARL algorithms, MADDPG, MFAC, and EPC. The results show that DARL1N generates equally good or even better policies in almost all scenarios with significantly higher training efficiency than benchmark methods, especially in large-scale problem settings.


  • [1] AWS (2022) Amazon ec2. Note: 2022-01-13 Cited by: §VI-A.
  • [2] W. Böhmer, V. Kurin, and S. Whiteson (2020-04) Deep coordination graphs. In

    International Conference on Machine Learning

    Online. Cited by: §II.
  • [3] F. Bullo, J. Cortés, and S. Martinez (2009) Distributed control of robotic networks: a mathematical approach to motion coordination algorithms. Princeton University Press. Cited by: §IV.
  • [4] L. Buşoniu, R. Babuška, and B. De Schutter (2010) Multi-agent reinforcement learning: an overview. Innovations in Multi-agent Systems and Applications, pp. 183–221. Cited by: §I.
  • [5] T. Chen, K. Zhang, G. B. Giannakis, and T. Başar (2021) Communication-efficient policy gradient methods for distributed reinforcement learning. IEEE Transactions on Control of Network Systems. Cited by: §II.
  • [6] T. Chu, J. Wang, L. Codecà, and Z. Li (2019) Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Transactions on Intelligent Transportation Systems 21 (3), pp. 1086–1095. Cited by: §IV.
  • [7] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018-02) Counterfactual multi-agent policy gradients. In

    AAAI Conference on Artificial Intelligence

    Louisiana, USA. Cited by: §I, §II.
  • [8] C. Guestrin, M. Lagoudakis, and R. Parr (2002-07) Coordinated reinforcement learning. In International Conference on Machine Learning, Sydney, Australia. Cited by: §II.
  • [9] S. Iqbal and F. Sha (2019-06) Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning, CA, USA. Cited by: §I, §II.
  • [10] J. R. Kok and N. Vlassis (2006) Collaborative multiagent reinforcement learning by payoff propagation. Journal of Machine Learning Research 7, pp. 1789–1828. Cited by: §II.
  • [11] L. Kuyer, S. Whiteson, B. Bakker, and N. Vlassis (2008-09) Multiagent reinforcement learning for urban traffic control using coordination graphs. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium. Cited by: §II.
  • [12] Y. Lin, G. Qu, L. Huang, and A. Wierman (2020) Distributed reinforcement learning in multi-agent networked systems. arXiv preprint arXiv:2006.06555. Cited by: §IV, §IV.
  • [13] Q. Long, Z. Zhou, A. Gupta, F. Fang, Y. Wu, and X. Wang (2020-04) Evolutionary population curriculum for scaling multi-agent reinforcement learning. In International Conference on Learning Representations (ICLR), Online. Cited by: §I, §I, §II, §VI-A, §VI-A.
  • [14] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017-12) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, CA, USA. Cited by: §I, §I, §II, §VI-A.
  • [15] G. Qu, A. Wierman, and N. Li (2020-06) Scalable reinforcement learning of localized policies for multi-agent networked systems. In Learning for Dynamics and Control, Online. Cited by: §I, §I, §II, §IV, §IV, §V-A, §VI-A.
  • [16] T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson (2018-07) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, Stockholm, Sweden. Cited by: §II.
  • [17] D. Simões, N. Lau, and L. P. Reis (2020) Multi-agent actor centralized-critic with communication. Neurocomputing. Cited by: §II.
  • [18] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi (2019-06) Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of 36th the International Conference on Machine Learning, CA, USA. Cited by: §II.
  • [19] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and T. Graepel (2018-07) Value-decomposition networks for cooperative multi-agent learning based on team reward. In International Conference on Autonomous Agents and Multi-Agent Systems, Stockholm, Sweden. Cited by: §II.
  • [20] R. S. Sutton and A. G. Barto (2018) Reinforcement Learning: An Introduction. MIT Press. Cited by: §V-A.
  • [21] T. Wang, L. Zeng, W. Dong, Q. Yang, Y. Yu, and C. Zhang (2021) Context-aware sparse deep coordination graphs. arXiv preprint arXiv:2106.02886. Cited by: §II.
  • [22] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang (2018-06) Mean field multi-agent reinforcement learning. In International Conference on Machine Learning, Cited by: §I, §I, §II, §VI-A, §VI-A.
  • [23] C. Yu, A. Velu, E. Vinitsky, Y. Wang, A. Bayen, and Y. Wu (2021) The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955. Cited by: §II.
  • [24] C. Zhang and V. Lesser (2013-05) Coordinating multi-agent reinforcement learning with limited communication. In International Conference on Autonomous Agents and Multi-Agent Systems, MN, USA. Cited by: §II.

Appendix A Appendix

A-a Proof of Lemma 1

We first prove the following inequality


where and . Particularly, letting and denote and , respectively, we have:


where derives from the fact that are part of both and . In the above equations, we have removed the subscription of the expectation function for simplicity, which should be . Then, we have

A-B Proof of Proposition 1

If agent , then based on the definition of potential neighbors, we have . According to the triangle inequality, , and according to Assumption 1, . Therefore, . Furthermore, using triangle inequality, we can obtain . As , we have . Therefore, agent will not be a one-hop neighbor of agent at time . ∎

A-C Experiment Settings

Training Parameters

All environments adopt the same training parameters. In particular, Adam optimizer is used to update the policy and Q function parameters with a learning rate of 0.01. The parameter in the Polyak averaging algorithm for updating target policy and target Q functions is set to . The discount factor is set to . The size of the buffer is set to . The parameters are updated after every episodes. The in Alg. 1 of DARL1N is set to times of the length of one episode. In the Ising Model and Food Collection environments, the length of each episode is set to 25 in all scenarios. In the Grassland and Adversarial Battle environments, the length of an episode is set to and for the scenarios of and , respectively. The size of a mini batch is set to in the Ising Model and in other environments.

Q Function and Policy Function Representation

In the implementations of DARL1N, MADDPG and MFAC, we use neural networks with fully connected layers to represent the approximated Q function and policy function. The neural networks have three hidden layers with each layer having 64 units and adopting ReLU as the activation function. To handle the varying sizes of

and in approximated Q function in DARL1N, we let the input dimension of the approximated Q function to be the size of the joint state and action space of the maximum number of agents that can be in and apply zero padding for agents that are not in the one-hop neighborhood of agent . In particular, in the Ising Model, the maximum number of one-hop neighbors of an agent is 5, which is fixed. The input dimension of the approximated Q function for agent is then . For other environments, the maximum number of one-hop neighbors of an agent is the total number of agents. The EPC adopts a population-invariant neural network architecture with attention modules to support arbitrary number of agents in different stages for training the Q function and policy function.

Environment and Neighborhood Configurations

The one-hop neighbors of an agent are defined over the agent’s state space using a distance metric, which is the prior knowledge of environment. In the Ising Model, the topology of the agents is fixed, and the one-hop neighbors of an agent are its vertically and horizontally adjacent agents and itself. In the other environments, the Euclidean distance between two agents in the 2-D space is used as the distance metric, and the neighbor distance is set to , , , , when , respectively. The bound is determined according to the maximum velocity and time interval between two consecutive time steps, and is set to , , , , when , respectively. The size of agents’ activity space is set to , , , when , respectively.