Counterfactual Multi-Agent Reinforcement Learning with Graph Convolution Communication

by   Jianyu Su, et al.
University of Virginia

We consider a fully cooperative multi-agent system where agents cooperate to maximize a system's utility in a partial-observable environment. We propose that multi-agent systems must have the ability to (1) communicate and understand the inter-plays between agents and (2) correctly distribute rewards based on an individual agent's contribution. In contrast, most work in this setting considers only one of the above abilities. In this study, we develop an architecture that allows for communication among agents and tailors the system's reward for each individual agent. Our architecture represents agent communication through graph convolution and applies an existing credit assignment structure, counterfactual multi-agent policy gradient (COMA), to assist agents to learn communication by back-propagation. The flexibility of the graph structure enables our method to be applicable to a variety of multi-agent systems, e.g. dynamic systems that consist of varying numbers of agents and static systems with a fixed number of agents. We evaluate our method on a range of tasks, demonstrating the advantage of marrying communication with credit assignment. In the experiments, our proposed method yields better performance than the state-of-art methods, including COMA. Moreover, we show that the communication strategies offers us insights and interpretability of the system's cooperative policies.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning Cooperative Multi-Agent Policies with Partial Reward Decoupling

One of the preeminent obstacles to scaling multi-agent reinforcement lea...

Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks

Learning when to communicate and doing that effectively is essential in ...

Decentralized Cooperative Communication-less Multi-Agent Task Assignment with Monte-Carlo Tree Search

Cooperative task assignment is an important subject in multi-agent syste...

On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning

The creation and destruction of agents in cooperative multi-agent reinfo...

Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

Extending transfer learning to cooperative multi-agent reinforcement lea...

Scene Dynamics: Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Scene graphs -- objects as nodes and visual relationships as edges -- de...

Fully Decentralized Policies for Multi-Agent Systems: An Information Theoretic Approach

Learning cooperative policies for multi-agent systems is often challenge...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Communication, taking many forms in different scenarios, is closely associated with cooperation. For example, many species utilize vocal communication to serve different cooperative tasks e.g. mating, warning of predators etc. The ability to communicate can be vital for solving cooperative multi-agent reinforcement learning (MARL) problems, such as coordination of semi or full autonomous vehicles, and coordination of machines in a product line. By surveying the recent advances in the field of deep MARL, we identified two challenges in this domain.

One challenge of MARL is the uncertainty of other agents’ strategies through out the training process, making it hard for agents to understand the inter-plays and achieve cooperation. Communication methods emerged as solutions to overcome this challenge [5, 11, 16, 19]. However, information sharing among all agents could be problematic in that it poses difficulties for agents in learning valuable information from the large volume of shared information. In addition, in real-world applications, it can be expensive for all agents in the system to broadcast their information globally. Hence, recent communication frameworks have moved away from fixed communication protocols to flexible and targeted communication. The attentional communication model (ATOC) allows agents to select and communicate to “collaborators” [11]. Targeted multi-agent communication (TarMAC) enables targeted communication between agents [5].

Another key challenge is credit assignment: without the full knowledge of the true state and other agent’s actions, agents have issues deducing their contribution to the system’s global reward [7]. It is possible to hand-engineer agent-level local reward functions in some scenarios [8], however, this technique has difficulties generalizing to other complex problems and requires domain expertise. Counterfactual multi-agent (COMA) policy gradient [7] has been proposed to solve such credit assignment problem in cooperative settings. COMA takes full advantages of centralised training that uses a centralized critic that conditions on the joint and all available state information and has achieved good performance on the StarCraft environment.

COMA addresses the credit assignment problem but overlooks communication between agents. The communication methods discussed above provide additional state information by information sharing but do not fully address the problem of credit assignment. In this paper, we propose a new multi-agent framework that can be seen as a factorization of a centralized policy that outputs actions for all the agents to optimize the expected global return. Combining communication with COMA, we allow agents to understand the inter-plays between agents and their contributions to the success of the system. Our framework consists of two components: a communication module that promotes an agent’s understanding of inter-plays, and a credit assign framework that applies COMA to tailor the global reward to individual agents. We provide extensive empirical demonstration of the effectiveness and flexibility of our frame across a range of tasks, environments, and size of agents. We benchmark our framework on the traffic junction environment [5, 19] and show that agents are able to cooperate despite variations in team size. Further, through an investigation of communicated messages, we empirically demonstrate that meaningful communication strategies can be learned under the credit assignment training paradigm. Finally, we introduce a new manufacturing environment that consists of heterogeneous agents and requires intensive cooperation among agents to maximize the profits of the manufacturing system over a limited time. On empirical tests, our method outperforms state-of-art methods from the literature on this team task.

This paper is organized as follows. In Section 2, we describe related work. In Section 3, we introduce preliminary information about MARL and RL. We describe our proposed method in detail in Section 4. In Section 5, we present the competitive performance on cooperative environment against baselines. We provide conclusions and possible areas of future work in Section 6.

2 Related Work

MARL has benefited from recent developments in deep reinforcement learning, with the frameworks moving away from tabular methods [2] to end-to-end schemes [7]. Our work is related to recent advances in deep multi-agent reinforcement learning. Judging by whether agents are allowed to communicate and whether the framework takes advantage of a centralized training paradigm, we categorize the recent deep multi-agent reinforcement learning works into communication methods and centralized training methods.

Communication Methods: This line of work utilizes an end-to-end communication architecture to enable agents to learn extra information from each other, where communication is learned by back-propagation. We can further categorize communication methods into explicit communication and implicit communication methods, based on whether the message is an input of agents. Explicit communication methods generally follow a sequential process where messages from agents at each timestep are generated, then messages are fed as inputs of other agents at . [6] defined discrete symbols as messages. [5, 11]

expand communication vocabulary by utilizing continuous vector messages.

Explicit communication methods might suffer from communication lag because messages only convey information about agents from the previous timestep. In contrast, implicit communication methods, integrating the message into the model’s structure, allows for timely information sharing among agents. [16] utilizes a communication module to generate discrete communication features for the final policy layer that outputs agents’ actions. CommNet [19] and BiCNet [17] communicate the encoding of local observation among all agents. Similar to our communication module, DGN utilizes graph convolution to allow for targeted communication between agents [10]. Recent advances in communication methods, including our architecture, are often coupled with the utilization of self-attention, which is referred to as a relation kernel. Among those works, TarMAC falls in the explicit communication category [5], and ATOC, which is categorized as a implicit communication method, does not use graph convolution. The closest to our communication module is DGN [10], which also utilizes graph convolutions and self-attention. However, DGN considers a different problem setting where local rewards are presented to agents, while our method provides only global rewards. Our method is further distinguished by the use of a recurrent structure in the policy network to take into consideration temporal factors.

Centralized Training Methods: Although communication methods also follow a centralized training paradigm, most of them are excluded from this category because they do not allow for true state information that helps individual agents to condition environment rewards on true states, thus fails to take full advantages of the training paradigm. Centralized training methods often use critics to promote an agent’s understanding of their joint contribution to the overall system. [8] utilizes a critic that conditions on an agent’s local information and hand-crafted local reward. [5, 7, 13] use a centralised critic to evaluate the action-value of the joint actions. Among these methods, only COMA addresses the credit assignment problem. In the proposed method, we utilize the COMA framework to promote effective communication between agents by back-propagating the gradient evaluated by the central critic.

3 Technical Background

Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs)

: Consider a fully cooperative multi-agent task with agents. The environment has a true state

, a transition probability function

, and a global reward function . Each agent takes actions according to its own policy with , and receives partial state information in the form of observation . At each timestep, each agent simultaneously chooses an action , forming a joint action that induces a transition in the environment according to . The joint action , determined by the joint policy , induces an overall environment reward . We denote the joint quantities in bold, and use to denote the joint actions other than the agent ’s action . Similar to a single-agent RL, MARL aims to maximize the discounted return . The joint value function is the expected return for following the joint policy from state . The value-action function defines the expected return for selecting joint action in state and following the joint policy .

Deep Q-Network (DQN)

: Combining reinforcement learning with deep neural networks, DQN demonstrated human-level control policies in many Atari games


. DQN aims at updating its parameters by minimizing a sequence of loss functions at each iteration

: , where . DQN is an off-policy algorithm that takes raw pixels as input and outputs actions. It utilizes a replay buffer to store its past experience in tuples , which allows for efficient sampling and resolving the strong correlation between consecutive samples.

Independent Q-Learning (IQL): Similar to DQN which learns an action-value approximator , IQL learns to approximate with respect to each agent in the multi-agent setting [20]. [7] employed a parameter sharing technique to modify this model to approximate , where is a feature that used to differentiate agents.

Policy Gradient Algorithms: These algorithms directly adjust the parameters of the policy in order to maximize the objective by taking steps in the direction of . The gradient with respect to the policy parameters is , where is the state transition by following policy , and is an action-value. Policy gradient algorithms differ in how to evaluate , e.g. the REINFORCE algorithm [22] simply uses a sample return .

Multi-Agent Policy Gradient Algorithms: Multi-agent policy gradient methods are extensions of policy gradient algorithms with a policy

for each agent. Similar to RL, many multi-agent implementations utilize critics to reduce the gradient variances. Among those multi-agent critic algorithms, COMA stands out as it deals with credit assignment issues. In the proposed method, we use the COMA structure to assist agents in learning to communicate by back-propagating tailored gradient. Details of COMA will be presented in the next section.

4 Methods

As shown in Figure 3, our architecture can be described as a multi-agent actor-critic framework. In the Dec-POMDP environment we described in the previous section, the centralized critic has access to true states (concatenations of local observations if true states are not available) while actors only receive local partial observation . The communication module, illustrated in Figure 1, is embedded along the actor network and communication strategies are obtained by back-propagation without supervision. During training, the critic helps actors to learn how to communicate and condition their individual actions on the true states. During testing, actors execute without the presence of the critic.

4.1 Communication Module

We now describe our multi-agent communication architecture in details. Our communication kernel consists of graph convolutions and relation kernels.

Graph Convolution

: Graph convolution extends the application of convolution neural networks from the euclidean domain to the non-euclidean domain such as graphs

[24]. Specifically, the spectral filter slides from one central node to another to learn latent features from clusters of nodes that are pivoted around the central node. Whenever a node is encoded into latent vectors, the aggregated information from its neighbouring nodes is taken into account by the spectral filter. Hence, we represent the communication among nodes with graph convolutions. In this work, we represent the environment as a graph, where each agent is represented by a node in the graph and edges are defined by metrics that measure the relationship (closeness) between agents. Agents can communicate with each other if and only if their corresponding nodes are connected in the graph. Moreover, the graph configuration may vary over time based on the metrics defined by domain experts, which makes our communication module applicable to dynamic environments.

Figure 1: Graph convolution integrates information from neighbouring nodes for Central nodes. This process is deemed as information sharing between neighbouring and central nodes.

The communication module consists of three components: observation encoder, convolution layer, and an RNN that outputs multinomial policies. Let agent be an arbitrary agent in the environment, indexes agents that are connected to agent in the graph. At timestep , local observations, and , are encoded into feature vectors and by the observation encoder. Then, the convolution layer aggregates from the neighbourhood of agent . This one-hop convolution process allows for agent to communicate with its first-order neighbours. By stacking convolution layers, agent can aggregate information from nodes that are up to -hops away. The convolution process can be written as:



denotes an activation function,

is agent ’s latent features generated by the th convolution layer, represents trainable weights in the convolution layer, and is a relation weights that defines the amount of information agent takes from agent . A range of relation kernels has been proposed [9, 12, 21]. The RNN takes latent features from convolution layers as the input and outputs agent’s policy .

Relation Kernel: Inspired by [10], we use multi-head dot-product attention (MHDPA) as the relation kernel to perform inductive reasoning between nodes. MHDPA projects the input matrix to a set of matrices that represent query, key, and value. Relation weights are obtained by dot-product attention. Formally, relation weights in the th convolution layer, , can be written as:


where and are trainable weights that project inputs of size into query matrices and key matrices, respectively, denotes input vectors with indexes central nodes, indexes neighbouring nodes, is the dimension of the -layer latent feature dimension, and denotes input dimension. By combining Equation 1 with 2, the latent features generated by the th convolution layer can be rewritten as:


where are trainable weights that projects inputs into a value matrices.

Depending on the application, the underlying graph that depicts the relationship between agents might change over time, making it difficult to achieve flexible and efficient graph convolution implementations. In this work, we utilize a mask to filter out nodes that are not related, where , with as the adjacency matrix at time step , and

as an identity matrix. Then

in the th graph convolution layer can be written in the following matrix form:


where denotes matrix element-wise product, with being a small positive scalar, indexes the central nodes and indexes a random node in the graph.

Figure 2: Relational Kernel with flexible graph configurations

4.2 Training Paradigm

In the MARL environment, the reward conditions on true state and joint actions rather than individual observation and actions . The global reward does not reveal each agent’s contribution, and thus the gradient calculated from the global rewards, such as , does not necessarily encourage individual agent to choose actions for the greater good of the whole system. As pointed out by [23], one has to conduct counterfactual experiments to find out the contribution from individual agents. That is, given a state , we change agent ’s action while holding other agents’ actions constant. Then we evaluate how the reward changes for agent . Difference rewards , which are developed based on this counterfactual reasoning, reveals the contribution from agent on the overall success of the system via comparing the global reward changes incurred by changing agent ’s actions to a default action . Any action by agent that improves improves the global reward .

Figure 3: Centralized training paradigm proposed in [7].

Inspired by difference rewards, COMA utilizes a centralized critic to evaluate the counterfactual rewards for each agent instead of rolling out parallel simulations at each timestep. This renders the counterfactual advantage:


where is the tailored advantage for agent , and is the action-value of the joint action evaluated by the centralized critic. The second part on the right-hand side is the separate counterfactual baseline marginalized on agent ’s available actions with action probabilities outputted by the decentralized actor.

We follow COMA’s training paradigm to further promote agents’ understanding of the inter-plays, with each agent obtaining tailored gradients (assuming each agent’s policy is parameterized by ), , from the centralized critic. Coupled with the communication module above, the actor network’s gradient is given by:


where parameterizes the actor network. The critic minimizes the temporal difference (TD) error:


Our implementation employs the A2C framework [14] to enable efficient on-policy sampling. That is, our model executes in multiple parallel environments, within which, the temporary buffer collects experiences from the current run , where denotes true states of the environment, denotes the joint actions, are local observations from agents respectively, and is the environment reward.

5 Experiments

To evaluate the effectiveness of the idea of combining communication with individual agent reward shaping, we compare our models with two baselines: COMA and IQL with proposed communication module. In this section, these three models are evaluated on 2 environments. The following subsections will present environment descriptions, model configuration, evaluations, and ablations.

All models used in the following section share the same configuration: RMSProp with a learning rate of

and , batch size , discount factor and trace decay discount TD. All methods containing communication modules utilize -layer convolutions with -head dot product attention.

Graph Convolution Counterfactual Baselines Description
IQL with Communication Yes No No reward shaping
COMA No Yes No communication
CCOMA Yes Yes Our architecture
Table 1: Summaries of models tested in the experiments

5.1 Traffic Junction

The traffic junction environment consists of vehicles in the environment moving along pre-assigned routes and was originally introduced in [19]. At each timestep, new cars enter the grid with probability at designated starting positions. However, the total number of cars at any time is constraint by . Each car’s task is to move along the pre-assigned route without colliding with other cars. At every timestep, a car can either gas which advances it by one cell on its route or break to stay at its current grid. A car is removed from the environment once it reaches its designated goal.

Two cars collide if they enter the same grid. This does not affect the simulation other than incurring the reward . The simulation also utilize to discourage traffic jam at each timestep, where is a reward and is the number of timesteps since the car entered the grid. Therefore, the total reward at time can be written as:

where is the number of collisions at timestep , and is the number of cars present. The game terminates after steps and is deemed as a failure if one or more collisions have occurred.

There are two modes of the simulation: easy and hard. In the easy mode, the environment is a grid that contains one junction intersected by two one-way road as shown in Figure 3(a). and . The hard mode traffic junction environments consist of 4 junctions as shown in Figure 3(b). At every timestep, cars can enter the grid from 1 of the 8 arrival points. And there are 7 different routes for each arrival points. The hard mode environment has grid with and ..

(a) Easy mode - and .
(b) Hard mode - and .
Figure 4: Traffic junction environment with two modes. The game continues for 40 steps. It is deemed a failure if more than one collisions occurs.

All agents are represented by a tuple , and they can only observe other cars in its vision range (a surrounding neighbour). Thus, the input size for each agent’s network is . For the centralized critic, we use a vector that tracks ids of active vehicles in the simulation to denote the true state of the environment. We trained all methods for million training steps, and optimal policies are saved.

Table LABEL:table:_tj shows the performance of each method on both easy mode and hard mode. The performance of our models in both modes outperforms CommNet and TarMAC reported in [5]. The comparison also demonstrates that our proposed communication module is effective with or without a centralized critic. Note that although COMA agents do not communicate with each other, the centralized critic is able to help agents that cannot communicate learn good policies.

Easy Hard Harder
IQL with comm 100.0% 98.6% 94.3%
COMA 100.0% 99.1% 99.1%
CCOMA 100.0% 99.6% 99.3%
CommNet reported in [5] 99.7% 78.9% -
TarMAC reported in [5] 100.0% 97.1% -
Table 2: Success rates on the traffic junction task.

We then increase the from to and create a harder version of the environment. The implementation follows the procedure described above. Our method achieved win rate. IQL with comm and COMA scored and , respectively. We investigate the agent’s action selections based on their coordinates. Figure 4(a) demonstrates that most brakes occurs when agents are entering junctions or in the grid where vehicles have probabilities to take a turn. Once the agent exit junctions, the number of brakes drops. We also looked at the continuous vectors agents are broadcasting at each location. Following [19], we compare the average vector norm of messages sent by agents at each location. As shown in Figure 4(b), the vector norm from opposite lanes are different. The agent might be able to locate the source of the message.

(a) Most of brakes happen when vehicles are entering junctions or in the cell where vehicles have probabilities to turn. Once the vehicle across the junction the brake probabilities drop as well. Examples of vehicles’ trajectories are shown.
(b) Average message vector norm at each position. The eight white cells represent goal positions. It can be seen that the message sent from left lanes are different from that of the right lane. The vector norm might help agent to locate the source of the message.
Figure 5: Analysis of Traffic Junction

5.2 Manufacture Line

The traffic junction environment concerns homogeneous agents whose internal states are deterministic. In addition, the cooperation between agents is not intense in the sense that cars only need to cooperate when they are in the proximity of each other. Inspired by [3, 4], we developed this manufacturing line environment which consists of machines of different types and requires constant cooperation between agents. As shown in Figure 5(a), the manufacturing process consists of two steps, where each step contains 3 homogeneous machines. At each timestep, agents take actions, incurring operation costs. The performance of the system heavily depends on the cooperation among machines, e.g. if all machines in step one shut down, machines in step two might as well stop because few parts flows from step 1 to step 2. Otherwise, the system suffers from meaningless operation costs.

The details of the environment are as follows. A product must go through both steps in order to be considered completed. There is a distributor between step 1 and step 2. The distributor can send the partially completed products from an arbitrary machine in step 1 to any machine machine in step 2, and it can temporarily store the partially completed products if process 1 works faster than process 2. Each machine chooses to take actions from produce, stop and conducts maintenance.

(a) The manufacturing line consists of a 2-step process. The distributor can temporarily store mid-products from step 1 and distribute parts based on the availability of machines in step 2.
(b) The life circle of a machine consists of 4 components. The health state of a machine can start from any point in this life circle e.g. the end of pre-mature, the beginning of severely-worn etc.
Figure 6: Descriptions of manufacturing line environment.

Each agent has 4 internal states: pre-mature, mature, slightly-worn, severely-worn. The length of each state is controlled by a gamma distribution of pre-defined mean and scale. This property produces stochastic inner-state transitions for each machine at each life circle. At the beginning of each episode, every machine initializes its internal state randomly. Figure

5(b) demonstrates a machine’s inner-state transitions. At each timestep, the agent can choose from 3 actions: produce, stop, and conduct maintenance. If a machine is in the severely-worn state, the only available action is conduct maintenance. Otherwise, the machine is free to take any action. If a machine takes the produce action, it will incur an operation cost and it will be capable of processing in this timestep. If a machine takes the stop action, it will incur an operation cost . If conduct maintenance action is selected, the machine will be under maintenance, which incurs maintenance cost . The length of maintenance is sampled from a gamma distribution. The only action available during this period is conduct maintenance. In addition, cost is incurred if a machine runs to severely-worn state without conducting maintenance in advance. Those rules yield the following reward function at a given timestep :

where is the expected profit of a single product, is the number of products coming out of the 2nd step of the process, , , and are the total number of machines that take the produce, stop, and conduct maintenance actions respectively, and is the number of machines runs to severely-worn state without conducting maintenance in advance. An episode terminates after 48 steps. Since our reward function is the profit of the system, we evaluate the algorithms by average accumulative profits at the end of test episodes.

Figure 7: Average Test Episode cumulative rewards during Testing episodes. Methods are ran for 5 times and their average performance is presented. The shades corresponds to schedules of curriculum learning.

In this experiment, each actor is able to observe the internal state of the machine, e.g. health state, machine id, the time has been spent in this state, and the action of the last step. The true state is simply the concatenation of observations from all agents. We utilized curriculum learning [1] to train our agents in total million training steps. Initially, all machines started from the “brand new”’ internal state for the first million training steps. In the following million training steps, machines were uniformly sampled out of machines to initialize its internal state randomly at every episode. We continue to sample more machines every million training steps until all machines initialize their states randomly. The shades in Figure 7 correspond to the schedule of our curriculum learning. As more machines become initialized randomly, the simulation becomes more difficult.

To understand the effects of the simulation’s randomness on the algorithms’ learning and performance, we followed the evaluation procedure described in [18]: the training is paused every steps during which test episodes are run with the latest model performing greedy action selection. Figure 7 depicts the performance of each method over the training period. It can be seen that the introduction of randomness into the simulation corresponds with a drop in the performance of the algorithms. CCOMA and COMA both show drastic changes in performance when randomness was introduced. However, CCOMA was able to eventually learn good cooperative policies while COMA failed. With the assistance of the centralized critic, CCOMA was able to pick up good policies while IQL with Comm was stuck to sub-optimal policies.

6 Conclusion

We introduced CCOMA, a multi-agent RL architecture that allows agents to interact and collaborate. Individually tailored rewards are made possible by a centralized critic that is capable of counterfactual reasoning, with task-specific team reward as the sole feedback in the environment. Furthermore, our investigation in communicated messages illustrated that agents learn meaningful communication strategies under the reward tailoring training paradigm. Evaluations on diverse tasks clearly show that our architecture is applicable to multi-agent systems with both dynamic sets of agents and a fixed number of agents. Empirically, our methods outperform state-of-art methods from the literature, demonstrating that combining communication with reward shaping is a viable solution to two challenges of MARL. Future work will extend our method to environments with larger numbers of agents. We also aim to implement our framework in real-world applications such as semi-conductor production plants and merging of semi or full self-driving vehicles.


  • [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    Proceedings of the 26th annual international conference on machine learning

    pp. 41–48. Cited by: §5.2.
  • [2] L. Bu, R. Babu, B. De Schutter, et al. (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2), pp. 156–172. Cited by: §2.
  • [3] B. Y. Choo, S. Adams, and P. Beling (2017) Health-aware hierarchical control for smart manufacturing using reinforcement learning. In 2017 IEEE International Conference on Prognostics and Health Management (ICPHM), pp. 40–47. Cited by: §5.2.
  • [4] B. Y. Choo, S. C. Adams, B. A. Weiss, J. A. Marvel, and P. A. Beling (2016) Adaptive multi-scale prognostics and health management for smart manufacturing systems. International journal of prognostics and health management 7. Cited by: §5.2.
  • [5] A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau (2019) TarMAC: targeted multi-agent communication. In International Conference on Machine Learning, pp. 1538–1546. Cited by: §1, §1, §2, §2, §2, §5.1, Table 2.
  • [6] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson (2016) Learning to communicate with deep multi-agent reinforcement learning. In Advances in neural information processing systems, pp. 2137–2145. Cited by: §2.
  • [7] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In

    Thirty-second AAAI conference on artificial intelligence

    Cited by: §1, §2, §2, §3, Figure 3.
  • [8] J. K. Gupta, M. Egorov, and M. Kochenderfer (2017) Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Cited by: §1, §2.
  • [9] W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in neural information processing systems, pp. 1024–1034. Cited by: §4.1.
  • [10] J. Jiang, C. Dun, and Z. Lu (2018) Graph convolutional reinforcement learning for multi-agent cooperation. arXiv preprint arXiv:1810.09202 2 (3). Cited by: §2, §4.1.
  • [11] J. Jiang and Z. Lu (2018) Learning attentional communication for multi-agent cooperation. In Advances in neural information processing systems, pp. 7254–7264. Cited by: §1, §2.
  • [12] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §4.1.
  • [13] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems, pp. 6379–6390. Cited by: §2.
  • [14] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §4.2.
  • [15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §3.
  • [16] I. Mordatch and P. Abbeel (2018) Emergence of grounded compositional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.
  • [17] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang (2017) Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:1703.10069 2. Cited by: §2.
  • [18] M. Samvelyan, T. Rashid, C. Schroeder de Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C. Hung, P. H. Torr, J. Foerster, and S. Whiteson (2019) The starcraft multi-agent challenge. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2186–2188. Cited by: §5.2.
  • [19] S. Sukhbaatar, R. Fergus, et al. (2016)

    Learning multiagent communication with backpropagation

    In Advances in neural information processing systems, pp. 2244–2252. Cited by: §1, §1, §2, §5.1, §5.1.
  • [20] M. Tan (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §3.
  • [21] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §4.1.
  • [22] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §3.
  • [23] D. H. Wolpert and K. Tumer (2002) Optimal payoff functions for members of collectives. In Modeling complexity in economic and social systems, pp. 355–369. Cited by: §4.2.
  • [24] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §4.1.