Reinforcement Learning (RL) has achieved enormous successes in robotics  and gaming  in both single and multiagent settings. For example, deep reinforcement learning (DRL) achieved super-human performance in the two-player game Go, which has a very high-dimensional state-action space [27, 28]. However, in multiagent scenarios, the sizes of the state space, joint action space, and joint observation space grow exponentially with the number of agents. As a result of this high dimensionality, existing multiagent reinforcement learning (MARL) algorithms require significant computational resources to learn an optimal policy, which impedes the application of MARL to systems such as swarm robotics . Thus, improving the scalability of MARL is a necessary step towards building large-scale multiagent learning systems for real-world applications.
In MARL, the increase of complexity of finding an optimal joint policy, with respect to the number of agents, is a result of coupled interactions between agents . However, in many multiagent scenarios, the interactions between agents are quite sparse. For example, in a soccer game, an agent typically only needs to pay attention to other nearby agents when dribbling because agents far away are not able to intercept. The existence of such sparsity structures of the state transition dynamics (or the state-action-reward relationships) suggests that an agent may only need to attend to information from a small subset of the agents for near-optimal decision-making. Note that the other players that require attention might not be nearby, such as the receiver of a long pass in soccer. In such cases, the agent only needs to selectively attend to agents that “matter the most”. As a result, the agent can spatially and temporally reduce the scale of the planning problem.
In large-scale MARL, sample complexity is a bottleneck of scalability . To reduce the sample complexity, another feature we can exploit is the interchangeability of homogeneous agents: switching two agents’ state/action will not make any difference to the environment. This interchangeability implies permutation-invariance of the multiagent state-action value function (a.k.a. the centralized -function) as well as interchangeability of agent policies. However, many MARL algorithms such as MADDPG , VDN , QMIX  do not exploit this symmetry and thus have to learn this interchangeability from experience, which increases the sample complexity unnecessarily.
Graph neural network (GNN) is a specific neural network architecture in which permutation-invariance features can be embedded via graph pooling operations, so this approach has been applied in MARL [1, 15, 14] to exploit the interchangeability. As MARL is a non-structural scenario where the links/connections between the nodes/agents are ambiguous to decide, a graph has to be created in advance to apply GNN for MARL. Refs. [1, 15, 14], apply ad-hoc methods, such as -nearest neighbors, hard threshold, and random dropout to obtain a graph structure. However, these methods require handcrafted metrics to measure the closeness between agents, which are scenario-specific and thus not general/principled. Inappropriately selecting neighbors based on a poorly designed closeness metric could lead to the failure of learning a useful policy.
While attention mechanisms 
could be applied to learn the strength of the connections between a pair of agents (i.e., closeness metric) in a general and principled way, such strengths are often dense, leading to a nearly-complete computation graph that does not benefit scalability. The dense attention mechanism results from that the softmax activation function operated on the raw attention logits generates a probability distribution with full support. One solution to enforce a sparse graph is topthresholding , which keeps the -largest attention scores and truncates the rest to zero. However, this truncation is a non-differentiable operation that may cause problems for gradient-based optimization algorithms, such as those used in end-to-end training. Therefore, a sparse attention mechanism that preserves the gradient flow necessary for gradient-based training is required.
To address the non-differentiability issue in sparse attention mechanisms, we generalize sparsemax  and obtain a sparsity mechanism whose pattern is adaptive to the environment states. This sparsity mechanism can reduce the complexity of both the forward pass and the back-propagation of the policy and value networks, as well as preserving the end-to-end trainability in contrast to hard thresholding. With the introduction of GNN and generalized sparsemax, which can preserve permutation invariance and promote sparsity respectively, the scalability of MARL is improved.
The discussion so far was restricted to homogeneous agents and thus permutation-invariance is desirable. However, in heterogeneous multiagent systems or competitive environments, permutation invariance and interchangeability are no longer valid. For example, in soccer, switching positions of two players from different sides can make a difference to the game. To address this heterogeneity, GNN-based MARL must distinguish the different semantic meanings of the connections between different agent pairs (e.g. friend/friend relationship versus friend/foe relationship). We address this requirement by multi-relational graph convolution network  to pass messages using different graph convolution layers on graph edge connections with different semantic meanings.
To summarize, we propose to learn an adaptive sparse communication graph within the GNN-based framework to improve the scalability of MARL, which applies to both homogeneous and heterogeneous multiagent systems in mixed cooperative-competitive scenarios.
I-a Related Work
One of the existing works exploiting the structure in MARL is the mean-field reinforcement learning (MFRL)  algorithm, which takes as input the observation and the mean action of neighboring agents to make the decision, and neglects the actions of all the other agents. This simplification leads to good scalability. However, the mean action cannot distinguish the difference among neighboring agents and the locality approximations fail to capture information from a far but important agent for optimal decision-making, which leads to sub-optimal policies. Multi-Actor-Attention-Critic (MAAC) is proposed in  to aggregate information using attention mechanism from all the other agents. Similarly, [1, 14, 7] also employ the attention mechanism to learn a representation for the action-value function. However, the communication graphs used there are either dense or ad-hoc ( nearest neighbors), which makes the learning difficult.
Sparse attention mechanisms were first studied by the natural language processing community in, where sparsemax was proposed as a sparse alternative to the activation function softmax. The basic idea is to project the attention logits onto the probability simplex, which can generate zero entries once the projection hits the boundary of the simplex. While generalized sparse attention mechanisms were further studied in [22, 3, 18], they are not adaptive to the state in the context of MARL, in terms of the sparsity pattern.
Given this state of the art, the contributions of this paper are twofold. First, we propose a new adaptive sparse attention mechanism in MARL to learn a sparse communication graph, which improves the scalability of MARL by lowering the sample complexity. Second, we extend our GNN-based MARL to heterogeneous systems in mixed cooperative-competitive settings using multi-relational GNN. The evaluations show that our algorithm significantly outperforms previous approaches on applications involving a large number of agents. This technique can be applied to empower large-scale autonomous systems such as swarm robotics.
Ii-a Multiagent Reinforcement Learning
As a multiagent extension of Markov decision processes (MDPs), a Markov game is defined as a tuple, where is a set of agent indices, is the set of state, and are the joint observation and joint action sets, respectively. The th agent chooses actions via a stochastic policy , which leads to the next state according to the state transition function . The th agent also obtains a reward as a function of the state and agent’s action , and receives a private observation correlated with the state . The initial states are determined by a distribution . The th agent aims to maximize its own total expected return , with discount factor and time horizon .
Ii-B Multi-head attention
The scaled dot-product attention mechanism was first proposed in  for natural language processing. An attention function maps the query and a set of key-value pairs to the output, which is the weighted sum of the values. The weight assigned to the each value calculated via a compatibility function of the query and the corresponding key. In the context of MARL, let be the representation of the agents. Key, query and value of agent is defined as , and , respectively with and are parameter matrices. The output for agent is then
where , the -th row of the weight matrix , is defined as
with being the softmax function in previous works of GNN-based MARL. The weight is dense as
for any vectorand .
To increase the expressiveness, multi-head attention is applied here via simply concatenating the outputs from a single attention function .
Ii-C Relational GNN
In heterogeneous multiagent systems, different agent pair can have different relations, such as friend or foe in a two-party zero-sum game. As a result, information aggregation from agents with different relations should have different parameters. Work in  proposed relational graph convolutional network to model multi-relational data. The forward-pass update of agent in a multi-relational graph is as follows
where denotes the set of neighbor indices of agent under relation and is a normalization constant. To distinguish the heterogeneity in MARL, similar to this convolution-based multi-relational GNN, we apply different attention heads on agent pairs with different relations.
In this section, we present our approach to exploit the sparsity in MARL by generalizing the dense soft-max attention to adaptive sparse attention. Moreover, our approach to apply multi-relational attention mechanism for heterogeneous games involving competitive agents is also introduced.
Iii-a Learning a communication graph via adaptive sparse attention
The scaled dot-product attention is applied to learn the communication graph in MARL. If an attention weight between a pair of agents is zero, then there is no communication/message passing between them. Thus, the normalization function in (2) is critical to learn a communication graph. As usually used in the attention mechanism  or classifications, is usually set to be softmax, which cannot induce sparsity. We propose an adaptive sparse activation function as an alternative to softmax.
Let be the raw attention logits and be normalized attention strength in the ()-dimensional probability simplex defined as . We are interested in the mapping from to . In other words, such a mapping can transform real weights to a probability distribution, i.e., the normalized attention strength between a pair of agents. The classical softmax, used in most attention mechanisms, is defined component-wisely as
A limitation of the softmax transformation is that the resulting probability distribution always has full support, which makes the communication graph dense, resulting in high complexity. In order to reduce the complexity, our idea is to replace the softmax activation function with a generalized activation function, which could adaptively be dense or sparse based on the state. To investigate alternative activation functions to softmax, consider the max operator defined as
where . The second equality comes from that the supremum of the linear form over a simplex is always achieved at a vertex, i.e., one of the standard basis vector . As a result, the max operator puts all the probability mass onto a single element, or in other words, only one entry of is nonzero corresponding to the largest entry of . For example, with , the probability distribution w.r.t. the logit , i.e., , is a step function, as equals 1 if and otherwise. This discontinuity at of the step function is not amenable to gradient-based optimization algorithms for training deep neural networks. One solution to the discontinuity issue encountered in (6) is to add a regularized in the max operator as
Different regularizers produce different mappings with distinct properties (see summary in Table I). Note that with as the Shannon entropy, recovers softmax. With the states/observations evolving, the ideal profile of should be able to adapt the sparsity extent (controlled via ) and the pattern (controlled via the selection of ) accordingly.
Note that the Tsallis entropy and the generalized entropy in Table I do not have closed-form solutions , which will increase the computational burden since iterative numerical algorithms will have to be employed. Sparsemax has a closed-form solution and can induce sparsity, but sparsemax is not adaptive and lacks flexibility as it is unable to switch from one sparsity pattern to another when necessary. We aim to combine the advantages and avoid the disadvantages using this new formulation
with and being a learnable neural network and a scalar, respectively. By choosing different , can exhibit different sparsity patterns including softmax and sparsemax. With fixed, the parameter can control how sparse the output could be, similar to the temperature parameter in softmax. The summary in Table II shows that (7) will lead to a general mapping and can combine properties such as translation and scaling invariance adaptively. Work in  proposed sparse-hourglass that can adjust the trade-off between translation and scaling invariance via tunable parameters. However, it is unclear under which circumstances one property is more desirable than the other, so there is little to no prior knowledge on how to tune such parameters. In contrast, our formulation in (7) can balance such trade-off via learning and while work in  is based on a fixed form of with tunable parameters.
While we can let the neural network learn without any restrictions, there is indeed prior knowledge that we can apply, e.g., monotonicity. It is desired to keep the monotonicity of , i.e., , as larger attention logit should be mapped into larger attention strength. As sparsemax is monotonic, this requires that , or in other words, the order of the input of coincides with that of the output. To keep this property, is designed component-wisely as , with are neural networks with hidden layers. Note that should be coupled with all of the entries of instead of be a univariate function only depending on , as demonstrated in Table II. As the second argument of (i.e., ) is invariant to , the order preserving of is equivalent to the monotonicity of and . In order to keep this monotonicity, we enforce all the weights of the networks and to be positive , by applying an absolute value function on the weights. This architecture can accelerate the learning process with extra prior knowledge, as it is monotonic by design.
Iii-B Message passing in MARL via GNN
We will present how the information is aggregated to learn a representation for per-agent value/policy network using a graph neural network. The scaled dot-product attention mechanism (Section II-B) with our generalized sparsemax as the activation function, denoted as sparse-Att, is applied to learn a communication graph and pass messages through the connections in the graph.
We start with homogeneous multiagent system, where the relation between any agent pair is identical. A graph is defined as , where represent an agent and the cardinality of is . Moreover, is if agent and can communicate directly (or agent is observable to agent ), and otherwise. This is a restriction on the communication graph and is the set of all possible edges. Then sparse-Att aims to learn a subset of via induced sparsity without compromising much optimality. For agent , let and be its observation and entity encoding respectively, where is the local state and is a learnable agent encoder network. Then the initial observation embedding of agent , denoted as , is
where is another learnable network and the operator denotes concatenation. Then at hop (-th round of message passing), agent aggregates information from its possible neighbors belonging to the set as follows
With , the multi-hop message passing can enable the agent to obtain information from beyond its immediate neighbors. In the message aggregation from all of the agents , identical parameters are used in , which enforces the permutation-invariance. This property is desirable because homogeneous agents are interchangeable.
However, interchangeability is no longer applicable to heterogeneous systems or mixed cooperative-competitive environment. For example, with being a two-team partition of , agents cooperate with other agents from the same team but compete against agents from the other team. For agent , its teammate neighborhood and enemy neighborhood are and , respectively. The edges connecting teammates and enemies are called positive and negative edges. Then based on multi-relational GNN, agent aggregates information at hop in the following way
where and are different attention heads. Additionally, balance theory  suggests that “the teammate of my teammate is my teammate” and “the enemy of my enemy is my teammate.” In a two-team competitive game, any walk (a sequence of nodes and edges of a graph) between an agent pair in the communication graph, comprising of both positive and negative edges, will lead to the same relation between the agent pair . This property eliminates the ambiguity that the information aggregated from the same agent (but different walk) might have a different teammate/enemy property.
The proposed algorithmic framework is illustrated in Fig. 1. After rounds of message passing, each agent has an updated encoding
. This encoding is then fed into the value network and the policy network, which estimate the state value and a probability distribution over all possible actions, respectively. As homogeneous agents are interchangeable, they share all of the parameters, including entity encoding, policy, value and message passing. Proximal policy gradient (PPO,) is employed to train the model in an end-to-end manner. As only local information is required, the proposed approach is decentralized. Moreover, our approach maintains the transferability of GNN-based approaches as all the network dimensions are invariant to agent/entity number in the system.
Iv-a Task description
The proposed algorithm is evaluated in three swarm robotics tasks: Coverage, Formation, and ParticleSoccer , first two of which are cooperative and the third is competitive. The tasks are simulated in the Multiagent Particle Environment111https://github.com/openai/multiagent-particle-envs(MAPE ). The agents in MAPE can move in a 2-dimensional space following a double integrator dynamic model. The action space of the agents is discretized, with each agent can accelerate/decelerate in both and direction. The three tasks are briefly introduced as follows.
Coverage: There are agents (light purple) and landmarks (black) in the environment (see illustration in Fig. 1(a)). The objective for the agents is to cover the landmarks with the smallest possible number of timesteps. Agents are not assigned to reach a certain landmark, but instead, have to figure out the assignment via communication such that the task can be finished optimally.
Formation: There are agents (blue) and landmarks (black) in the environment (see illustration in Fig. 1(b)), with being an even natural number. The agents need to split into two sub-teams of equal size, with each of them building a formation of a regular pentagon. The two regular pentagons with different sizes are both centered at the landmark.
ParticleSoccer: There are agents and 3 landmarks in the environment (see illustration in Fig. 1(c)), with the bigger landmark as a movable ball and the two smaller ones as a fixed landmark. A team wins the game via pushing the black ball to the opponent team’s goal. The goal color of the light blue (red, resp.) team is blue (red, resp.).
Iv-B Implementation specifications
The agent encoder and the entity encoder take input the -dimensional agent states and -dimensional entity states, respectively. The queries, keys, and values in all of the sparse attention mechanism are -dimensional. The communication hop is
. All neural networks are fully connected with the ReLU activation function. In the sparsity-promoting function (7), and all have one hidden layer with dimensions being , and , respectively. The absolute value function is used to keep the weights of the monotonicity-preserving neural network positive.
Evaluation is performed every episodes and PPO update is executed for epochs after collecting experience of timesteps.
In the cooperative scenarios i.e., Coverage and Formation, two metrics are used to evaluate the algorithms. The first is the average reward per step and the second is the task success rate. Higher means better performance for both metrics.
We compare our algorithms with two baselines: GNN-based MARL with dense attention mechanism  and MAAC . These two algorithms are considered to be strong baselines as they reported advantageous results against algorithms including MADDPG , COMA , VDN  and QMIX . Public repositories222https://github.com/sumitsk/matrl.git333 https://github.com/shariqiqbal2810/MAAC
are used for comparison. As both repositories also apply their algorithms on MAPE, the default hyperparameters are used for comparison.
demonstrated that our algorithm can achieve higher rewards than the two baselines with fewer episodes. This validates that sparse-Att can accelerate the learning process via aggregating information from agents that matter the most. Moreover, in terms of the second metric, i.e., success rate, our algorithm consistently outperforms the two baselines by a significant margin (with a much smaller variance), as shown in Fig.5. The evaluations of both metrics for two scenarios provide strong support for the advantages of our algorithm.
Performance comparison of three algorithm on two scenarios. Multiple policies learned from each algorithm are evaluated and the mean/standard deviation are plotted.
For the competitive ParticleSoccer task, we set with both red team and blue team of size . As this task is competitive, the above two metrics are no longer applicable. Instead, we let the red (blue, resp.) play against a blue (red, resp.) team from another algorithm. Table III presents the results of the inter-algorithm competition. The overall score of each algorithm equals the sum of the winning evaluation episodes of its red team and blue team playing against blue and red team respectively from other algorithms. The overall scores in Table III show that our algorithm can learn strong policies.
Iv-D Interpretability of the sparse communication graph
Let us proceed by considering the inherent sparity in Formation and ParticleSoccer. As mentioned in the description of the Formation scenario, the formation of each pentagon is related to half of the agents, while the sub-team assignments need to be learned. In the implementation, the reward is set to require that the first agents closest to the landmark build the formations of the inner pentagon and the remaining agents to build the formations of the outer pentagon. With the convergence of the learning algorithm, once a sub-team partition is learned to complete the two sub-tasks, the learned agent indexing of each team should not vary due to the distance sorting and the two pentagons are relatively far away. As a result, the reward to complete each sub-task is only related to the corresponding sub-team and hence the two sub-teams are decoupled from each other. The adjacency matrix of the learned communication graph shown in Fig. 5(a) validates that the inter-team communication is very sparse. This adjacency matrix is up to row/column permutation as indexing of each sub-team is learned without being known as a prior. Moreover, in a sub-team, the algorithm learns a communication graph similar to a star-graph. It can be understood that each sub-team selects a leader. As a star-graph is a connected graph with possibly minimum edges, this communication protocol is both effective and efficient. Also, the length of the path between any agent pair in a star graph is no greater than , which echos the two-hop communication () we used in the simulation. That is because due to the two-hop message-passing, the agents can eventually communicate with agents as far as two edges away, which includes all of the agents in a star graph. Note that the sparsity on the diagonal entries of the communication graph does not mean that the agent’s own information is neglected, as it is separately concatenated; see (9).
Also, in the ParticleSoccer scenario, from each team’s perspective, agents need to coordinate tightly within the team to greedily push the ball to the other team’s goal while only attending to a small number of agents from the other team. This leads to dense intra-team communication but relatively sparse inter-team communication. This is validated by the approximately block-diagonal adjacency matrix of the learned communication graph in Fig. 5(b).
V CONCLUSIONS and FUTURE WORK
This paper exploits sparsity to scale up Multi-Agent Reinforcement Learning (MARL), which is motivated by the fact that interactions are often sparse in multiagent systems. We propose a new general and adaptive sparsity-inducing activation function to empower an attention mechanism, which can learn a sparse communication graph among agents. The sparse communication graph can make the message-passing both effective and efficient such that the scalability of MARL is improved without compromising optimality. Our algorithm outperforms two baselines by a significant margin on three tasks. Moreover, for scenarios with inherent sparsity, it is shown that the sparsity of the learned communication graph is interpretable.
Future work will focus on combining evolutionary population curriculum learning and graph neural network to further improve the scalability. In addition, robust learning against evolving/learned adversarial attacks is also of great interest.
Research is supported by Scientific Systems Company, Inc. under research agreement SC-1661-04. Authors would like to thank Dong-Ki Kim, Samir Wadhwania and Michael Everett for their many useful discussions and Amazon Web Services for computation support.
-  (2019) Learning transferable cooperative behavior in multi-agent teams. arXiv preprint arXiv:1906.01202. Cited by: §I-A, §I, §IV-C.
-  (2002) The complexity of decentralized control of markov decision processes. Mathematics of operations research 27 (4), pp. 819–840. Cited by: §I.
Learning classifiers with fenchel-young losses: generalized entropies, margins, and algorithms. arXiv preprint arXiv:1805.09717. Cited by: §I-A, §III-A, TABLE I.
-  (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2), pp. 156–172. Cited by: §I.
-  (2015) Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §I.
-  (2019) Adaptively sparse transformers. arXiv preprint arXiv:1909.00015. Cited by: TABLE I.
-  (2018) Tarmac: targeted multi-agent communication. arXiv preprint arXiv:1810.11187. Cited by: §I-A.
Incorporating functional knowledge in neural networks.
Journal of Machine Learning Research10 (Jun), pp. 1239–1262. Cited by: §III-A.
-  (2012) Networks, crowds, and markets: reasoning about a highly connected world. Significance 9, pp. 43–44. Cited by: §III-B.
Counterfactual multi-agent policy gradients.
Thirty-second AAAI conference on artificial intelligence, Cited by: §IV-C.
-  (1946) Attitudes and cognitive organization. The Journal of psychology 21 (1), pp. 107–112. Cited by: §III-B.
-  (2017) Guided deep reinforcement learning for swarm systems. arXiv preprint arXiv:1709.06011. Cited by: §I.
-  (2018) Actor-attention-critic for multi-agent reinforcement learning. arXiv preprint arXiv:1810.02912. Cited by: §I-A, §IV-C.
-  (2018) Graph convolutional reinforcement learning for multi-agent cooperation. arXiv preprint arXiv:1810.09202 2 (3). Cited by: §I-A, §I.
-  (2019) Graph policy gradients for large scale robot control. arXiv preprint arXiv:1907.03822. Cited by: §I.
-  (2013) Reinforcement learning in robotics: a survey. The International Journal of Robotics Research 32 (11), pp. 1238–1274. Cited by: §I.
-  (2013) Concepts and recent advances in generalized information measures and statistics. Bentham Science Publishers. Cited by: TABLE I.
-  (2018) On controllable sparse alternatives to softmax. In Advances in Neural Information Processing Systems, pp. 6422–6432. Cited by: §I-A, §III-A.
-  (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390. Cited by: §I, §IV-A, §IV-C.
-  (2016) From softmax to sparsemax: a sparse model of attention and multi-label classification. In International Conference on Machine Learning, pp. 1614–1623. Cited by: §I-A, §I, TABLE I.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §I.
-  (2017) A regularized framework for sparse and structured neural attention. In Advances in Neural Information Processing Systems, pp. 3338–3348. Cited by: §I-A.
-  (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485. Cited by: §I, §IV-C.
-  (2004) Swarm robotics: from sources of inspiration to domains of application. In International workshop on swarm robotics, pp. 10–20. Cited by: §IV-A.
-  (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: §I, §II-C.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §III-B.
-  (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §I.
-  (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354–359. Cited by: §I.
-  (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296. Cited by: §IV-C.
-  (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2085–2087. Cited by: §I.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §I, §II-B, §II-B, §III-A.
-  (2018) Mean field multi-agent reinforcement learning. arXiv preprint arXiv:1802.05438. Cited by: §I-A.