S2RL: Do We Really Need to Perceive All States in Deep Multi-Agent Reinforcement Learning?

Collaborative multi-agent reinforcement learning (MARL) has been widely used in many practical applications, where each agent makes a decision based on its own observation. Most mainstream methods treat each local observation as an entirety when modeling the decentralized local utility functions. However, they ignore the fact that local observation information can be further divided into several entities, and only part of the entities is helpful to model inference. Moreover, the importance of different entities may change over time. To improve the performance of decentralized policies, the attention mechanism is used to capture features of local information. Nevertheless, existing attention models rely on dense fully connected graphs and cannot better perceive important states. To this end, we propose a sparse state based MARL (S2RL) framework, which utilizes a sparse attention mechanism to discard irrelevant information in local observations. The local utility functions are estimated through the self-attention and sparse attention mechanisms separately, then are combined into a standard joint value function and auxiliary joint value function in the central critic. We design the S2RL framework as a plug-and-play module, making it general enough to be applied to various methods. Extensive experiments on StarCraft II show that S2RL can significantly improve the performance of many state-of-the-art methods.


page 2

page 3

page 8


Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients

Value function factorization via centralized training and decentralized ...

Modeling the Interaction between Agents in Cooperative Multi-Agent Reinforcement Learning

Value-based methods of multi-agent reinforcement learning (MARL), especi...

Local Advantage Networks for Cooperative Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) enables us to create adaptive ...

PAC: Assisted Value Factorisation with Counterfactual Predictions in Multi-Agent Reinforcement Learning

Multi-agent reinforcement learning (MARL) has witnessed significant prog...

Interaction Pattern Disentangling for Multi-Agent Reinforcement Learning

Deep cooperative multi-agent reinforcement learning has demonstrated its...

CTDS: Centralized Teacher with Decentralized Student for Multi-Agent Reinforcement Learning

Due to the partial observability and communication constraints in many m...

Sparse Attention Guided Dynamic Value Estimation for Single-Task Multi-Scene Reinforcement Learning

Training deep reinforcement learning agents on environments with multipl...

1. Introduction

Multi-agent reinforcement learning (MARL) provides a framework for multiple agents to solve complex sequential decision-making problems, with broad applications including robotics control (Li et al., 2021; Lillicrap et al., 2016), video gaming (Vinyals et al., 2019; Liu et al., 2019), traffic light control (Wu et al., 2017) and autonomous driving (Kiran et al., 2022; Cao et al., 2013). In the paradigm of centralized training with decentralized execution (CTDE) (Lowe et al., 2017; Rashid et al., 2018), each local agent models a policy that treats the local observation as input. However, the role of entities is underestimated by most mainstream methods. Entities are defined as fine-grained tokens of observations, e.g., obstacles, landmarks, enemies, which determine the inference process of the model. Specifically, they treat all entities observed as a whole and contribute indiscriminately to the estimation of the value function. But in some cases, the importance of each entity changes dynamically over time steps.

(a) Six friendly Hydralisks face 8 enemy Zealots.
(b) Dense attention distribution
(c) Sparse attention distribution
Figure 4. A visualization example of agent performance on the StarCraft II super-hard scenario . As shown in (a), the closest to the green agent H3 are the red enemies Z0 and Z5. Thus the corresponding policy is that H3 only needs to focus on Z0 and Z5, which are more likely to be annihilated. (b) shows the softmax attention distribution of the H3 observations, finding that some weights are still assigned to irrelevant entities. In contrast, the sparse attention in (c) only focuses on Z0 and Z5.

To better leverage the observation information, the attention mechanism has been adopted (Vaswani et al., 2017) for its ability to learn the interaction relationship among entities and dynamically focus on the crucial parts. Most existing attention mechanisms compute importance weights based on dense fully-connected graphs, where all participants are assigned scores according to their contribution to model decisions. In practice, however, not all entities are helpful for model inference, and discarding redundant entities can sometimes improve overall performance. Therefore, it is crucial for agents to learn to select valuable observations and exclude others. To better illustrate this phenomenon, a visualization of the StarCraft II scene and the corresponding attention distribution is shown in Figure 4. The green agent H3 is very close to the red enemies Z0 and Z5. Hence agent H3 focusing only on enemies Z0 and Z5 is more effective. However, from the traditional dense attention distribution of H3, we can see that H3 assigns much attention to irrelevant entities. Note that the large state space brings great difficulties to policy learning for a more complex MARL environment.

An ideal way to solve this issue is to replace the traditional attention mechanism with sparse attention. From Figure 4(c), we can see that adopting sparse attention can well guide H3 only to perceive Z0 and Z5, reducing the observation space that the agent needs to perceive. However, simply applying sparse attention to local agents will corrupt the training. The main reason is that the network cannot distinguish which entity is more important at the beginning of training. If the agents only focus on critical entities initially, it may lead to an inadequate exploration of the environment and thus converge to a suboptimal policy. More specifically, temporarily discarding some entities can be seen as a policy exploration behavior. Meanwhile, local policies need to execute their exploration strategies. When these two strategies are executed simultaneously, it is difficult for the model to converge.

In this paper, we propose a Sparse State based MARL (S2RL) framework, where the sparse attention mechanism is utilized as the auxiliary for guiding the local agent training. In particular, we model the local utility function using a traditional self-attention mechanism. Then, we construct a corresponding auxiliary utility function for each agent, which is implemented by a sparse attention mechanism. The local utility and auxiliary utility functions respectively form the joint value and auxiliary value functions, which are further used to train the entire network. Since the sparse attention mechanism is considered auxiliary and thus does not corrupt the training process, the auxiliary value function is also used to update the entire framework. To this end, local agents can learn patterns to focus on essential entities while ignoring redundancy.

Our main contributions are summarized as follows:

  • To the best of our knowledge, this paper is the first attempt that uses enhanced awareness of crucial states as the auxiliary in MARL to improve convergence rate and performance.

  • We propose the S2RL framework for local agents to perceive crucial states while preserving all states. The proposed framework thus addresses the inability to converge using only a small number of partial observations.

  • We design the S2RL framework as a plug-and-play module, making it general enough to be applied to various methods in the CTDE paradigm.

  • The extensive experiments on StarCraft II show that S2RL brings remarkable improvements to existing methods, especially in complicated scenarios.

The remainder of the paper is organized as follows. In Section 2, we introduce the background of MARL and the CTDE framework. In Section 3, we propose our S2RL framework. Experimental results are presented in Section 4. Related works are presented in Section 5. Section 6 concludes the paper.

Figure 5. Illustration of the proposed S2RL method. We use the value-based MARL framework under the CTDE paradigm and apply the S2RL method to an agent utility network. The core of S2RL is composed of the dense attention module and sparse attention module, where sparse attention serves as an auxiliary for guiding the dense attention training.

2. Preliminaries

2.1. Dec-POMDP

A fully cooperative multi-agent sequential task can be described as a decentralized partially observable Markov decision process (Dec-POMDP) 

(Oliehoek and Amato, 2016), which is canonically formulated by the tuple:


In the process, is the finite set of agents and represents the global state of the environment. At each time step, each agent receives an individual partial observation according to the observation function and selects an action , forming a joint action . This results in a transition to the next state according to the state transition function . All agents share the same global reward based on the reward function , and is the discount factor. Due to partially observable setting, each agent has an action-observation history and learns its individual policy to jointly maximize the discounted return. The joint policy induces a joint action-value function: .

2.2. CTDE Framework

The centralized training and decentralized execution (CTDE) is a popular paradigm used in deep multi-agent reinforcement learning (Foerster et al., 2016; Rashid et al., 2018; Mao et al., 2020; Yang et al., 2020a; Foerster et al., 2018), which enables agents to learn their individual policies in a centralized way. During the centralized training process, the learning model can access the state and provide global information to assist the agents in exploring and training. However, each agent only makes decisions based on its local action-observation history during decentralized execution. The Individual-Global-Max principle (Son et al., 2019) guarantees the consistency between joint and local greedy action selections. Agents can obtain the optimal global reward by maximizing the individual utility function of each agent. Thus a more robust individual value function can benefit the whole team in cooperative multi-agent tasks.

The global Q-function is calculated by all individual value functions: , where is a joint action-observation history and is a joint action, is the credit assignment function parameterized by to learn value function factorization. Each agent learns its own utility function by maximizing the global value function , which is trained end-to-end to minimise the following TD loss:


where is the replay buffer, and is the parameter of the target network (Mnih et al., 2015).

3. Sparse state based MARL

In this section, we propose a novel sparse state based MARL framework that is general enough to be plugged into any existing value-based multi-agent algorithm. As shown in Figure 5, our framework adopts the CTDE paradigm, where each agent learns its individual utility network by optimizing the TD loss of the mixing network. During the execution, the mixing network is removed, and each agent acts according to its local policy derived from its value function. Distinguish from other value-based methods, our agents’ value functions or policies carry out the process of selection and discrimination according to the importance of different entities of state. To enable efficient and effective learning among agents between different entities of state, our method be described by three steps: 1) selection; 2) discrimination; 3) learning.

3.1. Selection and Discrimination

It is a dynamic process to assign attentions based on the contribution of the observed entities to the value estimation. In our framework, we adopt the self-attention module  (Vaswani et al., 2017) to capture the relational weights between the observed entities of the agents. In particular, an agent observes other entities at time step , then the corresponding input of utility network is defined as with being the entity dimension and being the state information of the -th () entity. All observed entities are embedded to dimension via an embedding function as follows:


Then, the embedding feature of each agent is projected to query , key and value representation, where are trainable weight matrices. Then, , , are input into the self-attention layer to calculate the entities importance for the model decision, which is given by


One of the limitations of the softmax activation function is that the resulting probability weights for any element never appear to be zero, which further leads to dense output probabilities. Nevertheless, for the sake of simplifying the exploration space and selecting valuable observations, it is crucial for agents to reduce the number of entities to focus on. Hence, a sparse probability distribution is desired to distinguish between critical and irrelevant entities, which can accelerate convergence and improve performance. To start with, inspired by sparsemax 

(Martins and Astudillo, 2016; Peters et al., 2019), we consider introducing sparse states to enhance the perception of valuable entities of agent observation and neglect the others.

We denote the products of the query with all keys as , which consists of rows with

being the logits of the

-th row. Afterwards, we define a matrix sorting operator as follows:



sorts the elements of vector in descending order. Then we calculate


where is the maximal number of crucial elements in that we intend to preserve, while other elements is set to zero in the subsequent operations. We define


with and the scaling vector as


Then, the threshold matrix is calculated as


where is an all-one vector and is the pointwise product. The sparse attention weights matrix is obtained by


where . Thus, the sparse attention is given by


which can retain most of the essential properties of softmax while assigning zero probability to low-scoring choices. Therefore, the model will pay more attention to critical entities when making decisions, reducing the attention to other redundant entities.

Initialize: Critic network , target critic = , agents’ Q-value networks and Replay buffer

1:for each training episode  do
2:     , = initial state, and for
3:     while  and  do
5:         for  each agent  do
6:              Calculate dense attention by (12)
7:              Calculate sparse attention by (13)
8:              Calculate trajectory encode by (14)
9:              Obtain by (15)
10:              Obtain by (16)
11:              Sample from
12:         end for
13:         Execute actions
14:         Receive reward and next state
15:     end while
16:     Store episodes in replay buffer
17:     Sample a random minibatch of episodes from
18:     Dense Attention Loss:
19:         Compute by (17)
20:     Auxiliary Sparse Attention Loss:
21:         Compute by (18)
22:     Update and by (19)
23:     Every episodes reset =
24:end for
Algorithm 1 Sparse State based MARL Algorithm

3.2. Learning with Sparse Loss

Obviously, the sparse attention mechanism can be realized by directly replacing the traditional self-attention activation function with a sparse distribution function. However, the model cannot distinguish which entity is vital from the beginning. Thus directly adopting the sparse attention mechanism will have performance regression. To address this issue, we design the structure shown on the right side of Figure  5 to guide the training of local agents, where we utilize two routes to exploit dense and sparse attention, respectively. Dense attention guarantees that the algorithm can converge, while sparse attention is a powerful auxiliary to enhance the agent perception of critical entities, thereby improving performance.

To do this, the sparse attention module and dense attention module share the weight matrices and the GRU module. Denote the parameters of these two networks by . The projected matrix and are fed into both dense and sparse attention. Then, we calculate the weighted sum of to obtain the output


In our implementation, the GRU (Cho et al., 2014) module is utilized to encode an agent’s history of observations and actions via


Then, and are concatenated with the output of GRU separately to estimate the individual value function as follows:


Each agent selects the action that maximizes and for subsequent computations in centralized training. In addition, the action selected by is executed in the environment. For the exploration strategy, -greedy is adopted, and the exploration rate of decreases over time.

To better learn the role of entities in credit assignment, we use a mixing network to estimate the global Q-values and , using per-agent utility and

. Since the auxiliary estimation is calculated in the individual utility function, our proposed S2RL is seamlessly integrated with various valued-based algorithms. For example, we can use the mixing network, a feed-forward neural network introduced by QMIX 

(Rashid et al., 2018). The mixing network mixes the agent network outputs monotonically. The parameters of the mixing network parameterized by are conditioned on the global states and are generated by a hyper-network. Then, we minimize the following TD loss to update the dense attention module:


where is the target network, and the expectation is estimated with uniform samples from the same replay buffer . In the meanwhile, the AUX Loss is given by


where is the auxiliary target network.

(a) 3s5z
(b) 3s_vs_5z
(c) 5s10z
(d) corridor
(e) 6h_vs_8z
(f) 3s5z_vs_3s6z
Figure 12. Learning curves of our S2RL and baselines on one easy map (3s5z), one hard map (3s_vs_5z), and 4 super-hard maps (corridor, 5s10z, 6h_vs_8z, 3s5z_vs_3s6z). All experimental results are illustrated with the median ( percentiles) performance and across 5 runs for a fair comparison.

In our framework, S2RL services as a plug-in module in the agent utility networks. The outputs of S2RL modules are directly used for subsequent network computations. Then, each agent is trained by minimizing the total loss


where is a regularization parameter that controls the level of attention to critical states. Obviously, a larger allows our algorithm to pay more attention to some critical states, while a smaller allows for a more even distribution of attention. The overall framework is trained in an end-to-end centralized manner. The complete algorithm is summarized in Algorithm  1.

4. Experiments

We conduct experiments on the StarCraft Multi-Agent Challenge (SMAC)111We use the SC2.4.10 version instead of the older SC2. Performance is not comparable between different versions. (Samvelyan et al., 2019) to demonstrate the effectiveness of the proposed sparse state based MARL (S2RL) method. SMAC has become a standard benchmark for evaluating state-of-the-art MARL methods, which focuses on micromanagement challenges. The setup of SMAC is that each ally entity is controlled by an individual learning agent, while the enemy entities are controlled by a built-in AI. At each time step, agents can move in four cardinal directions, stop, take no-operation, or choose an enemy to attack. Thus, if there are enemies in the scenario, the action space for each ally unit consists of discrete actions. Agents aim to inflict maximum damage on enemy entities to win the game. Therefore, proper tactics such as focusing fire and covering attack are required during battles. Learning these diverse interaction behaviors under partial observation is a crucial yet challenging task. In what follows, we detail the compared methods and parameter settings and then present the qualitative and quantitative performance of different methods.

4.1. Comparison Methods and Training Details

Our method is compared with several baseline methods, including IQL, VDN (Sunehag et al., 2018), QMIX (Rashid et al., 2018), QTRAN (Son et al., 2019), QPLEX (Wang et al., 2021a), CWQMIX and OWQMIX  (Rashid et al., 2020). Our S2RL implementation uses VDN, QMIX and QPLEX as an integrated architecture to verify its performance, called S2RL (VDN), S2RL (QMIX) and S2RL (QPLEX). These three SOTA methods are chosen for their robust performance in different multi-agent scenarios, while S2RL can also be easily applied to other frameworks.

We adopt the Python MARL framework (PyMARL) (Samvelyan et al., 2019)

to run all experiments. The hyperparameters of the baseline methods are the same as those in PyMARL to ensure comparability. The regularization parameter in (

19) is set to

. For all experiments, the optimization is conducted using RMSprop with a learning rate of

, a total timestep of M, a smoothing constant of , and no momentum or weight decay. For exploration, we use greedy with annealed linearly from to over time steps and kept constant for the rest of the training. For four super hard exploration maps (6h_vs_8z, 3s5z_vs_3s6z, corridor, 5s10z), we extend the epsilon annealing time to and the total timestep to , and three of them (6h_vs_8z, corridor, 5s10z) optimized with Adam for both series of S2RL and all the baselines and ablations. Batches of episodes are sampled from the replay buffer, and all tested methods are trained end-to-end on fully unrolled episodes. All experiments on the SMAC benchmark use the default reward and observation settings of the SMAC benchmark (Samvelyan et al., 2019). All experiments in this section were carried out with different random seeds on NVIDIA GTX V100 GPU.

(a) 3s5z
(b) 3s_vs_5z
(c) 5s10z
(d) corridor
(e) 6h_vs_8z
(f) 3s5z_vs_3s6z
Figure 19. The performance comparison between the vanilla methods and their S2RL variants. We integrate the proposed S2RL framework with VDN, QMIX and QPLEX.
(a) 5s10z
(b) 6h_vs_8z
(c) 3s5z_vs_3s6z
Figure 23. Ablation studies regarding component of dense attention and auxiliary sparse attention.

4.2. Overall Results

To demonstrate the efficiency of our proposed method, we conduct experiments on 6 challenging SMAC scenarios, which are classified into

Easy (3s5z), Hard (3s_vs_5z) and Super-Hard (6h_vs_8z, 3s5z_vs_3s6z, corridor, 5s10z). All of these scenarios are heterogeneous, where each army is composed of more than one entity type. It is worth mentioning that MARL algorithms are harder to converge on hard and super-hard maps and therefore need to focus more on important entities to speed up convergence. In this way, we are more interested in the performance of our method on these maps.

Figure 12 shows the overall performance of the tested algorithms in different scenarios. The results include the median performance and

percentiles are shaded to avoid the effect of any outliers as recommended in 

(Samvelyan et al., 2019). For the sake of demonstration, here we select the best plug-in method, referred to as S2RL in the following, to compare with other baseline algorithms. First of all, we can see that S2RL performs best on up to all six tasks, which means our proposed method can efficiently enhance the performance of agents in different scenarios. In the easy map, some algorithms have achieved good performance, and our S2RL is not significantly ahead. In contrast, our S2RL significantly improves the learning efficiency and final performance compared to the baselines in some hard and super-hard scenarios. Specifically, in 6h_vs_8z and 3s5z_vs_3s6z, our S2RL consistently outperforms baselines by a large margin during training. This is because the number of entities in easy maps is small, all entities are critical, and the selection gain brought by the sparse attention mechanism is not apparent. However, when the situation becomes more complex, and the agent needs to consider which entities are more critical to the decision, the benefits of the sparse attention mechanism are more pronounced.

In addition, to test the generalization of our method incorporated into various valued-based algorithms, we incorporate S2RL to VDN, QMIX and QPLEX respectively, and compare the final performance with vanilla agent utility networks in Figure 19. In general, most of the learning curves of S2RL (VDN), S2RL (QMIX) and S2RL (QPLEX) achieve gratifying results superior to VDN, QMIX and QPLEX. Besides, it is worth mentioning that our method pulls huge margins on tasks with more severe difficulties, demonstrating the effectiveness of S2RL. The experimental results show that in the super-hard map 6h_vs_8z, our proposed S2RL (QPLEX) improves the win rate by almost compared to the naive QPLEX. Even more encouraging is that S2RL (QMIX) can reach a win rate of while QMIX basically does not learn any strategy.

Furthermore, the promotion of incorporating S2RL into QMIX and QPLEX is higher than VDN, which reveals the importance of the mixing network. We hypothesize that the sparse attention mechanism enables the model to select critical entities and further clarify their contributions, which may promote the power of credit assignment. Unlike QMIX and QPLEX, VDN represents the joint action-value as a summation of individual Q-functions, resulting in this poor representation of the mixing network challenging to leverage the strengths of our approach.

4.3. Ablation Study

To evaluate the advantage of sparse auxiliary loss on the agent training process, we conduct ablation studies on three super hard maps (5s10z, 6h_vs_8z, 3s5z_vs_3s6z) to test its contribution. Our S2RL mainly consists of two parts: (A) dense attention, denoted Attn; (B) sparse attention as an auxiliary, noted as S2RL. We apply these two components to VDN, QMIX and QPLEX utility networks and compare their performance in Figure 23. The solid curves indicate that the agents use the dense attention module to calculate the importance of different entities. The dashed curves indicate that the agents learn to use the sparse attention module as an auxiliary to teach the dense attention module.

Generally speaking, the advantages of using S2RL gradually emerge in the middle and late stages of training. We assume that agents cannot distinguish which entity is more important at the beginning of training. As training progresses, agents explore more unknown states and are gradually able to distinguish which entities are more critical. Finally, the overall performance of agents is improved when they discard irrelevant entities. Furthermore, we find that using sparse attention achieves more significant improvements on 6h_vs_8z and corridor. On the 6h_vs_8z scenario, Hydralisks face enemy Zealots, while on the corridor scenario, Zealots face enemy Zerglings. The controllable agents in these scenarios are homogeneous, making it easier for them to explore cooperative strategies. Moreover, using the sparse attention module helps simplify the exploration space, making S2RL more advantageous in these scenarios.

(a) Strategy: Zealots leave the team separately to attract the attention of most enemies.
(b) Strategy: Zealots , , focus fire cooperatively and Zealots attack the distant enemy to rescue his teammates.
(c) Strategy: Zealots keep moving to avoid being attacked and others eliminate the scattered enemies.
Figure 27. A visualization example of the sophisticated strategies adopted by S2RL (QMIX) in the SMAC corridor scenario. In this super-hard map, ally units are 6 Zealots labeled by green circle, while enemy units are 24 Zerglings. Green and red shadows mark enemies attracted by ally units. Green arrows and red arrows indicate the direction in which ally units and enemy units will move, respectively. Yellow lines indicate enemies that ally units are attacking.

4.4. Action Representations

Figure  27 visualizes the final trained model S2RL (QMIX) on the SMAC corridor scenario to better explain why our method performs well. In this super-hard scenario, 6 friendly Zealots face 24 enemy Zerglings. The disparity in quantity means that our agents need to learn cooperative strategies such as moving, pulling and focusing fire. Otherwise, agents are doomed to lose if they gather together.

As shown in Figure 27(a), the game starts with the Zealots highlighted in green as a warrior, leaving the team separately to grab the attention of most of the enemies in the green oval. Thus other zealots can eliminate a small number of enemies in the red oval with a high probability of winning. In Figure 27(b), we can see that Zealots , Zealots and Zealots are focusing fire on the enemy, thus speeding up the eradication of the enemy. In the meanwhile, Zealots stands out to attack enemies surrounding their teammates from a distance. These sophisticated strategies reflect that Zealots has a better sense of the situation and knows what it should do to protect its teammates. In the next time step, we recognize that Zealots is constantly moving to avoid being attacked, and the enemy marked by the red oval is successfully drawn and walking towards our team (see Figure 27(c)). Although doomed to sacrifice, Zealots gives teammates plenty of time to annihilate scattered enemies and rescue Zealots . All in all, S2RL can effectively allow agents to immediately focus on critical entities and make decisions, especially in more intricate scenarios.

5. Related works

5.1. Value-based Methods in MARL

Recently, value-based methods have been applied to multi-agent scenarios to solve complex Markov games and have achieved significant algorithmic progress. VDN (Sunehag et al., 2018) represents the joint action-value as a summation of individual value functions. Due to its poor expression factorization, QMIX (Rashid et al., 2018) improves VDN (Sunehag et al., 2018) by using a mixing network for nonlinear aggregation while maintaining the monotonic relationship between centralized and individual value functions. Moreover, weighted QMIX (Rashid et al., 2020) adapts a twin network and encourages underestimated actions to alleviate the risk of suboptimal outcomes. The monotonic constraints of QMIX and similar methods lead to provably poor exploratory and suboptimal properties. To address the structural limitations, QTRAN (Son et al., 2019) constructs regularizations with linear constraints and relaxes them with a -norm penalty to improve tractability, but its constraints are computationally intractable. MAVEN (Mahajan et al., 2019) relaxes QTRAN (Son et al., 2019) by two penalties and introduces a hierarchical model to coordinate diverse explorations among agents. In (Wang et al., 2021a), a duplex dueling network architecture is introduced for factoring joint value functions, which achieves state-of-the-art on a range of cooperative tasks. Additionally, some more advanced methods (Wang et al., 2020, 2021b) introduce role-oriented frameworks to decompose complex MARL tasks. In general, these methods mainly focus on aggregating local agent utility networks into a central critic network, while our method improves the structure of individual agent networks for more robust performance.

5.2. Attention Mechanism in MARL

Recently, attention models are increasingly adopted in MARL algorithms (Vaswani et al., 2017; Hu et al., 2018; Velickovic et al., 2018), since the attention mechanism is effective in extracting communication channels, representing relations, and incorporating information in large contexts. ATOC (Jiang and Lu, 2018) and MAAC (Iqbal and Sha, 2019) process messages from other agents differently through the attention layer according to their state-dependent importance. SparseMAAC (Li et al., 2019) extends MAAC (Iqbal and Sha, 2019) with sparsity by directly replacing the softmax activation function in the attention mechanism with -sparsemax. In addition, TarMAC (Das et al., 2019) utilizes a sender-receiver soft attention mechanism and multiple rounds of cooperative reasoning to allow targeted continuous communication between agents. Then CollaQ(Zhang et al., 2020) considers the use of attention mechanisms to handle a variable number of agents to solve the problem of dynamic reward distribution. Qatten (Yang et al., 2020b) employs an attention mechanism to compute the weights of local action-value functions and mix them to approximate the global Q-value. EPC (Long et al., 2020) utilizes an attention mechanism to combine embeddings from different observation-action encoders. REFIL (Iqbal et al., 2021) uses attention in QMIX to generate a random mask group of agents. UPDET (Hu et al., 2021) decouple the policy distribution from intertwined input observations with the help of a transformer mechanism. Moreover, G2ANet (Liu et al., 2020) and HAMA (Ryu et al., 2020) construct the relationship between agents as a graph and utilize attention mechanisms to learn the relationship between agents. However, most of these existing attention mechanisms compute the importance weights of all entities. In this case, all participants are assigned scores according to a dense fully connected graph, which forces agents to perceive all entities. SparseMAAC takes sparsity into account, but it ignores that directly applying the sparse attention mechanism will disrupt sufficient exploration and push the algorithm towards suboptimal policies. In this paper, agents learn to perceive more critical entities of observation in the decision-making process while all observation information is preserved.

6. Conclusion

In this work, we investigate how cooperating MARL agents benefit from extracting significant entities from observations. We design a novel sparse state based MARL algorithm that utilizes a sparse attention mechanism as an auxiliary way to select critical entities and ignore extraneous information. Moreover, S2RL can be easily integrated into various value-based architectures such as VDN, QMIX, QPLEX, etc. Experimental results on the StarCraft II micromanagement benchmark and different value-based backbones demonstrate that our method significantly outperforms existing collaborative MARL algorithms and achieves state-of-the-art. It is worth mentioning that our method pulls huge margins on complex tasks, demonstrating the effectiveness of S2RL. It could be interesting to investigate the grouping between cooperating agents through sparseness for future work.

This work was supported by the National Key Research and Development Project of China (2021ZD0110400 2018AAA0101900), National Natural Science Foundation of China (U19B2042), The University Synergy Innovation Program of Anhui Province (GXXT-2021-004), Zhejiang Lab (2021KE0AC02), Academy Of Social Governance Zhejiang University, Fundamental Research Funds for the Central Universities (226-2022-00064

226-2022-00142), Artificial Intelligence Research Foundation of Baidu Inc., Program of ZJU and Tongdun Joint Research Lab, Shanghai AI Laboratory (P22KS00111).


  • Y. Cao, W. Yu, W. Ren, and G. Chen (2013) An overview of recent progress in the study of distributed multi-agent coordination. IEEE Trans. Ind. Informat. 9 (1), pp. 427–438. Cited by: §1.
  • K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    arXiv preprint arXiv:1409.1259. Cited by: §3.2.
  • A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau (2019) Tarmac: targeted multi-agent communication. In ICML, Cited by: §5.2.
  • J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson (2016) Learning to communicate with deep multi-agent reinforcement learning. In NeurIPS, Cited by: §2.2.
  • J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In AAAI, Cited by: §2.2.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §5.2.
  • S. Hu, F. Zhu, X. Chang, and X. Liang (2021) UPDeT: universal multi-agent RL via policy decoupling with transformers. In ICLR, Cited by: §5.2.
  • S. Iqbal, C. A. S. de Witt, B. Peng, W. Boehmer, S. Whiteson, and F. Sha (2021) Randomized entity-wise factorization for multi-agent reinforcement learning. In ICML, Cited by: §5.2.
  • S. Iqbal and F. Sha (2019) Actor-attention-critic for multi-agent reinforcement learning. In ICML, Cited by: §5.2.
  • J. Jiang and Z. Lu (2018) Learning attentional communication for multi-agent cooperation. In NeurIPS, Cited by: §5.2.
  • B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yogamani, and P. Pérez (2022) Deep reinforcement learning for autonomous driving: a survey. IEEE Trans. Intell. Transp. Syst. 23 (6), pp. 4909–4926. Cited by: §1.
  • J. Li, K. Kuang, B. Wang, F. Liu, L. Chen, F. Wu, and J. Xiao (2021) Shapley counterfactual credits for multi-agent reinforcement learning. In SIGKDD, Cited by: §1.
  • W. Li, B. Jin, and X. Wang (2019) SparseMAAC: sparse attention for multi-agent reinforcement learning. In DASFAA, Cited by: §5.2.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In ICLR, Cited by: §1.
  • S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess, and T. Graepel (2019) Emergent coordination through competition. In ICLR, Cited by: §1.
  • Y. Liu, W. Wang, Y. Hu, J. Hao, X. Chen, and Y. Gao (2020)

    Multi-agent game abstraction via graph attention neural network

    In AAAI, Cited by: §5.2.
  • Q. Long, Z. Zhou, A. Gupta, F. Fang, Y. Wu, and X. Wang (2020) Evolutionary population curriculum for scaling multi-agent reinforcement learning. In ICLR, Cited by: §5.2.
  • R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In NeurIPS, Cited by: §1.
  • A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson (2019) MAVEN: multi-agent variational exploration. In NeurIPS, Cited by: §5.1.
  • H. Mao, W. Liu, J. Hao, J. Luo, D. Li, Z. Zhang, J. Wang, and Z. Xiao (2020) Neighborhood cognition consistent multi-agent reinforcement learning. In AAAI, Cited by: §2.2.
  • A. F. T. Martins and R. F. Astudillo (2016) From softmax to sparsemax: A sparse model of attention and multi-label classification. In ICML, Cited by: §3.1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §2.2.
  • F. A. Oliehoek and C. Amato (2016) A concise introduction to decentralized pomdps. Springer. Cited by: §2.1.
  • B. Peters, V. Niculae, and A. F. T. Martins (2019) Sparse sequence-to-sequence models. In ACL, Cited by: §3.1.
  • T. Rashid, G. Farquhar, B. Peng, and S. Whiteson (2020) Weighted QMIX: expanding monotonic value function factorisation for deep multi-agent reinforcement learning. In NeurIPS, Cited by: §4.1, §5.1.
  • T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. N. Foerster, and S. Whiteson (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML, Cited by: §1, §2.2, §3.2, §4.1, §5.1.
  • H. Ryu, H. Shin, and J. Park (2020) Multi-agent actor-critic with hierarchical graph attention network. In AAAI, Cited by: §5.2.
  • M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C. Hung, P. H. S. Torr, J. N. Foerster, and S. Whiteson (2019) The starcraft multi-agent challenge. In AAMAS, Cited by: §4.1, §4.2, §4.
  • K. Son, D. Kim, W. J. Kang, D. Hostallero, and Y. Yi (2019) QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In ICML, Cited by: §2.2, §4.1, §5.1.
  • P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, et al. (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In AAMAS, Cited by: §4.1, §5.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §1, §3.1, §5.2.
  • P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §5.2.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
  • J. Wang, Z. Ren, T. Liu, Y. Yu, and C. Zhang (2021a) QPLEX: duplex dueling multi-agent q-learning. In ICLR, Cited by: §4.1, §5.1.
  • T. Wang, H. Dong, V. R. Lesser, and C. Zhang (2020) ROMA: multi-agent reinforcement learning with emergent roles. In ICML, Cited by: §5.1.
  • T. Wang, T. Gupta, A. Mahajan, B. Peng, S. Whiteson, and C. Zhang (2021b) RODE: learning roles to decompose multi-agent tasks. In ICLR, Cited by: §5.1.
  • C. Wu, A. Kreidieh, E. Vinitsky, and A. M. Bayen (2017) Emergent behaviors in mixed-autonomy traffic. In CoRL, Cited by: §1.
  • Y. Yang, J. Hao, G. Chen, H. Tang, Y. Chen, Y. Hu, C. Fan, and Z. Wei (2020a) Q-value path decomposition for deep multiagent reinforcement learning. In ICML, Cited by: §2.2.
  • Y. Yang, J. Hao, B. Liao, K. Shao, G. Chen, W. Liu, and H. Tang (2020b) Qatten: a general framework for cooperative multiagent reinforcement learning. arXiv preprint arXiv:2002.03939. Cited by: §5.2.
  • T. Zhang, H. Xu, X. Wang, Y. Wu, K. Keutzer, J. E. Gonzalez, and Y. Tian (2020) Multi-agent collaboration via reward attribution decomposition. arXiv preprint arXiv:2010.08531. Cited by: §5.2.