Log In Sign Up

UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers

by   Siyi Hu, et al.
Monash University

Recent advances in multi-agent reinforcement learning have been largely limited in training one model from scratch for every new task. The limitation is due to the restricted model architecture related to fixed input and output dimensions. This hinders the experience accumulation and transfer of the learned agent over tasks with diverse levels of difficulty (e.g. 3 vs 3 or 5 vs 6 multi-agent games). In this paper, we make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks with the requirement of different observation and action configurations. Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy by decoupling the policy distribution from the intertwined input observation with an importance weight measured by the merits of the self-attention mechanism. Compared to a standard transformer block, the proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable. UPDeT is general enough to be plugged into any multi-agent reinforcement learning pipeline and equip them with strong generalization abilities that enables the handling of multiple tasks at a time. Extensive experiments on large-scale SMAC multi-agent competitive games demonstrate that the proposed UPDeT-based multi-agent reinforcement learning achieves significant results relative to state-of-the-art approaches, demonstrating advantageous transfer capability in terms of both performance and training speed (10 times faster).


Deep Multi-Agent Reinforcement Learning with Relevance Graphs

Over recent years, deep reinforcement learning has shown strong successe...

Efficient Domain Coverage for Vehicles with Second Order Dynamics via Multi-Agent Reinforcement Learning

Collaborative autonomous multi-agent systems covering a specified area h...

Consolidation via Policy Information Regularization in Deep RL for Multi-Agent Games

This paper introduces an information-theoretic constraint on learned pol...

Multi-Agent Asynchronous Cooperation with Hierarchical Reinforcement Learning

Hierarchical multi-agent reinforcement learning (MARL) has shown a signi...

Deep Reinforcement Learning with Swin Transformer

Transformers are neural network models that utilize multiple layers of s...

Multi-Game Decision Transformers

A longstanding goal of the field of AI is a strategy for compiling diver...

Concentration Network for Reinforcement Learning of Large-Scale Multi-Agent Systems

When dealing with a series of imminent issues, humans can naturally conc...

1 Introduction

Reinforcement Learning (RL) provides a framework for decision-making problems in an interactive environment, with applications including robotics control (hester2010generalized), video gaming (mnih2015human), auto-driving (bojarski2016end), person search (ChangHSLYH18) and vision-language navigation (zhu2020vision). Cooperative multi-agent reinforcement learning (MARL), a long-standing problem in the RL context, involves organizing multiple agents to achieve a goal, and is thus a key tool used to address many real-world problems, such as mastering multi-player video games (peng2017multiagent) and studying population dynamics (yang2017study).

A number of methods have been proposed that exploit an action-value function to learn a multi-agent model (sunehag2017value, rashid2018qmix, du2019liir, mahajan2019maven, hostallero2019learning, zhou2020learning, yang2020multi

). However, current methods have poor representation learning ability and fail to exploit the common structure underlying the tasks this is because they tend to treat observation from different entities in the environment as an integral part of the whole. Accordingly, they give tacit support to the assumption that neural networks are able to automatically decouple the observation to find the best mapping between the whole observation and policy. Adopting this approach means that they treat all information from other agents or different parts of the environment in the same way. The most commonly used method involves concatenating the observations from each entity in to a vector that is used as input (

rashid2018qmix, du2019liir, zhou2020learning). In addition, current methods ignore the rich physical meanings behind each action. Multi-agent tasks feature a close relationship between the observation and output. If the model does not decouple the observation from the different agents, individual functions maybe misguided and impede the centralized value function. Worse yet, conventional models require the input and the output dimensions to be fixed (shao2018starcraft, wang2020few), which makes zero-shot transfer impossible. Thus, the application of current methods is limited in real-world applications.

Our solution to these problems is to develop a multi-agent reinforcement learning (MARL) framework with no limitation on input or output dimension. Moreover, this model should be general enough to be applicable to any existing MARL methods. More importantly, the model should be explainable and capable of providing further improvement for both the final performance on single-task scenarios and transfer capability on multi-task scenarios.

Figure 1: An overview of the MARL framework. Our work replaces the widely used GRU/LSTM-based individual value function with a transformer-based function. Actions are separated into action groups according to observations.

Inspired by the self-attention mechanism (vaswani2017attention), we propose a transformer-based MARL framework, named Universal Policy Decoupling Transformer (UPDeT). There are four key advantages of this approach: 1) Once trained, it can be universally deployed; 2) it provide more robust representation with a policy decoupling strategy; 3) it is more explainable; 4) it is general enough to be applied on any MARL model. We further design a transformer-based function to handle various observation sizes by treating individual observations as ”observation-entities”. We match the related observation-entity with action-groups by separating the action space into several action-groups with reference to the corresponding observation-entity, allowing us to get matched observation-entity — action-group pairs set. We further use a self-attention mechanism to learn the relationship between the matched observation-entity and other observation-entities. Through the use of self-attention map and the embedding of each observation-entity, UPDeT can optimize the policy at an action-group level. We refer to this strategy as Policy Decoupling. By combining the transformer and policy decoupling strategies, UPDeT significantly outperforms conventional RNN-based models.

In UPDeT, there is no need to introduce any new parameters for new tasks. We also prove that it is only with decoupled policy and matched observation-entity — action-group pairs that UPDeT can learn a strong representation with high transfer capability. Finally, our proposed UPDeT can be plugged into any existing method with almost no changes to the framework architecture required, while still bringing significant improvements to the final performance, especially in hard and complex multi-agent tasks.

The main contributions of this work are as follows: First, our UPDeT-based MARL framework outperforms RNN-based frameworks by a large margin in terms of final performance on state-of-the-art centralized functions. Second, our model has strong transfer capability and can handle a number of different tasks at a time. Third, our model accelerates the transfer learning speed (total steps cost) to make it roughly 10 times faster compared to RNN-based models in most scenarios.

2 Related Work

Attention mechanisms have become an integral part of models that capture global dependencies. In particular, self-attention (parikh2016decomposable) calculates the response at a specific position in a sequence by attending to all positions within this sequence. vaswani2017attention

demonstrated that machine translation models can achieve state-of-the-art results solely by using a self-attention model.

parmar2018image proposed an Image Transformer model that applies self-attention to image generation. wang2018non formalized self-attention as a non-local operation in order to model the spatial-temporal dependencies in video sequences. In spite of this, self-attention mechanisms have not yet been fully explored in multi-agent reinforcement learning.

Another line of research is multi-agent reinforcement learning (MARL). Existing work in MARL focuses primarily on building a centralized function to guide the training of individual value function (lowe2017multi, sunehag2017value, rashid2018qmix, mahajan2019maven, hostallero2019learning, yang2020multi, zhou2020learning). Few works have opted to form a better individual functions with strong representation and transfer capability. In standard reinforcement learning, this generalization has been fully studied (taylor2009transfer, ammar2012reinforcement, parisotto2015actor, gupta2017learning, da2019survey). While multi-agent transfer learning has been proven to be more difficult than the single-agent scenario (boutsioukis2011transfer, shao2018starcraft, vinyals2019grandmaster). However, the transfer capability of a multi-agent system is of greater significance due to the various number of agents, observations sizes and policy distributions.

To the best of our knowledge, we are the first to develop a multi-agent framework capable of handling multiple task at a time. Moreover, we provide a policy decoupling strategy to further improve the model performance and facilitate the multi-agent transfer learning, which is a significant step towards real world multi-agent applications.

3 Method

Figure 2: Three variants on different policy decoupling method types (upper part) and two variants on different temporal unit types (bottom). ‘AR’ , ‘MA’ and ‘EXP’ represent Action Restriction, Multi-task at A time and EXPlainable, respectively. , , and represents for observation, embedding, q-value and hidden states with observation entities and available actions. represents for the global hidden state and is the current time step. A black circle indicates that the variant possesses this attribute; moreover, variant (d) is our proposed UPDeT with best performance. Further details on all five variants can be found in Section 3.

We begin by introducing the notations and basic task settings necessary for our approach. We then describe a transformer-based individual function and policy decoupling strategy under MARL. Finally, we introduce different temporal units and assimilate our Universal Policy Decoupling Transformer (UPDeT) into Dec-POMDP.

3.1 Notations and Task Settings

Multi-agent Reinforcement Learning

A cooperative multi-agent task is a decentralized partially observable Markov decision process (

oliehoek2016concise) with a tuple . Let denote the global state of the environment, while represents the set of agents and is the action space. At each time step , agent selects an action , forming a joint action , which in turn causes a transition in the environment represented by the state transition function . All agents share the same reward function , while is a discount factor. We consider a partially observable scenario in which each agent makes individual observations according to the observation function . Each agent has an action-observation history that conditions a stochastic policy , creating the following joint action value: , where is the discounted return.

Centralized training with decentralized execution Centralized training with decentralized execution (CTDE) is a commonly used architecture in the MARL context. Each agent is conditioned only on its own action-observation history to make a decision using the learned policy. The centralized value function provides a centralized gradient to update the individual function based on its output. Therefore, a stronger individual value function can benefit the centralized training.

3.2 Transformer-based Individual Value Function

In this section, we present a mathematical formulation of our transformer-based model UPDeT. We describe the calculation of the global Q-function with self-attention mechanism. First, the observation is embedded into a semantic embedding to handle the various observation space. For example, if an agent observes other entities at time step , all observation entities are embedded via an embedding layer as follows:


Here, is the index of the agent, . Next, the value functions for the

agents for each step are estimated as follows:


We introduce , the temporal hidden state at the last time step , since POMDP policy is highly dependent on the historical information. denotes the observation embedding, while is the candidate action, . is the parameter that defines . Finally, the global Q-function is calculated by all individual value functions, as follows:


is the credit assignment function for defined by for each agent , as utilized in rashid2018qmix and sunehag2017value. For example, in VDN, is a sum function that can be expressed as .

Implement Q-function with Self-attention vaswani2017attention adopts three matrices, , , representing a set of keys, queries and values respectively. The attention is computed as follows:


where is a scaling factor equal to the dimension of the key. In our method, we adopt the self-attention to learn the features and relationships from the observation entity embedding and the global temporal information. To learn the independent policy in decentralized multi-agent learning, we define , and as the key, query and value metrics for each agent . We further consider the query, key and value for the same matrices , where is the number of layers of the transformer. Thus, we formulate our transformer as follows:


where represents the linear functions used to compute , , . Finally we project the entity features of the last transformer layer to the output space of the value function . We implement the projection using a linear function :


3.3 Policy Decoupling

A single transformer-based individual function with self-attention mechanism is still unable to handle various required policy distribution. A flexible mapping function in Eq. 6 is needed to deal with the various input and output dimensions and provide strong representation ability. Using the correlation between input and output, we design a strategy called policy decoupling, which is the key part of UPDeT.

Figure 3: The main pipeline of our proposed UPDeT, where represent observation entity, feature embedding and Q-value of each action respectively. Three operations are adopted to avoid introducing new parameters when forming the policy distribution, namely ‘preserve’, ‘aggregation’ and ‘abandon’. Details can be found in Section 3.3 and a real case can be found in Fig. 7.

The main idea behind the policy decoupling strategy can be summarized into three points:

  • Point

    : No restriction on policy dimension. The output dimension of a standard transformer block must be equal to or less than the input dimension. This is unacceptable in some MARL tasks, as the action number can be larger than the entity number.

  • Point

    : Ability to handle multiple tasks at a time. This requires a fixed model architecture without new parameters being introduced for new tasks. Unfortunately, if point

    is satisfied, point

    becomes very problematic to achieve. The difficulty lies in how to reconcile points



  • Point

    : Make the model more explainable. It would be preferable if we can could replace the conventional RNN-based model with a more explainable policy generation structure.

Following the above three points, we propose three policy decoupling methods, namely Vanilla Transformer, Aggregation Transformer and Universal Policy Decoupling Transformer (UPDeT). The pipelines are illustrated in Fig. 2. The details of the Vanilla Transformer and Aggregation Transformer are presented in the experiment section and act as our baselines. In this section, we mainly discuss the mechanism of our proposed UPDeT.

Tasking the entity features of the last transformer layer outlined in Eq. 5, the main challenge is to build a strong mapping between the features and the policy distribution. UPDeT first matches the input entity with the related output policy part. This correspondence is easy to find in the MARL task, as interactive action between two agents is quite common. Once we match the corresponding entity features and actions, we substantially reduce the burden of model learning representation using the self-attention mechanism. Moreover, considering that there might be more than one interactive actions of the matched entity feature, we separate the action space into several action groups, each of which consists several actions matched with one entity. The pipeline of this process is illustrated in the left part of Fig. 3. In the mapping function, to satisfy point

and point

, we adopt two strategies. First, if the action-group of one entity feature contains more than one action, a shared fully connected layer is added to map the output to the action number dimension. Second, if one entity feature has no corresponding action, we abandon it, there is no danger of losing the information carried by this kind of entity feature, as the transformer has aggregated the information necessary to each output. The pipeline of UPDeT can be found in the right part of Fig. 3. With UPDeT, there is no action restriction and no new parameter introduced in new scenarios. A single model can be trained on multiple tasks and deployed universally. In addition, matching the corresponding entity feature and action-group satisfies point

, as the policy is explainable using an attention heatmap, as we will discuss in Section 4.4.

3.4 Temporal Unit Structure

Notably, however a transformer-based individual value function with policy decoupling strategy cannot handle a partial observation decision process without trajectory or history information. In Dec-POMDP (oliehoek2016concise), each agent chooses an action according to , where and represents for action and action-observation history respectively. In GRU and LSTM, we adopt a hidden state to hold the information of the action-observation history. However, the combination of a transformer block and a hidden state has not yet been fully studied. In this section, we provide two approaches to handling the hidden state in UPDeT:

1) Global temporal unit treats the hidden state as an additional input of the transformer block. The process is formulated in a similar way to Eq. 5 with the relation: and . Here, we ignore the subscript and instead use to represent ’global’. The global temporal unit is simple but efficient, and provides us with robust performance in most scenarios.

2) Individual temporal unit treats the hidden state as the inner part of each entity. In other words, each input maintains its own hidden state, while each output projects a new hidden state for the next time step. The individual temporal unit uses a more precise approach to controlling history information as it splits the global hidden state into individual parts. We use to represent the number of entities. The relation of input and output is formulated as and . However, this method introduces the additional burden of learning the hidden state independently for each entity. In experiment Section 4.1.2, we test both variants and discuss them further.

3.5 Optimization

We use the standard squared in DQNs (mnih2015human) to optimize our entire framework as follows:


Here, represents the batch size. In partially observable settings, agents can benefit from conditioning on action-observation history. hausknecht2015deep propose Deep Recurrent Q-networks (DRQN) for this sequential decision process. For our part, we replace the widely used GRU (chung2014empirical)/LSTM (hochreiter1997long) unit in DRQN with a transformer-based temporal unit and then train the whole model.

4 StarCraft II Experiment

In this section, we evaluate UPDeT and its variants with different policy decoupling methods in the context of challenging micromanagement games in StarCraft II. We compare UPDeT with the RNN-based model on a single scenario and test the transfer capability on multiple-scenario transfer tasks. The experimental results show that UPDeT achieves significant improvement compared to the RNN-based model.

4.1 Single Scenario

In the single scenario experiments, we evaluate the model performance on different scenarios from SMAC (samvelyan2019starcraft). Specifically, the scenarios considered are as follows: 3 Marines vs 3 Marines (3m, Easy), 8 Marines vs 8 Marines (8m, Easy), 4 Marines vs 5 Marines (4m_vs_5m, Hard+) and 5 Marines vs 6 Marines (5m_vs_6m, Hard). In all these games, only the units from player’s side are treated as agents. Dead enemy units will be masked out from the action space to ensure that the executed action is valid. More detailed settings can be acquired from the SMAC environment (samvelyan2019starcraft).

4.1.1 Methods and Training Details

The MARL methods for evaluation include VDN (sunehag2017value), QMIX (rashid2018qmix) and QTRAN (hostallero2019learning). All three SOTA methods’ original implementation can be found at These methods were selected due to their robust performance across different multi-agent tasks. Other methods, including COMA (foerster2017counterfactual) and IQL (tan1993multi) do not perform stable across in all tasks, as have been proved in several recent works (rashid2018qmix, mahajan2019maven, zhou2020learning). Therefore, we combined UPDeT with VDN, QMIX and QTRAN to prove that our model can improve the model performance significantly compared to the GRU-based model.

(a) Policy variants
(b) Temporal variants
(c) MARL methods
(d) Easy scenarios
(e) Hard scenarios
(f) Mismatch experiment
Figure 4: Experimental results with different task settings. Details can be found in Section 4.1.2.

4.1.2 Result

The model performance result with different policy decoupling methods can be found in Fig. 3(a). Vanilla Transformer is our baseline for all transformer-based models. This transformer only satisfies point

. Each output embedding can either be projected to an action or abandoned. The vanilla transformer fails to beat the enemies in the experiment. Aggregation Transformer is a variant of vanilla transformer, the embedding of which are aggregated into a global embedding and then projected to a policy distribution. This transformer only satisfies the point

. The performance of the aggregation transformer is worse than that of the GRU-based model. The result proves that it is only with a policy decoupling strategy that the transformer-based model can outperform the conventional RNN-based model. Next, we adopt UPDeT to find the best temporal unit architecture in Fig. 3(b). The result shows that without a hidden state, the performance is significantly decreased. The temporal unit with global hidden state is more efficient in terms of convergence speed than the individual hidden state. However, the final performances are almost the same. To test the generalization of our model, we combine the UPDeT with VDN / QMIX / QTRAN respectively and compare the final performance with RNN-based methods in Fig. 3(c). We evaluate the model performance on 5m_vs_6m (Hard) scenarios. Combined with UPDeT, all three MARL methods obtain significant improvement by large margins compared to the GRU-based model. The result proves that our model can be injected into any existing stat- of-the-art MARL method to yield better performance. Further more, we combine UPDeT with VDN and evaluate the model performance on different scenarios from Easy to Hard+ in Fig. 3(d) and Fig. 3(e). The results show that the UPDeT performs stably on easy scenarios and significantly outperforms the GRU-based model on hard scenarios, in the 4m_vs_5m(Hard+) scenario, the performance improvement achieved by UPDeT relative to the GRU-based model is of the magnitude of around 80%. Finally, we conduct an ablation study on UPDeT with paired and unpaired observation-entity—action-group, the result of which are presented in Fig. 3(f). We disrupt the original correspondence between ’attack’ action and enemy unit. The final performance is heavily decreased compared to the original model, and is even worse than the GRU-based model. We accordingly conclude that only with policy decoupling and a paired observation-entity—action-group strategy can UPDeT learn a strong policy.

4.2 Multiple Scenarios

(a) Transfer from 7 marines to 3 marines
(b) Transfer from 3 marines to 7 marines
Figure 5: Experimental results on transfer learning with UPDeT (Uni-Transfer) and GRU unit (GRU-Transfer), along with UPDeT training from scratch (Uni-Scratch). At time step 0 and 500k, we load the model from the source scenario and finetune on the target scenarios. The circular points indicate the model performance on new scenarios without finetuning.

In this section, we discuss the transfer capability of UPDeT compared to the RNN-based model. We evaluate the model performance in a curriculum style. First, the model is trained one the 3m (3 Marines vs 3 Marines) scenario. We then used the pretrained 3m model to continually train on the 5m (5 Marines vs 5 Marines) and 7m (7 Marines vs 7 Marines) scenarios. We also conduct a experiment in reverse from 7m to 3m. During transfer learning, the model architecture of UPDeT remains fixed. Considering that the RNN-based model cannot handle various input and output dimensions, we modify the architecture of the source RNN model when training on the target scenario. We preserve the parameters of the GRU cell and initialize the fully connected layer with proper input and output dimensions to fit the new scenario. The final results can be seen in Fig. 4(a) and Fig. 4(b). Our proposed UPDeT achieves significantly better results than the GRU-based model. Statistically, UPDeT’s total timestep cost to converge is at least 10 times less than the GRU-based model and 100 times less than training from scratch. Moreover, the model demonstrates a strong generalization ability without finetuning, indicating that UPDeT learns a robust policy with meta-level skill.

4.3 Extensive experiment on large-scale MAS

To evaluate the model performance in large-scale scenarios, we test our proposed UPDeT on the 10m_vs_11m and 20m_vs_21m scenarios from SMAC and a 64_vs_64 battle game in the MAgent Environment (zheng2017magent). The final results can be found in Appendix E.

4.4 Attention based Strategy: An analysis

The significant performance improvement achieved by UPDeT on the SMAC multi-agent challenge can be credited to the self-attention mechanism brought by both transformer blocks and the policy decoupling strategy in UPDeT. In this section, we mainly discuss how the attention mechanism assists in learning a much more robust and explainable strategy. Here, we use the 3 Marines vs 3 Marines game (therefore, the size of the raw attention matrix is 6x6) as an example to demonstrate how the attention mechanism works. As mentioned in the caption of Fig. 6, we simplify the raw complete attention matrix to a grouped attention matrix. Fig. 5(b) presents the three different stages in one episode including Game Start, Attack and Survive, with their corresponding attention matrix and strategies. In the Game Start stage, the highest attention is in line 1 col 3 of the matrix, indicating that the agent pays more attention to its allies than its enemies. This phenomenon can be interpreted as follows: in the startup stage of one game, all the allies are spawned at the left side of the map and are encouraged to find and attack the enemies on the right side In the Attack stage, the highest attention is in line 2 col 2 of the matrix, which indicates that the enemy is now in the agent’s attack range; therefore, the agent will attack the enemy to get more rewards. Surprisingly, the agent chooses to attack the enemy with the lowest health value. This indicates that a long term plan can be learned based on the attention mechanism, since killing the weakest enemy first can decrease the punishment from the future enemy attacks. In the Survive stage, the agent’s health value is low, meaning that it needs to avoid being attacked. The highest attention is located in line 1 col 1, which clearly shows that the most important thing under the current circumstances is to stay alive. For as long as the agent is alive, there is still a chance for it to return to the front line and get more reward while enemies are attacking the allies instead of the agent itself.

In conclusion, the self-attention mechanism and policy decoupling strategy of UPDeT provides a strong and clear relation between attention weights and final strategies. This relation can help us better understand the policy generation based on the distribution of attention among different entities. An interesting idea presents itself here: namely, if we can find a strong mapping between attention matrix and final policy, the character of the agent could be modified in an unsupervised manner.

5 Conclusion

In this paper, we propose UPDeT, a universal policy decoupling transformer model that extends MARL to a much broader scenario. UPDeT is general enough to be plugged into any existing MARL method. Moreover, our experimental results show that, when combined with UPDeT, existing state-of-the-art MARL methods can achieve further significant improvements with the same training pipeline. On transfer learning tasks, our model is 100 times faster than training from scratch and 10 times faster than training using the RNN-based model. In the future, we aim to develop a centralized function based on UPDeT and apply the self-attention mechanism to the entire pipeline of MARL framework to yield further improvement.


This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No.U19A2073 and in part by the National Natural Science Foundation of China (NSFC) under Grant No.61976233 and No.61906109 and Australian Research Council Discovery Early Career Researcher Award (DE190100626), and Funding of “Leading Innovation Team of the Zhejiang Province” (2018R01017).


Appendix A Details of SMAC environment

The action space contains four movement directions, k attack actions (where k is the fixed maximum number of the enemy units in a map), stop and none-operation. At each time step, the agents receive a joint team reward, which is defined by the total damage incurred by the agents and the total damage from the enemy side. Each agent is described by several attributes, including health point , weapon cool down (CD), unit type, last action and the relative distance of the observed units. The enemy units are described in the same way except that CD is excluded. The partial observation of an agent comprises the attributes of the units, including both the agents and the enemy units, that exist within its view range, which is a circle with a specific radius.

Appendix B Details of Model

The transformer block in all different experiments consists of 3 heads and 2 layer transformer blocks. The other important training hyper parameters are as follows:

List of Hyper Parameters
Name Value
batch size 32
test interval 2000
gamma 0.99
buffer size 5000
token dimension (UPDeT) 32
channel dimension (UPDeT) 32
epsilon start 1.0
epsilon end 0.05
rnn hidden dimension 64
target net update interval 200
mixing embeddding dimension (QMIX) 32
hypernet layers (QMIX) 2
hypernet embedding (QMIX) 64
mixing embeddding dimension (QTRAN) 32
opt loss (QTRAN) 1
nopt min loss (QTRAN) 0.1

Appendix C SOTA MARL value-based Framework

The three SOTA method can be briefly summarized as follows:

  • VDN (sunehag2017value): this method learns an individual Q-value function and represents as a sum of individual Q-value functions that condition only on individual observations and actions.

  • QMIX (rashid2018qmix): this method learns a decentralized Q-function for each agent, with the assumption that the centralized Q-value increases monotonically with the individual Q-values.

  • QTRAN (hostallero2019learning): this method formulates multi-agent learning as an optimization problem with linear constraints and relaxes it with L2 penalties for tractability.

(a) Attention Matrix
(b) Strategy with Attention
Figure 6: An analysis of the attention based strategy of UPDeT. Part (a) visualizes a typical attention matrix. Part (b) utilizes the simplified attention matrix to describe the relationship between attention and final strategy. Further discussion can be found in Section 4.4.

Appendix D UPDeT on SMAC: A real case

We take the 3 Marines vs 3 Marines challenge from SMAC with UPDeT as an example; more details can be found in Fig. 7. The observation are separated into 3 groups: main agent, two other ally agents and three enemies. The policy output includes basic action corresponding to the main agent’s observation and attack actions, one for each enemy observation. The hidden state is added after the embedding layer. The output of other agents is abandoned as there is no corresponding action. Once an agent or enemy has died, we mask corresponding unavailable action in the action select stage to ensure only the available actions are selected.

Figure 7: Real case on 3 Marines vs 3 Marines Challenge from SMAC.

Appendix E Results of Extensive Experiment on Large Scale

(a) Large-Scale Scenarios
(b) MAgent Battle: 64 vs 64
Figure 8: Experimental results on the large-scale MAS, including SMAC and MAgent.

We further test the robustness of UPDeT in a large-scale multi-agent system. To do so, we enlarge the game size in SMAC (samvelyan2019starcraft) to incorporate more agents and enemies on the battle field. We use a 10 Marines vs 11 Marines game and a 20 Marines vs 21 Marines game to compare the performance between the UPDeT and GRU-based approaches. In the 20 Marines vs 21 Marines game, to accelerate the training and satisfy the hardware limitations, we decrease the batch size of both the GRU baseline and UPDeT from 32 to 24 in the training stage. The final results can be found in Fig. 7(a). The improvement is still significant in terms of both sample efficiency and final performance. Moreover, it is also worth mentioning that the model size of UPDeT stays fixed, while the GRU-based model becomes larger in large-scale scenarios. In the 20 Marines vs 21 Marines game, the model size of GRU is almost double that of UPDeT. This indicates that UPDeT is able to ensure the lightness of the model while still maintaining good performance.

We also test the model performance in the MAgent Environment (zheng2017magent

). The settings of MAgent are quite different from those of SMAC. First, the observation size and number of available actions are not related to the number of agents. Second, the 64_vs_64 battle game we tested is a two-player zero-sum game which is another hot research area that combines both MARL and GT (Game Theory), the most successful attempt in this area involves adopting a mean-field approximation of GT in MARL to accelerate the self-play training (

yang2018mean). Third, as for the model architecture, there is no need to use a recurrent network like GRU in MAgent and the large observation size requires the use of a CNN from embedding. However, ny treating UPDeT as a pure encoder without recurrent architecture, we can still conduct experiments on MAgent; the final results of these can be found in Fig. 7(b). As the result show, UPDeT performs better than the DQN baseline, although this improvement is not as significant as it in SMAC.