Reinforcement Learning (RL) provides a framework for decision-making problems in an interactive environment, with applications including robotics control (hester2010generalized), video gaming (mnih2015human), auto-driving (bojarski2016end), person search (ChangHSLYH18) and vision-language navigation (zhu2020vision). Cooperative multi-agent reinforcement learning (MARL), a long-standing problem in the RL context, involves organizing multiple agents to achieve a goal, and is thus a key tool used to address many real-world problems, such as mastering multi-player video games (peng2017multiagent) and studying population dynamics (yang2017study).
A number of methods have been proposed that exploit an action-value function to learn a multi-agent model (sunehag2017value, rashid2018qmix, du2019liir, mahajan2019maven, hostallero2019learning, zhou2020learning, yang2020multi
). However, current methods have poor representation learning ability and fail to exploit the common structure underlying the tasks this is because they tend to treat observation from different entities in the environment as an integral part of the whole. Accordingly, they give tacit support to the assumption that neural networks are able to automatically decouple the observation to find the best mapping between the whole observation and policy. Adopting this approach means that they treat all information from other agents or different parts of the environment in the same way. The most commonly used method involves concatenating the observations from each entity in to a vector that is used as input (rashid2018qmix, du2019liir, zhou2020learning). In addition, current methods ignore the rich physical meanings behind each action. Multi-agent tasks feature a close relationship between the observation and output. If the model does not decouple the observation from the different agents, individual functions maybe misguided and impede the centralized value function. Worse yet, conventional models require the input and the output dimensions to be fixed (shao2018starcraft, wang2020few), which makes zero-shot transfer impossible. Thus, the application of current methods is limited in real-world applications.
Our solution to these problems is to develop a multi-agent reinforcement learning (MARL) framework with no limitation on input or output dimension. Moreover, this model should be general enough to be applicable to any existing MARL methods. More importantly, the model should be explainable and capable of providing further improvement for both the final performance on single-task scenarios and transfer capability on multi-task scenarios.
Inspired by the self-attention mechanism (vaswani2017attention), we propose a transformer-based MARL framework, named Universal Policy Decoupling Transformer (UPDeT). There are four key advantages of this approach: 1) Once trained, it can be universally deployed; 2) it provide more robust representation with a policy decoupling strategy; 3) it is more explainable; 4) it is general enough to be applied on any MARL model. We further design a transformer-based function to handle various observation sizes by treating individual observations as ”observation-entities”. We match the related observation-entity with action-groups by separating the action space into several action-groups with reference to the corresponding observation-entity, allowing us to get matched observation-entity — action-group pairs set. We further use a self-attention mechanism to learn the relationship between the matched observation-entity and other observation-entities. Through the use of self-attention map and the embedding of each observation-entity, UPDeT can optimize the policy at an action-group level. We refer to this strategy as Policy Decoupling. By combining the transformer and policy decoupling strategies, UPDeT significantly outperforms conventional RNN-based models.
In UPDeT, there is no need to introduce any new parameters for new tasks. We also prove that it is only with decoupled policy and matched observation-entity — action-group pairs that UPDeT can learn a strong representation with high transfer capability. Finally, our proposed UPDeT can be plugged into any existing method with almost no changes to the framework architecture required, while still bringing significant improvements to the final performance, especially in hard and complex multi-agent tasks.
The main contributions of this work are as follows: First, our UPDeT-based MARL framework outperforms RNN-based frameworks by a large margin in terms of final performance on state-of-the-art centralized functions. Second, our model has strong transfer capability and can handle a number of different tasks at a time. Third, our model accelerates the transfer learning speed (total steps cost) to make it roughly 10 times faster compared to RNN-based models in most scenarios.
2 Related Work
Attention mechanisms have become an integral part of models that capture global dependencies. In particular, self-attention (parikh2016decomposable) calculates the response at a specific position in a sequence by attending to all positions within this sequence. vaswani2017attention
demonstrated that machine translation models can achieve state-of-the-art results solely by using a self-attention model.parmar2018image proposed an Image Transformer model that applies self-attention to image generation. wang2018non formalized self-attention as a non-local operation in order to model the spatial-temporal dependencies in video sequences. In spite of this, self-attention mechanisms have not yet been fully explored in multi-agent reinforcement learning.
Another line of research is multi-agent reinforcement learning (MARL). Existing work in MARL focuses primarily on building a centralized function to guide the training of individual value function (lowe2017multi, sunehag2017value, rashid2018qmix, mahajan2019maven, hostallero2019learning, yang2020multi, zhou2020learning). Few works have opted to form a better individual functions with strong representation and transfer capability. In standard reinforcement learning, this generalization has been fully studied (taylor2009transfer, ammar2012reinforcement, parisotto2015actor, gupta2017learning, da2019survey). While multi-agent transfer learning has been proven to be more difficult than the single-agent scenario (boutsioukis2011transfer, shao2018starcraft, vinyals2019grandmaster). However, the transfer capability of a multi-agent system is of greater significance due to the various number of agents, observations sizes and policy distributions.
To the best of our knowledge, we are the first to develop a multi-agent framework capable of handling multiple task at a time. Moreover, we provide a policy decoupling strategy to further improve the model performance and facilitate the multi-agent transfer learning, which is a significant step towards real world multi-agent applications.
We begin by introducing the notations and basic task settings necessary for our approach. We then describe a transformer-based individual function and policy decoupling strategy under MARL. Finally, we introduce different temporal units and assimilate our Universal Policy Decoupling Transformer (UPDeT) into Dec-POMDP.
3.1 Notations and Task Settings
Multi-agent Reinforcement Learning
A cooperative multi-agent task is a decentralized partially observable Markov decision process (oliehoek2016concise) with a tuple . Let denote the global state of the environment, while represents the set of agents and is the action space. At each time step , agent selects an action , forming a joint action , which in turn causes a transition in the environment represented by the state transition function . All agents share the same reward function , while is a discount factor. We consider a partially observable scenario in which each agent makes individual observations according to the observation function . Each agent has an action-observation history that conditions a stochastic policy , creating the following joint action value: , where is the discounted return.
Centralized training with decentralized execution Centralized training with decentralized execution (CTDE) is a commonly used architecture in the MARL context. Each agent is conditioned only on its own action-observation history to make a decision using the learned policy. The centralized value function provides a centralized gradient to update the individual function based on its output. Therefore, a stronger individual value function can benefit the centralized training.
3.2 Transformer-based Individual Value Function
In this section, we present a mathematical formulation of our transformer-based model UPDeT. We describe the calculation of the global Q-function with self-attention mechanism. First, the observation is embedded into a semantic embedding to handle the various observation space. For example, if an agent observes other entities at time step , all observation entities are embedded via an embedding layer as follows:
Here, is the index of the agent, . Next, the value functions for the
agents for each step are estimated as follows:
We introduce , the temporal hidden state at the last time step , since POMDP policy is highly dependent on the historical information. denotes the observation embedding, while is the candidate action, . is the parameter that defines . Finally, the global Q-function is calculated by all individual value functions, as follows:
is the credit assignment function for defined by for each agent , as utilized in rashid2018qmix and sunehag2017value. For example, in VDN, is a sum function that can be expressed as .
Implement Q-function with Self-attention vaswani2017attention adopts three matrices, , , representing a set of keys, queries and values respectively. The attention is computed as follows:
where is a scaling factor equal to the dimension of the key. In our method, we adopt the self-attention to learn the features and relationships from the observation entity embedding and the global temporal information. To learn the independent policy in decentralized multi-agent learning, we define , and as the key, query and value metrics for each agent . We further consider the query, key and value for the same matrices , where is the number of layers of the transformer. Thus, we formulate our transformer as follows:
where represents the linear functions used to compute , , . Finally we project the entity features of the last transformer layer to the output space of the value function . We implement the projection using a linear function :
3.3 Policy Decoupling
A single transformer-based individual function with self-attention mechanism is still unable to handle various required policy distribution. A flexible mapping function in Eq. 6 is needed to deal with the various input and output dimensions and provide strong representation ability. Using the correlation between input and output, we design a strategy called policy decoupling, which is the key part of UPDeT.
The main idea behind the policy decoupling strategy can be summarized into three points:
Point : No restriction on policy dimension. The output dimension of a standard transformer block must be equal to or less than the input dimension. This is unacceptable in some MARL tasks, as the action number can be larger than the entity number.
Point : Ability to handle multiple tasks at a time. This requires a fixed model architecture without new parameters being introduced for new tasks. Unfortunately, if point is satisfied, point becomes very problematic to achieve. The difficulty lies in how to reconcile points and .
Point : Make the model more explainable. It would be preferable if we can could replace the conventional RNN-based model with a more explainable policy generation structure.
Following the above three points, we propose three policy decoupling methods, namely Vanilla Transformer, Aggregation Transformer and Universal Policy Decoupling Transformer (UPDeT). The pipelines are illustrated in Fig. 2. The details of the Vanilla Transformer and Aggregation Transformer are presented in the experiment section and act as our baselines. In this section, we mainly discuss the mechanism of our proposed UPDeT.
Tasking the entity features of the last transformer layer outlined in Eq. 5, the main challenge is to build a strong mapping between the features and the policy distribution. UPDeT first matches the input entity with the related output policy part. This correspondence is easy to find in the MARL task, as interactive action between two agents is quite common. Once we match the corresponding entity features and actions, we substantially reduce the burden of model learning representation using the self-attention mechanism. Moreover, considering that there might be more than one interactive actions of the matched entity feature, we separate the action space into several action groups, each of which consists several actions matched with one entity. The pipeline of this process is illustrated in the left part of Fig. 3. In the mapping function, to satisfy point and point , we adopt two strategies. First, if the action-group of one entity feature contains more than one action, a shared fully connected layer is added to map the output to the action number dimension. Second, if one entity feature has no corresponding action, we abandon it, there is no danger of losing the information carried by this kind of entity feature, as the transformer has aggregated the information necessary to each output. The pipeline of UPDeT can be found in the right part of Fig. 3. With UPDeT, there is no action restriction and no new parameter introduced in new scenarios. A single model can be trained on multiple tasks and deployed universally. In addition, matching the corresponding entity feature and action-group satisfies point , as the policy is explainable using an attention heatmap, as we will discuss in Section 4.4.
3.4 Temporal Unit Structure
Notably, however a transformer-based individual value function with policy decoupling strategy cannot handle a partial observation decision process without trajectory or history information. In Dec-POMDP (oliehoek2016concise), each agent chooses an action according to , where and represents for action and action-observation history respectively. In GRU and LSTM, we adopt a hidden state to hold the information of the action-observation history. However, the combination of a transformer block and a hidden state has not yet been fully studied. In this section, we provide two approaches to handling the hidden state in UPDeT:
1) Global temporal unit treats the hidden state as an additional input of the transformer block. The process is formulated in a similar way to Eq. 5 with the relation: and . Here, we ignore the subscript and instead use to represent ’global’. The global temporal unit is simple but efficient, and provides us with robust performance in most scenarios.
2) Individual temporal unit treats the hidden state as the inner part of each entity. In other words, each input maintains its own hidden state, while each output projects a new hidden state for the next time step. The individual temporal unit uses a more precise approach to controlling history information as it splits the global hidden state into individual parts. We use to represent the number of entities. The relation of input and output is formulated as and . However, this method introduces the additional burden of learning the hidden state independently for each entity. In experiment Section 4.1.2, we test both variants and discuss them further.
We use the standard squared in DQNs (mnih2015human) to optimize our entire framework as follows:
Here, represents the batch size. In partially observable settings, agents can benefit from conditioning on action-observation history. hausknecht2015deep propose Deep Recurrent Q-networks (DRQN) for this sequential decision process. For our part, we replace the widely used GRU (chung2014empirical)/LSTM (hochreiter1997long) unit in DRQN with a transformer-based temporal unit and then train the whole model.
4 StarCraft II Experiment
In this section, we evaluate UPDeT and its variants with different policy decoupling methods in the context of challenging micromanagement games in StarCraft II. We compare UPDeT with the RNN-based model on a single scenario and test the transfer capability on multiple-scenario transfer tasks. The experimental results show that UPDeT achieves significant improvement compared to the RNN-based model.
4.1 Single Scenario
In the single scenario experiments, we evaluate the model performance on different scenarios from SMAC (samvelyan2019starcraft). Specifically, the scenarios considered are as follows: 3 Marines vs 3 Marines (3m, Easy), 8 Marines vs 8 Marines (8m, Easy), 4 Marines vs 5 Marines (4m_vs_5m, Hard+) and 5 Marines vs 6 Marines (5m_vs_6m, Hard). In all these games, only the units from player’s side are treated as agents. Dead enemy units will be masked out from the action space to ensure that the executed action is valid. More detailed settings can be acquired from the SMAC environment (samvelyan2019starcraft).
4.1.1 Methods and Training Details
The MARL methods for evaluation include VDN (sunehag2017value), QMIX (rashid2018qmix) and QTRAN (hostallero2019learning). All three SOTA methods’ original implementation can be found at https://github.com/oxwhirl/pymarl. These methods were selected due to their robust performance across different multi-agent tasks. Other methods, including COMA (foerster2017counterfactual) and IQL (tan1993multi) do not perform stable across in all tasks, as have been proved in several recent works (rashid2018qmix, mahajan2019maven, zhou2020learning). Therefore, we combined UPDeT with VDN, QMIX and QTRAN to prove that our model can improve the model performance significantly compared to the GRU-based model.
The model performance result with different policy decoupling methods can be found in Fig. 3(a). Vanilla Transformer is our baseline for all transformer-based models. This transformer only satisfies point . Each output embedding can either be projected to an action or abandoned. The vanilla transformer fails to beat the enemies in the experiment. Aggregation Transformer is a variant of vanilla transformer, the embedding of which are aggregated into a global embedding and then projected to a policy distribution. This transformer only satisfies the point . The performance of the aggregation transformer is worse than that of the GRU-based model. The result proves that it is only with a policy decoupling strategy that the transformer-based model can outperform the conventional RNN-based model. Next, we adopt UPDeT to find the best temporal unit architecture in Fig. 3(b). The result shows that without a hidden state, the performance is significantly decreased. The temporal unit with global hidden state is more efficient in terms of convergence speed than the individual hidden state. However, the final performances are almost the same. To test the generalization of our model, we combine the UPDeT with VDN / QMIX / QTRAN respectively and compare the final performance with RNN-based methods in Fig. 3(c). We evaluate the model performance on 5m_vs_6m (Hard) scenarios. Combined with UPDeT, all three MARL methods obtain significant improvement by large margins compared to the GRU-based model. The result proves that our model can be injected into any existing stat- of-the-art MARL method to yield better performance. Further more, we combine UPDeT with VDN and evaluate the model performance on different scenarios from Easy to Hard+ in Fig. 3(d) and Fig. 3(e). The results show that the UPDeT performs stably on easy scenarios and significantly outperforms the GRU-based model on hard scenarios, in the 4m_vs_5m(Hard+) scenario, the performance improvement achieved by UPDeT relative to the GRU-based model is of the magnitude of around 80%. Finally, we conduct an ablation study on UPDeT with paired and unpaired observation-entity—action-group, the result of which are presented in Fig. 3(f). We disrupt the original correspondence between ’attack’ action and enemy unit. The final performance is heavily decreased compared to the original model, and is even worse than the GRU-based model. We accordingly conclude that only with policy decoupling and a paired observation-entity—action-group strategy can UPDeT learn a strong policy.
4.2 Multiple Scenarios
In this section, we discuss the transfer capability of UPDeT compared to the RNN-based model. We evaluate the model performance in a curriculum style. First, the model is trained one the 3m (3 Marines vs 3 Marines) scenario. We then used the pretrained 3m model to continually train on the 5m (5 Marines vs 5 Marines) and 7m (7 Marines vs 7 Marines) scenarios. We also conduct a experiment in reverse from 7m to 3m. During transfer learning, the model architecture of UPDeT remains fixed. Considering that the RNN-based model cannot handle various input and output dimensions, we modify the architecture of the source RNN model when training on the target scenario. We preserve the parameters of the GRU cell and initialize the fully connected layer with proper input and output dimensions to fit the new scenario. The final results can be seen in Fig. 4(a) and Fig. 4(b). Our proposed UPDeT achieves significantly better results than the GRU-based model. Statistically, UPDeT’s total timestep cost to converge is at least 10 times less than the GRU-based model and 100 times less than training from scratch. Moreover, the model demonstrates a strong generalization ability without finetuning, indicating that UPDeT learns a robust policy with meta-level skill.
4.3 Extensive experiment on large-scale MAS
To evaluate the model performance in large-scale scenarios, we test our proposed UPDeT on the 10m_vs_11m and 20m_vs_21m scenarios from SMAC and a 64_vs_64 battle game in the MAgent Environment (zheng2017magent). The final results can be found in Appendix E.
4.4 Attention based Strategy: An analysis
The significant performance improvement achieved by UPDeT on the SMAC multi-agent challenge can be credited to the self-attention mechanism brought by both transformer blocks and the policy decoupling strategy in UPDeT. In this section, we mainly discuss how the attention mechanism assists in learning a much more robust and explainable strategy. Here, we use the 3 Marines vs 3 Marines game (therefore, the size of the raw attention matrix is 6x6) as an example to demonstrate how the attention mechanism works. As mentioned in the caption of Fig. 6, we simplify the raw complete attention matrix to a grouped attention matrix. Fig. 5(b) presents the three different stages in one episode including Game Start, Attack and Survive, with their corresponding attention matrix and strategies. In the Game Start stage, the highest attention is in line 1 col 3 of the matrix, indicating that the agent pays more attention to its allies than its enemies. This phenomenon can be interpreted as follows: in the startup stage of one game, all the allies are spawned at the left side of the map and are encouraged to find and attack the enemies on the right side In the Attack stage, the highest attention is in line 2 col 2 of the matrix, which indicates that the enemy is now in the agent’s attack range; therefore, the agent will attack the enemy to get more rewards. Surprisingly, the agent chooses to attack the enemy with the lowest health value. This indicates that a long term plan can be learned based on the attention mechanism, since killing the weakest enemy first can decrease the punishment from the future enemy attacks. In the Survive stage, the agent’s health value is low, meaning that it needs to avoid being attacked. The highest attention is located in line 1 col 1, which clearly shows that the most important thing under the current circumstances is to stay alive. For as long as the agent is alive, there is still a chance for it to return to the front line and get more reward while enemies are attacking the allies instead of the agent itself.
In conclusion, the self-attention mechanism and policy decoupling strategy of UPDeT provides a strong and clear relation between attention weights and final strategies. This relation can help us better understand the policy generation based on the distribution of attention among different entities. An interesting idea presents itself here: namely, if we can find a strong mapping between attention matrix and final policy, the character of the agent could be modified in an unsupervised manner.
In this paper, we propose UPDeT, a universal policy decoupling transformer model that extends MARL to a much broader scenario. UPDeT is general enough to be plugged into any existing MARL method. Moreover, our experimental results show that, when combined with UPDeT, existing state-of-the-art MARL methods can achieve further significant improvements with the same training pipeline. On transfer learning tasks, our model is 100 times faster than training from scratch and 10 times faster than training using the RNN-based model. In the future, we aim to develop a centralized function based on UPDeT and apply the self-attention mechanism to the entire pipeline of MARL framework to yield further improvement.
This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No.U19A2073 and in part by the National Natural Science Foundation of China (NSFC) under Grant No.61976233 and No.61906109 and Australian Research Council Discovery Early Career Researcher Award (DE190100626), and Funding of “Leading Innovation Team of the Zhejiang Province” (2018R01017).
Appendix A Details of SMAC environment
The action space contains four movement directions, k attack actions (where k is the fixed maximum number of the enemy units in a map), stop and none-operation. At each time step, the agents receive a joint team reward, which is defined by the total damage incurred by the agents and the total damage from the enemy side. Each agent is described by several attributes, including health point , weapon cool down (CD), unit type, last action and the relative distance of the observed units. The enemy units are described in the same way except that CD is excluded. The partial observation of an agent comprises the attributes of the units, including both the agents and the enemy units, that exist within its view range, which is a circle with a specific radius.
Appendix B Details of Model
The transformer block in all different experiments consists of 3 heads and 2 layer transformer blocks. The other important training hyper parameters are as follows:
|List of Hyper Parameters|
|token dimension (UPDeT)||32|
|channel dimension (UPDeT)||32|
|rnn hidden dimension||64|
|target net update interval||200|
|mixing embeddding dimension (QMIX)||32|
|hypernet layers (QMIX)||2|
|hypernet embedding (QMIX)||64|
|mixing embeddding dimension (QTRAN)||32|
|opt loss (QTRAN)||1|
|nopt min loss (QTRAN)||0.1|
Appendix C SOTA MARL value-based Framework
The three SOTA method can be briefly summarized as follows:
VDN (sunehag2017value): this method learns an individual Q-value function and represents as a sum of individual Q-value functions that condition only on individual observations and actions.
QMIX (rashid2018qmix): this method learns a decentralized Q-function for each agent, with the assumption that the centralized Q-value increases monotonically with the individual Q-values.
QTRAN (hostallero2019learning): this method formulates multi-agent learning as an optimization problem with linear constraints and relaxes it with L2 penalties for tractability.
Appendix D UPDeT on SMAC: A real case
We take the 3 Marines vs 3 Marines challenge from SMAC with UPDeT as an example; more details can be found in Fig. 7. The observation are separated into 3 groups: main agent, two other ally agents and three enemies. The policy output includes basic action corresponding to the main agent’s observation and attack actions, one for each enemy observation. The hidden state is added after the embedding layer. The output of other agents is abandoned as there is no corresponding action. Once an agent or enemy has died, we mask corresponding unavailable action in the action select stage to ensure only the available actions are selected.
Appendix E Results of Extensive Experiment on Large Scale
We further test the robustness of UPDeT in a large-scale multi-agent system. To do so, we enlarge the game size in SMAC (samvelyan2019starcraft) to incorporate more agents and enemies on the battle field. We use a 10 Marines vs 11 Marines game and a 20 Marines vs 21 Marines game to compare the performance between the UPDeT and GRU-based approaches. In the 20 Marines vs 21 Marines game, to accelerate the training and satisfy the hardware limitations, we decrease the batch size of both the GRU baseline and UPDeT from 32 to 24 in the training stage. The final results can be found in Fig. 7(a). The improvement is still significant in terms of both sample efficiency and final performance. Moreover, it is also worth mentioning that the model size of UPDeT stays fixed, while the GRU-based model becomes larger in large-scale scenarios. In the 20 Marines vs 21 Marines game, the model size of GRU is almost double that of UPDeT. This indicates that UPDeT is able to ensure the lightness of the model while still maintaining good performance.
We also test the model performance in the MAgent Environment (zheng2017magent
). The settings of MAgent are quite different from those of SMAC. First, the observation size and number of available actions are not related to the number of agents. Second, the 64_vs_64 battle game we tested is a two-player zero-sum game which is another hot research area that combines both MARL and GT (Game Theory), the most successful attempt in this area involves adopting a mean-field approximation of GT in MARL to accelerate the self-play training (yang2018mean). Third, as for the model architecture, there is no need to use a recurrent network like GRU in MAgent and the large observation size requires the use of a CNN from embedding. However, ny treating UPDeT as a pure encoder without recurrent architecture, we can still conduct experiments on MAgent; the final results of these can be found in Fig. 7(b). As the result show, UPDeT performs better than the DQN baseline, although this improvement is not as significant as it in SMAC.