1 Introduction
Cooperative multiagent reinforcement learning has been widely adopted in various domains, such as autonomous cars
[1], sensor networks [2], and robot swarms [3, 4]. In these tasks, due to the unavailability of individual rewards, each agent is required to learn a decentralized policy through a shared team reward signal [5, 6]. Thereby, discriminative credit assignment among agents plays an essential role in achieving effective cooperation. Recent years have witnessed great advances of cooperative multiagent reinforcement learning methods in credit assignment. Among them, valuebased methods have shown stateoftheart performance on challenging tasks, e.g., unit micromanagement in StarCraft II [7].Recently, value factorization has become particularly popular, which is based on the centralized training with decentralized execution (CTDE) paradigm [8]. Specifically, it factorizes joint value function by the integration of individual value functions during centralized training [5]. During the execution, decentralized policies can be easily derived by greedily selecting individual actions from the local value function . In this way, an implicit multiagent credit assignment is realized because is learned by optimizing the total temporaldifference error on the single global reward signal.
As the first attempt of CTDE, VDN [5] factorizes by leveraging summation operation over individual [9]. Despite its effectiveness in realizing the scalability of multiagent joint policy training and individual policy execution, VDN simply assigns equal credits to agents, which gives insufficient incentive for agents to learn cooperation. Considering the limit of equal assignment, QMIX designs the mixing network to assign the nonnegative learnable weights to individual Qvalues with a nonlinear function of the global state [10]. Following QMIX, more and more algorithms employ the variants of attention mechanism as the integration function [11, 12].
However, an open question is whether the mixing networks really achieve discriminative credit assignment. In this work, towards quantitative evaluation on such discriminability, we employ the gradient entropy of over to measure the discrepancy among the credits assigned to agents. The empirical results of the QMIX on gridworld environments and the benchmark environment SMAC show that the gradient entropy is always close to the maximum value during training, implying the limited credit assignment in QMIX. Besides, we simplify the QMIX network by using the summation of as the input of mixing network instead of , denoted as QMIXsimple. In this way, the gradients of over are equal. The experiments show that QMIXsimple can achieve comparable performance as QMIX, which also supports the above argument.
Based on the above observations, we improve QMIX by introducing the gradient entropy regularization. Since more indiscriminate credit assignment means larger gradient entropy regularization value, gradient entropy regularization can explicitly help the system achieve differentiated credit assignment. The experimental results demonstrate that gradient entropy regularization not only speeds up the training process but also improves the overall performance.
To sum up, the main contributions of this paper lie in three aspects:

We propose a metric, i.e., the gradient entropy of over , to measure the discriminativeness of credit assignment among agents.

We empirically show that QMIX suffers indiscriminability on the credit assignment: (1) the gradient entropy is close to the maximum value, i.e., all agents are assigned the same credit; (2) the simplified version of QMIX can achieve comparable performance as QMIX.

We employ the gradient entropy regularization to explicitly enhance the discriminality on the credit assignment. The experimental results on the gridworld and the benchmark environment demonstrate the efficacy of our method.
2 Related Work
In this section, we briefly review the related work on credit assignment in multiagent reinforcement learning, which can be divided into Qlearning and policygradient.
2.1 Credit Assignment in Qlearning
Value decomposition is a popular credit assignment method utilized in algorithms where a centralized Qvalue mixer is adopted. These methods strive to decompose a centralized Q value into individual Q values. VDN [5] uses a simple sum operation to generate the global Q value, which equally treats the contributions of all agents. QMIX [10], as an extension of VDN, employs a mixing network by leveraging state information to decide the transformation weights [13], which allows nonlinearity operations to decompose the central Q value. QATTEN [11] utilizes a multihead attention structure to weight the individual Qvalues based on the global state and the individual features, and then linearly integrates these values into the central action value. QPLEX [12] decomposes the central Qvalue into the sum of individual value functions and a nonpositive advantage function. By introducing the duplex dueling structure, QPLEX achieves the complete function class that satisfies the IGM principle. WQMIX [14] follows QMIX paradigm and exploits a weighted QMIX operator to put more emphasis on better joint actions.
2.2 Credit Assignment for Policy Gradient
Existing Credit assignments for policy gradient utilize a similar implementation of value decomposition. MADDPG [15] first introduces DDPG [16]
method to multiagent tasks. A centralized Qvalue estimator is used to aggregate agents’ information. It achieves an implicit and weak credit assignment by giving a global view of all agents’ actions to centralized critic. Counterfactual MultiAgent Policy Gradients (COMA)
[17] further gives a counterfactual baseline of expectation of all agents’ Qvalues. Since the computation of counterfactual baseline fixes all other agents’ actions, it cannot reflect the contribution of each agent. Essentially, COMA solves the lazy agent problem by encouraging agents to explore while not hurting the effect of the joint policy. Offpolicy multiagent decomposed policy gradients (DOP) [18] takes advantage of both QMIX and MADDPG, and investigates how to implement value decomposition in the mulitagent actorcritic framework. LICA [19] integrates global information into the centralized critic and adds a regularization term to encourage stochasticity in policies.The above methods all express their implicit realization of credit assignment. To the best of our knowledge, there is no study on the metric to quantitatively evaluate the discriminability of credits assigned by these methods. In this work, we propose a reasonable evaluation metric for discriminability of credit assignment, and introduce the gradient entropy regularization to help mixing network realize a more discriminative credit assignment.
3 Preliminaries
3.1 Problem Formulation
A fully cooperative multiagent problem is often described as a DecPOMDP [20, 21], which consists of a tuple . represents the global state of the environment. At each timestep, each agent chooses an action , forming a joint action . Given this joint action, the environment state will transit into a new one according to the transition function [22, 23, 24]. The global reward is shared by all the agents and is a discount factor.
Due to the communication constraints in practice, we consider the partially observable setting, where agents have no access to the global state for decision making. Each agent could only draw individual observations from the environment according to the observation function . Generally, each agent keeps an observationaction history to better derive the global information from limited local observations. With , agent can determine its action with policy . The individual policies are gathered to be a joint policy , which corresponds to a joint actionvalue function, i.e. , where represents the discounted reward. The overall target of the problem is to maximize the value of the joint policy, formulated as where is the initial distribution of the global state.
Under the centralized training and decentralized execution paradigm, the policies of agents are only conditioned on local observationaction histories but trained with global information, e.g., and .
3.2 Qmix
As a representative value decomposition method, QMIX follows the CTDE paradigm. The core part of QMIX is the mixing network, which is responsible for credit assignment. In QMIX, each agent holds an individual Qnetwork [25]. The output of individual Qnetworks are passed through the mixing network, which implicitly assign the credits to each agent and generates the approximation of the global Qvalue . Specifically, the weights of the mixing network are produced by the separate hypernetworks by taking the state
as input. Each hypernetwork consists of a single linear layer, followed by an absolute activation function, to ensure that the mixing network weights are nonnegative.
is calculated as follows:(1) 
where is the output of individual Qnetworks, and , , and are weights produced by hypernetworks. Since all elements in and are nonnegative, QMIX satisfies the following condition:
(2) 
This property guarantees the IndividualGlobal Maximum (IGM) principle, which means the optimal individual actions jointly make the optimal joint action for :
(3) 
In this way, agents could choose the best local actions based only on their own local observationaction histories and the joint action is the best action for the whole system.
The system is updated by minimizing squared TD loss on , according to the following formulation:
(4) 
Here is the TD target.
To sum up, QMIX takes the advantage of global information in decomposing the and guarantees consistency between the centralised and decentralised policies with monotonic constraint. However, QMIX does not explicitly require discrimination on credit assignment, which might result in indiscriminability and negatively influence the performance.
4 Methodology
In this section, we propose two strategies to investigate the discriminability of credit assignment in mixing network, i.e., an evaluation metric for discriminability measurement and a simplified version of QMIX. Moreover, we design a gradient entropy regularization to enhance the discriminability of credit assignment.
4.1 Discriminability Measurement of Credit Assignment via Gradient Entropy
In this section, we propose an evaluation metric to measure the discriminability of credit assignment, and explain the rationale behind it. Moreover, we employ this metric to evaluate the discriminability of credit assignment in QMIX.
Value decomposition paradigm approximates the global Qvalue by the integration of a set of individual Qvalues . According to Taylor expansion, can be factorized as follows:
(5)  
where denotes the gradient of over . The above equation indicates that the influence of each agent on is determined by the corresponding gradient. Moreover, a larger gradient means that the corresponding agent has a greater impact on .
Considering that the gradient reflects each agent’s contribution to the team, if the credits assigned to agents are discriminative, the gradient distribution over agents is unlikely to be uniform. To describe such nonuniformity, since the gradients are nonnegative, we normalize them by Eq. (6) and adopt the entropy of the normalized gradients as a measure of the discriminability of credits assigned to agents.
(6) 
Beyond that, considering different complexity across tasks, we also normalize the entropy with the maximum entropy of random variables. Based on the above discussion, the normalized gradient entropy is calculated as follows:
(7) 
where is the maximum entropy of values. It can be seen that the smaller the entropy, the more discriminative the gradient, and vice versa. In other words, if the normalized gradient entropy is close to 1, the gradients of on will be similar, meaning a lack of discriminability in credit assignment.
Given the definition of normalized gradient entropy, we employ it to measure the discriminability of credits assigned by QMIX. It is worth noting that the normalized gradient entropy is applicable to measure the discriminability of any differentiable mixing network structure. According to the mixing network structure in QMIX, the gradients of over are calculated as follows:
(8) 
where is the gradient on the activation function . here is the th element in the output of the first linear layer in mixing network. As shown in Eq. (2
), each element of the gradient vector is nonnegative. The detailed derivation process can be found in supplementary material.
4.2 Simplified QMIX
In addition to the discriminability measurement via gradient entropy, we conduct the ablation study to test the discriminability of QMIX in credit assignment. To this end, we propose a simplified version of QMIX, named QMIXsimple, as a baseline.
QMIXsimple follows a similar paradigm as QMIX but simplifies the mixing strategy. The overall framework of QMIXsimple is shown in Figure 2. Different from QMIX that integrates individual Qvalues via learnable nonnegative weights, QMIXsimple applies the summation operation over individual Qvalues, followed by two scalar multiplications with an activation function in between. Specifically, of QMIXsimple, denoted as , is calculated with the following equation:
(9) 
where , , and are produced by hypernetworks that take the global state as input. In this way, the gradients on are forced to be equal and credits are assigned equally.
Model  Equal Credits  Utilization of Global State 

VDN  
QMIX  
QMIXsimple 
Discussion: Here, we give a brief discussion of QMIXsimple, QMIX and VDN. Table 1 summarizes the comparison of these three models. VDN takes the sum of individual Qvalues to approximate the global Qvalue, which also assigns equal weights to agents. The key difference between VDN and QMIXsimple is that QMIXsimple employs scalar multiplications and nonlinear activation function to enrich the expressiveness of mixing network. Besides, QMIXsimple takes the advantage of global state to get a statedependent multiplicator while VDN actually ignores the global state. Compared to QMIX, QMIXsimple also utilizes the global state in mixing network but explicitly assigns equal weights. If QMIXsimple could achieve comparable performance as QMIX, it means that QMIX lacks discriminability in credit assignment.
4.3 Gradient Entropy Regularization
In this section, we propose a new method to guide QMIX to allocate different credits to different agents, named Gradient entropy REgularization (GRE).
In Section 4.1, we have explained the rationality of employing gradient entropy to measure the discriminability of credit assignment. Thereby, one straightforward way to enhance the discriminability of credit assignment is to incorporate the gradient entropy regularization into QMIX. The overall loss of QMIXGRE is the combination of the original TD loss and the regularizatioin term, i.e,
(10)  
where
is a hyperparameter balancing the TD loss and the regularization term. Note that when
, the model degenerates to the original QMIX. A larger value of means greater penalty for indiscriminability. It is worth mentioning that optimizing the regularization term only updates the parameters in the mixing network.PercentileEnv  Lumberjacks  TrafficJunction10  3s5z  MMM2  27m vs 30m 

5%  0.972  0.931  0.941  0.961  0.987 
25%  0.992  0.961  0.975  0.980  0.992 
50%  0.997  0.975  0.984  0.988  0.995 
75%  0.999  0.985  0.989  0.993  0.997 
95%  1.000  0.993  0.995  0.996  0.999 
percentile of gradient entropy value is close to 1. Since the maximum entropy means equal probability, the results here reveals the limited discriminablity of credit assignment in QMIX.
In our training process, the neural networks will minimize the TD loss as well as the normalized gradient entropy of the mixing network, forcing itself to assign different credits to different agents in an explicit way. As a result, the mixing network will learn a more discriminative assignment mechanism.
5 Experiments
To comprehensively understand the credit assignment in QMIX and evaluate our proposed gradient entropy regularization, we conduct the experiments to answer the following questions in Section 5.2, 5.3, and 5.4, respectively:
RQ1: Does QMIX achieve a discriminative credit assignment via mixing network?
RQ2: With the help of gradient entropy regularization, how does QMIXGRE perform compared to QMIX?
RQ3: Can QMIXGRE enhance the discriminability of credit assignment?
5.1 Experimental Settings
In this section, we briefly introduce the experimental environments and the hyperparameter settings.
In this paper, we conduct experiments on two gridworld environments and SMAC environments. The two gridworld environments, named Lumberjacks and TrafficJunction10, come from [26]. SMAC is introduced by [7]
Lumberjacks: The agents are lumberjacks whose goal is to cut down all the trees in the map. Each tree has a specified strength when an episode starts. For a tree in the map, once the number of agents in the cell of the tree is the same as or greater than its strength, the tree is automatically cut down. If all of the trees are cut down or the limit of the maximum episode step is reached, current episode ends. The possible actions of agents are moving in four directions (Down, Left, Upper, Right) or do nothing.
TrafficJunction10: The map is a 14 × 14 grid intersection and in each episode, there are 10 vehicles in total waiting to get into this intersection. At each time step, “new" cars enter the intersection from each of the four directions with probability and is randomly assigned to one of three possible routes (go straight, turn left and turn right). Once a car reaches the edge of the grid, it will be removed. The total number of cars in the map at same time is limited. Cars are requested to keep on the right side of the road. The car has two possible actions: move forward to its target or stay at the current cell. If the positions of two cars overlap, they collide and get a team reward 10, but this will not affect the simulation. The reward of any time step is the sum of passed time steps of all the cars in the map with a negative factor and the collide reward.
SMAC: This environment is built on Blizzard’s StarCraft II RTS game and is very popular in MARL research field. There are many maps in the environment where two teams of different agents try their best to defeat each other. In the game, agents have no access to the global state and the observations are strictly limited. In different maps of SMAC, the number and type of agents are different. The actions of agents also depend on the scenario, whose number ranges from 7 to 70. Once all the agents in a team die, the other team wins the game. Although the overall target is to have the highest win rate for each scenario, the environment still offers a reward signal to help training. This reward signal consists of hitpoint damage dealt and received by agents, the units killed and the battle result.
Our implementation is based on PyMARL [7] codebase and follows its hyperparameter settings. All the experiments in this paper are repeated five times with different random seeds. The models are trained with 2M steps and tested every 10K steps in SMAC while trained with 200K steps and tested every 5K steps in other two environments. Our codes are available in https://github.com/sgzZ123/GRE.
5.2 Investigation of QMIX: RQ1
To evaluate the discriminability of credits assigned by QMIX, we employ the normalized gradient entropy metric and compare the performance of QMIX and QMIXsimple.
To calculate the normalized gradient entropy, we collect the normalized entropy data on various SMAC environments and gridworld environments. Table 2 summarizes the () percentile of entropy values for each map/environment, respectively. Across the five experimental environments, the differences between the and percentile of entropy values are small, which indicate that the gradient entropy in QMIX consistently follows a longtail distribution. Figure 3 shows the normalized entropy curve in training process, where the yaxis ranges from 0.9 to 1.0. We can find the normalized entropy keeps close to the maximal value except for random exploration stage where agents randomly select their actions. To sum up, the distribution of gradient entropy demonstrates that QMIX suffers discriminability issue on the assignment of credits to agents.
In addition to discriminability measurement via gradient entropy, we conduct the ablation study to test the discriminability of QMIX in credit assignment. We conduct experiments on various SMAC environments and gridworld environments. In practice, we change the output of hypernetworks from matrix to scalars and keep other hyperparameters the same as original QMIX. Due to the page limit, we show the performance of QMIXsimple and QMIX on five representative maps/environments in Figure 4. The results on the remaining five maps of SMAC environments can be found in appendix. We can find that the performance of QMIXsimple is extremely close to the original QMIX algorithm except for few environments with only a minor drop. The performance drop probably comes from the decrease in the number of hypernetwork parameters (from matrices to scalars).
In summary, both the large value of gradient entropy and comparable performance of QMIXsimple to QMIX demonstrate that the credit assignment in QMIX still lacks discriminability.
5.3 Comparative Results: RQ2
from gridworld, QMIX is able to train a good joint policy, although with a large variance. GRE helps the algorithm to more stably achieve a good team policy. In (c)
3s5z, (d)MMM2 and (e)27m vs 30m from SMAC, QMIXGRE speeds up the training and improves the overall performance.We test our method on both gridworld and several SMAC maps. The results are shown in Figure 5. More results on SMAC can be found in appendix. In terms of the overall performance, QMIXGRE can outperform QMIX or at least achieve comparable performance. In terms of training efficiency, QMIXGRE has a significant advantage over QMIX in the early stage, indicating that QMIXGRE can speed up the training process.
In simple gridworld environments like Lumberjacks and TrafficJunction10, QMIXGRE can achieve comparable performance. In such a simple environment, QMIX itself can already solve tasks with high test return, so it is difficult to obtain further improvement. However, QMIX drops at the middle stage in the training while QMIXGRE improves the return gradually. We can see that QMIXGRE greatly boosts the performance and achieves similar performance on 3s5z. On MMM2 and 27m vs 30m, QMIXGRE not only speeds up training, but also improves the overall performance. The results show that the gradient entropy regularization would be more beneficial to the complex environment.
5.4 Discriminability of QMIXGRE: RQ3
To further investigate whether gradient entropy regularization indeed helps QMIX improve the discriminability of credit assignment, we collect the normalized gradient entropy data in the whole training process and compare it with QMIX.
Due to the page limit, we only plot the entropy curve of MMM2 in SMAC, as shown in Figure 6. We can find that in the random exploration stage in the training, there is actually little difference between the entropy of QMIX and QMIXGRE. As training progresses, the entropy of QMIXGRE is significantly smaller than QMIX, which means that QMIXGRE tends to realize a more discriminative credit assignment. It is worth mentioning that the gradient entropy is not that too small, which mainly results from the trade off between the optimization of TD loss and regularization term.
6 Conclusion
This paper revisits QMIX and proposes a new measurement to quantitatively study the credit assignment mechanism in algorithms whose framework are similar to QMIX, using the gradient flow and normalized entropy. With this measurement, we collect numerous data during the training process of QMIX in gridworld environments as well as various SMAC environments. Our results reveal that QMIX actually does not work as expected in most of the environments to distribute credits differentiably. To this end, we proposed Gradient entropy REgularization (GRE) to force the mixing network to be more distinguishable. Experiments show that our method does help boost the training and improve the performance of QMIX.
In our future work, we aim at conducting more experiments on more tasks to verify our normalized entropy measurement and GRE method. Further, we hope to design more powerful algorithms according to the entropy measurement and investigate deeper into MARL mechanisms.
References
 [1] Yongcan Cao, Wenwu Yu, Wei Ren, and Guanrong Chen. An overview of recent progress in the study of distributed multiagent coordination. IEEE Transactions on Industrial Informatics, 9(1):427–438, 2012.

[2]
Chongjie Zhang and Victor Lesser.
Coordinated multiagent reinforcement learning in networked
distributed pomdps.
In
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)
, volume 25, 2011.  [3] Maximilian Hüttenrauch, Adrian Šošić, and Gerhard Neumann. Guided deep reinforcement learning for swarm systems. arXiv preprint arXiv:1709.06011, 2017.
 [4] L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.
 [5] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Valuedecomposition networks for cooperative multiagent learning. arXiv preprint arXiv:1706.05296, 2017.
 [6] Pablo HernandezLeal, Bilal Kartal, and Matthew E. Taylor. A survey and critique of multiagent deep reinforcement learning. Autonomous Agents and MultiAgent Systems (AAMAS), 33(6):750–797, 2019.
 [7] Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, ChiaMan Hung, Philiph H. S. Torr, Jakob Foerster, and Shimon Whiteson. The StarCraft MultiAgent Challenge. CoRR, abs/1902.04043, 2019.
 [8] Landon Kraemer and Bikramjit Banerjee. Multiagent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190:82–94, 2016.

[9]
Stuart Russell and Andrew L. Zimdars.
Qdecomposition for reinforcement learning agents.
In
Proceedings of the International Conference on Machine Learning (ICML)
, 2003.  [10] Tabish Rashid, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multiagent reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 4295–4304. PMLR, 2018.
 [11] Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and Hongyao Tang. Qatten: A general framework for cooperative multiagent reinforcement learning. arXiv preprint arXiv:2002.03939, 2020.
 [12] Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multiagent qlearning. 2021.
 [13] David Ha, Andrew Dai, and Quoc V. Le. Hypernetworks. 2016.
 [14] Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted qmix: Expanding monotonic value function factorisation for deep multiagent reinforcement learning. In Proceedings of the Neural Information Processing Systems (NeurIPS), 2020.
 [15] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. arXiv preprint arXiv:1706.02275, 2017.
 [16] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning (ICML), pages 387–395. PMLR, 2014.
 [17] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multiagent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 32, 2018.
 [18] Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, and Chongjie Zhang. Offpolicy multiagent decomposed policy gradients. arXiv preprint arXiv:2007.12322, 2020.
 [19] Meng Zhou, Ziyu Liu, Pengwei Sui, Yixuan Li, and Yuk Ying Chung. Learning implicit credit assignment for cooperative multiagent reinforcement learning. In Proceedings of the Neural Information Processing Systems (NeurIPS), 2020.
 [20] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.
 [21] Frans A. Oliehoek and Christopher Amato. A concise introduction to decentralized pomdps. 2016.
 [22] Michael L. Littman. Markov games as a framework for multiagent reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), 1994.
 [23] Junling Hu and Michael P. Wellman. Nash qlearning for generalsum stochastic games. Journal of Machine Learning Research (JMLR), 4:1039–1069, 2003.
 [24] Lucian Busoniu, Robert Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning. In Systems, Man and Cybernetics, 2008.
 [25] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored mdps. In Proceedings of the Neural Information Processing Systems (NeurIPS), 2001.
 [26] Anurag Koul. magym: Collection of multiagent environments based on openai gym. https://github.com/koulanurag/magym, 2019.
Appendix A Appendix
a.1 Derivation of on { }
As in section 4.1, we need to calculate the gradients of on . The detailed process is listed here:
(11)  
where is the output of individual Qnetworks, and , , and are weights produced by hypernetworks.
Therefore, we have
(12) 
where is a diagonal matrix and .
a.2 Experiments
We develop our code based on PyMARL framework [7]. For the environmental settings, we follow the default setting in the QMIX implementation of PyMARL.
EnvPercentile  5%  25%  50%  75%  95% 

Lumberjacks  0.972  0.992  0.997  0.999  1.000 
TrafficJunction10  0.931  0.961  0.975  0.985  0.993 
3s vs 5z  0.870  0.940  0.972  0.990  0.998 
3s5z  0.941  0.975  0.984  0.989  0.995 
27m vs 30m  0.987  0.992  0.995  0.997  0.999 
2s3z  0.924  0.961  0.977  0.988  0.996 
5m vs 6m  0.975  0.988  0.993  0.996  0.999 
10m vs 11m  0.984  0.991  0.994  0.996  0.998 
bane vs bane  0.917  0.970  0.983  0.990  0.995 
MMM2  0.961  0.980  0.988  0.993  0.996 
We first calculate the entropy during the whole training process in two gridworld environments and several SMAC environments. The () percentile of entropy value for each map/environment is shown below, while some of them have already been shown in the main body of this paper. The data is collected in the whole training process, where in gridworld we store the gradients every 1K steps and 200K steps in total and in SMAC we save the gradients every 10K steps and 2M steps in total. The percentiles are calculated with all the saved data.
And Figure 7 shows the normalized entropy in the training process on these environments.
We also conduct QMIXsimple experiments on these environments, and the results are shown in Figure 8, some of which have been shown in the main body of the paper.