1. Introduction
MultiAgent Systems (MAS) have attracted substantial attention in many sequential decision problems in recent years, such as autonomous vehicle teams (Keviczky et al., 2007; Cao et al., 2012), robotics (Lillicrap et al., 2016; Ramchurn et al., 2010), scene graph generation (Chen et al., 2019), and network routing (Ye et al., 2015), etc. Among the approaches, MultiAgent Reinforcement Learning (MARL) has grown its popularity with its ability to learn without knowing the world model. A classical way in MARL to solve cooperative games is regarding the entire MAS as a single agent and optimize a joint policy according to the joint observations and trajectories (Tan, 1993). With the joint action space of agents growing exponentially as the number of agents and the constraints of partial observability, the classical method faces insurmountable obstacles. This promotes the Centralized Training with Decentralized Execution (CTDE) (Oliehoek et al., 2008; Kraemer and Banerjee, 2016) paradigm, where a central critic is set up to estimate the joint value function, and the agents are trained with global information but executed only based on its local observes and histories.
The main challenge that restricts the effective CTDE in MARL is credit assignment, which attributes the global reward signals according to the contributions of each agent. Recent studies that attempt to solve this challenge can be roughly divided into two branches. 1) Implicit methods (Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019; Yang et al., 2020b)
: it treats the central critic and the local agents as an entirety during the training procedure. A decomposition function (usually a neural network) is first set up to map the joint value function to local value functions. The central critic is then learned simultaneously with the decomposition function and the policy. Implicit methods suffer from inadequate decomposition limited by the design of the decomposition function. They also lack the interpretability for the distributed credits
(Heuillet et al., 2021). 2) Explicit methods (Foerster et al., 2018; Wang et al., 2020; Yang et al., 2020a): it trains the central critic and the local actors separately. In each iteration, the critic is first updated, after which some strategies are leveraged to compute the reward or the value function of each agent explicitly. Such reward signals or value functions are used to guide the training of local agents. Despite that the explicit methods overcome many shortcomings of the implicit counterpart, one has to algorithmically characterize the individual agent’s contribution from the overall success, which can be very hard in the context of subtle coalitions under common goals. We address this challenge by using a counterfactual method with Shapley Value. Shapley Value (Shapley, 1953)originates from cooperative game theory and is a golden standard to distribute benefits reasonably and fairly by accounting for the contribution of participating players. By treating the agents in MARL as the players in cooperative games, ideal credit assignment can be obtained up to computing the marginal contribution of Shapley Value. Inspired from this, Wang
et al. (Wang et al., 2020) proposed SQDDPG, which utilized Shapley Value in deterministic policy gradient (Silver et al., 2014; Lowe et al., 2017) to guide the learning of local agents. However, the performance of SQDDPG relies highly on the designed framework for estimating the marginal contribution, and this framework is limited by an assumption that the actions of agents are taken sequentially, which is often unrealistic. These restrictions make SQDDPG perform unsatisfactory in many tasks. To this end, we extend the explicit methods and propose a novel method that leverages Shapley Value to allocate the credits for agents. We achieve it by leveraging a counterfactual method to estimate what would have happened without the participation of a set of agents. The quantification of the contribution of a set of agents is then computed as the change of the central critic value by setting their actions to a baseline. Then the changes of the contributions caused by an agent in different set unions are treated as marginal contributions, and Shapley Value can thus be obtained. Finally, these unified values play the role of credits in local policies and guide its training procedure.Nevertheless, the computational complexity of the original Shapley Value grows factorially as the number of players increases. In many contexts of interest, such as network games, distributed control, and computing economics, this number can be quite large, which makes Shapley Value intractable. To alleviate the computational burden, we approximate Shapley Value through Monte Carlo sampling, which maintains the majority of the desired properties of Shapley Value. In our approach, the Shapley Value is computed by subsets of collaborators for each agent and is resampled at each time step. Our approach manages to reduce the computational complexity to polynomial in the number of players without much loss of effectiveness of Shapley Value.
Our main contributions can be summarized as follows:

We leverage a counterfactual method with Shapley Value to address the problem of credit assignment in MultiAgent Reinforcement Learning. The proposed Shapley Counterfactual Credits reasonably and fairly characterize the contributions of each local agent by fully considering their interactions.

We adopt a Monte Carlo samplingbased method to approximate Shapley Value and decrease its computational complexity growth from factorial to polynomial, which makes our algorithm viable for largescale, complicated tasks.

Extensive experiments show that our proposed method outperforms existing cooperative MARL algorithms significantly and achieves stateoftheart performance on StarCraft II benchmarks. The margin is especially large for more difficult tasks.
The rest of this paper is organized as follows. In Section 2, we first briefly reviews all related works. And we introduce the preliminaries, including DecPOMDPs, Shapley Value, and explicit framework for MARL in Section 3. The details of our proposed algorithm for credit assignment are introduced in Section 4. Experimental results and analyses are reported in Section 5. Finally, we conclude our paper and discuss on future directions in Section 6 .
2. Related Work
2.1. Implicit Credit Assignment
Most of the implicit methods follow the condition of IndividualGlobalMax (IGM), which means the optimal joint actions among the agents are equivalent to the optimal actions of each local agent. VDN (Sunehag et al., 2018) makes a hypothesis of the additivity to decompose the joint Qfunction into the sum of individual Qfunctions. QMIX (Rashid et al., 2018) gets rid of this assumption but adds a restriction of the monotonicity. LICA (Zhou et al., 2020) promotes QMIX to actorcritic as well as proposes an adaptive entropy regularization. Weighted QMIX adapts a twins network and encourages the underestimated actions to alleviate the risk of suboptimal results. QTRAN (Son et al., 2019) avoids the limitations of VDN and QMIX by introducing two regularization terms but has been proved to behave poorly in many situations. Qatten (Yang et al., 2020b) employs a multihead attention mechanism to compute the weights for the local action value functions and mix them to approximate the global Qvalue. All of these methods aim to learn a value decomposition from the total reward signals to the individual value functions, which suffer from several problems: (i) The performance of the model highly relies on the decomposition function. (ii) The lacking of interpretability for the distributed credits. (iii) The high risk of the joint policy tends to fall into suboptimal results (Williams and Peng, 1991; Mnih et al., 2016; Ahmed et al., 2019).
2.2. Explicit Credit Assignment
Explicit methods attribute the contributions for each agent that are at least provably locally optimal. The most representative method is COMA (Foerster et al., 2018), which utilizes a counterfactual advantage baseline to guide the learning of local policies. However, it treats each agent as an independent unit and overlooks the complex correlations among agents. Thus, it becomes inefficient when encounters complex situations. SQDDPG (Wang et al., 2020) proposes a network to estimate the marginal contribution, which is further used to approximate Shapley Value. Then, Shapley Value is used to guide the learning of local agents. However, such estimation for marginal contribution doesn’t make sense in many situations because the network overrelies on the assumption that the agents take actions sequentially. QPD (Yang et al., 2020a) designs a multichannel mixer critic and leverage integrated gradients to distribute credits along paths, which achieves stateoftheart results in many tasks. Intuitively, mining the relations between the agents is essential for the policy gradient in cooperative games. But the correlations are too complicated and are often underestimated by the models. To this end, we propose a Shapley Counterfactual Critic for credit assignment in MARL. Thanks to Shapley Value, the relations between the agents are considered sufficiently without prior knowledge, which further promotes the learning of local agents. Different from SQDDPG (Wang et al., 2020), we compute the marginal contributions according to a counterfactual method rather than building a network, which is more stable and efficient in complicated situations.
2.3. Shapley Value and Approximate SV
Shapley Value (Shapley, 1953; Bilbao and Edelman, 2000; Meng, 2012; Sundararajan and Najmi, 2020) originates from cooperative game theory in the 1950s, which assigns a unique distribution of total benefits generated by the coalition of all players. Shapley Value satisfies the properties of efficiency, symmetry, nullity, linearity and coherency. It is a unique and fairly way to quantify the importance of each player in the overall cooperation and widely used in economics. However, the computational complexity of Shapley Value grows factorially with respect to the number of participating players (Kumar et al., 2020). Thus, in order to decrease the computation, several recent studies start to approximate the exact Shaply Value (Fatima et al., 2008; Chen et al., 2018; Ghorbani and Zou, 2019; Wang et al., 2021) by sacrificing some properties. For example, Frye et al. (Frye et al., 2020) and Tom et al. (Heskes et al., 2020) utilize casual knowledge to simplify its calculation, which breaks the axiom of symmetry. LShapley and CShapley only consider the interactions among the local and connected player, which slightly break the properties of efficiency. DASP (Ancona et al., 2019)
and Neuron Shapley
(Ghorbani and Zou, 2020) adapt sample methods to approximate Shapley Value, which also slightly breaks the properties ofefficiency and symmetry.3. Preliminaries
3.1. DecPOMDPs
A fully cooperative multiagent sequential decisionmaking task with agents
can be modeled as a decentralised partially observable Markov decision process (DecPOMDP)
(Oliehoek and Amato, 2016; Bernstein et al., 2002; Busoniu et al., 2008; Gupta et al., 2017; Palmer et al., 2018). DecPOMDP is canonically formulated by the tuple:In the process, represents the true state of the environment. At each time step, each agent chooses an action simultaneously to formulate a joint action
. The action produces a state transition on the environment which is described by the Markov transition probability function
: . All of the agents share a same global reward function : .In the setting of partial observability, the observations of each agent are generated by a observation function : . Each agent owns an actionobservation history , where denotes the set of sequences of stateaction pairs with arbitrary length. On this history, each agent conditions a stochastic policy : . The common goal of all agents is to maximize the expected discounted return .
3.2. Shapley Value
Assume a coalition consists of players and they cooperate with each other to achieve a common goal. For a particular player , let be a random set that contains player and represents the set with the absence of , then the marginal contribution of in is defined as:
(1) 
where refers to the value function for estimating the cooperated contribution of a set of players.
Then the Shapley Value of player is computed as the weighted average of the marginal contributions in all of the subsets of :
(2) 
where denotes a set with size that contain the player . Shapley Value satisfy the following properties:

[leftmargin=*]

Efficiency. The credits generated by the big coalition is equal to the sum of the Shapley Values of all of the participating players .

Symmetry. If for all subsets then .

Nullity. If for all subsets then .

Linearity. Let and represent the associated gain functions, then for every .

Coherency. When another value function is utilized to measure the marginal contribution of , if for all subsets , then .
3.3. Explicit Framework for MARL
Explicit methods are interpretable for the allocated credits, which can reduce the suspicion of users to the rationality of the learned local agents. For this reason, we extend the explicit methods (Foerster et al., 2018; Wang et al., 2020; Yang et al., 2020a), which first train the central critic according to the joint states and actions and then distribute the global reward signals according to the contributions of local agents to the critic.
Following QPD (Yang et al., 2020a), we model our critic network with three components as shown in Figure 1
, that is, the feature extraction module, the feature fusion module, and the Qfunction estimation module. The first module consists of 2 dense layers with ReLU nonlinearity, which is used to extract the features of a particular agent’s observations and actions. Then the features of all agents are concatenated thus merged into a global feature. Finally, the joint Qvalue is computed according to the global feature. As Yang
et al. (Yang et al., 2020a) illustrated, different agents may own the same attributions, so can be categorized into different groups. For this reason, the agents within the same group are modeled using the same subnetwork. Meanwhile, in order to simplify the network architecture and accelerate the learning procedure, the agents of the same group share the same parameters. We represent the central critic as:(3) 
where and denote the observation and the action of the th agent, respectively.
In our implementation, each local agent is realized with a Recurrent Deep QNetwork, which is composed of an Long ShortTerm Memory (LSTM) layer and a MultiLayer Perceptron (MLP). We represent the local agent as:
(4) 
where is the hidden state of LSTM. For the exploration policy, greedy is adopted and the exploration rate of episode is:
(5) 
where is the initial exploration rate and represents the decreasing count of each episode.
4. Shapley Counterfactual Credits for MARL
The framework of our approach is illustrated in Figure 1. First, the central critic takes the actions and observations of each agent as input and approximates the total Q value. Then the contributions of the individual agents are distributed by the counterfactual method with Shapley Value. Finally, the local agents update their parameters according to the credits they earned.
In this section, we systematically describe our “Shapley Counterfactual Credits” for MultiAgent Reinforcement Learning. First, we will introduce a counterfactual method with Shapley Value to address the problem of “credit assignment”, which can fully mine the correlations among the local agents in Section 4.1. To downgrade the computational complexity. we replace the truly Shapley Value with its approximation, and this will be discussed in Section 4.2
. The details of the proposed algorithm and the loss function will be introduced in
Section 4.3.4.1. Counterfactual Method with Shapley Value for Credit Assignment
The main challenge we need to address is how to measure the contributions of the agent. In other words, we need to quantify how the agents’ actions influence the output of the central critic. COMA (Foerster et al., 2018) proposed a special critic and utilize a counterfactual baseline, which estimated the advantages of action value over expected value as this influence but shows poor performance on many tasks. Wolpert et al. (Wolpert and Tumer, 2002) computed the influence by using difference rewards which compares the global reward to the reward received when the action of an agent is replaced with a default action. Inspired by these ideas, we also proposed a counterfactual method in our central critic to measure the effect of the actions taken by the agents.
We consider the contribution of an action taken by an agent is equal to “how the output will change when this action is absent?” We formulate the contributions of the action performed by the th agent to the central critic as:
(6) 
where denotes a baseline that means the action is replaced by a default one.
However, such estimation for the contributions is insufficient since the agents are cooperating with each other and cannot be treated as independent units. We then desire to quantify the credits made by an agent precisely from the intricate relationship among agents, but the environment is complex, and there is no prior knowledge to indicate how they cooperated with each other. To this end, we propose to utilize Shapley Value for credit assignment, and this will be introduced in the next subsection.
As we mentioned before, Shapley Value distributes the credits fairly by considering the contributions of the participating players and satisfies many good properties such as efficiency, additivity, and coherency. Thus, we utilize this tool to extend the counterfactual method.
For convenience, we shorthand Equation (6) and change agent to a set of agents:
(7) 
where denotes all of the agents, represents the actions and observations of , and denotes that the actions of all agents in are replaced with default actions.
To compute the Shapley Value of the th agent in the big coalition, we need to compute its marginal contributions when this agent play roles in all of the subset of the big coalition . We define the marginal contribution of the th agent in the subset of as:
(8) 
where denotes with the removal of the th agent.
After getting the marginal contribution, we compute the Shapley Counterfactual Credits as:
(9) 
where denotes the set of agents with size that contains the th agent.
4.2. Approximation of Shapley Value
However, the main drawback of Shapley Value is that the computational complexity grows factorially as the number of the agents increases (Kumar et al., 2020). So recent studies usually use an approximation of Shapley Value as a substitution (Chen et al., 2018; Ghorbani and Zou, 2019; Wang et al., 2021). Since the number of the agents may bring an unacceptable computational cost, for alleviating the computational burden, the approximation of Shapley Value is necessary. Thus, we adopt the Monte Carlo sampling method to get the approximated Shapley Value:
(10) 
where represents the times of Monte Carlo sampling, represents a subset of sampled in th time that contains the th agent.
According to this approximation, we downgrade the computational complexity of the truly Shapley Value of an agent from to , where is the number of agents that may be very large in some situations, and
is a hyperparameter which represents the times of Monte Carlo sampling and can be a small positive integer. To be noticed that, such an approximation of Shapley Value might slightly break some of its properties such as
efficiency and Symmetry. Recent literature sacrificed its properties in varying degrees but got an acceptable computational costs (Chen et al., 2018; Ghorbani and Zou, 2019; Wang et al., 2021). We deem that such an approximation is necessary and will not bring too much impact to the model’s performance.4.3. Loss Function and Training Algorithm
We show the details of our algorithm in Algorithm 1. Our whole framework is updated in two stages. First, the local agents interact with the environment and take actions according to their observations and history. Then, these actions and observations act as the input of the central critic to estimate the joint Qfunction. Afterward, in the first stage, we update the central critic by minimizing the TDloss :
(11) 
where is the parameters of the central critic, is the output of the central critic, and represents the output of target network of the central critic.
In the second stage, we first get the Shapley Counterfactual Credits of each agent according to Equation (10). Then each agent is trained by minimizing the loss:
(12) 
where denotes the parameters of the th local agent, and is the output of the th agent.
The mean win rates of our method compared with others in different map scenarios of StarCraft II. The shaded areas represent the standard deviation.
Map  Methods  

VDN  QMIX  QTRAN  COMA  QPD  SQDDPG  OURS  
3m  100  100  100  100  100  100  95  96  99  99  64  65  99  99 
8m  100  100  100  100  100  100  100  100  95  95  92  90  98  97 
2s3z  100  100  100  100  92  91  45  45  99  98  60  55  100  100 
1c3s5z  88  85  95  90  40  41  15  15  77  72  2  2  61  60 
3s5z  80  69  80  67  12  13  5  3  79  80  1  1  92  90 
3s5z_vs_3s6z  0  0  0  0  0  0  0  0  3  5  0  0  20  20 
5. Experiments
We focus on addressing the problem of credit assignment in MARL with cooperative settings explicitly. We compare our proposed method with several baselines, including VDN (Sunehag et al., 2018), QMIX (Rashid et al., 2018), COMA (Foerster et al., 2018), QTRAN (Son et al., 2019), QPD (Yang et al., 2020a), and SQDDPG (Wang et al., 2020). The training configurations, experiment results, as well as the analysis will be described in detail in this section.
5.1. Experiment Settings
Environment
We perform extensive experiments on the StarCraft II (a realtime strategy game) micromanagement challenge, in which each army is controlled by an agent and act based on its local observations and the opponent’s army are controlled by the handcoded builtin StarCraft II AI. Each unit in StarCraft contains a rich set of complex microactions, which allow the learning of complex interactions between the agents that cooperate with each other. The overall goal is to maximize the accumulated rewards for each battle scenario. The environment produces rewards based on the hitpoint damage dealt and enemy units killed. Besides, another bonus is given when the battle wins. At each time step, each agent can only receive the local observations within its field of view. Meanwhile, an agent can only observe the other agents alive and located in its sight range. Besides, all agents can only attack the enemies within their shooting range, which is set to 6. The global state consists of the joint observations without the restriction of the sight range, which will be used in the central critic during the training procedure. All features are normalized by their maximum values before sent to the neural network. StarCraft MultiAgent Challenge (SMAC) environment (Samvelyan et al., 2019) is used as testbed, and we set the difficulty of the game AI as “very difficult” level.
Configurations
The central critic of our method is the same as QPD (Yang et al., 2020a), which consists of the feature extraction layers, the feature fusion operation, and the Qfunction estimation layers. First, the agents are grouped according to their attributions, and 2 dense layers are used to extract the features of their observations and actions. Each dense layer consists of 64 neurons for each channel. For accelerating the learning procedure, we adopt parameter sharing technique (Yang et al., 2018; Iqbal and Sha, 2019) where the agents within the same group share the parameters of the feature extraction layers. Then, we concatenated the features of all agents to fuse them into a global feature. Finally, for the final Qfunction estimation, we adapt another dense layer with one output neuron. In the procedure of computing Shapley Value, we adapt the Monte Carlo sampling method to sample 5 subsets for each agent at each time step. We set the counterfactual baseline
in the central critic as zero vector for convenience. We model the local agents with an LSTM layer and 2 fully connected layers. The dimensional of hidden state in LSTM is set as 64, the units of the two fully connected layers are set as 64 and
separately, where is the size of action space. We set the discount ratefor TDloss as 0.99. The replay buffer stores the most recent 1000 trajectories. During training, we update the central critic with Adam and local agent networks with RMSProp. We copy the parameters of the central critic to its target network every 200 training episodes. The full hyperparameters of our Shapley Counterfactual Credits are shown in Table
2. The map 3s5z_vs_3s6z is much harder than the other maps, and the allied forces have one unit less than the enemy. During training, the win rates remain 0 even when the returns are relatively high. For this reason, we set the number of the training episodes of map 3s5z_vs_3s6z to 50000, while the others are set to 20000.Settings  Value 

Batch size  32 
Replay buffer size  1000 
Training episodes  20000 
Exploration episodes  1000 
Start exploration rate  1 
End exploration rate  0 
TDloss discount  1000 
Target central critic update interval  200 episodes 
Evaluation interval  100 episodes 
Evaluation battel number  100 
Agent optimizer  RMSProp 
Central Critic optimizer  Adam 
Agent learning rate  0.005 
Central critic learning rate  0.01 
Dense units  64 
LSTM hidden units  64 
Baseline for Shapley Value  0 vector 
Times for Monte Carlo Sampling  5 
5.2. Results and Analysis
To demonstrate the efficiency of our proposed method, we perform experiments on 6 maps of StarCraft II (3m, 8m, 2s3z, 1c3s5z, 3s5z, 3s5z_vs_3s6z), including both homogeneous and heterogeneous scenarios. Figure 2 depicts the curve of mean win rates of our method compared to the baselines. The final results of our method are depicted in Table 1, where represents the median of the test win rates and represents mean test win rates.
All of the methods show high performance on three simple scenarios (3m, 8m, 2s3z), and our Shapley Counterfactual Credits algorithm is competitive with the stateoftheart algorithm, and achieves nearly 100% mean win rates. Both sides have 3 Marines in map 3m, and 8 Marines in map 8m. As the arms of both sides are single and the numbers are equal, each agent only needs to focus on beating enemies and avoid taking redundant actions. Concretely, from the replay, in map 3m and 8m, units learned to stand in a line or semicircle in order to set fire to the incoming enemies. Such a pattern is easy for models to learn, and agents hardly need to consider how to cooperate with its friendly forces. In map 2s3z, both sizes have 2 Stalkers and 3 Zealots. Since that Zealots counter Stalkers, the Stalkers need to hide behind the own side Zealots. Such a small number of units does not bring too much challenge for the learning of the model.
Our algorithm falls behind the other methods in map 1c3s5z, where both sizes have 3 Stalkers, 5 Zealots and an Colossus. Since the Colossus is more threatening, and becomes the priority target, which reduces the difficulty of the game. Here, we divide the learned ability of an agent into the personal ability and the cooperative ability. For example, “kite the enemy” as well as “attack highthreat targets” belongs to the former, and “move to protect the allies” belongs to the latter. In this map, all of the agents need to learn the pattern to attack the enemy’s Colossus first, which makes other actions less important. Since Shapley Value focuses more on mining the correlation between agents, our method does not perform very well in this scenario.
Our algorithm shows obvious advantages in two maps 3s5z and 3s5z_vs_3s6z which are much more difficult than others. In map 3s5z, both sizes have 3 Stalkers and 5 Zealots, and we got the mean win rates of 90%. In this scenario, not only the agents of Stalkers need to stand behind the allied Zealots, but learn to attack the enemy Stalkers with high priority. Meanwhile, the allied Zealots need to protect allied Stalkers as well as attack the nearest enemy Stalkers. In this complex situation, cooperation among agents is more important than before. Our counterfactual method with Shapley Value fully considers the correlation and interactions between units and distributes a moderate credit for the actions taken by each agent, thus outperforms the baselines significantly. For instance, a “movement” of a Zealots may affect other friendly forces in varying degrees; we measure its contribution by considering how the results will change when different kinds of correlations are absent. Especially, in map 3s5z_vs_3s6z, where ally has 3 Stalkers and 5 Zealots while the enemy has 6 Zealots, all of the current method except QPD got the mean win rates of zero. The reason for the poor performance of these methods is that cooperative behavior such as “block” rather than “kite” play more important roles in such settings. The Zealots need to attract firepower in order to protect the allied Stalkers, which is the only way to get the final victory. In this scenario, Shapley value fully demonstrates its superiority. Our method achieves the mean win rates of 20%, and reach the stateoftheart.
In conclusion, our proposed Shapley Counterfactual Credits algorithm shows its strength and beats all of the other methods in complicated scenarios where cooperation among agents plays an essential role. Our proposed algorithm also exhibits the competitive results with the stateoftheart algorithm in the scenarios that need to pay more attention to personal ability.
5.3. Ablation Study
To demonstrate the advantage of Shapley Value (Shapley, 1953) to the counterfactual method, we perform ablation study on three maps (3m, 8m, 2s3z). The difficulty of these three maps increases sequentially. The results are shown in Figure 3. The blue curves represent that the credits are allocated by the counterfactual method without Shapley Value. The red curves represent that the credits are distributed by Shapley Counterfactual Credits. For the balance between the performance and computational costs, we set the times of the Monte Carlo sampling for approximating Shapley Value as 5, and the analysis is shown in the next subsection.
In map 3m and 8m, the units need to learn the strategies that stand in a suitable position to fire the same enemy unit together. Thus, the ability to cooperate is relatively important, and the use of Shapley Value brings an improvement of the performance. While in 2s3z, the Stalkers need to “kite the Zealots” and the number of the units is small, which means personal ability is more important. So our method loses advantage in this scenario. It is worth mentioning that the use of Shapley Value makes learning more stable and reduces the standard deviation (the shaded part in the figure) of the win rates significantly. That because Shapley Value considers a variety of combinations among agents and measure the contribution of an agent via the weighted average of the counterfactual results of these combinations.
5.4. The Choice of Sample Times for Shapley Approximation
We approximated the truly Shapley Value via Monte Carlo sampling. Concretely, at each time step, we sample subsets randomly for each agent , and average the marginal contributions of in these subsets to represent its approximated Shapley Value. However, a large will still bring pressure to the computation costs, and small will lead to an inaccurate approximation. We performed extensive experiments to find a moderate hyperparameter, and the results are depicted in Figure 4. We conclude that 4 times sampling is sufficient to reach an ideal result. But to make the performance more stable, we set to 5 for in our experiments.
6. Conclusion and Future Work
In this paper, we investigate the problem of credit assignment in MultiAgent Reinforcement Learning. We extend the methods of explicit credit assignment and leverage a counterfactual method to measure the contributions of local agents to the central critic. To fully describe the relationships among the cooperative agents, Shapley Value is utilized with a samplebased method, with a MonteCarlo sampling variant to decrease its computational complexity from factorial to polynomial. Experiments on the StarCraft II micromanagement tasks show the superiority of our method as we reach the stateoftheart on various scenarios.
For future work, it could be interesting to investigate the causal knowledge among the cooperative agents. With this inferred knowledge, Shapley Value can be approximated in a more accurate way and the credit assignment can be more precise. Our method can also be extended to the scenarios with competitive settings, where variants of Shapley Value are proved to be effective.
Acknowledgements.
This work was supported by the National Key Research and Development Project of China (No.2018AAA0101900), the National Natural Science Foundation of China (No. 61625107, U19B2043, 61976185, No. 62006207), Zhejiang Natural Science Foundation (LR19F020002), Key R & D Projects of the Ministry of Science and Technology (No. 2020YFC0832500), Zhejiang Innovation Foundation(2019R52002), the Fundamental Research Funds for the Central Universities and Zhejiang Province Natural Science Foundation (No. LQ21F020020), Baoxiang Wang is partially supported by AC01202101031 and AC01202108001 from AIRS.References

Understanding the impact of entropy on policy optimization.
In
International Conference on Machine Learning
, pp. 151–160. Cited by: §2.1.  Explaining deep neural networks with a polynomial time algorithm for shapley value approximation. In International Conference on Machine Learning, pp. 272–281. Cited by: §2.3.
 The complexity of decentralized control of markov decision processes. Mathematics of operations research 27 (4), pp. 819–840. Cited by: §3.1.
 The shapley value on convex geometries. Discrete Applied Mathematics 103 (13), pp. 33–40. Cited by: §2.3.
 A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2), pp. 156–172. Cited by: §3.1.
 An overview of recent progress in the study of distributed multiagent coordination. IEEE Transactions on Industrial informatics 9 (1), pp. 427–438. Cited by: §1.
 Lshapley and cshapley: efficient model interpretation for structured data. In International Conference on Learning Representations, Cited by: §2.3, §4.2, §4.2.

Counterfactual critic multiagent training for scene graph generation.
In
Proceedings of the IEEE/CVF International Conference on Computer Vision
, pp. 4613–4623. Cited by: §1.  A linear approximation method for the shapley value. Artificial Intelligence 172 (14), pp. 1673–1699. Cited by: §2.3.
 Counterfactual multiagent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §1, §2.2, §3.3, §4.1, §5.
 Asymmetric shapley values: incorporating causal knowledge into modelagnostic explainability. Cited by: §2.3.
 Data shapley: equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251. Cited by: §2.3, §4.2, §4.2.
 Neuron shapley: discovering the responsible neurons. arXiv preprint arXiv:2002.09815. Cited by: §2.3.
 Cooperative multiagent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Cited by: §3.1.
 Causal shapley values: exploiting causal knowledge to explain individual predictions of complex models. Cited by: §2.3.
 Explainability in deep reinforcement learning. KnowledgeBased Systems 214, pp. 106685. Cited by: §1.
 Actorattentioncritic for multiagent reinforcement learning. In International Conference on Machine Learning, pp. 2961–2970. Cited by: §5.1.
 Decentralized receding horizon control and coordination of autonomous vehicle formations. IEEE Transactions on control systems technology 16 (1), pp. 19–33. Cited by: §1.
 Multiagent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190, pp. 82–94. Cited by: §1.
 Problems with shapleyvaluebased explanations as feature importance measures. In International Conference on Machine Learning, pp. 5491–5500. Cited by: §2.3, §4.2.
 Continuous control with deep reinforcement learning. International Conference on Learning Representations. Cited by: §1.
 Multiagent actorcritic for mixed cooperativecompetitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6382–6393. Cited by: §1.
 The core and shapley function for games on augmenting systems with a coalition structure. International Journal of Mathematical and Computational Sciences 6 (8), pp. 813–818. Cited by: §2.3.
 Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §2.1.
 A concise introduction to decentralized pomdps. Springer. Cited by: §3.1.
 Optimal and approximate qvalue functions for decentralized pomdps. Journal of Artificial Intelligence Research 32, pp. 289–353. Cited by: §1.
 Lenient multiagent deep reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 443–451. Cited by: §3.1.
 Decentralized coordination in robocup rescue. The Computer Journal 53 (9), pp. 1447–1461. Cited by: §1.
 Qmix: monotonic value function factorisation for deep multiagent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. Cited by: §1, §2.1, §5.
 The starcraft multiagent challenge. Cited by: §5.1.
 A value for nperson games. Contributions to the Theory of Games. Cited by: §1, §2.3, §5.3.
 Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Cited by: §1.
 Qtran: learning to factorize with transformation for cooperative multiagent reinforcement learning. In International Conference on Machine Learning, pp. 5887–5896. Cited by: §1, §2.1, §5.
 The many shapley values for model explanation. In International Conference on Machine Learning, pp. 9269–9278. Cited by: §2.3.
 Valuedecomposition networks for cooperative multiagent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2085–2087. Cited by: §1, §2.1, §5.
 Multiagent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §1.
 Shapley qvalue: a local reward approach to solve global reward games. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 7285–7292. Cited by: §1, §2.2, §3.3, §5.
 Shapley flow: a graphbased approach to interpreting model predictions. Cited by: §2.3, §4.2, §4.2.
 Function optimization using connectionist reinforcement learning algorithms. Connection Science 3 (3), pp. 241–268. Cited by: §2.1.
 Optimal payoff functions for members of collectives. In Modeling complexity in economic and social systems, pp. 355–369. Cited by: §4.1.
 Qvalue path decomposition for deep multiagent reinforcement learning. In International Conference on Machine Learning, pp. 10706–10715. Cited by: §1, §2.2, §3.3, §3.3, §5.1, §5.
 Qatten: a general framework for cooperative multiagent reinforcement learning. arXiv eprints. Cited by: §1, §2.1.
 Mean field multiagent reinforcement learning. In International Conference on Machine Learning, pp. 5571–5580. Cited by: §5.1.
 A multiagent framework for packet routing in wireless sensor networks. sensors 15 (5), pp. 10026–10047. Cited by: §1.
 Learning implicit credit assignment for multiagent actorcritic. Cited by: §2.1.
Comments
There are no comments yet.