Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning

06/01/2021 ∙ by Jiahui Li, et al. ∙ HUAWEI Technologies Co., Ltd. Zhejiang University 0

Centralized Training with Decentralized Execution (CTDE) has been a popular paradigm in cooperative Multi-Agent Reinforcement Learning (MARL) settings and is widely used in many real applications. One of the major challenges in the training process is credit assignment, which aims to deduce the contributions of each agent according to the global rewards. Existing credit assignment methods focus on either decomposing the joint value function into individual value functions or measuring the impact of local observations and actions on the global value function. These approaches lack a thorough consideration of the complicated interactions among multiple agents, leading to an unsuitable assignment of credit and subsequently mediocre results on MARL. We propose Shapley Counterfactual Credit Assignment, a novel method for explicit credit assignment which accounts for the coalition of agents. Specifically, Shapley Value and its desired properties are leveraged in deep MARL to credit any combinations of agents, which grants us the capability to estimate the individual credit for each agent. Despite this capability, the main technical difficulty lies in the computational complexity of Shapley Value who grows factorially as the number of agents. We instead utilize an approximation method via Monte Carlo sampling, which reduces the sample complexity while maintaining its effectiveness. We evaluate our method on StarCraft II benchmarks across different scenarios. Our method outperforms existing cooperative MARL algorithms significantly and achieves the state-of-the-art, with especially large margins on tasks with more severe difficulties.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Multi-Agent Systems (MAS) have attracted substantial attention in many sequential decision problems in recent years, such as autonomous vehicle teams (Keviczky et al., 2007; Cao et al., 2012), robotics (Lillicrap et al., 2016; Ramchurn et al., 2010), scene graph generation (Chen et al., 2019), and network routing (Ye et al., 2015), etc. Among the approaches, Multi-Agent Reinforcement Learning (MARL) has grown its popularity with its ability to learn without knowing the world model. A classical way in MARL to solve cooperative games is regarding the entire MAS as a single agent and optimize a joint policy according to the joint observations and trajectories (Tan, 1993). With the joint action space of agents growing exponentially as the number of agents and the constraints of partial observability, the classical method faces insurmountable obstacles. This promotes the Centralized Training with Decentralized Execution (CTDE) (Oliehoek et al., 2008; Kraemer and Banerjee, 2016) paradigm, where a central critic is set up to estimate the joint value function, and the agents are trained with global information but executed only based on its local observes and histories.

The main challenge that restricts the effective CTDE in MARL is credit assignment, which attributes the global reward signals according to the contributions of each agent. Recent studies that attempt to solve this challenge can be roughly divided into two branches. 1) Implicit methods (Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019; Yang et al., 2020b)

: it treats the central critic and the local agents as an entirety during the training procedure. A decomposition function (usually a neural network) is first set up to map the joint value function to local value functions. The central critic is then learned simultaneously with the decomposition function and the policy. Implicit methods suffer from inadequate decomposition limited by the design of the decomposition function. They also lack the interpretability for the distributed credits 

(Heuillet et al., 2021). 2) Explicit methods (Foerster et al., 2018; Wang et al., 2020; Yang et al., 2020a): it trains the central critic and the local actors separately. In each iteration, the critic is first updated, after which some strategies are leveraged to compute the reward or the value function of each agent explicitly. Such reward signals or value functions are used to guide the training of local agents. Despite that the explicit methods overcome many shortcomings of the implicit counterpart, one has to algorithmically characterize the individual agent’s contribution from the overall success, which can be very hard in the context of subtle coalitions under common goals. We address this challenge by using a counterfactual method with Shapley Value. Shapley Value (Shapley, 1953)

originates from cooperative game theory and is a golden standard to distribute benefits reasonably and fairly by accounting for the contribution of participating players. By treating the agents in MARL as the players in cooperative games, ideal credit assignment can be obtained up to computing the marginal contribution of Shapley Value. Inspired from this, Wang 

et al(Wang et al., 2020) proposed SQDDPG, which utilized Shapley Value in deterministic policy gradient (Silver et al., 2014; Lowe et al., 2017) to guide the learning of local agents. However, the performance of SQDDPG relies highly on the designed framework for estimating the marginal contribution, and this framework is limited by an assumption that the actions of agents are taken sequentially, which is often unrealistic. These restrictions make SQDDPG perform unsatisfactory in many tasks. To this end, we extend the explicit methods and propose a novel method that leverages Shapley Value to allocate the credits for agents. We achieve it by leveraging a counterfactual method to estimate what would have happened without the participation of a set of agents. The quantification of the contribution of a set of agents is then computed as the change of the central critic value by setting their actions to a baseline. Then the changes of the contributions caused by an agent in different set unions are treated as marginal contributions, and Shapley Value can thus be obtained. Finally, these unified values play the role of credits in local policies and guide its training procedure.

Nevertheless, the computational complexity of the original Shapley Value grows factorially as the number of players increases. In many contexts of interest, such as network games, distributed control, and computing economics, this number can be quite large, which makes Shapley Value intractable. To alleviate the computational burden, we approximate Shapley Value through Monte Carlo sampling, which maintains the majority of the desired properties of Shapley Value. In our approach, the Shapley Value is computed by subsets of collaborators for each agent and is re-sampled at each time step. Our approach manages to reduce the computational complexity to polynomial in the number of players without much loss of effectiveness of Shapley Value.

Our main contributions can be summarized as follows:

  1. We leverage a counterfactual method with Shapley Value to address the problem of credit assignment in Multi-Agent Reinforcement Learning. The proposed Shapley Counterfactual Credits reasonably and fairly characterize the contributions of each local agent by fully considering their interactions.

  2. We adopt a Monte Carlo sampling-based method to approximate Shapley Value and decrease its computational complexity growth from factorial to polynomial, which makes our algorithm viable for large-scale, complicated tasks.

  3. Extensive experiments show that our proposed method outperforms existing cooperative MARL algorithms significantly and achieves state-of-the-art performance on StarCraft II benchmarks. The margin is especially large for more difficult tasks.

The rest of this paper is organized as follows. In Section 2, we first briefly reviews all related works. And we introduce the preliminaries, including Dec-POMDPs, Shapley Value, and explicit framework for MARL in Section 3. The details of our proposed algorithm for credit assignment are introduced in Section 4. Experimental results and analyses are reported in Section 5. Finally, we conclude our paper and discuss on future directions in Section 6 .

2. Related Work

2.1. Implicit Credit Assignment

Most of the implicit methods follow the condition of Individual-Global-Max (IGM), which means the optimal joint actions among the agents are equivalent to the optimal actions of each local agent. VDN (Sunehag et al., 2018) makes a hypothesis of the additivity to decompose the joint Q-function into the sum of individual Q-functions. QMIX (Rashid et al., 2018) gets rid of this assumption but adds a restriction of the monotonicity. LICA (Zhou et al., 2020) promotes QMIX to actor-critic as well as proposes an adaptive entropy regularization. Weighted QMIX adapts a twins network and encourages the underestimated actions to alleviate the risk of suboptimal results. QTRAN (Son et al., 2019) avoids the limitations of VDN and QMIX by introducing two regularization terms but has been proved to behave poorly in many situations. Qatten (Yang et al., 2020b) employs a multi-head attention mechanism to compute the weights for the local action value functions and mix them to approximate the global Q-value. All of these methods aim to learn a value decomposition from the total reward signals to the individual value functions, which suffer from several problems: (i) The performance of the model highly relies on the decomposition function. (ii) The lacking of interpretability for the distributed credits. (iii) The high risk of the joint policy tends to fall into sub-optimal results (Williams and Peng, 1991; Mnih et al., 2016; Ahmed et al., 2019).

2.2. Explicit Credit Assignment

Explicit methods attribute the contributions for each agent that are at least provably locally optimal. The most representative method is COMA (Foerster et al., 2018), which utilizes a counterfactual advantage baseline to guide the learning of local policies. However, it treats each agent as an independent unit and overlooks the complex correlations among agents. Thus, it becomes inefficient when encounters complex situations. SQDDPG (Wang et al., 2020) proposes a network to estimate the marginal contribution, which is further used to approximate Shapley Value. Then, Shapley Value is used to guide the learning of local agents. However, such estimation for marginal contribution doesn’t make sense in many situations because the network over-relies on the assumption that the agents take actions sequentially. QPD (Yang et al., 2020a) designs a multi-channel mixer critic and leverage integrated gradients to distribute credits along paths, which achieves state-of-the-art results in many tasks. Intuitively, mining the relations between the agents is essential for the policy gradient in cooperative games. But the correlations are too complicated and are often underestimated by the models. To this end, we propose a Shapley Counterfactual Critic for credit assignment in MARL. Thanks to Shapley Value, the relations between the agents are considered sufficiently without prior knowledge, which further promotes the learning of local agents. Different from SQDDPG (Wang et al., 2020), we compute the marginal contributions according to a counterfactual method rather than building a network, which is more stable and efficient in complicated situations.

2.3. Shapley Value and Approximate SV

Shapley Value (Shapley, 1953; Bilbao and Edelman, 2000; Meng, 2012; Sundararajan and Najmi, 2020) originates from cooperative game theory in the 1950s, which assigns a unique distribution of total benefits generated by the coalition of all players. Shapley Value satisfies the properties of efficiency, symmetry, nullity, linearity and coherency. It is a unique and fairly way to quantify the importance of each player in the overall cooperation and widely used in economics. However, the computational complexity of Shapley Value grows factorially with respect to the number of participating players (Kumar et al., 2020). Thus, in order to decrease the computation, several recent studies start to approximate the exact Shaply Value (Fatima et al., 2008; Chen et al., 2018; Ghorbani and Zou, 2019; Wang et al., 2021) by sacrificing some properties. For example, Frye et al(Frye et al., 2020) and Tom et al(Heskes et al., 2020) utilize casual knowledge to simplify its calculation, which breaks the axiom of symmetry. L-Shapley and C-Shapley only consider the interactions among the local and connected player, which slightly break the properties of efficiency. DASP (Ancona et al., 2019)

and Neuron Shapley 

(Ghorbani and Zou, 2020) adapt sample methods to approximate Shapley Value, which also slightly breaks the properties ofefficiency and symmetry.

3. Preliminaries

3.1. Dec-POMDPs

A fully cooperative multi-agent sequential decision-making task with agents

can be modeled as a decentralised partially observable Markov decision process (Dec-POMDP) 

(Oliehoek and Amato, 2016; Bernstein et al., 2002; Busoniu et al., 2008; Gupta et al., 2017; Palmer et al., 2018). Dec-POMDP is canonically formulated by the tuple:

In the process, represents the true state of the environment. At each time step, each agent chooses an action simultaneously to formulate a joint action

. The action produces a state transition on the environment which is described by the Markov transition probability function

: . All of the agents share a same global reward function : .

In the setting of partial observability, the observations of each agent are generated by a observation function : . Each agent owns an action-observation history , where denotes the set of sequences of state-action pairs with arbitrary length. On this history, each agent conditions a stochastic policy : . The common goal of all agents is to maximize the expected discounted return .

3.2. Shapley Value

Assume a coalition consists of players and they cooperate with each other to achieve a common goal. For a particular player , let be a random set that contains player and represents the set with the absence of , then the marginal contribution of in is defined as:

(1)

where refers to the value function for estimating the cooperated contribution of a set of players.

Then the Shapley Value of player is computed as the weighted average of the marginal contributions in all of the subsets of :

(2)

where denotes a set with size that contain the player . Shapley Value satisfy the following properties:

  • [leftmargin=*]

  • Efficiency. The credits generated by the big coalition is equal to the sum of the Shapley Values of all of the participating players .

  • Symmetry. If for all subsets then .

  • Nullity. If for all subsets then .

  • Linearity. Let and represent the associated gain functions, then for every .

  • Coherency. When another value function is utilized to measure the marginal contribution of , if for all subsets , then .

3.3. Explicit Framework for MARL

Explicit methods are interpretable for the allocated credits, which can reduce the suspicion of users to the rationality of the learned local agents. For this reason, we extend the explicit methods (Foerster et al., 2018; Wang et al., 2020; Yang et al., 2020a), which first train the central critic according to the joint states and actions and then distribute the global reward signals according to the contributions of local agents to the critic.

Following QPD (Yang et al., 2020a), we model our critic network with three components as shown in Figure 1

, that is, the feature extraction module, the feature fusion module, and the Q-function estimation module. The first module consists of 2 dense layers with ReLU non-linearity, which is used to extract the features of a particular agent’s observations and actions. Then the features of all agents are concatenated thus merged into a global feature. Finally, the joint Q-value is computed according to the global feature. As Yang 

et al(Yang et al., 2020a) illustrated, different agents may own the same attributions, so can be categorized into different groups. For this reason, the agents within the same group are modeled using the same sub-network. Meanwhile, in order to simplify the network architecture and accelerate the learning procedure, the agents of the same group share the same parameters. We represent the central critic as:

(3)

where and denote the observation and the action of the -th agent, respectively.

In our implementation, each local agent is realized with a Recurrent Deep Q-Network, which is composed of an Long Short-Term Memory (LSTM) layer and a Multi-Layer Perceptron (MLP). We represent the local agent as:

(4)

where is the hidden state of LSTM. For the exploration policy, -greedy is adopted and the exploration rate of episode is:

(5)

where is the initial exploration rate and represents the decreasing count of each episode.

Figure 1. The framework of our method. We adopt a two-stage way that trains the central critic and the local policies separately. First, the central critic is updated with TD-loss. Then the credits of each agent are calculated by our proposed counterfactual method with approximate Shapley Value. Finally, the local policies are updated using the Shapley Counterfactual Credits.

4. Shapley Counterfactual Credits for MARL

The framework of our approach is illustrated in Figure 1. First, the central critic takes the actions and observations of each agent as input and approximates the total Q value. Then the contributions of the individual agents are distributed by the counterfactual method with Shapley Value. Finally, the local agents update their parameters according to the credits they earned.

In this section, we systematically describe our “Shapley Counterfactual Credits” for Multi-Agent Reinforcement Learning. First, we will introduce a counterfactual method with Shapley Value to address the problem of “credit assignment”, which can fully mine the correlations among the local agents in Section 4.1. To downgrade the computational complexity. we replace the truly Shapley Value with its approximation, and this will be discussed in Section 4.2

. The details of the proposed algorithm and the loss function will be introduced in

Section 4.3.

4.1. Counterfactual Method with Shapley Value for Credit Assignment

The main challenge we need to address is how to measure the contributions of the agent. In other words, we need to quantify how the agents’ actions influence the output of the central critic. COMA (Foerster et al., 2018) proposed a special critic and utilize a counterfactual baseline, which estimated the advantages of action value over expected value as this influence but shows poor performance on many tasks. Wolpert et al(Wolpert and Tumer, 2002) computed the influence by using difference rewards which compares the global reward to the reward received when the action of an agent is replaced with a default action. Inspired by these ideas, we also proposed a counterfactual method in our central critic to measure the effect of the actions taken by the agents.

We consider the contribution of an action taken by an agent is equal to “how the output will change when this action is absent?” We formulate the contributions of the action performed by the -th agent to the central critic as:

(6)

where denotes a baseline that means the action is replaced by a default one.

However, such estimation for the contributions is insufficient since the agents are cooperating with each other and cannot be treated as independent units. We then desire to quantify the credits made by an agent precisely from the intricate relationship among agents, but the environment is complex, and there is no prior knowledge to indicate how they cooperated with each other. To this end, we propose to utilize Shapley Value for credit assignment, and this will be introduced in the next subsection.

As we mentioned before, Shapley Value distributes the credits fairly by considering the contributions of the participating players and satisfies many good properties such as efficiency, additivity, and coherency. Thus, we utilize this tool to extend the counterfactual method.

For convenience, we shorthand Equation (6) and change agent to a set of agents:

(7)

where denotes all of the agents, represents the actions and observations of , and denotes that the actions of all agents in are replaced with default actions.

To compute the Shapley Value of the -th agent in the big coalition, we need to compute its marginal contributions when this agent play roles in all of the subset of the big coalition . We define the marginal contribution of the -th agent in the subset of as:

(8)

where denotes with the removal of the -th agent.

After getting the marginal contribution, we compute the Shapley Counterfactual Credits as:

(9)

where denotes the set of agents with size that contains the -th agent.

4.2. Approximation of Shapley Value

However, the main drawback of Shapley Value is that the computational complexity grows factorially as the number of the agents increases (Kumar et al., 2020). So recent studies usually use an approximation of Shapley Value as a substitution (Chen et al., 2018; Ghorbani and Zou, 2019; Wang et al., 2021). Since the number of the agents may bring an unacceptable computational cost, for alleviating the computational burden, the approximation of Shapley Value is necessary. Thus, we adopt the Monte Carlo sampling method to get the approximated Shapley Value:

(10)

where represents the times of Monte Carlo sampling, represents a subset of sampled in -th time that contains the -th agent.

According to this approximation, we downgrade the computational complexity of the truly Shapley Value of an agent from to , where is the number of agents that may be very large in some situations, and

is a hyperparameter which represents the times of Monte Carlo sampling and can be a small positive integer. To be noticed that, such an approximation of Shapley Value might slightly break some of its properties such as

efficiency and Symmetry. Recent literature sacrificed its properties in varying degrees but got an acceptable computational costs (Chen et al., 2018; Ghorbani and Zou, 2019; Wang et al., 2021). We deem that such an approximation is necessary and will not bring too much impact to the model’s performance.

4.3. Loss Function and Training Algorithm

We show the details of our algorithm in Algorithm 1. Our whole framework is updated in two stages. First, the local agents interact with the environment and take actions according to their observations and history. Then, these actions and observations act as the input of the central critic to estimate the joint Q-function. Afterward, in the first stage, we update the central critic by minimizing the TD-loss :

(11)

where is the parameters of the central critic, is the output of the central critic, and represents the output of target network of the central critic.

In the second stage, we first get the Shapley Counterfactual Credits of each agent according to Equation (10). Then each agent is trained by minimizing the loss:

(12)

where denotes the parameters of the -th local agent, and is the output of the -th agent.

Initialize: Central critic network , target central critic network , local agents’ networks

1:for each training episode  do
2:      = initial state, = 0, = 0 for each agent
3:     while  terminal and  do
4:         
5:         for each agent  do
6:              
7:              Sample from          
8:         Execute the joint action
9:         Get reward and next state      
10:     Add episode to replay buffer
11:     Collate episodes in buffer into a single batch
12:     for  in batch do
13:         for  to  do
14:              Compute the targets using central target network               
15:     Update central critic network with (11)
16:     Every episodes reset
17:     for  in batch do
18:         for  to  do
19:              Compute credits for each agent via (10)               
20:     Update the local agents with (12)
Algorithm 1 Shapley Counterfactual Credits Algorithm for MARL
(a) 3m
(b) 8m
(c) 2s3z
(d) 1c3s5z
(e) 3s5z
(f) 3s5z_vs_3s6z
Figure 2.

The mean win rates of our method compared with others in different map scenarios of StarCraft II. The shaded areas represent the standard deviation.

Map Methods
VDN QMIX QTRAN COMA QPD SQDDPG OURS
3m 100 100 100 100 100 100 95 96 99 99 64 65 99 99
8m 100 100 100 100 100 100 100 100 95 95 92 90 98 97
2s3z 100 100 100 100 92 91 45 45 99 98 60 55 100 100
1c3s5z 88 85 95 90 40 41 15 15 77 72 2 2 61 60
3s5z 80 69 80 67 12 13 5 3 79 80 1 1 92 90
3s5z_vs_3s6z 0 0 0 0 0 0 0 0 3 5 0 0 20 20
Table 1. Median and mean win rate of our method compared with other methods. represents the median of the test win rates and represents mean test win rates.
(a) 3m
(b) 8m
(c) 2s3z
Figure 3. Ablation study of Counterfactual Shapley Credits.

5. Experiments

We focus on addressing the problem of credit assignment in MARL with cooperative settings explicitly. We compare our proposed method with several baselines, including VDN (Sunehag et al., 2018), QMIX (Rashid et al., 2018), COMA (Foerster et al., 2018), QTRAN (Son et al., 2019), QPD (Yang et al., 2020a), and SQDDPG (Wang et al., 2020). The training configurations, experiment results, as well as the analysis will be described in detail in this section.

5.1. Experiment Settings

Environment

We perform extensive experiments on the StarCraft II (a real-time strategy game) micromanagement challenge, in which each army is controlled by an agent and act based on its local observations and the opponent’s army are controlled by the hand-coded built-in StarCraft II AI. Each unit in StarCraft contains a rich set of complex micro-actions, which allow the learning of complex interactions between the agents that cooperate with each other. The overall goal is to maximize the accumulated rewards for each battle scenario. The environment produces rewards based on the hit-point damage dealt and enemy units killed. Besides, another bonus is given when the battle wins. At each time step, each agent can only receive the local observations within its field of view. Meanwhile, an agent can only observe the other agents alive and located in its sight range. Besides, all agents can only attack the enemies within their shooting range, which is set to 6. The global state consists of the joint observations without the restriction of the sight range, which will be used in the central critic during the training procedure. All features are normalized by their maximum values before sent to the neural network. StarCraft Multi-Agent Challenge (SMAC) environment (Samvelyan et al., 2019) is used as testbed, and we set the difficulty of the game AI as “very difficult” level.

Configurations

The central critic of our method is the same as QPD (Yang et al., 2020a), which consists of the feature extraction layers, the feature fusion operation, and the Q-function estimation layers. First, the agents are grouped according to their attributions, and 2 dense layers are used to extract the features of their observations and actions. Each dense layer consists of 64 neurons for each channel. For accelerating the learning procedure, we adopt parameter sharing technique (Yang et al., 2018; Iqbal and Sha, 2019) where the agents within the same group share the parameters of the feature extraction layers. Then, we concatenated the features of all agents to fuse them into a global feature. Finally, for the final Q-function estimation, we adapt another dense layer with one output neuron. In the procedure of computing Shapley Value, we adapt the Monte Carlo sampling method to sample 5 subsets for each agent at each time step. We set the counterfactual baseline

in the central critic as zero vector for convenience. We model the local agents with an LSTM layer and 2 fully connected layers. The dimensional of hidden state in LSTM is set as 64, the units of the two fully connected layers are set as 64 and

separately, where is the size of action space. We set the discount rate

for TD-loss as 0.99. The replay buffer stores the most recent 1000 trajectories. During training, we update the central critic with Adam and local agent networks with RMSProp. We copy the parameters of the central critic to its target network every 200 training episodes. The full hyperparameters of our Shapley Counterfactual Credits are shown in Table 

2. The map 3s5z_vs_3s6z is much harder than the other maps, and the allied forces have one unit less than the enemy. During training, the win rates remain 0 even when the returns are relatively high. For this reason, we set the number of the training episodes of map 3s5z_vs_3s6z to 50000, while the others are set to 20000.

Settings Value
Batch size 32
Replay buffer size 1000
Training episodes 20000
Exploration episodes 1000
Start exploration rate 1
End exploration rate 0
TD-loss discount 1000
Target central critic update interval 200 episodes
Evaluation interval 100 episodes
Evaluation battel number 100
Agent optimizer RMSProp
Central Critic optimizer Adam
Agent learning rate 0.005
Central critic learning rate 0.01
Dense units 64
LSTM hidden units 64
Baseline for Shapley Value 0 vector
Times for Monte Carlo Sampling 5
Table 2. Hyperparameters of Shapley Counterfactual Credit Algorithm

5.2. Results and Analysis

To demonstrate the efficiency of our proposed method, we perform experiments on 6 maps of StarCraft II (3m, 8m, 2s3z, 1c3s5z, 3s5z, 3s5z_vs_3s6z), including both homogeneous and heterogeneous scenarios. Figure 2 depicts the curve of mean win rates of our method compared to the baselines. The final results of our method are depicted in Table 1, where represents the median of the test win rates and represents mean test win rates.

All of the methods show high performance on three simple scenarios (3m, 8m, 2s3z), and our Shapley Counterfactual Credits algorithm is competitive with the state-of-the-art algorithm, and achieves nearly 100% mean win rates. Both sides have 3 Marines in map 3m, and 8 Marines in map 8m. As the arms of both sides are single and the numbers are equal, each agent only needs to focus on beating enemies and avoid taking redundant actions. Concretely, from the replay, in map 3m and 8m, units learned to stand in a line or semicircle in order to set fire to the incoming enemies. Such a pattern is easy for models to learn, and agents hardly need to consider how to cooperate with its friendly forces. In map 2s3z, both sizes have 2 Stalkers and 3 Zealots. Since that Zealots counter Stalkers, the Stalkers need to hide behind the own side Zealots. Such a small number of units does not bring too much challenge for the learning of the model.

Our algorithm falls behind the other methods in map 1c3s5z, where both sizes have 3 Stalkers, 5 Zealots and an Colossus. Since the Colossus is more threatening, and becomes the priority target, which reduces the difficulty of the game. Here, we divide the learned ability of an agent into the personal ability and the cooperative ability. For example, “kite the enemy” as well as “attack high-threat targets” belongs to the former, and “move to protect the allies” belongs to the latter. In this map, all of the agents need to learn the pattern to attack the enemy’s Colossus first, which makes other actions less important. Since Shapley Value focuses more on mining the correlation between agents, our method does not perform very well in this scenario.

Our algorithm shows obvious advantages in two maps 3s5z and 3s5z_vs_3s6z which are much more difficult than others. In map 3s5z, both sizes have 3 Stalkers and 5 Zealots, and we got the mean win rates of 90%. In this scenario, not only the agents of Stalkers need to stand behind the allied Zealots, but learn to attack the enemy Stalkers with high priority. Meanwhile, the allied Zealots need to protect allied Stalkers as well as attack the nearest enemy Stalkers. In this complex situation, cooperation among agents is more important than before. Our counterfactual method with Shapley Value fully considers the correlation and interactions between units and distributes a moderate credit for the actions taken by each agent, thus outperforms the baselines significantly. For instance, a “movement” of a Zealots may affect other friendly forces in varying degrees; we measure its contribution by considering how the results will change when different kinds of correlations are absent. Especially, in map 3s5z_vs_3s6z, where ally has 3 Stalkers and 5 Zealots while the enemy has 6 Zealots, all of the current method except QPD got the mean win rates of zero. The reason for the poor performance of these methods is that cooperative behavior such as “block” rather than “kite” play more important roles in such settings. The Zealots need to attract firepower in order to protect the allied Stalkers, which is the only way to get the final victory. In this scenario, Shapley value fully demonstrates its superiority. Our method achieves the mean win rates of 20%, and reach the state-of-the-art.

In conclusion, our proposed Shapley Counterfactual Credits algorithm shows its strength and beats all of the other methods in complicated scenarios where cooperation among agents plays an essential role. Our proposed algorithm also exhibits the competitive results with the state-of-the-art algorithm in the scenarios that need to pay more attention to personal ability.

5.3. Ablation Study

To demonstrate the advantage of Shapley Value (Shapley, 1953) to the counterfactual method, we perform ablation study on three maps (3m, 8m, 2s3z). The difficulty of these three maps increases sequentially. The results are shown in Figure 3. The blue curves represent that the credits are allocated by the counterfactual method without Shapley Value. The red curves represent that the credits are distributed by Shapley Counterfactual Credits. For the balance between the performance and computational costs, we set the times of the Monte Carlo sampling for approximating Shapley Value as 5, and the analysis is shown in the next subsection.

In map 3m and 8m, the units need to learn the strategies that stand in a suitable position to fire the same enemy unit together. Thus, the ability to cooperate is relatively important, and the use of Shapley Value brings an improvement of the performance. While in 2s3z, the Stalkers need to “kite the Zealots” and the number of the units is small, which means personal ability is more important. So our method loses advantage in this scenario. It is worth mentioning that the use of Shapley Value makes learning more stable and reduces the standard deviation (the shaded part in the figure) of the win rates significantly. That because Shapley Value considers a variety of combinations among agents and measure the contribution of an agent via the weighted average of the counterfactual results of these combinations.

5.4. The Choice of Sample Times for Shapley Approximation

We approximated the truly Shapley Value via Monte Carlo sampling. Concretely, at each time step, we sample subsets randomly for each agent , and average the marginal contributions of in these subsets to represent its approximated Shapley Value. However, a large will still bring pressure to the computation costs, and small will lead to an inaccurate approximation. We performed extensive experiments to find a moderate hyperparameter, and the results are depicted in Figure 4. We conclude that 4 times sampling is sufficient to reach an ideal result. But to make the performance more stable, we set to 5 for in our experiments.

Figure 4. The mean win rates of the approximated Shapley Counterfactual Credits with different sample times in map 2s3z.

6. Conclusion and Future Work

In this paper, we investigate the problem of credit assignment in Multi-Agent Reinforcement Learning. We extend the methods of explicit credit assignment and leverage a counterfactual method to measure the contributions of local agents to the central critic. To fully describe the relationships among the cooperative agents, Shapley Value is utilized with a sample-based method, with a Monte-Carlo sampling variant to decrease its computational complexity from factorial to polynomial. Experiments on the StarCraft II micromanagement tasks show the superiority of our method as we reach the state-of-the-art on various scenarios.

For future work, it could be interesting to investigate the causal knowledge among the cooperative agents. With this inferred knowledge, Shapley Value can be approximated in a more accurate way and the credit assignment can be more precise. Our method can also be extended to the scenarios with competitive settings, where variants of Shapley Value are proved to be effective.

Acknowledgements.
This work was supported by the National Key Research and Development Project of China (No.2018AAA0101900), the National Natural Science Foundation of China (No. 61625107, U19B2043, 61976185, No. 62006207), Zhejiang Natural Science Foundation (LR19F020002), Key R & D Projects of the Ministry of Science and Technology (No. 2020YFC0832500), Zhejiang Innovation Foundation(2019R52002), the Fundamental Research Funds for the Central Universities and Zhejiang Province Natural Science Foundation (No. LQ21F020020), Baoxiang Wang is partially supported by AC01202101031 and AC01202108001 from AIRS.

References

  • Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans (2019) Understanding the impact of entropy on policy optimization. In

    International Conference on Machine Learning

    ,
    pp. 151–160. Cited by: §2.1.
  • M. Ancona, C. Oztireli, and M. Gross (2019) Explaining deep neural networks with a polynomial time algorithm for shapley value approximation. In International Conference on Machine Learning, pp. 272–281. Cited by: §2.3.
  • D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein (2002) The complexity of decentralized control of markov decision processes. Mathematics of operations research 27 (4), pp. 819–840. Cited by: §3.1.
  • J. M. Bilbao and P. H. Edelman (2000) The shapley value on convex geometries. Discrete Applied Mathematics 103 (1-3), pp. 33–40. Cited by: §2.3.
  • L. Busoniu, R. Babuska, and B. De Schutter (2008) A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (2), pp. 156–172. Cited by: §3.1.
  • Y. Cao, W. Yu, W. Ren, and G. Chen (2012) An overview of recent progress in the study of distributed multi-agent coordination. IEEE Transactions on Industrial informatics 9 (1), pp. 427–438. Cited by: §1.
  • J. Chen, L. Song, M. J. Wainwright, and M. I. Jordan (2018) L-shapley and c-shapley: efficient model interpretation for structured data. In International Conference on Learning Representations, Cited by: §2.3, §4.2, §4.2.
  • L. Chen, H. Zhang, J. Xiao, X. He, S. Pu, and S. Chang (2019) Counterfactual critic multi-agent training for scene graph generation. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    ,
    pp. 4613–4623. Cited by: §1.
  • S. S. Fatima, M. Wooldridge, and N. R. Jennings (2008) A linear approximation method for the shapley value. Artificial Intelligence 172 (14), pp. 1673–1699. Cited by: §2.3.
  • J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §1, §2.2, §3.3, §4.1, §5.
  • C. Frye, I. Feige, and C. Rowat (2020) Asymmetric shapley values: incorporating causal knowledge into model-agnostic explainability. Cited by: §2.3.
  • A. Ghorbani and J. Zou (2019) Data shapley: equitable valuation of data for machine learning. In International Conference on Machine Learning, pp. 2242–2251. Cited by: §2.3, §4.2, §4.2.
  • A. Ghorbani and J. Zou (2020) Neuron shapley: discovering the responsible neurons. arXiv preprint arXiv:2002.09815. Cited by: §2.3.
  • J. K. Gupta, M. Egorov, and M. Kochenderfer (2017) Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Cited by: §3.1.
  • T. Heskes, E. Sijben, I. G. Bucur, and T. Claassen (2020) Causal shapley values: exploiting causal knowledge to explain individual predictions of complex models. Cited by: §2.3.
  • A. Heuillet, F. Couthouis, and N. Díaz-Rodríguez (2021) Explainability in deep reinforcement learning. Knowledge-Based Systems 214, pp. 106685. Cited by: §1.
  • S. Iqbal and F. Sha (2019) Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 2961–2970. Cited by: §5.1.
  • T. Keviczky, F. Borrelli, K. Fregene, D. Godbole, and G. J. Balas (2007) Decentralized receding horizon control and coordination of autonomous vehicle formations. IEEE Transactions on control systems technology 16 (1), pp. 19–33. Cited by: §1.
  • L. Kraemer and B. Banerjee (2016) Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190, pp. 82–94. Cited by: §1.
  • I. E. Kumar, S. Venkatasubramanian, C. Scheidegger, and S. Friedler (2020) Problems with shapley-value-based explanations as feature importance measures. In International Conference on Machine Learning, pp. 5491–5500. Cited by: §2.3, §4.2.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. International Conference on Learning Representations. Cited by: §1.
  • R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6382–6393. Cited by: §1.
  • F. Meng (2012) The core and shapley function for games on augmenting systems with a coalition structure. International Journal of Mathematical and Computational Sciences 6 (8), pp. 813–818. Cited by: §2.3.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §2.1.
  • F. A. Oliehoek and C. Amato (2016) A concise introduction to decentralized pomdps. Springer. Cited by: §3.1.
  • F. A. Oliehoek, M. T. Spaan, and N. Vlassis (2008) Optimal and approximate q-value functions for decentralized pomdps. Journal of Artificial Intelligence Research 32, pp. 289–353. Cited by: §1.
  • G. Palmer, K. Tuyls, D. Bloembergen, and R. Savani (2018) Lenient multi-agent deep reinforcement learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 443–451. Cited by: §3.1.
  • S. D. Ramchurn, A. Farinelli, K. S. Macarthur, and N. R. Jennings (2010) Decentralized coordination in robocup rescue. The Computer Journal 53 (9), pp. 1447–1461. Cited by: §1.
  • T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson (2018) Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4295–4304. Cited by: §1, §2.1, §5.
  • M. Samvelyan, T. Rashid, C. S. De Witt, G. Farquhar, N. Nardelli, T. G. Rudner, C. Hung, P. H. Torr, J. Foerster, and S. Whiteson (2019) The starcraft multi-agent challenge. Cited by: §5.1.
  • L. S. Shapley (1953) A value for n-person games. Contributions to the Theory of Games. Cited by: §1, §2.3, §5.3.
  • D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Cited by: §1.
  • K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi (2019) Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 5887–5896. Cited by: §1, §2.1, §5.
  • M. Sundararajan and A. Najmi (2020) The many shapley values for model explanation. In International Conference on Machine Learning, pp. 9269–9278. Cited by: §2.3.
  • P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al. (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2085–2087. Cited by: §1, §2.1, §5.
  • M. Tan (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §1.
  • J. Wang, Y. Zhang, T. Kim, and Y. Gu (2020) Shapley q-value: a local reward approach to solve global reward games. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 7285–7292. Cited by: §1, §2.2, §3.3, §5.
  • J. Wang, J. Wiens, and S. Lundberg (2021) Shapley flow: a graph-based approach to interpreting model predictions. Cited by: §2.3, §4.2, §4.2.
  • R. J. Williams and J. Peng (1991) Function optimization using connectionist reinforcement learning algorithms. Connection Science 3 (3), pp. 241–268. Cited by: §2.1.
  • D. H. Wolpert and K. Tumer (2002) Optimal payoff functions for members of collectives. In Modeling complexity in economic and social systems, pp. 355–369. Cited by: §4.1.
  • Y. Yang, J. Hao, G. Chen, H. Tang, Y. Chen, Y. Hu, C. Fan, and Z. Wei (2020a) Q-value path decomposition for deep multiagent reinforcement learning. In International Conference on Machine Learning, pp. 10706–10715. Cited by: §1, §2.2, §3.3, §3.3, §5.1, §5.
  • Y. Yang, J. Hao, B. Liao, K. Shao, G. Chen, W. Liu, and H. Tang (2020b) Qatten: a general framework for cooperative multiagent reinforcement learning. arXiv e-prints. Cited by: §1, §2.1.
  • Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang (2018) Mean field multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 5571–5580. Cited by: §5.1.
  • D. Ye, M. Zhang, and Y. Yang (2015) A multi-agent framework for packet routing in wireless sensor networks. sensors 15 (5), pp. 10026–10047. Cited by: §1.
  • M. Zhou, Z. Liu, P. Sui, Y. Li, and Y. Y. Chung (2020) Learning implicit credit assignment for multi-agent actor-critic. Cited by: §2.1.