Related Work
MARL has benefited from recent developments in deep reinforcement learning, with the frameworks moving away from tabular methods [bu2008comprehensive]
to deep neural networks
[foerster2018counterfactual]. Our work is related to recent advances in CTDE deep multiagent reinforcement learning.The degree of training centralization varies in the current literature of MARL research. Independent Qlearning (IQL) [tan1993multi] and its deep neural network counterpart [tampuu2017multiagent] simply train an independent Qlearning model for each agent. Those that attempt to directly learn decentralized policies often suffer from the nonstationarity of the environment induced by agents simultaneously learning and exploring. [foerster2017stabilising, usunier2016episodic] attempt to stabilize learning under the decentralized training paradigm. [gupta2017cooperative] propose a training paradigm that alternates between centralized training with global rewards and decentralized training with reshaped rewards.
Centralized methods, by contrast, naturally avoid the nonstationary problem at the cost of scalability. COMA [foerster2016learning], takes advantage of CTDE, where actors are updated by following policy gradients that are tailored by their contributions to the system by the central critic. Multiagent deep deterministic policy gradient (MADDPG) [lowe2017multi] extends deep deterministic policy gradient (DDPG) [lillicrap2015continuous]
to mitigate the issue of high variance gradient estimates exacerbated in multiagent settings.
[wei2018multiagent], based on MADDPG, proposes multiagent soft Qlearning in continuous action spaces to tackle the issue of relative overgeneralization. Probabilistic recursive reasoning (PR2) [wen2019probabilistic] is a method that uses a probabilistic recursive reasoning policy gradient that enables agents to recursively reason what others believe about their own beliefs.More recently, valuebased methods, which lie between the extremes of IQL and COMA, have shown great success in solving complex multiagent problems. VDN [sunehag2017value], which represents jointaction value function as a summation of local actionvalue function, allows for centralized learning. However, it does not make use of extra state information. QMIX [rashid2018qmix] utilizes a nonnegative mixing network to represent a broader class of valuedecomposition functions. Furthermore, additional state information is captured by hypernetworks that output parameters for the mixing network. QTRAN [son2019qtran] is a generalized factorization method that can be applied to environments that are free from structural constraints. Other works, such as CommNet [foerster2016learning], TarMAC [das2019tarmac], ATOC [jiang2018learning], MAAC [iqbal2019actor], CCOMA [su2020counterfactual] and BiCNet[peng2017multiagent] exploit interagent communication.
Similar to QMIX and VDN, VDAC applies valuedecomposition, however, it differs in that VDAC is a policybased method that decomposes global statevalues whereas QMIX and VDN, which decompose global actionvalues, belong to the Qlearning family. [nguyen2018credit] addresses creditassignment issue, however, under a different MARL setting, DecPOMDP. COMA, which is also a policy gradient method inspired by difference rewards and has been tested on StarCraft II micromanage games, represents the body of literature most closely related to this paper.
Background
Decentralized Partially Observable Markov Decision Processes (DecPOMDPs)
: Consider a fully cooperative multiagent task with agents. Each agent identified by take an action simultaneously at every timestep, forming a joint action . The environment has a true state, a transition probability function
, and a global reward function . In the partial observation setting, each agent draws an observations from the observation function . Each agent conditions a stochastic policy on its observationaction history . Throughout this paper, quantities in bold represent joint quantities over agents, and bold quantities with the superscript denote joint quantities over agents other than a given agent . Similar to a singleagent RL, MARL aims to maximize the discounted return . The joint value function is the expected return for following the joint policy from state . The valueaction function defines the expected return for selecting joint action in state and following the joint policy .SingleAgent Policy Gradient Algorithms: In the setting of RL, policy gradient methods directly adjust the parameters of the policy in order to maximize the objective by taking steps in the direction of . The gradient with respect to the policy parameters is
(2) 
where is the state transition by following policy , and is an actionvalue. Policy gradient algorithms differ in how to evaluate , e.g. the REINFORCE algorithm [williams1992simple] simply uses a sample return .
To reduce variations in gradient estimates, a baseline is introduced. In actorcritic approaches [konda2000actor], an actor is trained by following gradients that are dependent on the critic. This yields the advantage function , where is the baseline ( or another constant is commonly used as the baseline). TD error , which is an unbiased estimate of , is a common choice for advantage functions. In practice, TD error that utilizes a nstep return yields good performance [mnih2016asynchronous].
MultiAgent Policy Gradient (MAPG) Algorithms: Multiagent policy gradient methods are extensions of policy gradient algorithms with a policy . Compared with policy gradient methods in RL settings, MAPG faces the issues of high variance gradient estimates [lowe2017multi] and credit assignment [foerster2018counterfactual]. Perhaps the simplest multiagent gradient can be written as:
(3) 
Multiagent policy gradients in the current literature often take advantage of CTDE by using a central critic to obtain extra state information , and avoid using the vanilla multiagent policy gradients (Equation 3) due to high variance. For instance, [lowe2017multi] utilize a central critic to estimate and optimize parameters in actors by following a multiagent DDPG gradient, which is derived from Equation 3:
(4) 
Unlike most actorcritic frameworks, [foerster2018counterfactual] claims to solve the credit assignment issue by applying the following counterfactual policy gradients:
(5) 
where is the counterfactual advantage for agent . Note that [foerster2018counterfactual] argue that the COMA gradients provide agents with tailored gradients, thus achieving credit assignment. However, they also prove that COMA gradients are unbiased estimates of the vanilla multiagent policy gradients, and that COMA is a variance reduction technique.
Methods
In addition to the previously outlined research questions, our goal in this work is to derive RL algorithms under the following constraints: (1) the learned policies are conditioned on agents’ local actionobservation histories (The environment is modeled as DecPOMDP), (2) a model of the environment dynamics is unknown (i.e. the proposed framework is taskfree and modelfree), (3) communication is not allowed between agents (i.e. we do not assume a differentiable communication channel such as [das2019tarmac]), and (4) the framework should enable parameter sharing among agents (namely, we do not train different models for each agent as is done in [tan1993multi]). A method that met the above criteria would constitute a generalpurpose multiagent learning algorithm that could be applied to a range of cooperative environments, with or without communication between agents. Hence, the following methods are proposed.
Algorithm  Central Critic  Value Decomposition  Policy Gradients 

IAC [foerster2018counterfactual]  No    TD advantage 
VDACsum  No  Linear  TD advantage 
VDACmix  Yes  Nonlinear  TD advantage 
Naive Critic  Yes    TD advantage 
COMA [foerster2018counterfactual]  Yes    COMA advantage 
Naive Central Critic Method
A naive central critic (naive critic) is proposed to answer the first research question: is a simple policy gradient sufficient to optimize multiagent actorcritics. As shown in Figure 0(a), naive critic’s central critic shares a similar structure with COMA’s critic. It takes input as and outputs . Actors follow a rather simple policy gradient: a TD advantage policy gradients that is common in RL literature, which is given by:
(6) 
where . In the next section, we will demonstrate that policy gradients taking the form of Equation 6, under our proposed actorcritic frameworks, are also unbiased estimates of the naive multiagent policy gradients. The pseudo code is listed in Appendix.
Value Decomposition ActorCritic
Difference rewards enable agents to learn from a shaped reward that is defined by a reward change incurred by replacing the original action with a default action . Any action taken by agent that improves also improves the global reward since the second term in the difference reward equation does not depend on . Therefore, the global reward is monotonically increasing with . Inspired by difference rewards, we propose to decompose state value into local states such that the following relationship holds:
(7) 
With Equation 7 enforced, given that the other agents stay at the same local states by taking , any action that leads agent to a local state with a higher value will also improve the global state value .
Two variants of valuedecomposition that satisfy Equation 7, VDACsum and VDACmix, are studied.
VDACsum
VDACsum simply assumes that the total state value is a summation of local state values :
(8) 
This linear representation is sufficient to satisfy Equation 7. VDACsum’s structure is shown in Figure 0(b). Note that the actor outputs both and . This is done by sharing nonoutput layers between distributed critics and actors. In this paper, denotes the distributed critics’ parameters and denotes the actors’ parameters for generality. The distributed critic is optimized by minibatch gradient descent to minimize the following loss:
(9) 
where is bootstrapped from the last state and is upperbounded by .
The policy network is trained by following the following policy gradient:
(10) 
where is a simple TD advantage.
Similar to independent actorcritic (IAC), VDACsum does not make full use of CTDE in that it does not incorporate state information during training. Furthermore, it can only represent a limited class of centralized statevalue functions.
VDACmix
To generalize the representation to a larger class of monotonic functions, we utilize a feedforward neural network that takes input as local state values
and outputs the global state value . To enforce Equation 7, the weights (not including bias) of the network are restricted to be nonnegative. This allows the network to approximate any monotonic function arbitrarily well [dugas2009incorporating].The weights of the mixing network are produced by separate hypernetworks [ha2016hypernetworks]. Following the practice in QMIX [rashid2018qmix], each hypernetwork takes the state
as an input and generates the weights of one layer of the mixing network. Each hypernetwork consists of a single linear layer. An absolute activation function is utilized in the hypernetwork to ensure that the outputted weights are nonnegative. The biases are not restricted to being nonnegative. Hence, the hypernetworks that produce the biases do not apply an absolute nonnegative function. The final bias is produced by a 2layer hypernetwork with a ReLU activation function following the first layer. Finally, hypernetwork outputs are reshaped into a matrix of appropriate size. Figure
0(c) illustrates the mixing network and the hypernetworks.The whole mixing network structure (including hypernetworks) can be seen as a central critic. Unlike critics in [foerster2018counterfactual], this critic takes local state values as additional inputs besides global state . Similar to VDACsum, the distributed critics is optimized by minimizing the following loss:
(11) 
where denotes the mixing network. Let denote parameters in the hypernetworks. The central critic is optimized by minimizing the same loss:
(12) 
The policy network is updated by following the same policy gradient in Equation 6. The pseudo code is provided in Appendix.
Convergence of VDAC frameworks
[foerster2018counterfactual] establish the convergence of COMA based on the convergence proof of singleagent actorcritic algorithms [konda2000actor, sutton2000policy]. In the same manner, we utilize the following lemma to substantiate the convergence of VDACs to a locally optimal policy.
Lemma 1: For a VDAC algorithm with a compatible TD(1) critic following a policy gradient
at each iteraction ,
Proof The VDAC gradient is given by:
(13) 
. Similarly, we first consider the expected distribution of the baseline :
(14) 
where the distribution is with respect to the stateaction distribution induced by the joint policy . Writing the joint policy as a product of independent actors:
(15) 
The total value does not depend on agent actions and is given by:
(16) 
where is a nonnegative function. This yields a singleagent actorcritic baseline:
(17) 
Now let be the discounted ergodic state distribution as defined by [sutton2000policy]:
(18) 
The reminder of the gradient is given by:
(19) 
which yields a standard singleagent actorcritic policy gradient:
(20) 
[konda2000actor] establish that an actorcritic that follows this gradient converges to a local maximum of the expected return , subject to assumptions included in their paper.
In the naive critic framework, is evaluated by the central critic and does not depend on agent actions. Hence, by following the same proof in equation 18, we can show that the expectation of naive critic baseline is also , thus proves naive critic also converges to a locally optimal policy.
Experiments
In this section, we benchmark VDACs against the baseline algorithms listed in Table LABEL:table:_tj on a standardized decentralised StarCraft II micromanagement environment, SMAC [samvelyan2019starcraft]. SMAC consists of a set of StarCraft II micromanagement games that aim to evaluate how well independent agents are able to cooperate to solve complex tasks. In each scenario, algorithmcontrolled ally units fight against enemy units controlled by the builtin game AI. An episode terminates when all units of either army have died or when the episode reached the predefined time limit. A game is counted as a win only if enemy units are eliminated. The goal is to maximize the win rate, i.e., the ratio of games won to games played.
The action space of agents consists of the following set of discrete actions: move[direction], attack[enemy id], stop, and no operation. Agents can only move in four directions: north, south, east, or west. A unit is allowed to perform the attack[enemy id] action only if the enemy is within its shooting range.
Each unit has a sight range that limits its ability to receive any information out of range. The sight range, which is bigger than shooting range, makes the environment partially observable from the standpoint of each agent. Agents can only observe other agents if they are both alive and located within the sight range. The global state, which is only available to agents during centralised training, encapsulates information about all units on the map.
Note that all the algorithms have access to the same partial observation and global state in our implementation ^{1}^{1}1https://github.com/hahayonghuming/VDACs. We consider the following maps in our experiments: 2s_vs_1sc, 2s3z, 3s5z, 1c3s5z, 8m, and bane_vs_bane. The detailed configuration of each map can be found in table LABEL:table:_sc_desc located in Appendix.
Observation features and state features are consistent across all algorithms. All algorithms are trained under A2C framework where
episodes are rolled out independently during the training. Refer to Appendix for training details and hyperparameters.
Ablations
We perform the following ablations to answer the corresponding research questions:

Ablation 1: Is the TD advantage gradient sufficient to optimize multiagent actorcritics? The comparison between the naive critic and COMA will demonstrate the effectiveness of TD advantage policy gradients because the only significant difference between those two methods is that the naive critic follows a TD advantage policy gradient whereas COMA follows the COMA gradient (Equation 5).

Ablation 2: Does applying statevalue factorization improve the performance of actorcritic methods? VDACsum and IAC, both of which do not have access to extra state information, shares an identical structure. The only difference is that VDACsum applies a simple statevalue factorization where the global statevalue is a summation of local state values. The comparison between VDACsum and IAC will reveal the necessity of applying statevalue factorization.

Ablation 3: Compared with QMIX, does VDAC provide a reasonable tradeoff between training efficiency and algorithm performance? We train VDAC and QMIX under A2C training paradigm, which is proposed to promote training efficiency, and compare their performance.

Ablation 4: What are the factors that contribute to the performance of the proposed VDAC? We investigate the necessity of nonlinear valuedecomposition by removing the nonlinear activation function in the mixing network. The resulting algorithm is called as VDACmix (linear) and it can be seen as a VDACsum with access to extra state information.
Overall Results
Overall results: Win rates on a range of SC minigames. Black dash line represents heuristic AI’s performance
As suggested in [samvelyan2019starcraft]
, our main evaluation metric is the median win percentage of evaluation episodes as a function of environment steps observed, over the
k training steps. Specifically, the performance of an algorithm is estimated by periodically running a fixed number of evaluation episodes (in our implementation, 32) during the course of training, with any exploratory behaviours disabled. The median performance as well as the  percentiles are obtained by repeating each experiment using 5 independent training runs. Figure 1 demonstrates the comparison among actorcritics across 6 different maps.In all scenarios, IAC fails to learn a policy that consistently defeats the enemy. In addition, its performance across training steps is highly unstable due to the nonstationarity of the environment and its lack of access to extra state information.
Noticeably, VDACmix consistently achieves the best performance across all tasks. On easy games (i.e, 8m), all algorithms generally perform well. This is due to the fact that a simple strategy to attack the nearest enemies, which is outputted by Heuristic AI, is sufficient to win. In harder games such as 3s5z and 2s3z, only VDACmix can match or outperform the heuristic AI.
It is worth noting that VDACsum, which cannot access extra state information, matches the naive critic’s performance on most maps.
Ablation Results
Ablation 1
Consistent with [lowe2017multi], the comparison between the naive critic and IAC demonstrates the importance to incorporate extra state information, which is also revealed by the comparison between COMA and IAC (Refer to Figure 1 for comparisons between naive critic and COMA across different maps.). As shown in Figure 2, naive critic outperforms COMA across all tasks. It reveals that it is also viable to use a TD advantage policy gradients in multiagent settings. In addition, COMA’s training is unstable, as can be seen in Figure 1(a), and 1(b), which might arise dues to its inability to predict accurate counterfactual actionvalue for untaken actions.
Ablation 2
Despite the similarity in structures of VDACsum and IAC, VDACsum’s median win rates at 2 million training step exceeds IAC’s consistently across all maps (Refer to Figure 1 for comparisons between VDACsum and IAC across 6 different maps.). It reveals that, by using a simple relationship to enforce equation 7, we can drastically improve multiagent actorcritic’s performance. Furthermore, VDACsum matches naive critic on many tasks, as shown in Figure 1(c), demonstrate that actors that are trained without extra state information can achieve similar performance to naive critic by simply enforcing equation 7. In addition, it is noticeable that, compared with naive critic, VDACsum’s performance is more stable across training.
Ablation 3
Figure 2(a) and 2(b) shows that, under A2C training paradigm, VDACmix outperform or match QMIX in map 2s_vs_2sc and 3s5z. Refer to Figure 4 in Appendix for comparisons between VDACs and QMIX over all maps. In easier games, QMIX’s performance can be comparable to VDACmix. In harder games such as 2s_vs_1sc and 3s5z, VDACmix’s median test win rates at 2 million training step outnumber QMIX’s by 38% and 71%, respectively. Furthermore, QMIX’s performance can be noticeably unstable across the training steps in some maps as shown in Figure 2(a).
Ablation 4
Finally, we introduced VDACmix (linear), which can be seen as an more general VDACsum that has access to extra state information. Consistent with our previous conclusion, the comparison between VDACmix (linear) and VDACsum shows that it is important to incorporate extra state information. In addition, the comparison between VDACmix and VDACmix (linear) shows the necessity of assuming the nonlinear relationship between the global state value and local state values . Refer to Figure 5 in Appendix for comparisons between VDACs across all maps.
Conclusion
In this paper, we propose a new creditassignment actorcritic frameworks based off our observation on difference rewards, which implies the monotonic relationship between the global reward and the reshaped rewards for agents. Theoretically, we establish the convergence of proposed actorcritics to a local optimal. Empirically, benchmark tests on StarCraft micromanage games demonstrate that our proposed actorcritics bridges the performance gap between multiagent actorcritics and Qlearning, and our methods provide a balanced tradeoff between training efficiency and performance. Furthermore, we identify a set of key factors that contribute to the performance of our proposed algorithms via a set of ablation experiments. In the future, We aim to implement our framework in realworld applications such as highway onramp merging of semi or full selfdriving vehicles.
References
Appendix A Appendix
Smac
In this paper, we use all the default settings in SMAC. That includes: the game difficulty is set to level , very difficult
, the shoot range, observe range, etc, are consistent with the default settings. The observation vector also follows the default implementation in
[samvelyan2019starcraft]: It contains the following attributes for both allied and enemy units within the sight range: distance, relative x, relative y, health, shield, and unit type. In addition, the observation vector includes the last actions of allied units that are in the field of view. Lastly, the terrain features, in particular the values of eight points at a fixed radius indicating height and walkability, surrounding agents within the observe range are also included. The state vector includes the coordinates of all agents relative to the center of the map, together with units’ observation feature vectors. Additionally, the energy of Medivacs and cooldown of the rest of the allied units are stored in the state vector. Finally, the last actions of all agents are attached to the state vector.Map Name  Ally Units  Enemy Units 

2s_vs_1sc  2 Stalkers  1 Spine Crawler 
8m  8 Marines  8 Marines 
2s3z  2 Stalkers & 3 Zealots  2 Stalkers & 3 Zealots 
3s5z  3 Stalkers & 5 Zealots  3 Stalkers & 5 Zealots 
1c3s5z  1 Colossus, 3 Stalkers & 5 Zealots  1 Colossus, 3 Stalkers & 5 Zealots 
bane_vs_bane  20 Zerglings & 4 Banelings  20 Zerglings & 4 Banelings 
Training Details and Hyperparameters
The agent networks of all algorithms resemble a DRQN [hausknecht2015deep] with a recurrent layer comprised of a GRU [chung2014empirical] with a 64dimensional hidden state, with a fullyconnected layer before and after. The exception is that IAC, VDACsum, and VDACmix agent networks contain an additional layer to output local state values and the policy network outputs a stochastic policy rather than actionvalues.
Algorithms are trained with RMSprop with learning rate
. During training, games are initiated independently, from which episodes are sampled. Qlearning replay buffer stores the latest 5000 episodes for each independent game (In total, replay buffer has a size of ). We set and (if needed). Target networks (if exists) are updated every training steps.The architecture of the COMA critic is a feedforward fullyconnected neural network with the first layers, each of which has units, followed by a final layer of units. Naive central critic shares the same architecture with COMA critic with an exception that its final layer contains units.
The mixing network in QMIX and VDACmix shares an identical structure. It consists of a single hidden layer of 32 units, whose parameters are outputted by hypernetworks. An ELU activation function follows the hidden layer in the mixing network. The hypernetworks consist of a feedforward network with a single hidden layer of 64 units with a ReLU activation function.
For naive central critic, IAC, and VDACs, is given by:
(21) 
, where can vary from state to state and is upperbounded by .
Comments
There are no comments yet.