MARL has benefited from recent developments in deep reinforcement learning, with the frameworks moving away from tabular methods [bu2008comprehensive]
to deep neural networks[foerster2018counterfactual]. Our work is related to recent advances in CTDE deep multi-agent reinforcement learning.
The degree of training centralization varies in the current literature of MARL research. Independent Q-learning (IQL) [tan1993multi] and its deep neural network counterpart [tampuu2017multiagent] simply train an independent Q-learning model for each agent. Those that attempt to directly learn decentralized policies often suffer from the non-stationarity of the environment induced by agents simultaneously learning and exploring. [foerster2017stabilising, usunier2016episodic] attempt to stabilize learning under the decentralized training paradigm. [gupta2017cooperative] propose a training paradigm that alternates between centralized training with global rewards and decentralized training with reshaped rewards.
Centralized methods, by contrast, naturally avoid the non-stationary problem at the cost of scalability. COMA [foerster2016learning], takes advantage of CTDE, where actors are updated by following policy gradients that are tailored by their contributions to the system by the central critic. Multi-agent deep deterministic policy gradient (MADDPG) [lowe2017multi] extends deep deterministic policy gradient (DDPG) [lillicrap2015continuous]
to mitigate the issue of high variance gradient estimates exacerbated in multi-agent settings.[wei2018multiagent], based on MADDPG, proposes multi-agent soft Q-learning in continuous action spaces to tackle the issue of relative overgeneralization. Probabilistic recursive reasoning (PR2) [wen2019probabilistic] is a method that uses a probabilistic recursive reasoning policy gradient that enables agents to recursively reason what others believe about their own beliefs.
More recently, value-based methods, which lie between the extremes of IQL and COMA, have shown great success in solving complex multi-agent problems. VDN [sunehag2017value], which represents joint-action value function as a summation of local action-value function, allows for centralized learning. However, it does not make use of extra state information. QMIX [rashid2018qmix] utilizes a non-negative mixing network to represent a broader class of value-decomposition functions. Furthermore, additional state information is captured by hypernetworks that output parameters for the mixing network. QTRAN [son2019qtran] is a generalized factorization method that can be applied to environments that are free from structural constraints. Other works, such as CommNet [foerster2016learning], TarMAC [das2019tarmac], ATOC [jiang2018learning], MAAC [iqbal2019actor], CCOMA [su2020counterfactual] and BiCNet[peng2017multiagent] exploit inter-agent communication.
Similar to QMIX and VDN, VDAC applies value-decomposition, however, it differs in that VDAC is a policy-based method that decomposes global state-values whereas QMIX and VDN, which decompose global action-values, belong to the Q-learning family. [nguyen2018credit] addresses credit-assignment issue, however, under a different MARL setting, Dec-POMDP. COMA, which is also a policy gradient method inspired by difference rewards and has been tested on StarCraft II micromanage games, represents the body of literature most closely related to this paper.
Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs)
Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs): Consider a fully cooperative multi-agent task with agents. Each agent identified by take an action simultaneously at every timestep, forming a joint action . The environment has a true state
, a transition probability function, and a global reward function . In the partial observation setting, each agent draws an observations from the observation function . Each agent conditions a stochastic policy on its observation-action history . Throughout this paper, quantities in bold represent joint quantities over agents, and bold quantities with the superscript denote joint quantities over agents other than a given agent . Similar to a single-agent RL, MARL aims to maximize the discounted return . The joint value function is the expected return for following the joint policy from state . The value-action function defines the expected return for selecting joint action in state and following the joint policy .
Single-Agent Policy Gradient Algorithms: In the setting of RL, policy gradient methods directly adjust the parameters of the policy in order to maximize the objective by taking steps in the direction of . The gradient with respect to the policy parameters is
where is the state transition by following policy , and is an action-value. Policy gradient algorithms differ in how to evaluate , e.g. the REINFORCE algorithm [williams1992simple] simply uses a sample return .
To reduce variations in gradient estimates, a baseline is introduced. In actor-critic approaches [konda2000actor], an actor is trained by following gradients that are dependent on the critic. This yields the advantage function , where is the baseline ( or another constant is commonly used as the baseline). TD error , which is an unbiased estimate of , is a common choice for advantage functions. In practice, TD error that utilizes a n-step return yields good performance [mnih2016asynchronous].
Multi-Agent Policy Gradient (MAPG) Algorithms: Multi-agent policy gradient methods are extensions of policy gradient algorithms with a policy . Compared with policy gradient methods in RL settings, MAPG faces the issues of high variance gradient estimates [lowe2017multi] and credit assignment [foerster2018counterfactual]. Perhaps the simplest multi-agent gradient can be written as:
Multi-agent policy gradients in the current literature often take advantage of CTDE by using a central critic to obtain extra state information , and avoid using the vanilla multi-agent policy gradients (Equation 3) due to high variance. For instance, [lowe2017multi] utilize a central critic to estimate and optimize parameters in actors by following a multi-agent DDPG gradient, which is derived from Equation 3:
Unlike most actor-critic frameworks, [foerster2018counterfactual] claims to solve the credit assignment issue by applying the following counterfactual policy gradients:
where is the counterfactual advantage for agent . Note that [foerster2018counterfactual] argue that the COMA gradients provide agents with tailored gradients, thus achieving credit assignment. However, they also prove that COMA gradients are unbiased estimates of the vanilla multi-agent policy gradients, and that COMA is a variance reduction technique.
In addition to the previously outlined research questions, our goal in this work is to derive RL algorithms under the following constraints: (1) the learned policies are conditioned on agents’ local action-observation histories (The environment is modeled as Dec-POMDP), (2) a model of the environment dynamics is unknown (i.e. the proposed framework is task-free and model-free), (3) communication is not allowed between agents (i.e. we do not assume a differentiable communication channel such as [das2019tarmac]), and (4) the framework should enable parameter sharing among agents (namely, we do not train different models for each agent as is done in [tan1993multi]). A method that met the above criteria would constitute a general-purpose multi-agent learning algorithm that could be applied to a range of cooperative environments, with or without communication between agents. Hence, the following methods are proposed.
|Algorithm||Central Critic||Value Decomposition||Policy Gradients|
|IAC [foerster2018counterfactual]||No||-||TD advantage|
|Naive Critic||Yes||-||TD advantage|
|COMA [foerster2018counterfactual]||Yes||-||COMA advantage|
Naive Central Critic Method
A naive central critic (naive critic) is proposed to answer the first research question: is a simple policy gradient sufficient to optimize multi-agent actor-critics. As shown in Figure 0(a), naive critic’s central critic shares a similar structure with COMA’s critic. It takes input as and outputs . Actors follow a rather simple policy gradient: a TD advantage policy gradients that is common in RL literature, which is given by:
where . In the next section, we will demonstrate that policy gradients taking the form of Equation 6, under our proposed actor-critic frameworks, are also unbiased estimates of the naive multi-agent policy gradients. The pseudo code is listed in Appendix.
Value Decomposition Actor-Critic
Difference rewards enable agents to learn from a shaped reward that is defined by a reward change incurred by replacing the original action with a default action . Any action taken by agent that improves also improves the global reward since the second term in the difference reward equation does not depend on . Therefore, the global reward is monotonically increasing with . Inspired by difference rewards, we propose to decompose state value into local states such that the following relationship holds:
With Equation 7 enforced, given that the other agents stay at the same local states by taking , any action that leads agent to a local state with a higher value will also improve the global state value .
Two variants of value-decomposition that satisfy Equation 7, VDAC-sum and VDAC-mix, are studied.
VDAC-sum simply assumes that the total state value is a summation of local state values :
This linear representation is sufficient to satisfy Equation 7. VDAC-sum’s structure is shown in Figure 0(b). Note that the actor outputs both and . This is done by sharing non-output layers between distributed critics and actors. In this paper, denotes the distributed critics’ parameters and denotes the actors’ parameters for generality. The distributed critic is optimized by minibatch gradient descent to minimize the following loss:
where is bootstrapped from the last state and is upper-bounded by .
The policy network is trained by following the following policy gradient:
where is a simple TD advantage.
Similar to independent actor-critic (IAC), VDAC-sum does not make full use of CTDE in that it does not incorporate state information during training. Furthermore, it can only represent a limited class of centralized state-value functions.
To generalize the representation to a larger class of monotonic functions, we utilize a feed-forward neural network that takes input as local state valuesand outputs the global state value . To enforce Equation 7, the weights (not including bias) of the network are restricted to be non-negative. This allows the network to approximate any monotonic function arbitrarily well [dugas2009incorporating].
The weights of the mixing network are produced by separate hypernetworks [ha2016hypernetworks]. Following the practice in QMIX [rashid2018qmix], each hypernetwork takes the state
as an input and generates the weights of one layer of the mixing network. Each hypernetwork consists of a single linear layer. An absolute activation function is utilized in the hypernetwork to ensure that the outputted weights are non-negative. The biases are not restricted to being non-negative. Hence, the hypernetworks that produce the biases do not apply an absolute non-negative function. The final bias is produced by a 2-layer hypernetwork with a ReLU activation function following the first layer. Finally, hypernetwork outputs are reshaped into a matrix of appropriate size. Figure0(c) illustrates the mixing network and the hypernetworks.
The whole mixing network structure (including hypernetworks) can be seen as a central critic. Unlike critics in [foerster2018counterfactual], this critic takes local state values as additional inputs besides global state . Similar to VDAC-sum, the distributed critics is optimized by minimizing the following loss:
where denotes the mixing network. Let denote parameters in the hypernetworks. The central critic is optimized by minimizing the same loss:
The policy network is updated by following the same policy gradient in Equation 6. The pseudo code is provided in Appendix.
Convergence of VDAC frameworks
[foerster2018counterfactual] establish the convergence of COMA based on the convergence proof of single-agent actor-critic algorithms [konda2000actor, sutton2000policy]. In the same manner, we utilize the following lemma to substantiate the convergence of VDACs to a locally optimal policy.
Lemma 1: For a VDAC algorithm with a compatible TD(1) critic following a policy gradient
at each iteraction ,
Proof The VDAC gradient is given by:
. Similarly, we first consider the expected distribution of the baseline :
where the distribution is with respect to the state-action distribution induced by the joint policy . Writing the joint policy as a product of independent actors:
The total value does not depend on agent actions and is given by:
where is a non-negative function. This yields a single-agent actor-critic baseline:
Now let be the discounted ergodic state distribution as defined by [sutton2000policy]:
The reminder of the gradient is given by:
which yields a standard single-agent actor-critic policy gradient:
[konda2000actor] establish that an actor-critic that follows this gradient converges to a local maximum of the expected return , subject to assumptions included in their paper.
In the naive critic framework, is evaluated by the central critic and does not depend on agent actions. Hence, by following the same proof in equation 18, we can show that the expectation of naive critic baseline is also , thus proves naive critic also converges to a locally optimal policy.
In this section, we benchmark VDACs against the baseline algorithms listed in Table LABEL:table:_tj on a standardized decentralised StarCraft II micromanagement environment, SMAC [samvelyan2019starcraft]. SMAC consists of a set of StarCraft II micromanagement games that aim to evaluate how well independent agents are able to cooperate to solve complex tasks. In each scenario, algorithm-controlled ally units fight against enemy units controlled by the built-in game AI. An episode terminates when all units of either army have died or when the episode reached the pre-defined time limit. A game is counted as a win only if enemy units are eliminated. The goal is to maximize the win rate, i.e., the ratio of games won to games played.
The action space of agents consists of the following set of discrete actions: move[direction], attack[enemy id], stop, and no operation. Agents can only move in four directions: north, south, east, or west. A unit is allowed to perform the attack[enemy id] action only if the enemy is within its shooting range.
Each unit has a sight range that limits its ability to receive any information out of range. The sight range, which is bigger than shooting range, makes the environment partially observable from the standpoint of each agent. Agents can only observe other agents if they are both alive and located within the sight range. The global state, which is only available to agents during centralised training, encapsulates information about all units on the map.
Note that all the algorithms have access to the same partial observation and global state in our implementation 111https://github.com/hahayonghuming/VDACs. We consider the following maps in our experiments: 2s_vs_1sc, 2s3z, 3s5z, 1c3s5z, 8m, and bane_vs_bane. The detailed configuration of each map can be found in table LABEL:table:_sc_desc located in Appendix.
Observation features and state features are consistent across all algorithms. All algorithms are trained under A2C framework where
episodes are rolled out independently during the training. Refer to Appendix for training details and hyperparameters.
We perform the following ablations to answer the corresponding research questions:
Ablation 1: Is the TD advantage gradient sufficient to optimize multi-agent actor-critics? The comparison between the naive critic and COMA will demonstrate the effectiveness of TD advantage policy gradients because the only significant difference between those two methods is that the naive critic follows a TD advantage policy gradient whereas COMA follows the COMA gradient (Equation 5).
Ablation 2: Does applying state-value factorization improve the performance of actor-critic methods? VDAC-sum and IAC, both of which do not have access to extra state information, shares an identical structure. The only difference is that VDAC-sum applies a simple state-value factorization where the global state-value is a summation of local state values. The comparison between VDAC-sum and IAC will reveal the necessity of applying state-value factorization.
Ablation 3: Compared with QMIX, does VDAC provide a reasonable trade-off between training efficiency and algorithm performance? We train VDAC and QMIX under A2C training paradigm, which is proposed to promote training efficiency, and compare their performance.
Ablation 4: What are the factors that contribute to the performance of the proposed VDAC? We investigate the necessity of non-linear value-decomposition by removing the non-linear activation function in the mixing network. The resulting algorithm is called as VDAC-mix (linear) and it can be seen as a VDAC-sum with access to extra state information.
Overall results: Win rates on a range of SC mini-games. Black dash line represents heuristic AI’s performance
As suggested in [samvelyan2019starcraft]
, our main evaluation metric is the median win percentage of evaluation episodes as a function of environment steps observed, over thek training steps. Specifically, the performance of an algorithm is estimated by periodically running a fixed number of evaluation episodes (in our implementation, 32) during the course of training, with any exploratory behaviours disabled. The median performance as well as the - percentiles are obtained by repeating each experiment using 5 independent training runs. Figure 1 demonstrates the comparison among actor-critics across 6 different maps.
In all scenarios, IAC fails to learn a policy that consistently defeats the enemy. In addition, its performance across training steps is highly unstable due to the non-stationarity of the environment and its lack of access to extra state information.
Noticeably, VDAC-mix consistently achieves the best performance across all tasks. On easy games (i.e, 8m), all algorithms generally perform well. This is due to the fact that a simple strategy to attack the nearest enemies, which is outputted by Heuristic AI, is sufficient to win. In harder games such as 3s5z and 2s3z, only VDAC-mix can match or outperform the heuristic AI.
It is worth noting that VDAC-sum, which cannot access extra state information, matches the naive critic’s performance on most maps.
Consistent with [lowe2017multi], the comparison between the naive critic and IAC demonstrates the importance to incorporate extra state information, which is also revealed by the comparison between COMA and IAC (Refer to Figure 1 for comparisons between naive critic and COMA across different maps.). As shown in Figure 2, naive critic outperforms COMA across all tasks. It reveals that it is also viable to use a TD advantage policy gradients in multi-agent settings. In addition, COMA’s training is unstable, as can be seen in Figure 1(a), and 1(b), which might arise dues to its inability to predict accurate counterfactual action-value for un-taken actions.
Despite the similarity in structures of VDAC-sum and IAC, VDAC-sum’s median win rates at 2 million training step exceeds IAC’s consistently across all maps (Refer to Figure 1 for comparisons between VDAC-sum and IAC across 6 different maps.). It reveals that, by using a simple relationship to enforce equation 7, we can drastically improve multi-agent actor-critic’s performance. Furthermore, VDAC-sum matches naive critic on many tasks, as shown in Figure 1(c), demonstrate that actors that are trained without extra state information can achieve similar performance to naive critic by simply enforcing equation 7. In addition, it is noticeable that, compared with naive critic, VDAC-sum’s performance is more stable across training.
Figure 2(a) and 2(b) shows that, under A2C training paradigm, VDAC-mix outperform or match QMIX in map 2s_vs_2sc and 3s5z. Refer to Figure 4 in Appendix for comparisons between VDACs and QMIX over all maps. In easier games, QMIX’s performance can be comparable to VDAC-mix. In harder games such as 2s_vs_1sc and 3s5z, VDAC-mix’s median test win rates at 2 million training step outnumber QMIX’s by 38% and 71%, respectively. Furthermore, QMIX’s performance can be noticeably unstable across the training steps in some maps as shown in Figure 2(a).
Finally, we introduced VDAC-mix (linear), which can be seen as an more general VDAC-sum that has access to extra state information. Consistent with our previous conclusion, the comparison between VDAC-mix (linear) and VDAC-sum shows that it is important to incorporate extra state information. In addition, the comparison between VDAC-mix and VDAC-mix (linear) shows the necessity of assuming the non-linear relationship between the global state value and local state values . Refer to Figure 5 in Appendix for comparisons between VDACs across all maps.
In this paper, we propose a new credit-assignment actor-critic frameworks based off our observation on difference rewards, which implies the monotonic relationship between the global reward and the reshaped rewards for agents. Theoretically, we establish the convergence of proposed actor-critics to a local optimal. Empirically, benchmark tests on StarCraft micromanage games demonstrate that our proposed actor-critics bridges the performance gap between multi-agent actor-critics and Q-learning, and our methods provide a balanced trade-off between training efficiency and performance. Furthermore, we identify a set of key factors that contribute to the performance of our proposed algorithms via a set of ablation experiments. In the future, We aim to implement our framework in real-world applications such as highway on-ramp merging of semi or full self-driving vehicles.
Appendix A Appendix
In this paper, we use all the default settings in SMAC. That includes: the game difficulty is set to level , very difficult
, the shoot range, observe range, etc, are consistent with the default settings. The observation vector also follows the default implementation in[samvelyan2019starcraft]: It contains the following attributes for both allied and enemy units within the sight range: distance, relative x, relative y, health, shield, and unit type. In addition, the observation vector includes the last actions of allied units that are in the field of view. Lastly, the terrain features, in particular the values of eight points at a fixed radius indicating height and walkability, surrounding agents within the observe range are also included. The state vector includes the coordinates of all agents relative to the center of the map, together with units’ observation feature vectors. Additionally, the energy of Medivacs and cooldown of the rest of the allied units are stored in the state vector. Finally, the last actions of all agents are attached to the state vector.
|Map Name||Ally Units||Enemy Units|
|2s_vs_1sc||2 Stalkers||1 Spine Crawler|
|8m||8 Marines||8 Marines|
|2s3z||2 Stalkers & 3 Zealots||2 Stalkers & 3 Zealots|
|3s5z||3 Stalkers & 5 Zealots||3 Stalkers & 5 Zealots|
|1c3s5z||1 Colossus, 3 Stalkers & 5 Zealots||1 Colossus, 3 Stalkers & 5 Zealots|
|bane_vs_bane||20 Zerglings & 4 Banelings||20 Zerglings & 4 Banelings|
Training Details and Hyperparameters
The agent networks of all algorithms resemble a DRQN [hausknecht2015deep] with a recurrent layer comprised of a GRU [chung2014empirical] with a 64-dimensional hidden state, with a fully-connected layer before and after. The exception is that IAC, VDAC-sum, and VDAC-mix agent networks contain an additional layer to output local state values and the policy network outputs a stochastic policy rather than action-values.
Algorithms are trained with RMSprop with learning rate. During training, games are initiated independently, from which episodes are sampled. Q-learning replay buffer stores the latest 5000 episodes for each independent game (In total, replay buffer has a size of ). We set and (if needed). Target networks (if exists) are updated every training steps.
The architecture of the COMA critic is a feedforward fully-connected neural network with the first layers, each of which has units, followed by a final layer of units. Naive central critic shares the same architecture with COMA critic with an exception that its final layer contains units.
The mixing network in QMIX and VDAC-mix shares an identical structure. It consists of a single hidden layer of 32 units, whose parameters are outputted by hypernetworks. An ELU activation function follows the hidden layer in the mixing network. The hypernetworks consist of a feedforward network with a single hidden layer of 64 units with a ReLU activation function.
For naive central critic, IAC, and VDACs, is given by:
, where can vary from state to state and is upper-bounded by .