Efficient Communication in Multi-Agent Reinforcement Learning via Variance Based Control

by   Sai Qian Zhang, et al.

Multi-agent reinforcement learning (MARL) has recently received considerable attention due to its applicability to a wide range of real-world applications. However, achieving efficient communication among agents has always been an overarching problem in MARL. In this work, we propose Variance Based Control (VBC), a simple yet efficient technique to improve communication efficiency in MARL. By limiting the variance of the exchanged messages between agents during the training phase, the noisy component in the messages can be eliminated effectively, while the useful part can be preserved and utilized by the agents for better performance. Our evaluation using a challenging set of StarCraft II benchmarks indicates that our method achieves 2-10× lower in communication overhead than state-of-the-art MARL algorithms, while allowing agents to better collaborate by developing sophisticated strategies.



page 8


Succinct and Robust Multi-Agent Communication With Temporal Message Control

Recent studies have shown that introducing communication between agents ...

Multi-agent Communication with Graph Information Bottleneck under Limited Bandwidth

Recent studies have shown that introducing communication between agents ...

Minimizing Communication while Maximizing Performance in Multi-Agent Reinforcement Learning

Inter-agent communication can significantly increase performance in mult...

Detection under One-Bit Messaging over Adaptive Networks

This work studies the operation of multi-agent networks engaged in binar...

Robust Risk-Sensitive Reinforcement Learning Agents for Trading Markets

Trading markets represent a real-world financial application to deploy r...

MARLeME: A Multi-Agent Reinforcement Learning Model Extraction Library

Multi-Agent Reinforcement Learning (MARL) encompasses a powerful class o...

Multi-Agent Intention Sharing via Leader-Follower Forest

Intention sharing is crucial for efficient cooperation under partially o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world applications (e.g., autonomous driving Shai et al. (2016), game playing Mnih et al. (2013) and robotics control Kober et al. (2013)) today require reinforcement learning tasks to be carried out in multi-agent settings. In MARL, multiple agents interact with each other in a shared environment. Each agent only has access to partial observations of the environment, and needs to make local decisions based on partial observations as well as both direct and indirect interactions with the other agents. This complex interaction model has introduced numerous challenges for MARL. In particular, during the training phase, each agent may dynamically change its strategy, causing dynamics in the surrounding environment and instability in the training process. Worse still, each agent can easily overfit its strategy to the behaviours of other agents Marc et al. (2017), which seriously deteriorates the overall performance.

In the research literature, there have been three lines of research that try to mitigate the instability and inefficiency caused by decentralized execution. The most common approach is independent Q-learning (IQL) Tan (1993), which breaks down a multi-agent learning problem into multiple independent single-agent learning problems, and makes each agent learn and act independently. Unfortunately, this approach does not account for instability caused by environment dynamics, and therefore often leads to collided actions during the execution. The second approach adopts the centralized training and decentralized execution Sunehag et al. (2017) paradigm, where a joint action-value function is learned during the training phase to better coordinate the agents’ behaviours. During execution, each agent acts independently without direct communication. The third approach introduces communication among agents during execution Sukhbaatar et al. (2016); Foerster et al. (2016). This approach allows each agent to dynamically adjusts its strategy based on its local observation along with the information received from the other agents. Nonetheless, it introduces additional communication overhead in terms of latency and bandwidth during execution, and its success is heavily dependent on the usefulness of the received information.

In this work, we leverage the advantages of both the second and third approach. Specifically, we consider a fully cooperative scenario where all the agents collaborate to achieve a common objective. The agents are trained in a centralized fashion within the multi-agent Q-learning framework, and are allowed to communicate with each other during execution. However, unlike previous work, we have a few key insights. First, for many applications, it is often superfluous for an agent to wait for feedback from all surrounding agents before making an action decision. For instance, when the front camera on a autonomous vehicle detects an obstacle within the dangerous distance limit, it triggers the ‘brake‘ signal without waiting for the feedback from the other parts of the vehicle. Second, the feedback received from the other agents may not always provide useful information. For example, the navigation system of the autonomous vehicle should pay more attention to the messages sent by the perception system (e.g., camera, radar), and less attention to the entertainment system inside the vehicle before taking its action. Full communication among the agents leads to a large communication overhead and latency, which is impractical for the real system implementation with strict latency requirement and bandwidth limit (e.g., real-time traffic signal control, autonomous driving, etc). In addition, as pointed out by Jiang and Lu (2018), excessive amount of communication may introduce useless and even harmful information which could even impair the convergence of learning.

Motivated by these observations, we design a novel deep MARL architecture that dramatically improves inter-agent communication efficiency. Specifically, we introduce Variance Based Control (VBC), a simple yet efficient approach to reduce the information transferred between the agents. By inserting an extra loss term on the variance of the exchanged information, the valuable and informative part inside the messages can be effectively extracted and leveraged to benefit the training of each individual agent. Unlike previous work, we do not introduce an extra decision module to dynamically adjust the communication pattern. This allows us to reduce the model complexity significantly. Instead, each agent first makes a preliminary decision based on its local information, and initiates communication only when its confidence level on this preliminary decision is low. Similarly, upon receiving the communication request, the agent replies to the request only when its message is informative. This architecture significantly improves the performance of the agents and reduces the communication overhead during the execution. Furthermore, it can be theoretically shown that the resulting training algorithm achieves guaranteed stability.

For evaluation, we test VBC on the StarCraft Multi-Agent Chanllenge Samvelyan et al. (2019), the results demonstrate that VBC achieves higher for winning rates and lower for communication overhead on average compared with the other benchmark algorithms. A video demo is available at 22 for better illustration of the VBC performance.

2 Related Work

The simplest training method for MARL is to make each agent learn independently using Independent Q-Learning (IQL) Tan (1993). Although IQL is successful in solving simple tasks such as Pong Tampuu et al. (2017), it ignores the environment dynamics arose from the interactions among the agents. As a result, it suffers from the problem of poor convergence, making it difficult to handle advanced tasks.

Given the recent success on deep Q-learning Mnih et al. (2013), some work explores the scheme of centralized training and decentralized execution. Sunehag et al. Sunehag et al. (2017) propose Value Decomposition Network (VDN), a method that acquires the joint action-value function by summing all the action-value functions of each agent. All the agents are trained as a whole by updating the joint action-value functions iteratively. QMIX Rashid et al. (2018)

sheds some light on VDN, and utilizes a neural network to represent the joint action-value function as a function of the individual action-value functions and the global state information. The authors of 

Lowe et al. (2017) extends the actor-critic methods to the multi-agent scenario. By performing centralized training and decentralized execution over the agents, the agents can better adapt to the changes on the environment and collaborate with each other. Foerster et al. Foerster et al. (2018b)

propose counterfactual multi-agent policy gradient (COMA), which employs a centralized critic function to estimate the action-value function of the joint, and decentralized actor functions to make each agent execute independently. All the aforementioned methods assume no communication between the agents during the execution. As a result, many subsequent approaches, including ours, can be applied to improve the performance of these methods.

Learning the communication pattern for MARL is first proposed by Sukhbaatar et. al. Sukhbaatar et al. (2016). The authors introduce CommNet, a framework that adopts continuous communication for fully cooperative tasks. During the execution, each agent takes their internal states as well as the means of the internal states of the rest agents as the input to make decision on its action. The BiCNet Peng et al. (2017) uses a bidirectional coordinated network to connect the agents. However, both schemes require all-to-all communication among the agents, which can cause a significant communication overhead and latency.

Several other proposals Foerster et al. (2016); Jiang and Lu (2018); Kim et al. (2019) use a selection module to dynamically adjust the communication pattern between the agents. In Foerster et al. (2016), the authors propose DIAL (Differentiable Inter-Agent Learning), the messages produced by an agent are selectively sent to the neighboring agents by the discretise/regularise unit (DRU). By jointly training DRU with the agent network, the communication overhead can be efficiently reduced. Jiang et. al. Jiang and Lu (2018) propose an attentional communication model that learns when the communication is required and how to aggregate the shared information. However, an agent can only talk with the agents in its observable field at each timestep. This limits the speed of information propagation, and restricts the possible communication patterns when the local observable field is small. Kim et. al.  Kim et al. (2019) propose a communication scheduling scheme for wireless environment, but only part of agents can broadcast their messages at each time. In comparison, our approach does not apply the hard constraints on the communication pattern, which is advantageous to the learning process. Also our method does not adopt additional decision module for the communication scheduling, which greatly reduces the model complexity.

3 Background

Deep Q-networks:

We consider a standard reinforcement learning problem based on Markov Decision Process (MDP). At each timestamp

, the agent observes the state , and chooses an action . It then receives a reward for its action and proceeds to the next state . The goal is to maximize the total expected discounted reward , where is the discount factor. Deep Q-Networks (DQN) use a deep neural network to represent the action-value function , where represents the parameters of the neural network, and is the total rewards received at and after . During the training phase, a replay buffer is used to store the transition tuples . The action-value function can be trained recursively by minimizing the loss , where and represents the parameters of the target network. An action is usually selected with

-greedy policy. Namely, selecting the action with maximum action value with probability

, and choosing a random action with probability .

Multi-agent deep reinforcement learning: We consider an environment with agents work cooperatively to fulfill a given task. At timestep , each agent () receives a local observation and executes an action . They then receive a joint reward

and proceed to the next state. We use a vector

to represent the joint actions taken by all the agents. The agents aim to maximize the joint reward by choosing the best joint actions at each timestep .

Deep recurrent Q-networks: Traditional DQN generates action solely based on a limited number of local observations without considering the prior knowledge. Hausknecht et al. Hausknecht and Stone (2015)

introduce Deep Recurrent Q-Networks (DRQN), which models the action-value function with a recurrent neural network (RNN). The DRQN leverages its recurrent structure to integrate the previous observations and knowledge for better decision-making. At each timestep

, the DRQN takes the local observation and hidden state from the previous steps as input to yield action values.

Learning the joint Q-function: Recent research effort has been made on the learning of joint action-value function for multi-agent Q-learning. Two representative works are VDN Sunehag et al. (2017) and QMIX Rashid et al. (2018). In VDN, the joint action-value function is assumed to be the sum of all the individual action-value functions , where , and are the collection of the observations, hidden states and actions of all the agents at timestep respectively. QMIX employs a neural network to represent the joint value function as a nonlinear function of and global state .

4 Variance Based Control

In this section, we present the detailed design of VBC in the context of multi-agent Q-learning. The main idea of VBC is to achieve the superior performance and communication efficiency by limiting the variance of the transferred messages. Moreover, during execution, each agent communicates with other agents only when its local decision is ambiguous. The degree of ambiguity is measured by the difference on top two largest action values. Upon receiving the communication request from other agents, the agent replies only if its feedback is informative, namely the variance of the feedback is high.

4.1 Agent Network Design

Figure 1: (a) Agent network structure of agent 1. (b) Centralized training scheme. means all the .

The agent network consists of the following three networks: local action generator, message encoder and combiner. Figure 1(a) describes the network architecture for agent . The local action generator

consists of a Gated Recurrent Unit (GRU) and a fully connected layer (FC). For network of agent

, the GRU takes the local observation and the hidden state as the inputs, and generates the intermediate results . is then sent to the FC layer, which outputs the local action values for each action , where is the set of possible actions. The message encoder,

, is a multi-layer perceptrons (MLP) which contains two FC layers and a leaky ReLU layer. The agent network involves multiple independent message encoders, each accepts

from another agent, and outputs . The outputs from local action generator and message encoder are then sent to the combiner, which produces the global action-value function of agent by taking into account the global observation and global history . To simplify the design and reduce model complexity, we do not introduce extra parameters for the combiner. Instead, we make the dimension of the the same as the local action values , and hence the combiner can simply perform elementwise summation over its inputs, namely . The combiner chooses the action with the -greedy policy, . Let and denote the set of parameters of the local action generators and the message encoders, respectively. To prevent the lazy agent problem Sunehag et al. (2017) and decrease the model complexity, we make is the same for all and , and also make the same for all . Accordingly, we can drop the corner scripts and use and to denote the agent network parameters and the message encoder.

4.2 Loss Function Definition

During the training phase, the message encoder and local action generator learn jointly to generate the best estimation on the action values. More specifically, we employ a mixing network (shown in Figure 1(b)) to aggregate the action-value functions from each agents , and yields the joint action-value function, . To limit the variance of the messages from the other agents, we introduce an extra loss term on the variance of the outputs of the message encoders

. The loss function during the training phase is defined as:


where , is the parameter of the target network which is copied from the periodically, is the variance function and is the weight of the loss on it. is the batch index. The replay buffer is refreshed periodically by running each agent network and selecting the action which maximizes .

4.3 Communication Protocol Design

During the execution phase, for every timestep , the agent first computes the local action-value function and . It then measures the confidence level on the local decision by computing the difference between the largest and the second largest element within the action values. An example is given in Figure 2(a). Assume agent 1 has three actions to select, the output of the local action generator of agent is , and the difference between the largest and the second largest action values is , which is greater than the threshold . Given the fact that the variance of message encoder outputs from the agent 2 and 3 is relatively small, it is highly possible that the global action-value function also has the largest value in its third element. Therefore agent 1 does not have to talk to other agents to acquire . On the other hand, agent 1 broadcasts a request to ask for help if its confidence level on the local decision is low. Because the request does not contain any actual data, it consumes very low bandwidth. Upon receiving the request, only the agents whose message has a large variance reply (Figure 2(b)), since their messages may change the current action decision at agent 1. This protocol not only reduces the communication overhead considerably, but also eliminates less informative messages which may impair the performance. Each agent only consults with the other agents when its confidence level on the local decision is low, and the other agents only reply when their messages can potentially change the final decision. The detailed protocol is summarized in Algorithm 1. Note that this communication protocol is also applied during the training stage to generate the action values, a complete training algorithm is presented in the appendix.

Figure 2: An example on communication protocol of the agents during execution.
1 Input: threshold on confidence of local actions , threshold on variance of message encoder output .
2 Output: The final action-value function .
3 for  do
4        Transmitter end:
5        Compute local action values . Denote the top two largest values of .
6        if  then
7               Let .
8       else
9               Broadcast a request to the other agents, and receive the from agents.
10               Let .
11        Receiver end:
12        Calculate variance of , if , store .
13        if  and Receive a request from agent  then
14               Reply the request with .
Algorithm 1 Communication protocol of agent

5 Convergence Analysis

In this section, we analyze convergence of the learning process with the loss function defined in equation (1) under the tabular setting. For the sake of simplicity, we ignore the dependency of the action-value function on the previous knowledge . To minimize equation (1), given the initial state , at iteration , the values in the table is updated according to the following rule:


where , are the learning rate and the joint action-value function at iteration respectively, and is the optimal joint action-value function. We have the following result on the convergence of the learning process. A detailed proof is given in the appendix.

Theorem 1.

Assume , , . Also assume the number of possible actions and states are finite. By performing equation 2 iteratively, we have , as , where satisfies .

6 Experiment

We evaluated the performance of VBC with the StarCraft Multi-Agent Challenge (SMAC) Samvelyan et al. (2019). StarCraft II 17 is a real-time strategy (RTS) game that has recently been utilized as a benchmark by the reinforcement learning community Rashid et al. (2018); Foerster et al. (2018b); Peng et al. (2017); Foerster et al. (2018a). In this work, we focus on the decentralized micromanagement problem in StarCraft II, which involves two armies, one controlled by the user (i.e. a group of agents), and the other controlled by the build-in StarCraft II AI. The goal of the user is to control its allied units to destroy all enemy units, while minimizing received damage on each unit. We consider six different battle settings. Three of them are symmetrical battles, where both the user and the enemy groups consist of 2 Stalkers and 3 Zealots (2s3z), 3 Stalkers and 5 Zealots (2s5z), and 1 Medivac, 2 Marauders and 7 Marines (MMM) respectively. The other three are unsymmetrical battles, where the user and enemy groups have different army unit compositions, including: 3 Stalkers for user versus 4 Zealots for enemy (3svs4z), 6 Hydralisks for user versus 8 Zealots for enemy (6svs8z), and 6 Zealot for user versus 24 Zerglings for enemy (6zvs24zerg). The unsymmetrical battles is considered to be harder than the symmetrical battles because of the difference in army size.

At each timestep, each agent controls a single unit to perform an action, including move[direction], attack[enemyid], stop and no-op. Each agent has a limited sight range and shooting range, where shooting range is less than the sight range. The attack operation is available only when the enemies are within the shooting range. The joint reward received by the allied units equals to the total damage inflicted on enemy units. Additionally, the agents are rewarded 100 extra points after killing each enemy unit, and 200 extra points for killing the entire army. The user group wins the battle only when the allied units kill all the enemies within the time limit. Otherwise the built-in AI wins. The input observation of each agent is a vector consists the following information of each allied unit and enemy unit in its sight range: relative ,

coordinates, relative distance and agent type. For the detailed game settings, hyperparameters, and additional experiment evaluation over other test environment, please refer to appendix.

6.1 Results

We compare VBC and other banchmark algorithms, including VDN Sunehag et al. (2017), QMIX Rashid et al. (2018) and SchedNet Kim et al. (2019), for controlling allied units. We consider two types of VBCs by adopting the mixing networks of VDN and QMIX, denoted as VBC+VDN and VBC+QMIX. The mixing network of VDN simply computes the elementwise summation across all the inputs, and the mixing network of QMIX deploys a neural network whose weight is derived from the global state . The detailed architecture of this mixing network can be found in Rashid et al. (2018). Additionally, we remove the penalty in Equation (1), and drop the limit on variance during the execution phase (i.e., and ). The agents are trained with the same network architecture shown in Figure (1), and the mixing network of VDN is used. We call this algorithm FComm (full communication). For SchedNet, at every timestep only out of agents can broadcast their messages by using Top(k) scheduling policy Kim et al. (2019). changes with different battles and we usually set close to , that is, each time roughly half of the allied units can broadcast their messages. The VBC are trained for different number of episodes based on the difficulties of the battles, which we describe in detail next.

To measure the speed of convergence for each algorithm, we stop the training process and save the current model every 200 training episodes. We then run 20 test episodes and measure the winning rates for these 20 episodes. For VBC+VDN and VBC+QMIX, the winning rates are measured by running the communication protocol described in Algorithm 1. For easy tasks, namely MMM and 2svs3z, we train the algorithms with million and million episodes respectively. For all the other tasks, we train the algorithms with million episodes. Each algorithm is trained 15 times. Figure 3 shows the average winning rate and 95confidence interval of each algorithm for all the six tasks. For hyperparameters used by VBC (i.e., used in equation (1), in Algorithm 1), we first search for a coarse parameter range based on random trial, experience and message statistics. We then perform a random search within a smaller hyperparameter space. Best selections are shown in the legend of each figure.

We notice that the algorithms that involve communication (i.e., SchedNet, FComm, VBC) outperform the algorithms without communication (i.e., VDN, QMIX) in all the six tasks. This is a clear indication that communication benefits the performance. Moreover, both VBC+VDN and VBC+QMIX achieve better winning rates than SchedNet, because SchedNet only allows a fixed number of agents to talk at every timestep, which prohibits some key information to exchange timely. Finally, VBC achieves a similar performance as FComm and even outplays FComm for some tasks (e.g., 2s3z,6hvs8z, 6zvs24zerg). This is because a fair amount of communication between the agents are noisy and redundant. By eliminating these undesired messages, VBC is able to achieve both communication efficiency and performance gain.

Figure 3: Winning rates for the six tasks, the shaded regions represent the 95% confidence intervals.
(a) Communication overhead
(b) QMIX and VDN (6h_vs_8z)
(c) VBC (6h_vs_8z)
(d) VBC (3svs4z)
(e) VBC (6z_vs_24zergs, stage 1)
(f) VBC (6z_vs_24zergs, stage 2)
Figure 4: Strategies and communication pattern for different scenarios

6.2 Communication Overhead

We now evaluate the communication overhead of VBC. To quantify the amount of communication, we run Algorithm 1 and count the total number of pairs of agents that conduct communication for each timestep , then divided by the total number of pairs of agents in the user group, . For example, for the task 3svs4z, the user controls 3 Stalkers, and therefore . Within these pairs of agents, pairs involve communication, and therefore . Let denote the communication overhead. Table 1 shows the of VBC+VDN, VBC+QMIX and SchedNet across all the test episodes at the end of the training phase of each battle. For SchedNet, simply equals the ratio between number of allied agents that are allowed to talk and the total number of allied agents. As shown in Table 1, in contrast to ScheNet, VBC+VDN and VBC+QMIX produce 10 lower communication overhead for MMM and 2s3z, and less traffic for the rest of tasks.

MMM 5.25% 5.36% 50%
2s3z 4.33% 4.68% 60%
3s5z 27.70% 28.13% 62.5%
3s_vs_4z 5.07% 5.19% 33.3%
6h_vs_8z 25.93% 24.62% 50%
6z_vs_24zerg 12.13% 13.35% 50%
Table 1: Communication overhead

6.3 Learned Strategy

In this section, we examine the behaviors of the agents in order to better understand the strategies adopted by the different algorithms. We have made a video demo available at 22 for better illustration.

For unsymmetrical battles, the number of allied units is less than the enemy units, and therefore the agents are prone to be besieged and killed by the enemies. This is exactly what happened for the QMIX and VDN agents on 6hvs8z, as shown in (Figure 4(b)). Figure 4(c) shows the strategy of VBC, all the Hydralisks are placed in a row at the bottom margin of the map. Due to the limited size of the map, the Zealots can not go beyond the margin to surround the Hydralisks. The Hydralisks then focus their firings to kill each Zealot. Figure 4(a) shows the change on for a sample test episode. We observe that most of the communication appears in the beginning of the episode. This is due to the fact that Hydralisks need to talk in order to place in a row. After the arrangement is formed, no communication is needed until the arrangement is broken due to the deaths of some Hydralisks, as indicated by the short spikes near the end of the episode. Finally, SchedNet and FComm utilize a similar strategy as VBC. Nonetheless, due to the restriction on communication pattern, the row formed by the allied agents are usually not well formed, and can be easily broken by the enemies.

For 3svs4z scenario, the Stalkers have a larger attack range than Zealots. All the algorithms adopt a kiting strategy where the Stalkers form a group and attack the Zealots while kiting them. For VBC and FComm, at each timestep only the agents that are far from the enemies attack, and rest of the agents (usually the healthier ones) are used as a shield to protect the firing agents (Figure 4(d)). Communication only occurs when the group are broken and need to realign. In contrast, VDN and QMIX do not have this attacking pattern, and all the Stalkers always fire simultaneously, therefore the Stalkers closest to the Zealots are get killed soon. SchedNet and FComm also adopt a similar policy as VBC, but the attacking pattern of the Stalkers is less regular, i.e., the Stalkers close to the Zealots also fire occasionally.

6zvs24zerg is the toughest scenario in our experiment. For QMIX and VDN, the 6 Zealots are surrounded and killed by 24 Zerglings shortly after episode starts. In contrast, VBC first separates the agents into two groups with two Zealots and four Zealots respectively (Figure 4(e)). The two Zealots attract most of Zerglings to a place far away from the rest four Zealots, and are killed shortly. Due to the limit sight range of the Zerglings, they can not find the rest four Zealots. On the other side, the four Zealots kill the small part of Zerglings easily and search for the rest Zerglings. The four Zealots take advantage of the short sight of the Zerglings. Each time the four Zealots adjust their positions in a way such that they can only be seen by a small amount of the Zerglings, the baited Zerglings are then be killed easily (Figure 4(f)). For VBC, the communication only occurs in the beginning of the episode when the Zealots are separated into two groups, and near the end of the episode when four Zealots adjust their positions. Both FComm and SchedNet learn the strategy of splitting the Zealots into two groups, but they fail to fine-tune their positions to kill the remaining Zerglings.

For symmetrical battles, the tasks are less challenging, and we see less disparities on performances of the algorithms. For 2s3z and 3s5z, the VDN agents attack the enemies blindly without any cooperation. The QMIX agents learn to focus firing and protect the Stalkers. The agents of VBC, FComm and SchedNet adopt a more aggressive policy, where the allied Zealots try to surround and kill the enemy Zealots first, and then attack the enemy Stalkers by collaborating with the allied Stalkers. This is extremely effective because Zealots counter Stalkers, so it is important to kill the enemy Zealots before they damage allied Stalkers. For VBC, the communication occurs mostly when the allied Zealots try to surround the enemy Zealots. For MMM, almost all the methods learn the optimal policy, namely killing the Medivac first, then attack the rest of the enemy units cooperatively.

7 Conclusion

In this work, we propose VBC, a simple and effective approach to achieve efficient communication among the agents in MARL. By constraining the variance of the exchanged messages during the training phase, VBC improves communication efficiency while enables better cooperation among the agents. The test results of StarCraft Multi-Agent Challenge indicate that VBC outperforms the other state-of-the-art methods significantly in terms of both winning rate and communication overhead.


  • J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson (2016) Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2137–2145. Cited by: §1, §2.
  • J. N. Foerster, C. A. S. de Witt, G. Farquhar, P. H. Torr, W. Boehmer, and S. Whiteson (2018a) Multi-agent common knowledge reinforcement learning. arXiv preprint arXiv:1810.11702. Cited by: §6, §9.1.1.
  • J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018b) Counterfactual multi-agent policy gradients. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2, §6, §9.1.1.
  • M. Hausknecht and P. Stone (2015) Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series, Cited by: §3.
  • T. Jaakkola, M. I. Jordan, and S. P. Singh (1994) Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pp. 703–710. Cited by: §8.1.
  • J. Jiang and Z. Lu (2018) Learning attentional communication for multi-agent cooperation. In Advances in Neural Information Processing Systems, pp. 7254–7264. Cited by: §1, §2.
  • D. Kim, S. Moon, D. Hostallero, W. J. Kang, T. Lee, K. Son, and Y. Yi (2019) Learning to schedule communication in multi-agent reinforcement learning. arXiv preprint arXiv:1902.01554. Cited by: §2, §6.1.
  • J. Kober, J. A. Bagnell, and J. Peters. (2013) "Reinforcement learning in robotics: a survey.". The International Journal of Robotics Research. Cited by: §1.
  • R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390. Cited by: §2.
  • L. Marc, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel (2017) "A unified game-theoretic approach to multiagent reinforcement learning.". In Advances in Neural Information Processing Systems, Cited by: §1.
  • F. S. Melo (2001) Convergence of q-learning: a simple proof. Institute Of Systems and Robotics, Tech. Rep, pp. 1–4. Cited by: §8.1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. (2013) "Playing atari with deep reinforcement learning.". arXiv preprint arXiv:1312.5602. Cited by: §1, §2.
  • P. Peng, Y. Wen, Y. Yang, Q. Yuan, Z. Tang, H. Long, and J. Wang (2017) Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games. arXiv preprint arXiv:1703.10069. Cited by: §2, §6, §9.1.1.
  • T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485. Cited by: §2, §3, §6.1, §6, §9.1.1.
  • M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson (2019) The StarCraft Multi-Agent Challenge. CoRR abs/1902.04043. Cited by: §1, §6, §9.1.1.
  • S. Shai, S. Shammah, and A. Shashua. (2016) "Safe, multi-agent, reinforcement learning for autonomous driving.". arXiv preprint arXiv:1610.03295. Cited by: §1.
  • [17] StarCraft official game site. Note: https://starcraft2.com/ Cited by: §6, §9.1.1.
  • S. Sukhbaatar, R. Fergus, et al. (2016)

    Learning multiagent communication with backpropagation

    In Advances in Neural Information Processing Systems, pp. 2244–2252. Cited by: §1, §2.
  • P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al. (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296. Cited by: §1, §2, §3, §4.1, §6.1.
  • A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente (2017) Multiagent cooperation and competition with deep reinforcement learning. PloS one 12 (4), pp. e0172395. Cited by: §2.
  • M. Tan (1993) "Multi-agent reinforcement learning: independent vs. cooperative agents.". In

    Proceedings of the tenth international conference on machine learning.

    Cited by: §1, §2.
  • [22] Video demo. Note: https://bit.ly/2VFkvCZ Cited by: §1, §6.3.

8 Appendix

8.1 Proof for theorem 1

Theorem 2.

Assume , , . Also assume the number of possible actions and states are finite. By performing equation 2 iteratively, we have , as , where satisfies .


The proof is based on [5] and  [11]. If we subtract from equation 2 in the paper, and rearrange the equation, we get:


where . Let , and we have:


Decompose into two random processes, and , where , we have the following two random iterative processes:


From Theorem 1 of [11], we know that converges to zero w.p. 1. From equation , we notice that:


Therefore we have . This is linear in and will converge to a nonpositive number as k approaches infinity. However, because is always greater or equal to zero, hence must converge to a number between 0 and . Therefore we get:


as k approaches infinity. ∎

9 VBC training algorithm

The detailed training algorithm for VBC is shown in Algorithm 2.

1 Input: Initial parameters for the agent network, and the mixing network. The initial parameters for the mixing network. The discount factor , size of minibatch B.
2 for training episode  do
3       Observe the initial state .
4       for timestep  do
5             Run Algorithm 1 for each agent and get the action value vector for each agent i.
6             Pick the actions which maximize for each agent i. Execute the actions and observes the joint reward and next state .
7             Store the tuple to the replay buffer.
9      Sample a minibatch S from the replay buffer.
10       For each sample within the minibatch, get the observation for each agent i, using the agent network to compute . Then use the action values as the input to mixing network, which then yields the joint action-value function at t.
11       Compute , and .
12       Calculate the loss . Update .
Algorithm 2 Training algorithm of VBC

9.1 Experiment settings and hyperparameters

In this section, we describe in detail the experiment settings.

9.1.1 StarCraft micromanagement challenges

StarCraft II [17] is a real-time strategy (RTS) game that has recently been utilized as a challenging benchmark by the reinforcement learning community [14, 3, 13, 2]. In this work, we concentrate on the decentralized micromanagement problem in StarCraft II. Specifically, the user controls an army that consists of several army units (agents). We consider the battle scenario where two armies, one controlled by the user, and the other controlled by the build-in StarCraft II AI, are placed on the same map and try to defeat each other. The agent type can be different between the two armies, and the agent type can also be different within the same army. The goal of the user is to control the allied units wisely to kill all the enemy units, while minimizing the damage on the health of each individual agent. The difficulty of the computer AI is set to Medium. We consider six different battle settings, which is shown in Table 2. Among these six settings, three are symmetrical battles, where the user group and the enemy group are identical in terms of type and quantity of agents. The other three are unsymmetrical battles, where the agent type of user groups and enemy group are different, and user group contains less number of allied units than the enemy group. The unsymmetrical battles is consider to be harder than the symmetrical battles because of the difference in the army size.

During execution, each agent are allowed to perform the following actions: move[direction], attack[enemyid], stop and no-op. There are four directions for the ’move’ operation: east, west, south, or north. Each agent can only attack the enemy within its shooting range. The Medivacs can only heal the partner by performing the heal[partnerid] instead of attacking the enemies. The number of possible actions for an agent ranges from 11 (2s3z) to 30 (6zvs24zerg). Each agent has a limited sight range, and can only receive information from the partners or enemies within the sight range. Furthermore, the shooting range is smaller than the sight range so the agent can not attack the opponents without observing them, and the attack operation is not available when the opponents are outside the shooting range. Finally, agents can only observe the live agents within the sight range, and can not discern the agents that are dead or outside the range. At each timestep, the joint reward received by the allied units equals to the total damage on the health levels of the enemy units. Additionally, the agents are rewarded 100 extra points after killing each enemy unit, and 200 extra points for killing all the enemies. The user group wins the battle only when the allied units kill all the enemies within the time limit. The user group loses the battle if either all the allied units are killed, or the time limit reaches. The time limit for different battles are: 120 timesteps for 2s3z and MMM, 150 for 3s5z and 3svs4z, and 200 for 6hvs8z and 6zvs24zerg.

The input observation of each agent consisting of a vector which involves the following information of each allied unit and enemy unit in its sight range: relative x,y coordinates, relative distance and agent type. For the mixing network of QMIX, the global state vector contains the following elements:

  1. Shield levels, health levels and cooldown levels of all the units at .

  2. The actions taken by all the units at .

  3. The x,y coordinates of all the units relative to the center of the map at .

For all the six battles, each allied or enemy agent has a sight range of 9 and shooting range of 6 for all types of agents. For additional information, please refer to [15].

Symm. 2s3z MMM 3s5z
User 2 Stalkers & 3 Zealots 1 Medivac, 2 Marauders & 7 Marines 3 Stalkers & 5 Zealots
Enemy 2 Stalkers & 3 Zealots 1 Medivac, 2 Marauders & 7 Marines 3 Stalkers & 5 Zealots
Unsymm. 3s_vs_4z 6h_vs_8z 6z_vs_24zerg
User 3 Stalkers 6 Hydralisks 6 Zealots
Enemy 4 Zealots 8 Zealots 24 Zerglings
Table 2: Agent types of the six battles

9.1.2 Hyperparameter

For network of agent , at timestep , raw observation is first passed through a single-layer MLP, which outputs a intermediate result with size of 64. The GRU then takes this intermediate result, as well as the hidden states from the previous timestep and generates and . Both and has a size of 64. The is then passed through a FC layer, which generates the local action-value function . The message encoder contains two FC layers with 196 and 64 units respectively. The combiner performs elementwise summation on the outputs of local action generator and the message encoders.

During the training, we set the and decrease linearly from 1.0 to 0.05 over the first timesteps and keep it to

for the rest of the learning process. The replay buffers stores the most recent 5000 episode. We perform a test run for every 200 training episodes to update the replay buffer. The training batch size is set to 32 and the test batch size is set to 8. We adopt a RMSprop optimizer with a learning rate

and .