Cooperative multi-agent reinforcement learning (MARL) has achieved remarkable successes in a variety of challenging problems such as urban systems Singh et al. (2020), coordination of robot swarms Hüttenrauch et al. (2017) and real-time strategy (RTS) video games Vinyals et al. (2019). With the paradigm of centralized training with decentralized execution (CTDE) Oliehoek et al. (2008); Kraemer and Banerjee (2016), recent years have seen great progresses in MARL methods Lowe et al. (2017); Rashid et al. (2018); Wang et al. (2020a, 2021), aiming to improving scalability and dealing with non-stationarity. However, many CTDE-based MARL methods struggle in scenarios where communication between cooperative agents is necessary during decentralized execution and the miscoordination problem of many MARL methods is widespread notably Wang et al. (2019). Recent studies on introducing inter-agent communication during the execution phase demonstrate that multi-agent coordination can be significantly improved in scenarios where communication between agents is allowed. Many multi-agent communicative reinforcement learning (MACRL) methods have been proposed Foerster et al. (2016); Sukhbaatar et al. (2016); Jiang and Lu (2018); Das et al. (2019); Wang et al. (2019, 2020b).
Meanwhile, adversarial machine learning has received extensive attention. Researchers claim that with unnoticeable perturbations on input images, an image classifier can be easily fooled by adversariesGoodfellow et al. (2017); Szegedy et al. (2014). Recent works in security issues of AI models show that ML Goodfellow et al. (2017) and RL Gleave et al. (2020); Lin et al. (2017) models are vulnerable and can be easily exposed to potential attacks, which can degenerate models’ performance and even pose many potential threats Nguyen et al. (2016). To improve the robustness of AI models, various attack methods, ranging from white-box attacks to black-box attacks, as well as many defense methods in adversarial ML have been proposed Carlini and Wagner (2017); Papernot et al. (2017); Tramèr et al. (2018).
Unfortunately, despite the obvious and growing importance, such adversarial problem in MACRL remains largely uninvestigated. Blumenkamp et al. Blumenkamp and Prorok (2020) considered adversarial competitive scenarios where agents had self-interests Blumenkamp and Prorok (2020). However, it focuses on attacking messages on a specific competitive MARL method, and the vulnerability of communication in cooperative MARL methods had not been discussed. Besides, the defense methods had not been proposed. Mitchell et al. Mitchell et al. (2020) propose a defensive method in MACRL by replacing weights generated by Attention Vaswani et al. (2017) model in MACRL method with weights generated by Gaussian process. However, such method is only restricted to attention-based MACRL method and neglects the scenarios where attacking strategies are learnable.
Motivated by the aforementioned problems, in this paper, we consider the deployment-time adversarial attacks which aim to disrupt the behaviour of trained models in MACRL. Specifically, our contributions are in three folds: (i) As a proof of concept, we make a first step towards conducting message attacks on MACRL methods in our proposed formulation where one agent of the cooperative agents is disguised as a cooperative agent but in fact malicious and can send malicious messages to disrupt the multi-agent coordination; (ii) Then, we propose a learnable two-stage message filter as our defense method. It first predicts which message sender is the malicious agent and then once the malicious agent is identified, the filter reconstructs the message. Such two-stage filter defense scheme can significantly maintain the coordination under message attack; (iii) As the malicious agent can adapt to the changing and improving defensive policies of the benign agents, the trained defense model would fail to defense. To address the resulting arms race and improve the robustness of MACRL methods, we model the adversarial MACRL problem as a two-player zero-sum game and propose -MACRL, which is based on Policy-Space Response Oracle (PSRO) Lanctot et al. (2017); Muller et al. (2019), to find the equilibrium in the policy space of both attacker and defender. Empirically, we demonstrate that MACRL methods are vulnerable to message attacks while our defense method and the game theoretic training framework can effectively improve the robustness of MACRL methods.
2 Related Work
There has been extensive research on encouraging effective communication between agents for improving performance on cooperative or competitive tasks Foerster et al. (2016); Sukhbaatar et al. (2016); Singh et al. (2018). Among the recent advances, some design communication mechanisms to address the problem of when to communicateKim et al. (2019); Singh et al. (2018); Jiang and Lu (2018); other lines of works, e.g., TarMac Das et al. (2019), focus on who to communicate. These works determine the two fundamental elements in communication, i.e., the message sender and the receiver. Apart from these two elements, the message itself is another element which is crucial in communication, i.e., what to communicate: Jaques et al. Jaques et al. (2019) propose to maximize the social influence of messages. Kim et al. Kim et al. (2021) encode messages such that they contain agents’ intention. Some other works learn to send succinct messages to meet the limitations of communication bandwidth Wang et al. (2020b, 2019); Zhang et al. (2020)
. In adversarial communication, the attacker needs to send messages that have the best damaging effect and low probability of being detected.
Adversarial training is a prevalent paradigm for training robust models to defend against potential attacks Goodfellow et al. (2017); Szegedy et al. (2014). Recent literature has considered two types of attacks Carlini and Wagner (2017); Papernot et al. (2017); Tramèr et al. (2018): black-box and white-box attacks. Under the black-box attack models, attackers do not have access to the classification or prediction model parameters, which is very common in real world scenarios; whereas in the white-box attack models, attackers have complete access to the architecture and parameters of the model, including potential defense mechanisms towards adversaries. Such assumption is too idealistic and may not be applicable for many adversarial scenarios. We consider the black-box attack method in our proposed problem formulation.
Recent advances in adversarial attack motivate researchers to investigate the adversarial problems and environment poisoning in RL Gleave et al. (2020); Lin et al. (2017); Xu et al. (2021). Lin et al. Lin et al. (2017) proposed two tactics of adversarial attack on deep RL agents. One tactic is the strategically-timed attack in order to minimize an agent’s reward via injecting noise to states at some time steps in an episode. The other tactic is the enchanting attack by luring the agent to a designated target state via generative planning methods. Gleave et al. Gleave et al. (2020) considered the problem of attacking an RL agent simply by choosing an adversarial policy acting in a multi-agent environment in order to create natural observations that are adversarial. However, few works have been done in adversarial MACRL. As discussed in Sec. 1, recent works Blumenkamp and Prorok (2020); Mitchell et al. (2020) are either focusing on attacking specific competitive setting with simple attack method without effective defense or defensing on specific communication method. Our work fill this gap in MACRL.
Dec-POMDP-Com. Decentralised partially observable Markov decision process with communication
Decentralised partially observable Markov decision process with communication(Dec-POMDP-Com) is proposed to describe a fully cooperative MACRL problem Oliehoek et al. (2016). Concretely, Dec-POMDP-Com can be formulated as a tuple where denotes the state of the environment and is the message space. Each agent chooses an action
at each time step, giving rise to a joint action vector,. is a Markovian transition function. Every agent shares the same joint reward function , and is the discount factor. Due to partial observability, each agent has individual partial observation , according to the observation function Each agent holds two polices, i.e., action policy and message policy , both of which are conditioned on the action-observation history and incoming messages aggregated by the communication protocol .
Adversarial Training. Attackers try to fool a DNN by corrupting the input Goodfellow et al. (2017) via -norm () attack. The attacker carefully generates artificial perturbations to manipulate the input of the model. In doing so, the attacker aims to fool the DNN into making incorrect predictions. The attacker finds optimal perturbation through optimization:
where is the input, is the DNN model, is the attack optimization objective, e.g., the metric for input similarity, and measures the distance between and . Though posing threats to the security of AI models, such adversarial attack ignites research on robustness in ML, which can further improve the robustness and reliability of AI models in deployment phase.
Two-player Zero-sum Games. We formulate the attack-defense problem in adversarial communication as a two-player zero-sum game. Formally, a two-player zero-sum game can described by a tuple , , where is the set of policies (or strategies), one for each player, and is the utilities function to calculate the payoff for each player given their joint policy Lanctot et al. (2017). For the game to be zero-sum, for any pair of policies , the payoff to player must be the negative of the payoff to the other player . Players try to maximize their own expected utility. The set of best responses to a policy is defined as the set of policies that maximally exploit the policy: .The exploitability of a joint policy is defined as: . A joint policy is a Nash equilibrium (NE) if , at which no player has anything to gain by changing only their own policy.
4 Achieving Robustness in MACRL
In this section, we first propose our problem formulation for adversarial MACRL. Next, we introduce the method to build optimal attacking scheme in multi-agent communication. Then, we propose a learnable two-stage message filter to defend against possible attacks. Finally, to make the defence system robust, we propose to formulate the attack-defense problem as a two-player zero-sum game, and apply a game-theoretic framework, i.e., Policy-Search Response-Oracle (PSRO) Lanctot et al. (2017), to approximate a Nash equilibrium policy for the message filter (defender). By this way, we can improve the worst case performance for the message filter so as to enhance the robustness.
4.1 Problem Formulation: Adversarial Communication in MARL
As introduced in Sec. 3, in Dec-POMDP-Com, each agent keeps two polices, i.e., the action policy and the message policy. In adversarial communication where there are malicious agents, we assume that each malicious agent holds the third private policy, , which generates adversarial message based on its action-observation history , received messages and the message intended to be sent. To reduce the likelihood of being detected, apart from the adversarial policy , malicious agents strictly follow their former action policy and message policy, trying to behave like benign agents. Malicious agents could send messages by convexly combining their original messages with adversarial messages, i.e., , or simply summing up the messages. Fig. 1(a) presents the overall attacking procedure. The agent is malicious and tries to generate adversarial message to disrupt cooperation. The adversarial massage sent by agent together with normal messages sent by other agents (denoted by ) are processed by the communication protocol (algorithm-related), generating a contaminated incoming message for a benign agent . In agent
’s perspective, under such attacking, the estimated Q-values will change, which may lead to incorrect decisions, e.g., the action with highest Q-value will shift from actionto action (a slight notation abuse), resulting to suboptimality. We model the message attack task as a Markov decision process (MDP). The optimization objective is to minimize the joint accumulated rewards . We make the following assumptions to make the problem both practical and tractable.
(Byzantine Failure Kirrmann (2015)) Agents have imperfect information on who is the adversary.
(Concealment) Malicious agents only attack the message in communication, but choose actions normally like benign agents in order to be covert.
() For simplicity, we consider the situation where there is only one adversarial agent. We leave it to future work for the problem of .
To defend against the attack, as in Fig. 1(b), we propose to add a defender to the communication protocol module. Instead of distributing the contaminated message directly to agent , the defender performs a transformation ( to ) before that. With such transformation, the benign agent is able to make the correct decision.
4.2 Learning the Attacking Scheme
Adversarial messages can interfere with agents’ decision making and multi-agent coordination will disintegrate as these messages spread. We train the attacker to model the optimal adversarial message distribution and perform attacking by sampling from the distribution. Concretely, adversarial messages
are drawn from a multivariate Gaussian distribution whose parameters are given by the attacker’s private policy, i.e., where is the covariance matrix. The optimization objective is to minimize the accumulated team rewards subject to a constraint on the distance between and , i.e., . We apply PPO Schulman et al. (2017) to optimize the policy by maximizing the following objective:
where , is the policy in last iteration, is the clipping ratio, and is the advantage function. We define the reward of the attacker to be negative team reward, then . We learn the value function by minimizing .
4.3 Defending against Adversarial Communication
As aforementioned, adversarial communication will cause serious damaging effect on multi-agents cooperation. However, defending against it is non-trivial, because i) benign agents have no prior knowledge on which agent is malicious; ii) the malicious agent can inconsistently appear both cooperative and non-cooperative to avoid being detected; and iii) recovering useful information from the contaminated message is difficult. To address these challenges, we propose a two-stage message filter for the communication protocol model to perform defence. As in Fig. 2, the message filter consists of an anomaly detector and a message reconstructor . The anomaly detector, parameterized by , outputs the probability that each incoming message needs to be recovered, i.e., , where denotes a categorical distribution. We perform sampling from the distribution to determine the anomaly message . Then the anomaly message is recovered by the reconstructor . The recovered message and other normal messages are aggregated and determine the messages that each agent will receive.
The optimization objective of the message filter is to maximize the joint accumulated rewards under attacking, i.e., , where is the team reward after performing the defending strategy. We utilize PPO to optimize the strategy. To mitigate the local optima induced by unguided exploration in large message space, we introduce a regularizer which is trained with the guidance of the ground-truth labels of malicious messages. For the reconstructor, we train it by minimizing the distance between the messages before and after attacking. To improve the tolerance of the reconstructor to errors made by the detector, apart from malicious messages, we also use benign messages as training data, in which case the reconstructor is an identical mapping. Formally, the two-stage message filter is optimized by maximizing the following function:
where , is the advantages. is the one-hot ground-truth labels of the malicious messages, is the messages that have not been attacked. , and
4.4 Learning Less Exploitable Defending Scheme
Although we have introduced the learnable two-stage message filter to defend against adversarial messages, the effectiveness of the defence system can rapidly decrease if the attacker becomes aware of the defender and adapts the attacking scheme. To mitigate this, we propose to improve the defending performance under the worst case, i.e., make it less exploitable. Concretely, we formulate the attack and defence problem as a two-player zero-sum game and define the payoff of the defender as the joint accumulated rewards. We solve the game, i.e., approach the Nash equilibrium, by optimizing the following MaxMin objective:
where and is the policies for the attacker and the defender respectively, denotes the policy spaces, and is the best response policy of , i.e., .
We propose -MACRL to approximate the NE in the adversarial communication game. -MACRL is developed upon Policy-Space Response Oracle (PSRO) Lanctot et al. (2017)
, a deep learning-based extension of the classic Double Oracle algorithmMcMahan et al. (2003) for finding NEs. The workflow of -MACRL is depicted in Fig. 3. At each iteration, -MACRL keeps a population (policy sets) of policies , e.g., and , and we can evaluate the payoff for the current iteration. Then, a Nash equilibrium is computed for the game restricted to policies in
by, for example, linear programming, replicator dynamics, etc.denotes the meta-strategy which is an arbitrary categorical distribution. Next, we calculate a best response policy to this NE for each player, e.g., and , and add them to the population: for . After extending the population,
we complete the payoff table and perform the next iteration.The algorithm converges if the best response polices are already in the population. Practically, -MACRL approximates the utility function by having each policy in the population playing with each other policy in and tracking the average utilities. The approximation of the best response polices is performed by maximizing and for the attacker and the defender, where the full expression of the two objectives are provided in Sec. 4.2 and Sec. 4.3, respectively. We conclude the overall algorithm of -MACRL in Appendix.
In this section, we conduct extensive experiments on different state-of-the-art MACRL methods to answer the following questions: Q1: Are MACRL methods vulnerable to message attacks when malicious agent in the team sends disruptive messages during deployment phase? Q2: Can our two-stage message filter effectively defend against message attacks and maintain team coordination when the malicious agents’ attack strategies are fixed? Q3: As the malicious agent can adapt to the changing and improving defensive communicative policies of the benign agents, by introducing -MACRL, can we learn robust defensive policy for benign agents? We start by providing the taxonomy of MACRL methods, and then select representative algorithms in each category to answer the questions. Additional experimental results can be found in Appendix.
5.1 Taxonomy of MACRL and Experimental Setup
Taxonomy of MACRL. MACRL are commonly categorized by whether they are implicit or explicit Oliehoek et al. (2016); Ahilan and Dayan (2021). In implicit communication, one agent’s action influences the observations of the other agents. Whereas in explicit communication, agents have a separate set of communication actions and exchange information via communication actions. In this paper, we focus on explicit communication because in implicit communication, in order to carry out an attack, the attacker’s behaviour is often bizarre, making the attack trivial and easily detectable. With different realistic problems being considered in MACRL, we consider the following categories:
Communication with delay (CD): Communication in real world is usually not real-time. We can model this by assuming that it takes one or a few time steps for the messages being received by the targeted agents Das et al. (2019).
Local communication (LC): Messages sometimes cannot be broadcast to all agents due to communication distance limitations or privacy concerns. Therefore, agents need to learn to communicate locally, affecting only those agents within the same communication group Böhmer et al. (2020).
Experimental Setup. Following the above taxonomy, we select some representative and state-of-the-art MACRL methods in each category because MACRL methods in each category share common merits and characteristics, making sweeping the list of less superior methods unnecessary. Our environmental setups are in table 1. We first train the attacker without the presence of defender. Then, we train our defender with a trained and fixed attacker. Finally, we conduct the training of
-MACRL with the presence of counterparts of both the attacker and the defender. All experiments are carried out with 5 random seeds and results are shown with a 95% confidence interval. Additional experimental setups can be found in Appendix.
5.2 The Attacking and Defensing Results
We first answer Q1 by presenting empirical results of message attacking. Then, to answer Q2, we present our defense results to demonstrate that our defense method can effectively defend against agnostic message attack and maintain coordination.
CommNet Sukhbaatar et al. (2016). We perform message attack to CommNet on a partially observable grid-world predator-prey (PP) task which is similar to that used by Bohmer et al. Böhmer et al. (2020). As depicted in Fig. 4, the decreasing test return value of the attacker indicates that coordination between agents can be easily destroyed via deliberate message attack, causing 40% and 33% decreases on scenarios with the punishment value of and , respectively. When the performance deceases to preset thresholds (23 for and 20 for ) of test return, we start training the defender with a trained and fixed-weight attacker attacking inside the team. The defender’s performance steadily increases to converged values, which are even as good as the performance without attack.
TarMAC Das et al. (2019). We conduct message attack on TrafficJunction (TJ) Sukhbaatar et al. (2016). As shown in Table 2, on TJ, the success rate of TarMAC decreases in both easy and hard TJ scenarios, demonstrating the vulnerability of TarMAC under malicious message attack. Armed with the defender, the performance of TarMAC on both easy and hard scenarios increase and surpass the performance of TarMAC with attack, illustrating the merit of our defense method.
NDQ Wang et al. (2019). We conduct attack on two typical SMAC Samvelyan et al. (2019) scenarios: and where communication between cooperative agents is necessary and the miscoordination problem of full factorization methods is widespread notably Wang et al. (2019). In Fig. 5, both the test won rate and test return of NDQ decrease during training, demonstrating that NDQ is also vulnerable to malicious message attack. In scenario , the collapses in the test won rate and the test return of the defender show that the defender is exploring and then the test won rate quickly converges to . In , the test won rate of the defender converges to , demonstrating our defense method can efficiently recover the coordination under message attack.
In this subsection, given the extensive experimental results, we can thus conclude that current MACRL methods are not robust when exposed to hostile message attacks. Convincingly, our defense method is a practical and effective method to combat message attacks.
5.3 The Robustness of our Defense Method
The robustness of our defensive model is yet to be evaluated. We answer two questions: Q3 (a): whether malicious agents can actually adapt to benign agents’ defense policies and Q3 (b): whether benign agents are able to learn robust policies.
The adaptative attacking policy. To answer Q3 (a), we present the performance curve of the attacker on NDQ and CommNet during the training of the attacker wherein the trained fixed defender policies are sampled from the defender policy set as introduced in Sec. 4.4. That is, the sampled defender policy is used to defend the message attacking, which aims to train adaptative attacking policy and further improve the robustness of the defender. As depicted in Fig. 7 and 7, with a trained fixed defender defending against message attacks, attacker is able to learn adaptative attacking policies with the cost of consuming more samples compared with training without defenders.
The robustness of defending policy. We answer Q3 (b) by illustrating the comparisons between defenders trained by -MACRL and the vanilla training method, respectively. Unlike defenders trained with -MACRL where a population of attacker policies is maintained and attacker policies are sampled for training defenders, in vanilla training method, however, there is only one attacker policy can be used for training defender. We use the expected utility value as our comparison metric to compare the robustness of defenders. We denote the expected maximum best response utility of the defender trained by -MACRL as and define the expected utility of the defender trained by vanilla training as . When is larger than , we can conclude that defender trained with -MACRL is more robust than its counterpart. Further details can be found in Appendix.
Evaluated on six scenarios, as illustrated in Fig. 8
(the error bar indicates one standard deviation), we can find thaton all scenarios are greater than those of , showing that our -MACRL has successfully trained robust defensive policies. In and , where the subject is NDQ, values are and while values are and , respectively. In PP ( and ), where CommNet models are pre-trained, values are and while values are and , respectively. In complex TJ scenarios where cars are moving along the road to complete assigned routes, the malicious message attack can incur failure of the route completion, especially in the hard scenario with 20 cars. We can see that value is and is . In easy scenario where there are only 4 cars, interestingly, we can still get improvement. The improvement of utility values on all evaluated scenarios indicates that our defenders trained by -MACRL are robust. We summarise the performance in Appendix. Intuitively, the robustness of the defender benefits a lot from exploring the policy space of the adversary via -MACRL, which enables the defender to maintain team coordination in the face of attacks.
5.4 How Do the Attacker Attack and the Defender Defend?
|Agent 1||Agent 2|
We showcase how the attacker attacks and how the defender defends. We use the attacker and the defender trained on with NDQ. We use Q values of all agents and messages as an example. As shown in Fig. 9, there are three agents, agent is the one which is taken over and becomes the malicious agent while agent and are benign agents. The action space of each agent is eight. White cells in Q value tables correspond to the actions which are not available at current step and values in red are Q values of the optimal actions. Following our problem formulation, agent sends malicious messages (the message dimension is 3) to agent and , changing Q values as well as optimal actions of agent and as shown in Fig. 9 (d) and Fig. 9 (a-b), respectively. The defender reconstructs the message (Fig. 9 (e)) and Q values of agent and are protected (Fig. 9 (c)). The reconstructed messages are subject to the same distribution (Table. 3) and the distance between the reconstructed messages are not far. Note that the partial ordering relation of Q values under defense (Fig. 9 (c)) are not changed (Fig. 9 (a)).
Our defense method consists of two components: the malicious agent discriminator and the message reconstructor model. Here, we conduct ablation studies on our proposed defense method to see how these two components actually cooperate to defend message attacks effectively. We run experiments on three scenarios and each scenario corresponds to one subject MACRL method. We train our defense method with the vanilla training method, i.e., we first train an attacker policy without any defense and then train the defender with the trained and fixed attacker policy. In Fig. 10, we first conclude that directly replacing the poisonous message with randomly generated messages cannot defend effectively especially on and TJ (Hard). Then, as depicted in Fig. 11, we show that without the discriminator, the message generator cannot able to present reliable defensing performance, especially in PP () where predators can be fooled and fail to cooperatively catch the prey when being exposed to message attack.
6 Conclusion and Future Works
In this paper, we consider the adversarial MACRL problem. To address this problem, we first propose our problem formulation on adversarial MACRL, then we propose our message attack model in order to investigate the vulnerability of communication under attack. To protect the multi-agent coordination, we propose our two-stage message filter as an effective defense method. Finally, we propose -MACRL to achieve communication robustness. Empirically, we demonstrate that MACRL methods are not robust to message attacks, and our defence method and the game-theoretic framework can effectively improve the robustness of MACRL.
Security and robustness are two important issues in MACRL and still remain largely uninvestigated. For future work, more message attacking methods can be proposed in order to gain deeper understanding on how messages actually contribute to agent’s decentralized executions and consequently more defense methods can be investigated. Theoretical analyses on game-theoretic equilibrium of the attack method and the defense method in MACRL are also future directions.
- Correcting experience replay for multi-agent communication. ICLR. Cited by: §5.1.
- The emergence of adversarial communication in multi-agent reinforcement learning. The 4th Conference on Robot Learning (CoRL 2020). Cited by: §1, §2.
- Deep coordination graphs. In ICML, Cited by: 3rd item, §5.1, §5.2.
Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1, §2.
- Tarmac: targeted multi-agent communication. In ICML, pp. 1538–1546. Cited by: Figure 13, §B.2, §1, §2, 1st item, §5.1, §5.2.
- Learning to communicate with deep multi-agent reinforcement learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2145–2153. Cited by: §1, §2.
- Adversarial policies: attacking deep reinforcement learning. In International Conference on Learning Representations, External Links: Cited by: §1, §2.
- Attacking machine learning with adversarial examples. OpenAI. https://blog. openai. com/adversarial-example-research. Cited by: §1, §2, §3.
- Guided deep reinforcement learning for swarm systems. AAMAS Workshop. Cited by: §1.
- Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In ICML, Cited by: §2.
- Learning attentional communication for multi-agent cooperation. In NIPS, pp. 7254–7264. Cited by: §1, §2.
- Learning to schedule communication in multi-agent reinforcement learning. ICLR. Cited by: §2.
- COMMUNICATION in multi-agent reinforcement learning: intention sharing. ICLR. Cited by: §2.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix D.
- Fault tolerant computing in industrial automation. Switzerland: ABB Research Center. Cited by: Assumption 1.
- Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190, pp. 82–94. Cited by: §1.
- A unified game-theoretic approach to multiagent reinforcement learning. In NIPS, Cited by: Appendix A, §1, §3, §4.4, §4.
- Tactics of adversarial attack on deep reinforcement learning agents. In IJCAI, Cited by: §1, §2.
- Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, Cited by: §1.
- Planning in the presence of cost functions controlled by an adversary. In ICML, Cited by: §4.4.
- Gaussian process based message filtering for robust multi-agent cooperation in the presence of adversarial communication. arXiv preprint arXiv:2012.00508. Cited by: §1, §2.
- A generalized training approach for multiagent learning. In ICML, Cited by: §1.
- Towards a science of security games. In Mathematical Sciences with Multidisciplinary Applications, pp. 347–381. Cited by: §1.
- A concise introduction to decentralized POMDPs. Vol. 1, Springer. Cited by: §3, §5.1.
- Optimal and approximate q-value functions for decentralized POMDPs. JAIR 32, pp. 289–353. Cited by: §1.
- Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1, §2.
- PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), External Links: Cited by: Appendix D.
- QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML, Cited by: §1.
- The StarCraft multi-agent challenge. CoRR abs/1902.04043. Cited by: §C.3, §5.1, §5.2.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.2.
- Learning when to communicate at scale in multiagent cooperative and competitive tasks. ICLR. Cited by: §2.
- Hierarchical multiagent reinforcement learning for maritime traffic management. In AAMAS, Cited by: §1.
Learning multiagent communication with backpropagation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2252–2260. Cited by: Figure 12, §B.2, §1, §2, §5.1, §5.2, §5.2.
- Intriguing properties of neural networks. In ICLR, Cited by: §1, §2.
- Ensemble adversarial training: attacks and defenses. In International Conference on Learning Representations, External Links: Cited by: §1, §2.
- Attention is all you need. In NIPS, Cited by: §B.2, §1.
- Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
- QPLEX: duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062. Cited by: §1.
- Learning efficient multi-agent communication: an information bottleneck approach. In ICML, Cited by: §1, §2.
- Learning nearly decomposable value functions via communication minimization. In ICML, Cited by: Figure 14, §B.2, §1, §2, 2nd item, §5.1, §5.2.
- DOP: off-policy multi-agent decomposed policy gradients. In ICML, External Links: Cited by: §1.
- Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209. Cited by: §C.2.
- Transferable environment poisoning: training-time attack on reinforcement learning. In AAMAS, Cited by: §2.
- Succinct and robust multi-agent communication with temporal message control. arXiv preprint arXiv:2010.14391, pp. 17271–17282. Cited by: §2, 2nd item.
Appendix A Algorithm
We present Algorithm 1, which is mainly built on top of PSRO , to train -MACRL. At each training step, we train both the attacker and the defender alternately by exploiting their counterpart’s policies via sampling (line 1), then a new policy is trained and extended to the policy set (line 1) for the two players. We can compute the missing entries (line 1) in and finally solve the matrix game (line 1).
Appendix B Additional Introduction
b.1 Evaluation Metrics
We denote the expected maximum best response utility of the defender by evaluating with the attacker policy from the attcker policy set during -MACRL training as :
where is the probability of policy being chosen from their policy set . is the averaged total return of the evaluated 16 episodes given the best response defender policy and the chosen attacker policy . For , there is only one attacker policy to compare with in vanilla training framework.
b.2 Introduction to the subject MACRL methods
CommNet  is a simple yet effective multi-agent communicative algorithm where incoming messages of each agent is generated by averaging the messages sent by all agents in the last time step. We present the architecture of CommNet in Fig. 12.
TarMAC  extends CommNet by allowing agents to pay attention to important parts of the incoming messages via Attention  network. Concretely, each agent generates signature and query vectors when sending message. In the message receiving phase, attention weights of incoming messages are calculated by comparing the similarity between the query vector and the signature vector of each incoming messages. Then a weighted sum over all incoming messages are performed to determine the message an agent will receive for decentralized execution. The architecture of TarMAC is in Fig. 13.
NDQ  achieves nearly decomposable Q-functions via communication minimization. Specifically, it trains the communication model by maximizing mutual information between agents’ action selection and communication messages while minimizing the entropy of messages between agents. Each agent broadcasts messages to all other agents. The sent messages are sampled from learned distributions which are optimized by mutual information. The overview of NDQ is in Fig. 13.
Appendix C Experimental Setup
There are 3 predators, trying to catch 6 prey in a grid. Each predator has local observation of a sub-grid centered around it and can move in one of the 4 compass directions, remain still, or perform catch action. The prey moves randomly and is caught if at least two near-by predators try to catch it simultaneously. The predator will obtain a team reward at 10 for each successful catch and will be punished for each losing catch action. A prey will be removed from the grid after it being caught. An episode ends after 60 steps or all preys have been caught.
In TJ scenarios, there are cars and each car’s task is to completes its pre-assigned route. Each car has a limited observations of a region around it, but is free to communicate with all other cars. The action space for each car at every timestep is gas and brake. The reward function is , where is the number of timesteps since the activation of the car, and is a collision penalty. Once a car completes its route, it becomes available to be sampled and spawned in the environment with a new route assignment. We do not carry out attacks on House3D  due to copyright disputes111https://futurism.com/tech-suing-facebook-princeton-data.
c.3 StarCraft II
We consider SMAC  combat scenarios where the enemy units are controlled by StarCraft II built-in AI (difficulty level is set to medium), and each of the ally units is controlled by a learning agent. The units of the two groups can be asymmetric. The action space is discrete: noop, move[direction], attack[enemy id], and stop. Agents receive a globally shared reward which is equal to the total damage dealt on the enemy units at each timestep.
bane_vs_hM: 3 Banelings try to kill a Hydralisk assisted by a Medivac. 3 Banelings together can just blow up the Hydralisk. Therefore, to win the game, 3 Banelings ally units should not give the Hydralisk changes to rest time when the Medivac can restore its health.
1o2r_vs_4r: Ally units consist of 1 Overseer and 2 Roaches. The Overseer can find 4 Reapers and then notify its teammates to kill the invading enemy units, the Reapers, in order to win the game. At the start of an episode, ally units spawn at a random point on the map while enemy units are initialized at another random point. Given that only the Overseer knows the position of the enemy unit, the ability to learn to deliver this message to its teammates is vital for effectively winning the combat.
Appendix D Hyperparameters
We train all the subject MACRL methods with hyperparameters used in their papers to reproduce communicative performances. is . We use Adam  optimizer with the best learning rate to train the attacker model as well as the discriminator and the message reconstructor model of the defender model. We use the default hyperparameters of Adam optimizer. and are and . The clip ratio of the advantage of PPO is27] to implement all the deep neural network models. We carry out all axperiments on a machine with 8 NVIDIA Tesla V100 GPU 16G GPUs and 44 Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz CPU cores. We use 5 random seeds in our experiments.
Appendix E Experimental Results
We present the performance of best response strategies on NDQ, CommNet and TarMAC in table 4.
|Test Won %|
|Test Success %|
We show the threshold of the message perturbation used in each scenarios of each subject MACRL method in Table 5, 6 and 7. Subject MACRL methods have various threshold value for each scenarios. Some perturbation threshold values are large, for example the in Table 6 and 7. The reason is that the range of the mean of the distribution original messages are large, such as the range of mean of the distribution of the message in NDQ can be .