Mis-spoke or mis-lead: Achieving Robustness in Multi-Agent Communicative Reinforcement Learning

by   Wanqi Xue, et al.
Nanyang Technological University

Recent studies in multi-agent communicative reinforcement learning (MACRL) demonstrate that multi-agent coordination can be significantly improved when communication between agents is allowed. Meanwhile, advances in adversarial machine learning (ML) have shown that ML and reinforcement learning (RL) models are vulnerable to a variety of attacks that significantly degrade the performance of learned behaviours. However, despite the obvious and growing importance, the combination of adversarial ML and MACRL remains largely uninvestigated. In this paper, we make the first step towards conducting message attacks on MACRL methods. In our formulation, one agent in the cooperating group is taken over by an adversary and can send malicious messages to disrupt a deployed MACRL-based coordinated strategy during the deployment phase. We further our study by developing a defence method via message reconstruction. Finally, we address the resulting arms race, i.e., we consider the ability of the malicious agent to adapt to the changing and improving defensive communicative policies of the benign agents. Specifically, we model the adversarial MACRL problem as a two-player zero-sum game and then utilize Policy-Space Response Oracle to achieve communication robustness. Empirically, we demonstrate that MACRL methods are vulnerable to message attacks while our defence method the game-theoretic framework can effectively improve the robustness of MACRL.



There are no comments yet.


page 8

page 9

page 15


Succinct and Robust Multi-Agent Communication With Temporal Message Control

Recent studies have shown that introducing communication between agents ...

Message-Dropout: An Efficient Training Method for Multi-Agent Deep Reinforcement Learning

In this paper, we propose a new learning technique named message-dropout...

Adversarial Attacks On Multi-Agent Communication

Growing at a very fast pace, modern autonomous systems will soon be depl...

Promoting Resilience in Multi-Agent Reinforcement Learning via Confusion-Based Communication

Recent advances in multi-agent reinforcement learning (MARL) provide a v...

KnowRU: Knowledge Reusing via Knowledge Distillation in Multi-agent Reinforcement Learning

Recently, deep Reinforcement Learning (RL) algorithms have achieved dram...

Quantum-Secure Authentication via Abstract Multi-Agent Interaction

Current methods for authentication based on public-key cryptography are ...

Robust Market Making via Adversarial Reinforcement Learning

We show that adversarial reinforcement learning (ARL) can be used to pro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cooperative multi-agent reinforcement learning (MARL) has achieved remarkable successes in a variety of challenging problems such as urban systems Singh et al. (2020), coordination of robot swarms Hüttenrauch et al. (2017) and real-time strategy (RTS) video games Vinyals et al. (2019). With the paradigm of centralized training with decentralized execution (CTDE) Oliehoek et al. (2008); Kraemer and Banerjee (2016), recent years have seen great progresses in MARL methods Lowe et al. (2017); Rashid et al. (2018); Wang et al. (2020a, 2021), aiming to improving scalability and dealing with non-stationarity. However, many CTDE-based MARL methods struggle in scenarios where communication between cooperative agents is necessary during decentralized execution and the miscoordination problem of many MARL methods is widespread notably Wang et al. (2019). Recent studies on introducing inter-agent communication during the execution phase demonstrate that multi-agent coordination can be significantly improved in scenarios where communication between agents is allowed. Many multi-agent communicative reinforcement learning (MACRL) methods have been proposed Foerster et al. (2016); Sukhbaatar et al. (2016); Jiang and Lu (2018); Das et al. (2019); Wang et al. (2019, 2020b).

Meanwhile, adversarial machine learning has received extensive attention. Researchers claim that with unnoticeable perturbations on input images, an image classifier can be easily fooled by adversaries 

Goodfellow et al. (2017); Szegedy et al. (2014). Recent works in security issues of AI models show that ML Goodfellow et al. (2017) and RL Gleave et al. (2020); Lin et al. (2017) models are vulnerable and can be easily exposed to potential attacks, which can degenerate models’ performance and even pose many potential threats Nguyen et al. (2016). To improve the robustness of AI models, various attack methods, ranging from white-box attacks to black-box attacks, as well as many defense methods in adversarial ML have been proposed Carlini and Wagner (2017); Papernot et al. (2017); Tramèr et al. (2018).

Unfortunately, despite the obvious and growing importance, such adversarial problem in MACRL remains largely uninvestigated. Blumenkamp et al. Blumenkamp and Prorok (2020) considered adversarial competitive scenarios where agents had self-interests Blumenkamp and Prorok (2020). However, it focuses on attacking messages on a specific competitive MARL method, and the vulnerability of communication in cooperative MARL methods had not been discussed. Besides, the defense methods had not been proposed. Mitchell et al. Mitchell et al. (2020) propose a defensive method in MACRL by replacing weights generated by Attention Vaswani et al. (2017) model in MACRL method with weights generated by Gaussian process. However, such method is only restricted to attention-based MACRL method and neglects the scenarios where attacking strategies are learnable.

Motivated by the aforementioned problems, in this paper, we consider the deployment-time adversarial attacks which aim to disrupt the behaviour of trained models in MACRL. Specifically, our contributions are in three folds: (i) As a proof of concept, we make a first step towards conducting message attacks on MACRL methods in our proposed formulation where one agent of the cooperative agents is disguised as a cooperative agent but in fact malicious and can send malicious messages to disrupt the multi-agent coordination; (ii) Then, we propose a learnable two-stage message filter as our defense method. It first predicts which message sender is the malicious agent and then once the malicious agent is identified, the filter reconstructs the message. Such two-stage filter defense scheme can significantly maintain the coordination under message attack; (iii) As the malicious agent can adapt to the changing and improving defensive policies of the benign agents, the trained defense model would fail to defense. To address the resulting arms race and improve the robustness of MACRL methods, we model the adversarial MACRL problem as a two-player zero-sum game and propose -MACRL, which is based on Policy-Space Response Oracle (PSRO) Lanctot et al. (2017); Muller et al. (2019), to find the equilibrium in the policy space of both attacker and defender. Empirically, we demonstrate that MACRL methods are vulnerable to message attacks while our defense method and the game theoretic training framework can effectively improve the robustness of MACRL methods.

2 Related Work

There has been extensive research on encouraging effective communication between agents for improving performance on cooperative or competitive tasks Foerster et al. (2016); Sukhbaatar et al. (2016); Singh et al. (2018). Among the recent advances, some design communication mechanisms to address the problem of when to communicateKim et al. (2019); Singh et al. (2018); Jiang and Lu (2018); other lines of works, e.g., TarMac Das et al. (2019), focus on who to communicate. These works determine the two fundamental elements in communication, i.e., the message sender and the receiver. Apart from these two elements, the message itself is another element which is crucial in communication, i.e., what to communicate: Jaques et al. Jaques et al. (2019) propose to maximize the social influence of messages. Kim et al. Kim et al. (2021) encode messages such that they contain agents’ intention. Some other works learn to send succinct messages to meet the limitations of communication bandwidth Wang et al. (2020b, 2019); Zhang et al. (2020)

. In adversarial communication, the attacker needs to send messages that have the best damaging effect and low probability of being detected.

Adversarial training is a prevalent paradigm for training robust models to defend against potential attacks Goodfellow et al. (2017); Szegedy et al. (2014). Recent literature has considered two types of attacks Carlini and Wagner (2017); Papernot et al. (2017); Tramèr et al. (2018): black-box and white-box attacks. Under the black-box attack models, attackers do not have access to the classification or prediction model parameters, which is very common in real world scenarios; whereas in the white-box attack models, attackers have complete access to the architecture and parameters of the model, including potential defense mechanisms towards adversaries. Such assumption is too idealistic and may not be applicable for many adversarial scenarios. We consider the black-box attack method in our proposed problem formulation.

Recent advances in adversarial attack motivate researchers to investigate the adversarial problems and environment poisoning in RL Gleave et al. (2020); Lin et al. (2017); Xu et al. (2021). Lin et al. Lin et al. (2017) proposed two tactics of adversarial attack on deep RL agents. One tactic is the strategically-timed attack in order to minimize an agent’s reward via injecting noise to states at some time steps in an episode. The other tactic is the enchanting attack by luring the agent to a designated target state via generative planning methods. Gleave et al. Gleave et al. (2020) considered the problem of attacking an RL agent simply by choosing an adversarial policy acting in a multi-agent environment in order to create natural observations that are adversarial. However, few works have been done in adversarial MACRL. As discussed in Sec. 1, recent works Blumenkamp and Prorok (2020); Mitchell et al. (2020) are either focusing on attacking specific competitive setting with simple attack method without effective defense or defensing on specific communication method. Our work fill this gap in MACRL.

3 Preliminaries


Decentralised partially observable Markov decision process with communication

(Dec-POMDP-Com) is proposed to describe a fully cooperative MACRL problem Oliehoek et al. (2016). Concretely, Dec-POMDP-Com can be formulated as a tuple where denotes the state of the environment and is the message space. Each agent chooses an action

at each time step, giving rise to a joint action vector,

. is a Markovian transition function. Every agent shares the same joint reward function , and is the discount factor. Due to partial observability, each agent has individual partial observation , according to the observation function Each agent holds two polices, i.e., action policy and message policy , both of which are conditioned on the action-observation history and incoming messages aggregated by the communication protocol .

Adversarial Training. Attackers try to fool a DNN by corrupting the input Goodfellow et al. (2017) via -norm () attack. The attacker carefully generates artificial perturbations to manipulate the input of the model. In doing so, the attacker aims to fool the DNN into making incorrect predictions. The attacker finds optimal perturbation through optimization:

where is the input, is the DNN model, is the attack optimization objective, e.g., the metric for input similarity, and measures the distance between and . Though posing threats to the security of AI models, such adversarial attack ignites research on robustness in ML, which can further improve the robustness and reliability of AI models in deployment phase.

Two-player Zero-sum Games. We formulate the attack-defense problem in adversarial communication as a two-player zero-sum game. Formally, a two-player zero-sum game can described by a tuple , , where is the set of policies (or strategies), one for each player, and is the utilities function to calculate the payoff for each player given their joint policy Lanctot et al. (2017). For the game to be zero-sum, for any pair of policies , the payoff to player must be the negative of the payoff to the other player . Players try to maximize their own expected utility. The set of best responses to a policy is defined as the set of policies that maximally exploit the policy: .The exploitability of a joint policy is defined as: . A joint policy is a Nash equilibrium (NE) if , at which no player has anything to gain by changing only their own policy.

4 Achieving Robustness in MACRL

In this section, we first propose our problem formulation for adversarial MACRL. Next, we introduce the method to build optimal attacking scheme in multi-agent communication. Then, we propose a learnable two-stage message filter to defend against possible attacks. Finally, to make the defence system robust, we propose to formulate the attack-defense problem as a two-player zero-sum game, and apply a game-theoretic framework, i.e., Policy-Search Response-Oracle (PSRO) Lanctot et al. (2017), to approximate a Nash equilibrium policy for the message filter (defender). By this way, we can improve the worst case performance for the message filter so as to enhance the robustness.

4.1 Problem Formulation: Adversarial Communication in MARL

As introduced in Sec. 3, in Dec-POMDP-Com, each agent keeps two polices, i.e., the action policy and the message policy. In adversarial communication where there are malicious agents, we assume that each malicious agent holds the third private policy, , which generates adversarial message based on its action-observation history , received messages and the message intended to be sent. To reduce the likelihood of being detected, apart from the adversarial policy , malicious agents strictly follow their former action policy and message policy, trying to behave like benign agents. Malicious agents could send messages by convexly combining their original messages with adversarial messages, i.e., , or simply summing up the messages. Fig. 1(a) presents the overall attacking procedure. The agent is malicious and tries to generate adversarial message to disrupt cooperation. The adversarial massage sent by agent together with normal messages sent by other agents (denoted by ) are processed by the communication protocol (algorithm-related), generating a contaminated incoming message for a benign agent . In agent

’s perspective, under such attacking, the estimated Q-values will change, which may lead to incorrect decisions, e.g., the action with highest Q-value will shift from action

to action (a slight notation abuse), resulting to suboptimality. We model the message attack task as a Markov decision process (MDP). The optimization objective is to minimize the joint accumulated rewards . We make the following assumptions to make the problem both practical and tractable.

Assumption 1.

(Byzantine Failure Kirrmann (2015)) Agents have imperfect information on who is the adversary.

Assumption 2.

(Concealment) Malicious agents only attack the message in communication, but choose actions normally like benign agents in order to be covert.

Assumption 3.

() For simplicity, we consider the situation where there is only one adversarial agent. We leave it to future work for the problem of .

To defend against the attack, as in Fig. 1(b), we propose to add a defender to the communication protocol module. Instead of distributing the contaminated message directly to agent , the defender performs a transformation ( to ) before that. With such transformation, the benign agent is able to make the correct decision.

Figure 1: The general framework of our attack (left) and defense (right) schemes. Q-values (in red) in tables correspond to actions to be executed w/ or w/o attack and w/ defense of agent . Further discussion can be found in Sec. 4.1

4.2 Learning the Attacking Scheme

Adversarial messages can interfere with agents’ decision making and multi-agent coordination will disintegrate as these messages spread. We train the attacker to model the optimal adversarial message distribution and perform attacking by sampling from the distribution. Concretely, adversarial messages

are drawn from a multivariate Gaussian distribution whose parameters are given by the attacker’s private policy

, i.e., where is the covariance matrix. The optimization objective is to minimize the accumulated team rewards subject to a constraint on the distance between and , i.e., . We apply PPO Schulman et al. (2017) to optimize the policy by maximizing the following objective:


where , is the policy in last iteration, is the clipping ratio, and is the advantage function. We define the reward of the attacker to be negative team reward, then . We learn the value function by minimizing .

4.3 Defending against Adversarial Communication

Figure 2: The communication protocol with the message filter (defender).

As aforementioned, adversarial communication will cause serious damaging effect on multi-agents cooperation. However, defending against it is non-trivial, because i) benign agents have no prior knowledge on which agent is malicious; ii) the malicious agent can inconsistently appear both cooperative and non-cooperative to avoid being detected; and iii) recovering useful information from the contaminated message is difficult. To address these challenges, we propose a two-stage message filter for the communication protocol model to perform defence. As in Fig. 2, the message filter consists of an anomaly detector and a message reconstructor . The anomaly detector, parameterized by , outputs the probability that each incoming message needs to be recovered, i.e., , where denotes a categorical distribution. We perform sampling from the distribution to determine the anomaly message . Then the anomaly message is recovered by the reconstructor . The recovered message and other normal messages are aggregated and determine the messages that each agent will receive.

The optimization objective of the message filter is to maximize the joint accumulated rewards under attacking, i.e., , where is the team reward after performing the defending strategy. We utilize PPO to optimize the strategy. To mitigate the local optima induced by unguided exploration in large message space, we introduce a regularizer which is trained with the guidance of the ground-truth labels of malicious messages. For the reconstructor, we train it by minimizing the distance between the messages before and after attacking. To improve the tolerance of the reconstructor to errors made by the detector, apart from malicious messages, we also use benign messages as training data, in which case the reconstructor is an identical mapping. Formally, the two-stage message filter is optimized by maximizing the following function:


where , is the advantages. is the one-hot ground-truth labels of the malicious messages, is the messages that have not been attacked. , and

are hyperparameters.

4.4 Learning Less Exploitable Defending Scheme

Although we have introduced the learnable two-stage message filter to defend against adversarial messages, the effectiveness of the defence system can rapidly decrease if the attacker becomes aware of the defender and adapts the attacking scheme. To mitigate this, we propose to improve the defending performance under the worst case, i.e., make it less exploitable. Concretely, we formulate the attack and defence problem as a two-player zero-sum game and define the payoff of the defender as the joint accumulated rewards. We solve the game, i.e., approach the Nash equilibrium, by optimizing the following MaxMin objective:


where and is the policies for the attacker and the defender respectively, denotes the policy spaces, and is the best response policy of , i.e., .

We propose -MACRL to approximate the NE in the adversarial communication game. -MACRL is developed upon Policy-Space Response Oracle (PSRO) Lanctot et al. (2017)

, a deep learning-based extension of the classic Double Oracle algorithm 

McMahan et al. (2003) for finding NEs. The workflow of -MACRL is depicted in Fig. 3. At each iteration, -MACRL keeps a population (policy sets) of policies , e.g., and , and we can evaluate the payoff for the current iteration. Then, a Nash equilibrium is computed for the game restricted to policies in

by, for example, linear programming, replicator dynamics, etc.

denotes the meta-strategy which is an arbitrary categorical distribution. Next, we calculate a best response policy to this NE for each player, e.g., and , and add them to the population: for . After extending the population,

Figure 3: The workflow of -MACRL.

we complete the payoff table and perform the next iteration.The algorithm converges if the best response polices are already in the population. Practically, -MACRL approximates the utility function by having each policy in the population playing with each other policy in and tracking the average utilities. The approximation of the best response polices is performed by maximizing and for the attacker and the defender, where the full expression of the two objectives are provided in Sec. 4.2 and Sec. 4.3, respectively. We conclude the overall algorithm of -MACRL in Appendix.

5 Experiments

In this section, we conduct extensive experiments on different state-of-the-art MACRL methods to answer the following questions: Q1: Are MACRL methods vulnerable to message attacks when malicious agent in the team sends disruptive messages during deployment phase? Q2: Can our two-stage message filter effectively defend against message attacks and maintain team coordination when the malicious agents’ attack strategies are fixed? Q3: As the malicious agent can adapt to the changing and improving defensive communicative policies of the benign agents, by introducing -MACRL, can we learn robust defensive policy for benign agents? We start by providing the taxonomy of MACRL methods, and then select representative algorithms in each category to answer the questions. Additional experimental results can be found in Appendix.

5.1 Taxonomy of MACRL and Experimental Setup

Taxonomy of MACRL. MACRL are commonly categorized by whether they are implicit or explicit Oliehoek et al. (2016); Ahilan and Dayan (2021). In implicit communication, one agent’s action influences the observations of the other agents. Whereas in explicit communication, agents have a separate set of communication actions and exchange information via communication actions. In this paper, we focus on explicit communication because in implicit communication, in order to carry out an attack, the attacker’s behaviour is often bizarre, making the attack trivial and easily detectable. With different realistic problems being considered in MACRL, we consider the following categories:

  • Communication with delay (CD): Communication in real world is usually not real-time. We can model this by assuming that it takes one or a few time steps for the messages being received by the targeted agents Das et al. (2019).

  • Communication with cost or limited bandwidth (CC): Agents should avoid sending redundant messages because communication in real world is costly and communication channels often have a limited bandwidth Wang et al. (2019); Zhang et al. (2020).

  • Local communication (LC): Messages sometimes cannot be broadcast to all agents due to communication distance limitations or privacy concerns. Therefore, agents need to learn to communicate locally, affecting only those agents within the same communication group Böhmer et al. (2020).

Methods Scenarios CD CommNet Sukhbaatar et al. (2016) Predator-Prey Böhmer et al. (2020) LC TarMAC Das et al. (2019) TrafficJunction Sukhbaatar et al. (2016) CC NDQ Wang et al. (2019) StarCraft II Samvelyan et al. (2019) Table 1: The representative MACRL methods of different categories in our experiments. Figure 4: Attack and defense performance on PP.

Experimental Setup. Following the above taxonomy, we select some representative and state-of-the-art MACRL methods in each category because MACRL methods in each category share common merits and characteristics, making sweeping the list of less superior methods unnecessary. Our environmental setups are in table 1. We first train the attacker without the presence of defender. Then, we train our defender with a trained and fixed attacker. Finally, we conduct the training of

-MACRL with the presence of counterparts of both the attacker and the defender. All experiments are carried out with 5 random seeds and results are shown with a 95% confidence interval. Additional experimental setups can be found in Appendix.

5.2 The Attacking and Defensing Results

We first answer Q1 by presenting empirical results of message attacking. Then, to answer Q2, we present our defense results to demonstrate that our defense method can effectively defend against agnostic message attack and maintain coordination.

CommNet Sukhbaatar et al. (2016). We perform message attack to CommNet on a partially observable grid-world predator-prey (PP) task which is similar to that used by Bohmer et al. Böhmer et al. (2020). As depicted in Fig. 4, the decreasing test return value of the attacker indicates that coordination between agents can be easily destroyed via deliberate message attack, causing 40% and 33% decreases on scenarios with the punishment value of and , respectively. When the performance deceases to preset thresholds (23 for and 20 for ) of test return, we start training the defender with a trained and fixed-weight attacker attacking inside the team. The defender’s performance steadily increases to converged values, which are even as good as the performance without attack.

Easy Hard
TarMAC w/
TarMAC w/
Table 2: Success rates on TJ.

TarMAC Das et al. (2019). We conduct message attack on TrafficJunction (TJ) Sukhbaatar et al. (2016). As shown in Table 2, on TJ, the success rate of TarMAC decreases in both easy and hard TJ scenarios, demonstrating the vulnerability of TarMAC under malicious message attack. Armed with the defender, the performance of TarMAC on both easy and hard scenarios increase and surpass the performance of TarMAC with attack, illustrating the merit of our defense method.

Figure 5: Attack and defense performance (test won rate and test return) on NDQ (scenarios: and ). Pink dot lines indicate the best performance without attack.

NDQ Wang et al. (2019). We conduct attack on two typical SMAC Samvelyan et al. (2019) scenarios: and where communication between cooperative agents is necessary and the miscoordination problem of full factorization methods is widespread notably Wang et al. (2019). In Fig. 5, both the test won rate and test return of NDQ decrease during training, demonstrating that NDQ is also vulnerable to malicious message attack. In scenario , the collapses in the test won rate and the test return of the defender show that the defender is exploring and then the test won rate quickly converges to . In , the test won rate of the defender converges to , demonstrating our defense method can efficiently recover the coordination under message attack.

In this subsection, given the extensive experimental results, we can thus conclude that current MACRL methods are not robust when exposed to hostile message attacks. Convincingly, our defense method is a practical and effective method to combat message attacks.

5.3 The Robustness of our Defense Method

The robustness of our defensive model is yet to be evaluated. We answer two questions: Q3 (a): whether malicious agents can actually adapt to benign agents’ defense policies and Q3 (b): whether benign agents are able to learn robust policies.

The adaptative attacking policy. To answer Q3 (a), we present the performance curve of the attacker on NDQ and CommNet during the training of the attacker wherein the trained fixed defender policies are sampled from the defender policy set as introduced in Sec. 4.4. That is, the sampled defender policy is used to defend the message attacking, which aims to train adaptative attacking policy and further improve the robustness of the defender. As depicted in Fig. 7 and 7, with a trained fixed defender defending against message attacks, attacker is able to learn adaptative attacking policies with the cost of consuming more samples compared with training without defenders.

Figure 6: Test won rate on NDQ under attack.
Figure 7: Test return on CommNet under attack.

The robustness of defending policy. We answer Q3 (b) by illustrating the comparisons between defenders trained by -MACRL and the vanilla training method, respectively. Unlike defenders trained with -MACRL where a population of attacker policies is maintained and attacker policies are sampled for training defenders, in vanilla training method, however, there is only one attacker policy can be used for training defender. We use the expected utility value as our comparison metric to compare the robustness of defenders. We denote the expected maximum best response utility of the defender trained by -MACRL as and define the expected utility of the defender trained by vanilla training as . When is larger than , we can conclude that defender trained with -MACRL is more robust than its counterpart. Further details can be found in Appendix.

Figure 8: Comparisons between defenders trained with -MACRL and vanilla training method.

Evaluated on six scenarios, as illustrated in Fig. 8

(the error bar indicates one standard deviation), we can find that

on all scenarios are greater than those of , showing that our -MACRL has successfully trained robust defensive policies. In and , where the subject is NDQ, values are and while values are and , respectively. In PP ( and ), where CommNet models are pre-trained, values are and while values are and , respectively. In complex TJ scenarios where cars are moving along the road to complete assigned routes, the malicious message attack can incur failure of the route completion, especially in the hard scenario with 20 cars. We can see that value is and is . In easy scenario where there are only 4 cars, interestingly, we can still get improvement. The improvement of utility values on all evaluated scenarios indicates that our defenders trained by -MACRL are robust. We summarise the performance in Appendix. Intuitively, the robustness of the defender benefits a lot from exploring the policy space of the adversary via -MACRL, which enables the defender to maintain team coordination in the face of attacks.

5.4 How Do the Attacker Attack and the Defender Defend?

Agent 1 Agent 2
Table 3: Distributions of messages.

We showcase how the attacker attacks and how the defender defends. We use the attacker and the defender trained on with NDQ. We use Q values of all agents and messages as an example. As shown in Fig. 9, there are three agents, agent is the one which is taken over and becomes the malicious agent while agent and are benign agents. The action space of each agent is eight. White cells in Q value tables correspond to the actions which are not available at current step and values in red are Q values of the optimal actions. Following our problem formulation, agent sends malicious messages (the message dimension is 3) to agent and , changing Q values as well as optimal actions of agent and as shown in Fig. 9 (d) and Fig. 9 (a-b), respectively. The defender reconstructs the message (Fig. 9 (e)) and Q values of agent and are protected (Fig. 9 (c)). The reconstructed messages are subject to the same distribution (Table. 3) and the distance between the reconstructed messages are not far. Note that the partial ordering relation of Q values under defense (Fig. 9 (c)) are not changed (Fig. 9 (a)).

Figure 9: Message attack and defense: An example on (subject method: NDQ).

5.5 Ablations

Our defense method consists of two components: the malicious agent discriminator and the message reconstructor model. Here, we conduct ablation studies on our proposed defense method to see how these two components actually cooperate to defend message attacks effectively. We run experiments on three scenarios and each scenario corresponds to one subject MACRL method. We train our defense method with the vanilla training method, i.e., we first train an attacker policy without any defense and then train the defender with the trained and fixed attacker policy. In Fig. 10, we first conclude that directly replacing the poisonous message with randomly generated messages cannot defend effectively especially on and TJ (Hard). Then, as depicted in Fig. 11, we show that without the discriminator, the message generator cannot able to present reliable defensing performance, especially in PP () where predators can be fooled and fail to cooperatively catch the prey when being exposed to message attack.

Figure 10: Replacing potential malicious messages with randomly generated messages.
Figure 11: Randomly replacing potential malicious messages via the learned message reconstructor.

6 Conclusion and Future Works

In this paper, we consider the adversarial MACRL problem. To address this problem, we first propose our problem formulation on adversarial MACRL, then we propose our message attack model in order to investigate the vulnerability of communication under attack. To protect the multi-agent coordination, we propose our two-stage message filter as an effective defense method. Finally, we propose -MACRL to achieve communication robustness. Empirically, we demonstrate that MACRL methods are not robust to message attacks, and our defence method and the game-theoretic framework can effectively improve the robustness of MACRL.

Security and robustness are two important issues in MACRL and still remain largely uninvestigated. For future work, more message attacking methods can be proposed in order to gain deeper understanding on how messages actually contribute to agent’s decentralized executions and consequently more defense methods can be investigated. Theoretical analyses on game-theoretic equilibrium of the attack method and the defense method in MACRL are also future directions.


  • S. Ahilan and P. Dayan (2021) Correcting experience replay for multi-agent communication. ICLR. Cited by: §5.1.
  • J. Blumenkamp and A. Prorok (2020) The emergence of adversarial communication in multi-agent reinforcement learning. The 4th Conference on Robot Learning (CoRL 2020). Cited by: §1, §2.
  • W. Böhmer, V. Kurin, and S. Whiteson (2020) Deep coordination graphs. In ICML, Cited by: 3rd item, §5.1, §5.2.
  • N. Carlini and D. Wagner (2017)

    Towards evaluating the robustness of neural networks

    In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1, §2.
  • A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau (2019) Tarmac: targeted multi-agent communication. In ICML, pp. 1538–1546. Cited by: Figure 13, §B.2, §1, §2, 1st item, §5.1, §5.2.
  • J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson (2016) Learning to communicate with deep multi-agent reinforcement learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2145–2153. Cited by: §1, §2.
  • A. Gleave, M. Dennis, C. Wild, N. Kant, S. Levine, and S. Russell (2020) Adversarial policies: attacking deep reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
  • I. Goodfellow, N. Papernot, S. Huang, Y. Duan, P. Abbeel, and J. Clark (2017) Attacking machine learning with adversarial examples. OpenAI. https://blog. openai. com/adversarial-example-research. Cited by: §1, §2, §3.
  • M. Hüttenrauch, A. Šošić, and G. Neumann (2017) Guided deep reinforcement learning for swarm systems. AAMAS Workshop. Cited by: §1.
  • N. Jaques, A. Lazaridou, E. Hughes, C. Gulcehre, P. Ortega, D. Strouse, J. Z. Leibo, and N. De Freitas (2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In ICML, Cited by: §2.
  • J. Jiang and Z. Lu (2018) Learning attentional communication for multi-agent cooperation. In NIPS, pp. 7254–7264. Cited by: §1, §2.
  • D. Kim, S. Moon, D. Hostallero, W. J. Kang, T. Lee, K. Son, and Y. Yi (2019) Learning to schedule communication in multi-agent reinforcement learning. ICLR. Cited by: §2.
  • W. Kim, J. Park, and Y. Sung (2021) COMMUNICATION in multi-agent reinforcement learning: intention sharing. ICLR. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix D.
  • H. Kirrmann (2015) Fault tolerant computing in industrial automation. Switzerland: ABB Research Center. Cited by: Assumption 1.
  • L. Kraemer and B. Banerjee (2016) Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190, pp. 82–94. Cited by: §1.
  • M. Lanctot, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel (2017) A unified game-theoretic approach to multiagent reinforcement learning. In NIPS, Cited by: Appendix A, §1, §3, §4.4, §4.
  • Y. Lin, Z. Hong, Y. Liao, M. Shih, M. Liu, and M. Sun (2017) Tactics of adversarial attack on deep reinforcement learning agents. In IJCAI, Cited by: §1, §2.
  • R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, Cited by: §1.
  • H. B. McMahan, G. J. Gordon, and A. Blum (2003) Planning in the presence of cost functions controlled by an adversary. In ICML, Cited by: §4.4.
  • R. Mitchell, J. Blumenkamp, and A. Prorok (2020) Gaussian process based message filtering for robust multi-agent cooperation in the presence of adversarial communication. arXiv preprint arXiv:2012.00508. Cited by: §1, §2.
  • P. Muller, S. Omidshafiei, M. Rowland, K. Tuyls, J. Perolat, S. Liu, D. Hennes, L. Marris, M. Lanctot, E. Hughes, et al. (2019) A generalized training approach for multiagent learning. In ICML, Cited by: §1.
  • T. H. Nguyen, D. Kar, M. Brown, A. Sinha, A. X. Jiang, and M. Tambe (2016) Towards a science of security games. In Mathematical Sciences with Multidisciplinary Applications, pp. 347–381. Cited by: §1.
  • F. A. Oliehoek, C. Amato, et al. (2016) A concise introduction to decentralized POMDPs. Vol. 1, Springer. Cited by: §3, §5.1.
  • F. A. Oliehoek, M. T. Spaan, and N. Vlassis (2008) Optimal and approximate q-value functions for decentralized POMDPs. JAIR 32, pp. 289–353. Cited by: §1.
  • N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1, §2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), External Links: Link Cited by: Appendix D.
  • T. Rashid, M. Samvelyan, C. Schroeder, G. Farquhar, J. Foerster, and S. Whiteson (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML, Cited by: §1.
  • M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson (2019) The StarCraft multi-agent challenge. CoRR abs/1902.04043. Cited by: §C.3, §5.1, §5.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.2.
  • A. Singh, T. Jain, and S. Sukhbaatar (2018) Learning when to communicate at scale in multiagent cooperative and competitive tasks. ICLR. Cited by: §2.
  • A. J. Singh, A. Kumar, and H. C. Lau (2020) Hierarchical multiagent reinforcement learning for maritime traffic management. In AAMAS, Cited by: §1.
  • S. Sukhbaatar, A. Szlam, and R. Fergus (2016)

    Learning multiagent communication with backpropagation

    In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2252–2260. Cited by: Figure 12, §B.2, §1, §2, §5.1, §5.2, §5.2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR, Cited by: §1, §2.
  • F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2018) Ensemble adversarial training: attacks and defenses. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §B.2, §1.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
  • J. Wang, Z. Ren, T. Liu, Y. Yu, and C. Zhang (2020a) QPLEX: duplex dueling multi-agent q-learning. arXiv preprint arXiv:2008.01062. Cited by: §1.
  • R. Wang, X. He, R. Yu, W. Qiu, B. An, and Z. Rabinovich (2020b) Learning efficient multi-agent communication: an information bottleneck approach. In ICML, Cited by: §1, §2.
  • T. Wang, J. Wang, C. Zheng, and C. Zhang (2019) Learning nearly decomposable value functions via communication minimization. In ICML, Cited by: Figure 14, §B.2, §1, §2, 2nd item, §5.1, §5.2.
  • Y. Wang, B. Han, T. Wang, H. Dong, and C. Zhang (2021) DOP: off-policy multi-agent decomposed policy gradients. In ICML, External Links: Link Cited by: §1.
  • Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian (2018) Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209. Cited by: §C.2.
  • H. Xu, R. Wang, L. Raizman, and Z. Rabinovich (2021) Transferable environment poisoning: training-time attack on reinforcement learning. In AAMAS, Cited by: §2.
  • S. Q. Zhang, J. Lin, and Q. Zhang (2020) Succinct and robust multi-agent communication with temporal message control. arXiv preprint arXiv:2010.14391, pp. 17271–17282. Cited by: §2, 2nd item.

Appendix A Algorithm

We present Algorithm 1, which is mainly built on top of PSRO [17], to train -MACRL. At each training step, we train both the attacker and the defender alternately by exploiting their counterpart’s policies via sampling (line 1), then a new policy is trained and extended to the policy set (line 1) for the two players. We can compute the missing entries (line 1) in and finally solve the matrix game (line 1).

1 Input: initial policy sets for both players , maximum number of iterations ;
2 Empirically estimate utilities for each joint policy ;
3 Initialize mixed strategies for both players  for ;
4 Initialize number of iterations ;
5 while not converge and  do
6       for player  do
7             Train by letting it exploits ;
8             Extend policy set ;
9      Check convergence and ;
10       Estimate missing entries in ;
11       Compute mixed strategies by solving the matrix game defined by ;
12Output: mixed strategies for both players ;
Algorithm 1 -MACRL

Appendix B Additional Introduction

b.1 Evaluation Metrics

We denote the expected maximum best response utility of the defender by evaluating with the attacker policy from the attcker policy set during -MACRL training as :

where is the probability of policy being chosen from their policy set . is the averaged total return of the evaluated 16 episodes given the best response defender policy and the chosen attacker policy . For , there is only one attacker policy to compare with in vanilla training framework.

Figure 12: The Architecture of CommNet from [33]. An overview of our CommNet model. Left: view of module for a single agent . Note that the parameters are shared across all agents. Middle: a single communication step, where each agents modules propagate their internal state , as well as broadcasting a communication vector on a common channel (shown in red). Right: full model , showing input states for each agent, two communication steps and the output actions for each agent.

b.2 Introduction to the subject MACRL methods

CommNet [33] is a simple yet effective multi-agent communicative algorithm where incoming messages of each agent is generated by averaging the messages sent by all agents in the last time step. We present the architecture of CommNet in Fig. 12.

TarMAC [5] extends CommNet by allowing agents to pay attention to important parts of the incoming messages via Attention [36] network. Concretely, each agent generates signature and query vectors when sending message. In the message receiving phase, attention weights of incoming messages are calculated by comparing the similarity between the query vector and the signature vector of each incoming messages. Then a weighted sum over all incoming messages are performed to determine the message an agent will receive for decentralized execution. The architecture of TarMAC is in Fig. 13.

Figure 13: Overview of TarMAC from [5]. Left: At every timestep, each agent policy gets a local observation and aggregated message as input, and predicts an environment action and a targeted communication message . Right: Targeted communication between agents is implemented as a signature-based soft attention mechanism. Each agent broadcasts a message consisting of a signature , which can be used to encode agent-specific information and a value , which contains the actual message. At the next timestep, each receiving agent gets as input a convex combination of message values, where the attention weights are obtained by a dot product between sender’s signature and a query vector predicted from the receiver’s hidden state.

NDQ [40] achieves nearly decomposable Q-functions via communication minimization. Specifically, it trains the communication model by maximizing mutual information between agents’ action selection and communication messages while minimizing the entropy of messages between agents. Each agent broadcasts messages to all other agents. The sent messages are sampled from learned distributions which are optimized by mutual information. The overview of NDQ is in Fig. 13.

Figure 14: Overview of NDQ from [40]. The message encoder generates an embedding distribution that is sampled and concatenated with the current local history to serve as an input to the local action-value function. Local action values are fed into a mixing network to to get an estimation of the global action value..

Appendix C Experimental Setup

c.1 Predator-and-Prey

There are 3 predators, trying to catch 6 prey in a grid. Each predator has local observation of a sub-grid centered around it and can move in one of the 4 compass directions, remain still, or perform catch action. The prey moves randomly and is caught if at least two near-by predators try to catch it simultaneously. The predator will obtain a team reward at 10 for each successful catch and will be punished for each losing catch action. A prey will be removed from the grid after it being caught. An episode ends after 60 steps or all preys have been caught.

c.2 TrafficJunction

Figure 15: TJ

In TJ scenarios, there are cars and each car’s task is to completes its pre-assigned route. Each car has a limited observations of a region around it, but is free to communicate with all other cars. The action space for each car at every timestep is gas and brake. The reward function is , where is the number of timesteps since the activation of the car, and is a collision penalty. Once a car completes its route, it becomes available to be sampled and spawned in the environment with a new route assignment. We do not carry out attacks on House3D  [42] due to copyright disputes111https://futurism.com/tech-suing-facebook-princeton-data.

c.3 StarCraft II

(a) bane_vs_hM
(b) 1o_2r_vs_4r
Figure 16: Snapshots of the StarCraft II scenarios.

We consider SMAC [29] combat scenarios where the enemy units are controlled by StarCraft II built-in AI (difficulty level is set to medium), and each of the ally units is controlled by a learning agent. The units of the two groups can be asymmetric. The action space is discrete: noop, move[direction], attack[enemy id], and stop. Agents receive a globally shared reward which is equal to the total damage dealt on the enemy units at each timestep.

bane_vs_hM: 3 Banelings try to kill a Hydralisk assisted by a Medivac. 3 Banelings together can just blow up the Hydralisk. Therefore, to win the game, 3 Banelings ally units should not give the Hydralisk changes to rest time when the Medivac can restore its health.

1o2r_vs_4r: Ally units consist of 1 Overseer and 2 Roaches. The Overseer can find 4 Reapers and then notify its teammates to kill the invading enemy units, the Reapers, in order to win the game. At the start of an episode, ally units spawn at a random point on the map while enemy units are initialized at another random point. Given that only the Overseer knows the position of the enemy unit, the ability to learn to deliver this message to its teammates is vital for effectively winning the combat.

Appendix D Hyperparameters

We train all the subject MACRL methods with hyperparameters used in their papers to reproduce communicative performances. is . We use Adam [14] optimizer with the best learning rate to train the attacker model as well as the discriminator and the message reconstructor model of the defender model. We use the default hyperparameters of Adam optimizer. and are and . The clip ratio of the advantage of PPO is

. We use 3 MLP layers and each layer has 64 neurons with ReLU activation for the attacker model and the defender models (the discriminator and the message reconstructor). We use PyTorch 

[27] to implement all the deep neural network models. We carry out all axperiments on a machine with 8 NVIDIA Tesla V100 GPU 16G GPUs and 44 Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz CPU cores. We use 5 random seeds in our experiments.

Appendix E Experimental Results

We present the performance of best response strategies on NDQ, CommNet and TarMAC in table 4.

Scenarios bane_vs_hM 1o_2r_vs_4r
Test Won %
Scenarios p=0 p=-0.5
Test Return
Scenarios Easy Hard
Test Success %
Table 4: The performance of best response strategies: NDQ (top), CommNet (middle) and TarMAC (bottom). The defender can effectively recovery the maintain the coordination behaviour.

We show the threshold of the message perturbation used in each scenarios of each subject MACRL method in Table 5, 6 and 7. Subject MACRL methods have various threshold value for each scenarios. Some perturbation threshold values are large, for example the in Table 6 and 7. The reason is that the range of the mean of the distribution original messages are large, such as the range of mean of the distribution of the message in NDQ can be .

Scenarios Easy Hard
Table 5: TarMAC
Scenarios p=0 p=-0.5
Table 6: CommNet
Scenarios bane_vs_hM 1o_2r_vs_4r
Table 7: NDQ