TarMAC: Targeted Multi-Agent Communication

We explore a collaborative multi-agent reinforcement learning setting where a team of agents attempts to solve cooperative tasks in partially-observable environments. In this scenario, learning an effective communication protocol is key. We propose a communication architecture that allows for targeted communication, where agents learn both what messages to send and who to send them to, solely from downstream task-specific reward without any communication supervision. Additionally, we introduce a multi-stage communication approach where the agents co-ordinate via multiple rounds of communication before taking actions in the environment. We evaluate our approach on a diverse set of cooperative multi-agent tasks, of varying difficulties, with varying number of agents, in a variety of environments ranging from 2D grid layouts of shapes and simulated traffic junctions to complex 3D indoor environments. We demonstrate the benefits of targeted as well as multi-stage communication. Moreover, we show that the targeted communication strategies learned by agents are both interpretable and intuitive.


page 6

page 8


Mixed Cooperative-Competitive Communication Using Multi-Agent Reinforcement Learning

By using communication between multiple agents in multi-agent environmen...

Collaborative Visual Navigation

As a fundamental problem for Artificial Intelligence, multi-agent system...

Learning to Guide and to Be Guided in the Architect-Builder Problem

We are interested in interactive agents that learn to coordinate, namely...

Cooperative Multi-Agent Search on Endogenously-Changing Fitness Landscapes

We use a multi-agent system to model how agents (representing firms) may...

FCMNet: Full Communication Memory Net for Team-Level Cooperation in Multi-Agent Systems

Decentralized cooperation in partially-observable multi-agent systems re...

Learning to cooperate: Emergent communication in multi-agent navigation

Emergent communication in artificial agents has been studied to understa...

On the Pitfalls of Measuring Emergent Communication

How do we know if communication is emerging in a multi-agent system? The...

1 Introduction

Effective communication is a key ability for collaborative multi-agents systems. Indeed, intelligent agents (humans or artificial) in real-world scenarios can significantly benefit from exchanging information that enables them to coordinate, strategize, and utilize their combined sensory experiences to act in the physical world. The ability to communicate has wide-ranging applications for artificial agents – from multi-player gameplay in simulated games (DoTA, Quake, StarCraft) or physical worlds (robot soccer), to networks of self-driving cars communicating with each other to achieve safe and swift transport, to teams of robots on search-and-rescue missions deployed in hostile and fast-evolving environments.

A salient property of human communication is the ability to hold targeted interactions. Rather than the ‘one-size-fits-all’ approach of broadcasting messages to all participating agents, as has been previously explored (Sukhbaatar et al., 2016; Foerster et al., 2016), it can be useful to direct certain messages to specific recipients. This enables a more flexible collaboration strategy in complex environments. For example, within a team of search-and-rescue robots with a diverse set of roles and goals, a message for a fire-fighter (“smoke is coming from the kitchen”) is largely meaningless for a bomb-defuser.

In this work we develop a collaborative multi-agent deep reinforcement learning approach that supports targeted communication. Crucially, each individual agent actively selects which other agents to send messages to. This targeted communication behavior is operationalized via a simple signature-based soft attention mechanism: along with the message, the sender broadcasts a key which is used by receivers to gauge the relevance of the message. This communication mechanism is learned implicitly, without any attention supervision, as a result of end-to-end training using a downstream task-specific team reward.

The inductive bias provided by soft attention in the communication architecture is sufficient to enable agents to 1) communicate agent-goal-specific messages (guide fire-fighter towards fire, bomb-defuser towards bomb, ), 2) be adaptive to variable team sizes (the size of the local neighborhood a self-driving car can communicate with changes as it moves), and 3) be interpretable through predicted attention probabilities that allow for inspection of

which agent is communicating what message and to whom.

Our results however show that just using targeted communication is not enough. Complex real-world tasks might require large populations of agents to go through multiple stages of collaborative communication and reasoning, involving large amounts of information to be persistent in memory and exchanged via high-bandwidth communication channels. To this end, our actor-critic framework combines centralized training with decentralized execution (Lowe et al., 2017), thus enabling scaling to a large number of agents. In this context, our inter-agent communication architecture supports multiple stages of targeted interactions at every time-step, and the agents’ recurrent policies support persistent relevant information in internal states.

While natural language, a finite set of discrete tokens with pre-specified human-conventionalized meanings, may seem like an intuitive protocol for inter-agent communication – one that enables human-interpretability of interactions – forcing machines to communicate among themselves in discrete tokens presents additional training challenges. Since our work focuses on machine-only multi-agent teams, we allow agents to communicate via continuous vectors (rather than discrete symbols), and via the learning process, agents have the flexibility to discover and optimize their communication protocol as per task requirements.

We provide extensive empirical demonstration of the efficacy of our approach across a range of tasks, environments, and team sizes. We begin by benchmarking multi-agent communication with and without attention on a cooperative navigation task derived from the SHAPES environment (Andreas et al., 2016). We show that agents learn intuitive attention behavior across a spectrum of task difficulties. Next, we evaluate the same targeted multi-agent communication architecture on the traffic junction environment (Sukhbaatar et al., 2016), and show that agents are able to adaptively focus on ‘active’ agents in the case of varying team sizes. Finally, we demonstrate effective multi-agent communication in 3D environments on a cooperative first-person point-goal navigation task in the rich House3D environment (Wu et al., 2018).

2 Related Work

Multi-agent systems fall at the intersection of game theory, distributed systems, and Artificial Intelligence in general 

(Shoham & Leyton-Brown, 2008), and thus have a rich and diverse literature. Our work builds on and is related to prior work in deep multi-agent reinforcement learning, the centralized training and decentralized execution paradigm, and emergent communication protocols.

Multi-Agent Reinforcement Learning (MARL). Within MARL (see Busoniu et al. (2008)

for a survey), our work is related to recent efforts on using recurrent neural networks to approximate agent policies 

(Hausknecht & Stone, 2015), algorithms stabilizing multi-agent training (Lowe et al., 2017; Foerster et al., 2018), and tasks in novel application domains such as coordination and navigation in 3D simulated environments (Peng et al., 2017; OpenAI, 2018; Jaderberg et al., 2018).

Centralized Training & Decentralized Execution. Both Sukhbaatar et al. (2016) and Hoshen (2017)

adopt a fully centralized framework at both training and test time – a central controller processes local observations from all agents and outputs a probability distribution over joint actions. In this setting, any controller (a fully-connected network) can be viewed as implicitly encoding communication.

Sukhbaatar et al. (2016) present an efficient architecture to learn a centralized controller invariant to agent permutations – by sharing weights and averaging as in Zaheer et al. (2017). Meanwhile Hoshen (2017) proposes to replace averaging by an attentional mechanism to allow targeted interactions between agents. While closely related to our communication architecture, his work only considers fully supervised one-next-step prediction tasks, while we tackle the full reinforcement learning problem with tasks requiring planning over long time horizons.

Moreover, a centralized controller quickly becomes intractable in real-world tasks with many agents and high-dimensional observation spaces (navigation in House3D (Wu et al., 2018)). To address these weaknesses, we adopt the framework of centralized learning but decentralized execution (following Foerster et al. (2016); Lowe et al. (2017)) and further relax it by allowing agents to communicate. While agents can use extra information during training, at test time, they pick actions solely based on local observations and communication messages received from other agents.

Finally, we note that fully decentralized execution at test time without communication is very restrictive. It means 1) each agent must act myopically based solely on its local observation and 2) agents cannot coordinate their actions. In our setting, communication between agents offers a reasonable trade-off between allowing agents to globally coordinate while retaining tractability (since the communicated messages are much lower-dimensional than the observation space).

Emergent Communication Protocols. Our work is also related to recent work on learning communication protocols in a completely end-to-end manner with reinforcement learning – from perceptual input (pixels) to communication symbols (discrete or continuous) to actions (navigating in an environment). While  (Foerster et al., 2016; Jorge et al., 2016; Das et al., 2017; Kottur et al., 2017; Mordatch & Abbeel, 2017; Lazaridou et al., 2017) constrain agents to communicate with discrete symbols with the explicit goal to study emergence of language, our work operates in the paradigm of learning a continuous communication protocol in order to solve a downstream task  (Sukhbaatar et al., 2016; Hoshen, 2017; Jiang & Lu, 2018). While (Jiang & Lu, 2018) also operate in a decentralized execution setting and use an attentional communication mechanism, their setup is significantly different from ours as they use attention to decide when to communicate, not who to communicate with (‘who’ depends on a hand-tuned neighborhood parameter in their work). Table 1 summarizes the main axes of comparison between our work and previous efforts in this exciting space.

Decentralized Targeted Multi-Stage Reinforcement
Execution Communication Decisions Learning
DIAL (Foerster et al., 2016) Yes No No Yes (Q-Learning)
CommNets (Sukhbaatar et al., 2016) No No Yes Yes (REINFORCE)
VAIN (Hoshen, 2017) No Yes Yes No (Supervised)
ATOC (Jiang & Lu, 2018) Yes No No Yes (Actor-Critic)
TarMAC (this paper) Yes Yes Yes Yes (Actor-Critic)
Table 1: Comparison with previous work on collaborative multi-agent communication with continuous vectors.

3 Technical Background

Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs).

A Dec-POMDP is a cooperative multi-agent extension of a partially observable Markov decision process (

Oliehoek (2012)). For agents, it is defined by a set of states describing possible configurations of all agents, a global reward function , a transition probability function , and for each agent a set of allowed actions , a set of possible observations and an observation function . Operationally, at each time step every agent picks an action based on its local observation following its own stochastic policy . The system randomly transitions to the next state given the current state and joint action . The agent team receives a global reward while each agent receives a local observation of the new state . Agents aim to maximize the total expected return where is a discount factor and is the episode time horizon.

Actor-Critic Algorithms. Policy gradient methods directly adjust the parameters of the policy in order to maximize the objective by taking steps in the direction of . We can write the gradient with respect to the policy parameters as

where is called the action-value, it is the expected remaining discounted reward if we take action in state and follow policy thereafter. Actor-Critic algorithms learn an approximation of the unknown true action-value function by e.g. temporal-difference learning (Sutton & Barto, 1998). This is called the Critic while the policy is called the Actor.

Multi-Agent Actor-Critic. Lowe et al. (2017) propose a multi-agent Actor-Critic algorithm adapted to centralized learning and decentralized execution. Each agent learns its own individual policy conditioned on local observation

, using a centralized Critic which estimates the joint action-value


4 TarMAC: Targeted Multi-Agent Communication

Figure 1: Overview of our multi-agent architecture with targeted communication. Left: At every timestep, each agent policy gets a local observation and aggregated message as input, and predicts an environment action and a targeted communication message . Right: Targeted communication between agents is implemented as a signature-based soft attention mechanism. Each agent broadcasts a message consisting of a signature , which can be used to encode agent-specific information and a value , which contains the actual message. At the next timestep, each receiving agent gets as input a convex combination of message values, where the attention weights are obtained by a dot product between sender’s key and a query vector predicted from the receiver’s hidden state.

We now describe our multi-agent communication architecture in detail. Recall that we have agents with policies , respectively parameterized by , jointly performing a cooperative task. At every timestep , the th agent for all sees a local observation , and must select a discrete environment action and a continuous communication message , received by other agents at the next timestep, in order to maximize global reward . Since no agent has access to the underlying state of the environment , there is incentive in communicating with each other and being mutually helpful to do better as a team.

Policies and Decentralized Execution. Each agent is essentially modeled as a Dec-POMDP augmented with communication. Each agent’s policy is implemented as a

-layer Gated Recurrent Unit 

(Cho et al., 2014). At every timestep, the local observation and a vector aggregating messages sent by all agents at the previous timestep (described in more detail below) are used to update the hidden state of the GRU, which encodes the entire message-action-observation history up to time . From this internal state representation, the agent’s policy predicts a categorical distribution over the space of actions, and another output head produces an outgoing message vector . Note that for all our experiments, agents are symmetric and policies are instantiated from the same set of shared parameters; . This considerably speeds up learning.

Centralized Critic. Following prior work (Lowe et al., 2017), we operate under the centralized learning and decentralized execution paradigm wherein during training, a centralized critic guides the optimization of individual agent policies. The centralized Critic takes as input predicted actions and internal state representations from all agents to estimate the joint action-value at every timestep. The centralized Critic is learned by temporal difference  (Sutton & Barto, 1998) and the gradient of the expected return with respect to policy parameters is approximated by:

Note that compared to an individual critic

for each agent, having a centralized critic leads to considerably lower variance in policy gradient estimates since it takes into account actions from all agents. At test time, the critic is not needed anymore and policy execution is fully decentralized.

Targeted, Multi-Stage Communication. Establishing complex collaboration strategies requires targeted communication the ability to send specific messages to specific agents, as well as multi-stage communication multiple rounds of back-and-forth interactions between agents. We use a signature-based soft-attention mechanism in our communication structure to enable targeting. Each message consists of parts – a signature to target recipients, and a value :


At the receiving end, each agent (indexed by ) predicts a query vector from its hidden state and uses it to compute a dot product with signatures of all messages. This is scaled by followed by a softmax to obtain attention weight for each message value vector:


Note that equation 2 also includes corresponding to the ability to self-attend (Vaswani et al., 2017), which we empirically found to improve performance, especially in situations when an agent has found the goal in a coordinated navigation task and all it is required to do is stay at the goal, so others benefit from attending to this agent’s message but return communication is not needed.

For multiple stages of communication, aggregated message vector and internal state are first used to predict the next internal state taking into account a first round of communication:


Next, is used to predict signature, value, key followed by repeating Eqns 1-4 for multiple rounds until we get a new final aggregated message vector to be used as input at the next timestep.

5 Experiments

We evaluate our targeted multi-agent communication architecture on a variety of tasks and environments. All our models were trained with a batched synchronous version of the multi-agent Actor-Critic described above, using RMSProp with a learning rate of

and , batch size , discount factor and entropy regularization coefficient for agent policies.

5.1 Shapes

The SHAPES dataset was introduced by Andreas et al. (2016)111github.com/jacobandreas/nmn2/tree/shapes, and originally created for testing compositional visual reasoning for the task of visual question answering. It consists of synthetic images of D colored shapes arranged in a grid ( cells in the original dataset) along with corresponding question-answer pairs. There are shapes (circle, square, triangle), colors (red, green, blue), and sizes (small, big) in total (see Figure 2).

We convert each image from the SHAPES dataset into an active environment where agents can now be spawned at different regions of the image, observe only a local patch around them, and take actions to move around – up, down, left, right, stay. Each agent is tasked with navigating to a specified goal state in the environment – ‘red’, ‘blue square’, ‘small green circle’, – and the reward for each agent at every timestep is based on team performance .

(a) agents have to find red, red, green, blue respectively. : inital spawn locations; : was on red at so and attend to messages from since they have to find red. has found its goal (green) and is self-attending; : attends to messages from as is on ’s target – blue; : finds red, so and shift attention to ; : all agents are at their respective goal locations and primarily self-attending.
(b) agents have to find red on a large environment. : Agent finds red and signals all other agents; to : All agents make their way to ’s location and eventually converge around red.
Figure 2: Visualizations of learned targeted communication in SHAPES. Figure best viewed in color.
, agents, findred , agents, findred , agents, findred,red,green,blue
No communication
No attention

Table 2: Success rates on different settings of cooperative navigation in the SHAPES environment.

Having a symmetric, team-based reward incentivizes agents to cooperate with each other in finding each agent’s goal. For example, as shown in Figure 1(a), if agent ’s goal is to find red and agent ’s goal is to find blue, it is in agent ’s interest to let agent know if it passes by red () during its exploration / quest for blue and vice versa (). SHAPES serves as a flexible testbed for carefully controlling and analyzing the effect of changing the size of the environment, no. of agents, goal configurations, . Figure 2 visualizes learned protocols from two different configurations, and table:shapesnumbers reports quantitative evaluation for three different configurations. Benefits of communication and attention increase with task complexity ( & findred findred,red,green,blue).

5.2 Traffic Junction

Easy Hard
No communication
CommNets (Sukhbaatar et al., 2016)
TarMAC -stage
TarMAC -stage

Table 3: Success rates on the traffic junction task. Our targeted -stage communication architecture gets a success rate of on the ‘hard’ variant of the task, significantly outperforming Sukhbaatar et al. (2016). Note that - and -stage refer to the number of rounds of communication between actions (Equation 4).
(a) Brake probabilities at different locations on the hard traffic junction environment. Cars tend to brake close to or right before entering junctions.
(b) Attention probabilities at different locations. Cars are most attended to in the ‘internal grid’ – right after the st junction and before the nd.
(c) No. of cars being attended to. 1) is positively correlated with total cars, indicating that TarMAC is adaptive to dynamic team sizes, and 2) is slightly right-shifted, since it takes few steps of communication to adapt.
Figure 3: Results on the traffic junction environment.

Environment and Task. The simulated traffic junction environments from Sukhbaatar et al. (2016) consist of cars moving along pre-assigned, potentially intersecting routes on one or more road junctions. The total number of cars is fixed at and at every timestep, new cars get added to the environment with probability . Once a car completes its route, it becomes available to be sampled and added back to the environment with a different route assignment. Each car has a limited visibility of a region around it, but is free to communicate with all other cars. The action space for each car at every timestep is gas and brake, and the reward consists of a linear time penalty , where is the number of timesteps since car has been active, and a collision penalty .

Quantitative Results. We compare our approach with CommNets (Sukhbaatar et al., 2016) on the easy and hard difficulties of the traffic junction environment. The easy task has one junction of two one-way roads on a grid with and , while the hard task has four connected junctions of two-way roads on a grid with and . See Figure 2(a), 2(b) for an example of the four two-way junctions in the hard task. As shown in table:traffic, a no communication baseline has a success rates of and on easy and hard respectively. On easy, both CommNets and TarMAC get close to . On hard, TarMAC with -stage communication significantly outperforms CommNets with a success rate of , while -stage further improves on this at , which is an 18% absolute improvement over CommNets.

Model Interpretation. Interpreting the learned policies, Figure 2(a) shows braking probabilities at different locations: cars tend to brake close to or right before entering traffic junctions, which is reasonable since junctions have the highest chances for collisions.

Turning our attention to attention probabilities (Figure 2(b)), we can see that cars are most-attended to when in the ‘internal grid’ – right after crossing the st junction and before hitting the nd junction. These attention probabilities are intuitive: cars learn to attentively attend to specific sensitive locations with the most relevant local observations to avoid collisions.

Finally, Figure 2(c) compares total number of cars in the environment snumber of cars being attended to with probability at any time. Interestingly, these are (loosely) positively correlated, with Spearman’s , which shows that TarMAC is able to adapt to variable number of agents. Crucially, agents learn this dynamic targeting behavior purely from task rewards with no hand-coding! Note that the right shift between the two curves is expected, as it takes a few timesteps of communication for team size changes to propagate. At a relative time shift of , the Spearman’s rank correlation between the two curves goes up to .

5.3 House3D

Finally, we benchmark TarMAC on a cooperative point-goal navigation task in House3D (Wu et al., 2018). House3D provides a rich and diverse set of publicly-available222github.com/facebookresearch/house3d D indoor environments, wherein agents do not have access to the top-down map and must navigate purely from first-person vision. Similar to SHAPES, the agents are tasked with finding a specified goal (such as ‘fireplace’), spawned at random locations in the environment and allowed to communicate with each other and move around. Each agent gets a shaped reward based on progress towards the specified target. An episode is successful if all agents end within m of the target object in navigation steps.

table:house3d shows success rates on a find[fireplace] task in House3D. A no-communication navigation policy trained with the same reward structure gets a success rate of . Mean-pooled communication (no attention) performs slightly better with a success rate of , and TarMAC achieves the best success rate at . Figure 4 visualizes predicted navigation trajectories of agents. Note that the communication vectors are significantly more compact (-d) than the high-dimensional observation space, making our approach particularly attractive for scaling to large teams.

Success rate
No communication
No attention

Table 4: Success rates on a -agent cooperative find[fireplace] navigation task in House3D.
Figure 4: Agents navigating to the fireplace in House3D (marked in yellow). Note in particular that agent is spawned facing away from it. It communicates with others, turns to face the fireplace, and moves towards it.

6 Conclusions and Future Work

We introduced TarMAC, an architecture for multi-agent reinforcement learning which allows targeted interactions between agents and multiple stages of collaborative reasoning at every timestep. Evaluation on three diverse environments show that our model is able to learn intuitive attention behavior and improves performance, with downstream task-specific team reward as sole supervision.

While multi-agent navigation experiments in House3D show promising performance, we aim to exhaustively benchmark TarMAC on more challenging D navigation tasks because we believe this is where decentralized targeted communication can have the most impact – as it allows scaling to a large number of agents with large observation spaces. Given that the D navigation problem is hard in and of itself, it would be particularly interesting to investigate combinations with recent advances orthogonal to our approach (spatial memory, planning networks) with the TarMAC framework.