Effective communication is a key ability for collaborative multi-agents systems. Indeed, intelligent agents (humans or artificial) in real-world scenarios can significantly benefit from exchanging information that enables them to coordinate, strategize, and utilize their combined sensory experiences to act in the physical world. The ability to communicate has wide-ranging applications for artificial agents – from multi-player gameplay in simulated games (DoTA, Quake, StarCraft) or physical worlds (robot soccer), to networks of self-driving cars communicating with each other to achieve safe and swift transport, to teams of robots on search-and-rescue missions deployed in hostile and fast-evolving environments.
A salient property of human communication is the ability to hold targeted interactions. Rather than the ‘one-size-fits-all’ approach of broadcasting messages to all participating agents, as has been previously explored (Sukhbaatar et al., 2016; Foerster et al., 2016), it can be useful to direct certain messages to specific recipients. This enables a more flexible collaboration strategy in complex environments. For example, within a team of search-and-rescue robots with a diverse set of roles and goals, a message for a fire-fighter (“smoke is coming from the kitchen”) is largely meaningless for a bomb-defuser.
In this work we develop a collaborative multi-agent deep reinforcement learning approach that supports targeted communication. Crucially, each individual agent actively selects which other agents to send messages to. This targeted communication behavior is operationalized via a simple signature-based soft attention mechanism: along with the message, the sender broadcasts a key which is used by receivers to gauge the relevance of the message. This communication mechanism is learned implicitly, without any attention supervision, as a result of end-to-end training using a downstream task-specific team reward.
The inductive bias provided by soft attention in the communication architecture is sufficient to enable agents to 1) communicate agent-goal-specific messages (guide fire-fighter towards fire, bomb-defuser towards bomb, ), 2) be adaptive to variable team sizes (the size of the local neighborhood a self-driving car can communicate with changes as it moves), and 3) be interpretable through predicted attention probabilities that allow for inspection ofwhich agent is communicating what message and to whom.
Our results however show that just using targeted communication is not enough. Complex real-world tasks might require large populations of agents to go through multiple stages of collaborative communication and reasoning, involving large amounts of information to be persistent in memory and exchanged via high-bandwidth communication channels. To this end, our actor-critic framework combines centralized training with decentralized execution (Lowe et al., 2017), thus enabling scaling to a large number of agents. In this context, our inter-agent communication architecture supports multiple stages of targeted interactions at every time-step, and the agents’ recurrent policies support persistent relevant information in internal states.
While natural language, a finite set of discrete tokens with pre-specified human-conventionalized meanings, may seem like an intuitive protocol for inter-agent communication – one that enables human-interpretability of interactions – forcing machines to communicate among themselves in discrete tokens presents additional training challenges. Since our work focuses on machine-only multi-agent teams, we allow agents to communicate via continuous vectors (rather than discrete symbols), and via the learning process, agents have the flexibility to discover and optimize their communication protocol as per task requirements.
We provide extensive empirical demonstration of the efficacy of our approach across a range of tasks, environments, and team sizes. We begin by benchmarking multi-agent communication with and without attention on a cooperative navigation task derived from the SHAPES environment (Andreas et al., 2016). We show that agents learn intuitive attention behavior across a spectrum of task difficulties. Next, we evaluate the same targeted multi-agent communication architecture on the traffic junction environment (Sukhbaatar et al., 2016), and show that agents are able to adaptively focus on ‘active’ agents in the case of varying team sizes. Finally, we demonstrate effective multi-agent communication in 3D environments on a cooperative first-person point-goal navigation task in the rich House3D environment (Wu et al., 2018).
2 Related Work
Multi-Agent Reinforcement Learning (MARL). Within MARL (see Busoniu et al. (2008)
for a survey), our work is related to recent efforts on using recurrent neural networks to approximate agent policies(Hausknecht & Stone, 2015), algorithms stabilizing multi-agent training (Lowe et al., 2017; Foerster et al., 2018), and tasks in novel application domains such as coordination and navigation in 3D simulated environments (Peng et al., 2017; OpenAI, 2018; Jaderberg et al., 2018).
adopt a fully centralized framework at both training and test time – a central controller processes local observations from all agents and outputs a probability distribution over joint actions. In this setting, any controller (a fully-connected network) can be viewed as implicitly encoding communication.Sukhbaatar et al. (2016) present an efficient architecture to learn a centralized controller invariant to agent permutations – by sharing weights and averaging as in Zaheer et al. (2017). Meanwhile Hoshen (2017) proposes to replace averaging by an attentional mechanism to allow targeted interactions between agents. While closely related to our communication architecture, his work only considers fully supervised one-next-step prediction tasks, while we tackle the full reinforcement learning problem with tasks requiring planning over long time horizons.
Moreover, a centralized controller quickly becomes intractable in real-world tasks with many agents and high-dimensional observation spaces (navigation in House3D (Wu et al., 2018)). To address these weaknesses, we adopt the framework of centralized learning but decentralized execution (following Foerster et al. (2016); Lowe et al. (2017)) and further relax it by allowing agents to communicate. While agents can use extra information during training, at test time, they pick actions solely based on local observations and communication messages received from other agents.
Finally, we note that fully decentralized execution at test time without communication is very restrictive. It means 1) each agent must act myopically based solely on its local observation and 2) agents cannot coordinate their actions. In our setting, communication between agents offers a reasonable trade-off between allowing agents to globally coordinate while retaining tractability (since the communicated messages are much lower-dimensional than the observation space).
Emergent Communication Protocols. Our work is also related to recent work on learning communication protocols in a completely end-to-end manner with reinforcement learning – from perceptual input (pixels) to communication symbols (discrete or continuous) to actions (navigating in an environment). While (Foerster et al., 2016; Jorge et al., 2016; Das et al., 2017; Kottur et al., 2017; Mordatch & Abbeel, 2017; Lazaridou et al., 2017) constrain agents to communicate with discrete symbols with the explicit goal to study emergence of language, our work operates in the paradigm of learning a continuous communication protocol in order to solve a downstream task (Sukhbaatar et al., 2016; Hoshen, 2017; Jiang & Lu, 2018). While (Jiang & Lu, 2018) also operate in a decentralized execution setting and use an attentional communication mechanism, their setup is significantly different from ours as they use attention to decide when to communicate, not who to communicate with (‘who’ depends on a hand-tuned neighborhood parameter in their work). Table 1 summarizes the main axes of comparison between our work and previous efforts in this exciting space.
|DIAL (Foerster et al., 2016)||Yes||No||No||Yes (Q-Learning)|
|CommNets (Sukhbaatar et al., 2016)||No||No||Yes||Yes (REINFORCE)|
|VAIN (Hoshen, 2017)||No||Yes||Yes||No (Supervised)|
|ATOC (Jiang & Lu, 2018)||Yes||No||No||Yes (Actor-Critic)|
|TarMAC (this paper)||Yes||Yes||Yes||Yes (Actor-Critic)|
3 Technical Background
Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs).
A Dec-POMDP is a cooperative multi-agent extension of a partially observable Markov decision process (Oliehoek (2012)). For agents, it is defined by a set of states describing possible configurations of all agents, a global reward function , a transition probability function , and for each agent a set of allowed actions , a set of possible observations and an observation function . Operationally, at each time step every agent picks an action based on its local observation following its own stochastic policy . The system randomly transitions to the next state given the current state and joint action . The agent team receives a global reward while each agent receives a local observation of the new state . Agents aim to maximize the total expected return where is a discount factor and is the episode time horizon.
Actor-Critic Algorithms. Policy gradient methods directly adjust the parameters of the policy in order to maximize the objective by taking steps in the direction of . We can write the gradient with respect to the policy parameters as
where is called the action-value, it is the expected remaining discounted reward if we take action in state and follow policy thereafter. Actor-Critic algorithms learn an approximation of the unknown true action-value function by e.g. temporal-difference learning (Sutton & Barto, 1998). This is called the Critic while the policy is called the Actor.
4 TarMAC: Targeted Multi-Agent Communication
We now describe our multi-agent communication architecture in detail. Recall that we have agents with policies , respectively parameterized by , jointly performing a cooperative task. At every timestep , the th agent for all sees a local observation , and must select a discrete environment action and a continuous communication message , received by other agents at the next timestep, in order to maximize global reward . Since no agent has access to the underlying state of the environment , there is incentive in communicating with each other and being mutually helpful to do better as a team.
Policies and Decentralized Execution. Each agent is essentially modeled as a Dec-POMDP augmented with communication. Each agent’s policy is implemented as a
-layer Gated Recurrent Unit(Cho et al., 2014). At every timestep, the local observation and a vector aggregating messages sent by all agents at the previous timestep (described in more detail below) are used to update the hidden state of the GRU, which encodes the entire message-action-observation history up to time . From this internal state representation, the agent’s policy predicts a categorical distribution over the space of actions, and another output head produces an outgoing message vector . Note that for all our experiments, agents are symmetric and policies are instantiated from the same set of shared parameters; . This considerably speeds up learning.
Centralized Critic. Following prior work (Lowe et al., 2017), we operate under the centralized learning and decentralized execution paradigm wherein during training, a centralized critic guides the optimization of individual agent policies. The centralized Critic takes as input predicted actions and internal state representations from all agents to estimate the joint action-value at every timestep. The centralized Critic is learned by temporal difference (Sutton & Barto, 1998) and the gradient of the expected return with respect to policy parameters is approximated by:
Note that compared to an individual critic
for each agent, having a centralized critic leads to considerably lower variance in policy gradient estimates since it takes into account actions from all agents. At test time, the critic is not needed anymore and policy execution is fully decentralized.
Targeted, Multi-Stage Communication. Establishing complex collaboration strategies requires targeted communication the ability to send specific messages to specific agents, as well as multi-stage communication multiple rounds of back-and-forth interactions between agents. We use a signature-based soft-attention mechanism in our communication structure to enable targeting. Each message consists of parts – a signature to target recipients, and a value :
At the receiving end, each agent (indexed by ) predicts a query vector from its hidden state and uses it to compute a dot product with signatures of all messages. This is scaled by followed by a softmax to obtain attention weight for each message value vector:
Note that equation 2 also includes corresponding to the ability to self-attend (Vaswani et al., 2017), which we empirically found to improve performance, especially in situations when an agent has found the goal in a coordinated navigation task and all it is required to do is stay at the goal, so others benefit from attending to this agent’s message but return communication is not needed.
For multiple stages of communication, aggregated message vector and internal state are first used to predict the next internal state taking into account a first round of communication:
We evaluate our targeted multi-agent communication architecture on a variety of tasks and environments. All our models were trained with a batched synchronous version of the multi-agent Actor-Critic described above, using RMSProp with a learning rate ofand , batch size , discount factor and entropy regularization coefficient for agent policies.
The SHAPES dataset was introduced by Andreas et al. (2016)111github.com/jacobandreas/nmn2/tree/shapes, and originally created for testing compositional visual reasoning for the task of visual question answering. It consists of synthetic images of D colored shapes arranged in a grid ( cells in the original dataset) along with corresponding question-answer pairs. There are shapes (circle, square, triangle), colors (red, green, blue), and sizes (small, big) in total (see Figure 2).
We convert each image from the SHAPES dataset into an active environment where agents can now be spawned at different regions of the image, observe only a local patch around them, and take actions to move around – up, down, left, right, stay. Each agent is tasked with navigating to a specified goal state in the environment – ‘red’, ‘blue square’, ‘small green circle’, – and the reward for each agent at every timestep is based on team performance .
|, agents, findred||, agents, findred||, agents, findred,red,green,blue|
Having a symmetric, team-based reward incentivizes agents to cooperate with each other in finding each agent’s goal. For example, as shown in Figure 1(a), if agent ’s goal is to find red and agent ’s goal is to find blue, it is in agent ’s interest to let agent know if it passes by red () during its exploration / quest for blue and vice versa (). SHAPES serves as a flexible testbed for carefully controlling and analyzing the effect of changing the size of the environment, no. of agents, goal configurations, . Figure 2 visualizes learned protocols from two different configurations, and table:shapesnumbers reports quantitative evaluation for three different configurations. Benefits of communication and attention increase with task complexity ( & findred findred,red,green,blue).
5.2 Traffic Junction
|CommNets (Sukhbaatar et al., 2016)|
Environment and Task. The simulated traffic junction environments from Sukhbaatar et al. (2016) consist of cars moving along pre-assigned, potentially intersecting routes on one or more road junctions. The total number of cars is fixed at and at every timestep, new cars get added to the environment with probability . Once a car completes its route, it becomes available to be sampled and added back to the environment with a different route assignment. Each car has a limited visibility of a region around it, but is free to communicate with all other cars. The action space for each car at every timestep is gas and brake, and the reward consists of a linear time penalty , where is the number of timesteps since car has been active, and a collision penalty .
Quantitative Results. We compare our approach with CommNets (Sukhbaatar et al., 2016) on the easy and hard difficulties of the traffic junction environment. The easy task has one junction of two one-way roads on a grid with and , while the hard task has four connected junctions of two-way roads on a grid with and . See Figure 2(a), 2(b) for an example of the four two-way junctions in the hard task. As shown in table:traffic, a no communication baseline has a success rates of and on easy and hard respectively. On easy, both CommNets and TarMAC get close to . On hard, TarMAC with -stage communication significantly outperforms CommNets with a success rate of , while -stage further improves on this at , which is an 18% absolute improvement over CommNets.
Model Interpretation. Interpreting the learned policies, Figure 2(a) shows braking probabilities at different locations: cars tend to brake close to or right before entering traffic junctions, which is reasonable since junctions have the highest chances for collisions.
Turning our attention to attention probabilities (Figure 2(b)), we can see that cars are most-attended to when in the ‘internal grid’ – right after crossing the st junction and before hitting the nd junction. These attention probabilities are intuitive: cars learn to attentively attend to specific sensitive locations with the most relevant local observations to avoid collisions.
Finally, Figure 2(c) compares total number of cars in the environment snumber of cars being attended to with probability at any time. Interestingly, these are (loosely) positively correlated, with Spearman’s , which shows that TarMAC is able to adapt to variable number of agents. Crucially, agents learn this dynamic targeting behavior purely from task rewards with no hand-coding! Note that the right shift between the two curves is expected, as it takes a few timesteps of communication for team size changes to propagate. At a relative time shift of , the Spearman’s rank correlation between the two curves goes up to .
Finally, we benchmark TarMAC on a cooperative point-goal navigation task in House3D (Wu et al., 2018). House3D provides a rich and diverse set of publicly-available222github.com/facebookresearch/house3d D indoor environments, wherein agents do not have access to the top-down map and must navigate purely from first-person vision. Similar to SHAPES, the agents are tasked with finding a specified goal (such as ‘fireplace’), spawned at random locations in the environment and allowed to communicate with each other and move around. Each agent gets a shaped reward based on progress towards the specified target. An episode is successful if all agents end within m of the target object in navigation steps.
table:house3d shows success rates on a find[fireplace] task in House3D. A no-communication navigation policy trained with the same reward structure gets a success rate of . Mean-pooled communication (no attention) performs slightly better with a success rate of , and TarMAC achieves the best success rate at . Figure 4 visualizes predicted navigation trajectories of agents. Note that the communication vectors are significantly more compact (-d) than the high-dimensional observation space, making our approach particularly attractive for scaling to large teams.
6 Conclusions and Future Work
We introduced TarMAC, an architecture for multi-agent reinforcement learning which allows targeted interactions between agents and multiple stages of collaborative reasoning at every timestep. Evaluation on three diverse environments show that our model is able to learn intuitive attention behavior and improves performance, with downstream task-specific team reward as sole supervision.
While multi-agent navigation experiments in House3D show promising performance, we aim to exhaustively benchmark TarMAC on more challenging D navigation tasks because we believe this is where decentralized targeted communication can have the most impact – as it allows scaling to a large number of agents with large observation spaces. Given that the D navigation problem is hard in and of itself, it would be particularly interesting to investigate combinations with recent advances orthogonal to our approach (spatial memory, planning networks) with the TarMAC framework.
- Andreas et al. (2016) Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural Module Networks. In CVPR, 2016.
- Busoniu et al. (2008) L. Busoniu, R. Babuska, and B. De Schutter. A Comprehensive Survey of Multiagent Reinforcement Learning. Trans. Sys. Man Cyber Part C, 2008.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
- Das et al. (2017) Abhishek Das, Satwik Kottur, José M.F. Moura, Stefan Lee, and Dhruv Batra. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. In ICCV, 2017.
- Foerster et al. (2016) Jakob Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In NIPS, 2016.
- Foerster et al. (2018) Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In AAAI, 2018.
- Hausknecht & Stone (2015) Matthew Hausknecht and Peter Stone. Deep Recurrent Q-Learning for Partially Observable MDPs. In AAAI, 2015.
- Hoshen (2017) Yedid Hoshen. VAIN: Attentional multi-agent predictive modeling. In NIPS. 2017.
- Jaderberg et al. (2018) Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel. Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv:1807.01281, 2018.
- Jiang & Lu (2018) Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent cooperation. CoRR, 2018.
- Jorge et al. (2016) Emilio Jorge, Mikael Kågebäck, and Emil Gustavsson. Learning to play guess who? and inventing a grounded language as a consequence. In NIPS workshop on Deep Reinforcement Learning, 2016.
- Kottur et al. (2017) Satwik Kottur, José MF Moura, Stefan Lee, and Dhruv Batra. Natural language does not emerge ‘naturally’ in multi-agent dialog. In EMNLP, 2017.
- Lazaridou et al. (2017) Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. In ICLR, 2017.
- Lowe et al. (2017) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, 2017.
- Mordatch & Abbeel (2017) Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908, 2017.
- Oliehoek (2012) Frans A. Oliehoek. Decentralized POMDPs. In Reinforcement Learning: State of the Art. Springer Berlin Heidelberg, 2012.
- OpenAI (2018) OpenAI. OpenAI Five. https://blog.openai.com/openai-five/, 2018.
- Peng et al. (2017) Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017.
- Shoham & Leyton-Brown (2008) Yoav Shoham and Kevin Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2008.
Sukhbaatar et al. (2016)
Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus.
Learning multiagent communication with backpropagation.In NIPS, 2016.
- Sutton & Barto (1998) Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, 1998.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
- Wu et al. (2018) Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building Generalizable Agents With a Realistic And Rich 3D Environment. arXiv preprint arXiv:1801.02209, 2018.
- Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and Alexander J Smola. Deep sets. In NIPS, 2017.