Log In Sign Up

Improving Coordination in Multi-Agent Deep Reinforcement Learning through Memory-driven Communication

Deep reinforcement learning algorithms have recently been used to train multiple interacting agents in a centralised manner whilst keeping their execution decentralised. When the agents can only acquire partial observations and are faced with a task requiring coordination and synchronisation skills, inter-agent communication plays an essential role. In this work, we propose a framework for multi-agent training using deep deterministic policy gradients that enables the concurrent, end-to-end learning of an explicit communication protocol through a memory device. During training, the agents learn to perform read and write operations enabling them to infer a shared representation of the world. We empirically demonstrate that concurrent learning of the communication device and individual policies can improve inter-agent coordination and performance, and illustrate how different communication patterns can emerge for different tasks.


page 7

page 16


Federated Control with Hierarchical Multi-Agent Deep Reinforcement Learning

We present a framework combining hierarchical and multi-agent deep reinf...

Multi-agent Deep Reinforcement Learning with Extremely Noisy Observations

Multi-agent reinforcement learning systems aim to provide interacting ag...

Concurrent Meta Reinforcement Learning

State-of-the-art meta reinforcement learning algorithms typically assume...

MA-Dreamer: Coordination and communication through shared imagination

Multi-agent RL is rendered difficult due to the non-stationary nature of...

Emergent Linguistic Phenomena in Multi-Agent Communication Games

In this work, we propose a computational framework in which agents equip...

Curriculum-Driven Multi-Agent Learning and the Role of Implicit Communication in Teamwork

We propose a curriculum-driven learning strategy for solving difficult m...

Generalization of Reinforcement Learners with Working and Episodic Memory

Memory is an important aspect of intelligence and plays a role in many d...

Code Repositories

1 Introduction

Reinforcement Learning (RL) allows agents to learn how to map observations to actions through feedback reward signals (Sutton and Barto, 1998)

. Recently, deep neural networks

(LeCun et al., 2015; Schmidhuber, 2015) have had a noticeable impact on RL (Li, 2017)

as they provide flexible models for learning value functions and policies; they also allow to overcome difficulties related to large state spaces, and eliminate the need for hand-crafted features and ad-hoc heuristics

(Cortes et al., 2002; Parker et al., 2003; Olfati-Saber et al., 2007). Deep reinforcement learning (DLR) algorithms have been successfully proposed in single-agent systems, including video game playing (Mnih et al., 2015), robot locomotion (Lillicrap et al., 2015), object localisation (Caicedo and Lazebnik, 2015) and data-center cooling (Evans and Gao, 2016).

Following the uptake of DRL in single-agent domains, there is now a need to develop learning algorithms for multi-agent (MA) systems as these present additional challenges. Markov Decision Processes, upon which DRL methods rely, assume that the reward distribution and dynamics are stationary

(Hernandez-Leal et al., 2017). When multiple learners interact with each other, this property is violated because the reward that an agent receives also depends on other agents’ actions (Laurent et al., 2011). This is known as the moving-target problem (Tuyls and Weiss, 2012). This removes convergence guarantees and introduces additional learning instabilities. Further difficulties arise from environments characterized by partial observability (Singh et al., 1994; Chu and Ye, 2017; Peshkin et al., 2000) where the agents do not have full access to the world state, and where coordination skills are essential.

An important challenge in multi-agent DRL is how to facilitate communication between the interacting agents. Communication is widely known to play a critical role in promoting coordination between humans (Számadó, 2010). When coordination is required and no common languages exist, simple communication protocols are likely to emerge (Selten and Warglien, 2007). The relation between communication and coordination has been widely discussed (Vorobeychik et al., 2017; Demichelis and Weibull, 2008; Miller and Moser, 2004; Kearns, 2012). For instance, in two-players game, allowing the players to exchange messages has resulted in improved coordination (Cooper et al., 1989).

Analogously, the importance of communication has been recognised when designing artificial MA learning systems, especially in tasks requiring synchronization (Fox et al., 2000; Scardovi and Sepulchre, 2008; Wen et al., 2012) and group strategy coordination (Wunder et al., 2009; Itō et al., 2011). A wide range of MA applications have benefitted from inter-agent message passing including distributed smart grid control (Pipattanasomporn et al., 2009), consensus in networks (You and Xie, 2011), multi-robot control (Ren and Sorensen, 2008) autonomous vehicle driving (Petrillo et al., 2018), elevators control (Crites and Barto, 1998) and soccer-playing robots (Stone and Veloso, 1998).

Recently, Lowe et al. (2017) have proposed MADDPG (Multi-Agent Deep Deterministic Policy Gradient). Their approach extends the actor-critic algorithm (Degris et al., 2012) in which each agent has an actor to select actions and a critic to evaluate them. MADDPG embraces the centralised learning and decentralised execution paradigm (CLDE) (Foerster et al., 2016; Kraemer and Banerjee, 2016; Oliehoek and Vlassis, 2007). During centralised training, the critics receive observations and actions from all the agents whilst the actors only see their local observations. On the other hand, the execution only relies on actors. This approach has been designed to address the emergence of environment non-stationarity (Tuyls and Weiss, 2012; Laurent et al., 2011) and it has been shown to perform well in a number of mixed competitive and cooperative environments. However, in MADDPG, the agents can only share each other’s actions and observations during training through their critics, but do not have the means to develop an explicit form of communication through their experiences.

In this article, we consider tasks requiring strong coordination and synchronization skills. In these cases, being able to communicate information beyond the private observations, and infer a shared representation of the world through interactions, becomes essential. Ideally, an agent should be able to remember its current and past experience generated when interacting with the environment, learn how to compactly represent these experiences in an appropriate manner, and share this information for others to benefit from. Analogously, an agent should be able to learn how to decode the information generated by other agents and leverage it under every environmental state.

The above requirements are captured here by introducing a communication mechanism facilitating information sharing within the CLDE paradigm. Specifically, we provide the agents with a shared communication device that can be used to learn from their collective private observations and share relevant messages with others. Each agent also learns how to decode the memory content in order to improve its own policy. Both the read and write operations are implemented as parametrised, non-linear gating mechanisms that are learned concurrently to the individual policies. When the underlying task to be solved demands for complex coordination skills, we demonstrate that our approach can achieve higher performance compared to the MADDPG baseline. Furthermore, we demonstrate that being able to learn end-to-end a communication protocol concurrently to learning the policies can also improve upon a Meta-agent approach whereby all the agents perfectly share all their observations and actions in both training and execution. We provide a visualisation of the communication patterns that have emerged when training two-agent systems and demonstrate how these patterns correlate with the underlying tasks being learned.

2 Related Work

The problem of RL in cooperative environments has been studied extensively (Littman, 1994; Schmidhuber, 1996; Panait and Luke, 2005; Matignon et al., 2007). Early attempts exploited single-agent techniques like Q-learning to train all agents independently (Tan, 1993), but suffered from the excessive size of the state space resulting from having multiple agents. Subsequent improvements were obtained using variations of Q-learning (Ono and Fukumoto, 1996; Guestrin et al., 2002) and distributed approaches (Lauer and Riedmiller, 2000). More recently, DRL techniques like DQN (Mnih et al., 2013) have led to superior performance in single-agents settings by approximating policies through deep neural networks. Tampuu et al. (2017) have demonstrated that an extension of the DQN is able to train multiple agents independently to solve the Pong game. Gupta et al. (2017) have analyzed the performance of popular DRL algorithms, including DQN, DDPG (Lillicrap et al., 2015), TRPO (Schulman et al., 2015) and actor-critic on different MA environments, and have introduced a curriculum learning approach to increase scalability. Foerster et al. (2017) have suggested using a centralized critic for all agents that marginalises out a single agent’s action while other agents’ actions are kept fixed.

The role of communication in cooperative settings has also been explored, and different methods have been proposed differing on how the communication channels have been formulated using DRL techniques. Many approaches rely on implicit communication mechanisms whereby the weights of the neural networks used to implement policies or action-value functions are shared across agents or modelled to allow inter-agent information flow. For instance, in CommNet (Sukhbaatar et al., 2016)

, the policies are implemented through subsets of units of a large feed-forward neural network mapping the inputs of all agents to actions. At any given time step, the hidden states of each agent are used as messages, averaged and sent as input for the next layer. In BiCNet

(Peng et al., 2017), the agents’ policies and value networks are connected through bidirectional neural networks, and trained using an actor-critic approach. Jiang and Lu (2018)

proposed an attention mechanism that, when a need for communication emerges, selects which subsets of agents should communicate; the hidden states of their policy networks are integrated through an LSTM (Long Short Term Memory)

(Hochreiter and Schmidhuber, 1997) to generate a message that is used as input for the next layer of the policy network. Kong et al. (2017) introduces a master-slave architecture whereby a master agent provides high-level instructions to organize the slave agents in an attempt to achieve fine-grained optimality.

In our work, we introduce a mechanism to generate explicit messages capturing relevant aspects of the world, which the agents are able to collectively learn from their observations and interactions. The messages are then sent and received to complement their private observations when making decisions. A recently proposed approach, DIAL (Differentiable Inter-Agent Learning) (Foerster et al., 2016), is somewhat related in that the communication is enabled by differentiable channels allowing the gradient of the Q-function to bring feedback from one agent to the other. Like DIAL, we would like the agents to share explicit messages. However, whereas DIAL uses simple and pre-determined protocols, we would like to provide the agents with the ability to infer complex protocols from experience, without necessarily relying on pre-defined ones, and utilise those to learn better policies.

3 Memory-driven MADDPG

3.1 Problem setup

We consider a system with interacting agents and adopt a multi-agent extension of partially observable Markov decision processes (Littman, 1994). This formulation assumes a set, , containing all the states characterising the environment, a sequence where each is a set of possible actions for the agent, and a sequence where each contains the observations available to the agent. Each provides a partial characterisation of the current state and is private for that agent. Every action is deterministically chosen accordingly to a policy function, , parametrised by . The environment generates a next state according to a transition function, , that considers the current state and the actions taken by the agents. The reward received by an agent, , are functions of states and actions. Each agent learns a policy that maximises the expected discounted future rewards over a period of time steps, , where is the -discounted sum of future rewards. During training, we would like an agent to learn by using not only its own observations, but through a collectively learned representation of the world that accumulates through experiences coming from all the agents. At the same time, each agent should develop the ability to interpret this shared knowledge in its own unique way as needed to optimise its policy. Finally, the information sharing mechanism would need to be designed in such a way to be used in both training and execution.

3.2 Memory-driven communication

We introduce a shared communication mechanism enabling agents to establish a communication protocol through a memory device of capacity (Figure 1). The device is designed to store a message which progressively captures the collective knowledge of the agents as they interact. Each agent’s policy then becomes , i.e. dependent on its private observation as well as the collective memory. Before taking an action, each agent accesses the memory device to initially retrieve and interpret the message left by others. After reading the message, the agent performs a writing operation that updates the memory content. During training, these operations are learned without any a priori constraints about the nature of the messages other than the size, , of the device. During execution, the agents use the communication protocol that they have learned to read and write the memory over an entire episode. We aim to build a end-to-end model trainable only through reward signals, and use neural networks as function approximators for policies, and learnable gated functions to characterize an agent’s interactions with the memory. The chosen parametrisations of these operations are listed below.

Figure 1: The MD-MADDPG framework

Encoding operation.

Upon receiving its private observations, each agent maps those on to an embedding representing its current vision of the state:


where is a neural network parametrised by . The embedding plays a fundamental role in selecting a new action and in the reading and writing phases.

Read operation.

After encoding the current information, the agent performs a read operation allowing to extract and interpret relevant knowledge that has been previously captured through

. By interpreting this information content, the agent has access to what other agents have learned. A context vector

is generated to capture spatio-temporal information previously encoded in through a linear mapping,

where represent the learnable weights of the linear projection. While is defined as general observation encoder, is specifically designed to extract features for the reading operation. An ablation study to show the effectiveness of is provided in Supplementary Material. The agent observation embedding , the reading context vector and the current memory contain different types of information that are used jointly as inputs to learn a gating mechanism,



is the sigmoid function and

means that the three vectors are concatenated. The values of are used as weights to modulate the memory content and extract the information from it, i.e.


where represents the Hadamard product. takes values in and its role is to potentially downgrade the information stored in memory or even completely discard the current content. Learning agent-specific weights and means that each agent is able to interpret in its own unique way. As the reading operation strongly depends on the current observation, the interpretation of can change from time to time depending on what an agent sees during an episode. Given that depends on and (from in Eq. 1), we lump all the adjustable parameters into and write


Write operation.

In the writing phase, an agent decides what information to share and how to properly update the content of the memory whilst taking into account the other agents. The write operation is loosely inspired by the LSTM (Hochreiter and Schmidhuber, 1997) where the content of the memory is updated through gated functions regulating what information is kept and what is discarded. Initially, the agent generates a candidate memory content, , which depends on its own encoded observations and current shared memory through a non-linear mapping,

where are weights to learn. An input gate, , contains the values used to regulate the content of this candidate while a forget gate, , is used to decide what to keep and what to discard from the current . These operations are described as follows:

The agent then finally generates an updated message as a linear combination of the weighted old and new messages as follows:


The update is stored in memory and made accessible to the other agents. At each time step, agents sequentially read and write the content of the memory using the above procedure. Since depends on and (derived from in Eq. 1) we collect all the parameters into and write the writing operation as:


Action selector.

Upon completing both read and write operations, the agent is able to take an action, , which depends on the current encoding of its observations, its own interpretation of the current memory content and its updated version, that is


where is a neural network parametrised by . The resulting policy function can be written as a composition of functions:


in which contains all the relevant parameters.

Learning algorithm

We optimise to maximize by adapting the MADDPG formulation (Lowe et al., 2017). In our framework the gradient becomes

where is a replay buffer which contains transitions in the form of , in which are all the current observations and all the next observations. The function is updated as follows:

where are all the next actions and is a target network whose parameters are periodically updated with the current parameters of . We call the resulting algorithm MD-MADDPG (Memory-driven MADDPG). All the details about the centralised learning, decentralised execution and pseudo code are provided in the Supplementary Material.

4 Experimental Settings and Results

4.1 Environments

In this section, we present a battery of six 2D scenarios, with continuos space and discrete time, of increasing complexity and requiring progressively more elaborated coordination skills: five environments inspired by the Cooperative Navigation problem from the multi-agent particle environment (Lowe et al., 2017; Mordatch and Abbeel, 2017), and Waterworld from the SISL suite (Gupta et al., 2017). We focus on two-agents systems to keep the settings sufficiently simple and analyze emerging communication behaviours.

Cooperative Navigation (CN).

This environment consists of agents and corresponding landmarks. An agent’s task is to occupy one of the landmarks whilst avoiding collisions with other agents. Every agent observes the distance to all others agents and landmark positions.

Partial Observable Cooperative Navigation (PO CN).

This is based on Cooperative Navigation, i.e. the task and action space are the same, but the agents now have a limited vision range and can only observe a portion of the environment around them within a pre-defined radius.

Synchronous Cooperative Navigation (Sync CN).

The agents need to occupy the landmarks exactly at the same time in order to be positively rewarded. A landmark is declared as occupied when an agent is arbitrarily close to it. Agents are penalised when the landmarks are not occupied at the same time.

Sequential Cooperative Navigation (Sequential CN).

This environments is similar to the previous one, but the agents now need to occupy the landmarks one after another, in sequence, in order to be positively rewarded. Occupying the landmarks at the same time is penalised.

Swapping Cooperative Navigation (Swapping CN).

In this case the task is more complex as it consists of two sub-tasks. Initially, the agents need to reach the landmarks and occupy them at same time. Then, they need to swap their landmarks and repeat the same process.


In this environment, two agents with limited range vision have to collaboratively capture food targets whilst avoiding poison targets. A food target can be captured only if both agents reach it at the same time. Additional details are reported in (Gupta et al., 2017).

4.2 Experimental results

In our experiments, we compared the performance of the proposed MD-MADDPG algorithm against both MADDPG and Meta-agent MADDPG (MA-MADDPG). In the latter, the policy of an agent during both training and execution is conditioned upon the observations of all the other agents in order to overcome difficulties due to partial observability (see also the Supplementary Material). This algorithm has been added to the baselines in order to better understand the role of the memory mechanism and assess whether learning a shared representation of the world can improve the learned coordination skills. We analyse the performance on all the six environments described in Section 4.1. In each case, after training, we evaluate an algorithm’s performance by collecting samples from an additional episodes, which are then used to extract different performance metrics: the reward quantifies how well a task has been solved; the distance from landmark captures how well an agent has learned to reach a landmark; the number of collisions counts how many times an agent has failed to avoid collisions with others; sync occupations counts how many times both landmarks have been occupied simultaneously and, analogously, not sync occupations counts how many times only one of the two landmarks has been occupied. For Waterworld, we count the number of food targets and number of poison targets. In Table 1

, for each metric, we report the sample average and standard deviation obtained by each algorithm on each environment.

CN Average distance
# collisions
PO CN Average distance
# collisions
Sync CN # sync occup.
# not sync occup.
Sequential CN Reward
Average distance
Swapping CN Reward
Average distance
Waterworld # food targets
# poison targets
Table 1: Comparison of MADDPG, MA-MADDPG and MD-MADDPG on six environments ordered by increasing level of difficulty, from CN to Waterword. The sample mean and standard deviation for episodes are reported for each metric.

All three algorithms perform very similarly in the Cooperative Navigation and Partial Observable Navigation cases. This is an expected result because these environments involve simple tasks that can be completed without explicit message-passing activities and information sharing. Despite communication not being essential, MD-MADDPG reaches comparable performance to MADDPG and MA-MADDPG. In the Synchronous Cooperative Navigation case, the ability of MA-MADDPG to overcome partial observability issues by sharing the observations across agents seem to be crucial as the total rewards achieved by this algorithm are substantially higher than those obtained by both MADDPG and MD-MADDPG. In this case, whilst not achieving the highest reward, MD-MADDPG keeps the number of unsynchronised occupations at the lowest level, and also performs better than MADDPG on all three metrics. It would appear that in this case pulling all the private observations together is sufficient for the agent to synchronize their paths leading to the landmarks.

When moving on to more complex tasks requiring further coordination, the performances of the three algorithms diverge further in favour of MD-MADDPG. The requirement for strong collaborative behaviour is more evident in the Sequential Cooperative Navigation problem as the agents need to explicitly learn to take either shorter or longer paths from their initial positions to the landmarks in order to occupy them in sequential order. Furthermore, according to the results in Table 1, the average distance travelled by the agents trained with MD-MADDPG is less then the half the distance travelled by agents trained with MADDPG, indicating that these agents were able to find a better strategy by developing an appropriate communication protocol. Similarly, in the Swapping Cooperative Navigation scenario, MD-MADDPG achieves superior performance, and is again able to discover solutions involving the shortest average distance. Waterworld is significantly more challenging as it requires a sustained level of synchronization throughout the entire episode and can be seen as a sequence of sub-tasks whereby each time the agents must reach a new food target whilst avoiding poison targets. In Table 1, it can be noticed that MD-MADDPG significantly outperforms both competitors in this case. The importance of sharing observations with other agents can also be seen here as MA-MADDPG generates good policies that avoid poison targets, yet in this case, the average reward is substantially lower than the one scored by MD-MADDPG.

(a) Sequential CN
(b) Swapping CN
(c) Waterworld
Figure 2: Visualisation of communications strategies learned by the agents in three different environments: the three principal components provide orthogonal descriptors of the memory content written by the agents and are being plotted as a function of time. Within each component, the highest values are in red, and the lowest values are in blue. The bar at the bottom of each figure indicates which phase (or sub-task) was being executed within an episode; see Section 4.3 for further details. The memory usage patterns learned by the agents are correlated with the underlying phases and the memory is no longer utilised once a task is about to be completed.

4.3 Communication analysis

In this section, we explore the dynamic patterns of communication activity that emerged in the environments presented in the previous section, and look at how the agents use the shared memory throughout an episode while solving the required task. For each environment, after training, we executed episodes with time horizon and stored the write vector of each agent at every time step . Exploring how evolves within an episode can shed some light onto the role of the memory device at each phase of the task. In order to produce meaningful visualisations, we first projected the dimensions of

onto the directions maximising the sample variance (i.e. the variance of the observed

across simulated episodes) using a linear PCA.

Figure 2 shows the principal components (PCs) associated with the two agents over time for three of our six simulation environments. Only the first three PCs were retained as those were found to cumulatively explain over of variance in all cases. The values of each PC were standardised to lie in

and are plotted on a color map: one is in red and zero in blue. The timeline at the bottom of each figure indicates which specific phase of an episode is being executed at any given time point, and each consecutive phase is coloured using a different shade of grey. For instance, in Sequential Cooperative Navigation, a single landmark is reached and occupied in each phase. In Swapping Cooperative Navigation, during the first phase the agents search and find the landmarks; in the second phase they swap targets, and in the third phase they complete the task by reaching the landmarks again. Usually, in the last phase, the agents learn to stay close to their targets. We interpret the higher values as being indicative of high memory usage, and lower values as being associated to low activity. In most cases, high communication activity is maintained when the agents are actively working and completing a task, while during the final phases (where typically there is no exploration because the task is considered completed) low activity levels are more predominant.

This analysis also highlights the fact that the communication channel is used differently in each environment. In some cases, the levels of activity alternate between the agents. For instance, in Sequential Cooperative Navigation (Figure 2a), high levels of memory usage by one agent are associated with low ones by the other. A different behaviour is observed for the Swapping Cooperative Navigation task where both agents produce either high or low activation value. The dynamics characterizing the memory usage also change based on the particular phase reached within an episode. For example, in Figure 2a, during the first two phases the agents typically show alternating activity levels whilst in the third phase both agents significantly decrease their memory activity as the task has already been solved and there are no more changes in the environment. Figure 2 provides some evidence that, in some cases, a peer-to-peer communication strategy is likely to emerge instead of a master-slave one where one agent takes complete control of the shared channel. The scenario is significantly more complex in Waterworld where the changes in memory usage appear at a much higher frequency due to the presence of very many sequential sub-tasks. Here, each light-grey phase indicates that a food target has been captured. Peaks of memory activity seem to follow those events as the agents reassess their situation and require higher coordination to jointly decide what the next target is going to be.

5 Conclusions

In this work, we have introduced MD-MADDPG, a multi-agent reinforcement learning framework that uses a shared memory device as an intra-agent communication channel to improve coordination skills. The memory content contains a learned representation of the environment that is used to better inform the individual policies. The memory device is learnable end-to-end without particular constraints other than its size, and each agent develops the ability to modify and interpret it. We prove that this approach leads to better performance in cooperative tasks where coordination and synchronization are crucial to a successful completion of the task and where world visibility is very limited. Furthermore, we have visualised and analyzed the dynamics of the communication patterns that have emerged in several environments. This exploration has indicated that, as expected, the agents have learned different communication protocols depending upon the complexity of the task. In this study we have focused on two-agent systems to keep the settings sufficiently simple and understand the role of the memory. In future work we plan on studying how the algorithm scales with the number of agents, and anticipate a higher level of learning difficulty due to the increased number of parameters, some of which would need to be shared. With more than two agents, we also expect that the order in which the memory is updated may also be important. A possible approach may consist of deploying agent selection mechanisms, possibly based on attention, so that only a relevant subset of agents can modify the memory at any given time, or impose master-slave architectures. In future work we are also planning to assess MD-MADDPG on environments characterized by more structured and high-dimensional observations (e.g. pixel data) where collectively learning to represent the environment through a shared memory should be particularly beneficial.


  • Caicedo and Lazebnik [2015] Juan C Caicedo and Svetlana Lazebnik. Active object localization with deep reinforcement learning. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pages 2488–2496, 2015.
  • Chu and Ye [2017] Xiangxiang Chu and Hangjun Ye. Parameter sharing deep deterministic policy gradient for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:1710.00336, 2017.
  • Cooper et al. [1989] Russell Cooper, Douglas V DeJong, Robert Forsythe, and Thomas W Ross. Communication in the battle of the sexes game: some experimental results. The RAND Journal of Economics, 20(4):568, 1989.
  • Cortes et al. [2002] Jorge Cortes, Sonia Martinez, Timur Karatas, and Francesco Bullo. Coverage control for mobile sensing networks. In Robotics and Automation, 2002. Proceedings. ICRA’02. IEEE International Conference on, volume 2, pages 1327–1332. IEEE, 2002.
  • Crites and Barto [1998] Robert H Crites and Andrew G Barto. Elevator group control using multiple reinforcement learning agents. Machine learning, 33(2-3):235–262, 1998.
  • Degris et al. [2012] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
  • Demichelis and Weibull [2008] Stefano Demichelis and Jorgen W Weibull. Language, meaning, and games: A model of communication, coordination, and evolution. American Economic Review, 98(4):1292–1311, 2008.
  • Evans and Gao [2016] Richard Evans and Jim Gao. Deepmind ai reduces google data centre cooling bill by 40, 2016. Accessed: 17-09-2018.
  • Foerster et al. [2016] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
  • Foerster et al. [2017] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. arXiv preprint arXiv:1705.08926, 2017.
  • Fox et al. [2000] Dieter Fox, Wolfram Burgard, Hannes Kruppa, and Sebastian Thrun. A probabilistic approach to collaborative multi-robot localization. Autonomous robots, 8(3):325–344, 2000.
  • Guestrin et al. [2002] Carlos Guestrin, Michail Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In ICML, volume 2, pages 227–234. Citeseer, 2002.
  • Gupta et al. [2017] Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pages 66–83. Springer, 2017.
  • Hernandez-Leal et al. [2017] Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. A survey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint arXiv:1707.09183, 2017.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Itō et al. [2011] Takayuki Itō, Minjie Zhang, Valentin Robu, Shaheen Fatima, Tokuro Matsuo, and Hirofumi Yamaki. Innovations in Agent-Based Complex Automated Negotiations. Springer-Verlag Berlin Heidelberg, 2011.
  • Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  • Jiang and Lu [2018] Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent cooperation. arXiv preprint arXiv:1805.07733, 2018.
  • Kearns [2012] Michael Kearns. Experiments in social computation. Communications of the ACM, 55(10):56–67, 2012.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kong et al. [2017] Xiangyu Kong, Bo Xin, Fangchen Liu, and Yizhou Wang. Revisiting the master-slave architecture in multi-agent deep reinforcement learning. arXiv preprint arXiv:1712.07305, 2017.
  • Kraemer and Banerjee [2016] Landon Kraemer and Bikramjit Banerjee. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190:82–94, 2016.
  • Lauer and Riedmiller [2000] Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer, 2000.
  • Laurent et al. [2011] Guillaume J Laurent, Laëtitia Matignon, Le Fort-Piat, et al. The world of independent learners is not markovian. International Journal of Knowledge-based and Intelligent Engineering Systems, 15(1):55–64, 2011.
  • LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
  • Li [2017] Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.
  • Lillicrap et al. [2015] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
  • Littman [1994] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pages 157–163. Elsevier, 1994.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
  • Matignon et al. [2007] Laëtitia Matignon, Guillaume Laurent, and Nadine Le Fort-Piat. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS’07., pages 64–69, 2007.
  • Miller and Moser [2004] John H Miller and Scott Moser. Communication and coordination. Complexity, 9(5):31–40, 2004.
  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Mordatch and Abbeel [2017] Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908, 2017.
  • Olfati-Saber et al. [2007] Reza Olfati-Saber, J Alex Fax, and Richard M Murray. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE, 95(1):215–233, 2007.
  • Oliehoek and Vlassis [2007] Frans A Oliehoek and Nikos Vlassis. Q-value functions for decentralized pomdps. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 220. ACM, 2007.
  • Ono and Fukumoto [1996] Norihiko Ono and Kenji Fukumoto. Multi-agent reinforcement learning: A modular approach. In Second International Conference on Multiagent Systems, pages 252–258, 1996.
  • Panait and Luke [2005] Liviu Panait and Sean Luke. Cooperative multi-agent learning: The state of the art. Autonomous agents and multi-agent systems, 11(3):387–434, 2005.
  • Parker et al. [2003] Dawn C Parker, Steven M Manson, Marco A Janssen, Matthew J Hoffmann, and Peter Deadman. Multi-agent systems for the simulation of land-use and land-cover change: a review. Annals of the association of American Geographers, 93(2):314–337, 2003.
  • Peng et al. [2017] Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:1703.10069, 2017.
  • Peshkin et al. [2000] Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau, and Leslie Pack Kaelbling. Learning to cooperate via policy search. In

    Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence

    , pages 489–496. Morgan Kaufmann Publishers Inc., 2000.
  • Petrillo et al. [2018] Alberto Petrillo, Alessandro Salvi, Stefania Santini, and Antonio Saverio Valente. Adaptive multi-agents synchronization for collaborative driving of autonomous vehicles with multiple communication delays. Transportation research part C: emerging technologies, 86:372–392, 2018.
  • Pipattanasomporn et al. [2009] Manisa Pipattanasomporn, Hassan Feroze, and Saifur Rahman. Multi-agent systems in a distributed smart grid: Design and implementation. In Power Systems Conference and Exposition, 2009. PSCE’09. IEEE/PES, pages 1–8. IEEE, 2009.
  • Ren and Sorensen [2008] Wei Ren and Nathan Sorensen. Distributed coordination architecture for multi-robot formation control. Robotics and Autonomous Systems, 56(4):324–333, 2008.
  • Scardovi and Sepulchre [2008] Luca Scardovi and Rodolphe Sepulchre. Synchronization in networks of identical linear systems. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 546–551. IEEE, 2008.
  • Schmidhuber [1996] Jurgen Schmidhuber. A general method for multi-agent reinforcement learning in unrestricted environments. In Adaptation, Coevolution and Learning in Multiagent Systems: Papers from the 1996 AAAI Spring Symposium, pages 84–87, 1996.
  • Schmidhuber [2015] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
  • Selten and Warglien [2007] Reinhard Selten and Massimo Warglien. The emergence of simple languages in an experimental coordination game. Proceedings of the National Academy of Sciences, 104(18):7361–7366, 2007.
  • Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
  • Singh et al. [1994] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan.

    Learning without state-estimation in partially observable markovian decision processes.

    In Machine Learning Proceedings 1994, pages 284–292. Elsevier, 1994.
  • Stone and Veloso [1998] Peter Stone and Manuela Veloso. Towards collaborative and adversarial learning: A case study in robotic soccer. International Journal of Human-Computer Studies, 48(1):83–104, 1998.
  • Sukhbaatar et al. [2016] Sainbayar Sukhbaatar, Rob Fergus, et al.

    Learning multiagent communication with backpropagation.

    In Advances in Neural Information Processing Systems, pages 2244–2252, 2016.
  • Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
  • Számadó [2010] Szabolcs Számadó. Pre-hunt communication provides context for the evolution of early human language. Biological Theory, 5(4):366–382, 2010.
  • Tampuu et al. [2017] Ardi Tampuu, Tambet Matiisen, Dorian Kodelja, Ilya Kuzovkin, Kristjan Korjus, Juhan Aru, Jaan Aru, and Raul Vicente. Multiagent cooperation and competition with deep reinforcement learning. PloS one, 12(4):e0172395, 2017.
  • Tan [1993] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993.
  • Tuyls and Weiss [2012] Karl Tuyls and Gerhard Weiss. Multiagent learning: Basics, challenges, and prospects. Ai Magazine, 33(3):41, 2012.
  • Uhlenbeck and Ornstein [1930] George E Uhlenbeck and Leonard S Ornstein. On the theory of the brownian motion. Physical review, 36(5):823, 1930.
  • Vorobeychik et al. [2017] Yevgeniy Vorobeychik, Zlatko Joveski, and Sixie Yu. Does communication help people coordinate? PloS one, 12(2):e0170780, 2017.
  • Wen et al. [2012] Guanghui Wen, Zhisheng Duan, Wenwu Yu, and Guanrong Chen. Consensus in multi-agent systems with communication constraints. International Journal of Robust and Nonlinear Control, 22(2):170–182, 2012.
  • Wunder et al. [2009] Michael Wunder, Michael Littman, and Matthew Stone. Communication, credibility and negotiation using a cognitive hierarchy model. In Workshop# 19: MSDM 2009, page 73, 2009.
  • You and Xie [2011] Keyou You and Lihua Xie. Network topology and communication data rate for consensusability of discrete-time multi-agent systems. IEEE Transactions on Automatic Control, 56(10):2262, 2011.

Supplementary Material

Appendix A MD-MADDPG Algorithm

Inizialise actors () and critics networks ()
Inizialise actor target networks () and critic target networks ()
Inizialise replay buffer
for episode = 1 to E do
     Inizialise a random process for exploration
     Inizialise memory device
     for t = 1 to max episode length do
         for agent = 1 to  do
              Receive observation and the message
              Generate observation encoding (Eq. 1)
              Generate read vector (Eq. 3)
              Generate new message (Eq. 5)
              Select action
              Store the new message in the memory device
         end for
         Set and
         Execute actions , observe rewards and next observations
         Store () in replay buffer
     end for
     for agent = 1 to  do
         Sample a random minibatch of samples () from
         Update critic by minimizing
         Update actor according to the policy gradient:
     end for
     Update target networks:
end for
Algorithm 1 MD-MADDPG algorithm

Appendix B MD-MADDPG centralised learning

All the agent-specific policy parameters, i.e. , are learned end-to-end. We adopt an actor-critic model within a CLDE framework [Foerster et al., 2016, Lowe et al., 2017]. In the standard actor-critic model [Degris et al., 2012], the agent is composed by an actor, used to select the actions, and a critic, used to evaluate the actor moves to provide feedback. In DDPG [Silver et al., 2014, Lillicrap et al., 2015], neural networks are used to approximate both the actor, represented by the policy function , and its corresponding critic, represented by an action-value function , in order to maximize the objective function . This is done by adjusting the parameters in the direction of the gradient of which can be written as:

The actions are produced by the actor , are evaluated by the critic which minimises the following loss:

where is the next observation, is an experience replay buffer which contains tuples , represent the target Q-value. is a target network whose parameters are periodically updated with the current parameters of to make training more stable. minimises the expectation of the difference between the current and the target action-state function.

In this formulation, as there is no interaction between agents, the policies are learned independently. We adopt the CLDE paradigm by letting the critics use the observations and the actions of all agents, hence:


where contains transitions in the form of and are the next observations of all agents. Accordingly, is updated as


in which are the next actions of all agents. By minimising Eq. 10 the model attempts to improve the estimate of the critic which is used to improve the policy itselfs through Eq. 9. Since the input of the policy described in Eq. 8 is the gradient of the Memory-driven MADDPG to maximize can be written as:

where is a replay buffer which contains transitions in the form of . The function is updated according to Eq. 10.

Appendix C MD-MADDPG decentralised execution

During execution, only the learned actors are used to make decisions and select actions. An action is taken in turn by a single agent. The current agent receives its private observations, , reads to extract (Eq. 3), generates the new version of (Eq. 5), stores it into and selects its action using . The policy of the next agent is then driven by the updated memory.

Appendix D Meta-agent

In the meta-agent MADDPG (MA-MADDPG) each agent can observe the observations of all other agents to address issues related to partial observability. The policy of each agent is defined as and the gradient is:

where and is updated according to 10.

Appendix E Experimental details

In all our experiments, we use a neural network with one layer of 512 units for the encoding (Eq. 1), a neural network with one layer of 256 units for the action selector (Eq. 7) and neural networks with three hidden layers () for the critics. For MADDPG and MA-MADDPG the actors are implemented with neural networks with 2 hidden layers (). The size of the and is fixed to . The training happens through the Adam optimizer [Kingma and Ba, 2014] with a learning rate of for critic and for policies. The reward discount factor is set to , the size of the replay buffer to and the batch size to . The number of time steps for episode is set to for Waterworld and for the other environments. We update network parameters after every 100 samples added to the replay buffer using soft updates with . The Ornstein-Uhlenbeck process with and is used to add notimeise to the exploration process [Uhlenbeck and Ornstein, 1930]. Discrete actions are supported by using the Gumbel-Softmax estimator [Jang et al., 2016].

Appendix F Additional Experiments

f.1 Corrupting the memory

Table 2 shows the performance of MD-MADDPG when a Gaussian noise (mean 0 and standard deviation 1) is added to the memory content at execution time. It can be noted that corrupting the communication in this way, causes a general worsening of the results. This outcome, that, of course, is expected, shows that the messages that the agents learn to exchange are crucial to achieve good performance, indeed when they are corrupted the agents struggle to achieve a form of communication which lead to improve their synchronization.

Environment Metric MD-MADDPG - noise
CN Average distance
# collisions
CN Average distance
# collisions
Sync CN # sync occup.
# not sync occup.
Sequential CN Reward
Average distance
Swapping CN Reward
Average distance
Waterworld # food targets
# poison targets
Table 2: Results of MD-MADDPG when Gaussian noise is added to the memory content.

f.2 Increasing the number of agents - Cooperative Navigation

Table 3 shows the comparison between MADDPG and MD-MADDPG on Cooperative Navigation when the number of agents increases. Despite in MD-MADDPG communication can result more difficult to achieve since the agents share the same channel, it allows to achieve a general improvement to the overall performance over MADDPG.

Number of agents Metric MADDPG MD-MADDPG
3 Average distance
# collisions
4 Average distance
# collisions
5 Average distance
# collisions
6 Average distance
# collisions
Table 3: Comparison of MADDPG and MD-MADDPG on Cooperative Navigation when increasing the number of agents.

f.3 Increasing the number of agents - Partial Observable Cooperative Navigation

Table LABEL:tab:exp_pocn_moreagents presents the comparison of MADDPG and MD-MADDPG on Partially Observable Cooperative Navigation when the number of agents increases. It can be noted that MD-MADDPG still achieves good performance and in some scenarios (e.g. number of agents = 5) it outperforms MADDPG.

Number of agents Metric MADDPG MD-MADDPG
3 Average distance
# collisions
4 Average distance
# collisions
5 Average distance
# collisions
6 Average distance
# collisions

f.4 Removing the context vector

In order to provide an ablation study to show the importance of the context vector (Eq. 3.2) we have run a set of experiments removing from the reading module of the agents. Table 5 shows that when using only without during the reading phase, a general worsening of the performance happens on all the environments (see Table 1) for the results when the context vector has been used).

Environment Metric MD-MADDPG - no context
CN Average distance
# collisions
CN Average distance
# collisions
Sync CN # sync occup.
# not sync occup.
Sequential CN Reward
Average distance
Swapping CN Reward
Average distance
Waterworld # food targets
# poison targets
Table 5: Results of MD-MADDPG without context vector.

Appendix G Memory analysis

Figure 3: Visualisation of communications strategies learned by the agents in Synchronous Cooperative Navigation

In the Synchronous Cooperative Navigation the phase indicates if none of the landmarks is occupied (light-grey), if just one is occupied (dark-grey) and if both are occupied (black). It can be noted (Figure 3) that memory activity is very intense before the phase three, while agents are collaborating to complete the task.

Appendix H Simulation environments

(a) Cooperative Navigation
(b) Partial Observable CN
(c) Synchronous CN
(c) Sequential
(d) Swapping
(e) Waterworld
An illustration of our environments. Blue circles represent the agents; dashed lines indicate the range of vision; green and red circles represent the food and poison targets, respectively, while black dots represent landmarks to be reached.

Appendix G Memory analysis

Figure 3: Visualisation of communications strategies learned by the agents in Synchronous Cooperative Navigation

In the Synchronous Cooperative Navigation the phase indicates if none of the landmarks is occupied (light-grey), if just one is occupied (dark-grey) and if both are occupied (black). It can be noted (Figure 3) that memory activity is very intense before the phase three, while agents are collaborating to complete the task.

Appendix H Simulation environments

(a) Cooperative Navigation
(b) Partial Observable CN
(c) Synchronous CN
(c) Sequential
(d) Swapping
(e) Waterworld
An illustration of our environments. Blue circles represent the agents; dashed lines indicate the range of vision; green and red circles represent the food and poison targets, respectively, while black dots represent landmarks to be reached.