Is multiagent deep reinforcement learning the answer or the question? A brief survey

by   Pablo Hernandez-Leal, et al.

Deep reinforcement learning (DRL) has achieved outstanding results in recent years. This has led to a dramatic increase in the number of applications and methods. Recent works have explored learning beyond single-agent scenarios and have considered multiagent scenarios. Initial results report successes in complex multiagent domains, although there are several challenges to be addressed. In this context, first, this article provides a clear overview of current multiagent deep reinforcement learning (MDRL) literature. Second, it provides guidelines to complement this emerging area by (i) showcasing examples on how methods and algorithms from DRL and multiagent learning (MAL) have helped solve problems in MDRL and (ii) providing general lessons learned from these works. We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists in both areas (DRL and MAL) in a joint effort to promote fruitful research in the multiagent community.




Distributed Deep Reinforcement Learning: An Overview

Deep reinforcement learning (DRL) is a very active research area. Howeve...

A Survey on Deep Reinforcement Learning-based Approaches for Adaptation and Generalization

Deep Reinforcement Learning (DRL) aims to create intelligent agents that...

Deep Reinforcement Learning and Transportation Research: A Comprehensive Review

Deep reinforcement learning (DRL) is an emerging methodology that is tra...

Combining Evolution and Deep Reinforcement Learning for Policy Search: a Survey

Deep neuroevolution and deep Reinforcement Learning have received a lot ...

The Faults in Our Pi Stars: Security Issues and Open Challenges in Deep Reinforcement Learning

Since the inception of Deep Reinforcement Learning (DRL) algorithms, the...

Solving the Traveling Salesperson Problem with Precedence Constraints by Deep Reinforcement Learning

This work presents solutions to the Traveling Salesperson Problem with p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Almost 20 years ago Stone and Veloso’s seminal survey Stone and Veloso (2000) laid the groundwork for defining the area of multiagent systems (MAS) and its open problems in the context of AI. About ten years ago, Shoham, Powers, and Grenager Shoham et al. (2007) noted that the literature on multiagent learning (MAL) was growing and it was not possible to enumerate all relevant articles. Since then, the number of published MAL works continues to steadily rise, which led to different surveys on the area, ranging from analyzing the basics of MAL and their challenges Alonso et al. (2002); Tuyls and Weiss (2012); Busoniu et al. (2008)

, to addressing specific subareas: game theory and MAL 

Shoham et al. (2007); Nowé et al. (2012), cooperative scenarios Panait and Luke (2005); Matignon et al. (2012), and evolutionary dynamics of MAL Bloembergen et al. (2015). In just the last couple of years, two surveys related to MAL have been published: learning in non-stationary environments Hernandez-Leal et al. (2017) and agents modeling agents Albrecht and Stone (2018).

The research interest in MAL has been accompanied by successes, first, in single-agent video games Mnih et al. (2015); more recently, in two-player games, e.g., playing Go Silver et al. (2016, 2017), poker Moravčík et al. (2017), and games of two competing teams, e.g., DOTA 2 ope (2018).

While different techniques and algorithms were used in the above scenarios, in general, they are all a combination of techniques from two main areas: reinforcement learning (RL) Sutton and Barto (1998)

and deep learning 

LeCun et al. (2015).

RL is an area of machine learning where an agent learns by interacting (i.e., taking actions) within a dynamic environment. However, one of its main drawbacks is the need for defining (hand-crafting) features used to learn. In recent years, deep learning has had successes in different areas such as computer vision and natural language processing 

LeCun et al. (2015). One of the key aspects of deep learning is the use of neural networks

(NNs) that can find compact representations in high-dimensional data 

Arulkumaran et al. (2017), thus eliminating the need for manual feature design.

Complementing each other, deep learning and reinforcement learning generated a new area: deep reinforcement learning (DRL) Arulkumaran et al. (2017), in which deep neural networks are trained to approximate the optimal policy and/or the value function, where the promise of generalization is expected to be delivered by the representation ability of deep NNs (as the function approximator). One of the key advantages of DRL is that it enables RL to scale to problems with high-dimensional state and action spaces.

DRL has been regarded as an important component in constructing general AI systems Lake et al. (2016) and has been successfully integrated with other techniques, e.g., search Silver et al. (2016), planning Tamar et al. (2016), and more recently with multiagent systems, with an emerging area of multiagent deep reinforcement learning (MDRL).

Learning in multiagent settings is fundamentally more difficult than the single-agent case with problems like non-stationarity, curse of dimensionality, and multiagent credit assignment 

Shoham et al. (2007); Busoniu et al. (2008); Hernandez-Leal et al. (2017); Weiss (2013); Agogino and Tumer (2004); Wei and Luke (2016). Despite this complexity, top AI conferences like AAAI, ICML, ICLR, IJCAI and NIPS, and specialized conferences such as AAMAS, have published works reporting successes in MDRL. In light of these works, we believe it is pertinent to first, have an overview of the recent MDRL works, and second, understand how these recent works relate to the existing literature.

This paper contributes to the state of the art with a brief survey of the current works in MDRL in an effort to complement existing surveys Hernandez-Leal et al. (2017); Albrecht and Stone (2018); Arulkumaran et al. (2017). In particular, we identified four categories to group recent works (see Figure 1):

  • Analysis of emergent behaviors

  • Learning communication

  • Learning cooperation

  • Agents modeling agents

For each category we provide a description as well as outline the recent works (see Section 3 and Tables 44). Then, we take a step back and reflect on how these new works relate to the existing literature (see Section 4). In that context, first, we present examples on how methods and algorithms from (non-deep) MAL and RL helped to solve problems in MDRL (see Section 4.1). Second, we present general lessons learned from the existing MDRL works (see Section 4.2). Third, we outline some open questions (see Section 4.3). Lastly, we present some conclusions from this work (see Section 5).

Our goal is to outline a recent and active area (i.e., MDRL), as well as to motivate future research to take advantage of the ample and existing literature in multiagent learning. We expect that researchers with experience on either DRL or MAL could benefit from this article to have a common understanding about recent works and open problems in MDRL and to avoid having scattered sub-communities with little interaction Shoham et al. (2007); Hernandez-Leal et al. (2017); Albrecht and Stone (2018); Darwiche (2018).

(a) Analysis of emergent behaviors
(b) Learning communication
(c) Learning cooperation
(d) Agents modeling agents
Figure 1: Categories of different MDRL works. (a) Analysis of emergent behaviors: evaluate DRL algorithms in multiagent scenarios. (b) Learning communication: agents learn with actions and through messages. (c) Learning cooperation: agents learn to cooperate using only actions and local observations. (d) Agents modeling agents: agents reason about others to fulfill a task (e.g., cooperative or competitive). For a more detailed description see Sections 3.33.6 and Tables 44.

2 Single-agent learning

This section presents the formalism of reinforcement learning and its main components before outlining deep reinforcement learning along with recent algorithms. For a more detailed description we refer the reader to excellent books and surveys on the area Sutton and Barto (1998); Arulkumaran et al. (2017); Kaelbling et al. (1996).

2.1 Reinforcement learning

RL formalizes the interaction of an agent with an environment using a Markov decision process (MDP) 

Puterman (1994). An MDP is defined by the tuple where represents a finite set of states. represents a finite set of actions. The transition function

maps each state-action pair to a probability distribution over the possible successor states, where

denotes the set of all probability distributions over . Thus, for each and , the function determines the probability of a transition from state to state after executing action . The reward function defines the immediate and possibly stochastic reward that an agent would receive given that the agent executes action while in state and it is transitioned to state . represents the discount factor that balances the trade-off between immediate rewards and future rewards.

MDPs are adequate models to obtain optimal decisions in single agent environments. Solving an MDP will yield a policy , which is a mapping from states to actions. An optimal policy is the one that maximizes the expected discounted sum of rewards. There are different techniques for solving MDPs assuming a complete description of all its elements. One of the most common techniques is the value iteration algorithm Bellman (1957) which is based on the Bellman equation:

This equation expresses the value of a state which can be used to obtain the optimal policy , i.e., the one that maximizes that value function, and the optimal value function .

Value iteration requires a complete and accurate representation of states, actions, rewards, and transitions. However, this may be difficult to obtain in many domains. For this reason, RL algorithms often learn from experience interacting with the environment in discrete time-steps.


One of the most well known algorithms for RL is Q-learning Watkins (1989)

. It has been devised for stationary, single-agent, fully observable environments with discrete actions. A Q-learning agent keeps the estimate of its expected payoff starting in state

, taking action as . Each tabular entry is an estimate of the corresponding optimal function that maps state-action pairs to the discounted sum of future rewards starting with action at state and following the optimal policy thereafter. Each time the agent transitions from a state to a state via action receiving payoff , the table is updated as follows:

with the learning rate and typically decreasing over the course of many iterations. Q-learning is proven to converge towards if each state-action pair is visited infinitely often under specific parameters Watkins (1989); Tsitsiklis (1994).

REINFORCE (Monte Carlo policy gradient)

In contrast to value-based methods for the RL problem, policy gradient methods can learn parameterized policies without using intermediate value estimates. Policy parameters are learned based on gradients of some performance measure with gradient ascent method Sutton et al. (2000). For example, REINFORCE Williams (1992) uses estimated return by Monte Carlo methods with full episode trajectories to learn policy parameters where .

represents the return, is the learning rate, and

. Policy gradient methods can have high variance. To address this challenge, actor-critic algorithms, which approximate policy gradient methods, have been proposed. Actor-critic algorithms combine value-based and policy-gradient based methods 

Konda and Tsitsiklis (2000). The actor represents the policy, i.e., action-selection mechanism, whereas a critic is used for the value function learning. The policy update takes into account critic’s value estimate reducing the variance compared to vanilla policy gradient methods. When the critic also learns a state-action function besides a state value function, an advantage function can be computed as a baseline for variance reduction Sutton and Barto (1998).

Policy gradient methods have a clear connection with deep reinforcement learning since the policy might be represented by a neural network whose input is a representation of the state, whose output are action selection probabilities, and whose weights are the policy parameters.

2.2 Deep reinforcement learning

Even when tabular RL methods such as Q-learning had successes, there were drawbacks: RL could be slow to learn in large state spaces, the methods did not generalize (across the state space), and state representations needed to be hand-specified Sutton and Barto (1998). Fortunately, these challenges can be addressed by using deep learning, i.e., neural networks as function approximators as follows:

where represents the neural network weights. First, deep learning helps to generalize across states improving the sample efficiency for large state-space RL problems. Second, deep learning can be used to reduce (or eliminate) the need for manually designing features to represent state information LeCun et al. (2015).

However, extending deep learning to RL problems comes with additional challenges including non-i.i.d. data for highly correlated sequential agent interactions and non-stationary data distribution due to learning agent behavior Mnih et al. (2013). Below we mention how the existing DRL methods address these challenges when briefly reviewing value-based methods, such as DQN Mnih et al. (2015); policy gradient methods, like PPO Schulman et al. (2015); and actor-critic methods like A3C Jaderberg et al. (2017). We refer the reader to a recent survey on single-agent DRL Arulkumaran et al. (2017) for a more detailed discussion of the literature.

Value-based methods

Figure 2: Deep Q-Network (DQN) Mnih et al. (2015), an example of standard Deep RL neural network architecture, composed of several layers: Convolutional

layers employ filters to learn features from high-dimensional data with a much smaller number of neurons and

Dense layers are fully-connected layers. The last layer represents the actions the agent can take (in this case, possible actions). Deep Recurrent Q-Network (DRQN) Hausknecht and Stone (2015), which extends DQN to partially observable domains Cassandra (1998), is identical to this setup except the penultimate layer ( Dense layer) is replaced with a recurrent LSTM layer Hochreiter and Schmidhuber (1997).

The major breakthrough work blending deep learning with Q-learning was Deep Q-Network (DQN) Mnih et al. (2015). DQN uses a deep neural network for function approximation (see Figure 2) and maintains an experience replay (ER) buffer to store interactions . In contrast to tabular Q-learning, DQN keeps an additional copy of neural network parameters, i.e., , for the target network besides the parameters to stabilize the learning, i.e., to alleviate the non-stationary data distribution. For each training iteration

, DQN minimizes the mean-squared error (MSE) between the Q-network and its target network using the loss function:

where target network parameters are set to Q-network parameters periodically and mini-batches of tuples are sampled from the ER buffer, as depicted in Figure 3.

The ER buffer provides stability for learning as random batches sampled from the buffer helps alleviating the problems caused by the non-i.i.d. data. However, it comes with disadvantages, such as memory requirements and a mismatch between buffer content from earlier policy and from the current policy Mnih et al. (2016).

DQN assumes full state observability and it does not perform well in partially observable domains Cassandra (1998). Deep Recurrent Q-Networks (DRQN) Hausknecht and Stone (2015) proposed using recurrent neural networks

, in particular, LSTMs (Long Short-Term Memory

Hochreiter and Schmidhuber (1997) in DQN, for this setting. The main change to the standard DQN is to replace the first post-convolutional fully connected layer with an LSTM layer (see Figure 2). With this addition, DRQN has memory capacity so that it can even work with only one input rather than a stacked input of consecutive frames.

Figure 3: Representation of a DQN agent that uses an experience replay buffer to keep tuples to use within minibatch updates. The policy is parametrized with a NN that outputs an action at every timestep.

Policy gradient methods

For many tasks, particularly for physical control, the action space is continuous and high dimensional where DQN is not suitable. Deep deterministic policy gradient (DDPG) Lillicrap et al. (2016)

is a model-free off-policy actor-critic algorithm for such domains where several adjustments to the DQN algorithm have been made. An actor-critic based model is employed where target network parameters slowly change, in contrast to hard reset to learned network parameters as in DQN. Given the off-policy nature, DDPG generates exploratory behavior by adding sampled noise from some noise processes to its actor policy. The authors also used batch normalization 

Ioffe and Szegedy (2015) to ensure generalization across many different tasks without performing manual normalizations.

Figure 4: Asynchronous Advantage Actor-Critic (A3C) employs multiple (CPUs) workers without needing an ER buffer. Each worker has its own NN and independently interacts with the environment to compute the loss and gradients. Workers then pass computed gradients to the global NN that optimizes the parameters and synchronizes with the worker asynchronously. This distributed system is designed for single-agent deep RL. A major advantage of this method is that it does not require a GPU; it can use multiple CPU threads on standard computers.

A3C (Asynchronous Advantage Actor-Critic) Mnih et al. (2016) is an algorithm that employs a parallelized asynchronous training scheme (using multiple CPU threads) for efficiency. It is an on-policy RL method that does not use an experience replay buffer. A3C allows multiple workers to simultaneously interact with the environment and compute gradients locally. All the workers pass their computed local gradients to a global NN which performs the optimization and synchronizes with the workers asynchronously (see Figure 4). There is also A2C (Advantage Actor-Critic) method that combines all the gradients from all the workers to update the global NN synchronously.

The UNREAL framework Jaderberg et al. (2017) is built on top of A3C. In particular, UNREAL proposes unsupervised auxiliary tasks (e.g., reward prediction) to speed up the learning process which require no additional feedback from the environment. In contrast to A3C, UNREAL uses an ER buffer that is sampled with more priority given to interactions with positive rewards to improve the critic network.

Another distributed architecture is the Importance Weighted Actor-Learner Architecture (IMPALA) Espeholt et al. (2018). Unlike A3C or UNREAL, IMPALA actors communicate trajectories of experience (sequences of states, actions, and rewards) to a centralized learner, thus IMPALA decouples acting from learning.

TRPO (Trust Region Policy Optimization) Schulman et al. (2015) and PPO (Proximal Policy Optimization) Schulman et al. (2017) are state-of-the art policy gradient algorithms. Compared to vanilla policy gradient algorithms, PPO is designed to prevent abrupt changes in policies during training by incorporating the change in policy to the loss function. Another advantage of PPO is that it can be used in a distributed fashion, DPPO Heess et al. (2017). Note that distributed approaches like DPPO or A3C use only parallelization to improve the learning for single agent DLR and they should not be considered MDRL approaches.

We have reviewed recent algorithms in DRL, while the list is not exhaustive, it provides an overview of the different state-of-art techniques and algorithms which will become useful while describing the MDRL techniques in the next section.

3 Multiagent Deep Reinforcement Learning (MDRL)

First, we briefly introduce the general framework on multiagent learning and then we dive into the categories and the research on MDRL.

3.1 Multiagent Learning

Learning in a multiagent environment is inherently more complex than in the single-agent case, as agents interact at the same time with environment and potentially with each other (Busoniu et al., 2008). Directly using single-agent algorithms in a multiagent setting is a natural approach, called independent learners Tan (1993), even if assumptions under which these algorithms were derived are violated. In particular the Markov property111The future dynamics, transitions, and rewards fully depend on the current state. becomes invalid since the environment is no longer stationary (Tuyls and Weiss, 2012; Nowé et al., 2012; Laurent et al., 2011). This approach ignores the multiagent nature of the setting entirely and it can fail when an opponent adapts or learns, for example, based on the past history of interactions (Shoham et al., 2007).

In order to understand why multiagent domains are non-stationary from agents’ local perspectives, consider a stochastic (also known as Markov) game , which can be seen as an extension of an MDP to multiple agents Littman (1994). One key distinction is that the transition, , and reward function, , depend on the actions of all, , agents.

Given a learning agent and using the common shorthand notation for the set of opponents, the value function now depends on the joint action , and the joint policy :


Consequently, the optimal policy is a best response dependent on the other agents’ policies,

Specifically, the opponents’ joint policy can be non-stationary, thus becoming the parameter of the best response function.

There are other common problems in MAL, including convergence Shoham et al. (2007); Bowling and Veloso (2002); Balduzzi et al. (2018), action shadowing Wei and Luke (2016); Fulda and Ventura (2007), the curse of dimensionality Busoniu et al. (2008), and multiagent credit assignment Agogino and Tumer (2004). Describing each problem is out of the scope of this survey. However, we refer the interested reader to excellent resources on general MAL Tuyls and Weiss (2012); Weiss (2013); mul (2017), as well as surveys in specific areas: game theory and multiagent reinforcement learning Busoniu et al. (2008); Nowé et al. (2012), cooperative scenarios Panait and Luke (2005); Matignon et al. (2012), evolutionary dynamics of multiagent learning Bloembergen et al. (2015), learning in non-stationary environments Hernandez-Leal et al. (2017), and agents modeling agents Albrecht and Stone (2018).

3.2 MDRL categorization

In the previous section we outlined some recent works in single-agent DRL since an exhaustive list is out of the scope of this article. This explosion of works has led DRL to be extended and combined with other techniques Arulkumaran et al. (2017). One natural extension to DRL is to test whether these approaches could be applied in a multiagent environment.

We analyzed of the most recent works (that are not covered by previous MAL surveys Hernandez-Leal et al. (2017); Albrecht and Stone (2018)) that have a clear connection with MDRL.222

We do not consider genetic algorithms or swarm intelligence in this work.

We propose 4 categories which take inspiration from previous surveys Stone and Veloso (2000); Busoniu et al. (2008); Panait and Luke (2005); Albrecht and Stone (2018) and that conveniently describe and represent current works.333Note that some of these works fit into two categories, however, for the ease of exposition when describing them we only do so in one category.

  • Analysis of emergent behaviors. These works do not propose learning algorithms — their main focus is to analyze and evaluate DRL algorithms, e.g., DQN Tampuu et al. (2017); Leibo et al. (2017); Raghu et al. (2018) and others Lazaridou et al. (2017); Mordatch and Abbeel (2017); Bansal et al. (2018), in a multiagent environment. See Section 3.3 and Table 4.

  • Learning communication Lazaridou et al. (2017); Mordatch and Abbeel (2017); Foerster et al. (2016); Sukhbaatar et al. (2016); Peng et al. (2017). These works explore a sub-area that is attracting attention444For example, see recent workshops on Emergent Communication: and and that had not been explored much in the MAL literature. See Section 3.4 and Table 4.

  • Learning cooperation. While learning to communicate is an emerging area, fostering cooperation in learning agents has a long history of research in MAL Panait and Luke (2005); Matignon et al. (2012). The works in this category mostly take inspiration from MAL to extend to the MDRL setting, with both value-based methods Foerster et al. (2017); Palmer et al. (2018); Omidshafiei et al. (2017); Zheng et al. (2018); Jaderberg et al. (2018); Sunehag et al. (2018); Rashid et al. (2018); Lanctot et al. (2017) and policy gradients methods Peng et al. (2017); Foerster et al. (2017); Lowe et al. (2017). See Section 3.5 and Table 4.

  • Agents modeling agents. Albrecht and Stone Albrecht and Stone (2018) presented a thorough survey in this topic and we have found many works that fit into this category in the MDRL setting, some taking inspiration from DRL He et al. (2016); Raileanu et al. (2018); Hong et al. (2018), and others from MAL Lanctot et al. (2017); Heinrich and Silver (2016); Foerster et al. (2018); Rabinowitz et al. (2018); Yang et al. (2018). Modeling agents is helpful not only to cooperate, but also for modeling opponents Lanctot et al. (2017); He et al. (2016); Hong et al. (2018); Heinrich and Silver (2016), inferring hidden goals Raileanu et al. (2018), and accounting for the learning behavior of other agents Foerster et al. (2018). See Section 3.6 and Table 4.

In the rest of this section we describe each category along with the summaries of related works.

Work Summary
Tampuu et al. Tampuu et al. (2017) Train DQN agents to play Pong.
Leibo et al. Leibo et al. (2017) Train DQN agents to play sequential social dilemmas.
Bansal et al. Bansal et al. (2018) Train PPO agents in competitive MuJoCo scenarios.
Raghu et al. Raghu et al. (2018) Train PPO, A3C, and DQN agents in attacker-defender games.
Lazaridou et al. Lazaridou et al. (2017) Train agents represented with NN to learn a communication language.
Mordatch and Abbeel Mordatch and Abbeel (2017)

Learn communication with an end-to-end differentiable model to train with backpropagation.

Table 2: These papers propose algorithms for learning communication, together with a deep neural network architecture. A more detailed description is given in Section 3.4.
Algorithm Architecture Summary
RIAL Foerster et al. (2016) DRQN Use a single network (parameter sharing) to train agents that take environmental and communication actions.
DIAL Foerster et al. (2016) DRQN Use gradient sharing during learning and communication actions during execution.
CommNet Sukhbaatar et al. (2016) Multilayer NN

Use a continuous vector channel for communication on a single network.

BiCNet Peng et al. (2017) Bidirectional RNN Use the actor-critic paradigm where communication occurs in the latent space.
Table 3: These papers aim to learn cooperation. We highlight the closest work (DRL or MAL) in which it is based. A more detailed description is given in Section 3.5.
Algorithm Based on Summary
Fingerprints Foerster et al. (2017) IQN Deal with ER problems in MDRL by conditioning the value function on a fingerprint that disambiguates the age of the sampled data.
Lenient-DQN Palmer et al. (2018) DQN Achieve cooperation by leniency, optimism in the value function by forgiving suboptimal (low-rewards) actions.
Hysteretic-DRQN Omidshafiei et al. (2017) DRQN Achieve cooperation by using two learning rates, depending on the updated values together with multitask learning via policy distillation.
WDDQN Zheng et al. (2018) DQN Achieve cooperation by leniency, weighted double estimators, and a modified prioritized experience replay buffer.
FTW Jaderberg et al. (2018) IMPALA Agents act in a mixed environment (composed of teammates and opponents), it proposes a two-level architecture and a population-based learning.
VDN Sunehag et al. (2018) IQN Decompose the team action-value function into pieces across agents, where the pieces can be easily added.
QMIX Rashid et al. (2018) VDN Decompose the team action-value function together with a mixing network that can recombine them.
COMA Foerster et al. (2017) - Use a centralized critic and a counter-factual advantage function based on solving the multiagent credit assignment.
MADDPG Lowe et al. (2017) DDPG Use an actor-critic approach where the critic is augmented with information from other agents, the actions of all agents.
Table 4: These papers consider agents modeling agents. A more detailed description is given in Section 3.6.
Algorithm Summary
DRON He et al. (2016) Have a network to infer the opponent behavior together with the standard DQN architecture.
DPIQN, DPIRQN Hong et al. (2018) Learn policy features from raw observations that represent high-level opponent behaviors via auxiliary tasks.
SOM Raileanu et al. (2018) Assume the reward function depends on a hidden goal of both agents and then use an agent’s own policy to infer the goal of the other agent.
NFSP Heinrich and Silver (2016) Compute approximate Nash equilibria via self-play and two neural networks.
DCH Lanctot et al. (2017) Policies can overfit to opponents: better compute approximate best responses to a mixture of policies.
LOLA Foerster et al. (2018) Use a learning rule where the agent accounts for the parameter update of other agents in order to maximize its own reward.
ToMnet Rabinowitz et al. (2018) Use an architecture for end-to-end learning and inference of diverse opponent types.
Deep Bayes-ToMoP Yang et al. (2018) Best respond to opponents using Bayesian policy reuse, theory of mind, and deep networks.
Table 1: These papers analyze emergent behaviors in MDRL. A detailed description is given in Section 3.3.

3.3 Emergent behaviors

A group of works have studied and analyzed the emergence of behaviors (e.g., cooperative or competitive) using independent DRL agents in different settings.

One of the earliest works by Tampuu et al. Tampuu et al. (2017) placed two independent DQN learning agents to play the Pong game. Their focus was to adapt the reward function for the learning agents, which resulted in either cooperative or competitive emergent behaviors.

Leibo et at. Leibo et al. (2017) also studied independent DQNs although in the context of sequential social dilemmas.555A sequential social dilemma is a Markov game that satisfies certain social dilemma inequalities Leibo et al. (2017). The focus of this work was to highlight that cooperative or competitive behaviors exist not only as discrete (atomic) actions, but they are temporally extended (over policies).

Recently, Bansal et al. Bansal et al. (2018) explored the emergent behaviors in competitive scenarios using the MuJoCo simulator Todorov et al. (2012). They trained independent learning agents with PPO Schulman et al. (2017) plus two main modifications to deal with the MAL nature of the problem. First, they used exploration rewards which are dense rewards that allow agents to learn basic (non-competitive) behaviors — this type of reward is annealed through time giving more weight to the environmental (competitive) reward. Second, they propose opponent sampling which maintains a pool of older versions of the opponent to sample from, in contrast to using the most recent version.

Raghu et al. Raghu et al. (2018) investigated how DRL algorithms (DQN, A2C, and PPO) performed in a family of two-player zero-sum games with tunable complexity, called Erdos-Selfridge-Spencer games. Their reasoning is threefold: (i) these games provide a parametrized family of environments where (ii) optimal behavior can be completely characterized, and (iii) support multiagent play. Their work showed that algorithms can exhibit wide variation in performance as the algorithms are tuned to the game’s difficulty.

Lazaridou et al. Lazaridou et al. (2017)

proposed a framework for language learning that relies on multiagent communication. The agents, represented by (feed-forward) neural networks, need to develop an

emergent language to solve a task. The task is formalized as a signaling game Fudenberg and Tirole (1991) in which two agents, a sender and a receiver, obtain a pair of images. The sender is told one of them is the target and is allowed to send a message (from a fixed vocabulary) to the receiver. Only when the receiver identifies the target image do both agents receive a positive reward. A key objective of this work was to analyze if the agent’s language could be human-interpretable, showing limited yet encouraging results.

Similarly, Mordatch and Abbeel Mordatch and Abbeel (2017) investigated the emergence of language with the difference that in their setting there were no explicit roles for the agents (i.e., sender, receiver). To learn, they proposed an end-to-end differentiable model of all agent and environment state dynamics over time to calculate the gradient of the return with backpropagation.

3.4 Learning communication

As we discussed in the previous section, one of the desired emergent behaviors of multiagent interaction is the emergence of communication Lazaridou et al. (2017); Mordatch and Abbeel (2017). This setting usually considers a set of cooperative agents in partially observable environment where agents need to maximize their shared utility by means of communicating information.

RIAL (Reinforced Inter-Agent Learning) and DIAL (Differentiable Inter-Agent Learning) are two methods using deep networks to learn to communicate Foerster et al. (2016). Both methods use a neural net that outputs the agent’s Q values (as done in standard DRL algorithms) and a message to communicate to other agents in the next timestep. RIAL is based on DRQN and also uses the concept of parameter sharing, i.e., using a single network whose parameters are shared among all agents. In contrast, DIAL directly passes gradients via the communication channel during learning, and messages are discretized and mapped to the set of communication actions during execution.

While RIAL and DIAL used a discrete communication channel, CommNet Sukhbaatar et al. (2016) considered a continuous vector channel. Through this channel agents receive the summed transmissions of other agents. The authors assume full cooperation and train a single network for all the agents. There are two distinctive characteristics of CommNet from previous works: it allows multiple communication cycles at each timestep and a dynamic variation of agents at run time, i.e., agents come and go in the environment.

In contrast to previous approaches, in Multiagent Bidirectionally Coordinated Network (BiCNet) Peng et al. (2017) communication takes place in the latent space (i.e., in the hidden layers). It also uses parameter sharing, however, it proposes bidirectional RNNs Schuster and Paliwal (1997) to model the actor and critic networks of their model. Note that in BiCNeT agents do not explicitly share a message and thus it can be considered a method for learning cooperation.

3.5 Learning cooperation

Although explicit communication is a new emerging trend in MDRL, there has already been a large amount of work in MAL for cooperative settings that do not involve communication Panait and Luke (2005); Matignon et al. (2012). Therefore, it was a natural starting point for many recent MDRL works.

Foerster et al. Foerster et al. (2017) addressed the problem of cooperation with independent Q-learning agents, where the agents use the standard DQN architecture of neural networks and an experience replay buffer (see Section 2.2 and Figure 3). However, the ER buffer introduces problems, such as: the dynamics that generated the data in the ER no longer reflect the current dynamics, making the experience obsolete Foerster et al. (2017). Their solution is to add information to the experience tuple that can help to disambiguate the age of the sampled data from the replay memory. Two approaches were proposed: Multiagent Importance Sampling adds the probability of the joint action, and Multiagent Fingerprints adds the estimate of other agents’ policies. For the practical implementation, good results were obtained by using the training iteration number and exploration rate as the fingerprint.

Lenient-DQN (LDQN) Palmer et al. (2018) extended the leniency concept to multiagent DRL, this is, introducing optimism in the value function update to foster cooperation Bloembergen et al. (2010). However, similar to other works Foerster et al. (2017), the authors experienced problems with the ER buffer and arrived at a similar solution: adding information to the experience tuple, in their case, the leniency value. When sampling from the ER buffer, this value is used to determine a leniency condition; if the condition is not met then the sample is ignored.

In a similar vein, Hysteretic Deep Recurrent Q-Networks (HDRQNs) Omidshafiei et al. (2017) were proposed for fostering cooperation among independent learners. The motivation is similar to LDQN, making a optimistic value update, however, their solution is different. Here, two learning rates are used (inspired by Hysteretic Q-learning Matignon et al. (2012)) and the ER buffer is modified into concurrent experience replay trajectories, which are composed of three dimensions: agent index, the episode, and the timestep. When training, the sampled traces have the same starting timesteps. Moreover, to improve on generalization over different tasks they make use of policy distillation Rusu et al. (2016) (see Section 4.1).

Weighted Double Deep Q-Network (WDDQN) Zheng et al. (2018) is based on having double estimators Van Hasselt et al. (2016), i.e., two Q-networks, which intuitively balance between overestimation and underestimation. It also uses a lenient reward to be optimistic during initial phase of coordination and proposes a scheduled replay strategy in which samples closer to the terminal state are given higher priority since those are more rare in occurrence.

Figure 5: An schematic view of the architecture used in FTW (For the Win) Jaderberg et al. (2018): two unrolled RNNs operate at different time-scales, the idea is that the Slow RNN

helps with long term temporal correlations. Observations are latent space output of some convolutional neural network to learn non-linear features. Feudal Networks 

Vezhnevets et al. (2017) is another work (in single agent DRL) that also maintains a multi-time scale hierarchy where the slower network sets the goal, and faster network tries to achieve them.

While previous approaches were mostly inspired by how MAL algorithms could be extended to MDRL, other works take as base the results by single-agent DRL. One example is the For The Win (FTW) Jaderberg et al. (2018) agent which is based on the IMPALA architecture Espeholt et al. (2018) (see Section 2.2). Their setting is a game where two opposing teams compete to capture each other’s flags cap (2018). To deal with the MAL problem they propose two main additions: a hierarchical two-level representation with RNNs operating at different timescales, as depicted in Figure 5, and a population based training Jaderberg et al. (2017) where 30 agents were trained in parallel together with a stochastic matchmaking scheme that biases agents to be of similar skills Herbrich et al. (2007). An interesting result from this work is that the population-based training obtained better results than training via self-play, which was a standard concept in previous works Silver et al. (2016); Bowling et al. (2015).

FTW is based on value-based methods, but others have considered policy gradient methods (see Section 2.1) for MDRL. For example, Lowe et al. Lowe et al. (2017) noted that using standard policy gradient methods on multiagent environments yields high variance and performs poorly. Therefore, to overcome this issue they propose the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) Lowe et al. (2017) which is an extension of DDPG Lillicrap et al. (2016) with an actor-critic architecture where the critic is augmented with other agents’ actions, while the actor only has local information (turning the method into a centralized training with decentralized execution). In this case, the ER buffer records experiences of all agents. In contrast to the previous algorithms, MADDPG was tested in both cooperative and competitive scenarios.

Another approach based on policy gradients is the Counterfactual Multi-Agent Policy Gradients (COMA) Foerster et al. (2017). COMA focuses on the fully centralized setting and the multiagent credit assignment problem Tumer and Agogino (2007), i.e., how the agents should deduce their contributions when learning in a cooperative setting in the presence only of global rewards. Their proposal is to compute a counterfactual baseline, that is, marginalize out the action of the agent while keeping the rest of the other agents’ actions fixed. Then, an advantage function can be computed comparing the current value to the counterfactual.

On the one hand, fully centralized approaches (e.g., COMA) do not suffer from non-stationarity but have constrained scalability. On the other hand, independent learning agents are better suited to scale but suffer from non-stationarity issues. There are some hybrid approaches that learn a centralized but factored value function Oliehoek (2018). Value Decomposition Networks (VDNs) Sunehag et al. (2018) aim to decompose a team value function into an additive decomposition of the individual value functions. Similarly, QMIX Rashid et al. (2018) relies on the idea of factorizing, however, instead of sum, QMIX assumes a mixing network that combines the local values in a non-linear way, which can represent complex action-value functions.

3.6 Agents modeling agents

Figure 6: (a) Deep Policy Inference Q-Network: receives four frames as input. (b) Deep Policy Inference Recurrent Q-Network: receives one frame as input and has an LSTM layer instead of a fully connected layer (FC). Both approaches Hong et al. (2018) condition the value outputs on the policy features, , which are also used to learn the opponent policy .

An important ability for agents to have is to reason about the behaviors of other agents by constructing models that make predictions about the modeled agents Albrecht and Stone (2018). An early work for modeling agents while using deep neural networks was the Deep Reinforcement Opponent Network (DRON) He et al. (2016). The idea is to have two networks: one which evaluates values and a second one that learns a representation of the opponent’s policy. Moreover, the authors proposed to have several expert networks to combine their predictions to get the estimated value, the idea being that each expert network captures one type of opponent strategy. DRON uses hand-crafted features to define the opponent network. In contrast, Deep Policy Inference Q-Network (DPIQN) and its recurrent version, DPIRQN Hong et al. (2018) learn policy features directly from raw observations of the other agents. The way to learn these policy features is by means of auxiliary tasks Jaderberg et al. (2017) (see Section 4.1) that provide additional learning goals, in this case, the auxiliary task is to learn the opponents’ policy. Then, the value function of the learning agent is conditioned on the policy features (see Figure 6), which aims to reduce the non-stationarity of the environment. The authors used an adaptive training procedure to adjust the attention (a weight on the loss function) to either emphasize learning the policy features (of the opponent) or the respective values of the agent. An advantage of these approaches is that modeling the agents can work for both opponents and teammates Hong et al. (2018).

In many previous works an opponent model is learned from observations. Self Other Modeling (SOM) Raileanu et al. (2018) proposed a different approach, this is, using the agent’s own policy as a means to predict the opponent’s actions. SOM aims to infer other agents’ goals by using two networks, one used for computing the agents’ policy and values, and a second one used to infer the opponent’s policy. The idea is that these networks have the same input parameters but with a different values (the agent’s or the opponent’s). In contrast to previous approaches, SOM is not focused on learning the opponent strategy but rather on estimating the opponent’s goal (hidden state).

There is a long-standing history of combining game theory and MAL Shoham et al. (2007); Nowé et al. (2012); Bowling et al. (2015). From that context, some approaches were inspired by influential game theory approaches. Neural Fictitious Self-Play (NFSP) Heinrich and Silver (2016) builds on fictitious play Brown (1951) together with two deep networks to find approximate Nash equilibria, which is an efficient solution concept in certain types of games (e.g., two-player zero-sum games). One network learns an approximate best response to the historical behavior of other agents and the second one learns a model that averages over the agent’s behavior. The agent behaves using a mixture of both networks. A second example that takes inspiration from behavioral game theory Camerer et al. (2004); Costa Gomes et al. (2001) are Deep Cognitive Hierarchies (DCHs) Lanctot et al. (2017). Their goal is to find approximate best responses to the meta-strategy (obtained by empirical game theoretic analysis Walsh et al. (2002)) of other players. Their reasoning is that computing a standard best response can over-fit to a specific agent, causing it to fail to generalize. Therefore, they propose to add an opponent/teammate regularization by means of approximately best responding to a mixture of policies.

Previous approaches usually learned a model of the other agents as a way to predict their behavior. However, they do not explicitly account for anticipated learning of the other agents, which is the objective of Learning with Opponent-Learning Awareness (LOLA) Foerster et al. (2018). LOLA optimizes the expected return after the opponent updates its policy one step. Therefore, a LOLA agent directly shapes the policy updates of other agents to maximize its own reward. One of LOLA’s assumptions is having access to opponents’ policy parameters.

Theory of Mind Network (ToMnet) Rabinowitz et al. (2018) starts with a simple premise: when encountering a novel opponent, the agent should already have a strong and rich prior about how the opponent should behave. They propose an architecture composed of three networks: (i) a character network that learns from historical information, (ii) a mental state network that takes the character output and the recent trajectory, and (iii) the prediction network that takes the current state as well as the outputs of the other networks as its input. The output of the architecture is open for different problems but in general its goal is to predict the opponent’s next action.

Deep Bayes-ToMoP (Bayesian Theory of Mind Policy) Yang et al. (2018) is another algorithm that takes inspiration from theory of mind de Weerd et al. (2013). This algorithm is capable of best responding to changing stationary strategies of the opponent and to higher-level reasoning strategies (using theory of mind). It uses concepts from Bayesian policy reuse Rosman et al. (2016); Hernandez-Leal et al. (2016) and efficient tracking and detection of opponent strategies Hernandez-Leal et al. (2017); Hernandez-Leal and Kaisers (2017) together with deep networks.

4 Bridging MAL and MDRL

This section aims to provide directions to promote fruitful cooperations between sub-communities and reduce fractures in the MAL community. First, we present examples on how techniques from DRL and MAL can complement each other to solve problems in MDRL (see Section 4.1). Second, we outline lessons learned from the works analyzed in this survey (see Section 4.2). Third, we pose some open challenges and reflect on their relation with previous open questions in MAL Albrecht and Stone (2018) (see Section 4.3).

4.1 Examples of fruitful cooperation to solve MDRL problems

Here, we present 2 examples on how techniques from MAL can be used to solve problems in MDRL, and then 2 examples of DRL techniques extended naturally to MDRL.

Dealing with non-stationarity in independent learners

It is well known that using independent learners makes the environment non-stationary from the agent’s point of view Tuyls and Weiss (2012); Laurent et al. (2011). This can be solved in different ways Hernandez-Leal et al. (2017). One example is Hyper-Q Tesauro (2003) which is a MAL approach that proposes to account for the (values of mixed) strategies of other agents and include that information in the state representation, which effectively turns the learning problem into a stationary one. Note that in this way it is possible to even consider adaptive agents. Foerster et al. Foerster et al. (2016) make use of this insight to propose their fingerprint algorithm in an MDRL problem (see Section 3.5).

Multiagent credit assignment

In cooperative multiagent scenarios, it is common to use either local rewards, unique for each agent, or global rewards, which represent the entire group’s performance Agogino and Tumer (2008). However, local rewards are usually harder to obtain, therefore, it is common to rely only on the global ones. This raises the problem of credit assignment: how do a single agent’s actions contribute to a system that involves the actions of many agents Agogino and Tumer (2004). A solution that has proven successful in many scenarios is difference rewards Agogino and Tumer (2008); Devlin et al. (2014), which capture an agent’s contribution to the system’s global performance. COMA builds on this concept to propose an advantage function based on the contribution of the agent, which can be efficiently computed with deep neural networks Foerster et al. (2017) (see Section 3.5).

Multitask learning in MDRL

In the context of RL, multitask learning is an area that develops agents that can act in several related tasks rather than just in a single one Taylor and Stone (2009). Policy distillation Rusu et al. (2016) was proposed to extract DRL policies to train a much smaller NN and it also can be used to merge several task-specific policies into a single policy, i.e., for multitask learning. In the MDRL setting, HDRQNs Omidshafiei et al. (2017) successfully adapted policy distillation to obtain a more general multitask multiagent network (see Section 3.5).

Auxiliary tasks for MDRL

Jaderberg et al. Jaderberg et al. (2017) introduced the auxiliary task concept with the insight that (single-agent) environments contain a variety of possible training signals (e.g., pixel changes, network activations). These can be treated as pseudo-reward functions and the agent can still use standard DRL to learn the optimal policy for those. One could think of extending these auxiliary tasks to modeling other agents’ behaviors Mordatch and Abbeel (2017), this is one of the key ideas that DPIQN and DRPIQN Hong et al. (2018) proposed in MDRL settings (see Section 3.6).

4.2 Lessons learned

We have exemplified how DRL and MAL can complement each other for MDRL settings. Now, we outline general best practices learned from the works analyzed throughout this paper.

  • Experience replay buffer in MDRL. While some works removed the ER buffer in MDRL Foerster et al. (2016) it is an important component in many DRL and MDRL algorithms. However, using the standard buffer (i.e., keeping ) will probably fail. Adding information in the experience tuple that can help disambiguate the sample, is the solution adopted in many works, whether a value based method Foerster et al. (2017); Palmer et al. (2018); Omidshafiei et al. (2017); Zheng et al. (2018) or a policy gradient method Lowe et al. (2017). In this regard, it is an open question to consider new DRL ideas for the ER Andrychowicz et al. (2017); Schaul et al. (2016); Lipton et al. (2018) and how those would fare in a MDRL setting.

  • Centralized learning with decentralized execution. Many MAL works were either fully centralized or fully decentralized approaches. However, inspired by decentralized partially observable Markov decison processes (DEC-POMDPs) Oliehoek et al. (2016), in MDRL this new mixed paradigm has been commonly used  Foerster et al. (2017); Palmer et al. (2018); Omidshafiei et al. (2017); Rashid et al. (2018); Lanctot et al. (2017); Foerster et al. (2017); Lowe et al. (2017). Note that during learning additional information can be used (e.g., state, action, rewards) and during execution this information is removed.

  • Parameter sharing. Another frequent component in many MDRL works is the idea of sharing parameters, i.e., training a single network in which agents share their weights. Note that, since agents could receive different observations (e.g., in partially observable scenarios), they can still behave differently Foerster et al. (2016); Sukhbaatar et al. (2016); Peng et al. (2017); Foerster et al. (2017); Sunehag et al. (2018); Rashid et al. (2018).

  • Recurrent networks. LSTM networks Hochreiter and Schmidhuber (1997); Greff et al. (2017) are widely used recurrent neural networks that have memory capability for sequential data, in contrast to feed-forward or convolutional neural networks. In single-agent DRL, DRQN Hausknecht and Stone (2015) initially proposed idea of using recurrent networks in single-agent partially observable environments. Then, Feudal Networks Vezhnevets et al. (2017); Dayan and Hinton (1993) proposed multiple LSTM networks with different time-scales, i.e., the observation input schedule is different for each LSTM network, to create a temporal hierarchy so that it can better address the long-term credit assignment challenge for RL problems. Recently, the use of recurrent networks has been extended to MDRL Bansal et al. (2018); Foerster et al. (2016); Peng et al. (2017); Omidshafiei et al. (2017); Sunehag et al. (2018); Rashid et al. (2018); Raileanu et al. (2018); Hong et al. (2018); Rabinowitz et al. (2018); Yang et al. (2018) for example, in FTW Jaderberg et al. (2018), depicted in Figure 5.

  • Ensemble policies. Particularly when explicitly modeling opponents, models can over-fit to the behavior of other agents. In order to reduce this problem, a solution is to have a set of policies and learn from them or best respond to the mixture of them Lanctot et al. (2017); Lowe et al. (2017); He et al. (2016).

4.3 Open questions

Finally, here we present some open questions for MDRL.

  • On the challenge of sparse and delayed rewards.

    Recent MAL competitions and environments (e.g., Pommerman Resnick et al. (2018), Capture the flag cap (2018), MARLO mar (2018), Starcraft II Vinyals et al. (2017), and Dota 2 ope (2018)) have complex scenarios where many actions are taken before a reward signal is available. This is already a challenge for RL Sutton and Barto (1998); Kaelbling et al. (1996) — in MDRL this is even more problematic since the agents not only need to learn basic behaviors (like in DRL), but also to learn the strategic element (e.g., competitive/collaborative) embedded in the multiagent setting. To address this issue, recent MDRL approaches propose adding dense rewards at each step to allow the agents to learn basic motor skills and then decrease these dense rewards over time in favor of the environmental reward Bansal et al. (2018). Recent works like OpenAI Five ope (2018) uses hand-crafted intermediate rewards to accelerate the learning and FTW Jaderberg et al. (2018) lets the agents learn their internal rewards by a hierarchical two-tier optimization. In single agent domains, RUDDER Arjona-Medina et al. (2018) has been recently proposed for such delayed sparse reward problems. RUDDER generates a new MDP with more intermediate rewards whose optimal solution is still an optimal solution to the original MDP. This is achieved by using LSTM networks to redistribute the original sparse reward to earlier state-action pairs and automatically provide reward shaping. The extension of RUDDER to multiagent domains is an open question that might yield more efficient learning agents.

  • On the role of self-play.

    Self-play is a cornerstone in MAL with impressive results Bowling and Veloso (2002); Hu and Wellman (2003); Bowling (2004); Conitzer and Sandholm (2006); Greenwald and Hall (2003). While notable results had also been shown in MDRL Heinrich and Silver (2016); Bowling et al. (2015), recent works have also shown that plain self-play does not yield the best results. However, adding diversity, i.e., population-based or sampling-based methods, have shown good results Bansal et al. (2018); Jaderberg et al. (2018). A drawback of these solutions is the additional computational requirements since they need either parallel training (more CPU computation) or memory requirements. Then, it is still an open problem to work on computationally efficient implementations in MAL and MDRL (see (Albrecht and Stone, 2018, Section 5.5)).

  • On the challenge of the combinatorial nature of MDRL.

    Monte Carlo tree search (MCTS) Browne et al. (2012) has been the backbone of major breakthrough for AlphaGo Silver et al. (2016) and AlphaGo Zero Silver et al. (2017) that combined search and DRL. A recent work Vodopivec et al. (2017) has outlined how search and RL can be better combined for potentially new methods. However, for multiagent scenarios, there is an additional challenge of the exponential growth of all the agents’ action spaces for centralized methods Kartal et al. (2015). Amato Amato and Oliehoek (2015) proposed to use factored value functions to exploit structure for multiagent settings. Search parallelization has also been employed for better scalability within multiagent scenarios Kartal et al. (2016); Best et al. (2018). Given more scalable planners, there is room for research in combining these techniques in MDRL settings.

5 Conclusions

Deep reinforcement learning has shown recent success on many fronts Mnih et al. (2015); Silver et al. (2016); Moravčík et al. (2017) and a natural next step is to test multiagent scenarios. However, learning in multiagent environments is fundamentally more difficult due to non-stationarity, the increase of dimensionality, and the credit-assignment problem, among other factors Stone and Veloso (2000); Busoniu et al. (2008); Hernandez-Leal et al. (2017); Bowling and Veloso (2002); Tumer and Agogino (2007); Wei et al. (2018).

While previous MAL surveys have tried to contextualize most of the MAL literature within the framework of game theory Shoham et al. (2007), there is still an acknowledgement that some works lie outside that framework. Stone Stone (2007) takes this insight as a starting point to consider that “there is no single correct multiagent learning algorithm — each problem must be considered individually.” Based on this reasoning, in this survey we have presented an overview on recent works in the emerging area of Multiagent Deep Reinforcement Learning (MDRL).

This work provides a comprehensive view of MDRL. First, we categorized recent works into four different topics: emergent behaviors, learning communication, learning cooperation, and agents modeling agents. Then, we exemplified how DRL and MAL can complement each other, we provided general lessons learned applicable to MDRL, and we reflected on open questions for MAL and MDRL.

While the number of works in DRL and MDRL are notable and represent important milestones for AI, at the same time we acknowledge there are also open questions in both (deep) single-agent learning Arulkumaran et al. (2017); Darwiche (2018); Henderson et al. (2018); Nagarajan et al. (2018); Machado et al. (2018); Torrado et al. (2018) and multiagent learning Hernandez-Leal et al. (2017); Albrecht and Stone (2018); Ortega and Legg (2018); Bono et al. (2018); Yang et al. (2018); Grover et al. (2018); Ling et al. (2018). In this article, we have provided an outline of a recent active research area of MDRL, and at the same time our aim was to motivate future research to take advantage of the ample and existing literature to avoid having scattered sub-communities with little interaction.


  • Stone and Veloso (2000) P. Stone, M. M. Veloso, Multiagent Systems - A Survey from a Machine Learning Perspective., Autonomous Robots 8 (2000) 345–383.
  • Shoham et al. (2007) Y. Shoham, R. Powers, T. Grenager, If multi-agent learning is the answer, what is the question?, Artificial Intelligence 171 (2007) 365–377.
  • Alonso et al. (2002) E. Alonso, M. D’inverno, D. Kudenko, M. Luck, J. Noble, Learning in multi-agent systems, Knowledge Engineering Review 16 (2002) 1–8.
  • Tuyls and Weiss (2012) K. Tuyls, G. Weiss, Multiagent learning: Basics, challenges, and prospects, AI Magazine 33 (2012) 41–52.
  • Busoniu et al. (2008) L. Busoniu, R. Babuska, B. De Schutter, A Comprehensive Survey of Multiagent Reinforcement Learning, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 38 (2008) 156–172.
  • Nowé et al. (2012) A. Nowé, P. Vrancx, Y.-M. De Hauwere, Game theory and multi-agent reinforcement learning, in: Reinforcement Learning, Springer, 2012, pp. 441–470.
  • Panait and Luke (2005) L. Panait, S. Luke, Cooperative Multi-Agent Learning: The State of the Art, Autonomous Agents and Multi-Agent Systems 11 (2005).
  • Matignon et al. (2012) L. Matignon, G. J. Laurent, N. Le Fort-Piat, Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems, Knowledge Engineering Review 27 (2012) 1–31.
  • Bloembergen et al. (2015) D. Bloembergen, K. Tuyls, D. Hennes, M. Kaisers, Evolutionary Dynamics of Multi-Agent Learning: A Survey., Journal of Artificial Intelligence Research 53 (2015) 659–697.
  • Hernandez-Leal et al. (2017) P. Hernandez-Leal, M. Kaisers, T. Baarslag, E. Munoz de Cote, A Survey of Learning in Multiagent Environments - Dealing with Non-Stationarity (2017). arXiv:1707.09183.
  • Albrecht and Stone (2018) S. V. Albrecht, P. Stone, Autonomous agents modelling other agents: A comprehensive survey and open problems, Artificial Intelligence 258 (2018) 66–95.
  • Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (2015) 529–533.
  • Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature 529 (2016) 484–489.
  • Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., Mastering the game of Go without human knowledge, Nature 550 (2017) 354.
  • Moravčík et al. (2017) M. Moravčík, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, M. Bowling, DeepStack: Expert-level artificial intelligence in heads-up no-limit poker, Science 356 (2017) 508–513.
  • ope (2018) Open AI Five,, 2018. [Online; accessed 7-September-2018].
  • Sutton and Barto (1998) R. S. Sutton, A. G. Barto, Introduction to reinforcement learning, volume 135, MIT press Cambridge, 1998.
  • LeCun et al. (2015) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436.
  • Arulkumaran et al. (2017) K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A. Bharath, A Brief Survey of Deep Reinforcement Learning (2017). arXiv:1708.05866v2.
  • Lake et al. (2016) B. M. Lake, T. D. Ullman, J. Tenenbaum, S. Gershman, Building machines that learn and think like people, Behavioral and Brain Sciences 40 (2016).
  • Tamar et al. (2016) A. Tamar, S. Levine, P. Abbeel, Y. Wu, G. Thomas, Value Iteration Networks., NIPS (2016) 2154–2162.
  • Weiss (2013) G. Weiss (Ed.),  Multiagent Systems, (Intelligent Robotics and Autonomous Agents series), 2nd ed., MIT Press, Cambridge, MA, USA, 2013.
  • Agogino and Tumer (2004) A. K. Agogino, K. Tumer, Unifying Temporal and Structural Credit Assignment Problems., in: Proceedings of 17th International Conference on Autonomous Agents and Multiagent Systems, 2004.
  • Wei and Luke (2016) E. Wei, S. Luke, Lenient Learning in Independent-Learner Stochastic Cooperative Games., Journal of Machine Learning Research (2016).
  • Darwiche (2018) A. Darwiche, Human-level intelligence or animal-like abilities?, Commun. ACM 61 (2018) 56–67.
  • Kaelbling et al. (1996) L. P. Kaelbling, M. L. Littman, A. W. Moore, Reinforcement learning: A survey, Journal of artificial intelligence research 4 (1996) 237–285.
  • Puterman (1994) M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming, John Wiley & Sons, Inc., 1994.
  • Bellman (1957) R. Bellman, A Markovian decision process, Journal of Mathematics and Mechanics 6 (1957) 679–684.
  • Watkins (1989) J. Watkins, Learning from delayed rewards, Ph.D. thesis, King’s College, Cambridge, UK, 1989.
  • Tsitsiklis (1994) J. Tsitsiklis, Asynchronous stochastic approximation and Q-learning, Machine Learning 16 (1994) 185–202.
  • Sutton et al. (2000) R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, Policy Gradient Methods for Reinforcement Learning with Function Approximation., in: Advances in Neural Information Processing Systems, 2000.
  • Williams (1992) R. J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine learning 8 (1992) 229–256.
  • Konda and Tsitsiklis (2000) V. R. Konda, J. Tsitsiklis, Actor-critic algorithms, in: Advances in Neural Information Processing Systems, 2000.
  • Mnih et al. (2013) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing Atari with Deep Reinforcement Learning (2013). arXiv:1312.5602v1.
  • Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, P. Moritz, Trust Region Policy Optimization., in: 31st International Conference on Machine Learning, Lille, France, 2015.
  • Jaderberg et al. (2017) M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, K. Kavukcuoglu, Reinforcement Learning with Unsupervised Auxiliary Tasks., in: International Conference on Learning Representations, 2017.
  • Hausknecht and Stone (2015) M. Hausknecht, P. Stone, Deep Recurrent Q-Learning for Partially Observable MDPs, in: International Conference on Learning Representations, 2015.
  • Cassandra (1998) A. R. Cassandra, Exact and approximate algorithms for partially observable Markov decision processes, Ph.D. thesis, Computer Science Department, Brown University, 1998.
  • Hochreiter and Schmidhuber (1997) S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780.
  • Mnih et al. (2016) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, 2016, pp. 1928–1937.
  • Lillicrap et al. (2016) T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, Continuous control with deep reinforcement learning, in: International Conference on Learning Representations, 2016.
  • Ioffe and Szegedy (2015) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proceedings of the 32nd International Conference on Machine Learning (2015) 448–456.
  • Espeholt et al. (2018) L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al., IMPALA: Scalable distributed Deep-RL with importance weighted actor-learner architectures, in: International Conference on Machine Learning, 2018.
  • Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal Policy Optimization Algorithms (2017). arXiv:1707.06347.
  • Heess et al. (2017) N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. A. Riedmiller, D. Silver, Emergence of Locomotion Behaviours in Rich Environments. (2017). arXiv:1707.02286v2.
  • Tan (1993) M. Tan, Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents, in: Machine Learning Proceedings 1993 Proceedings of the Tenth International Conference, University of Massachusetts, Amherst, June 27–29, 1993, 1993, pp. 330–337.
  • Laurent et al. (2011) G. J. Laurent, L. Matignon, L. Fort-Piat, et al., The world of independent learners is not markovian, International Journal of Knowledge-based and Intelligent Engineering Systems 15 (2011) 55–64.
  • Littman (1994) M. L. Littman, Markov games as a framework for multi-agent reinforcement learning, in: Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ, USA, 1994, pp. 157–163.
  • Bowling and Veloso (2002) M. Bowling, M. Veloso, Multiagent learning using a variable learning rate, Artificial Intelligence 136 (2002) 215–250.
  • Balduzzi et al. (2018) D. Balduzzi, S. Racaniere, J. Martens, J. Foerster, K. Tuyls, T. Graepel, The mechanics of n-player differentiable games (2018). arXiv:1802.05642.
  • Fulda and Ventura (2007) N. Fulda, D. Ventura, Predicting and Preventing Coordination Problems in Cooperative Q-learning Systems, in: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, Hyderabad, India, 2007, pp. 780–785.
  • mul (2017) Multiagent Learning, Foundations and Recent Trends,, 2017. [Online; accessed 7-September-2018].
  • Tampuu et al. (2017) A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, R. Vicente, Multiagent cooperation and competition with deep reinforcement learning, PLOS ONE 12 (2017) e0172395.
  • Leibo et al. (2017) J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, Multi-agent Reinforcement Learning in Sequential Social Dilemmas, in: Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, Sao Paulo, 2017.
  • Raghu et al. (2018) M. Raghu, A. Irpan, J. Andreas, R. Kleinberg, Q. Le, J. Kleinberg, Can Deep Reinforcement Learning solve Erdos-Selfridge-Spencer Games?, in: Proceedings of the 35th International Conference on Machine Learning, 2018.
  • Lazaridou et al. (2017) A. Lazaridou, A. Peysakhovich, M. Baroni, Multi-Agent Cooperation and the Emergence of (Natural) Language, in: International Conference on Learning Representations, 2017.
  • Mordatch and Abbeel (2017) I. Mordatch, P. Abbeel, Emergence of grounded compositional language in multi-agent populations (2017). arXiv:1703.04908.
  • Bansal et al. (2018) T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, I. Mordatch, Emergent Complexity via Multi-Agent Competition., in: International Conference on Machine Learning, 2018.
  • Foerster et al. (2016) J. N. Foerster, Y. M. Assael, N. De Freitas, S. Whiteson, Learning to communicate with deep multi-agent reinforcement learning, in: Advances in Neural Information Processing Systems, 2016, pp. 2145–2153.
  • Sukhbaatar et al. (2016) S. Sukhbaatar, A. Szlam, R. Fergus, Learning Multiagent Communication with Backpropagation, in: Advances in Neural Information Processing Systems, 2016, pp. 2244–2252.
  • Peng et al. (2017) P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, J. Wang, Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games. (2017). arXiv:1703.10069.
  • Foerster et al. (2017) J. N. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli, S. Whiteson, Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning., in: International Conference on Machine Learning, 2017.
  • Palmer et al. (2018) G. Palmer, K. Tuyls, D. Bloembergen, R. Savani, Lenient Multi-Agent Deep Reinforcement Learning., in: International Conference on Autonomous Agents and Multiagent Systems, 2018.
  • Omidshafiei et al. (2017) S. Omidshafiei, J. Pazis, C. Amato, J. P. How, J. Vian, Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability, in: Proceedings of the 34th International Conference on Machine Learning, Sydney, 2017.
  • Zheng et al. (2018) Y. Zheng, J. Hao, Z. Zhang, Weighted double deep multiagent reinforcement learning in stochastic cooperative environments (2018). arXiv:1802.08534.
  • Jaderberg et al. (2018) M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castañeda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, T. Graepel, Human-level performance in first-person multiplayer games with population-based deep reinforcement learning (2018).
  • Sunehag et al. (2018) P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, T. Graepel, Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward., in: Proceedings of 17th International Conference on Autonomous Agents and Multiagent Systems, Stockholm, Sweden, 2018.
  • Rashid et al. (2018) T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. N. Foerster, S. Whiteson, QMIX - Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning., in: International Conference on Machine Learning, 2018.
  • Lanctot et al. (2017) M. Lanctot, V. F. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, T. Graepel, A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning., in: Advances in Neural Information Processing Systems, 2017.
  • Foerster et al. (2017) J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, S. Whiteson, Counterfactual Multi-Agent Policy Gradients., in: 32nd AAAI Conference on Artificial Intelligence, 2017.
  • Lowe et al. (2017) R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, I. Mordatch, Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments., in: Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
  • He et al. (2016) H. He, J. Boyd-Graber, K. Kwok, H. Daume, Opponent modeling in deep reinforcement learning, in: 33rd International Conference on Machine Learning, 2016, pp. 2675–2684.
  • Raileanu et al. (2018) R. Raileanu, E. Denton, A. Szlam, R. Fergus, Modeling Others using Oneself in Multi-Agent Reinforcement Learning., in: International Conference on Machine Learning, 2018.
  • Hong et al. (2018) Z.-W. Hong, S.-Y. Su, T.-Y. Shann, Y.-H. Chang, C.-Y. Lee, A Deep Policy Inference Q-Network for Multi-Agent Systems, in: International Conference on Autonomous Agents and Multiagent Systems, 2018.
  • Heinrich and Silver (2016) J. Heinrich, D. Silver, Deep Reinforcement Learning from Self-Play in Imperfect-Information Games (2016). arXiv:1603.01121.
  • Foerster et al. (2018) J. N. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, I. Mordatch, Learning with Opponent-Learning Awareness., in: Proceedings of 17th International Conference on Autonomous Agents and Multiagent Systems, Stockholm, Sweden, 2018.
  • Rabinowitz et al. (2018) N. C. Rabinowitz, F. Perbet, H. F. Song, C. Zhang, S. M. A. Eslami, M. Botvinick, Machine Theory of Mind., in: International Conference on Machine Learning, Stockholm, Sweden, 2018.
  • Yang et al. (2018) T. Yang, Z. Meng, J. Hao, C. Zhang, Y. Zheng, Bayes-ToMoP: A Fast Detection and Best Reponse Algorithm Towards Sophisticated Opponents (2018). arXiv:1809.04240.
  • Todorov et al. (2012) E. Todorov, T. Erez, Y. Tassa, MuJoCo - A physics engine for model-based control., Intelligent Robots and Systems (2012).
  • Fudenberg and Tirole (1991) D. Fudenberg, J. Tirole, Game Theory, The MIT Press, 1991.
  • Schuster and Paliwal (1997) M. Schuster, K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing 45 (1997) 2673–2681.
  • Bloembergen et al. (2010) D. Bloembergen, M. Kaisers, K. Tuyls, Lenient frequency adjusted Q-learning, in: Proceedings of the 22nd Belgian/Netherlands Artificial Intelligence Conference, 2010.
  • Rusu et al. (2016) A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, R. Hadsell, Policy Distillation, in: International Confernece on Learning Representations, 2016.
  • Van Hasselt et al. (2016) H. Van Hasselt, A. Guez, D. Silver, Deep Reinforcement Learning with Double Q-Learning, in: AAAI, Phoenix, AZ, 2016.
  • Vezhnevets et al. (2017) A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, K. Kavukcuoglu, FeUdal Networks for Hierarchical Reinforcement Learning., International Conference On Machine Learning (2017).
  • cap (2018) Capture the Flag: the emergence of complex cooperative agents,, 2018. [Online; accessed 7-September-2018].
  • Jaderberg et al. (2017) M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al., Population based training of neural networks (2017). arXiv:1711.09846.
  • Herbrich et al. (2007) R. Herbrich, T. Minka, T. Graepel, TrueSkill™: a Bayesian skill rating system, in: Advances in neural information processing systems, 2007, pp. 569–576.
  • Bowling et al. (2015) M. Bowling, N. Burch, M. Johanson, O. Tammelin, Heads-up limit hold’em poker is solved, Science 347 (2015) 145–149.
  • Tumer and Agogino (2007) K. Tumer, A. Agogino, Distributed agent-based air traffic flow management, in: Proceedings of the 6th International Conference on Autonomous Agents and Multiagent Systems, Honolulu, Hawaii, 2007.
  • Oliehoek (2018) F. A. Oliehoek, Interactive Learning and Decision Making - Foundations, Insights & Challenges., International Joint Conference on Artificial Intelligence (2018).
  • Brown (1951) G. W. Brown, Iterative solution of games by fictitious play, Activity analysis of production and allocation 13 (1951) 374–376.
  • Camerer et al. (2004) C. F. Camerer, T.-H. Ho, J.-K. Chong, A cognitive hierarchy model of games, The Quarterly Journal of Economics 119 (2004) 861.
  • Costa Gomes et al. (2001) M. Costa Gomes, V. P. Crawford, B. Broseta, Cognition and Behavior in Normal–Form Games: An Experimental Study, Econometrica 69 (2001) 1193–1235.
  • Walsh et al. (2002) W. E. Walsh, R. Das, G. Tesauro, J. O. Kephart, Analyzing complex strategic interactions in multi-agent systems, AAAI-02 Workshop on Game-Theoretic and Decision-Theoretic Agents (2002) 109–118.
  • de Weerd et al. (2013) H. de Weerd, R. Verbrugge, B. Verheij, How much does it help to know what she knows you know? An agent-based simulation study, Artificial Intelligence 199-200 (2013) 67–92.
  • Rosman et al. (2016) B. Rosman, M. Hawasly, S. Ramamoorthy, Bayesian Policy Reuse, Machine Learning 104 (2016) 99–127.
  • Hernandez-Leal et al. (2016) P. Hernandez-Leal, M. E. Taylor, B. Rosman, L. E. Sucar, E. Munoz de Cote, Identifying and Tracking Switching, Non-stationary Opponents: a Bayesian Approach, in: Multiagent Interaction without Prior Coordination Workshop at AAAI, Phoenix, AZ, USA, 2016.
  • Hernandez-Leal et al. (2017) P. Hernandez-Leal, Y. Zhan, M. E. Taylor, L. E. Sucar, E. Munoz de Cote, Efficiently detecting switches against non-stationary opponents, Autonomous Agents and Multi-Agent Systems 31 (2017) 767–789.
  • Hernandez-Leal and Kaisers (2017) P. Hernandez-Leal, M. Kaisers, Towards a Fast Detection of Opponents in Repeated Stochastic Games, in: G. Sukthankar, J. A. Rodriguez-Aguilar (Eds.), Autonomous Agents and Multiagent Systems: AAMAS 2017 Workshops, Best Papers, Sao Paulo, Brazil, May 8-12, 2017, Revised Selected Papers, 2017, pp. 239–257.
  • Tesauro (2003) G. Tesauro, Extending Q-learning to general adaptive multi-agent systems, in: Advances in Neural Information Processing Systems, Vancouver, Canada, 2003, pp. 871–878.
  • Agogino and Tumer (2008) A. K. Agogino, K. Tumer, Analyzing and visualizing multiagent rewards in dynamic and stochastic domains., Autonomous Agents and Multi-Agent Systems (2008).
  • Devlin et al. (2014) S. Devlin, L. M. Yliniemi, D. Kudenko, K. Tumer, Potential-based difference rewards for multiagent reinforcement learning., in: 13th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2014, Paris, France, 2014.
  • Taylor and Stone (2009) M. E. Taylor, P. Stone, Transfer learning for reinforcement learning domains: A survey, The Journal of Machine Learning Research 10 (2009) 1633–1685.
  • Andrychowicz et al. (2017) M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, W. Zaremba, Hindsight experience replay, in: Advances in Neural Information Processing Systems, 2017.
  • Schaul et al. (2016) T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized Experience Replay, in: International Confernece on Learning Representations, 2016.
  • Lipton et al. (2018) Z. C. Lipton, K. Azizzadenesheli, A. Kumar, L. Li, J. Gao, L. Deng, Combating Reinforcement Learning’s Sisyphean Curse with Intrinsic Fear (2018). arXiv:1611.01211v8.
  • Oliehoek et al. (2016) F. A. Oliehoek, C. Amato, et al., A concise introduction to decentralized POMDPs, Springer, 2016.
  • Greff et al. (2017) K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, J. Schmidhuber, LSTM: A Search Space Odyssey, IEEE Transactions on Neural Networks and Learning Systems 28 (2017) 2222–2232.
  • Dayan and Hinton (1993) P. Dayan, G. E. Hinton, Feudal reinforcement learning, in: Advances in neural information processing systems, 1993, pp. 271–278.
  • Yang et al. (2018) Y. Yang, J. Hao, M. Sun, Z. Wang, C. Fan, G. Strbac, Recurrent Deep Multiagent Q-Learning for Autonomous Brokers in Smart Grid, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 2018.
  • Resnick et al. (2018) C. Resnick, W. Eldridge, D. Ha, D. Britz, J. Foerster, J. Togelius, K. Cho, J. Bruna, Pommerman: A Multi-Agent Playground (2018). arXiv:1809.07124.
  • mar (2018) Multi-Agent Reinforcement Learning in Minecraft,, 2018. [Online; accessed 7-September-2018].
  • Vinyals et al. (2017) O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van Hasselt, D. Silver, T. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ekermo, J. Repp, R. Tsing, StarCraft II: A New Challenge for Reinforcement Learning (2017). arXiv:1708.04782v1.
  • Arjona-Medina et al. (2018) J. A. Arjona-Medina, M. Gillhofer, M. Widrich, T. Unterthiner, S. Hochreiter, Rudder: Return decomposition for delayed rewards (2018). arXiv:1806.07857.
  • Hu and Wellman (2003) J. Hu, M. P. Wellman, Nash Q-learning for general-sum stochastic games, Journal of Machine Learning Research 4 (2003) 1039–1069.
  • Bowling (2004) M. Bowling, Convergence and no-regret in multiagent learning, in: Advances in Neural Information Processing Systems, Vancouver, Canada, 2004, pp. 209–216.
  • Conitzer and Sandholm (2006) V. Conitzer, T. Sandholm, AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents, Machine Learning 67 (2006) 23–43.
  • Greenwald and Hall (2003) A. Greenwald, K. Hall, Correlated Q-learning, in: Proceedings of 17th International Conference on Autonomous Agents and Multiagent Systems, Washington, DC, USA, 2003, pp. 242–249.
  • Browne et al. (2012) C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, S. Colton, A survey of Monte Carlo tree search methods, IEEE Transactions on Computational Intelligence and AI in games 4 (2012) 1–43.
  • Vodopivec et al. (2017) T. Vodopivec, S. Samothrakis, B. Ster, On Monte Carlo tree search and reinforcement learning, Journal of Artificial Intelligence Research 60 (2017) 881–936.
  • Kartal et al. (2015) B. Kartal, J. Godoy, I. Karamouzas, S. J. Guy, Stochastic tree search with useful cycles for patrolling problems, in: Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE, 2015, pp. 1289–1294.
  • Amato and Oliehoek (2015) C. Amato, F. A. Oliehoek, Scalable Planning and Learning for Multiagent POMDPs, in: AAAI, 2015, pp. 1995–2002.
  • Kartal et al. (2016) B. Kartal, E. Nunes, J. Godoy, M. Gini, Monte Carlo tree search with branch and bound for multi-robot task allocation, in: The IJCAI-16 Workshop on Autonomous Mobile Service Robots, 2016.
  • Best et al. (2018) G. Best, O. M. Cliff, T. Patten, R. R. Mettu, R. Fitch, Dec-MCTS: Decentralized planning for multi-robot active perception, The International Journal of Robotics Research (2018) 0278364918755924.
  • Wei et al. (2018) E. Wei, D. Wicke, D. Freelan, S. Luke, Multiagent Soft Q-Learning (2018). arXiv:1804.09817.
  • Stone (2007) P. Stone, Multiagent learning is not the answer. It is the question, Artificial Intelligence 171 (2007) 402–405.
  • Henderson et al. (2018) P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger, Deep Reinforcement Learning That Matters., in: 32nd AAAI Conference on Artificial Intelligence, 2018.
  • Nagarajan et al. (2018) P. Nagarajan, G. Warnell, P. Stone, Deterministic implementations for reproducibility in deep reinforcement learning (2018). arXiv:1809.05676.
  • Machado et al. (2018) M. C. Machado, M. G. Bellemare, E. Talvitie, J. Veness, M. Hausknecht, M. Bowling, Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents, Journal of Artificial Intelligence Research 61 (2018) 523–562.
  • Torrado et al. (2018) R. R. Torrado, P. Bontrager, J. Togelius, J. Liu, D. Perez-Liebana, Deep Reinforcement Learning for General Video Game AI (2018). arXiv:1806.02448.
  • Ortega and Legg (2018) P. A. Ortega, S. Legg, Modeling friends and foes (2018). arXiv:1807.00196.
  • Bono et al. (2018) G. Bono, J. S. Dibangoye, L. Matignon, F. Pereyron, O. Simonin, Cooperative multi-agent policy gradient, in: European Conference on Machine Learning, 2018.
  • Yang et al. (2018) Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, J. Wang, Mean field multi-agent reinforcement learning, in: Proceedings of the 35th International Conference on Machine Learning, Stockholm Sweden, 2018.
  • Grover et al. (2018) A. Grover, M. Al-Shedivat, J. K. Gupta, Y. Burda, H. Edwards, Learning Policy Representations in Multiagent Systems., in: International Conference on Machine Learning, 2018.
  • Ling et al. (2018) C. K. Ling, F. Fang, J. Z. Kolter, What game are we playing? end-to-end learning in normal and extensive form games, in: Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018.