Deep Multi-Agent Reinforcement Learning for Decentralized Continuous Cooperative Control

03/14/2020 ∙ by Christian Schroeder de Witt, et al. ∙ 13

Deep multi-agent reinforcement learning (MARL) holds the promise of automating many real-world cooperative robotic manipulation and transportation tasks. Nevertheless, decentralised cooperative robotic control has received less attention from the deep reinforcement learning community, as compared to single-agent robotics and multi-agent games with discrete actions. To address this gap, this paper introduces Multi-Agent Mujoco, an easily extensible multi-agent benchmark suite for robotic control in continuous action spaces. The benchmark tasks are diverse and admit easily configurable partially observable settings. Inspired by the success of single-agent continuous value-based algorithms in robotic control, we also introduce COMIX, a novel extension to a common discrete action multi-agent Q-learning algorithm. We show that COMIX significantly outperforms state-of-the-art MADDPG on a partially observable variant of a popular particle environment and matches or surpasses it on Multi-Agent Mujoco. Thanks to this new benchmark suite and method, we can now pose an interesting question: what is the key to performance in such settings, the use of value-based methods instead of policy gradients, or the factorisation of the joint Q-function? To answer this question, we propose a second new method, FacMADDPG, which factors MADDPG's critic. Experimental results on Multi-Agent Mujoco suggest that factorisation is the key to performance.



There are no comments yet.


page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While reinforcement learning (RL) has shown promise in learning optimal control policies for a variety of single-agent robot control problems, ranging from idealised multi-joint simulations (Todorov et al., 2012; Gu et al., 2016; Haarnoja et al., 2018) to complex grasping control problems (Kalashnikov et al., 2018; Andrychowicz et al., 2020), many real-world robot control tasks can be naturally framed as multiple decentralised collaborating agents. Cooperative manipulation tasks arise in autonomous aerial construction (Augugliaro et al., 2013, 2014), industrial manufacturing (Caccavale and Uchiyama, 2008), and agricultural robotics (Shamshiri et al., 2018) and have so far received comparatively little attention from the deep RL community.

Cooperative robotics present a number of challenges when compared to many conventional multi-agent tasks. For example, unlike in established multi-agent benchmarks, such as StarCraft II (Vinyals et al., 2017; Samvelyan et al., 2019), robotic actuators are usually continuous, so learning algorithms must scale to large continuous joint action spaces. Furthermore, such tasks are often partially observable, which arises from varying sensory equipment, included limited fields of view, together with communication constraints due to latency, power or environmental limitations (Ong et al., 2009). In fact, many such applications require fully decentralised policies for safety reasons, as communication cannot be guaranteed under all circumstances (Takadama et al., 2003).

Even when execution must be decentralised, deep reinforcement learning policies are typically trained in a centralised fashion in a simulator or laboratory. The framework of Centralised Training with Decentralised Execution (CTDE) (Oliehoek et al., 2008; Kraemer and Banerjee, 2016) allows policy training to exploit extra information that is not available during execution in order to accelerate learning (Lowe et al., 2017; Foerster et al., 2018; Rashid et al., 2018).

Although some multi-agent benchmark environments for continuous control exist (Leibo et al., 2017; Liu et al., 2019), few environments specialise in cooperative control and even fewer model partial observability. Moreover, existing benchmarks, like the popular Multi-agent Particle Environment (Leibo et al., 2017), are not complex enough to meaningfully compare methods intended for robotic control. After performing a comprehensive search, we decided to introduce Multi-Agent Mujoco, a novel benchmark that fits our requirements of being a diverse, publicly available, partially observable cooperative robotics simulation that effectively captures the nature of cooperative robotic manipulation tasks.

Starting from the popular fully observable single-agent robotic control suite Mujoco (Todorov et al., 2012) included with OpenAI Gym (Brockman et al., 2016)

, we decompose single robots into individual segments controlled by different agents. We introduce partial observability during execution by allowing the user to limit the observation distance within the model graph formed by joints and limbs, as well as allowing fine-grained control over which segment attributes may be observed at different distance levels. Multi-Agent Mujoco is available as open-source software.

While there is a diverse portfolio of multi-agent algorithms for cooperative tasks with discrete action spaces (Foerster et al., 2018; Rashid et al., 2018; Schroeder de Witt et al., 2019), decentralised continuous control algorithms have been largely limited to deep deterministic policy gradient approaches (Lowe et al., 2017). Single-agent -learning approaches to continuous control exist, but the involved greedy action maximisation usually requires strong constraints on the functional form of the -values (Gu et al., 2016; Amos et al., 2017)

or an approximate maximisation procedure based on search heuristics

(Kalashnikov et al., 2018). Neither approach scales well when joint action spaces are large and values poorly approximated by constrained functions, as might be expected in cooperative multi-agent robotic tasks.

To this end, we introduce a novel -learning algorithm, COMIX, that employs a decentralisable joint action-value function with a per-agent factorisation (Rashid et al., 2018). This allows the application of the cross-entropy method (Kalashnikov et al., 2018) for greedy action maximisation on a per-agent basis, circumventing scaling issues related to large joint action spaces. Importantly, we find that COMIX significantly outperforms the state-of-the-art MADDPG in a partially observable variant of a Continuous Predator-Prey toy environment (Lowe et al., 2017). We then benchmark both on Multi-Agent Mujoco, which is a more realistic decentralisable cooperative robotic control setting. We find that COMIX outperforms MADDPG in two of the three scenarios tested and performs similarly to MADDPG in the third one. These results suggest that continuous -learning is a compelling alternative to deterministic policy gradients for decentralised cooperative multi-agent tasks.

Thanks to this new benchmark suite and method, we can now pose an interesting question: what is the key to performance in such settings, the use of value-based methods instead of policy gradients, or the factorisation of the joint -function? To answer this question, we introduce Factored MADDPG (FacMADDPG), a novel variant of MADDPG where the centralised critic is factored into individual critic networks similarly to COMIX. We find that, interestingly, FacMADDPG performs similarly to COMIX in the Predator-Prey toy environment, as well as on a single Multi-Agent Mujoco environment. This suggests that it is indeed the value-function factorisation that is key to performance in tasks such as these.

2 Background

We consider a fully cooperative multi-agent task in which a team of cooperative agents choose sequential actions in a stochastic, partially observable environment. It can be modeled as a

decentralised partially observable Markov decision process

(Dec-POMDP Oliehoek et al., 2016), defined by a tuple . Here denotes the set of agents and describes the discrete or continuous state of the environment. The initial state is drawn from distribution , and at each time step , all agents choose simultaneously discrete or continuous actions , yielding the joint action . After executing the joint action in state , the next state is drawn from transition kernel and the collaborative reward is returned to the team.

In a Dec-POMDP, the true state of the environment cannot be directly observed by the agents. Each agent draws an individual observation , from the observation kernel . The history of an agent’s observations and actions is denoted , and the set of all agents’ histories . Agent chooses its actions with a decentralised policy based only on its individual history. The collaborating team of agents aim to learn a joint policy that maximises their expected discounted return, . This joint policy induces a joint action-value function

which estimates the expected discounted return when the agents take joint action

with histories in state and then follow some joint policy :


where is a discount factor.

2.1 Deep -Learning

Deep -Network (DQN Mnih et al., 2015)

uses a deep neural network to estimate the action-value function,

, where are the parameters of the network. For the sake of simplicity, we restrict us here to feed-forward networks, which condition on the last observations , rather than the entire agent histories . The network parameters are trained by gradient descent on the mean squared regression loss:


where the expectation is estimated with transitions sampled from an experience replay buffer (Lin, 1992b). The use of replay buffer reduces correlations in the observation sequence. To further stabilise learning, denotes parameters of a target network that are only periodically copied from the most recent .

2.2 Centralised Training with Decentralised Execution

A simple and natural approach to solving Dec-POMDPs is to let each agent learn an individual action-value function independently, as in independent -learning (IQL Tan, 1993). IQL serves as a surprisingly strong benchmark in both cooperative and competitive MARL tasks with discrete actions (Tampuu et al., 2017). However, IQL has no convergence guarantees since, as agents independently explore and update their policies, the environment becomes nonstationary from each agent’s viewpoint. An alternative solution is to employ centralised training with decentralised execution (CTDE Kraemer and Banerjee, 2016). CTDE allows the learning algorithm to access all local action-observation histories and global state , as well as share gradients and parameters, but each agent’s executed policy can condition only on its own action-observation history .

2.3 VDN and QMIX

Value decomposition networks (VDN Sunehag et al., 2018) and QMIX (Rashid et al., 2018) are two representative examples of value function factorisation (Koller and Parr, 1999) that aim to efficiently learn a centralised but factored action-value function. They both work in cooperative MARL tasks with discrete actions, using CTDE. To ensure consistency between the centralised and decentralised polices, VDN factors the joint action-value function into a sum of individual action-value functions ,111Strictly speaking, each is a utility function since by itself it does not estimate an expected return. We refer to as action-value function for simplicity. one for each agent , that condition only on individual action-observation histories:


By contrast, QMIX represents the joint action-value function as a monotonic function of individual action-value functions. The main insight is that, to extract decentralised policies that are consistent with their centralised counterparts, it suffices to constrain to be monotonic in each : Thus, in QMIX, the joint action-value function is represented as:


where is a mixing network that takes as input the agent network outputs and mixes them monotonically, producing the values of . Monotonicity can be guaranteed by non-negative mixing weights. These weights are generated by separate hypernetworks (Ha et al., 2016), parameterised by , which condition on the full state . This allows it to learn different monotonic mixing weights in each state.

In both methods, the loss function is analogous to the standard DQN loss of (

2), where is replaced by . During execution, each agent selects actions greedily with respect to its own .

2.4 Maddpg

Multi-agent deterministic policy gradient (MADDPG Lowe et al., 2017) is an actor-critic method that works in both cooperative and competitive MARL tasks with discrete or continuous action spaces. MADDPG was originally designed for the more general case of partially observable stochastic games (Kuhn, 1953). Here we discuss a version specific to Dec-POMDPs and consider continuous actions. We assume each agent has a deterministic policy , parameterised by , with . MADDPG learns a centralised critic for each agent that conditions on the full state and the joint actions of all agents. The policy gradient for is:

where and are sampled from a replay buffer . The centralised action-value function of each agent is trained by minimising the following loss


where transitions are sampled from a replay buffer (Lin, 1992a) and and are target-network parameters.

3 Multi-Agent Mujoco

Figure 1: Agent partitionings for Multi-Agent Mujoco environments: A) 2-Agent Swimmer [2x1], B) 3-Agent Hopper [3x1], C) 2-Agent HalfCheetah [2x3], D) 6-agent HalfCheetah [6x1], E) 2-Agent Humanoid and 2-Agent HumanoidStandup (each [1x9,1x8]), F) 2-Agent Walker G) 2-Agent Reacher [2x1], H) 2-Agent Ant [2x4], I) 2-Agent Ant Diag [2x4], J) 4-Agent Ant [4x2]. Colours indicate agent partitionings. Each joint corresponds to a single controllable motor. Split partitions indicate shared body segments. Square brackets indicate [(number of agents) x (joints per agent)]. Joint IDs are in order of definition in the corresponding OpenAI Gym XML asset files (Brockman et al., 2016)

. Global joints indicate degrees of freedom of the center of mass of the composite robotic agent.

Multi-Agent Mujoco is a novel benchmark for decentralised cooperative continuous multi-agent robotic control. Starting from the popular fully observable single-agent robotic Mujoco (Todorov et al., 2012) control suite included with OpenAI Gym (Brockman et al., 2016), we create novel scenarios in which multiple agents have to solve a task cooperatively. This is achieved by first representing a given single robotic agent as a body graph, where vertices (joints) are connected by adjacent edges (body segments), as shown in Figure 1. We then partition the body graph into disjunct sub-graphs, one for each agent, each of which contains one or more joints that can be controlled.

Figure 2: Observations by distance for 3-Agent Hopper (as seen from agent 1). 1) corresponds to joints and body parts at zero graph distance from agent 1, 2) corresponds to joints and body parts observable at unit graph distance and 3) at two unit graph distances.

Hence, each agent’s action space in Multi-Agent Mujoco is given by the joint action space over all motors controllable by that agent. For example, for the agent corresponding to the green partition in 2-Agent HalfCheetah (Figure 1, C) consists of three joints (joint ids 1,2 and 3) and four adjacent body segments. Each joint has action space , hence the joint action space of this agent .

For each agent , observations are constructed in a two-stage process. First, we infer which body segments and joints are observable by agent . Each agent can always observe all joints within its own sub-graph. A configurable parameter

determines the maximum graph distance to the agent’s subgraph at which joints are observable. Body segments directly attached to observable joints are themselves observable. The agent observation is then given by a fixed order concatenation of the representation vector of each observable graph element. Depending on configuration, representation vectors may include attributes such as position, velocity and external body forces.

Restricting both the observation distance , as well as limiting the set of observable element categories imposes partial observability. However, task goals remain unchanged from the single-agent variants (see Table 1 in the Appendix), except that the goals must be reached collaboratively by multiple agents: we simply repurpose the original single-agent reward signal as a team reward signal.



-learning has shown considerable success in multi-agent settings with discrete action spaces (Rashid et al., 2018). However, performing greedy action selection in -learning requires evaluating , where is the joint state-action value function. In discrete action spaces, this operation can be performed efficiently through enumeration unless the action space is extremely large. In continuous action spaces, however, enumeration is impossible. Hence, existing continuous -learning approaches in single-agent settings either impose constraints on the form of -value to make maximisation easy (Gu et al., 2016; Amos et al., 2017), at the expense of estimation bias, or perform greedy action selection only in approximation (Kalashnikov et al., 2018). Neither approach scales easily to the large joint action spaces inherent to multi-agent settings, as 1) the joint action space grows exponentially in the number of agents, and 2) training to select maximal actions becomes impractical when there are many agents.

This highlights the importance of learning a centralised but factored . To factor large joint action spaces efficiently in a decentralisable fashion, COMIX models a joint state-action value function , where are per-agent utility functions used for greedy action selection. Similarly to QMIX (Rashid et al., 2018), COMIX imposes a monotonicity constraint on to keep joint action selection compatible with action selection from individual utility functions.

COMIX performs greedy selection of actions with respect to utility functions for each agent using the cross-entropy method (CEM De Boer et al., 2005), a sampling-based derivative-free heuristic search method that has been successfully used to find approximate maxima of nonconvex -networks in a number of single-agent robotic control tasks (Kalashnikov et al., 2018). The centralised but factored allows us to use CEM to sample actions for each agent independently and to use the individual utility function to guide the selection of maximal actions. We rely on CEM instead of other continuous -learning approaches (Gu et al., 2016; Amos et al., 2017) because of its empirical success (see Section 5).

CEM iteratively draws a batch of random samples from a candidate distribution , e.g., a Gaussian, at each iteration . The best

samples are then used to fit a new Gaussian distribution

, and this process repeats

times. For COMIX, we use a CEM hyperparameter configuration similar to Qt-Opt

(Kalashnikov et al., 2018), where , , and .222We empirically find iterations to suffice. Gaussian distributions are initialised with mean

and standard deviation

. Algorithm 2 outlines the full CEM process for COMIX.

  function COMIX 
     for each training episode  do
        for  until step  do
        end for
     end for
  end function
Algorithm 1 Algorithmic description of COMIX.
The function CEM is defined in Appendix 10.


Learning a centralised critic conditioning on a large joint agent observation space can be difficult (Iqbal and Sha, 2019). We introduce FacMADDPG, a novel variant of MADDPG with an agent-specific factorisation that facilitates the learning of a centralised critic in Dec-POMDPs. In FacMADDPG, all agents share a single centralised critic that is factored as


where is a function represented by a monotonic mixing network. Although the monotonicity requirement on is no longer required as the critic is not used for greedy action selection, FacMADDPG does impose monotonicity on in order to keep the factorisation comparable to the one employed by COMIX. We find that FacMADDPG significantly outperforms an ablation without monotonicity constraints (see Appendix 10).

5 Experimental Setup

Figure 3: Continuous Predator-Prey. Left: Top-down view of toroidal plane, with predators (red), prey (green) and obstacles (grey). Right: Illustration of the prey’s avoidance heuristic. Observation radii of both agents and prey are indicated.
Figure 4:

Mean episode return on (a) Continuous Predator-Prey, (b) 2-Agent HalfCheetah [2x3], (c) 2-Agent Walker [2x3], and (d) 3-Agent Hopper [3x1]. The mean across 10 seeds is plotted and the 95% confidence interval is shown shaded.

Partially Observable Continuous Predator-Prey.

The mixed simple tag environment (Figure 3) introduced by Leibo et al. (2017) is a variant of the classic predator-prey game. Three slower cooperating circular agents (red), each with continuous movement action spaces , must catch a faster circular prey (green) on a randomly generated two-dimensional toroidal plane with two large landmarks blocking the way.

To obtain a purely cooperative environment, we replace the prey’s policy by a hard-coded heuristic, that, at any time step, moves the prey to the sampled position with the largest distance to the closest predator. If one of the cooperative agents collides with the prey, a team reward of is emitted; otherwise, no reward is given. In the original simple tag environment, each agent can observe the relative positions of other two agents, the relative position and velocity of the prey, and the relative positions of the landmarks. This means each agent’s private observation provides an almost complete representation of the true state of the environment.

To introduce partial observability to the environment, we add an agent view radius, which restricts the agents from receiving information about other entities (including all landmarks, the other two agents, and the prey) that are out of range. Specifically, we set the value of view radius such that the agents can only observe other agents roughly 60% of the time. We open-source the full set of multi-agent particle environments with added partial observability.333

Multi-Agent Mujoco.

All Multi-Agent Mujoco environments are configured according to the default configuration of Multi-Agent Mujoco, where each agent can observe both velocity and position of its own body parts and positions only at graph distances greater than zero. We set maximum observation distances to for 2-Agent HalfCheetah and 3-Agent Hopper, for 2-Agent Walker. Default team reward signals are used (see Table 1 in the Appendix).


We also introduce a number of novel ablations in order to study diverse aspects of our method in isolation: 1) COVDN: we factor the joint action-value function into a sum of individual action-value functions as in VDN, and use CEM to learn for each agent , 2) COMIX-NAF: we factor assuming mixing monotonicity as in QMIX, and add quadratic function constraints on each based on Normalized Advantage Functions (NAF Gu et al., 2016), and 3) COVDN-NAF: we represent assuming additive mixing as in VDN, and add quadratic function constraints on based on NAF.

Evaluation Procedure.

We evaluate each method’s performance using the following procedure: for each run of a method, we pause training every fixed number of timesteps (2000 timesteps for Continuous Predator-Prey and 4000 timesteps for Multi-Agent Mujoco) and run 10 independent episodes with each agent performing greedy decentralised action selection. The mean value of these 10 episode returns are then used to evaluate the performance of learned policies. See Appendix 9 for further experimental details.

6 Empirical Results

Figure 4 shows that COMIX significantly outperforms MADDPG on Continuous Predator-Prey (Figure 4a), both in terms of absolute performance and learning speed. On Multi-Agent Mujoco, COMIX outperforms MADDPG in absolute terms on 2-Agent Walker scenario (Figure 4c), while MADDPG cannot learn it effectively. On 2-Agent HalfCheetah (Figure 4

b), COMIX outperforms MADDPG in absolute terms and has lower limit variance. On 3-Agent Hopper (Figure

4d), COMIX performs similarly to MADDPG but has significantly lower variance across seeds.

Despite the ability to represent a richer form of coordination with its functionally unconstrained critic (Son et al., 2019), in our experiments MADDPG is not able to outperform COMIX, which uses a monotonically constrained mixing network.

We hypothesise that this is because MADDPG’s critic directly conditions on the joint observations and actions of all agents. COMIX, by contrast, represents the optimal joint action-value function using a monotonic mixing function of per-agent utilities. Early in training, MADDPG’s critic estimator may thus be more prone to picking up non-trivial suboptimal coordination patterns than COMIX. Such local minima might be hard to subsequently escape.

By contrast, the monotonicity constraint on COMIX’s mixing network may smooth the optimisation landscape, allowing COMIX to avoid suboptimal local minima more efficiently than MADDPG. In other words, COMIX’s network architecture imposes a suitable prior that captures the forms of additive-monotonic coordination required to solve these tasks.

These results an interesting question: What is the key to performance in such settings, the use of value-based methods instead of policy gradients, or the choice of factorisation of the joint -value function? Previous work on this question (Bescuca, 2019) is inconclusive due to the confounder that the policy gradient methods studied (COMA and Central-V (Foerster et al., 2018)) are on-policy, while the respective -learning method, QMIX (Rashid et al., 2018), is off-policy with experience replay (Lin, 1992a).

To address this question, we evaluate the performance of FacMADDPG, which factors MADDPG’s critic in the same manner as COMIX’s joint -value function. We find that FacMADDPG performs similarly to COMIX on both Continuous Predator-Prey and all three Multi-Agent Mujoco tasks (see Figure 4b-4d). As both COMIX and FacMADDPG are off-policy algorithms, this shows that the factorisation of the joint -value function is the key to performance in these decentralised continuous cooperative multi-agent tasks.

Figure 5: Mean episode return on Continuous Predator-Prey comparing COMIX and ablations. The mean across 10 seeds is plotted and the 95% confidence interval is shown shaded.


We find that COVDN performs drastically worse than both COMIX and MADDPG across all Multi-Agent Mujoco tasks (shown in Figure 4a-4d), demonstrating the necessity of the non-linear mixing of agent utilities and conditioning on the state information in order to achieve competitive performance in such tasks.

Figure 5 shows that, compared to its ablations COVDN-NAF and COMIX-NAF, COMIX is noticeably more stable on Continuous Predator-Prey. COVDN-NAF has sharp drops in performance at the late stage of training, while COMIX-NAF converges significantly slower than COMIX and is much more varied across seeds. This demonstrates that greedy action selection based on CEM heuristic search is both more stable and performant than simple exact methods in practice.

Finally, we evaluate an ablation of FacMADDPG without the monotonicity constraint. We find that this method does not learn at all in Continuous Predator-Prey, and performs significantly worse than both COMIX and MADDPG on 2-Agent HalfCheetah (see Figure 7 in Appendix 10). This suggests monotonicity matters.

7 Related Work

While several MARL benchmarks with continuous action spaces have been released, few are simultaneously diverse, fully cooperative, decentralisable and admit partial observability. The Multi-Agent Particle suite (Lowe et al., 2017) features a few decentralisable tasks in a fully observable planar point mass toy environment. Presumably due to its focus on real-world robotic control, RoboCup Soccer Simulation (Kitano et al., 1997; Stone and Sutton, 2001; Riedmiller et al., 2009) does not currently feature an easily configurable software interface for MARL, nor suitable AI-controlled benchmark opponents. Liu et al. (2019) introduce MuJoCo Soccer Environment, a multi-agent soccer environment with continuous simulated physics that cannot be used in a purely cooperative setting and does not admit partial observability.

Of the existing environments most similar but not as diverse as Multi-Agent Mujoco, Wang et al. (2018) introduce two decomposed Mujoco environments, Centipede and Snakes, the latter of which being similar to Multi-Agent Mujoco’s 2-Agent Swimmer. Ackermann et al. (2019) evaluate on a single environment that is similar to a particular configuration of 2-Agent Ant, but do not consider tasks across different numbers of agents and Mujoco scenarios.

A number of multi-agent variants of deep deterministic policy gradients (Lillicrap et al., 2015; Lowe et al., 2017) have been proposed for MARL in continuous action spaces: MADDPG-M (Kilinc and Montana, 2018) uses communication channels in order to overcome observation noise in partially observable settings. By contrast, we consider fully decentralised settings without communication. R-MADDPG (Wang et al., 2019) equips MADDPG with recurrent policies and critics in a partially observable setting with communication. As we are primarily interested in the relative performance between policy gradients and continuous -learning approaches, we employ feed-forward network architectures to avoid the complexities of recurrent network training.

NerveNet (Wang et al., 2018) achieves policy transfer across robotic agents with different numbers of repeated units. Unlike COMIX, NerveNet is not fully decentralisable as it requires explicit communication channels. Iqbal and Sha (2019) introduce MAAC, a variant of MADDPG for stochastic games, in which the centralised critics employ an attention mechanism on top of agent-specific observation embeddings. Unlike FacMADDPG, MAAC explicitly addresses settings where agents receive both individual and team rewards.

Besides VDN and QMIX, QTRAN (Son et al., 2019) allows for arbitrary utility function mixings by introducing auxiliary losses that align utility function maxima with maxima of the joint -function. Despite being more expressive, QTRAN does not scale well to complex environments, such as StarCraft II (Samvelyan et al., 2019; Bohmer et al., 2019) and may not generalise well to continuous action spaces due to the point-wise nature of its auxiliary losses.

Continuous -learning has so far been studied almost exclusively in the fully observable single-agent setting. Two distinct approaches to making greedy action selection tractable have emerged: Both Normalized Advantage Functions (NAF Gu et al., 2016) and Partially Input-Convex Neural Networks (PICNN Amos et al., 2017) constrain the functional form of the state-action value function approximator such as to guarantee an easily identifiable global maximum. In contrast, heuristic search approaches, such as Cross-Entropy Maximisation (CEM Mannor et al., 2003), forfeit global guarantees but allow for unconstrained -learning approximators. CEM (Mannor et al., 2003) has been used successfully in single-agent robotic simulations (Kalashnikov et al., 2018). As for COMIX, we find that ablations using NAF perform poorly (see section 5).

8 Conclusion

In order to stimulate research into decentralised cooperative robotic control, we introduce a novel benchmark suite, Multi-Agent Mujoco. Multi-Agent Mujoco consists of a diverse set of multi-agent tasks with continuous action spaces and is easily extensible. We also introduce COMIX, a novel -learning algorithm that factors the joint action space into per-agent action spaces to overcome scalability problems in continuous greedy action selection.

Our results show that COMIX performs competitively with, or even outperforms, MADDPG both on a traditional benchmark environment, as well as on Multi-Agent Mujoco. Futhermore, we introduce a second method FacMADDPG, a novel variant of MADDPG where the centralised critic is factored into individual critic networks similarly to COMIX.

We find that, interestingly, FacMADDPG performs similarly to COMIX in Predator-Prey toy environment, as well as on a single Multi-Agent Mujoco environment. This shows that the factoration of the joint -value is the key to performance in decentralised continuous cooperative multi-agent tasks.

Future work will investigate the utility of recent amortisation techniques for cross-entropy maximisation (Van de Wiele et al., 2020) and the relationship between exploration strategies in continuous -learning and deterministic policy gradient approaches. We also plan to extend Multi-Agent Mujoco to contain more challenging multi-agent environments composed of multiple robotic agents rather than decompositions of single robotic agents.


This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713), the National Institutes of Health (grant agreement number R01GM114311), EPSRC/MURI grant EP/N019474/1 and the JP Morgan Chase Faculty Research Award. This work is linked to and partly funded by the project Free the Drones (FreeD) under the Innovation Fund Denmark and Microsoft. It was also supported by the Oxford-Google DeepMind Graduate Scholarship and a generous equipment grant from NVIDIA.


  • J. Ackermann, V. Gabler, T. Osa, and M. Sugiyama (2019) Reducing overestimation bias in multi-agent domains using double centralized critics. arXiv preprint arXiv:1910.01465. Cited by: §7.
  • B. Amos, L. Xu, and J. Z. Kolter (2017) Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 146–155. Cited by: §1, §4, §4, §7.
  • O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2020) Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §1.
  • F. Augugliaro, A. Mirjan, F. Gramazio, M. Kohler, and R. D’Andrea (2013) Building tensile structures with flying machines. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3487–3492. Cited by: §1.
  • F. Augugliaro, S. Lupashin, M. Hamer, C. Male, M. Hehn, M. W. Mueller, J. S. Willmann, F. Gramazio, M. Kohler, and R. D’Andrea (2014) The flight assembled architecture installation: cooperative construction with flying machines. IEEE Control Systems Magazine 34 (4), pp. 46–64. Cited by: §1.
  • M. Bescuca (2019) Factorised critics in deep multi-agent reinforcement learning. In Master Thesis, University of Oxford, Cited by: §6.
  • W. Bohmer, V. Kurin, and S. Whiteson (2019) Deep coordination graphs. arXiv preprint arXiv:1910.00091. Cited by: §7.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §1, Figure 1, §3, §9.2.
  • F. Caccavale and M. Uchiyama (2008) Cooperative Manipulators. In Springer Handbook of Robotics, B. Siciliano and O. Khatib (Eds.), pp. 701–718. Cited by: §1.
  • K. Ciosek and S. Whiteson (2018) Expected policy gradients. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §9.3.
  • P. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein (2005) A tutorial on the cross-entropy method. Annals of operations research 134 (1), pp. 19–67. Cited by: §4.
  • J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson (2018) Counterfactual multi-agent policy gradients. In Thirty-second AAAI conference on artificial intelligence, Cited by: §1, §1, §6.
  • S. Gu, T. Lillicrap, I. Sutskever, and S. Levine (2016) Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §1, §1, §4, §4, §5, §7.
  • D. Ha, A. Dai, and Q. V. Le (2016) Hypernetworks. arXiv preprint arXiv:1609.09106. Cited by: §2.3.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1856–1865. Cited by: §1.
  • S. Iqbal and F. Sha (2019) Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 2961–2970. Cited by: §4, §7.
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §1, §1, §1, §4, §4, §4, §7.
  • O. Kilinc and G. Montana (2018) Multi-agent deep reinforcement learning with extremely noisy observations. arXiv preprint arXiv:1812.00922. Cited by: §7.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §9.1.
  • H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, E. Osawa, and H. Matsubara (1997) RoboCup: a challenge problem for ai. AI magazine 18 (1), pp. 73–73. Cited by: §7.
  • D. Koller and R. Parr (1999) Computing factored value functions for policies in structured mdps. In Proceedings of IJCAI, pp. 1332–1339. Cited by: §2.3.
  • L. Kraemer and B. Banerjee (2016) Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190, pp. 82–94. Cited by: §1, §2.2.
  • H. Kuhn (1953) Extensive games and the problem of information. Annals of Mathematics Studies 28. Cited by: §2.4.
  • J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel (2017) Multi-agent Reinforcement Learning in Sequential Social Dilemmas. arXiv preprint arXiv:1702.03037. Cited by: §1, §5.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §7.
  • L. Lin (1992a) Reinforcement learning for robots using neural networks. In Dissertation, Carnegie Mellon University, Cited by: §2.4, §6.
  • L. Lin (1992b) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8 (3-4), pp. 293–321. Cited by: §2.1.
  • S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess, and T. Graepel (2019) Emergent coordination through competition. arXiv preprint arXiv:1902.07151. Cited by: §1, §7.
  • R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems, pp. 6379–6390. Cited by: §1, §1, §1, §2.4, §7, §7.
  • S. Mannor, R. Y. Rubinstein, and Y. Gat (2003) The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 512–519. Cited by: §7.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §2.1.
  • F. A. Oliehoek, C. Amato, et al. (2016) A concise introduction to decentralized pomdps. Vol. 1, Springer. Cited by: §2.
  • F. A. Oliehoek, M. T. J. Spaan, and Nikos Vlassis (2008) Optimal and Approximate Q-value Functions for Decentralized POMDPs. JAIR 32, pp. 289–353. Cited by: §1.
  • S. C. Ong, S. W. Png, D. Hsu, and W. S. Lee (2009) POMDPs for robotic tasks with mixed observability.. In Robotics: Science and systems, Vol. 5, pp. 4. Cited by: §1.
  • M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz (2018) Parameter space noise for exploration. arXiv preprint arXiv:1706.01905. Cited by: §9.3.
  • T. Rashid, M. Samvelyan, C. S. Witt, G. Farquhar, J. Foerster, and S. Whiteson (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 4292–4301. Cited by: §1, §1, §1, §2.3, §4, §4, §6.
  • M. Riedmiller, T. Gabel, R. Hafner, and S. Lange (2009) Reinforcement learning for robot soccer. Autonomous Robots 27 (1), pp. 55–73. Cited by: §7.
  • M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C. Hung, P. H. S. Torr, J. Foerster, and S. Whiteson (2019) The StarCraft Multi-Agent Challenge. CoRR abs/1902.04043. Cited by: §1, §7.
  • C. Schroeder de Witt, J. Foerster, G. Farquhar, P. Torr, W. Boehmer, and S. Whiteson (2019) Multi-agent common knowledge reinforcement learning. In Advances in Neural Information Processing Systems 32, pp. 9924–9935. Cited by: §1.
  • R. Shamshiri, C. Weltzien, I. Hameed, I. Yule, T. Grift, S. Balasundram, L. Pitonakova, D. Ahmad, and G. Chowdhary (2018) Research and development in agricultural robotics: A perspective of digital farming. International Journal of Agricultural and Biological Engineering 11, pp. 1–14. Cited by: §1.
  • K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi (2019) QTRAN: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 5887–5896. Cited by: §6, §7.
  • P. Stone and R. S. Sutton (2001) Scaling reinforcement learning toward RoboCup soccer. In Icml, Vol. 1, pp. 537–544. Cited by: §7.
  • P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, and et al. (2018) Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2085–2087. Cited by: §2.3.
  • K. Takadama, S. Matsumoto, S. Nakasuka, and K. Shimohara (2003) A reinforcement learning approach to fail-safe design for multiple space robots?cooperation mechanism without communication and negotiation schemes. Advanced Robotics 17 (1), pp. 21–39. Cited by: §1.
  • A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente (2017) Multiagent cooperation and competition with deep reinforcement learning. PloS one 12 (4). Cited by: §2.2.
  • M. Tan (1993) Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §2.2.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1, §1, §3.
  • T. Van de Wiele, D. Warde-Farley, A. Mnih, and V. Mnih (2020) Q-learning in enormous action spaces via amortized approximate maximization. arXiv preprint arXiv:2001.08116. Cited by: §8.
  • O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van Hasselt, D. Silver, T. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ekermo, J. Repp, and R. Tsing (2017) StarCraft II: A New Challenge for Reinforcement Learning. arXiv preprint arXiv:1708.04782. Cited by: §1.
  • R. E. Wang, M. D. Everett, and J. P. How (2019) R-MADDPG for partially observable environments and limited communication. In Proceedings of the Reinforcement Learning for Real Life workshop (at ICML), Cited by: §7.
  • T. Wang, R. Liao, J. Ba, and S. Fidler (2018) NerveNet: learning structured policy with graph neural networks. In 6th International Conference on Learning Representations, ICLR, Cited by: §7, §7.

9 Experimental Details

In all experiments, we use a replay buffer of size , the target networks are updated via soft target updates with

, and we scale the gradient norms during training to be at most 0.5. The mixing network used in COMIX, COMIX-NAF, and FacMADDPG consists of a single hidden layer of 64 units, utilising an ELU non-linearity. The hypernetworks are then sized to produce weights of appropriate size. The hypernetworks producing the first layer weights and final layer weights and bias of the mixing network all consist of a single hidden layer of 64 units with a ReLU non-linearity. For COMIX and its ablations, to speed up the learning, we share the parameters of the agent networks across all agents. Similarly, in FacMADDPG, a single actor and critic network is shared among all agents, while in MADDPG, there is a separate actor and critic network for each agent as in the original algorithm.

9.1 Continuous Predator-Prey

The architecture of all agent networks is a MLP with 2 hidden layers of 64 units and ReLU non-linearities, except for COVDN-NAF and COMIX-NAF where we replace ReLU units with tanh units as it leads to better performance. In both MADDPG and FacMADDPG, the architecture of all agent networks and critic networks is also a MLP with 2 hidden layers of 64 units and ReLU non-linearities, while the final output layer of the actor was a tanh layer, to bound the actions. The global state consists of the observations of all agents. COMIX and MADDPG can both take advantage of the extra state information available during training, while COVDN ignores it. During training and testing, we restrict each episode to have a length of 25 time steps. Training lasts for 2 million timesteps. To encourage exploration, we use uncorrelated, mean-zero Gaussian noise during training (for all 2 million timesteps). We set for all experiments. We train on a batch size of 1024 after every timestep. All neural networks are trained using Adam (Kingma and Ba, 2014) optimiser with a learning rate of . To evaluate the learning performance, the training is paused after every 2000 timesteps during which 10 test episodes are run with agents performing action selection greedily in a decentralised fashion.

9.2 Multi-Agent Mujoco

The architecture of all agent networks is a MLP with 2 hidden layers with 400 and 300 units respectively, similar to the setting used in OpenAI Spinning Up.444 All neural networks use ReLU non-linearities for all hidden layers, except for COVDN-NAF and COMIX-NAF where we found tanh units lead to better performance. In both MADDPG and FacMADDPG, the architecture of all agent networks and critic networks is also a MLP with 2 hidden layers with 400 and 300 units respectively, while the final output layer of the actor was a tanh layer, to bound the actions. The global state consists of the full state information returned by the original OpenAI Gym (Brockman et al., 2016). COMIX and MADDPG can both take advantage of the extra state information available during training, while COVDN ignores it. During training and testing, we restrict each episode to have a length of 1000 time steps. Training lasts for 4 million timesteps. To encourage exploration, we use uncorrelated, mean-zero Gaussian noise during training (for all 4 million timesteps). We also use the same trick as in OpenAI Spinning Up to improve exploration at the start of training. For a fixed number of steps at the beginning (we set it to be 10000), the agent takes actions which are sampled from a uniform random distribution over valid actions. After that, it returns to normal Gaussian exploration. We set for all experiments. We train on a batch size of 100 after every timestep. All neural networks are trained using Adam optimiser with a learning rate of . To evaluate the learning performance, the training is paused after every 4000 timesteps during which 10 test episodes are run with agents performing action selection greedily in a decentralised fashion.

9.3 Exploration

The choice of exploration strategy plays a substantial role in the performance of deep deterministic policy gradient algorithms (Ciosek and Whiteson, 2018). To keep exploration strategies comparable across MADDPG and -learning based COMIX, we restrict ourselves to noising in action spaces rather than parameter space (Plappert et al., 2018). MADDPG’s official codebase 555 uses additive Gaussian noise with a standard deviation that is itself given by an additional policy output that is learnt end-to-end within the policy gradient loss. As -learning does not allow explicitly predict a policy output, we cannot apply a comparable strategy for COMIX. However, we find empirically that constant i.i.d. noising in the action spaces exhibits similar performance at lower variance than learnt noise on 2-Agent HalfCheetah (see Figure 6: Right). Even on Continuous Predator-Prey, a significantly less complex environment on which MADDPG’s official codebase was tuned on, learnt exploration does not result in better limit performance than i.i.d. Gaussian noise (see Figure 6: Left).

Figure 6: Mean episode return on Left: Continuous Predator-Prey and Right: 2-Agent HalfCheetah comparing MADDPG with constant i.i.d. Gaussian noise and MADDPG with learned Gaussian noise. The mean across 10 seeds is plotted and the 95% confidence interval is shown shaded.

10 Critic Mixing Network Constraints in FacMADDPG

As in an actor-critic setting, the critic is not used for greedy action selection, FacMADDPG does not strictly require a monotonocity constraint on its critic mixing network. However, we find empirically that introducing the monotonicity requirement significantly increases performance (see Figure 7). This supports the hypothesis that introducing monotonicity constraints strikes a reasonable trade-off between having independent critics with limited coordinative ability and the case where excess coordinative expressivity in the unconstrained critic leads to an increase in learning difficulty.

Figure 7: Mean episode return on Left: Continuous Predator-Prey and Right: 2-Agent HalfCheetah comparing FacMADDPG and FacMADDPG without monotonicity constraints on the mixing network of the critic. The mean across 10 seeds is plotted and the 95% confidence interval is shown shaded.
Task Goal Special observations Reward function
2-Agent Swimmer Maximise ve -speed. -
2-Agent Reacher Fingertip (green) needs to reach target at random location (red). Target is only visible to green agent.
2-Agent Ant Maximise ve -speed. All agents can observe the central torso.
2-Agent Ant Diag Maximise ve -speed. All agents can observe the central torso.
2-Agent HalfCheetah Maximise ve -speed. -
2-Agent Humanoid Maximise ve -speed. -
2-Agent HumanoidStandup Maximise ve -speed. -
3-Agent Hopper Maximise ve -speed. -
4-Agent Ant Maximise ve -speed. All agents can observe the central torso.
6-Agent HalfCheetah Maximise ve -speed. -
Table 1: Overview of tasks contained in Multi-Agent Mujoco. We define as an action regularisation term .
  function CEM (
     for  do
        for  do
           for  do
           end for
           if  then
           else {Right}
           end if
        end for
     end for
  end function
Algorithm 2 For each agent , we perform CEM iterations. Hyper-parameters control how many actions are sampled at the th iteration.