Deep Reinforcement Learning (DRL) has been successfully applied to a range of challenging tasks such as Atari games (Mnih et al., 2015), industrial data center cooling applications (Evans and Gao, 2016) and controlling humanoids (Xie et al., 2019). Most of these tasks involve single agents, where the agent’s local objective is identical to the global system objective.
However, many real world applications like air traffic control (Tumer and Agogino, 2007), multi-robot coordination (Sheng et al., 2006; Yliniemi et al., 2014), communication and language (Lazaridou et al., 2016; Mordatch and Abbeel, 2018), and autonomous driving (Shalev-Shwartz et al., 2016) involve multiple agents interacting with each other. Unfortunately, traditional DRL approaches are ill-suited to tackling multiagent problems due to a host of challenges including non-stationary environments (Foerster et al., 2017; Lowe et al., 2017), structural credit assignment (Agogino and Tumer, 2004; Rahmattalabi et al., 2016), and the explosion of the search space with increasing number of agents (Li et al., 2012).
Consider soccer where a team of agents coordinate to achieve a global objective of winning. Directly optimizing this objective to train each agent is sub-optimal due to two reasons. First, it fails to encapsulate the contributions of individual agents to the final result. Second, it is usually very sparse - a single scalar capturing performance of an entire team operating across an extended period of time. This makes it a weak metric to learn on. Domain knowledge has been used to design agent-specific rewards (Devlin et al., 2011; Williamson et al., 2009). However, this is not very generalizable. For example, a team that is winning may benefit from protecting its lead by temporarily being more defensive. This objective now becomes misaligned with the local objectives of the strikers that prioritize scoring. This leads to sub-optimal coordination overall.
In this paper, we introduce Multiagent Evolutionary Reinforcement Learning (MERL), a hybrid algorithm that combines gradient-based and gradient-free learning to address sparse and noisy coordination objectives without the need to manually design agent-specific rewards to align with a global objective. MERL employs a two-level approach: a local optimizer (policy gradient) learns using local rewards computed directly over each agent’s observation set. This has the advantage of being high-fidelity and dense - a perfect signal to learn non-coordination related aspects of the world such as perception and navigation. A global optimizer (evolutionary algorithm) learns to directly optimize the global reward which encodes the true system goal. The two processes operate concurrently and share information.
Our hypothesis is that the solution to the coordination task often exists on a smaller manifold than that for the related navigation and perception tasks. For instance in soccer, assume that each player has mastered their self-oriented skills such as perceiving the world, passing, dribbling, and running. Given these skills, the coordination aspect of the game can be roughly reduced to planning who/when to make passes and what spaces to occupy and when. The search space for learning the self-oriented skills is significantly larger than that for coordination. MERL leverages this split within the structure of the task: employing local rewards coupled with fast policy gradient methods to learn the self-oriented skills while employing the less powerful but more general global optimizer (neureoevolution) to learn coordination skills.
A key strength of MERL is that it optimizes the true learning goal (global reward) directly while leveraging local rewards as an auxiliary signal. This is in stark contrast to reward shaping techniques that construct a proxy reward to incentivize the attainment of the global reward (Agogino and Tumer, 2004; Devlin and Kudenko, 2012). Apart from requiring domain knowledge and manual tuning, this approach also poses risks of changing the underlying problem itself (Ng et al., 1999). MERL, on the other hand, is not susceptible to this mode of failure and is guaranteed to optimize the global reward. We test MERL in a multi-rover domain with increasingly complex coordination objectives. Results demonstrate that MERL significantly outperforms state-of-the-art multiagent reinforcement learning methods like MADDPG while using the same set of information and reward functions.
2 Background and Related Work
A standard reinforcement learning (RL) setting is often formalized as a Markov Decision Process (MDP) and consists of an agent interacting with an environment over a finite number of discrete time steps. This formulation can be extended to multiagent systems in the form of partially observable Markov games(Littman, 1994; Lowe et al., 2017). An -agent Markov game is defined by , a global state of the world, and a set of observations and actions corresponding to the agents. At each time step , each agent observes its corresponding observation and maps it to an action using its policy .
Each agent receives a scalar reward based on the global state and joint action of the team. The world then transitions to the next state which produces a new set of observations . The process continues until a terminal state is reached. is the total return for agent with discount factor . Each agent aims to maximize its expected return.
Policy gradient (PG) methods frame the goal of maximizing the expected return as the minimization of a loss function. A widely used PG method for continuous, high-dimensional action spaces is DDPG(Lillicrap et al., 2015). Recently, (Fujimoto et al., 2018) extended DDPG to Twin Delayed DDPG (TD3), addressing its well-known overestimation problem. TD3 is the state-of-the-art, off-policy algorithm for model-free DRL in continuous action spaces.
TD3 uses an actor-critic architecture (Sutton and Barto, 1998) maintaining a deterministic policy (actor) , and two distinct critics . Each critic independently approximates the actor’s action-value function
. A separate copy of the actor and critics are kept as target networks for stability and are updated periodically. A noisy version of the actor is used to explore the environment during training. The actor is trained using a noisy version of the sampled policy gradient computed by backpropagation through the combined actor-critic networks. This mitigates overfitting of the deterministic policy by smoothing the policy gradient updates.
Evolutionary Reinforcement Learning (ERL) is a hybrid algorithm that combines Evolutionary Algorithms (EAs) (Floreano et al., 2008; Lüders et al., 2017; Fogel, 2006; Spears et al., 1993), with policy gradient methods (Khadka and Tumer, 2018). Instead of discarding the data generated during a standard EA rollout, ERL stores this data in a central replay buffer shared with the policy gradient’s own rollouts - thereby increasing the diversity of the data available for the policy gradient learners. Since the EA directly optimizes for episode-wide return, it biases exploration towards states with higher long-term returns. The policy gradient algorithm which learns using this state distribution inherits this implicit bias towards long-term optimization. Concurrently, the actor trained by the policy gradient algorithm is inserted into the evolutionary population allowing the EA to benefit from the fast gradient-based learning.
Related Work: Lowe et al. (2017) introduced MADDPG which tackled the inherent non-stationarity of a multiagent learning environment by leveraging a critic which had full access to the joint state and action during training. Foerster et al. (2018b) utilized a similar setup with a centralized critic across agents to tackle StarCraft micromanagement tasks. An algorithm that could explicitly model other agents’ learning was investigated in Foerster et al. (2018a). However, all these approaches rely on a dense local reward that is aligned with the global coordination objective. Methods to solve for these aligned agent-specific reward functions were investigated in Li et al. (2012) but were limited to tasks with strong simulators where tree-based planning could be used.
A closely related work to MERL is (Liu et al., 2019) where Population-Based Training (PBT) (Jaderberg et al., 2017) is used to optimize the relative importance between a collection of dense, shaped rewards automatically during training. This can be interpreted as a singular central reward function constructed by scalarizing a collection of reward signals where the scalarization coefficients are adaptively learned during training. In contrast, MERL optimizes its reward functions independently with information transfer across them facilitated through shared replay buffers and policy migration directly. This form of information transfer through a shared replay buffer has been explored extensively in recent literature (Colas et al., 2018; Khadka et al., 2019).
3 Motivating Example
Consider the rover domain (Agogino and Tumer, 2004), a classic multiagent task where a team of rovers coordinate to explore a region. The global objective is to observe all POIs (Points of Interest) distributed in the area. Each robot also receives a local reward defined as the negative distance to the closest POI. In Figure 1(a), a team of two rovers and seek to explore and observe POIs and . is closer to and has enough fuel to reach either of the POIs whereas can only reach . There is no communication between the rovers.
If optimizes only locally by pursuing the closer POI , then the global objective is not achieved since can only reach . The globally optimal solution for is to spend more fuel and pursue - this is misaligned with its locally optimal solution. This is related to social sequential dilemmas (Leibo et al., 2017; Perolat et al., 2017). Figure 1(b) shows the comparative performance of four algorithms - namely TD3-mixed, TD3-global, EA and MERL on this coordination task.
TD3-mixed and TD3-global optimize a scalarized version of the joint objective or just the global reward, respectively. Since the global reward is extremely sparse (only disbursed when a POI is observed) TD3-global fails to learn anything meaningful. In contrast, TD3-mixed, by virtue of its dense local reward component, successfully learns to perceive and navigate. However, the mixed reward is a static scalarization between local and global rewards that are not always aligned as described in the preceding paragraph. TD3-mixed converges to the greedy local policy of pursuing .
relies on randomly stumbling onto a solution - e.g., a navigation sequence that takes the rovers to the correct POIs. The probability of one of the rovers stumbling onto the nearest POI is significantly higher. This is also the policy that EA converges to.
MERL combines the core strengths of TD3 and EA. It exploits the local reward to first learn perception and navigation skills - treating it as a dense, auxiliary reward even though it is not aligned with the global objective. The task is then reduced to its coordination component - picking the right POI to go to. This is effectively tackled by the EA engine within MERL and enables it to find the optimal solution. This ability to leverage reward functions across multiple levels even when they are misaligned is the core strength of MERL.
4 Multiagent Evolutionary Reinforcement Learning
Policy Topology: We represent our multiagent (team
) policies using a multi-headed neural networkas illustrated in Figure 2. The head represents the -th agent in the team. Given an incoming observation for agent , only the output of is considered as agent ’s response. In essence, all agents act independently based on their own observations while sharing weights (and by extension, the features) in the lower layers (trunk). This is commonly used to improve learning speed (Silver et al., 2017). Further, each agent also has its own replay buffer which stores its experience defined by the tuple (state, action, next state, local reward) for each episode of interaction with the environment (rollout) involving that agent.
Global Reward Optimization: Figure 3 illustrates the MERL algorithm. A population of multi-headed teams, each with the same topology, is initialized with random weights. The replay buffer is shared by the -th agent across all teams. The population is then evaluated for each rollout. The global reward for each team is disbursed at the end of the episode and is considered as its fitness score. A selection operator selects a portion of the population for survival with probability proportionate to their fitness scores. The weights of the teams in the population are probabilistically perturbed through mutation and crossover operators to create the next generation of teams. A portion of the teams with the highest relative fitness are preserved as elites. At any given time, the team with the highest fitness, or the champion, represents the best multiagent solution for the task.
Policy Gradient: The procedure described so far resembles a standard EA except that each agent stores each of its experiences in its associated replay buffer instead of just discarding it. However, unlike EA, which only learns based on the low-fidelity global reward, MERL also learns from the experiences within episodes of a rollout using policy gradients. To enable this kind of "local learning", MERL initializes one multi-headed policy network and one critic . A noisy version of is then used to conduct its own set of rollouts in the environment, storing each agent ’s experiences in its corresponding buffer similar to the evolutionary rollouts.
Local Reward Optimization: Crucially, each agent’s replay buffer is kept separate from that of every other agent to ensure diversity amongst the agents. The shared critic samples a random mini-batch uniformly from each replay buffer and uses it to update its parameters using gradient descent. Each agent then draws a mini-batch of experiences from its corresponding buffer and uses it to sample a policy gradient from the shared critic. Unlike the teams in the evolutionary population which directly seek to optimize the global reward, seeks to maximize the local reward per agent while exploiting the experiences collected via evolution.
Local Global Migration: Periodically, the network is copied into the evolving population of teams and can propagate its features by participating in evolution. This is the core mechanism that combines policies learned via local and global rewards. Regardless of whether the two rewards are aligned, evolution ensures that only the performant derivatives of the migrated network are retained. This mechanism guarantees protection against destructive interference commonly seen when a direct scalarization between two reward functions is attempted. Further, the level of information exchange is automatically adjusted during the process of learning, in contrast to being manually tuned by an expert designer.
5 Rover Domain
The domain used in this paper is a variant of the rover domain used in (Agogino and Tumer, 2004; Rahmattalabi et al., 2016). Here, a team of robots aim to observe Points of Interest (POIs) scattered across the environment. The robots start out in the center of the field, randomly distributed within an area of the total field. The POIs are initialized randomly outside this area with a minimum distance of m from any robot. This is inspired by real-world scenarios of exploration of an unknown environment where the team of robots are air-dropped towards the center of the field.
Robot Capabilities: Each robot is loosely based on the characteristics of a Pioneer robot (Thrun et al., 2000). It’s observation space consists of two channels dedicated to detecting POIs and rovers, respectively. Each channel receives intensity information over resolution spanning the around the robot’s position. This is similar to a LIDAR. Since within each bracket, it returns the closest reflector, occlusions make the problem partially-observable. Each robot outputs two continuous actions: and representing change in heading and drive, respectively. The maximum change in heading is capped at per step while the maximum drive is capped at .
Reward Functions: The team’s global reward is the percentage of POIs observed at the end of an episode. This is computed and broadcast to each robot at the end of an episode. It is sparse, low-fidelity, and noisy from each agent’s point of view as it compiles the entire team’s joint state-action history onto a single scalar value. However, it is an appropriate metric to evaluate the overall performance of the team without having to simultaneously account for each agent’s navigation or perception skills.
Each robot also receives a local reward computed as the negative distance to the closest POI. In contrast to global reward, the local reward is dense, high-fidelity, and not noisy as it depends solely on the robot’s own observations and actions. Critically, this local reward is not necessarily aligned with the global objective since it does not aim to maximize the total number of POIs observed by the group. This makes it a good training metric for each agent to learn local skills like navigation to a particular POI without having to simultaneously account for the global objective.
Global Team Objective: The coordination objective in the rover domain is expressed as the coupling requirement (Rahmattalabi et al., 2016; Agogino and Tumer, 2004). A coupling requirement of means that robots are required to be within an activation distance of a POI simultaneously in order to observe it. In the simplest case, a coupling of defines a coordination problem where the robots need to spread out and explore the area on their own. This is similar to a set cover problem.
In contrast, a coupling of defines a coordination problem where the robots need to form sub-teams of and explore the area jointly. The presence of robots within the activation distance of a POI, where < , generates no reward signal. This defines a tough exploration problem and is based on tasks like lifting a rock where multiple robots need to coordinate tightly to achieve any success at all.
Compared Baselines: We compare the performance of MERL with a standard neuroevolutionary algorithm (EA) (Fogel, 2006), MADDPG (Lowe et al., 2017) and MATD3, a variant of MADDPG that integrates the improvements described within TD3 Fujimoto et al. (2018) over DDPG. Internally, MERL uses EA and TD3 as its global and local optimizer, respectively.
MADDPG on the other hand was chosen as it is the state-of-the-art multiagent RL algorithm. We implemented MATD3 ourselves to ensure that the differences between MADDPG and MERL do not originate from having the more stable TD3 over DDPG.
Further, MADDPG and MATD3 were tested with using either only global rewards or mixed (global + local) reward functions. The local reward function here is simply defined as the negative of the distance to the closest POI. MERL inherently leverages both reward functions while EA directly optimizes the global reward function. These variations for the baselines allow us to evaluate the efficacy of the differentiating features of MERL as opposed to improvements that might come from other ways of combining reward functions
Methodology for Reported Metrics: For MATD3 and MADDPG, the team network was periodically tested on task instances without any exploratory noise. The average score was logged as its performance. For MERL and EA, we choose the team with the highest fitness as the champion for each generation. The champion was then tested on task instances, and the average score was logged. This protocol shielded the reported metrics from any bias of the population size. We conduct statistically independent runs with random seeds from and report the average with error bars showing a confidence interval.
The Steps Metric: All scores reported are compared against the number of environment steps (frames). A step is defined as the multiagent team taking a joint action and receiving a feedback from the environment. To make the comparisons fair across single-team and population-based algorithms, all steps taken by all teams in the population are counted cumulatively.
Rover Domain Setup: For each coupling requirement of , robots were initialized accompanied with POIs spread out in the world. The coordination problem to be tackled was thus two-fold. First, the team of robots had to learn to form sub-teams of size . Next, the sub-teams would now need to coordinate with each other to spread out and cover different POIs to ensure they get all within the time allocated. Both the team formation, and coordinated spreading out has to be done autonomously and adaptively based on the distribution of the robots and POIs (varied randomly for each instance of the task). This is the core difficulty of the task.
Figure 4 shows the comparative performance of MERL, MADDPG (global and mixed), MATD3 (global and mixed), and EA tested in the rover domain with coupling requirements from to . MERL significantly outperforms all baselines across all coupling requirements. The tested baselines clearly degrade quickly beyond a coupling of . The increasing coupling requirement is equivalent to increasing difficulty in joint-space exploration and entanglement in the global coordination objective. However, it does not increase the size of the state-space, complexity of perception, or navigation. This indicates that the degradation in performance is strictly due to the increase in complexity of the coordination objective.
Notably, MERL is able to learn on coupling greater than where methods without explicit reward shaping have been shown to fail entirely (Rahmattalabi et al., 2016). This is consistent with the performances of our baselines as none of them use explicit domain-specific reward shaping. MERL successfully completes the task using the same set of information and coarse, unshaped reward functions as the other algorithms. The primary mechanism that enables this is MERL’s bi-level approach whereby it leverages the local reward function to solve navigation and perception while concurrently using the global reward function to learn team formation and effective coordination.
Team Behaviors: Figure 5 illustrates the trajectories generated for the rover domain with a coupling of . The trajectories for partially and fully trained MERL are shown in Figure 5(a) and (b), respectively. During training, when MERL has not discovered success in the global coordination objective (no POIs are successfully observed), MERL simply proceeds to optimize the local objective for each robot. This allows it to reach trajectories such as the ones shown in 5(a) where each robot learns to go towards a POI.
Given this joint behavior, the probability of having robots congregate to the same POI is higher compared to random undirected exploration by each robot. Once this scenario is stumbled upon, the global optimizer (EA) within MERL will explicitly select for agent policies that lead to such team-forming joint behaviors. Eventually it succeeds as shown in Figure 5(b). Here, team formation and collaborative pursuit of the POIs is immediately apparent. Two teams of robots each form at the start of the episode. Further, the two teams also coordinate among each other to pursue different POIs in order to maximize the global team reward. While the POI allocation is not perfect, (the one in the bottom is left unattended) they do succeed in successfully observing out of the POIs.
In contrast, MATD3-mixed fails to successfully observe any POI. From the trajectories, it is apparent that the robots have successfully learned to perceive and navigate to reach POIs. However, they are unable to use this sub-skill towards fulfilling the coordination objective. Instead each robot is rather split on the objective that it is optimizing. Some robots seem to be in sole pursuit of POIs without any regard for team formation or collaboration while others seem to exhibit random movements.
The primary reason for this is the mixed reward function that directly combines the local and global reward function. Since the two reward functions have no guarantees of alignment across the state-space of the task, they invariably lead to learning these sub-optimal joint-behaviors that solve a certain form of scalarized mixed objective. In practice, this problem can be addressed by manually tuning the scalariziation coefficients to achieve the required coordination behavior. However, without such manual reward shaping, MATD3-mixed fails to solve the task. In contrast, MERL is able to solve the task without any reward shaping or manual tuning.
In this paper, we introduced MERL, a hybrid algorithm that can combine global objectives with local objectives even when they are not aligned with each other. MERL achieves this by using a fast policy-gradient local optimizer to exploit dense local rewards while concurrently leveraging a global optimizer (EA) to tackle the coordination aspects of the task.
Results demonstrate that MERL significantly outperforms MADDPG, the state-of-the-art multiagent RL method . We also tested a modification of MADDPG to integrate TD3 - the state-of-the-art single-agent RL algorithm - as well as variations that utilized only global, or global and local rewards. These experiments demonstrated that the core improvements of MERL come from the combination of EA and policy gradient algorithm which enable MERL to combine multiple objectives without relying on alignment between those objectives. This differentiates MERL from other approaches like reward scalarization and reward shaping that either require extensive manual tuning or can detrimentally change the MDP (Ng et al., 1999) itself.
Here, we limited our focus to cooperative domains. Future work will explore MERL for adversarial settings such as Pommerman (Resnick et al., 2018), StarCraft (Justesen and Risi, 2017; Vinyals et al., 2017) and RoboCup (Kitano et al., 1995; Liu et al., 2019). Further, MERL can be considered a bi-level approach to combine local and global objectives. Extending MERL to generalized multilevel rewards is another promising area for future work.
- Agogino and Tumer  A. K. Agogino and K. Tumer. Unifying temporal and structural credit assignment problems. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 980–987. IEEE Computer Society, 2004.
- Colas et al.  C. Colas, O. Sigaud, and P.-Y. Oudeyer. Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. arXiv preprint arXiv:1802.05054, 2018.
- Devlin et al.  S. Devlin, M. Grześ, and D. Kudenko. Multi-agent, reward shaping for robocup keepaway. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 3, pages 1227–1228. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
- Devlin and Kudenko  S. M. Devlin and D. Kudenko. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, pages 433–440. IFAAMAS, 2012.
- Evans and Gao  R. Evans and J. Gao. Deepmind AI reduces google data centre cooling bill by 40%. DeepMind blog, 20, 2016.
- Floreano et al.  D. Floreano, P. Dürr, and C. Mattiussi. Neuroevolution: from architectures to learning. Evolutionary Intelligence, 1(1):47–62, 2008.
Foerster et al. 
J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and
Stabilising experience replay for deep multi-agent reinforcement
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1146–1155. JMLR. org, 2017.
- Foerster et al. [2018a] J. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch. Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 122–130. International Foundation for Autonomous Agents and Multiagent Systems, 2018a.
Foerster et al. [2018b]
J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson.
Counterfactual multi-agent policy gradients.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018b.
- Fogel  D. B. Fogel. Evolutionary computation: toward a new philosophy of machine intelligence, volume 1. John Wiley & Sons, 2006.
- Fujimoto et al.  S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
- Jaderberg et al.  M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
Justesen and Risi 
N. Justesen and S. Risi.
Learning macromanagement in starcraft from replays using deep learning.In 2017 IEEE Conference on Computational Intelligence and Games (CIG), pages 162–169. IEEE, 2017.
- Khadka and Tumer  S. Khadka and K. Tumer. Evolution-guided policy gradient in reinforcement learning. In Advances in Neural Information Processing Systems, pages 1196–1208, 2018.
- Khadka et al.  S. Khadka, S. Majumdar, T. Nassar, Z. Dwiel, E. Tumer, S. Miret, Y. Liu, and K. Tumer. Collaborative evolutionary reinforcement learning. arXiv preprint arXiv:1905.00976v2, 2019.
- Kitano et al.  H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, and E. Osawa. Robocup: The robot world cup initiative, 1995.
- Lazaridou et al.  A. Lazaridou, A. Peysakhovich, and M. Baroni. Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182, 2016.
- Leibo et al.  J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel. Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464–473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
- Li et al.  F.-D. Li, M. Wu, Y. He, and X. Chen. Optimal control in microgrid using multi-agent reinforcement learning. ISA transactions, 51(6):743–751, 2012.
- Lillicrap et al.  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Littman  M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
- Liu et al.  S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess, and T. Graepel. Emergent coordination through competition. arXiv preprint arXiv:1902.07151, 2019.
- Lowe et al.  R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
- Lüders et al.  B. Lüders, M. Schläger, A. Korach, and S. Risi. Continual and one-shot learning through neural networks with dynamic external memory. In European Conference on the Applications of Evolutionary Computation, pages 886–901. Springer, 2017.
- Mnih et al.  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Mordatch and Abbeel  I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Ng et al.  A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pages 278–287, 1999.
- Perolat et al.  J. Perolat, J. Z. Leibo, V. Zambaldi, C. Beattie, K. Tuyls, and T. Graepel. A multi-agent reinforcement learning model of common-pool resource appropriation. In Advances in Neural Information Processing Systems, pages 3643–3652, 2017.
- Rahmattalabi et al.  A. Rahmattalabi, J. J. Chung, M. Colby, and K. Tumer. D++: Structural credit assignment in tightly coupled multiagent domains. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4424–4429. IEEE, 2016.
- Resnick et al.  C. Resnick, W. Eldridge, D. Ha, D. Britz, J. Foerster, J. Togelius, K. Cho, and J. Bruna. Pommerman: A multi-agent playground. arXiv preprint arXiv:1809.07124, 2018.
- Shalev-Shwartz et al.  S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
- Sheng et al.  W. Sheng, Q. Yang, J. Tan, and N. Xi. Distributed multi-robot coordination in area exploration. Robotics and Autonomous Systems, 54(12):945–955, 2006.
- Silver et al.  D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
- Spears et al.  W. M. Spears, K. A. De Jong, T. Bäck, D. B. Fogel, and H. De Garis. An overview of evolutionary computation. In European Conference on Machine Learning, pages 442–459. Springer, 1993.
- Sutton and Barto  R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Thrun et al.  S. Thrun, W. Burgard, and D. Fox. A real-time algorithm for mobile robot mapping with applications to multi-robot and 3d mapping. In ICRA, volume 1, pages 321–328, 2000.
- Tumer and Agogino  K. Tumer and A. Agogino. Distributed agent-based air traffic flow management. In Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems, page 255. ACM, 2007.
- Vinyals et al.  O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
- Williamson et al.  S. A. Williamson, E. H. Gerding, and N. R. Jennings. Reward shaping for valuing communications during multi-agent coordination. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 641–648. International Foundation for Autonomous Agents and Multiagent Systems, 2009.
- Xie et al.  Z. Xie, P. Clary, J. Dao, P. Morais, J. Hurst, and M. van de Panne. Iterative reinforcement learning based design of dynamic locomotion skills for cassie. arXiv preprint arXiv:1903.09537, 2019.
- Yliniemi et al.  L. Yliniemi, A. K. Agogino, and K. Tumer. Multirobot coordination for space exploration. AI Magazine, 35(4):61–74, 2014.
Appendix A Hyperparameters Description
|Actor Learning Rate|
|Critic Learning Rate|
|Discount Learning Rate|
|Replay Buffer Size|
|Super Mutation Probability||N/A|
|Reset Mutation Probability||N/A|
|Number of elites||N/A|
|Rollouts per fitness||N/A|
|Actor Neural Architecture|
|Critic Neural Architecture|
TD3 Policy Noise variance
|TD3 Policy Noise Clip|
|TD3 Policy Update Frequency|
Table 1 details the hyperparameters used for MERL, MATD3, and MADDPG. The hyperparmaeters were kept consistent across all experiments. The hyperparameters themselves are defined below:
Optimizer = Adam
Adam optimizer was used to update both the actor and critic networks for all learners.
This parameter controls the number of different actors (policies) that are present in the evolutionary population.
This parameter controls the number of rollout workers (each running an episode of the task) per generation.
Note: The two parameters above (population size and rollout size) collectively modulates the proportion of exploration carried out through noise in the actor’s parameter space and its action space.
This parameter controls the magnitude of the soft update between the actors and critic networks, and their target counterparts.
Actor Learning Rate
This parameter controls the learning rate of the actor network.
Critic Learning Rate
This parameter controls the learning rate of the critic network.
This parameter controls the discount rate used to compute the return optimized by policy gradient.
Replay Buffer Size
This parameter controls the size of the replay buffer. After the buffer is filled, the oldest experiences are deleted in order to make room for new ones.
This parameters controls the batch size used to compute the gradients.
Actor Activation Function
Hyperbolic tangent was used as the activation function.
Critic Activation Function
Hyperbolic tangent was used as the activation function.
Number of Elites
This parameter controls the fraction of the population that are categorized as elites. Since an elite individual (actor) is shielded from the mutation step and preserved as it is, the elite fraction modulates the degree of exploration/exploitation within the evolutionary population.
This parameter represents the probability that an actor goes through a mutation operation between generation.
This parameter controls the fraction of the weights in a chosen actor (neural network) that are mutated, once the actor is chosen for mutation.
This parameter controls the standard deviation of the Gaussian operation that comprises mutation.
Super Mutation Probability
This parameter controls the probability that a super mutation (larger mutation) happens in place of a standard mutation.
Reset Mutation Probability
This parameter controls the probability a neural weight is instead reset between rather than being mutated.
This parameter controls the standard deviation of the Gaussian operation that comprise the noise added to the actor’s actions during exploration by the learners (learner roll-outs).
TD3 Policy Noise Variance
This parameter controls the standard deviation of the Gaussian operation that comprise the noise added to the policy output before applying the Bellman backup. This is often referred to as the magnitude of policy smoothing in TD3.
TD3 Policy Noise Clip
This parameter controls the maximum norm of the policy noise used to smooth the policy.
TD3 Policy Update Frequency
This parameter controls the number of critic updates per policy update in TD3.
Appendix B Rollout Methodology
Algorithm 2 describes an episode of rollout under MERL detailing the connections between the local reward, global reward, and the associated replay buffer.