Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination

by   Shauharda Khadka, et al.
Oregon State University

A key challenge for Multiagent RL (Reinforcement Learning) is the design of agent-specific, local rewards that are aligned with sparse global objectives. In this paper, we introduce MERL (Multiagent Evolutionary RL), a hybrid algorithm that does not require an explicit alignment between local and global objectives. MERL uses fast, policy-gradient based learning for each agent by utilizing their dense local rewards. Concurrently, an evolutionary algorithm is used to recruit agents into a team by directly optimizing the sparser global objective. We explore problems that require coupling (a minimum number of agents required to coordinate for success), where the degree of coupling is not known to the agents. We demonstrate that MERL's integrated approach is more sample-efficient and retains performance better with increasing coupling orders compared to MADDPG, the state-of-the-art policy-gradient algorithm for multiagent coordination.



There are no comments yet.


page 3

page 8


Joint Optimization of Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm

Many engineering problems have multiple objectives, and the overall aim ...

Cooperative Multi-Agent Reinforcement Learning with Partial Observations

In this paper, we propose a distributed zeroth-order policy optimization...

Cooperative Heterogeneous Deep Reinforcement Learning

Numerous deep reinforcement learning agents have been proposed, and each...

Dimension-Free Rates for Natural Policy Gradient in Multi-Agent Reinforcement Learning

Cooperative multi-agent reinforcement learning is a decentralized paradi...

Evolutionary RL for Container Loading

Loading the containers on the ship from a yard, is an impor- tant part o...

Reinforcement Learning for Heterogeneous Teams with PALO Bounds

We introduce reinforcement learning for heterogeneous teams in which rew...

Multiagent Rollout Algorithms and Reinforcement Learning

We consider finite and infinite horizon dynamic programming problems, wh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Reinforcement Learning (DRL) has been successfully applied to a range of challenging tasks such as Atari games (Mnih et al., 2015), industrial data center cooling applications (Evans and Gao, 2016) and controlling humanoids (Xie et al., 2019). Most of these tasks involve single agents, where the agent’s local objective is identical to the global system objective.

However, many real world applications like air traffic control (Tumer and Agogino, 2007), multi-robot coordination (Sheng et al., 2006; Yliniemi et al., 2014), communication and language (Lazaridou et al., 2016; Mordatch and Abbeel, 2018), and autonomous driving (Shalev-Shwartz et al., 2016) involve multiple agents interacting with each other. Unfortunately, traditional DRL approaches are ill-suited to tackling multiagent problems due to a host of challenges including non-stationary environments (Foerster et al., 2017; Lowe et al., 2017), structural credit assignment (Agogino and Tumer, 2004; Rahmattalabi et al., 2016), and the explosion of the search space with increasing number of agents (Li et al., 2012).

Consider soccer where a team of agents coordinate to achieve a global objective of winning. Directly optimizing this objective to train each agent is sub-optimal due to two reasons. First, it fails to encapsulate the contributions of individual agents to the final result. Second, it is usually very sparse - a single scalar capturing performance of an entire team operating across an extended period of time. This makes it a weak metric to learn on. Domain knowledge has been used to design agent-specific rewards (Devlin et al., 2011; Williamson et al., 2009). However, this is not very generalizable. For example, a team that is winning may benefit from protecting its lead by temporarily being more defensive. This objective now becomes misaligned with the local objectives of the strikers that prioritize scoring. This leads to sub-optimal coordination overall.

In this paper, we introduce Multiagent Evolutionary Reinforcement Learning (MERL), a hybrid algorithm that combines gradient-based and gradient-free learning to address sparse and noisy coordination objectives without the need to manually design agent-specific rewards to align with a global objective. MERL employs a two-level approach: a local optimizer (policy gradient) learns using local rewards computed directly over each agent’s observation set. This has the advantage of being high-fidelity and dense - a perfect signal to learn non-coordination related aspects of the world such as perception and navigation. A global optimizer (evolutionary algorithm) learns to directly optimize the global reward which encodes the true system goal. The two processes operate concurrently and share information.

Our hypothesis is that the solution to the coordination task often exists on a smaller manifold than that for the related navigation and perception tasks. For instance in soccer, assume that each player has mastered their self-oriented skills such as perceiving the world, passing, dribbling, and running. Given these skills, the coordination aspect of the game can be roughly reduced to planning who/when to make passes and what spaces to occupy and when. The search space for learning the self-oriented skills is significantly larger than that for coordination. MERL leverages this split within the structure of the task: employing local rewards coupled with fast policy gradient methods to learn the self-oriented skills while employing the less powerful but more general global optimizer (neureoevolution) to learn coordination skills.

A key strength of MERL is that it optimizes the true learning goal (global reward) directly while leveraging local rewards as an auxiliary signal. This is in stark contrast to reward shaping techniques that construct a proxy reward to incentivize the attainment of the global reward (Agogino and Tumer, 2004; Devlin and Kudenko, 2012). Apart from requiring domain knowledge and manual tuning, this approach also poses risks of changing the underlying problem itself (Ng et al., 1999). MERL, on the other hand, is not susceptible to this mode of failure and is guaranteed to optimize the global reward. We test MERL in a multi-rover domain with increasingly complex coordination objectives. Results demonstrate that MERL significantly outperforms state-of-the-art multiagent reinforcement learning methods like MADDPG while using the same set of information and reward functions.

2 Background and Related Work

Markov Games:

A standard reinforcement learning (RL) setting is often formalized as a Markov Decision Process (MDP) and consists of an agent interacting with an environment over a finite number of discrete time steps. This formulation can be extended to multiagent systems in the form of partially observable Markov games

(Littman, 1994; Lowe et al., 2017). An -agent Markov game is defined by , a global state of the world, and a set of observations and actions corresponding to the agents. At each time step , each agent observes its corresponding observation and maps it to an action using its policy .

Each agent receives a scalar reward based on the global state and joint action of the team. The world then transitions to the next state which produces a new set of observations . The process continues until a terminal state is reached. is the total return for agent with discount factor . Each agent aims to maximize its expected return.


Policy gradient (PG) methods frame the goal of maximizing the expected return as the minimization of a loss function. A widely used PG method for continuous, high-dimensional action spaces is DDPG

(Lillicrap et al., 2015). Recently, (Fujimoto et al., 2018) extended DDPG to Twin Delayed DDPG (TD3), addressing its well-known overestimation problem. TD3 is the state-of-the-art, off-policy algorithm for model-free DRL in continuous action spaces.

TD3 uses an actor-critic architecture (Sutton and Barto, 1998) maintaining a deterministic policy (actor) , and two distinct critics . Each critic independently approximates the actor’s action-value function

. A separate copy of the actor and critics are kept as target networks for stability and are updated periodically. A noisy version of the actor is used to explore the environment during training. The actor is trained using a noisy version of the sampled policy gradient computed by backpropagation through the combined actor-critic networks. This mitigates overfitting of the deterministic policy by smoothing the policy gradient updates.

Evolutionary Reinforcement Learning (ERL) is a hybrid algorithm that combines Evolutionary Algorithms (EAs) (Floreano et al., 2008; Lüders et al., 2017; Fogel, 2006; Spears et al., 1993), with policy gradient methods (Khadka and Tumer, 2018). Instead of discarding the data generated during a standard EA rollout, ERL stores this data in a central replay buffer shared with the policy gradient’s own rollouts - thereby increasing the diversity of the data available for the policy gradient learners. Since the EA directly optimizes for episode-wide return, it biases exploration towards states with higher long-term returns. The policy gradient algorithm which learns using this state distribution inherits this implicit bias towards long-term optimization. Concurrently, the actor trained by the policy gradient algorithm is inserted into the evolutionary population allowing the EA to benefit from the fast gradient-based learning.

Related Work: Lowe et al. (2017) introduced MADDPG which tackled the inherent non-stationarity of a multiagent learning environment by leveraging a critic which had full access to the joint state and action during training. Foerster et al. (2018b) utilized a similar setup with a centralized critic across agents to tackle StarCraft micromanagement tasks. An algorithm that could explicitly model other agents’ learning was investigated in Foerster et al. (2018a). However, all these approaches rely on a dense local reward that is aligned with the global coordination objective. Methods to solve for these aligned agent-specific reward functions were investigated in Li et al. (2012) but were limited to tasks with strong simulators where tree-based planning could be used.

A closely related work to MERL is (Liu et al., 2019) where Population-Based Training (PBT) (Jaderberg et al., 2017) is used to optimize the relative importance between a collection of dense, shaped rewards automatically during training. This can be interpreted as a singular central reward function constructed by scalarizing a collection of reward signals where the scalarization coefficients are adaptively learned during training. In contrast, MERL optimizes its reward functions independently with information transfer across them facilitated through shared replay buffers and policy migration directly. This form of information transfer through a shared replay buffer has been explored extensively in recent literature (Colas et al., 2018; Khadka et al., 2019).

3 Motivating Example

(a) Rover domain
(b) MERL vs TD3 vs EA
Figure 1: (a) Rover domain with a clear misalignment between local and global reward functions (b) Comparative performance of MERL compared against TD3-mixed, TD3-global and EA.

Consider the rover domain (Agogino and Tumer, 2004), a classic multiagent task where a team of rovers coordinate to explore a region. The global objective is to observe all POIs (Points of Interest) distributed in the area. Each robot also receives a local reward defined as the negative distance to the closest POI. In Figure 1(a), a team of two rovers and seek to explore and observe POIs and . is closer to and has enough fuel to reach either of the POIs whereas can only reach . There is no communication between the rovers.

If optimizes only locally by pursuing the closer POI , then the global objective is not achieved since can only reach . The globally optimal solution for is to spend more fuel and pursue - this is misaligned with its locally optimal solution. This is related to social sequential dilemmas (Leibo et al., 2017; Perolat et al., 2017). Figure 1(b) shows the comparative performance of four algorithms - namely TD3-mixed, TD3-global, EA and MERL on this coordination task.

TD3-mixed and TD3-global optimize a scalarized version of the joint objective or just the global reward, respectively. Since the global reward is extremely sparse (only disbursed when a POI is observed) TD3-global fails to learn anything meaningful. In contrast, TD3-mixed, by virtue of its dense local reward component, successfully learns to perceive and navigate. However, the mixed reward is a static scalarization between local and global rewards that are not always aligned as described in the preceding paragraph. TD3-mixed converges to the greedy local policy of pursuing .


relies on randomly stumbling onto a solution - e.g., a navigation sequence that takes the rovers to the correct POIs. The probability of one of the rovers stumbling onto the nearest POI is significantly higher. This is also the policy that EA converges to.

MERL combines the core strengths of TD3 and EA. It exploits the local reward to first learn perception and navigation skills - treating it as a dense, auxiliary reward even though it is not aligned with the global objective. The task is then reduced to its coordination component - picking the right POI to go to. This is effectively tackled by the EA engine within MERL and enables it to find the optimal solution. This ability to leverage reward functions across multiple levels even when they are misaligned is the core strength of MERL.

4 Multiagent Evolutionary Reinforcement Learning

Figure 2: Team represented as multi-headed policy net

Policy Topology: We represent our multiagent (team

) policies using a multi-headed neural network

as illustrated in Figure 2. The head represents the -th agent in the team. Given an incoming observation for agent , only the output of is considered as agent ’s response. In essence, all agents act independently based on their own observations while sharing weights (and by extension, the features) in the lower layers (trunk). This is commonly used to improve learning speed (Silver et al., 2017). Further, each agent also has its own replay buffer which stores its experience defined by the tuple (state, action, next state, local reward) for each episode of interaction with the environment (rollout) involving that agent.

Global Reward Optimization: Figure 3 illustrates the MERL algorithm. A population of multi-headed teams, each with the same topology, is initialized with random weights. The replay buffer is shared by the -th agent across all teams. The population is then evaluated for each rollout. The global reward for each team is disbursed at the end of the episode and is considered as its fitness score. A selection operator selects a portion of the population for survival with probability proportionate to their fitness scores. The weights of the teams in the population are probabilistically perturbed through mutation and crossover operators to create the next generation of teams. A portion of the teams with the highest relative fitness are preserved as elites. At any given time, the team with the highest fitness, or the champion, represents the best multiagent solution for the task.

Figure 3: High level schematic of MERL highlighting the integration of local and global reward functions

Policy Gradient: The procedure described so far resembles a standard EA except that each agent stores each of its experiences in its associated replay buffer instead of just discarding it. However, unlike EA, which only learns based on the low-fidelity global reward, MERL also learns from the experiences within episodes of a rollout using policy gradients. To enable this kind of "local learning", MERL initializes one multi-headed policy network and one critic . A noisy version of is then used to conduct its own set of rollouts in the environment, storing each agent ’s experiences in its corresponding buffer similar to the evolutionary rollouts.

Local Reward Optimization: Crucially, each agent’s replay buffer is kept separate from that of every other agent to ensure diversity amongst the agents. The shared critic samples a random mini-batch uniformly from each replay buffer and uses it to update its parameters using gradient descent. Each agent then draws a mini-batch of experiences from its corresponding buffer and uses it to sample a policy gradient from the shared critic. Unlike the teams in the evolutionary population which directly seek to optimize the global reward, seeks to maximize the local reward per agent while exploiting the experiences collected via evolution.

Local Global Migration: Periodically, the network is copied into the evolving population of teams and can propagate its features by participating in evolution. This is the core mechanism that combines policies learned via local and global rewards. Regardless of whether the two rewards are aligned, evolution ensures that only the performant derivatives of the migrated network are retained. This mechanism guarantees protection against destructive interference commonly seen when a direct scalarization between two reward functions is attempted. Further, the level of information exchange is automatically adjusted during the process of learning, in contrast to being manually tuned by an expert designer.

Algorithm 1

provides a detailed pseudo-code of the MERL algorithm. The choice of hyperparameters is explained in the Appendix. Additionally, our source code

111https://tinyurl.com/y6erclts is available online.

1:Initialize a population of multi-head teams , each with weights initialized randomly
2:Initialize a shared critic with weights
3:Initialize an ensemble of empty cyclic replay buffers , one for each agent
4:Define a white Gaussian noise generator random number generator
5:for generation = 1,  do
6:     for team  do
7:         , = Rollout (, , noise=None, )
8:          = Rollout (, , noise=, )
9:         Assign as ’s fitness
10:     end for
11:     Rank the population based on fitness scores
12:     Select the first teams as elites
13:     Select the remaining teams from , to form Set using tournament selection with replacement
14:     while   do
15:         Single-point crossover between a randomly sampled and and append to
16:     end while
17:     for Agent =, do
18:         Randomly sample a minibatch of transitions from
19:         Compute = +
20:         where = [action sampled from the head of ]
21:         Update by minimizing the loss:
22:         Update using the sampled policy gradient
24:         Soft update target networks: and
25:         Migrate the policy gradient team for weakest
26:     end for
27:end for
Algorithm 1 Multiagent Evolutionary Reinforcement Learning

5 Rover Domain

The domain used in this paper is a variant of the rover domain used in (Agogino and Tumer, 2004; Rahmattalabi et al., 2016). Here, a team of robots aim to observe Points of Interest (POIs) scattered across the environment. The robots start out in the center of the field, randomly distributed within an area of the total field. The POIs are initialized randomly outside this area with a minimum distance of m from any robot. This is inspired by real-world scenarios of exploration of an unknown environment where the team of robots are air-dropped towards the center of the field.

Robot Capabilities: Each robot is loosely based on the characteristics of a Pioneer robot (Thrun et al., 2000). It’s observation space consists of two channels dedicated to detecting POIs and rovers, respectively. Each channel receives intensity information over resolution spanning the around the robot’s position. This is similar to a LIDAR. Since within each bracket, it returns the closest reflector, occlusions make the problem partially-observable. Each robot outputs two continuous actions: and representing change in heading and drive, respectively. The maximum change in heading is capped at per step while the maximum drive is capped at .

Reward Functions: The team’s global reward is the percentage of POIs observed at the end of an episode. This is computed and broadcast to each robot at the end of an episode. It is sparse, low-fidelity, and noisy from each agent’s point of view as it compiles the entire team’s joint state-action history onto a single scalar value. However, it is an appropriate metric to evaluate the overall performance of the team without having to simultaneously account for each agent’s navigation or perception skills.

Each robot also receives a local reward computed as the negative distance to the closest POI. In contrast to global reward, the local reward is dense, high-fidelity, and not noisy as it depends solely on the robot’s own observations and actions. Critically, this local reward is not necessarily aligned with the global objective since it does not aim to maximize the total number of POIs observed by the group. This makes it a good training metric for each agent to learn local skills like navigation to a particular POI without having to simultaneously account for the global objective.

Global Team Objective: The coordination objective in the rover domain is expressed as the coupling requirement (Rahmattalabi et al., 2016; Agogino and Tumer, 2004). A coupling requirement of means that robots are required to be within an activation distance of a POI simultaneously in order to observe it. In the simplest case, a coupling of defines a coordination problem where the robots need to spread out and explore the area on their own. This is similar to a set cover problem.

In contrast, a coupling of defines a coordination problem where the robots need to form sub-teams of and explore the area jointly. The presence of robots within the activation distance of a POI, where < , generates no reward signal. This defines a tough exploration problem and is based on tasks like lifting a rock where multiple robots need to coordinate tightly to achieve any success at all.

6 Results

Compared Baselines: We compare the performance of MERL with a standard neuroevolutionary algorithm (EA) (Fogel, 2006), MADDPG (Lowe et al., 2017) and MATD3, a variant of MADDPG that integrates the improvements described within TD3 Fujimoto et al. (2018) over DDPG. Internally, MERL uses EA and TD3 as its global and local optimizer, respectively.

MADDPG on the other hand was chosen as it is the state-of-the-art multiagent RL algorithm. We implemented MATD3 ourselves to ensure that the differences between MADDPG and MERL do not originate from having the more stable TD3 over DDPG.

Further, MADDPG and MATD3 were tested with using either only global rewards or mixed (global + local) reward functions. The local reward function here is simply defined as the negative of the distance to the closest POI. MERL inherently leverages both reward functions while EA directly optimizes the global reward function. These variations for the baselines allow us to evaluate the efficacy of the differentiating features of MERL as opposed to improvements that might come from other ways of combining reward functions

Methodology for Reported Metrics: For MATD3 and MADDPG, the team network was periodically tested on task instances without any exploratory noise. The average score was logged as its performance. For MERL and EA, we choose the team with the highest fitness as the champion for each generation. The champion was then tested on task instances, and the average score was logged. This protocol shielded the reported metrics from any bias of the population size. We conduct statistically independent runs with random seeds from and report the average with error bars showing a confidence interval.

The Steps Metric: All scores reported are compared against the number of environment steps (frames). A step is defined as the multiagent team taking a joint action and receiving a feedback from the environment. To make the comparisons fair across single-team and population-based algorithms, all steps taken by all teams in the population are counted cumulatively.

Rover Domain Setup: For each coupling requirement of , robots were initialized accompanied with POIs spread out in the world. The coordination problem to be tackled was thus two-fold. First, the team of robots had to learn to form sub-teams of size . Next, the sub-teams would now need to coordinate with each other to spread out and cover different POIs to ensure they get all within the time allocated. Both the team formation, and coordinated spreading out has to be done autonomously and adaptively based on the distribution of the robots and POIs (varied randomly for each instance of the task). This is the core difficulty of the task.

(a) Coupling 1
(b) Coupling 2
(c) Coupling 3
(d) Coupling 4
(e) Coupling 5
(f) Coupling 6
(g) Coupling 7
(h) Legend
Figure 4: Performance on the Rover Domain with coupling varied from to . MERL significantly outperforms other baselines while being robust to increasing complexity of the coordination objective.

Figure 4 shows the comparative performance of MERL, MADDPG (global and mixed), MATD3 (global and mixed), and EA tested in the rover domain with coupling requirements from to . MERL significantly outperforms all baselines across all coupling requirements. The tested baselines clearly degrade quickly beyond a coupling of . The increasing coupling requirement is equivalent to increasing difficulty in joint-space exploration and entanglement in the global coordination objective. However, it does not increase the size of the state-space, complexity of perception, or navigation. This indicates that the degradation in performance is strictly due to the increase in complexity of the coordination objective.

Notably, MERL is able to learn on coupling greater than where methods without explicit reward shaping have been shown to fail entirely (Rahmattalabi et al., 2016). This is consistent with the performances of our baselines as none of them use explicit domain-specific reward shaping. MERL successfully completes the task using the same set of information and coarse, unshaped reward functions as the other algorithms. The primary mechanism that enables this is MERL’s bi-level approach whereby it leverages the local reward function to solve navigation and perception while concurrently using the global reward function to learn team formation and effective coordination.

Team Behaviors: Figure 5 illustrates the trajectories generated for the rover domain with a coupling of . The trajectories for partially and fully trained MERL are shown in Figure 5(a) and (b), respectively. During training, when MERL has not discovered success in the global coordination objective (no POIs are successfully observed), MERL simply proceeds to optimize the local objective for each robot. This allows it to reach trajectories such as the ones shown in 5(a) where each robot learns to go towards a POI.

Given this joint behavior, the probability of having robots congregate to the same POI is higher compared to random undirected exploration by each robot. Once this scenario is stumbled upon, the global optimizer (EA) within MERL will explicitly select for agent policies that lead to such team-forming joint behaviors. Eventually it succeeds as shown in Figure 5(b). Here, team formation and collaborative pursuit of the POIs is immediately apparent. Two teams of robots each form at the start of the episode. Further, the two teams also coordinate among each other to pursue different POIs in order to maximize the global team reward. While the POI allocation is not perfect, (the one in the bottom is left unattended) they do succeed in successfully observing out of the POIs.

(a) MERL (training)
(b) MERL (trained)
(c) MATD3 (trained)
Figure 5: Visualizations illustrating the trajectories generated with a coupling of . Red and black squares represent observed and unobserved POIs respectively

In contrast, MATD3-mixed fails to successfully observe any POI. From the trajectories, it is apparent that the robots have successfully learned to perceive and navigate to reach POIs. However, they are unable to use this sub-skill towards fulfilling the coordination objective. Instead each robot is rather split on the objective that it is optimizing. Some robots seem to be in sole pursuit of POIs without any regard for team formation or collaboration while others seem to exhibit random movements.

The primary reason for this is the mixed reward function that directly combines the local and global reward function. Since the two reward functions have no guarantees of alignment across the state-space of the task, they invariably lead to learning these sub-optimal joint-behaviors that solve a certain form of scalarized mixed objective. In practice, this problem can be addressed by manually tuning the scalariziation coefficients to achieve the required coordination behavior. However, without such manual reward shaping, MATD3-mixed fails to solve the task. In contrast, MERL is able to solve the task without any reward shaping or manual tuning.

7 Conclusion

In this paper, we introduced MERL, a hybrid algorithm that can combine global objectives with local objectives even when they are not aligned with each other. MERL achieves this by using a fast policy-gradient local optimizer to exploit dense local rewards while concurrently leveraging a global optimizer (EA) to tackle the coordination aspects of the task.

Results demonstrate that MERL significantly outperforms MADDPG, the state-of-the-art multiagent RL method . We also tested a modification of MADDPG to integrate TD3 - the state-of-the-art single-agent RL algorithm - as well as variations that utilized only global, or global and local rewards. These experiments demonstrated that the core improvements of MERL come from the combination of EA and policy gradient algorithm which enable MERL to combine multiple objectives without relying on alignment between those objectives. This differentiates MERL from other approaches like reward scalarization and reward shaping that either require extensive manual tuning or can detrimentally change the MDP (Ng et al., 1999) itself.

Here, we limited our focus to cooperative domains. Future work will explore MERL for adversarial settings such as Pommerman (Resnick et al., 2018), StarCraft (Justesen and Risi, 2017; Vinyals et al., 2017) and RoboCup (Kitano et al., 1995; Liu et al., 2019). Further, MERL can be considered a bi-level approach to combine local and global objectives. Extending MERL to generalized multilevel rewards is another promising area for future work.



Appendix A Hyperparameters Description

Hyperparameter MERL MATD3/MADDPG
Population size 20 N/A
Rollout size
Target weight
Actor Learning Rate
Critic Learning Rate
Discount Learning Rate
Replay Buffer Size
Batch Size
Mutation Probability N/A
Mutation Fraction N/A
Mutation Strength N/A
Super Mutation Probability N/A
Reset Mutation Probability N/A
Number of elites N/A
Exploration Policy
Exploration Noise
Rollouts per fitness N/A
Actor Neural Architecture
Critic Neural Architecture

TD3 Policy Noise variance

TD3 Policy Noise Clip
TD3 Policy Update Frequency
Table 1: Hyperparameters used for Experiments

Table 1 details the hyperparameters used for MERL, MATD3, and MADDPG. The hyperparmaeters were kept consistent across all experiments. The hyperparameters themselves are defined below:

  • Optimizer = Adam
    Adam optimizer was used to update both the actor and critic networks for all learners.

  • Population size
    This parameter controls the number of different actors (policies) that are present in the evolutionary population.

  • Rollout size
    This parameter controls the number of rollout workers (each running an episode of the task) per generation.

    Note: The two parameters above (population size and rollout size) collectively modulates the proportion of exploration carried out through noise in the actor’s parameter space and its action space.

  • Target weight
    This parameter controls the magnitude of the soft update between the actors and critic networks, and their target counterparts.

  • Actor Learning Rate
    This parameter controls the learning rate of the actor network.

  • Critic Learning Rate
    This parameter controls the learning rate of the critic network.

  • Discount Rate
    This parameter controls the discount rate used to compute the return optimized by policy gradient.

  • Replay Buffer Size
    This parameter controls the size of the replay buffer. After the buffer is filled, the oldest experiences are deleted in order to make room for new ones.

  • Batch Size
    This parameters controls the batch size used to compute the gradients.

  • Actor Activation Function

    Hyperbolic tangent was used as the activation function.

  • Critic Activation Function
    Hyperbolic tangent was used as the activation function.

  • Number of Elites
    This parameter controls the fraction of the population that are categorized as elites. Since an elite individual (actor) is shielded from the mutation step and preserved as it is, the elite fraction modulates the degree of exploration/exploitation within the evolutionary population.

  • Mutation Probability
    This parameter represents the probability that an actor goes through a mutation operation between generation.

  • Mutation Fraction
    This parameter controls the fraction of the weights in a chosen actor (neural network) that are mutated, once the actor is chosen for mutation.

  • Mutation Strength

    This parameter controls the standard deviation of the Gaussian operation that comprises mutation.

  • Super Mutation Probability
    This parameter controls the probability that a super mutation (larger mutation) happens in place of a standard mutation.

  • Reset Mutation Probability
    This parameter controls the probability a neural weight is instead reset between rather than being mutated.

  • Exploration Noise
    This parameter controls the standard deviation of the Gaussian operation that comprise the noise added to the actor’s actions during exploration by the learners (learner roll-outs).

  • TD3 Policy Noise Variance
    This parameter controls the standard deviation of the Gaussian operation that comprise the noise added to the policy output before applying the Bellman backup. This is often referred to as the magnitude of policy smoothing in TD3.

  • TD3 Policy Noise Clip
    This parameter controls the maximum norm of the policy noise used to smooth the policy.

  • TD3 Policy Update Frequency
    This parameter controls the number of critic updates per policy update in TD3.

Appendix B Rollout Methodology

Algorithm 2 describes an episode of rollout under MERL detailing the connections between the local reward, global reward, and the associated replay buffer.

1:procedure Rollout(, , noise, )
3:     for j = 1: do
4:         Reset environment and get initial joint state
5:         while env is not done do
6:              Initialize an empty list of joint action = []
7:              for Each agent (actor head) and  do
9:              end for
10:              Execute and observe joint local reward , global reward and joint next state
11:              for Each Replay Buffer and , , , in , , ,  do
12:                  Append transition to
13:              end for
15:              if env is done: then
17:              end if
18:         end while
19:     end for
20:     Return ,
21:end procedure
Algorithm 2 Function Rollout