1 Introduction
Despite outstanding successes in specific domains such as games Silver et al. (2017); Jaderberg et al. (2019) and robotics Tobin et al. (2018); Akkaya et al. (2019), Reinforcement Learning (RL) algorithms are still far from being immediately applicable to complex sequential decision problems. Among the issues, a remaining burden is the need to find the right balance between exploitation and exploration. On one hand, algorithms which do not explore enough can easily get stuck in poor local optima. On the other hand, exploring too much hinders sample efficiency and can even prevent users from applying RL to large real world problems.
Dealing with this explorationexploitation tradeoff has been the focus of many RL papers Tang et al. (2016); Bellemare et al. (2016); Fortunato et al. (2017); Plappert et al. (2017). Among other things, having a population of agents working in parallel in the same environment is now a common recipe to stabilize learning and improve exploration, as these parallel agents collect a more diverse set of samples. This has led to two approaches, namely distributed RL where the agents are the same and populationbased training, where diversity between agents further favors exploration Whiyoung Jung (2020); ParkerHolder et al. (2020). However, such methods do certainly not make the most efficient use of available computational resources, as the agents may collect highly redundant information.
Besides, the focus on sparse or deceptive rewards problems led to the realization that looking for diversity independently from maximizing rewards might be a good exploration strategy Lehman and Stanley (2011); Eysenbach et al. (2018); Colas et al. (2018). More recently, it was established that if one can define a behavior space or outcome space corresponding to the smaller space that matters to decide if a behavior is successful or not, maximizing diversity in this space might be the optimal strategy to find the sparse reward source Doncieux et al. (2019).
When the reward signal is not sparse though, one can do better than just looking for diversity. Trying to simultaneously maximize diversity and rewards has been formalized into the QualityDiversity (QD) framework Pugh et al. (2016); Cully and Demiris (2017). The corresponding algorithms try to populate the outcome space as widely as possible with an archive
of past solutions which are both diverse and reward efficient. To do so, they generally rely on evolutionary algorithms. Selecting diverse and reward efficient solutions is then performed using the Pareto front of the
diversity reward efficiency landscape, or populating a grid of outcome cells with reward efficient solutions in the MapElites algorithm Mouret and Clune (2015). In principle, the QD approach offers a great way to deal with the explorationexploitation tradeoff as it simultaneously ensures pressure towards both wide covering of the outcome space and high return efficiency. However, these methods suffer from relying on evolutionary methods. Though they have been shown to be competitive with deep RL approaches provided enough computational power Salimans et al. (2017); Colas et al. (2020), they do not take advantage of the gradient’s analytical form, and thus have to sample to estimate gradients, resulting in far worse sample efficiency than their deep RL counterparts
Sigaud and Stulp (2019).On the other hand, deep RL methods which leverage policy gradients have far better sample efficiency but they struggle on problems that require strong exploration and are sensitive to poorly conditioned reward signals such as deceptive rewards Colas et al. (2018). This is in part because they explore in the action space, the stateaction space or the policy space rather than in an outcome space.
In this work, we combine the general QD framework with policy gradient methods and capitalize on the strengths of both approaches. Our QDRL algorithm explores in an outcome space and thus can solve problems that simultaneously require complex exploration and high dimensional control capabilities. We investigate the properties of QDRL by first controlling a low dimensional agent in a maze, and then addressing antmaze, a larger MuJoCo benchmark. We compare QDRL to several recent algorithms which also combine a diversity objective and a return maximization method, namely the nses family which mixes evolution strategies with novelty search Conti et al. (2017) and the mees algorithm Colas et al. (2020) which uses MapElites to maintain a diverse and high performing population. The latter has been shown to scale well enough to also address large MuJoCo benchmarks, but we show that QDRL is several orders of magnitude more sample efficient than these competitors.
2 Related Work
We consider the general context of a fully observable Markov Decision Problem (MDP) where is the state space, is the action space, is the transition function, is the reward function and is a discount factor. The explorationexploitation tradeoff being central in RL, the search for efficient exploration methods are ubiquitous in the domain. We focus on the relationship between our work and two methods: those which introduce explicit diversity into a multiactor deep RL approach, and those which combine distinct mechanisms for exploration and exploitation.
Diversity in multiactor RL
Managing several actors is now a well established method to improve wall clock time and stabilize learning Jaderberg et al. (2017). But including an explicit diversity criterion is a more recent trend.
The arac algorithm Doan et al. (2019) uses a combination of attraction and repulsion mechanisms between good agents and poor agents to ensure diversity in a population of agents trained in parallel. The algorithm shows improvement in performance in large continuous action benchmarks such as humanoidv2 and sparse reward variants. But diversity is defined in the space of policy performance thus the drive towards novel behaviors could be strengthened.
The p3std3 algorithm Whiyoung Jung (2020) is an instance of populationbased training where the parameters of the best actor are softly distilled into the rest of the population. To prevent the whole population from collapsing into a single agent, a simple diversity criterion is enforced so as to maintain a minimum distance between all agents. The algorithm shows good performance over a large set of continuous action benchmarks, including "delayed" variants where the reward is obtained only every time steps. However, the diversity criterion they use is far from guaranteeing efficient exploration of the outcome space, particularly in the absence of reward, and it seems that the algorithms mostly benefits from the higher stability of populationbased training.
With respect to p3std3, the dvd algorithm ParkerHolder et al. (2020) proposes a populationwide diversity criterion which consists in maximizing the volume between the parameters of the agents in a latent space. This criterion better limits redundancy between the considered agents.
Like our work, all these methods use a population of deep RL agents and explicitly look for diversity among these agents. However, none of them addresses deceptive reward environments such as the mazes we consider in our work. Furthermore, none of them clearly separates two components nor searches for diversity in the outcome space as QDRL does.
Separated exploration and exploitation mechanisms
One extreme case of the separation between exploration and exploitation is "explorationonly" methods. The efficiency of this approach was first put forward within the evolutionary optimization literature Lehman and Stanley (2011); Doncieux et al. (2019) and then imported into the reinforcement learning literature with works such as Eysenbach et al. (2018) which gave rise to several recent followup Pong et al. (2019); Lee et al. (2019); Islam et al. (2019). These methods have proven useful in the sparse reward case, but they are inherently limited when some reward signal can be used and maximized during exploration. A second approach is sequential combination. Similarly to us, the geppg algorithm Colas et al. (2018) combines a diversity seeking component, namely Goal Exploration Processes Forestier et al. (2017) and a deep RL algorithm, namely ddpg Lillicrap et al. (2015) and shows that combining them sequentially can overcome a deceptive gradient issue. This sequential combination of explorationthenexploitation is also present in GoExplore Ecoffet et al. (2019) which explores first and then memorizes the sequence to look for a high reward policy in atari games, and in pbcs Matheron et al. (2020) which does the same in a continuous action domain. Again, this approach is limited when the reward signal can help driving the exploration process towards a satisfactory solution.
Removing the sequentiality limitation, some approaches use a population of agents with various exploration rates Badia et al. (2020). Along a different line, the cemrl algorithm Pourchot and Sigaud (2018) combines an evolutionary algorithm, cem De Boer et al. (2005), and a deep RL algorithm, td3 Fujimoto et al. (2018) in such a way that each component takes the lead when it is the most appropriate in the current situation. Doing so, cemrl benefits from the better sample efficiency of deep RL and from the higher stability of evolutionary methods. But the evolutionary part is not truly a diversity seeking component and, being still an evolutionary method, it is not as sample efficient as td3. A common feature between cemrl and our work is that the reward seeking agents benefit from the findings of the other agents by sharing their replay buffer.
Closer to our qualitydiversity inspired approach, Conti et al. (2017) proposes qdes and nsres. But, as outlined in Colas et al. (2020), these approaches are not sample efficient and the diversity and environment reward functions are mixed in a less efficient way. The most closely related work w.r.t. ours is Colas et al. (2020). The mees algorithm also optimizes both diversity and reward efficiency, using an archive and two ES populations. Instead of using a Pareto front, mees uses the MapElites approach where the outcome space is split in cells that the algorithm has to cover. Using such distributional ES approach has been shown to be critically more efficient than populationbased GA methods Salimans et al. (2017), but our results show that they are still less sample efficient than offpolicy deep RL methods as they do not leverage direct access to the policy gradient.
3 QdRl
We present QDRL, a qualitydiversity optimization method designed to address hard exploration problems where sample efficiency matters. As depicted in Figure 1, QDRL
optimizes a population of agents for both environment reward and diversity using offpolicy policy gradient methods which are known to be more efficient than traditional genetic algorithms or evolution strategies
Salimans et al. (2017); Petroski Such et al. (2017). In this study, we chose to rely on the td3 agent (see Supplementary Section A) but any other offpolicy agent such as sac Haarnoja et al. (2018) could be used instead.With respect to the standard MDP framework, QDRL introduces an extra outcome space and a behavior characterization function that extracts the outcome for a state . The outcome of a behavior characterizes what matters about this behavior. As it often corresponds to what is needed to determine whether the behavior was successful or not, this outcome space can be equivalent to a goal space such as introduced in Schaul et al. (2015); Andrychowicz et al. (2017). For example, when working in a maze environment, the outcome may represent the coordinates at the end of the trajectory of the agent, which may also be its goal. However, in contrast to uvfas, we do not condition the policy on the outcome . In this work, the behavior characterization function is given, as is also the case in Schaul et al. (2015); Andrychowicz et al. (2017), and we consider outcomes computed as a function of a single state. The more general case where it is learned or computed as a function of the whole trajectory is left for future work.
As any QD system, QDRL manages a population of policies into an archive which contains all previously trained actors. The first generation contains a population of neural actors with parameters . While all these actors share the same neural architecture, their weights are initialized differently. At each iteration, a selection mechanism based on the Pareto front selects best actors from the archive containing all past agents, according to two criteria: the environment reward and a measure of novelty of the actor. To better stick to the QD framework, hereafter the former is called "quality" and the latter "diversity". If the Pareto front contains less than actors, the whole Pareto front is selected and computed again over the remaining actors. Additional actors are sampled from the new Pareto front, and so on until actors are sampled.
These selected actors form a new generation. Then, half of the selected actors are trained to optimize quality while the others optimize diversity. These actors with updated weights are then evaluated in the environment and added to the archive. In more details, training is performed as follows.
First, as any standard RL method, QDRL optimizes actors so as to maximize quality. More formally, it updates the actor weights to maximize the objective function where is a trajectory obtained by following the policy and is the environment reward function.
Second, QDRL also optimizes actors to increase diversity in the outcome space. To evaluate the diversity of an outcome , we seek for the nearest neighbors of outcome in the archive and compute a novelty score as the mean of the squared Euclidean distances between and its neighbors, as in Lehman and Stanley (2011); Conti et al. (2017). More formally, QDRL maximizes the objective function where is the outcome discovered by policy at time step , is the novelty score function and is the archive containing already discovered outcomes.
The and functions have the same structure as we can rewrite , where is a non stationary reward function corresponding to novelty scores. Thus all the mechanisms introduced in the deep RL literature to optimize can also be applied to optimize . Notably, we can introduce Qvalue functions and
dealing with quality and diversity and we can define two randomly initialized critic neural networks
and , with parameters and to approximate these functions. These critics are shared by all the trained actors. Therefore, they capture the average population performance rather than the performance of individual actors, which has both an information sharing effect and a smoothing effect. We found that training individual critics is harder in practice and left this analysis for future work.The quality and diversity update of actor weights is performed according to Equation (2). An update consists in sampling a batch of transitions from the replay buffer and optimizing the weights of both critics so as to maximize quality and diversity. Then, we optimize parameters of half policies so as to maximize and the other half to maximize . Therefore, the global update can be written
(1) 
(2) 
where and correspond to the parameters of target critic networks. To keep notations simple, updates of the extra critic networks introduced in td3 to reduce the value estimation bias do not appear in (1), but we use them in practice.
Once updates have been performed, trajectories are collected in parallel from all policies. These trajectories are stored into a common replay buffer and the tuple (final outcome , return, parameters) is stored into the archive. Since the novelty score of an outcome varies through time as the archive grows, instead of storing it, we store outcomes and fresh diversities are computed every time a batch is sampled from the replay buffer.
4 Experiments
In this section, we demonstrate the capability of QDRL to solve challenging exploration problems. We implement it with the td3 algorithm and refer to this implementation as the qdtd3 algorithm. Hyperparameters are described in Section B of the supplementary document. We first analyse each component of qdtd3 and demonstrate their usefulness on a toy example. Then we show that qdtd3 can solve a more challenging control and exploration problem such as navigating the MuJoCo Ant into a large maze with a better sample complexity than its standard evolutionary competitors.
4.1 Point Maze: Move a point in a Maze
We first consider the pointmaze environment in which the agent controls a 2D material point which must exit from a three corridors maze depicted in Figure 1(a). The observation corresponds to the agent coordinates at time . The two continuous actions correspond to position increments along the and axes. The outcome space is the final state of the agent, as in Conti et al. (2017). The initial position of the agent is sampled uniformly in . This zone is located at the bottom right of the maze. The exit area is a square centered at of width . Once this exit square is reached, the episode ends. The maximum length of an episode is 200 time steps. The reward is computed as . This reward leads to a deceptive gradient signal: following it would lead the agent to stay stuck by the second wall in the maze, as shown in Figure 4 of Supplementary Section C. In order to exit the maze, the agent must find the right balance between exploitation and exploration, that is at a certain point ignore the policy gradient and only explore the maze. Thus, though this example may look simple due to its low dimension, it remains very challenging for standard deep reinforcement agents such as td3.
qdtd3 performs three main operations: (i) it optimizes half of the agents to maximize quality; (ii) it does the same to the other half to maximize diversity; (iii) it uses a qualitydiversity Pareto front as a population selection mechanism. We investigate the impact of each of these components separately through an ablation study. For all experiments, we use 4 actors. Results are aggregated in Figure 2(a).
First, we measure performance when training the 4 actors to maximize quality only. We call the resulting agent qtd3, but this is simply a multiactor td3. As depicted in Figure 5 of Supplementary Section C, the qtd3 population finds a way to the second maze wall but is stuck there due to the deceptive nature of the gradient. This experiment shows clearly enough that using a qualityonly strategy has no chance of solving hard exploration problems with a deceptive reward signal such as pointmaze or antmaze.
Then, we evaluate the agent performance when training the 4 actors to maximize diversity only. We call the resulting agent dtd3. We show that dtd3
finds sometimes how to get further the second wall but with a large variance and without finding the optimal trajectory.
We then consider a dtd3 + pareto agent that optimizes only for diversity but performs agent selection from the archive with a Pareto front, that is it selects 4 actors for the next generation based on both their quality and diversity, but without optimizing the former. Interestingly, adding the Pareto front selection mechanism significantly improves performance and stability.
Finally, qdtd3 optimizes half of the actors for quality, the other half for diversity and it selects them with the Pareto Front mechanism. We observe that qdtd3 outperforms all ablations, even if the improvement over dtd3 + pareto is lesser, which means that optimizing for quality is less critical in this environment as good enough solutions are found just by maximising diversity. Table 1 summarises all the ablations we performed.
Opt. Quality  Opt. Diversity  Pareto selection  Episode return ( std)  
qdtd3  
dtd3 + pareto  X  
dtd3  X  X  
qtd3 + pareto  X  
qtd3  X  X 
4.2 Ant Maze: Control an articulated Ant to solve a Maze
We then test qdtd3 on a challenging environment modified from OpenAI Gym Brockman et al. (2016) based on antv2 and also used in Colas et al. (2020); Frans et al. (2018). We refer to this environment as the antmaze environment. In antmaze, a fourlegged "ant" robot has to reach a goal zone located at which corresponds to the lower right part of the maze (colored in green in (b)). The initial position of the ant is sampled from a small circle of radius around the initial point situated in the extreme bottom right of the maze. Maze walls are organized so that following the gradient of distance to the goal drives the ant into a deadend. As in the pointmaze, the reward is expressed as minus the distance between the center of gravity of the ant and the center of the goal zone, thus leading to a strongly deceptive gradient. This environment is more complex than pointmaze
as the agent must learn to control a body with 8 degrees of freedom in all directions to explore the maze and solve it. Therefore, this problem is much harder than the standard
antv2 gym environment in which the ant only learns to go straight forward. The observation space contains the positions, angle, velocities and angular velocities of most ant articulations and center of gravity, and has 29 dimensions. The action space is where an action correspond to the choice of 8 continuous torque intensities to apply to the 8 ant articulations. The episodes have a fixed length of 3000 time steps. As previously, the outcome is computed as the final position of the center of gravity of the ant.We compare the performance of qdtd3 on this benchmark to 4 stateofthe art methods: nsres, nsraes, nses and mees. While nses and meesexplore optimize only for diversity, and meesexploit optimizes only for quality, nsres, nsraes and meesexplore exploit optimize for both. To ensure fair comparison, we did not implement our own versions of these algorithms but reused results from the mees paper Colas et al. (2020). We also ensured that the environment we used was rigorously the same.
All the baselines were run for 5 seeds. In qdtd3, each seed corresponds to 20 actors distributed on 20 cpu cores. As in Colas et al. (2020)
, we compute the average and standard deviation between the seeds of the minimum distance to the goal reached at the end of an episode by one of the agents and report the results in Figure
2(b). As explained in Colas et al. (2020), nsres and meesexploit obtain a score around 26, which means that they get stuck in the deadend, similarly to qtd3. By contrast, all other algorithms manage to avoid it. More importantly, the qdtd3 algorithm achieves a similar score to these better exploring agents, but in more than 15 times less samples than its evolutionary competitors, as shown in Table 2.Final Perf. ( std)  Sampled steps  Steps to 10  Ratio to QDTD3  

qdtd3 (Ours)  6e8  5e8  1  
nses  1e10  8e9  16  
nsres  1e10  
nsraes  1e10  8e9  16  
mees ee  1e10  9e9  18  
mees Explore  1e10  9e9  18  
mees Exploit  1e10  
qtd3  3e8 
These results show that qdtd3 leveraged the sample efficiency brought by offpolicy policy gradient to learn to efficiently explore the maze. We also emphasize the low cost of qdtd3 compared to its evolutionary counterparts. To solve the antmaze, qdtd3 requires only 2 days of training on 20 cpu cores with no gpu while evolutionary algorithms are usually run on much larger infrastructures. For instance, mees needs to sample 10.000 different set of parameters per iteration and evaluates them all to compute a diversity gradient with cem Colas et al. (2020). Besides, the failure of the qtd3 ablation into antmaze unsurprisingly shows that a pure RL approach without a diversity component fails in these deceptive gradient benchmarks.
5 Conclusion
In this paper, we proposed a novel way to deal with the explorationexploitation tradeoff by combining a rewardseeking component, a diversityseeking component and a selection component inspired from the QualityDiversity approach. Crucially, we showed that quality and diversity could be optimized with offpolicy reinforcement learning algorithms, resulting in a significantly improved sample efficiency. We showed experimentally the effectiveness of the resulting QDRL framework, which can solve in two days with 20 cpus problems which were previously out of reach without a much larger infrastructure.
Key components of QDRL
are selection through a Pareto front and the search for diversity in an outcome space. Admittedly, the outcome space needed to compute the diversity reward is hard coded. There are attempts to automatically obtain the outcome space through unsupervised learning methods
Péré et al. (2018); Paolo et al. (2019), but defining such a space is often a trivial decision which helps a lot, and can alleviate the need to carefully design reward functions.In the future, we first want to address the case where the outcome depends on the whole trajectory. Next we plan to further study the versatility of our approach to exploration compared to other deep reinforcement learning exploration approaches. Besides, we intend to show that our approach could be extended to problems where the environment reward function can itself be decomposed into several loosely dependent components, such as standing, moving forward and manipulating objects for a humanoid or solving multiagent reinforcement learning problems. In such environments, we could replace the maximization of the sum of reward contributions with a multicriteria selection from a Pareto front where diversity would be only one of the considered criteria.
Broader Impact
Our paper presents a novel approach to the combination of diversitydriven exploration and modern reinforcement learning techniques. It results in more stable learning with respect to standard reinforcement learning, and more sample efficient learning with respect to standard evolutionary approaches to diversity. We believe this has a positive impact in making reinforcement learning techniques more accessible and feasible towards real world applications. Besides, our work may help casting a needed bridge between the reinforcement learning and the evolutionary optimization research communities. Finally, by releasing our code, we believe that we help efforts in reproducible science and allow the wider community to build upon and extend our work in the future.
References
 [1] (2019) Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113. Cited by: §1.
 [2] (2017) Hindsight Experience Replay. arXiv preprint arXiv:1707.01495. Cited by: §3.
 [3] (2020) Agent57: outperforming the atari human benchmark. arXiv preprint arXiv:2003.13350. Cited by: §2.
 [4] (2016) Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479. Cited by: §1.
 [5] (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §4.2.
 [6] (2020) Scaling mapelites to deep neuroevolution. arXiv preprint arXiv:2003.01825. Cited by: §1, §1, §2, §4.2, §4.2, §4.2, §4.2.
 [7] (2018) GEPPG: decoupling exploration and exploitation in deep reinforcement learning algorithms. arXiv preprint arXiv:1802.05054. Cited by: §1, §1, §2.
 [8] (2017) Improving exploration in evolution strategies for deep reinforcement learning via a population of noveltyseeking agents. arXiv preprint arXiv:1712.06560. Cited by: §1, §2, §3, §4.1.

[9]
(2017)
Quality and diversity optimization: a unifying modular framework.
IEEE Transactions on Evolutionary Computation
. Cited by: Appendix B, §1.  [10] (2005) A tutorial on the crossentropy method. Annals of Operations Research 134 (1), pp. 19–67. Cited by: §2.
 [11] (2019) Attractionrepulsion actorcritic for continuous control reinforcement learning. arXiv preprint arXiv:1909.07543. Cited by: §2.
 [12] (2019) Novelty search: a theoretical perspective. In Proceedings of the Genetic and Evolutionary Computation Conference, pp. 99–106. Cited by: §1, §2.
 [13] (2019) Goexplore: a new approach for hardexploration problems. arXiv preprint arXiv:1901.10995. Cited by: §2.
 [14] (2018) Diversity is all you need: learning skills without a reward function. arXiv preprint arXiv:1802.06070. Cited by: §1, §2.
 [15] (2017) Intrinsically motivated goal exploration processes with automatic curriculum learning. arXiv preprint arXiv:1708.02190. Cited by: §2.
 [16] (2017) Noisy networks for exploration. arXiv preprint arXiv:1706.10295. Cited by: §1.
 [17] (2018) Meta learning shared hierarchies. Proc. of ICLR. Cited by: §4.2.
 [18] (2018) Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: Appendix A, §2.
 [19] (2018) Soft actorcritic algorithms and applications. External Links: 1812.05905 Cited by: §3.
 [20] (2019) Marginalized state distribution entropy regularization in policy optimization. arXiv preprint arXiv:1912.05128. Cited by: §2.
 [21] (2019) Humanlevel performance in 3d multiplayer games with populationbased reinforcement learning. Science 364 (6443), pp. 859–865. Cited by: §1.
 [22] (2017) Populationbased training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §2.
 [23] (2019) Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274. Cited by: §2.
 [24] (2011) Abandoning objectives: evolution through the search for novelty alone. Evolutionary computation 19 (2), pp. 189–223. Cited by: §1, §2, §3.
 [25] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: Appendix A, §2.
 [26] (2020) PBCS: efficient exploration and exploitation using a synergy between reinforcement learning and motion planning. arXiv preprint arXiv:2004.11667. Cited by: §2.
 [27] (2015) Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909. Cited by: §1.
 [28] (2019) Unsupervised learning and exploration of reachable outcome space. algorithms 24, pp. 25. Cited by: §5.
 [29] (2020) Effective diversity in populationbased reinforcement learning. arXiv preprint arXiv:2002.00632. Cited by: §1, §2.
 [30] (2018) Unsupervised learning of goal spaces for intrinsically motivated goal exploration. arXiv preprint arXiv:1803.00781. Cited by: §5.
 [31] (2017) Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567. Cited by: §3.
 [32] (2017) Parameter space noise for exploration. arXiv preprint arXiv:1706.01905. Cited by: §1.
 [33] (2019) Skewfit: statecovering selfsupervised reinforcement learning. arXiv preprint arXiv:1903.03698. Cited by: §2.
 [34] (2018) CEMrl: combining evolutionary and gradientbased methods for policy search. arXiv preprint arXiv:1810.01222. Cited by: §2.
 [35] (2016) Quality diversity: a new frontier for evolutionary computation. Frontiers in Robotics and AI 3, pp. 40. Cited by: §1.
 [36] (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §1, §2, §3.

[37]
(2015)
Universal value function approximators.
In
International Conference on Machine Learning
, pp. 1312–1320. Cited by: §3.  [38] (2019) Policy search in continuous action domains: an overview. Neural Networks 113, pp. 28–40. Cited by: §1.
 [39] (2014) Deterministic policy gradient algorithms. In Proceedings of the 30th International Conference in Machine Learning, Cited by: Appendix A.
 [40] (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354–359. Cited by: §1.
 [41] (2016) # exploration: a study of countbased exploration for deep reinforcement learning. arXiv preprint arXiv:1611.04717. Cited by: §1.
 [42] (2018) Domain randomization and generative models for robotic grasping. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3482–3489. Cited by: §1.
 [43] (2020) Populationguided parallel policy search for reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §2.
Appendices
Appendix A The TD3 Agent
The Twin Delayed Deep Deterministic (td3) agent Fujimoto et al. (2018) builds upon the Deep Deterministic Policy Gradient (ddpg) agent Lillicrap et al. (2015). It trains a deterministic actor that maps directly environment observations to continuous actions and a critic that takes an environment state and an action and estimates the average return from selecting action in state and then following policy . ddpg alternates policy evaluation and policy improvement operations so as to maximise the average discounted return. In ddpg, the critic is updated to minimize a temporal difference error during the policy evaluation step which induces an overestimation bias. td3 corrects for this bias by introducing two critics and . td3 alternates between interactions with the environment and critic and actor updates. It plays one step in the environment using its deterministic policy and then stores the observed transition into a replay buffer . Then, it samples a batch of transitions from and updates the critic networks. Half the time it also samples another batch of transitions to update the actor network.
Both critics are updated so as to minimize a loss function which is expressed as a mean squared error between their predictions and a target:
(3) 
where the common target is computed as:
(4) 
The Qvalue estimation used to compute target is taken as minimum between both critic predictions thus reducing the overestimation bias. td3 also adds a small perturbation to the action so as to smooth the value estimate by bootstrapping similar stateaction value estimates.
Every two critics updates, the actor is updated using the deterministic policy gradient also used in ddpg Silver et al. (2014). For a state , ddpg updates the actor such as to maximise the critic estimation for this state and the action selected by the actor. As there are two critics in td3, the authors suggest to take the first critic as an arbitrary choice. Thus, the actor is updated by minimizing the following loss function:
(5) 
Policy evaluation and policy improvement steps are repeated until convergence. td3 demonstrates state of the art performance on some MuJoCo benchmarks. In this study, we use it to update the population of actors for both quality and diversity.
Appendix B QDRL Implementation Details
In pointmaze, the dimensions of the state space and the action space are both equal to 2. By contrast, in antmaze the dimension of the state space is 29 while the action space dimension is 8. We use fully connected layers networks for all actors and critics.
We consider populations of actors for the pointmaze environment and actors for the antmaze environment. We use 1 cpu thread per actor. The code parallelisation is implemented with the Message Passing Interface (MPI) library. Our experiments were run on a machine with cpu cores and 100 GB of RAM. We did not use any gpu. One experiment on the pointmaze takes between 2 and 3 hours while an experiment on the antmaze takes 2 days.
During one iteration of the QDRL algorithm, the actors of the population are updated according to Equation (2) where the losses are computed on batches sampled from a shared replay buffer. Then, the actors are evaluated. All the gradients are computed in parallel. Then, the gradients relative to the critic networks are averaged through a reduce operation and redistributed to the actors threads to update their weights.
After being updated, actors are evaluated by performing an episode. Evaluations also take place in parallel. All the transitions are stored into a replay buffer. For each actor, we also compute the discounted return over the episode as where is the episode length. We compute the distance between the final outcome and its closest neighbor in the archive. If this distance is superior to an acceptance threshold (see below for values), we add the tuple (actor set of weights, return , final outcome ) to the archive. Otherwise, we decide between keeping the new actor or its closest neighbor in the archive by selecting the one with the highest return. This selection technique, suggested in Cully and Demiris (2017), allows QDRL to save space by only keeping the relevant elements. We also set a maximum size for the archive. If this maximum was to be reached we would use a First In First Out mechanism. However it was never the case in any of our experiments.
Finally, to select the new actors from the population to start the next iteration, we compute a qualitydiversity Pareto Front of all the actors saved in the archive. We sample the actors in the Pareto Front. If the Pareto front contains less than actors, we select them all, remove them, compute the Pareto front over the remaining actors and sample again from it, and so on until we get actors.
b.1 Hyperparameters
We summarize all the hyperparameters used in experiments in Table 3. We highlight the fact that most of these hyperparameters values are the original ones for the td3 algorithm. Our method introduces only 3 hyperparameters: the archive size, the threshold of acceptance to add an outcome in the archive and the nearest neighbors. The archive size value is determined to never be reached. We found that QDRL is not sensitive to the number of nearest neighbors as long as this number is higher than . The threshold of acceptance is determined to find a good tradeoff between keeping an archive of an acceptable size with respect to the infrastructure ram capacities and not being too selective so as to keep a maximum number of meaningful actors set of weights.
Parameter  Point Maze  Ant Maze 

Reinforcement Learning  
optimizer  Adam  Adam 
learning rate  
discount factor  
replay buffer size  
hidden layers size  
activations  ReLU  ReLU 
minibatch size  
target smoothing coefficient  
delay policy update  
target update interval  
gradient steps  
Archive  
archive size  
threshold of acceptance  
k nearest neighbors 
b.2 Full Pseudo Code of QDRL
Appendix C PointMaze and AntMaze Environments analysis
In this section, we propose a finer analysis of the pointmaze and antmaze environments. We highlight why these environment are hard to solve for classical deep RL agents without extra exploration mechanisms and show the impact of the different components of our algorithm.
c.1 Deceptive Gradient in AntMaze
Figure 4, highlights the deceptive nature of the antmaze environment reward by depicting gradient fields in both environments.
c.2 Exploration in PointMaze for All Ablations
Figure 5 summarizes the coverage of the pointmaze environment by the different ablation algorithms over the course of training. A dot in the figure corresponds to the final position of an agent performing an episode in the environment. The color highlights the course of training: agents evaluated early in training are in blue while newer ones are represented in purple. Figure 5 corresponds to the map coverage for one seed, we chose the most representative among all seeds. As dtd3 suffers from a high variance between seeds, we showed two possible behaviors: one where the whole map is covered and one where dtd3 gets stuck.
In Figure 5, all algorithms using diversity (qdtd3, dtd3, dtd3 + pareto) are able to explore the whole environment. The lower region, full of blue dots, is explored first while the upper region, full of purple dots, is explored later. In the map coverage of qdtd3, the area in the right corner just above the first wall is not explored. This is because qdtd3 favors both quality and diversity so this area is not explored in priority. The two algorithms relying on quality only (qtd3 and qtd3 + pareto) quickly reach the first wall and then get stuck here because of the deceptive reward signal.
Comments
There are no comments yet.