Continuous Control With Ensemble Deep Deterministic Policy Gradients

The growth of deep reinforcement learning (RL) has brought multiple exciting tools and methods to the field. This rapid expansion makes it important to understand the interplay between individual elements of the RL toolbox. We approach this task from an empirical perspective by conducting a study in the continuous control setting. We present multiple insights of fundamental nature, including: an average of multiple actors trained from the same data boosts performance; the existing methods are unstable across training runs, epochs of training, and evaluation runs; a commonly used additive action noise is not required for effective training; a strategy based on posterior sampling explores better than the approximated UCB combined with the weighted Bellman backup; the weighted Bellman backup alone cannot replace the clipped double Q-Learning; the critics' initialization plays the major role in ensemble-based actor-critic exploration. As a conclusion, we show how existing tools can be brought together in a novel way, giving rise to the Ensemble Deep Deterministic Policy Gradients (ED2) method, to yield state-of-the-art results on continuous control tasks from OpenAI Gym MuJoCo. From the practical side, ED2 is conceptually straightforward, easy to code, and does not require knowledge outside of the existing RL toolbox.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

10/19/2020

Softmax Deep Double Deterministic Policy Gradients

A widely-used actor-critic reinforcement learning algorithm for continuo...
07/09/2020

SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning

Model-free deep reinforcement learning (RL) has been successful in a ran...
07/01/2019

FiDi-RL: Incorporating Deep Reinforcement Learning with Finite-Difference Policy Search for Efficient Learning of Continuous Control

In recent years significant progress has been made in dealing with chall...
05/08/2022

Simultaneous Double Q-learning with Conservative Advantage Learning for Actor-Critic Methods

Actor-critic Reinforcement Learning (RL) algorithms have achieved impres...
02/07/2021

Deep Reinforcement Learning with Dynamic Optimism

In recent years, deep off-policy actor-critic algorithms have become a d...
11/06/2018

ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search

In this paper, we propose an actor ensemble algorithm, named ACE, for co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, deep reinforcement learning (RL) has achieved multiple breakthroughs in a range of challenging domains (e.g. Silver et al. (2016); Berner et al. (2019); Andrychowicz et al. (2020b); Vinyals et al. (2019)). A part of this success is related to an ever-growing toolbox of tricks and methods that were observed to boost the RL algorithms’ performance (e.g. Hessel et al. (2018); Haarnoja et al. (2018b); Fujimoto et al. (2018); Wang et al. (2020); Osband et al. (2019)). This state of affairs benefits the field but also brings challenges related to often unclear interactions between the individual improvements and the credit assignment related to the overall performance of the algorithm (Andrychowicz et al., 2020a; Ilyas et al., 2020).

In this paper, we present a comprehensive empirical study of multiple tools from the RL toolbox applied to the continuous control in the OpenAI Gym MuJoCo setting. These are presented in Section 4 and Appendix B. Our insights include:

  • The ensemble of actors boost the agent performance.

  • The current state-of-the-art methods are unstable under several stability criteria.

  • The normally distributed action noise, commonly used for exploration, can hinder training.

  • The approximated posterior sampling exploration (Osband et al., 2013) outperforms approximated UCB exploration combined with weighted Bellman backup (Lee et al., 2020).

  • The weighted Bellman backup (Lee et al., 2020) can not replace the clipped double Q-Learning (Fujimoto et al., 2018).

  • The critics’ initialization plays a major role in ensemble-based actor-critic exploration, while the training is mostly invariant to the actors’ initialization.

To address some of the issues listed above, we introduce the Ensemble Deep Deterministic Policy Gradient (ED2) algorithm111Our code is based on SpinningUp (Achiam, 2018)

. We open-source it at:

https://github.com/ed2-paper/ED2., see Section 3. ED2 brings together existing RL tools in a novel way: it is an off-policy algorithm for continuous control, which constructs an ensemble of streamlined versions of TD3 agents and achieves the state-of-the-art performance in OpenAI Gym MuJoCo, substantially improving the results on the two hardest tasks – Ant and Humanoid. Consequently, ED2 does not require knowledge outside of the existing RL toolbox, is conceptually straightforward, and easy to code.

2 Background

We model the environment as a Markov Decision Process (MDP). It is defined by the tuple

, where is a continuous multi-dimensional state space, denotes a continuous multi-dimensional action space, is a transition kernel, stands for a discount factor, refers to an initial state distribution, and is a reward function. The agent learns a policy from sequences of transitions , called episodes or trajectories, where , , , is a terminal signal, and is the terminal time-step. A stochastic policy maps each state to a distribution over actions. A deterministic policy assigns each state an action.

All algorithms that we consider in this paper use a different policy for collecting data (exploration) and a different policy for evaluation (exploitation). In order to keep track of the progress, the evaluation runs are performed every ten thousand environment interactions. Because of the environments’ stochasticity, we run the evaluation policy multiple times. Let be a set of (undiscounted) returns from evaluation episodes , i.e. . We evaluate the policy using the average test return

and the standard deviation of the test returns

.

We run experiments on four continuous control tasks and their variants, introduced in the appropriate sections, from the OpenAI Gym MuJoCo suite (Brockman et al., 2016)

. The agent observes vectors that describe the kinematic properties of the robot and its actions specify torques to be applied on the robot joints. See Appendix

D for the details on the experimental setup.

3 Ensemble Deep Deterministic Policy Gradients

For completeness of exposition, we present ED2 before the experimental section. The ED2 architecture is based on an ensemble of Streamlined Off-Policy (SOP) agents (Wang et al., 2020), meaning that our agent is an ensemble of TD3-like agents (Fujimoto et al., 2018) with the action normalization and the ERE replay buffer. The pseudo-code listing can be found in Algorithm 1, while the implementation details, including a more verbose version of pseudo-code (Algorithm 3), can be found in Appendix E. In the data collection phase (Lines 2-10), ED2 selects one actor from the ensemble uniformly at random (Lines 2 and 10) and run its deterministic policy for the course of one episode (Line 5). In the evaluation phase (not shown in Algorithm 1

), the evaluation policy averages all the actors’ output actions. We train the ensemble every 50 environment steps with 50 stochastic gradient descent updates (Lines

11-14). ED2 concurrently learns Q-functions, and where , by mean square Bellman error minimization, in almost the same way that SOP learns its two Q-functions. The only difference is that we have critic pairs that are initialized with different random weights and then trained independently with the same batches of data. Because of the different initial weights, each Q-function has a different bias in its Q-values. The actors, , train maximizing their corresponding first critic, , just like SOP.

Utilizing the ensembles requires several design choices, which we summarize below. The ablation study of ED2 elements is provided in Appendix C.

1:Input: init. params for policy and Q-functions , , ; replay buffer ;
2:Sample the current policy index .
3:Reset the environment and observe the state .
4:repeat
5:     Execute action uses the action normalization
6:     Observe and store in the replay buffer .
7:     Set
8:     if episode is finished then
9:         Reset the environment and observe initial state .
10:         Sample the current policy index .      
11:     if time to update then
12:         for as many as steps done in the environment do
13:              Sample a batch of transitions uses ERE
14:              Update the parameters , and by one gradient step.               
15:until convergence
Algorithm 1 ED2 - Ensemble Deep Deterministic Policy Gradients

Ensemble

Used: We train the ensemble of actors and critics; each actor learns from its own critic and the whole ensemble is trained on the same data.

Not used: We considered different actor-critic configurations, initialization schemes and relations, as well as the use of random prior networks (Osband et al., 2018), data bootstrap (Osband et al., 2016), and different ensemble sizes. We also change the SOP network sizes and training intensity instead of using the ensemble. Besides the prior networks in some special cases, these turn out to be inferior as shown in Section 4 and Appendix B.1.

Exploration

Used: We pick one actor uniformly at random to collect the data for the course of one episode. The actor is deterministic (no additive action noise is applied). These two choices ensure coherent and temporally-extended exploration similarly to Osband et al. (2016).

Not used: We tested several approaches to exploration: using the ensemble of actors, UCB (Lee et al., 2020), and adding the action noise in different proportions. These experiments are presented in Appendix B.2.

Exploitation

Used: The evaluation policy averages all the actors’ output actions to provide stable performance.

Not used:

We tried picking an action with the biggest value estimate (average of the critics’ Q-functions) in evaluation

(Huang et al., 2017).

Interestingly, both policies had similar results, see Appendix B.3.

Action normalization

Used: We use the action normalization introduced by Wang et al. (2020).

Not used: We experimented with the observations and rewards normalization, which turned out to be unnecessary. The experiments are presented in Appendix B.4.

Q-function updates

Used: We do SGD updates (ADAM optimizer (Kingma and Ba, 2015), MSE loss) to the actors and the critics every environment interactions, use Clipped Double Q-Learning (Fujimoto et al., 2018).

Not used: We also examined doing the updates at the end of each episode (with the proportional number of updates), using the Hubert loss, and doing weighted Bellman backups (Lee et al., 2020). However, we found them to bring no improvement to our method, as presented in Appendix B.5.

4 Experiments

In this section, we present our comprehensive study and the resulting insights. The rest of the experiments verifying that our design choices perform better than alternatives are in Appendix B

. Unless stated otherwise, a solid line in the figures represents an average, while a shaded region shows a 95% bootstrap confidence interval. We used

seeds for ED2 and the baselines, and seeds for the ED2 variants.

4.1 Ensemble of actors boost the agent performance

ED2 achieves state-of-the-art performance on the OpenAI Gym MuJoCo suite. Figure 1 shows the results of ED2 contrasted with three strong baselines: SUNRISE (Lee et al., 2020) – which is also the ensemble-based method – SOP (Wang et al., 2020), and SAC (Haarnoja et al., 2018b).

Figure 1: The average test returns across the training of ED2 and the three baselines.

Figure 2 shows ED2 with different ensemble sizes. As can be seen, the ensemble of size 5 (which we use in ED2) achieves good results, striking a balance between performance and computational overhead.

Figure 2: The average test returns across the training of ED2 with a different number of actor-critics.

4.2 The current state-of-the-art methods are unstable under several stability criteria

We consider three notions of stability: inference stability, asymptotic performance stability, and training stability. ED2 outperforms baselines in each of these notions, as discussed below. Similar metrics were also studied in Chan et al. (2020).

Inference stability

We say that an agent is inference stable if, when run multiple times, it achieves similar test performance every time. We measure inference stability using the standard deviation of test returns explained in Section 2. We found that that the existing methods train policies that are surprisingly sensitive to the randomness in the environment initial conditions222The MuJoCo suite is overall deterministic, nevertheless, little stochasticity is injected at the beginning of each trajectory, see Appendix D for details.. Figure 1 and Figure 3 show that ED2 successfully mitigates this problem. By the end of the training, ED2 produces results within of the average performance on Humanoid, while the performance of SUNRISE, SOP, and SAC may vary as much as .

Figure 3: The standard deviation of test returns across training (lower is better), see Section 2.

Asymptotic performance stability

We say that an agent achieves asymptotic performance stability if it achieves similar test performance across multiple training runs starting from different initial networks weights. Figure 4

shows that ED2 has a significantly smaller variance than the other methods while maintaining high performance.

Figure 4: The dots are the average test returns after training ( samples) of each seed. The distance between each box’s top and bottom edges is the interquartile range (IQR). The whiskers spread across all values.

Training stability

We will consider training stable if performance does not severely deteriorate from one evaluation to the next. We define the root mean squared deterioration metric (RMSD) as follows:

where is the number of the evaluation phases during training and is the average test return at the -th evaluation phase (described in Section 2). We compare returns 20 evaluation phases apart to ensure that the deterioration in performance doesn’t stem from the evaluation variance. ED2 has the lowest RMSD across all tasks, see Figure 5.

Figure 5: RMSD, the average and the 95% bootstrap confidence interval over 30 seeds.

4.3 The normally distributed action noise, commonly used for exploration, can hinder training

In this experiment, we deprive SOP of its exploration mechanism, namely additive normal action noise with the standard deviation of , and call this variant deterministic SOP (det. SOP). The lack of the action noise, while simplifying the algorithm, causes relatively minor deterioration in the Humanoid performance, has no significant influence on the Hopper or Walker performance, and substantially improves the Ant performance, see Figure 6. This result shows that no additional exploration mechanism, often in a form of an exploration noise (Lillicrap et al., 2016; Fujimoto et al., 2018; Wang et al., 2020), is required for the diverse data collection and, in case of Ant, it even hinders training.

Figure 6: The average test returns across the training of SOP without and with the exploration noise. All metrics were computed over 30 seeds.

ED2 leverages this insight and constructs an ensemble of deterministic SOP agents presented in Section 3. Figure 7 shows that ED2 exhibit the same effects coming from the exploration without the action noise. In Figure 8 we present the more refined experiment where we vary the noise level. With more noise the Humanoid results get better, whereas the Ant results get much worse.

Figure 7: The average test returns across the training of ED2 with and without the exploration noise. We used Gaussian noise with the standard deviation of , the default from Wang et al. (2020).
Figure 8: The average test returns across the training of ED2 with and without the exploration noise. Different noise standard deviations.

4.4 The approximated posterior sampling exploration outperforms approximated UCB exploration combined with weighted Bellman backup

The posterior sampling is proved to be theoretically superior to the OFU strategy (Osband and Van Roy, 2017). We prove this empirically for the approximate methods. ED2 uses posterior sampling exploration approximated with the bootstrap (Osband et al., 2016). SUNRISE, on the other hand, approximates the Upper Confidence Bound (UCB) exploration technique and does weighted Bellman backups (Lee et al., 2020). For the fair comparison between ED2 and SUNRISE, we substitute the SUNRISE base algorithm SAC for the SOP algorithm used by ED2. We call this variant SUNRISE-SOP.

We test both methods on the standard MuJoCo benchmarks as well as delayed (Zheng et al., 2018a) and sparse (Plappert et al., 2018) rewards variants. Both variations make the environments harder from the exploration standpoint. In the delayed version, the rewards are accumulated and returned to the agent only every 10 time-steps. In the sparse version, the reward for the forward motion is returned to the agent only after it crosses the threshold of one unit on the x-axis. For a better perspective, a fully trained Humanoid is able to move to around five units until the end of the episode. All the other reward components (living reward, control cost, and contact cost) remain unchanged. The results are presented in Table 1.

Environment

SOP

SUNRISE-SOP

ED2

Improvement over SOP

Improvement over SUNRISE-SOP

Hopper-v2
Walker-v2
Ant-v2
Humanoid-v2
DelayedHopper-v2
DelayedWalker-v2
DelayedAnt-v2
DelayedHumanoid-v2
SparseHopper-v2
SparseWalker-v2
SparseAnt-v2
SparseHumanoid-v2
Table 1: The test returns after training (3M samples) median across 30 seeds for the standard MuJoCo and 7 seeds for the delayed/sparse variants.

The performance in MuJoCo environments benefits from the ED2 approximate Bayesian posterior sampling exploration (Osband et al., 2013) in contrast to the approximated UCB in SUNRISE, which follows the OFU principle. Moreover, ED2 outperforms the non-ensemble method SOP, supporting the argument of coherent and temporally-extended exploration of ED2.

The experiment where the ED2’s exploration mechanism is replaced for UCB is in Appendix B.2. This variant also achieves worse results than ED2. The additional exploration efficiency experiment in the custom Humanoid environment, where an agent has to find and reach a goal position, is in Appendix A.

4.5 The weighted Bellman backup can not replace the clipped double Q-Learning

We applied the weighted Bellman backups proposed by Lee et al. (2020) to our method. It is suggested that the method mitigates error propagation in Q-learning by re-weighting the Bellman backup based on uncertainty estimates from an ensemble of target Q-functions (i.e. variance of predictions). Interestingly, Figure 9 does not show this positive effect on ED2.

Figure 9: The average test returns across the training of our method and ED2 with the weighted Bellman backup.

Our method uses clipped double -Learning to mitigate overestimation in -functions (Fujimoto et al., 2018). We wanted to check if it is required and if it can be exchanged for the weighted Bellman backups used by Lee et al. (2020). Figure 10 shows that clipped double Q-Learning is required and that the weighted Bellman backups can not replace it.

Figure 10: The average test returns across the training of our method and ED2 without clipped double Q-Learning in two variants without and with the weighted Bellman backups.

4.6 The critics’ initialization plays a major role in ensemble-based actor-critic exploration, while the training is mostly invariant to the actors’ initialization

In this experiment, actors’ weights are initialized with the same random values (contrary to the standard case of different initialization). Moreover, we test a corresponding case with critics’ weights initialized with the same random values or simply training only a single critic.

Figure 11 indicates that the choice of actors initialization does not matter in all tasks but Humanoid. Although the average performance on Humanoid seems to be better, it is also less stable. This is quite interesting because the actors are deterministic. Therefore, the exploration must come from the fact that each actor is trained to optimize his own critic.

On the other hand, Figure 11 shows that the setup with the single critic severely impedes the agent performance. We suspect that using the single critic impairs the agent exploration capabilities as its actors’ policies, trained to maximize the same critic’s -function, become very similar.

Figure 11: The average test returns across the training of ED2, ED2 with actors initialized to the same random values, and ED2 with the single critic.

5 Related work

Off-policy RL

Recently, multiple deep RL algorithms for continuous control have been proposed, e.g. DDPG (Lillicrap et al., 2016), TD3 (Fujimoto et al., 2018), SAC (Haarnoja et al., 2018b), SOP (Wang et al., 2020), SUNRISE (Lee et al., 2020). They provide a variety of methods for improving training quality, including double-Q bias reduction (van Hasselt et al., 2016), target policy smoothing or different update frequencies for actor and critic (Fujimoto et al., 2018), entropy regularization (Haarnoja et al., 2018b), action normalization (Wang et al., 2020), prioritized experience replay (Wang et al., 2020), weighted Bellman backups (Kumar et al., 2020; Lee et al., 2020), and use of ensembles (Osband et al., 2019; Lee et al., 2020; Kurutach et al., 2018; Chua et al., 2018).

Ensembles

Deep ensembles are a practical approximation of a Bayesian posterior, offering improved accuracy and uncertainty estimation (Lakshminarayanan et al., 2017; Fort et al., 2019). They inspired a variety of methods in deep RL. They are often used for temporally-extended exploration; see the next paragraph. Other than that, ensembles of different TD-learning algorithms were used to calculate better -learning targets (Chen et al., 2018). Others proposed to combine the actions and value functions of different RL algorithms (Wiering and van Hasselt, 2008) or the same algorithm with different hyper-parameters (Huang et al., 2017). For mixing the ensemble components, complex self-adaptive confidence mechanisms were proposed in Zheng et al. (2018b). Our method is simpler: it uses the same algorithm with the same hyper-parameters without any complex or learnt mixing mechanism. Lee et al. (2020) proposed a unified framework for ensemble learning in deep RL (SUNRISE) which uses bootstrap with random initialization (Osband et al., 2016) similarly to our work. We achieve better results than SUNRISE and show in Appendix B that their UCB exploration and weighted Bellman backups do not aid our algorithm performance.

Exploration

Various frameworks have been developed to balance exploration and exploitation in RL. The optimism in the face of uncertainty principle (Lai and Robbins, 1985; Bellemare et al., 2016) assigns an overly optimistic value to each state-action pair, usually in the form of an exploration bonus reward, to promote visiting unseen areas of the environment. The maximum entropy method (Haarnoja et al., 2018a) encourages the policy to be stochastic, hence boosting exploration. In the parameter space approach (Plappert et al., 2018; Fortunato et al., 2018), noise is added to the network weights, which can lead to temporally-extended exploration and a richer set of behaviours. Posterior sampling (Strens, 2000; Osband et al., 2016, 2018) methods have similar motivations. They stem from the Bayesian perspective and rely on selecting the maximizing action among sampled and statistically plausible set of action values. The ensemble approach (Lowrey et al., 2018; Miłoś et al., 2019; Lee et al., 2020) trains multiple versions of the agent, which yields a diverse set of behaviours and can be viewed as an instance of posterior sampling RL.

6 Conclusions

We conduct a comprehensive empirical analysis of multiple tools from the RL toolbox applied to the continuous control in the OpenAI Gym MuJoCo setting. We believe that the findings can be useful to RL researchers. Additionally, we propose Ensemble Deep Deterministic Policy Gradients (ED2), an ensemble-based off-policy RL algorithm, which achieves state-of-the-art performance and addresses several issues found during the aforementioned study.

Reproducibility Statement

We have made a significant effort to make our results reproducible. We use 30 random seeds, which is above the currently popular choice in the field (up to 5 seeds). Furthermore, we systematically explain our design choices in Section 3 and we provide a detailed pseudo-code of our method in Algorithm 3 in the Appendix B. Additionally, we open-sourced the code for the project333 https://github.com/ed2-paper/ED2 together with examples of how to reproduce the main experiments. The implementation details are explained in Appendix E and extensive information about the experimental setup is given in Appendix D.

References

  • J. Achiam (2018) Spinning Up in Deep Reinforcement Learning. GitHub repository. Note: https://github.com/openai/spinningup Cited by: footnote 1.
  • M. Andrychowicz, A. Raichuk, P. Stanczyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski, S. Gelly, and O. Bachem (2020a) What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. External Links: 2006.05990, ISSN 23318422 Cited by: §B.4, §1.
  • O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. (2020b) Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §1.
  • M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1471–1479. External Links: Link Cited by: §5.
  • C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §2.
  • S. C. Y. Chan, S. Fishman, A. Korattikara, J. Canny, and S. Guadarrama (2020) Measuring the Reliability of Reinforcement Learning Algorithms. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §4.2.
  • X. L. Chen, L. Cao, C. X. Li, Z. X. Xu, and J. Lai (2018) Ensemble Network Architecture for Deep Reinforcement Learning. Mathematical Problems in Engineering. External Links: Document, ISSN 15635147 Cited by: §5.
  • K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS 2018, pp. 4759–4770. Cited by: §5.
  • S. Fort, H. Hu, and B. Lakshminarayanan (2019) Deep ensembles: A loss landscape perspective. CoRR abs/1912.02757. External Links: Link, 1912.02757 Cited by: §5.
  • M. Fortunato, M. G. Azar, B. Piot, J. Menick, M. Hessel, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg (2018) Noisy networks for exploration. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, Cited by: §5.
  • S. Fujimoto, H. van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In

    Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

    , J. G. Dy and A. Krause (Eds.),
    Proceedings of Machine Learning Research, Vol. 80, pp. 1582–1591. External Links: Link Cited by: §B.2, 5th item, §1, §3, §3, §4.3, §4.5, §5.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018a) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In 35th International Conference on Machine Learning, ICML 2018, External Links: 1801.01290, ISBN 9781510867963 Cited by: §5.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine (2018b) Soft actor-critic algorithms and applications. CoRR abs/1812.05905. External Links: Link, 1812.05905 Cited by: §1, §4.1, §5.
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: Combining improvements in deep reinforcement learning. In

    32nd AAAI Conference on Artificial Intelligence, AAAI 2018

    ,
    pp. 3215–3222. External Links: 1710.02298, ISBN 9781577358008, Link Cited by: §1.
  • Z. Huang, S. Zhou, B. E. Zhuang, and X. Zhou (2017) Learning to run with actor-critic ensemble. External Links: 1712.08987, ISSN 23318422 Cited by: §3, §5.
  • A. Ilyas, L. Engstrom, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry (2020) A closer look at deep policy gradients. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §3.
  • A. Kumar, A. Gupta, and S. Levine (2020) DisCor: corrective feedback in reinforcement learning via distribution correction. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §5.
  • T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel (2018) Model-ensemble trust-region policy optimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §5.
  • T. L. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §5.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS 2017, Cited by: §5.
  • K. Lee, M. Laskin, A. Srinivas, and P. Abbeel (2020) SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning. External Links: 2007.04938, ISSN 23318422 Cited by: §B.1, §B.2, 4th item, 5th item, §3, §3, §4.1, §4.4, §4.5, §4.5, §5, §5, §5.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.3, §5.
  • K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch (2018) Plan online, learn offline: efficient learning and exploration via model-based control. CoRR abs/1811.01848. External Links: 1811.01848 Cited by: §5.
  • P. Miłoś, Ł. Kuciński, K. Czechowski, P. Kozakowski, and M. Klimek (2019) Uncertainty-sensitive learning and planning with ensembles. arXiv preprint arXiv:1912.09996. Cited by: §5.
  • I. Osband, J. Aslanides, and A. Cassirer (2018) Randomized prior functions for deep reinforcement learning. In NeurIPS, Cited by: Appendix A, §B.1, §3, §5.
  • I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems, External Links: 1602.04621, ISSN 10495258 Cited by: §B.1, §3, §3, §4.4, §5, §5.
  • I. Osband, B. V. Roy, D. J. Russo, and Z. Wen (2019) Deep exploration via randomized value functions. J. Mach. Learn. Res. 20, pp. 124:1–124:62. External Links: Link Cited by: §1, §5.
  • I. Osband, B. Van Roy, and D. Russo (2013) (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, External Links: 1306.0940, ISSN 10495258 Cited by: 4th item, §4.4.
  • I. Osband and B. Van Roy (2017) Why is posterior sampling better than optimism for reinforcement learning?. In 34th International Conference on Machine Learning, ICML 2017, External Links: 1607.00215, ISBN 9781510855144 Cited by: §4.4.
  • M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz (2018) Parameter space noise for exploration. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, Cited by: §4.4, §5.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016)

    Mastering the game of go with deep neural networks and tree search

    .
    nature 529 (7587), pp. 484–489. Cited by: §1.
  • M. J. A. Strens (2000) A bayesian framework for reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pp. 943–950. Cited by: §5.
  • H. van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, D. Schuurmans and M. P. Wellman (Eds.), pp. 2094–2100. External Links: Link Cited by: §5.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
  • C. Wang, Y. Wu, Q. Vuong, and K. Ross (2020) Striving for simplicity and performance in off-policy DRL: output normalization and non-uniform sampling. Proceedings of the 37th International Conference on Machine Learning 119, pp. 10070–10080. External Links: Link Cited by: §B.2, Appendix E, Appendix E, §1, §3, §3, Figure 7, §4.1, §4.3, §5.
  • M. A. Wiering and H. van Hasselt (2008) Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics. External Links: Document, ISSN 10834419 Cited by: §5.
  • Z. Zheng, J. Oh, and S. Singh (2018a) On Learning Intrinsic Rewards for Policy Gradient Methods. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4649–4659. Cited by: §4.4.
  • Z. Zheng, C. Yuan, Z. Lin, Y. Cheng, and H. Wu (2018b) Self-adaptive double bootstrapped DDPG. In IJCAI International Joint Conference on Artificial Intelligence, External Links: Document, ISBN 9780999241127, ISSN 10450823 Cited by: §5.

Appendix A Exploration efficiency in the custom Humanoid environment

To check the exploration capabilities of our method, we constructed two environments based on Humanoid where the goal is not only to move forward as fast as possible but to find and get to the specific region. The environments are described in Figure 13.

Figure 12: This is a top-down view. Humanoid starts at the origin (marked ’x’). The reward in each time-step is equal to the number of circles for which the agent is inside. Being in the most nested circle (or either of two) solves the task.
Figure 13: The fraction of episodes in which the task is finished for ED2 and two baselines. The average and the 95% bootstrap confidence interval over 20 seeds.

Because the Humanoid initial state is slightly perturbed every run, we compare solved rates over multiple runs, see details in Appendix D. Figure 13 compares the solved rates of our method and the three baselines. Our method outperforms the baselines. For this experiment, our method uses the prior networks [Osband et al., 2018].

Appendix B Design choices

In this section, we summarize the empirical evaluation of various design choices grouped by topics related to an ensemble of agents (B.1), exploration (B.2), exploitation (B.3), normalization (B.4), and -function updates (B.5). In the plots, a solid line and a shaded region represent an average and a 95% bootstrap confidence interval over seeds in case of ED2 (ours) and seeds otherwise. All of these experiments test ED2 presented in Section 3 with Algorithm 2 used for evaluation (the ensemble critic variant). We call Algorithm 2 a ’vote policy’.

1:Input: ensemble size ; policy and -function parameters where ; max action scale ;
2:function vote_policy(, )
(1)
3:     if use arbitrary critic then
(2)
4:     else use ensemble critic
(3)
     
5:     return for
Algorithm 2 Vote policy

b.1 Ensemble

Prior networks

We tested if our algorithm can benefit from prior networks [Osband et al., 2018]. It turned out that the results are very similar on OpenAI Gym MuJoCo tasks, see Figure 14. However, the prior networks are useful on our crafted hard-exploration Humanoid environments, see Figure 15.

Figure 14: The average test returns across the training of ED2 without (ours) and with prior networks.
Figure 15: The average test returns across the training of ED2 without (ours) and with prior networks.

Moreover, we tested if the deterministic SOP variant can benefit from prior networks. It turned out that the results are very similar or worse, see Figure 16.

Figure 16: The average test returns across the training of SOP and SOP without the exploration noise in two variants: without and with the prior network. All metrics were computed over 30 seeds.

Ensemble size

Figure 17 shows ED2 with different ensemble sizes. As can be seen, the ensemble of size 5 (which we use in ED2) achieves good results, striking a balance between performance and computational overhead.

Figure 17: The average test returns across the training of ED2 with a different number of actor-critics.

Data bootstrap

Osband et al. [2016] and Lee et al. [2020] remark that training an ensemble of agents using the same training data but with different initialization achieves, in most cases, better performance than applying different training samples to each agent. We confirm this observation in Figure 18

. Data bootstrap assigned each transition to each agent in the ensemble with 50% probability.

Figure 18: The average test returns across the training of ED2 without (ours) and with data bootstrap.

SOP bigger networks and training intensity

We checked if simply training SOP with bigger networks or with higher training intensity (a number of updates made for each collected transition) can get it close to the ED2 results. Figure 19 compares ED2 to SOP with different network sizes, while Figure 20 compares ED2 to SOP with one or five updates per environment step. It turns out that bigger networks or higher training intensity does not improve SOP performance.

Figure 19: The average test returns across the training of ED2 and SOP with different network sizes.
Figure 20: The average test returns across the training of ED2 and SOP with one or five updates for every step in an environment.

b.2 Exploration

Vote policy

In this experiment, we used the so-called "vote policy" described in Algorithm 2. We use it for action selection in step 6 of Algorithm 3 in two variations: (1) where the random critic, chosen for the duration of one episode, evaluates each actor’s action or (2) with the full ensemble of critics for actors actions evaluation. Figure 21 shows that the arbitrary critic is not much different from our method. However, in the case of the ensemble critic, we observe a significant performance drop suggesting deficient exploration.

Figure 21: The average test returns across the training of ED2 with and without the vote policy for exploration.

Ucb

We tested the UCB exploration method from Lee et al. [2020]. This method defines an upper-confidence bound (UCB) based on the mean and variance of Q-functions in an ensemble and selects actions with the highest UCB for efficient exploration. Figure 22 shows that the UCB exploration method makes the results of our algorithm worse.

Figure 22: The average test returns across the training of our method and ED2 with the UCB exploration.

Gaussian noise

While our method uses ensemble-based temporally coherent exploration, the most popular choice of exploration is injecting i.i.d. noise [Fujimoto et al., 2018, Wang et al., 2020]. We evaluate if these two approaches can be used together. We used Gaussian noise with the standard deviation of , it is the default value in Wang et al. [2020]. We found that the effects are task-specific, barely visible for Hopper and Walker, positive in the case of Humanoid, and negative for Ant – see Figure 23. In a more refined experiment, we varied the noise level. With more noise the Humanoid results are better, whereas the And results are worse – see Figure 24.

Figure 23: The average test returns across the training of ED2 with and without the additive Gaussian noise for exploration.
Figure 24: The average test returns across the training of ED2 with and without the additive Gaussian noise for exploration. Different noise standard deviations.

b.3 Exploitation

We used the vote policy, see Algorithm 2, as the evaluation policy in step 22 of Algorithm 3. Figure 25 shows that the vote policy does worse on the OpenAI Gym MuJoCo tasks. However, on our custom Humanoid tasks introduced in Section 4, it improves our agent performance – see Figure 26.

Figure 25: The average test returns across the training of our method and ED2 with the vote policy for evaluation.
Figure 26: The average test returns across the training of our method and ED2 with the vote policy for evaluation.

b.4 Normalization

We validated if rewards or observations normalization [Andrychowicz et al., 2020a] help our method. In both cases, we keep the empirical mean and standard deviation of each reward/observation coordinate, based on all rewards/observations seen so far, and normalize rewards/observations by subtracting the empirical mean and dividing by the standard deviation. It turned out that only the observations normalization significantly helps the agent on Humanoid, see Figures 27 and 28. The action normalization influence is tested in Appendix C.

Figure 27: The average test returns across the training of our method and ED2 with the rewards normalization.
Figure 28: The average test returns across the training of our method and ED2 with the observations normalization.

b.5 -function updates

Huber loss

We tried using the Huber loss for the -function training. It makes the results on all tasks worse, see Figure 29.

Figure 29: The average test returns across the training of our method and ED2 with the Huber loss.

Appendix C Ablation study

In this section, we ablate the ED2 components to see their impact on performance and stability. We start with the ensemble exploration and exploitation and then move on to the action normalization and the ERE replay buffer. In all plots, a solid line and a shaded region represent an average and a 95% bootstrap confidence interval over seeds in all but action normalization and ERE replay buffer experiments, where we run seeds.

Exploration & Exploitation

In the first experiment we wanted to isolate the effect of ensemble-based temporally coherent exploration on the performance and stability of ED2. Figures 30-33 compare the performance and stability of ED2 and one baseline, SOP, to ED2 with the single actor (the first one) used for evaluation in step 22 of Algorithm 3. It is worth noting that the action selection during the data collection, step 6 in Algorithm 3, is left unchanged – the ensemble of actors is used for exploration and each actor is trained on all the data. This should isolate the effect of exploration on the test performance of every actor. The results show that the performance improvement and stability of ED2 does not come solely from the efficient exploration. ED2 ablation performs comparably to the baseline and is even less stable.

Figure 30: The average test returns across the training of ED2, ED2 with the single actor for exploitation, and the baseline.
Figure 31: The standard deviation of test returns across the training of ED2, ED2 with the single actor for exploitation, and the baseline.
Figure 32: The dots are the average test returns after training ( samples) of each seed of ED2, ED2 with the single actor for exploitation, and the baseline. The distance between each box’s top and bottom edges is the interquartile range (IQR). The whiskers spread across all values.
Figure 33: RMSD of ED2, ED2 with the single actor for exploitation, and the baseline – the average and the 95% bootstrap confidence interval over 30 seeds.

In the next experiment, we wanted to check if the ensemble evaluation is all we need in that event. Figure 34 compares the performance of ED2 and one baseline, SOP, to ED2 with the single actor (the first one) used for the data collection in step 6 of Algorithm 3. The action selection during the evaluation, step 22 in Algorithm 3, is left unchanged – the ensemble of actors is trained on the data collected only by one of the actors. We add Gaussian noise to the single actor’s actions for exploration as described in Appendix B.2. The results show that the ensemble actor test performance collapses, possibly because of training on the out of distribution data. This implies that the ensemble of actors, used for evaluation, improves the test performance and stability. However, it is required that the same ensemble of actors is also used for exploration, during the data collection.

Figure 34: The average test returns across the training of ED2, ED2 with the single actor for exploration, and the baseline.

Action normalization

The implementation details of the action normalization are described in Appendix E. Figure 35 shows that the action normalization is especially required on the Ant and Humanoid environments, while not disrupting the training on the other tasks.

Figure 35: The average test returns across the training of ED2 with and without the action normalization.

ERE replay buffer

The implementation details of the ERE replay buffer are described in Appendix E. In Figure 36 we observe that it improves the final performance of ED2 on all tasks, especially on Walker2d and Humanoid.

Figure 36: The average test returns across the training of ED2 with and without the ERE replay buffer.

Appendix D Experimental setup

Plots

In all evaluations, we used 30 evaluation episodes to better access the average performance of each policy, as described in Section 2. For a more pleasant look and easier visual assessment, we smoothed the lines using an exponential moving average with a smoothing factor equal .

OpenAI Gym MuJoCo

In MuJoCo environments, presented in Figure 37, a state is defined by position and velocity of the robot’s root, and angular position and velocity of each of its joints. The observation holds almost all information from the state except the x and y position of the robot’s root. The action is a torque that should be applied to each joint of the robot. Sizes of those spaces for each environment are summarised in Table 2.

MuJoCo is a deterministic physics engine thus all simulations conducted inside it are deterministic. This includes simulations of our environments. However, to simplify the process of data gathering and to counteract over-fitting the authors of OpenAI Gym decided to introduce some stochasticity. Each episode starts from a slightly different state - initial positions and velocities are perturbed with random noise (uniform or normal depending on the particular environment).

Figure 37: The OpenAI Gym MuJoCo tasks we benchmark out method on.
Environment name Action space size Observation space size
Hopper-v2 3 11
Walker2d-v2 6 17
Ant-v2 8 111
Humanoid-v2 17 376
Table 2: Action and observation space sizes for used environments.

Appendix E Implementation details

1:Input: ensemble size ; init. policy and -functions , param. where ; replay buffer ; max action scale ; target smoothing std. dev.

; interpolation factor

;
2:Set the target parameters ,
3:Sample the current policy index .
4:Reset the environment and observe the state .
5:repeat
6:     Execute action uses the action normalization
7:     Observe and store in the replay buffer .
8:     Set
9:     if episode is finished then
10:         Reset the environment and observe initial state .
11:         Sample the current policy index .      
12:     if time to update then
13:         for as many as steps done in the environment do
14:              Sample a batch of transitions uses ERE
15:              Compute targets
16:              Update the -functions by one step of gradient descent using
17:              Update the policies by one step of gradient ascent using
18:              Update target parameters with
              
19:     if time to evaluate then
20:         for specified number of evaluation runs do
21:              Reset the environment and observe the state .
22:              Execute policy until the terminal state.
23:              Record and log the return.               
24:until convergence
Algorithm 3 ED2 - Ensemble Deep Deterministic Policy Gradients

Architecture and hyper-parameters

In our experiments, we use deep neural networks with two hidden layers, each of them with 256 units. All of the networks use ReLU as an activation, except on the final output layer, where the activation used varies depending on the model: critic networks use no activation, while actor networks use

multiplied by the max action scale. Table 3 shows the hyper-parameters used for the tested algorithms.

[b] Parameter SAC SOP SUNRISE ED2 discounting optimizer Adam Adam Adam Adam learning rate replay buffer size batch size ensemble size - - entropy coefficient - - update interval 1 (ERE) - -

  • Number of environment interactions between updates.

Table 3: Default values of hyper-parameters were used in our experiments.

Action normalization

Our algorithm employs action normalization proposed by Wang et al. [2020]. It means that before applying the squashing function (e.g. ), the outputs of each actor network are normalized in the following way: let be the output of the actor’s network and let be the average magnitude of this output, where is the action’s dimensionality. If then we normalize the output by setting to for all . Otherwise, we leave the output unchanged. Each actor’s outputs are normalized independently from other actors in the ensemble.

Emphasizing Recent Experience

We implement the Emphasizing Recent Experience (ERE) mechanism from Wang et al. [2020]. ERE samples non-uniformly from the most recent experiences stored in the replay buffer. Let be the number of mini-batch updates and be the size of the replay buffer. When performing the gradient updates, we sample from the most recent data points stored in the replay buffer, where for .

The hyper-parameter starts off with a set value of and is later adapted based on the improvements in the agent training performance. Let be the improvement in terms of training episode returns made over the last time-steps and be the maximum of such improvements over the course of the training. We adapt according to the formula:

Our implementation uses the exponentially weighted moving average to store the value of . More concretely, we define based on two additional parameters and so that . Those parameters are then updated whenever we receive a new training episode return :

where , and is the maximum length of an episode.

Hardware

During the training of our models, we employ only CPUs using a cluster where each node has available cores of GHz, alongside at least GB of memory. The running time of a typical experiment did not exceed 24 hours.