Recently, deep reinforcement learning (RL) has achieved multiple breakthroughs in a range of challenging domains (e.g. Silver et al. (2016); Berner et al. (2019); Andrychowicz et al. (2020b); Vinyals et al. (2019)). A part of this success is related to an ever-growing toolbox of tricks and methods that were observed to boost the RL algorithms’ performance (e.g. Hessel et al. (2018); Haarnoja et al. (2018b); Fujimoto et al. (2018); Wang et al. (2020); Osband et al. (2019)). This state of affairs benefits the field but also brings challenges related to often unclear interactions between the individual improvements and the credit assignment related to the overall performance of the algorithm (Andrychowicz et al., 2020a; Ilyas et al., 2020).
In this paper, we present a comprehensive empirical study of multiple tools from the RL toolbox applied to the continuous control in the OpenAI Gym MuJoCo setting. These are presented in Section 4 and Appendix B. Our insights include:
The ensemble of actors boost the agent performance.
The current state-of-the-art methods are unstable under several stability criteria.
The normally distributed action noise, commonly used for exploration, can hinder training.
The critics’ initialization plays a major role in ensemble-based actor-critic exploration, while the training is mostly invariant to the actors’ initialization.
To address some of the issues listed above, we introduce the Ensemble Deep Deterministic Policy Gradient (ED2) algorithm111Our code is based on SpinningUp (Achiam, 2018) . We open-source it at:
. We open-source it at:https://github.com/ed2-paper/ED2., see Section 3. ED2 brings together existing RL tools in a novel way: it is an off-policy algorithm for continuous control, which constructs an ensemble of streamlined versions of TD3 agents and achieves the state-of-the-art performance in OpenAI Gym MuJoCo, substantially improving the results on the two hardest tasks – Ant and Humanoid. Consequently, ED2 does not require knowledge outside of the existing RL toolbox, is conceptually straightforward, and easy to code.
We model the environment as a Markov Decision Process (MDP). It is defined by the tuple, where is a continuous multi-dimensional state space, denotes a continuous multi-dimensional action space, is a transition kernel, stands for a discount factor, refers to an initial state distribution, and is a reward function. The agent learns a policy from sequences of transitions , called episodes or trajectories, where , , , is a terminal signal, and is the terminal time-step. A stochastic policy maps each state to a distribution over actions. A deterministic policy assigns each state an action.
All algorithms that we consider in this paper use a different policy for collecting data (exploration) and a different policy for evaluation (exploitation). In order to keep track of the progress, the evaluation runs are performed every ten thousand environment interactions. Because of the environments’ stochasticity, we run the evaluation policy multiple times. Let be a set of (undiscounted) returns from evaluation episodes , i.e. . We evaluate the policy using the average test return
and the standard deviation of the test returns.
We run experiments on four continuous control tasks and their variants, introduced in the appropriate sections, from the OpenAI Gym MuJoCo suite (Brockman et al., 2016)
. The agent observes vectors that describe the kinematic properties of the robot and its actions specify torques to be applied on the robot joints. See AppendixD for the details on the experimental setup.
3 Ensemble Deep Deterministic Policy Gradients
For completeness of exposition, we present ED2 before the experimental section. The ED2 architecture is based on an ensemble of Streamlined Off-Policy (SOP) agents (Wang et al., 2020), meaning that our agent is an ensemble of TD3-like agents (Fujimoto et al., 2018) with the action normalization and the ERE replay buffer. The pseudo-code listing can be found in Algorithm 1, while the implementation details, including a more verbose version of pseudo-code (Algorithm 3), can be found in Appendix E. In the data collection phase (Lines 2-10), ED2 selects one actor from the ensemble uniformly at random (Lines 2 and 10) and run its deterministic policy for the course of one episode (Line 5). In the evaluation phase (not shown in Algorithm 1
), the evaluation policy averages all the actors’ output actions. We train the ensemble every 50 environment steps with 50 stochastic gradient descent updates (Lines11-14). ED2 concurrently learns Q-functions, and where , by mean square Bellman error minimization, in almost the same way that SOP learns its two Q-functions. The only difference is that we have critic pairs that are initialized with different random weights and then trained independently with the same batches of data. Because of the different initial weights, each Q-function has a different bias in its Q-values. The actors, , train maximizing their corresponding first critic, , just like SOP.
Utilizing the ensembles requires several design choices, which we summarize below. The ablation study of ED2 elements is provided in Appendix C.
Used: We train the ensemble of actors and critics; each actor learns from its own critic and the whole ensemble is trained on the same data.
Not used: We considered different actor-critic configurations, initialization schemes and relations, as well as the use of random prior networks (Osband et al., 2018), data bootstrap (Osband et al., 2016), and different ensemble sizes. We also change the SOP network sizes and training intensity instead of using the ensemble. Besides the prior networks in some special cases, these turn out to be inferior as shown in Section 4 and Appendix B.1.
Used: We pick one actor uniformly at random to collect the data for the course of one episode. The actor is deterministic (no additive action noise is applied). These two choices ensure coherent and temporally-extended exploration similarly to Osband et al. (2016).
Not used: We tested several approaches to exploration: using the ensemble of actors, UCB (Lee et al., 2020), and adding the action noise in different proportions. These experiments are presented in Appendix B.2.
Used: The evaluation policy averages all the actors’ output actions to provide stable performance.
We tried picking an action with the biggest value estimate (average of the critics’ Q-functions) in evaluation(Huang et al., 2017).
Interestingly, both policies had similar results, see Appendix B.3.
Used: We use the action normalization introduced by Wang et al. (2020).
Not used: We experimented with the observations and rewards normalization, which turned out to be unnecessary. The experiments are presented in Appendix B.4.
In this section, we present our comprehensive study and the resulting insights. The rest of the experiments verifying that our design choices perform better than alternatives are in Appendix B
. Unless stated otherwise, a solid line in the figures represents an average, while a shaded region shows a 95% bootstrap confidence interval. We usedseeds for ED2 and the baselines, and seeds for the ED2 variants.
4.1 Ensemble of actors boost the agent performance
ED2 achieves state-of-the-art performance on the OpenAI Gym MuJoCo suite. Figure 1 shows the results of ED2 contrasted with three strong baselines: SUNRISE (Lee et al., 2020) – which is also the ensemble-based method – SOP (Wang et al., 2020), and SAC (Haarnoja et al., 2018b).
Figure 2 shows ED2 with different ensemble sizes. As can be seen, the ensemble of size 5 (which we use in ED2) achieves good results, striking a balance between performance and computational overhead.
4.2 The current state-of-the-art methods are unstable under several stability criteria
We consider three notions of stability: inference stability, asymptotic performance stability, and training stability. ED2 outperforms baselines in each of these notions, as discussed below. Similar metrics were also studied in Chan et al. (2020).
We say that an agent is inference stable if, when run multiple times, it achieves similar test performance every time. We measure inference stability using the standard deviation of test returns explained in Section 2. We found that that the existing methods train policies that are surprisingly sensitive to the randomness in the environment initial conditions222The MuJoCo suite is overall deterministic, nevertheless, little stochasticity is injected at the beginning of each trajectory, see Appendix D for details.. Figure 1 and Figure 3 show that ED2 successfully mitigates this problem. By the end of the training, ED2 produces results within of the average performance on Humanoid, while the performance of SUNRISE, SOP, and SAC may vary as much as .
Asymptotic performance stability
We say that an agent achieves asymptotic performance stability if it achieves similar test performance across multiple training runs starting from different initial networks weights. Figure 4
shows that ED2 has a significantly smaller variance than the other methods while maintaining high performance.
We will consider training stable if performance does not severely deteriorate from one evaluation to the next. We define the root mean squared deterioration metric (RMSD) as follows:
where is the number of the evaluation phases during training and is the average test return at the -th evaluation phase (described in Section 2). We compare returns 20 evaluation phases apart to ensure that the deterioration in performance doesn’t stem from the evaluation variance. ED2 has the lowest RMSD across all tasks, see Figure 5.
4.3 The normally distributed action noise, commonly used for exploration, can hinder training
In this experiment, we deprive SOP of its exploration mechanism, namely additive normal action noise with the standard deviation of , and call this variant deterministic SOP (det. SOP). The lack of the action noise, while simplifying the algorithm, causes relatively minor deterioration in the Humanoid performance, has no significant influence on the Hopper or Walker performance, and substantially improves the Ant performance, see Figure 6. This result shows that no additional exploration mechanism, often in a form of an exploration noise (Lillicrap et al., 2016; Fujimoto et al., 2018; Wang et al., 2020), is required for the diverse data collection and, in case of Ant, it even hinders training.
ED2 leverages this insight and constructs an ensemble of deterministic SOP agents presented in Section 3. Figure 7 shows that ED2 exhibit the same effects coming from the exploration without the action noise. In Figure 8 we present the more refined experiment where we vary the noise level. With more noise the Humanoid results get better, whereas the Ant results get much worse.
4.4 The approximated posterior sampling exploration outperforms approximated UCB exploration combined with weighted Bellman backup
The posterior sampling is proved to be theoretically superior to the OFU strategy (Osband and Van Roy, 2017). We prove this empirically for the approximate methods. ED2 uses posterior sampling exploration approximated with the bootstrap (Osband et al., 2016). SUNRISE, on the other hand, approximates the Upper Confidence Bound (UCB) exploration technique and does weighted Bellman backups (Lee et al., 2020). For the fair comparison between ED2 and SUNRISE, we substitute the SUNRISE base algorithm SAC for the SOP algorithm used by ED2. We call this variant SUNRISE-SOP.
We test both methods on the standard MuJoCo benchmarks as well as delayed (Zheng et al., 2018a) and sparse (Plappert et al., 2018) rewards variants. Both variations make the environments harder from the exploration standpoint. In the delayed version, the rewards are accumulated and returned to the agent only every 10 time-steps. In the sparse version, the reward for the forward motion is returned to the agent only after it crosses the threshold of one unit on the x-axis. For a better perspective, a fully trained Humanoid is able to move to around five units until the end of the episode. All the other reward components (living reward, control cost, and contact cost) remain unchanged. The results are presented in Table 1.
Improvement over SOP
Improvement over SUNRISE-SOP
The performance in MuJoCo environments benefits from the ED2 approximate Bayesian posterior sampling exploration (Osband et al., 2013) in contrast to the approximated UCB in SUNRISE, which follows the OFU principle. Moreover, ED2 outperforms the non-ensemble method SOP, supporting the argument of coherent and temporally-extended exploration of ED2.
The experiment where the ED2’s exploration mechanism is replaced for UCB is in Appendix B.2. This variant also achieves worse results than ED2. The additional exploration efficiency experiment in the custom Humanoid environment, where an agent has to find and reach a goal position, is in Appendix A.
4.5 The weighted Bellman backup can not replace the clipped double Q-Learning
We applied the weighted Bellman backups proposed by Lee et al. (2020) to our method. It is suggested that the method mitigates error propagation in Q-learning by re-weighting the Bellman backup based on uncertainty estimates from an ensemble of target Q-functions (i.e. variance of predictions). Interestingly, Figure 9 does not show this positive effect on ED2.
Our method uses clipped double -Learning to mitigate overestimation in -functions (Fujimoto et al., 2018). We wanted to check if it is required and if it can be exchanged for the weighted Bellman backups used by Lee et al. (2020). Figure 10 shows that clipped double Q-Learning is required and that the weighted Bellman backups can not replace it.
4.6 The critics’ initialization plays a major role in ensemble-based actor-critic exploration, while the training is mostly invariant to the actors’ initialization
In this experiment, actors’ weights are initialized with the same random values (contrary to the standard case of different initialization). Moreover, we test a corresponding case with critics’ weights initialized with the same random values or simply training only a single critic.
Figure 11 indicates that the choice of actors initialization does not matter in all tasks but Humanoid. Although the average performance on Humanoid seems to be better, it is also less stable. This is quite interesting because the actors are deterministic. Therefore, the exploration must come from the fact that each actor is trained to optimize his own critic.
On the other hand, Figure 11 shows that the setup with the single critic severely impedes the agent performance. We suspect that using the single critic impairs the agent exploration capabilities as its actors’ policies, trained to maximize the same critic’s -function, become very similar.
5 Related work
Recently, multiple deep RL algorithms for continuous control have been proposed, e.g. DDPG (Lillicrap et al., 2016), TD3 (Fujimoto et al., 2018), SAC (Haarnoja et al., 2018b), SOP (Wang et al., 2020), SUNRISE (Lee et al., 2020). They provide a variety of methods for improving training quality, including double-Q bias reduction (van Hasselt et al., 2016), target policy smoothing or different update frequencies for actor and critic (Fujimoto et al., 2018), entropy regularization (Haarnoja et al., 2018b), action normalization (Wang et al., 2020), prioritized experience replay (Wang et al., 2020), weighted Bellman backups (Kumar et al., 2020; Lee et al., 2020), and use of ensembles (Osband et al., 2019; Lee et al., 2020; Kurutach et al., 2018; Chua et al., 2018).
Deep ensembles are a practical approximation of a Bayesian posterior, offering improved accuracy and uncertainty estimation (Lakshminarayanan et al., 2017; Fort et al., 2019). They inspired a variety of methods in deep RL. They are often used for temporally-extended exploration; see the next paragraph. Other than that, ensembles of different TD-learning algorithms were used to calculate better -learning targets (Chen et al., 2018). Others proposed to combine the actions and value functions of different RL algorithms (Wiering and van Hasselt, 2008) or the same algorithm with different hyper-parameters (Huang et al., 2017). For mixing the ensemble components, complex self-adaptive confidence mechanisms were proposed in Zheng et al. (2018b). Our method is simpler: it uses the same algorithm with the same hyper-parameters without any complex or learnt mixing mechanism. Lee et al. (2020) proposed a unified framework for ensemble learning in deep RL (SUNRISE) which uses bootstrap with random initialization (Osband et al., 2016) similarly to our work. We achieve better results than SUNRISE and show in Appendix B that their UCB exploration and weighted Bellman backups do not aid our algorithm performance.
Various frameworks have been developed to balance exploration and exploitation in RL. The optimism in the face of uncertainty principle (Lai and Robbins, 1985; Bellemare et al., 2016) assigns an overly optimistic value to each state-action pair, usually in the form of an exploration bonus reward, to promote visiting unseen areas of the environment. The maximum entropy method (Haarnoja et al., 2018a) encourages the policy to be stochastic, hence boosting exploration. In the parameter space approach (Plappert et al., 2018; Fortunato et al., 2018), noise is added to the network weights, which can lead to temporally-extended exploration and a richer set of behaviours. Posterior sampling (Strens, 2000; Osband et al., 2016, 2018) methods have similar motivations. They stem from the Bayesian perspective and rely on selecting the maximizing action among sampled and statistically plausible set of action values. The ensemble approach (Lowrey et al., 2018; Miłoś et al., 2019; Lee et al., 2020) trains multiple versions of the agent, which yields a diverse set of behaviours and can be viewed as an instance of posterior sampling RL.
We conduct a comprehensive empirical analysis of multiple tools from the RL toolbox applied to the continuous control in the OpenAI Gym MuJoCo setting. We believe that the findings can be useful to RL researchers. Additionally, we propose Ensemble Deep Deterministic Policy Gradients (ED2), an ensemble-based off-policy RL algorithm, which achieves state-of-the-art performance and addresses several issues found during the aforementioned study.
We have made a significant effort to make our results reproducible. We use 30 random seeds, which is above the currently popular choice in the field (up to 5 seeds). Furthermore, we systematically explain our design choices in Section 3 and we provide a detailed pseudo-code of our method in Algorithm 3 in the Appendix B. Additionally, we open-sourced the code for the project333 https://github.com/ed2-paper/ED2 together with examples of how to reproduce the main experiments. The implementation details are explained in Appendix E and extensive information about the experimental setup is given in Appendix D.
- Spinning Up in Deep Reinforcement Learning. GitHub repository. Note: https://github.com/openai/spinningup Cited by: footnote 1.
- What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. External Links: Cited by: §B.4, §1.
- Learning dexterous in-hand manipulation. The International Journal of Robotics Research 39 (1), pp. 3–20. Cited by: §1.
- Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1471–1479. External Links: Cited by: §5.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
- OpenAI gym. External Links: Cited by: §2.
- Measuring the Reliability of Reinforcement Learning Algorithms. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, Cited by: §4.2.
- Ensemble Network Architecture for Deep Reinforcement Learning. Mathematical Problems in Engineering. External Links: Cited by: §5.
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS 2018, pp. 4759–4770. Cited by: §5.
- Deep ensembles: A loss landscape perspective. CoRR abs/1912.02757. External Links: Cited by: §5.
- Noisy networks for exploration. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, Cited by: §5.
Addressing function approximation error in actor-critic methods.
Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1582–1591. External Links: Cited by: §B.2, 5th item, §1, §3, §3, §4.3, §4.5, §5.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In 35th International Conference on Machine Learning, ICML 2018, External Links: Cited by: §5.
- Soft actor-critic algorithms and applications. CoRR abs/1812.05905. External Links: Cited by: §1, §4.1, §5.
Rainbow: Combining improvements in deep reinforcement learning.
32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 3215–3222. External Links: Cited by: §1.
- Learning to run with actor-critic ensemble. External Links: Cited by: §3, §5.
- A closer look at deep policy gradients. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Cited by: §1.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §3.
- DisCor: corrective feedback in reinforcement learning via distribution correction. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: §5.
- Model-ensemble trust-region policy optimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Cited by: §5.
- Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §5.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS 2017, Cited by: §5.
- SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning. External Links: Cited by: §B.1, §B.2, 4th item, 5th item, §3, §3, §4.1, §4.4, §4.5, §4.5, §5, §5, §5.
- Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §4.3, §5.
- Plan online, learn offline: efficient learning and exploration via model-based control. CoRR abs/1811.01848. External Links: Cited by: §5.
- Uncertainty-sensitive learning and planning with ensembles. arXiv preprint arXiv:1912.09996. Cited by: §5.
- Randomized prior functions for deep reinforcement learning. In NeurIPS, Cited by: Appendix A, §B.1, §3, §5.
- Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems, External Links: Cited by: §B.1, §3, §3, §4.4, §5, §5.
- Deep exploration via randomized value functions. J. Mach. Learn. Res. 20, pp. 124:1–124:62. External Links: Cited by: §1, §5.
- (More) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, External Links: Cited by: 4th item, §4.4.
- Why is posterior sampling better than optimism for reinforcement learning?. In 34th International Conference on Machine Learning, ICML 2017, External Links: Cited by: §4.4.
- Parameter space noise for exploration. In 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, Cited by: §4.4, §5.
Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
- A bayesian framework for reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pp. 943–950. Cited by: §5.
- Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, D. Schuurmans and M. P. Wellman (Eds.), pp. 2094–2100. External Links: Cited by: §5.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
- Striving for simplicity and performance in off-policy DRL: output normalization and non-uniform sampling. Proceedings of the 37th International Conference on Machine Learning 119, pp. 10070–10080. External Links: Cited by: §B.2, Appendix E, Appendix E, §1, §3, §3, Figure 7, §4.1, §4.3, §5.
- Ensemble algorithms in reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics. External Links: Cited by: §5.
- On Learning Intrinsic Rewards for Policy Gradient Methods. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4649–4659. Cited by: §4.4.
- Self-adaptive double bootstrapped DDPG. In IJCAI International Joint Conference on Artificial Intelligence, External Links: Cited by: §5.
Appendix A Exploration efficiency in the custom Humanoid environment
To check the exploration capabilities of our method, we constructed two environments based on Humanoid where the goal is not only to move forward as fast as possible but to find and get to the specific region. The environments are described in Figure 13.
Because the Humanoid initial state is slightly perturbed every run, we compare solved rates over multiple runs, see details in Appendix D. Figure 13 compares the solved rates of our method and the three baselines. Our method outperforms the baselines. For this experiment, our method uses the prior networks [Osband et al., 2018].
Appendix B Design choices
In this section, we summarize the empirical evaluation of various design choices grouped by topics related to an ensemble of agents (B.1), exploration (B.2), exploitation (B.3), normalization (B.4), and -function updates (B.5). In the plots, a solid line and a shaded region represent an average and a 95% bootstrap confidence interval over seeds in case of ED2 (ours) and seeds otherwise. All of these experiments test ED2 presented in Section 3 with Algorithm 2 used for evaluation (the ensemble critic variant). We call Algorithm 2 a ’vote policy’.
We tested if our algorithm can benefit from prior networks [Osband et al., 2018]. It turned out that the results are very similar on OpenAI Gym MuJoCo tasks, see Figure 14. However, the prior networks are useful on our crafted hard-exploration Humanoid environments, see Figure 15.
Moreover, we tested if the deterministic SOP variant can benefit from prior networks. It turned out that the results are very similar or worse, see Figure 16.
Figure 17 shows ED2 with different ensemble sizes. As can be seen, the ensemble of size 5 (which we use in ED2) achieves good results, striking a balance between performance and computational overhead.
Osband et al.  and Lee et al.  remark that training an ensemble of agents using the same training data but with different initialization achieves, in most cases, better performance than applying different training samples to each agent. We confirm this observation in Figure 18
. Data bootstrap assigned each transition to each agent in the ensemble with 50% probability.
SOP bigger networks and training intensity
We checked if simply training SOP with bigger networks or with higher training intensity (a number of updates made for each collected transition) can get it close to the ED2 results. Figure 19 compares ED2 to SOP with different network sizes, while Figure 20 compares ED2 to SOP with one or five updates per environment step. It turns out that bigger networks or higher training intensity does not improve SOP performance.
In this experiment, we used the so-called "vote policy" described in Algorithm 2. We use it for action selection in step 6 of Algorithm 3 in two variations: (1) where the random critic, chosen for the duration of one episode, evaluates each actor’s action or (2) with the full ensemble of critics for actors actions evaluation. Figure 21 shows that the arbitrary critic is not much different from our method. However, in the case of the ensemble critic, we observe a significant performance drop suggesting deficient exploration.
We tested the UCB exploration method from Lee et al. . This method defines an upper-confidence bound (UCB) based on the mean and variance of Q-functions in an ensemble and selects actions with the highest UCB for efficient exploration. Figure 22 shows that the UCB exploration method makes the results of our algorithm worse.
While our method uses ensemble-based temporally coherent exploration, the most popular choice of exploration is injecting i.i.d. noise [Fujimoto et al., 2018, Wang et al., 2020]. We evaluate if these two approaches can be used together. We used Gaussian noise with the standard deviation of , it is the default value in Wang et al. . We found that the effects are task-specific, barely visible for Hopper and Walker, positive in the case of Humanoid, and negative for Ant – see Figure 23. In a more refined experiment, we varied the noise level. With more noise the Humanoid results are better, whereas the And results are worse – see Figure 24.
We validated if rewards or observations normalization [Andrychowicz et al., 2020a] help our method. In both cases, we keep the empirical mean and standard deviation of each reward/observation coordinate, based on all rewards/observations seen so far, and normalize rewards/observations by subtracting the empirical mean and dividing by the standard deviation. It turned out that only the observations normalization significantly helps the agent on Humanoid, see Figures 27 and 28. The action normalization influence is tested in Appendix C.
b.5 -function updates
We tried using the Huber loss for the -function training. It makes the results on all tasks worse, see Figure 29.
Appendix C Ablation study
In this section, we ablate the ED2 components to see their impact on performance and stability. We start with the ensemble exploration and exploitation and then move on to the action normalization and the ERE replay buffer. In all plots, a solid line and a shaded region represent an average and a 95% bootstrap confidence interval over seeds in all but action normalization and ERE replay buffer experiments, where we run seeds.
Exploration & Exploitation
In the first experiment we wanted to isolate the effect of ensemble-based temporally coherent exploration on the performance and stability of ED2. Figures 30-33 compare the performance and stability of ED2 and one baseline, SOP, to ED2 with the single actor (the first one) used for evaluation in step 22 of Algorithm 3. It is worth noting that the action selection during the data collection, step 6 in Algorithm 3, is left unchanged – the ensemble of actors is used for exploration and each actor is trained on all the data. This should isolate the effect of exploration on the test performance of every actor. The results show that the performance improvement and stability of ED2 does not come solely from the efficient exploration. ED2 ablation performs comparably to the baseline and is even less stable.
In the next experiment, we wanted to check if the ensemble evaluation is all we need in that event. Figure 34 compares the performance of ED2 and one baseline, SOP, to ED2 with the single actor (the first one) used for the data collection in step 6 of Algorithm 3. The action selection during the evaluation, step 22 in Algorithm 3, is left unchanged – the ensemble of actors is trained on the data collected only by one of the actors. We add Gaussian noise to the single actor’s actions for exploration as described in Appendix B.2. The results show that the ensemble actor test performance collapses, possibly because of training on the out of distribution data. This implies that the ensemble of actors, used for evaluation, improves the test performance and stability. However, it is required that the same ensemble of actors is also used for exploration, during the data collection.
ERE replay buffer
Appendix D Experimental setup
In all evaluations, we used 30 evaluation episodes to better access the average performance of each policy, as described in Section 2. For a more pleasant look and easier visual assessment, we smoothed the lines using an exponential moving average with a smoothing factor equal .
OpenAI Gym MuJoCo
In MuJoCo environments, presented in Figure 37, a state is defined by position and velocity of the robot’s root, and angular position and velocity of each of its joints. The observation holds almost all information from the state except the x and y position of the robot’s root. The action is a torque that should be applied to each joint of the robot. Sizes of those spaces for each environment are summarised in Table 2.
MuJoCo is a deterministic physics engine thus all simulations conducted inside it are deterministic. This includes simulations of our environments. However, to simplify the process of data gathering and to counteract over-fitting the authors of OpenAI Gym decided to introduce some stochasticity. Each episode starts from a slightly different state - initial positions and velocities are perturbed with random noise (uniform or normal depending on the particular environment).
|Environment name||Action space size||Observation space size|
Appendix E Implementation details
; interpolation factor;
Architecture and hyper-parameters
In our experiments, we use deep neural networks with two hidden layers, each of them with 256 units. All of the networks use ReLU as an activation, except on the final output layer, where the activation used varies depending on the model: critic networks use no activation, while actor networks usemultiplied by the max action scale. Table 3 shows the hyper-parameters used for the tested algorithms.
Our algorithm employs action normalization proposed by Wang et al. . It means that before applying the squashing function (e.g. ), the outputs of each actor network are normalized in the following way: let be the output of the actor’s network and let be the average magnitude of this output, where is the action’s dimensionality. If then we normalize the output by setting to for all . Otherwise, we leave the output unchanged. Each actor’s outputs are normalized independently from other actors in the ensemble.
Emphasizing Recent Experience
We implement the Emphasizing Recent Experience (ERE) mechanism from Wang et al. . ERE samples non-uniformly from the most recent experiences stored in the replay buffer. Let be the number of mini-batch updates and be the size of the replay buffer. When performing the gradient updates, we sample from the most recent data points stored in the replay buffer, where for .
The hyper-parameter starts off with a set value of and is later adapted based on the improvements in the agent training performance. Let be the improvement in terms of training episode returns made over the last time-steps and be the maximum of such improvements over the course of the training. We adapt according to the formula:
Our implementation uses the exponentially weighted moving average to store the value of . More concretely, we define based on two additional parameters and so that . Those parameters are then updated whenever we receive a new training episode return :
where , and is the maximum length of an episode.
During the training of our models, we employ only CPUs using a cluster where each node has available cores of GHz, alongside at least GB of memory. The running time of a typical experiment did not exceed 24 hours.