Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

by   Yunhao Tang, et al.
Columbia University

Off-policy learning algorithms have been known to be sensitive to the choice of hyper-parameters. However, unlike near on-policy algorithms for which hyper-parameters could be optimized via e.g. meta-gradients, similar techniques could not be straightforwardly applied to off-policy learning. In this work, we propose a framework which entails the application of Evolutionary Strategies to online hyper-parameter tuning in off-policy learning. Our formulation draws close connections to meta-gradients and leverages the strengths of black-box optimization with relatively low-dimensional search spaces. We show that our method outperforms state-of-the-art off-policy learning baselines with static hyper-parameters and recent prior work over a wide range of continuous control benchmarks.



There are no comments yet.


page 1

page 2

page 3

page 4


On Hyper-parameter Tuning for Stochastic Optimization Algorithms

This paper proposes the first-ever algorithmic framework for tuning hype...

Efficacy of Modern Neuro-Evolutionary Strategies for Continuous Control Optimization

We analyze the efficacy of modern neuro-evolutionary strategies for cont...

Revisiting Hyper-Parameter Tuning for Search-based Test Data Generation

Search-based software testing (SBST) has been studied a lot in the liter...

The Potential Benefits of Filtering Versus Hyper-Parameter Optimization

The quality of an induced model by a learning algorithm is dependent on ...

Adaptive Structural Hyper-Parameter Configuration by Q-Learning

Tuning hyper-parameters for evolutionary algorithms is an important issu...

Hyper Converged Infrastructures: Beyond virtualization

Hyper Convergence has brought virtualization and IT strategies to a new ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Off-policy learning is a powerful paradigm for RL problems. Despite its great promise, when combined with neural networks in many modern applications

[27, 35], off-policy learning suffers from constant instability, also partly characterized as the deadly triad [38, 40]. As a result, additional empirical techniques must be implemented to achieve more robust performance in practice, e.g. target networks [27]. Though theory suggests that off-policy learning could be performed between highly different behavior policy and target policy , in challenging domains the best performance is obtained when data are near on-policy, i.e. [18]. In batch RL, an extreme special case of off-policy learning, where the data are collected under a behavior policy before hand and no further data collection is allowed, naive applications of off-policy algorithms do not work properly [11]

. In addition to algorithmic limitations, the search of good hyper-parameters for off-policy algorithms is also critical yet brittle. For example, prior work has observed that the performance is highly sensitive to hyper-parameters such as learning rates and depends critically on seemingly heuristic techniques such as

-step updates [23, 3, 16].

In this work, we focus on this latter source of instability for off-policy learning, i.e. hyper-parameter tuning. Unlike supervised learning counterparts, where a static set of hyper-parameters might suffice, for general RL problems it is desirable to adapt the hyper-parameters

on the fly as the training procedures are much more non-stationary. Though it is possible to design theoretically justified scheme for hyper-parameters, such methods are usually limited to a set of special quantities, such as the eligibility trace [25] or mixing coefficient for alpha-retrace [31]. More generally, the tuning of generic hyper-parameters could be viewed as greedily optimizing certain meta-objectives at each iteration [45, 29, 46]. For example, in near on-policy algorithms such as IMPALA [10], hyper-parameters are updated by meta-gradients [45, 46] (in such literature, trainable hyper-parameters are called meta-parameters), which are calculated via back-propagation from the meta-objectives.

However, in off-policy learning, techniques such as meta-gradients are not immediately feasible. Indeed, since the existing formulation of meta-gradients [45, 46] is limited to near on-policy actor-critic algorithms [26, 10], its extension to replay-based off-policy algorithms is not yet clear. The difficulty arises from the design of many off-policy algorithms - many off-policy updates are not based on the target RL objective but proxies such as Bellman errors [27, 23] or off-policy objectives [9]. This makes it challenging to define and calculate meta-gradients, which requires differentiating through the RL objectives via policy gradients [45]

. To adapt hyper-parameters in such cases, a naive yet straightforward resort is to train multiple agents with an array of hyper-parameters in parallel as in PBT, and update hyper-parameters with e.g. genetic algorithms

[17]. Though being more blackbox in nature, PBT proved high-performing yet too costly in practice.

(a) Mean
(b) Median
(c) Best ratio
Figure 1: Training performance of discrete hyper-parameter adaptation on control suite tasks. Each plot shows a separate performance statistics during training (mean, median and win ratio). The statistics are normalized per task and averaged over simulated locomotion tasks. Observe that ES adaptation outperforms other baselines in every performance metric. See Section 4 and Appendix 5.2 for detailed descriptions of the normalized scores.
Main idea.

We propose a framework for optimizing hyper-parameters within the lifetime of a single agent (unlike multiple copies in PBT) with ES, called OHT-ES. ES are agnostic to the off-policy updates of the baseline algorithm and can readily adapt discrete/continuous hyper-parameters effectively. With the recent revival of ES especially for low-dimensional search space [32, 13], we will see that our proposal combines the best of both off-policy learning and ES.

OHT-ES outperforms off-policy baselines with static hyper-parameters. In Figure 1, we show the significant performance gains of off-policy learning baselines combined with OHT-ES (blue curves), compared to static hyper-parameters. We evaluate all algorithms with normalized scores over 13 simulated control tasks (see Section 4 for details). The performance gains of OHT-ES are consistent across all three reported metrics over normalized scores.

2 Background

In the standard formulation of MDP, at a discrete time , an agent is in state , takes action , receives a reward and transitions to a next state . A policy defines a map from states to distributions over actions. The standard objective of RL is to maximize the expected cumulative discounted returns with a discount factor .

2.1 Off-policy learning

Off-policy learning entails policy optimization through learning from data generated via arbitrary behavior policy, e.g. historical policies. For example, Q-learning [42] is a prominent framework for off-policy learning, where given a -step partial trajectory , the -step Q-learning optimizes a parameterized Q-function by minimizing the Bellman error

where is the -step target and denotes that the data are sampled from a replay buffer . When , Q-learning converges to the optimal solution in tabular cases and under mild conditions [42]. Recently, [31] shows that general uncorrected -step updates for introduce target bias in exchange for fast contractions to the fixed point, which tends to bring empirical gains. Though there is no general optimality guarantee for , prior work finds that employing significantly speeds up the optimization in challenging image-based benchmark domains [26, 16, 18]. Other related prominent off-policy algorithms include off-policy policy gradients [9, 41], whose details we omit here.

2.2 Off-policy actor-critic

By construction, Q-learning requires a maximization over actions to compute target values, which becomes intractable when the action space is continuous, e.g. . To bypass such issues, consider a policy as an approximate maximizer, i.e. . This produces the Q-function target . The Q-function (critic) and the policy (actor) are alternately updated as follows, with learning rate ,


Depending on whether the actor or the critic is fully optimized at each iteration, there are two alternative interpretations of the updates defined in Eqn.(1). When the policy is fully optimized such that , the updates are exact -step Q-learning. When the critic is fully optimized such that , the updates are -step SARSE for policy evaluation with deterministic policy gradients [36]. In practice, critic and actor updates take place alternately and the algorithm is a mixture between value iteration and policy iteration [37]. Built upon the updates (Eq.(1)), additional techniques such as double critic [12] and maximum entropy formulation [14] could greatly improve the stability of the baseline algorithm.

2.2.1 Evolutionary strategies

ES are a family of zero-order optimization algorithms (see e.g. [15, 8, 43, 32]), which have seen recent revival for applications in RL [32]. In its generic form, consider a function with parameter , the aim is to optimize with only queries of the function values. For simplicity, assume is continuous and consider the ES gradient descent formulation introduced in [32]. Instead of optimizing directly, consider a smoothed objective

with some fixed variance parameter

. It is then feasible to approximate the gradient with

-sample unbiased estimates, in particular,


are i.i.d. Gaussian vectors. A naive approach to RL is to flatten the sequential problem into a one-step blackbox problem, by setting

. Despite its simplicity, this approach proved efficient compared to policy gradient algorithms [32, 6, 24], though generally its sample efficiency could not match that off-policy algorithms.

3 Online Hyper-parameter Tuning via Evolutionary Strategies

Let denote the set of adjustable real-valued hyper-parameters, e.g. the learning rate

and a probability distribution over

-step targets for discrete . At iteration with actor-critic parameter , given replay buffer , the algorithm constructs an update such that [45]. Here we make explicit the dependency of the update function on the replay . For example, the update function could be the gradient descents defined in Eqn.(1).

When the algorithm does not update hyper-parameter at all , and we reduce to the case of static hyper-parameters. One straightforward way to update the hyper-parameter is to greedily optimize the hyper-parameters against some meta objective [45, 29], such that


Since the motivation of hyper-parameter adaptation was to better optimize the RL objective, it is natural to set the meta objective as the target RL objective, i.e. cumulative returns .

3.1 Methods

1:  Input: off-policy update function and agent parameter .
2:  while   do
3:     Sample

hyper-parameters from a Gaussian distribution

4:     Train off-policy agents: .
5:     Collect rollout with agent parameter , save data to . Estimate .
6:     Update the hyper-parameter distribution based on Eqn.(3).
7:  end while
Algorithm 1 Online Hyper-parameter Tuning via Evolutionary Strategies (OHT-ES)

Now we describe Online Hyper-parameter Tuning via Evolutionary Strategies (OHT-ES). Note that the framework is generic as it could be combined with any off-policy algorithms with update function . Recall that the update function returns a new parameter . The general meta algorithm is presented in Algorithm 1, where we assume hyper-parameters to be real-valued. It is straightforward to derive similar algorithms for discrete hyper-parameters as explained below.

Consider at iteration of learning, the agent maintains a parametric distribution over hyper-parameters, e.g. Gaussian with tunable mean and fixed variance . Then we sample a population of actor-critic agents each with a separate hyper-parameter drawn from the parametric distribution . Then for each of the copies of the agent, we update their parameters via the off-policy subroutine . After the update is complete, each agent with parameter collects rollouts from the environment and saves the data to . From the rollouts, construct estimates of the meta objective . Fianlly, the hyper-parameter mean is updated via a ES subroutine. For example, we might apply ES gradient ascent [32] and the new distribution parameter is updated with learning rate ,

Discrete hyper-parameters.

We also account for the case where the hyper-parameters take values from a discrete set of values, denoted as . In such cases, instead of maintaining a parametric Gaussian distribution over hyper-parameters such that , we maintain a categorical distribution where

is the logits and

. By sampling several hyper-parameter candidates , we could construct a score function gradient estimator [44] for the logits


3.2 Connections to prior work

We make explicit the connections between our approach and closely related prior work.

Connections to meta-gradients.

When hyper-parameters are real-valued, the ES updates defined in Eqn.(3) closely relates to meta-gradients [45], as summarized in the following proposition.

Proposition 1.

(Proved in Appendix 5.1) Assume that sampled hyper-parameters follow a Gaussian distribution . Then the following holds,


Since ES gradient updates are a zero-order approximation to the analytic gradients, this connection should be intuitive. Note that the RHS of Eq.(5) differs from meta-gradient updates in practice in several aspects [45]: in general, meta-gradients could introduce trace parameters to stabilize the update, and the gradient is evaluated at instead of as defined above.

Connections to near on-policy methods.

For near on-policy algorithms such as A2C, TRPO and PPO [26, 33, 34], there are natural constraints on the parameter updates. As a result, given the meta objective of one hyper-parameter value , it is possible to estimate meta objectives at alternative hyper-parameter values with importance sampling (IS) [29]. Then meta objectives could be greedily optimized via even zero-order methods. However, it is not clear how correlations/variance of such IS-EStimated meta objectives impact the updates, as they are estimated from the same data. Alternative to IS, we estimate via the Monte-Carlo sample of cumulative returns under , which is applicable when trust regions are not available (as with many off-policy algorithms) and when policies are deterministic [23, 12].

Connections to Es-Rl.

Our method closely relates to prior work on combining ES with gradient based off-policy RL algorithms [19, 30], which we name ES-RL. These algorithms maintain a population of off-policy agents with parameters and carry out ES updates directly on the agent parameter, e.g. genetic algorithm [19] or cross-entropy method [30]. They could be interpreted as a special case of our framework: indeed, one could include the trainable agent parameters as part of the hyper-parameter and this formulation reduces to ES-RL. However, ES-RL applies ES updates to a high-dimensional trainable parameter, which might be less effective than to a low-dimensional hyper-parameter search space. We will examine their relative strengths in Section 4.

Connections to Pbt.

Our framework could be interpreted as a special variant of PBT [17], where copies of the RL agents share replay buffers. In particular, PBT agents are trained independently in parallel and only exchange information during periodic hyper-parameter updates, while our approach ensures that these agents share information during training as well. This makes our approach potentially much more sample efficient than PBT. It is also worth noting that sharing buffers involves a trade-off - though agents could utilize others’ data for potentially better exploration, the behavior data also become less on-policy for any particular agent and might introduce additional instability [18].

4 Experiments

In the experiments, we seek to address the following questions: (1) Is OHT-ES effective for discrete hyper-parameters? (2) Is OHT-ES effective for continuous hyper-parameters and how does it compare to meta-gradients [45]? (3) How is OHT-ES compared to highly related methods such as ES-RL [30]?

To address (1), we study the effect of adapting the horizon hyper-parameter in -step updates. Prior work observed that generally performs well for Atari and image-based continuous control [26, 16, 3], though the best hyper-parameter could be task-dependent. We expect OHT-ES to be able to adapt to the near optimal hyper-parameters for each task. To address (2), we study the effect of learning rates, and we compare with an application of meta-gradients [45] to off-policy agents. Though prior work focuses on applying meta-gradients to near on-policy methods [45], we provide one extension to off-policy baselines for comparison, with details described below.

Benchmark tasks.

For benchmark tasks, we focus on state-based continuous control. In order to assess the strengths of different algorithmic variants, we consider similar tasks Walker, Cheetah and Ant with different simulation backends from OpenAI gym [5], Roboschool [22], DeepMind Control Suite [39] and Bullet Physics Engine [7]. These backends differ in many aspects, e.g. dimensions of observation and action space, transition dynamics and reward functions. With such a wide range of varieties, we seek to validate algorithmic gains with sufficient robustness to varying domains. There are a total of distinct tasks, with details in Appendix 5.2.

Base update function.

Since we focus on continuous control, we adopt state-of-the-art TD3 [12] as the baseline algorithm, i.e. the update function defined in Eqn.(2).

4.1 Continuous hyper-parameters

As an example of adaptive continuous hyper-parameters, we focus on learning rates , which includes the learning rates for actor and critic respectively. Extensions to other continuous hyper-parameters are straightforward. For example, the original meta-gradients were designed for discount factor or eligibility trace [45] and later extended to entropy regularization and learning rates [46]. For baseline TD3, alternative hyper-parameter is the discount , for which we find adaptive tuning does not provide significant gains.

We present results on challenging domains from the DeepMind Control Suite [39], where performance gains are most significant. Detailed environment and hyper-parameter settings are in Appendix 5.2. We compare OHT-ES tuning approach with a variant of meta-gradients: as discussed in Section 3, meta-gradient approaches are less straightforward in general off-policy learning. We derive a meta-gradient algorithm for deterministic actor-critics [23, 12] and provide a brief introduction below.

Meta-gradients for deterministic actor-critics.

Deterministic actor-critics maintain a Q-function critic and a deterministic actor . We propose to train an alternative critic for policy evaluation updated via TD-learning as . Recall that actor-critics are updated as defined in Eqn.(1), and recall to be updated parameters. Next, let the meta objective be the off-policy objective [9], where the expectation is taken such that . The meta-gradients are calculated as . Please see Appendix 5.2 for a detailed derivation and design choices.


The comparisons between OHT-ES, meta-gradients and TD3 baseline are shown in Figure 2. We make a few observations: (1) OHT-ES consistently achieves the best across all four environments and achieve significant performance gains (asymptotic performance and learning speed) than meta-gradients and TD3; (2) Meta-gradients achieve gains over the baseline most of the time, which implies that there are potentials for improvements due to adaptive learning rate; (3) Baseline TD3 does not perform very well on the control suite tasks. This is in contrast to its high-performance on typical benchmarks such as OpenAI gym [22]. This provides strong incentives to test on a wide range of benchmark testbeds in future research as in our paper. We speculate that TD3’s suboptimality is due to the fact that its design choices (including hyper-parameters) are not exhaustively tuned on these new benchmarks. With adaptive tuning, we partially resolve the issue and obtain performance almost identical to state-of-the-art algorithms on the control suite (e.g. see MPO [1]).

(a) DMWalkerRun
(b) DMWalkerStand
(c) DMWalkerWalk
(d) DMCheetahRun
Figure 2: Training performance of continuous hyper-parameter adaptation on control suite tasks. Algorithmic variants are shown in different colors: TD3 (red), meta-gradients TD3 (green) and OHT-ES TD3 (blue). Each task is trained for time steps and each curve shows the results across three seeds.

4.2 Discrete hyper-parameters

As an important example of adaptive discrete hyper-parameters, we focus on the horizon parameter in -step updates. Due to the discrete nature of such hyper-parameters, it is less straightforward to apply meta-gradients out of the box. As a comparison to the adaptive approach, we consider static hyper-parameters and test if online adaptation brings significant gains. We show results on tasks from the control suite in Figure 3 (first row). For static baselines, we consider TD3 with -step updates with .

(a) DMWalkerRun
(b) DMWalkerStand
(c) DMWalkerWalk
(d) DMCheetahRun
(e) DMWalkerRun(D)
(f) DMWalkerStand(D)
(g) DMWalkerWalk(D)
(h) DMCheetahRun(D)
Figure 3: Training performance of discrete hyper-parameter adaptation on control suite tasks. TD3 with different -step parameters are shown in a few colors, while blue shows the result of ES adaptation. Each curve shows the results across three seeds.
Evaluation with normalized scores.

Since different tasks involve a wide range of inherent difficulties and reward scales, we propose to calculate the normalized scores for each task and aggregate performance across tasks. This is similar to the standard evaluation technique on Atari games [4]. In particular, for each task, let denote the performance curve of a particular algorithm with maximum iteration , let be the performance of a random policy and the optimal policy respectively. Then the normalized score is and we graph them for comparison (Figure 1). Please refer to Appendix 5.2 for detailed scores for each tasks.

For convenience, let there be algorithmic baselines and tasks. To facilitate comparison of the overall performance, for the -th baseline we calculate the normalized scores for the -th task task , and at each time tick calculate statistics across tasks. There are three statistics: mean, median and best ratio, similar to [31]. The best ratio indicates the proportion of tasks on which a certain baseline performs the best. These three statistics summarize the overall algorithmic performance of baseline methods and display their relative strength/weakness.

Evaluations on standard benchmarks.

We present results across all 13 simulated tasks in Figure 1. In Figure 3 (first row), we show detailed training curves on the control suite. Here, OHT-ES maintains a categorical distribution over .

We make several observations from the results: (1) The -step update with achieves better performance across the second largest number of tasks, yet its overall performance is slightly worse than (median). (2) The adaptive -step performs the best across all three metrics. This implies that adaptive -step both achieves significantly better overall performance (mean and median) and achieve the best performance across a considerable proportion of tasks (best ratio); (3) From the best ratio result, we conclude that adaptive -step is able to locate the best -step hyper-parameter for each task through the online adaptation.

Evaluations on delayed reward environment.

Delayed reward environment tests algorithms’ capability to tackle delayed feedback in the form of sparse rewards [28]. In particular, a standard benchmark environment returns dense reward at each step . Consider accumulating the reward over consecutive steps and return the sum at the end steps, i.e. if and if .

We present the full results in Figure 4 with normalized scores across all 13 simulated tasks. In Figure 3 (bottom row) we show detailed training curves on control suite. Due to delayed rewards, we find it beneficial to increase the support of the categorical distribution to allow for bootstrapping from longer horizons. As a result, OHT-ES takes discrete values from .

We also make several observations: (1) The overall performance of -step update is monotonic in when (mean and median). In particular, we see that when the -step update performs the best. Intuitively, we see that -step update skips over time steps and combine multiple rewards into a single reward, which makes it naturally compatible with the delayed reward signal; (2) The best ratio curves show that achieves fastest learning progress across all baselines (including adaptive -step), yet this advantage decays away as the training progresses and adaptive -step takes over. This implies that adapting -step hyper-parameter is critical in achieving more stable long term progress; (3) In terms of overall performance, adaptive -step initially lags behind yet quickly catches up and exceeds the latter.

(a) Mean
(b) Median
(c) Best ratio
Figure 4: Training performance of discrete hyper-parameter adaptation on control suite tasks with delayed rewards. The plot has the exact same setup as Figure 1.

4.3 Comparison to Es-Rl

The combination of ES with RL subroutines have the potential of bringing the best of both worlds. While previous sections have shown that adaptive hyper-parameters achieve generally significantly better performance than static hyper-parameters, how does this approach compare to the case where the ES adaptation is applied to the entire parameter vector [19, 30] ?

We show results of a wide range of tasks in Table 1, where we compare several baselines: ES adaptation of -step horizon parameter ; ES adaptation of learning rate ; ES adaptation to parameter vector (also named ES-RL) [30] 111Here, the ES update is based on the CEM [8] according to CEM-RL [30]., as well as baseline TD3 and SAC [14]. Several observations: (1) Across the selected tasks, ES adaptation generally provides performance gains over baseline TD3, as shown by the fact that best performance is usually obtained via ES adaptations; (2) ES adaptation of hyper-parameters achieve overall better performance than ES-RL. We speculate that this is partially because ES-RL naively applies ES updates to high-dimensional parameter vectors, which could be highly inefficient. On the other hand, ES adaptation of hyper-parameters focus on a compact set of tunable variables and could exploit the strength of ES updates to a larger extent.

Tasks ES -step ES ES TD3 TD3 SAC
Table 1: Summary of the performance of algorithmic variants across benchmark tasks. ES -step denotes tuning of -step horizon parameter ; ES denotes tuning of learning rate ; ES TD3 denotes the ES-RL baseline [30]. For each task, algorithmic variants with top performance are highlighted (multiple are highlighted if they are not statistically significantly different). Each entry shows performance.

5 Conclusion

We propose a framework which combines ES with online hyper-parameter tuning of general off-policy learning algorithms. This framework extends the mathematical formulation of near on-policy based meta-gradients [45, 46] and flexibly allows for the adaptation of both discrete and continuous variables. Empirically, this method provides significant performance gains over static hyper-parameters in off-policy learning baselines. As part of the ongoing efforts in combining ES with off-policy learning, the current formulation greatly reduces the search space of the ES subroutines, and makes the performance gains more consistent compared to prior work [30].


  • [1] A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller (2018) Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920. Cited by: §4.1.
  • [2] J. Achiam (2018) Openai spinning up. GitHub, GitHub repository. Cited by: §5.2, §5.2, §5.2.1.
  • [3] G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. Tb, A. Muldal, N. Heess, and T. Lillicrap (2018) Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617. Cited by: §1, §4.
  • [4] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    Cited by: §4.2.
  • [5] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §4, §5.2.
  • [6] K. Choromanski, M. Rowland, V. Sindhwani, R. E. Turner, and A. Weller (2018) Structured evolution with compact architectures for scalable policy optimization. arXiv preprint arXiv:1804.02395. Cited by: §2.2.1, §5.2.1.
  • [7] E. Coumans (2010) Bullet physics engine. Open Source Software: http://bulletphysics. org 1 (3), pp. 84. Cited by: §4, §5.2.
  • [8] P. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein (2005) A tutorial on the cross-entropy method. Annals of operations research 134 (1), pp. 19–67. Cited by: §2.2.1, §5.2.1, §5.2.1, footnote 1.
  • [9] T. Degris, M. White, and R. S. Sutton (2012) Off-policy actor-critic. arXiv preprint arXiv:1205.4839. Cited by: §1, §2.1, §4.1, §5.2.1.
  • [10] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §1, §1.
  • [11] S. Fujimoto, D. Meger, and D. Precup (2018) Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900. Cited by: §1.
  • [12] S. Fujimoto, H. Van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §2.2, §3.2, §4, §4.1.
  • [13] D. Ha and J. Schmidhuber (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §1.
  • [14] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §2.2, §4.3.
  • [15] N. Hansen, S. D. Müller, and P. Koumoutsakos (2003) Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation 11 (1), pp. 1–18. Cited by: §2.2.1.
  • [16] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1, §4.
  • [17] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, et al. (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §1, §3.2.
  • [18] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney (2018) Recurrent experience replay in distributed reinforcement learning. Cited by: §1, §2.1, §3.2.
  • [19] S. Khadka and K. Tumer (2018) Evolution-guided policy gradient in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1188–1200. Cited by: §3.2, §4.3.
  • [20] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.1, §5.2.1.
  • [21] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §5.1.
  • [22] O. Klimov and J. Schulman (2017) Roboschool. Cited by: §4, §4.1, §5.2.
  • [23] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §1, §3.2, §4.1.
  • [24] H. Mania, A. Guy, and B. Recht (2018) Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055. Cited by: §2.2.1, §5.2.1.
  • [25] T. A. Mann, H. Penedones, S. Mannor, and T. Hester (2016) Adaptive lambda least-squares temporal difference learning. arXiv preprint arXiv:1612.09465. Cited by: §1.
  • [26] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In

    International Conference on Machine Learning

    pp. 1928–1937. Cited by: §1, §2.1, §3.2, §4.
  • [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1, §1.
  • [28] J. Oh, Y. Guo, S. Singh, and H. Lee (2018) Self-imitation learning. arXiv preprint arXiv:1806.05635. Cited by: §4.2.
  • [29] S. Paul, V. Kurin, and S. Whiteson (2019)

    Fast efficient hyperparameter tuning for policy gradients

    arXiv preprint arXiv:1902.06583. Cited by: §1, §3.2, §3.
  • [30] A. Pourchot and O. Sigaud (2018) CEM-rl: combining evolutionary and gradient-based methods for policy search. arXiv preprint arXiv:1810.01222. Cited by: §3.2, §4.3, §4.3, Table 1, §4, §5, §5.2, §5.2.1, footnote 1.
  • [31] M. Rowland, W. Dabney, and R. Munos (2019) Adaptive trade-offs in off-policy learning. arXiv preprint arXiv:1910.07478. Cited by: §1, §2.1, §4.2.
  • [32] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §1, §2.2.1, §3.1, §5.2.1, §5.2.1.
  • [33] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §3.2.
  • [34] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.2.
  • [35] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
  • [36] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In ICML, Cited by: §2.2, §5.2.1.
  • [37] R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §2.2.
  • [38] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
  • [39] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §4, §4.1, §5.2.
  • [40] H. Van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and J. Modayil (2018) Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648. Cited by: §1.
  • [41] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas (2015) Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. Cited by: §2.1.
  • [42] C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §2.1.
  • [43] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber (2014) Natural evolution strategies. The Journal of Machine Learning Research 15 (1), pp. 949–980. Cited by: §2.2.1.
  • [44] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pp. 5–32. Cited by: §3.1, §5.1.
  • [45] Z. Xu, H. P. van Hasselt, and D. Silver (2018) Meta-gradient reinforcement learning. In Advances in neural information processing systems, pp. 2396–2407. Cited by: §1, §1, §3.2, §3.2, §3, §3, §4.1, §4, §4, §5, §5.1, §5.2.1.
  • [46] T. Zahavy, Z. Xu, V. Veeriah, M. Hessel, J. Oh, H. van Hasselt, D. Silver, and S. Singh (2020) Self-tuning deep reinforcement learning. arXiv preprint arXiv:2002.12928. Cited by: §1, §1, §4.1, §5, §5.1.

APPENDIX: Online Hyper-parameter Tuning in Off-policy Learning via Evolutionary Strategies

5.1 Proof of Proposition 1

To show the equivalence, note first that the ES gradient estimator is the REINFORCE gradient estimator [44] of the meta-objective. This gradient could be converted to its reparameterized gradient counterpart [21] as follows

Then we expand the RHS in orders of . In particular, , where are Taylor expansionns of the objective with respect to . Due to the expectation, the first-order term vanishes due to . And because we take the limit , the term with vanishes too. When the meta-objective does not explicitly depend on the meta-parameter, i.e. (which is the case if the meta-objective is defined as the cumulative returns of the policy as in [45, 46] and our case), we finally have

5.2 Experiment Details

Environment details.

We consider a set of similar tasks Walker, Cheetah and Ant with different simulation backends: Walker-v1, HalfCheetah-v1 and Ant-v1 from OpenAI gym [5]; RoboschoolWalker-v1, RoboschoolHalfCheetah-v1 and RoboschoolAnt-v1 from Roboschool [22]; WalkerRun, WalkerWalk, WalkerStand and CheetahRun from DeepMind Control Suite [39]; Walker2dBullet-v0, HalfCheetahBullet-v0 and AntBullet-v0 from Bullet Physics Engine [7]. Due to different simulation backends, these environments vary in several aspects, which allow us to validate the performance of algorithms in a wider range of scenarios.

Normalization scores.

To calculate the normalization scores, we adopt the score statistics reported in Table 2. We summarize three statistics from where indexes the algorithmic baseline, indexes the task and indexes the time tick during training. Three statistics are defined at each time tick and for each baseline as

where refers to taking median across all task and is the indicator function.

Implementation details.

The algorithmic baselines TD3, SAC and DDPG are all based on OpenAI Spinning Up [2]. We construct all algorithmic variants on top of the code base. To implement ES-RL, we borrow components from the open source code of the original paper [30].


All algorithmic baselines, including TD3, SAC and DDPG share the same network architecture following [2]. The Q-function network and policy are both -layer neural network with hidden units per layer, before the output layer. All hidden layers have relu activation. By default, for all algorithmic variants, both networks are updated with learning rate . Other missing hyper-parameters take default values from the code base.

5.2.1 Further implementation and hyper-parameter details

Below we introduce the skeleton formulation and detailed hyper-parameter setup for each algorithmic variant.

Oht-ES for discrete hyper-parameters.

We focused on adapting the -step hyper-parameter , which takes discrete values. The hyper-parameter is constrained to be for a total of values. We parameterize logits and update the distribution .

As introduced in the main paper, we maintain agents, each corresponding to a hyper-parameter value . At training iteration , we sample and execute the corresponding agent. The trajectory is used for estimating the performance of the agent , then the logits are updated based on Eqn.(4). To ensure exploration, when sampling the agent, we maintain a probability of to sample uniformly. The logits are initialized to be a zero-valued vector. We sample agents before carrying out updates on the logits. The update is with an Adam optimizer [20] with learning rate . To generate test performance, the algorithm samples from the distribution and evaluate its corresponding performance.

Oht-ES for continuous hyper-parameters.

We focused on adapting the learning rate . To ensure positivity of the learning rate, we take the parameterization where and update with ES.

At training iteration , we sample

perturbations of the current parameter means, with standard deviation

initialized at , to be tuned based on each tasks. We find for most tasks, works the best. Note that though the standard deviation of is large in a typical ES setting for RL [32, 6, 24], this induces relatively small changes in the space of . The mean of the Gaussian is initialized at as we take the default learning rate for TD3 to be [2]. We also maintain a main agent whose keeps training the policy parameter using the central learning rate (this agent generates the test performance).

For the ES update, we adopt CEM [8] for the update of . Note that unlike a gradient-based ES update [32] here the standard deviation parameter is adjusted online.

Meta-gradient for continuous hyper-parameters.

Meta-gradients are designed for continuous hyper-parameters [45]. We still consider adapting the learning rates as above. We elaborate how meta-gradients are implemented in details below.

In addition to the typical Q-function network , we also train a meta Q-function critic to approximate the Q-function of the current policy . The meta-critic has the same training objective as but they differ in predictions due to randomness in the initializations and updates.

The meta objective is where the expectation is taken such that . This meta objective estimates the off-policy gradient objective [9, 36]. Since the policy network is updated according to Eqn.(1), we calculate the meta-gradient as , to be computed via back-propagations.

At training iteration , the meta critic is updated with the same rules as but with different batches of data. Then the learning rate are updated via meta-gradients with the Adam optimizer [20] and learning rate , tuned for each task. Throughout the training, only one agent is maintained and trained, and this single agent generates the test performance.


We implement the CEM-RL algorithm [30] but with our TD3 subroutines for fair evaluations. For critical hyper-parameters in the algorithm, we take their default values from the paper [30].

Let be the agent parameter, the algorithm maintains a Gaussian distribution for the agent parameter. At each iteration , the algorithm samples parameters from the distribution and updates agents using TD3 gradient updates. Then each agent is executed in the environment and their fitness (or cumulative returns) are collected. Finally, the distribution parameters are updated via cross-entropy methods [8]. The mean agent generates the test performance.

Tasks High score Low score
Table 2: Summary of the performance scores used for calculating normalized scores across different simulated tasks. Normalized scores are calculated as where is the test performance of a given baseline algorithm. Below, the low score is estimated by executing random policy in the environment. The high score is estimated as high performing returns in the selected tasks from related prior literature.