1 Introduction
Offpolicy learning is a powerful paradigm for RL problems. Despite its great promise, when combined with neural networks in many modern applications
[27, 35], offpolicy learning suffers from constant instability, also partly characterized as the deadly triad [38, 40]. As a result, additional empirical techniques must be implemented to achieve more robust performance in practice, e.g. target networks [27]. Though theory suggests that offpolicy learning could be performed between highly different behavior policy and target policy , in challenging domains the best performance is obtained when data are near onpolicy, i.e. [18]. In batch RL, an extreme special case of offpolicy learning, where the data are collected under a behavior policy before hand and no further data collection is allowed, naive applications of offpolicy algorithms do not work properly [11]. In addition to algorithmic limitations, the search of good hyperparameters for offpolicy algorithms is also critical yet brittle. For example, prior work has observed that the performance is highly sensitive to hyperparameters such as learning rates and depends critically on seemingly heuristic techniques such as
step updates [23, 3, 16].In this work, we focus on this latter source of instability for offpolicy learning, i.e. hyperparameter tuning. Unlike supervised learning counterparts, where a static set of hyperparameters might suffice, for general RL problems it is desirable to adapt the hyperparameters
on the fly as the training procedures are much more nonstationary. Though it is possible to design theoretically justified scheme for hyperparameters, such methods are usually limited to a set of special quantities, such as the eligibility trace [25] or mixing coefficient for alpharetrace [31]. More generally, the tuning of generic hyperparameters could be viewed as greedily optimizing certain metaobjectives at each iteration [45, 29, 46]. For example, in near onpolicy algorithms such as IMPALA [10], hyperparameters are updated by metagradients [45, 46] (in such literature, trainable hyperparameters are called metaparameters), which are calculated via backpropagation from the metaobjectives.However, in offpolicy learning, techniques such as metagradients are not immediately feasible. Indeed, since the existing formulation of metagradients [45, 46] is limited to near onpolicy actorcritic algorithms [26, 10], its extension to replaybased offpolicy algorithms is not yet clear. The difficulty arises from the design of many offpolicy algorithms  many offpolicy updates are not based on the target RL objective but proxies such as Bellman errors [27, 23] or offpolicy objectives [9]. This makes it challenging to define and calculate metagradients, which requires differentiating through the RL objectives via policy gradients [45]
. To adapt hyperparameters in such cases, a naive yet straightforward resort is to train multiple agents with an array of hyperparameters in parallel as in PBT, and update hyperparameters with e.g. genetic algorithms
[17]. Though being more blackbox in nature, PBT proved highperforming yet too costly in practice.Main idea.
We propose a framework for optimizing hyperparameters within the lifetime of a single agent (unlike multiple copies in PBT) with ES, called OHTES. ES are agnostic to the offpolicy updates of the baseline algorithm and can readily adapt discrete/continuous hyperparameters effectively. With the recent revival of ES especially for lowdimensional search space [32, 13], we will see that our proposal combines the best of both offpolicy learning and ES.
OHTES outperforms offpolicy baselines with static hyperparameters. In Figure 1, we show the significant performance gains of offpolicy learning baselines combined with OHTES (blue curves), compared to static hyperparameters. We evaluate all algorithms with normalized scores over 13 simulated control tasks (see Section 4 for details). The performance gains of OHTES are consistent across all three reported metrics over normalized scores.
2 Background
In the standard formulation of MDP, at a discrete time , an agent is in state , takes action , receives a reward and transitions to a next state . A policy defines a map from states to distributions over actions. The standard objective of RL is to maximize the expected cumulative discounted returns with a discount factor .
2.1 Offpolicy learning
Offpolicy learning entails policy optimization through learning from data generated via arbitrary behavior policy, e.g. historical policies. For example, Qlearning [42] is a prominent framework for offpolicy learning, where given a step partial trajectory , the step Qlearning optimizes a parameterized Qfunction by minimizing the Bellman error
where is the step target and denotes that the data are sampled from a replay buffer . When , Qlearning converges to the optimal solution in tabular cases and under mild conditions [42]. Recently, [31] shows that general uncorrected step updates for introduce target bias in exchange for fast contractions to the fixed point, which tends to bring empirical gains. Though there is no general optimality guarantee for , prior work finds that employing significantly speeds up the optimization in challenging imagebased benchmark domains [26, 16, 18]. Other related prominent offpolicy algorithms include offpolicy policy gradients [9, 41], whose details we omit here.
2.2 Offpolicy actorcritic
By construction, Qlearning requires a maximization over actions to compute target values, which becomes intractable when the action space is continuous, e.g. . To bypass such issues, consider a policy as an approximate maximizer, i.e. . This produces the Qfunction target . The Qfunction (critic) and the policy (actor) are alternately updated as follows, with learning rate ,
(1) 
Depending on whether the actor or the critic is fully optimized at each iteration, there are two alternative interpretations of the updates defined in Eqn.(1). When the policy is fully optimized such that , the updates are exact step Qlearning. When the critic is fully optimized such that , the updates are step SARSE for policy evaluation with deterministic policy gradients [36]. In practice, critic and actor updates take place alternately and the algorithm is a mixture between value iteration and policy iteration [37]. Built upon the updates (Eq.(1)), additional techniques such as double critic [12] and maximum entropy formulation [14] could greatly improve the stability of the baseline algorithm.
2.2.1 Evolutionary strategies
ES are a family of zeroorder optimization algorithms (see e.g. [15, 8, 43, 32]), which have seen recent revival for applications in RL [32]. In its generic form, consider a function with parameter , the aim is to optimize with only queries of the function values. For simplicity, assume is continuous and consider the ES gradient descent formulation introduced in [32]. Instead of optimizing directly, consider a smoothed objective
with some fixed variance parameter
. It is then feasible to approximate the gradient withsample unbiased estimates, in particular,
whereare i.i.d. Gaussian vectors. A naive approach to RL is to flatten the sequential problem into a onestep blackbox problem, by setting
. Despite its simplicity, this approach proved efficient compared to policy gradient algorithms [32, 6, 24], though generally its sample efficiency could not match that offpolicy algorithms.3 Online Hyperparameter Tuning via Evolutionary Strategies
Let denote the set of adjustable realvalued hyperparameters, e.g. the learning rate
and a probability distribution over
step targets for discrete . At iteration with actorcritic parameter , given replay buffer , the algorithm constructs an update such that [45]. Here we make explicit the dependency of the update function on the replay . For example, the update function could be the gradient descents defined in Eqn.(1).When the algorithm does not update hyperparameter at all , and we reduce to the case of static hyperparameters. One straightforward way to update the hyperparameter is to greedily optimize the hyperparameters against some meta objective [45, 29], such that
(2) 
Since the motivation of hyperparameter adaptation was to better optimize the RL objective, it is natural to set the meta objective as the target RL objective, i.e. cumulative returns .
3.1 Methods
Now we describe Online Hyperparameter Tuning via Evolutionary Strategies (OHTES). Note that the framework is generic as it could be combined with any offpolicy algorithms with update function . Recall that the update function returns a new parameter . The general meta algorithm is presented in Algorithm 1, where we assume hyperparameters to be realvalued. It is straightforward to derive similar algorithms for discrete hyperparameters as explained below.
Consider at iteration of learning, the agent maintains a parametric distribution over hyperparameters, e.g. Gaussian with tunable mean and fixed variance . Then we sample a population of actorcritic agents each with a separate hyperparameter drawn from the parametric distribution . Then for each of the copies of the agent, we update their parameters via the offpolicy subroutine . After the update is complete, each agent with parameter collects rollouts from the environment and saves the data to . From the rollouts, construct estimates of the meta objective . Fianlly, the hyperparameter mean is updated via a ES subroutine. For example, we might apply ES gradient ascent [32] and the new distribution parameter is updated with learning rate ,
(3) 
Discrete hyperparameters.
We also account for the case where the hyperparameters take values from a discrete set of values, denoted as . In such cases, instead of maintaining a parametric Gaussian distribution over hyperparameters such that , we maintain a categorical distribution where
is the logits and
. By sampling several hyperparameter candidates , we could construct a score function gradient estimator [44] for the logits(4) 
3.2 Connections to prior work
We make explicit the connections between our approach and closely related prior work.
Connections to metagradients.
When hyperparameters are realvalued, the ES updates defined in Eqn.(3) closely relates to metagradients [45], as summarized in the following proposition.
Proposition 1.
(Proved in Appendix 5.1) Assume that sampled hyperparameters follow a Gaussian distribution . Then the following holds,
(5) 
Since ES gradient updates are a zeroorder approximation to the analytic gradients, this connection should be intuitive. Note that the RHS of Eq.(5) differs from metagradient updates in practice in several aspects [45]: in general, metagradients could introduce trace parameters to stabilize the update, and the gradient is evaluated at instead of as defined above.
Connections to near onpolicy methods.
For near onpolicy algorithms such as A2C, TRPO and PPO [26, 33, 34], there are natural constraints on the parameter updates. As a result, given the meta objective of one hyperparameter value , it is possible to estimate meta objectives at alternative hyperparameter values with importance sampling (IS) [29]. Then meta objectives could be greedily optimized via even zeroorder methods. However, it is not clear how correlations/variance of such ISEStimated meta objectives impact the updates, as they are estimated from the same data. Alternative to IS, we estimate via the MonteCarlo sample of cumulative returns under , which is applicable when trust regions are not available (as with many offpolicy algorithms) and when policies are deterministic [23, 12].
Connections to EsRl.
Our method closely relates to prior work on combining ES with gradient based offpolicy RL algorithms [19, 30], which we name ESRL. These algorithms maintain a population of offpolicy agents with parameters and carry out ES updates directly on the agent parameter, e.g. genetic algorithm [19] or crossentropy method [30]. They could be interpreted as a special case of our framework: indeed, one could include the trainable agent parameters as part of the hyperparameter and this formulation reduces to ESRL. However, ESRL applies ES updates to a highdimensional trainable parameter, which might be less effective than to a lowdimensional hyperparameter search space. We will examine their relative strengths in Section 4.
Connections to Pbt.
Our framework could be interpreted as a special variant of PBT [17], where copies of the RL agents share replay buffers. In particular, PBT agents are trained independently in parallel and only exchange information during periodic hyperparameter updates, while our approach ensures that these agents share information during training as well. This makes our approach potentially much more sample efficient than PBT. It is also worth noting that sharing buffers involves a tradeoff  though agents could utilize others’ data for potentially better exploration, the behavior data also become less onpolicy for any particular agent and might introduce additional instability [18].
4 Experiments
In the experiments, we seek to address the following questions: (1) Is OHTES effective for discrete hyperparameters? (2) Is OHTES effective for continuous hyperparameters and how does it compare to metagradients [45]? (3) How is OHTES compared to highly related methods such as ESRL [30]?
To address (1), we study the effect of adapting the horizon hyperparameter in step updates. Prior work observed that generally performs well for Atari and imagebased continuous control [26, 16, 3], though the best hyperparameter could be taskdependent. We expect OHTES to be able to adapt to the near optimal hyperparameters for each task. To address (2), we study the effect of learning rates, and we compare with an application of metagradients [45] to offpolicy agents. Though prior work focuses on applying metagradients to near onpolicy methods [45], we provide one extension to offpolicy baselines for comparison, with details described below.
Benchmark tasks.
For benchmark tasks, we focus on statebased continuous control. In order to assess the strengths of different algorithmic variants, we consider similar tasks Walker, Cheetah and Ant with different simulation backends from OpenAI gym [5], Roboschool [22], DeepMind Control Suite [39] and Bullet Physics Engine [7]. These backends differ in many aspects, e.g. dimensions of observation and action space, transition dynamics and reward functions. With such a wide range of varieties, we seek to validate algorithmic gains with sufficient robustness to varying domains. There are a total of distinct tasks, with details in Appendix 5.2.
Base update function.
4.1 Continuous hyperparameters
As an example of adaptive continuous hyperparameters, we focus on learning rates , which includes the learning rates for actor and critic respectively. Extensions to other continuous hyperparameters are straightforward. For example, the original metagradients were designed for discount factor or eligibility trace [45] and later extended to entropy regularization and learning rates [46]. For baseline TD3, alternative hyperparameter is the discount , for which we find adaptive tuning does not provide significant gains.
We present results on challenging domains from the DeepMind Control Suite [39], where performance gains are most significant. Detailed environment and hyperparameter settings are in Appendix 5.2. We compare OHTES tuning approach with a variant of metagradients: as discussed in Section 3, metagradient approaches are less straightforward in general offpolicy learning. We derive a metagradient algorithm for deterministic actorcritics [23, 12] and provide a brief introduction below.
Metagradients for deterministic actorcritics.
Deterministic actorcritics maintain a Qfunction critic and a deterministic actor . We propose to train an alternative critic for policy evaluation updated via TDlearning as . Recall that actorcritics are updated as defined in Eqn.(1), and recall to be updated parameters. Next, let the meta objective be the offpolicy objective [9], where the expectation is taken such that . The metagradients are calculated as . Please see Appendix 5.2 for a detailed derivation and design choices.
Evaluations.
The comparisons between OHTES, metagradients and TD3 baseline are shown in Figure 2. We make a few observations: (1) OHTES consistently achieves the best across all four environments and achieve significant performance gains (asymptotic performance and learning speed) than metagradients and TD3; (2) Metagradients achieve gains over the baseline most of the time, which implies that there are potentials for improvements due to adaptive learning rate; (3) Baseline TD3 does not perform very well on the control suite tasks. This is in contrast to its highperformance on typical benchmarks such as OpenAI gym [22]. This provides strong incentives to test on a wide range of benchmark testbeds in future research as in our paper. We speculate that TD3’s suboptimality is due to the fact that its design choices (including hyperparameters) are not exhaustively tuned on these new benchmarks. With adaptive tuning, we partially resolve the issue and obtain performance almost identical to stateoftheart algorithms on the control suite (e.g. see MPO [1]).
4.2 Discrete hyperparameters
As an important example of adaptive discrete hyperparameters, we focus on the horizon parameter in step updates. Due to the discrete nature of such hyperparameters, it is less straightforward to apply metagradients out of the box. As a comparison to the adaptive approach, we consider static hyperparameters and test if online adaptation brings significant gains. We show results on tasks from the control suite in Figure 3 (first row). For static baselines, we consider TD3 with step updates with .
Evaluation with normalized scores.
Since different tasks involve a wide range of inherent difficulties and reward scales, we propose to calculate the normalized scores for each task and aggregate performance across tasks. This is similar to the standard evaluation technique on Atari games [4]. In particular, for each task, let denote the performance curve of a particular algorithm with maximum iteration , let be the performance of a random policy and the optimal policy respectively. Then the normalized score is and we graph them for comparison (Figure 1). Please refer to Appendix 5.2 for detailed scores for each tasks.
For convenience, let there be algorithmic baselines and tasks. To facilitate comparison of the overall performance, for the th baseline we calculate the normalized scores for the th task task , and at each time tick calculate statistics across tasks. There are three statistics: mean, median and best ratio, similar to [31]. The best ratio indicates the proportion of tasks on which a certain baseline performs the best. These three statistics summarize the overall algorithmic performance of baseline methods and display their relative strength/weakness.
Evaluations on standard benchmarks.
We present results across all 13 simulated tasks in Figure 1. In Figure 3 (first row), we show detailed training curves on the control suite. Here, OHTES maintains a categorical distribution over .
We make several observations from the results: (1) The step update with achieves better performance across the second largest number of tasks, yet its overall performance is slightly worse than (median). (2) The adaptive step performs the best across all three metrics. This implies that adaptive step both achieves significantly better overall performance (mean and median) and achieve the best performance across a considerable proportion of tasks (best ratio); (3) From the best ratio result, we conclude that adaptive step is able to locate the best step hyperparameter for each task through the online adaptation.
Evaluations on delayed reward environment.
Delayed reward environment tests algorithms’ capability to tackle delayed feedback in the form of sparse rewards [28]. In particular, a standard benchmark environment returns dense reward at each step . Consider accumulating the reward over consecutive steps and return the sum at the end steps, i.e. if and if .
We present the full results in Figure 4 with normalized scores across all 13 simulated tasks. In Figure 3 (bottom row) we show detailed training curves on control suite. Due to delayed rewards, we find it beneficial to increase the support of the categorical distribution to allow for bootstrapping from longer horizons. As a result, OHTES takes discrete values from .
We also make several observations: (1) The overall performance of step update is monotonic in when (mean and median). In particular, we see that when the step update performs the best. Intuitively, we see that step update skips over time steps and combine multiple rewards into a single reward, which makes it naturally compatible with the delayed reward signal; (2) The best ratio curves show that achieves fastest learning progress across all baselines (including adaptive step), yet this advantage decays away as the training progresses and adaptive step takes over. This implies that adapting step hyperparameter is critical in achieving more stable long term progress; (3) In terms of overall performance, adaptive step initially lags behind yet quickly catches up and exceeds the latter.
4.3 Comparison to EsRl
The combination of ES with RL subroutines have the potential of bringing the best of both worlds. While previous sections have shown that adaptive hyperparameters achieve generally significantly better performance than static hyperparameters, how does this approach compare to the case where the ES adaptation is applied to the entire parameter vector [19, 30] ?
We show results of a wide range of tasks in Table 1, where we compare several baselines: ES adaptation of step horizon parameter ; ES adaptation of learning rate ; ES adaptation to parameter vector (also named ESRL) [30] ^{1}^{1}1Here, the ES update is based on the CEM [8] according to CEMRL [30]., as well as baseline TD3 and SAC [14]. Several observations: (1) Across the selected tasks, ES adaptation generally provides performance gains over baseline TD3, as shown by the fact that best performance is usually obtained via ES adaptations; (2) ES adaptation of hyperparameters achieve overall better performance than ESRL. We speculate that this is partially because ESRL naively applies ES updates to highdimensional parameter vectors, which could be highly inefficient. On the other hand, ES adaptation of hyperparameters focus on a compact set of tunable variables and could exploit the strength of ES updates to a larger extent.
Tasks  ES step  ES  ES TD3  TD3  SAC 

DMWalkerRun  
DMWalkerWalk  
DMWalkerStand  
DMCheetahRun  
Ant  
HalfCheetah  
RoboAnt  
RoboHalfCheetah  
Ant(B)  
HalfCheetah(B) 
5 Conclusion
We propose a framework which combines ES with online hyperparameter tuning of general offpolicy learning algorithms. This framework extends the mathematical formulation of near onpolicy based metagradients [45, 46] and flexibly allows for the adaptation of both discrete and continuous variables. Empirically, this method provides significant performance gains over static hyperparameters in offpolicy learning baselines. As part of the ongoing efforts in combining ES with offpolicy learning, the current formulation greatly reduces the search space of the ES subroutines, and makes the performance gains more consistent compared to prior work [30].
References
 [1] (2018) Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920. Cited by: §4.1.
 [2] (2018) Openai spinning up. GitHub, GitHub repository. Cited by: §5.2, §5.2, §5.2.1.
 [3] (2018) Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617. Cited by: §1, §4.

[4]
(2013)
The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research
47, pp. 253–279. Cited by: §4.2.  [5] (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §4, §5.2.
 [6] (2018) Structured evolution with compact architectures for scalable policy optimization. arXiv preprint arXiv:1804.02395. Cited by: §2.2.1, §5.2.1.
 [7] (2010) Bullet physics engine. Open Source Software: http://bulletphysics. org 1 (3), pp. 84. Cited by: §4, §5.2.
 [8] (2005) A tutorial on the crossentropy method. Annals of operations research 134 (1), pp. 19–67. Cited by: §2.2.1, §5.2.1, §5.2.1, footnote 1.
 [9] (2012) Offpolicy actorcritic. arXiv preprint arXiv:1205.4839. Cited by: §1, §2.1, §4.1, §5.2.1.
 [10] (2018) Impala: scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561. Cited by: §1, §1.
 [11] (2018) Offpolicy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900. Cited by: §1.
 [12] (2018) Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §2.2, §3.2, §4, §4.1.
 [13] (2018) World models. arXiv preprint arXiv:1803.10122. Cited by: §1.
 [14] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §2.2, §4.3.
 [15] (2003) Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cmaes). Evolutionary computation 11 (1), pp. 1–18. Cited by: §2.2.1.
 [16] (2018) Rainbow: combining improvements in deep reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1, §4.
 [17] (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846. Cited by: §1, §3.2.
 [18] (2018) Recurrent experience replay in distributed reinforcement learning. Cited by: §1, §2.1, §3.2.
 [19] (2018) Evolutionguided policy gradient in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1188–1200. Cited by: §3.2, §4.3.
 [20] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.1, §5.2.1.
 [21] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §5.1.
 [22] (2017) Roboschool. Cited by: §4, §4.1, §5.2.
 [23] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §1, §3.2, §4.1.
 [24] (2018) Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055. Cited by: §2.2.1, §5.2.1.
 [25] (2016) Adaptive lambda leastsquares temporal difference learning. arXiv preprint arXiv:1612.09465. Cited by: §1.

[26]
(2016)
Asynchronous methods for deep reinforcement learning.
In
International Conference on Machine Learning
, pp. 1928–1937. Cited by: §1, §2.1, §3.2, §4.  [27] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1, §1.
 [28] (2018) Selfimitation learning. arXiv preprint arXiv:1806.05635. Cited by: §4.2.

[29]
(2019)
Fast efficient hyperparameter tuning for policy gradients
. arXiv preprint arXiv:1902.06583. Cited by: §1, §3.2, §3.  [30] (2018) CEMrl: combining evolutionary and gradientbased methods for policy search. arXiv preprint arXiv:1810.01222. Cited by: §3.2, §4.3, §4.3, Table 1, §4, §5, §5.2, §5.2.1, footnote 1.
 [31] (2019) Adaptive tradeoffs in offpolicy learning. arXiv preprint arXiv:1910.07478. Cited by: §1, §2.1, §4.2.
 [32] (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §1, §2.2.1, §3.1, §5.2.1, §5.2.1.
 [33] (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §3.2.
 [34] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.2.
 [35] (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
 [36] (2014) Deterministic policy gradient algorithms. In ICML, Cited by: §2.2, §5.2.1.
 [37] (1998) Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §2.2.
 [38] (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
 [39] (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §4, §4.1, §5.2.
 [40] (2018) Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648. Cited by: §1.
 [41] (2015) Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. Cited by: §2.1.
 [42] (1992) Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: §2.1.
 [43] (2014) Natural evolution strategies. The Journal of Machine Learning Research 15 (1), pp. 949–980. Cited by: §2.2.1.
 [44] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pp. 5–32. Cited by: §3.1, §5.1.
 [45] (2018) Metagradient reinforcement learning. In Advances in neural information processing systems, pp. 2396–2407. Cited by: §1, §1, §3.2, §3.2, §3, §3, §4.1, §4, §4, §5, §5.1, §5.2.1.
 [46] (2020) Selftuning deep reinforcement learning. arXiv preprint arXiv:2002.12928. Cited by: §1, §1, §4.1, §5, §5.1.
APPENDIX: Online Hyperparameter Tuning in Offpolicy Learning via Evolutionary Strategies
5.1 Proof of Proposition 1
To show the equivalence, note first that the ES gradient estimator is the REINFORCE gradient estimator [44] of the metaobjective. This gradient could be converted to its reparameterized gradient counterpart [21] as follows
Then we expand the RHS in orders of . In particular, , where are Taylor expansionns of the objective with respect to . Due to the expectation, the firstorder term vanishes due to . And because we take the limit , the term with vanishes too. When the metaobjective does not explicitly depend on the metaparameter, i.e. (which is the case if the metaobjective is defined as the cumulative returns of the policy as in [45, 46] and our case), we finally have
5.2 Experiment Details
Environment details.
We consider a set of similar tasks Walker, Cheetah and Ant with different simulation backends: Walkerv1, HalfCheetahv1 and Antv1 from OpenAI gym [5]; RoboschoolWalkerv1, RoboschoolHalfCheetahv1 and RoboschoolAntv1 from Roboschool [22]; WalkerRun, WalkerWalk, WalkerStand and CheetahRun from DeepMind Control Suite [39]; Walker2dBulletv0, HalfCheetahBulletv0 and AntBulletv0 from Bullet Physics Engine [7]. Due to different simulation backends, these environments vary in several aspects, which allow us to validate the performance of algorithms in a wider range of scenarios.
Normalization scores.
To calculate the normalization scores, we adopt the score statistics reported in Table 2. We summarize three statistics from where indexes the algorithmic baseline, indexes the task and indexes the time tick during training. Three statistics are defined at each time tick and for each baseline as
where refers to taking median across all task and is the indicator function.
Implementation details.
The algorithmic baselines TD3, SAC and DDPG are all based on OpenAI Spinning Up https://github.com/openai/spinningup [2]. We construct all algorithmic variants on top of the code base. To implement ESRL, we borrow components from the open source code https://github.com/apourchot/CEMRL of the original paper [30].
Architecture.
All algorithmic baselines, including TD3, SAC and DDPG share the same network architecture following [2]. The Qfunction network and policy are both layer neural network with hidden units per layer, before the output layer. All hidden layers have relu activation. By default, for all algorithmic variants, both networks are updated with learning rate . Other missing hyperparameters take default values from the code base.
5.2.1 Further implementation and hyperparameter details
Below we introduce the skeleton formulation and detailed hyperparameter setup for each algorithmic variant.
OhtES for discrete hyperparameters.
We focused on adapting the step hyperparameter , which takes discrete values. The hyperparameter is constrained to be for a total of values. We parameterize logits and update the distribution .
As introduced in the main paper, we maintain agents, each corresponding to a hyperparameter value . At training iteration , we sample and execute the corresponding agent. The trajectory is used for estimating the performance of the agent , then the logits are updated based on Eqn.(4). To ensure exploration, when sampling the agent, we maintain a probability of to sample uniformly. The logits are initialized to be a zerovalued vector. We sample agents before carrying out updates on the logits. The update is with an Adam optimizer [20] with learning rate . To generate test performance, the algorithm samples from the distribution and evaluate its corresponding performance.
OhtES for continuous hyperparameters.
We focused on adapting the learning rate . To ensure positivity of the learning rate, we take the parameterization where and update with ES.
At training iteration , we sample
perturbations of the current parameter means, with standard deviation
initialized at , to be tuned based on each tasks. We find for most tasks, works the best. Note that though the standard deviation of is large in a typical ES setting for RL [32, 6, 24], this induces relatively small changes in the space of . The mean of the Gaussian is initialized at as we take the default learning rate for TD3 to be [2]. We also maintain a main agent whose keeps training the policy parameter using the central learning rate (this agent generates the test performance).Metagradient for continuous hyperparameters.
Metagradients are designed for continuous hyperparameters [45]. We still consider adapting the learning rates as above. We elaborate how metagradients are implemented in details below.
In addition to the typical Qfunction network , we also train a meta Qfunction critic to approximate the Qfunction of the current policy . The metacritic has the same training objective as but they differ in predictions due to randomness in the initializations and updates.
The meta objective is where the expectation is taken such that . This meta objective estimates the offpolicy gradient objective [9, 36]. Since the policy network is updated according to Eqn.(1), we calculate the metagradient as , to be computed via backpropagations.
At training iteration , the meta critic is updated with the same rules as but with different batches of data. Then the learning rate are updated via metagradients with the Adam optimizer [20] and learning rate , tuned for each task. Throughout the training, only one agent is maintained and trained, and this single agent generates the test performance.
EsRl.
We implement the CEMRL algorithm [30] but with our TD3 subroutines for fair evaluations. For critical hyperparameters in the algorithm, we take their default values from the paper [30].
Let be the agent parameter, the algorithm maintains a Gaussian distribution for the agent parameter. At each iteration , the algorithm samples parameters from the distribution and updates agents using TD3 gradient updates. Then each agent is executed in the environment and their fitness (or cumulative returns) are collected. Finally, the distribution parameters are updated via crossentropy methods [8]. The mean agent generates the test performance.
Tasks  High score  Low score 

DMWalkerRun  
DMWalkerWalk  
DMWalkerStand  
DMCheetahRun  
Ant  
HalfCheetah  
RoboAnt  
RoboHalfCheetah  
RoboWalker2d  
Ant(B)  
HalfCheetah(B)  
Walker2d(B) 
Comments
There are no comments yet.