1 Introduction
Policy gradient methods (Williams, 1992; Sutton et al., 1999) optimise reinforcement learning policies by performing gradient ascent on the policy parameters and have shown considerable success in environments characterised by large or continuous action spaces (Mordatch et al., 2015; Schulman et al., 2016; Rajeswaran et al., 2017). However, like other gradientbased optimisation methods, their performance can be sensitive to a number of key hyperparameters.
For example, the performance of first order policy gradient methods can depend critically on the learning rate, the choice of which in turn often depends on the task, the particular policy gradient method in use, and even the optimiser, e.g., RMSProp
(Tieleman & Hinton, 2012) and ADAM (Kingma & Ba, 2014) have narrow (and different) ranges for good learning rates (Henderson et al., 2018b). Even for second order methods like Natural Policy Gradients (NPG) (Kakade, 2001) or Trust Region Policy Optimisation (TRPO) (Schulman et al., 2015), which tend to be more robust to the KL divergence constraint (which can be interpreted as a learning rate), significant performance gains can often be obtained by tuning these parameters (Duan et al., 2016).Similarly, variance reduction techniques such as Generalised Advantage Estimators (GAE)
(Schulman et al., 2016), which trade variance for bias in policy gradient estimates, introduce key hyperparameters that can also greatly affect performance (Schulman et al., 2016; Mahmood et al., 2018).Given such sensitivities, there is a great need for effective methods for tuning policy gradient hyperparameters. Perhaps the most popular hyperparameter optimiser is simply grid search (Schulman et al., 2015; Mnih et al., 2016; Duan et al., 2016; Igl et al., 2018; Farquhar et al., 2018). More sophisticated techniques such as Bayesian optimisation (BO) (Srinivas et al., 2010; Hutter et al., 2011; Snoek et al., 2012; Chen et al., 2018) have also proven effective and new innovations such as Population Based Training (PBT) (Jaderberg et al., 2017)
have shown considerable promise. Furthermore, a host of methods have been proposed for hyperparameter optimisation in supervised learning (see Section
4).However, such methods are poorly suited to reinforcement learning. They typically estimate only the best fixed values of the hyperparameters, when in fact they often need to change dynamically during learning according to a schedule that is not known a priori (Jaderberg et al., 2017). This is particularly important in reinforcement learning, where the distribution of visited states, the need for exploration, and the cost of taking suboptimal actions can all vary greatly during a single learning run.
Even methods such as PBT that can learn such schedules suffer from another weakness: they require performing many learning runs to identify good hyperparameters. Even methods that do so far more efficiently than grid search still require many more runs than the single run that would be conducted if good hyperparameters were known a priori. This inefficiency is particularly problematic in reinforcement learning, where it incurs not just computational costs but sample costs, as new learning runs typically require fresh interactions with the environment.
To make hyperparameter optimisation practical for reinforcement learning methods such as policy gradients, we need radically more efficient methods that can dynamically set key hyperparameters on the fly, not just find the best fixed values, and do so within a single run, using only the data that the baseline method would have gathered anyway.
This goal may seem overly ambitious, but in this paper we show that it is actually entirely feasible, using a surprisingly simple method that we call Hyperparameter Optimisation on the Fly (HOOF). HOOF automatically learns a schedule for those hyperparameters that affect the policy update directly through the gradient, e.g., the learning rate, and the GAE hyperparameters.
The main idea is to use the policy gradient method to find the gradient direction, and then greedily select the hyperparameter setting that maximises the return along that direction. To maintain sample efficiency, HOOF uses importance sampling (IS) to construct offpolicy estimates of the value of the policy after various candidate updates.
The viability of such a simple approach is counterintuitive since offpolicy evaluation using IS tends to have high variance that grows rapidly as the behaviour and evaluation policies diverge. However, HOOF is motivated by the insight that in second order methods such as NPG and TRPO, constraints on the magnitude of the update in policy space ensure that the IS estimates remain informative. While this is not the case for first order methods, we show that adding a simple KL constraint, without any of the complications of second order methods, suffices to keep IS estimates informative and enable effective hyperparameter optimisation.
HOOF is 1) sample efficient, requiring no more than one training run and as a result, 2) computationally efficient compared to sequential and parallel search methods; 3) able to learn a dynamic schedule for the hyperparameters that outperforms methods that learn fixed hyperparameter settings; and 4) simple to implement.
Furthermore, when reward is a sum of multiple separately observed reward streams (van Seijen et al., 2017), HOOF can learn different hyperparameter schedules for each reward stream, leading to even faster and better learning.
Finally, because HOOF is gradient free, it avoids the limitations of gradientbased methods (Sutton, 1992; Luketina et al., 2016; Xu et al., 2018) for learning hyperparameters. While such methods can be highly sample efficient, they are more restricted in which hyperparameters they can learn, e.g., just the learning rate (Sutton, 1992), or the discount factor and in TD() (Xu et al., 2018), and can be sensitive to the choice of their own hyperparameters (see Section 4).
We evaluate HOOF across a range of simulated continuous control tasks using the Mujoco OpenAI Gym environments (Brockman et al., 2016). First, we show that using HOOF to learn optimal hyperparameter schedules for NPG can outperform TRPO. This suggests that while strictly enforcing the KL constraint enables TRPO to outperform NPG, doing so becomes unnecessary once we can properly adapt NPG’s hyperparameters. Second, we apply HOOF to A2C (Mnih et al., 2016), and show that using it to learn the learning rate can improve performance. Third, we show that HOOF can be used to disentangle the entropy term from the A2C objective, with its coefficient learnt separately, leading to even better performance. Finally, we consider tasks with multiple reward streams and show that HOOF enables faster learning in such settings.
2 Background
Consider the RL task where an agent interacts with its environment and tries to maximise its expected return. At timestep , it observes the current state , takes an action , receives a reward , and transitions to a new state
following some transition probability
. The value function of the state is for some discount rate . The undiscounted formulation of the objective is to find a policy that maximises the expected return . In stochastic policy gradient algorithms, is sampled from a parametrised stochastic policy that maps states to actions. We abuse notation to use to denote both the policy as well as the parameters. These methods perform an update of the form(1) 
Here represents a step along the gradient direction for some objective function estimated from a batch of sampled trajectories , and is the set of hyperparameters.
For first order policy gradient methods with GAE, , and the update takes the form:
(2) 
where with . By discounting future rewards and bootstrapping off the value function, GAE reduces the variance due to rewards observed far in the future, but adds bias to the policy gradient estimate. Well chosen can significantly speed up learning (Schulman et al., 2016; Henderson et al., 2018a; Mahmood et al., 2018).
In first order methods, small updates in parameter space can lead to large changes in policy space, leading to large changes in performance. Second order methods like NPG address this by restricting the change to the policy through the constraint . An approximate solution to this constrained optimisation problem leads to the update rule:
(3) 
where is the estimated Fisher information matrix (FIM).
Since the above is only an approximate solution, the constraint can be violated in some iterations. Further, since is not adaptive, it might be too large for some iterations. TRPO addresses these issues by requiring an improvement in the surrogate , as well as ensuring that the KLdivergence constraint is satisfied. It does this by performing a backtracking line search along the gradient direction. As a result, TRPO is more robust to the choice of (Schulman et al., 2015).
3 Hyperparameter Optimisation on the Fly
The main idea behind HOOF is to automatically learn a schedule for the hyperparameters by greedily maximising the value of the updated policy, i.e., starting with policy at iteration , HOOF sets
(4) 
Given a set of sampled trajectories, can be computed for any , and thus we can generate different candidate without requiring any further samples. However, solving the optimisation problem in (3) requires evaluating for each such candidate. Any onpolicy approach would have prohibitive sample requirements, so HOOF uses weighted importance sampling (WIS) to construct an offpolicy estimate of . Given sampled trajectories , with corresponding returns , the WIS estimate of is given by:
(5) 
where . Since , we have:
(6) 
The success of this approach depends critically on the quality of the WIS estimates, which can suffer from high variance that grows rapidly as the distributions of and diverge. Fortunately, for second order methods like NPG, is automatically approximately bounded by the update, ensuring reasonable WIS estimates when HOOF directly uses (3). In the following, we consider the more challenging case of first order methods.
3.1 First Order HOOF
Without a KL bound on the policy update, it may seem that WIS will not yield adequate estimates to solve (3). However, a key insight is that, while the estimated policy value can have high variance, the relative ordering of the policies, which HOOF solves for, has much lower variance. Nonetheless, HOOF could still fail if becomes too large, which can occur in first order methods. Hence, First Order HOOF modifies (3) by constraining :
(7) 
While this yields an update that superficially resembles that of second order methods, the KL constraint is applied only during the search for the optimal hyperparameter settings using WIS. The direction of the update is determined solely by a first order gradient update rule, and the estimation and inversion of the FIM is not required.
First order methods do not typically use KL constraints. Instead, as we show experimentally in Section 5.1, they rely on a good choice of initial learning rate as well as a manually constructed annealing schedule to keep reasonable. HOOF obviates the need for expensive optimisation of the initial learning rate, e.g., via a grid search, and learns a dynamic schedule for the learning rate, as well as other hyperparameters.
While first order HOOF has its own hyperparameter , we show in Section 5.1 that its performance is highly robust to the choice of .
3.2 HOOF with Multiple Reward Streams
In some environments, the reward function is a sum of multiple reward streams, i.e., (van Seijen et al., 2017). For example, if we are trying to learn a locomotion behaviour for a robot there could be a reward stream for forward motion, another that penalises joint movement, and another for reaching the goal. If each of these reward streams is observable, we can use HOOF to learn hyperparameters specific to each reward stream. In this setting, the GAE in (2) is simply a linear combination of the advantages for each reward stream, each with its own set of parameters:
(8) 
3.3 Greedy Maximisation
Setting the hyperparameters at each iteration with HOOF requires greedily maximising by solving (3). This can be done using random search or BO, depending on the computational expense of generating and evaluating each candidate .
In (2), is independent of . Thus, if we only want to learn a schedule for , the gradient needs to be computed once. Subsequently, computing for different involves a multiplication and an addition operation, which is far cheaper than computing the gradient. Thus, in this case we can employ random search to solve (3) efficiently.
If we use HOOF to learn as well, has to be computed for each setting of . With neural net value function approximations, we modify our value function such that its inputs are , similar to Universal Value Function Approximators (Schaul et al., 2015). Thus we learn a conditioned value function that can make value predictions for any candidate at the cost of a single forward pass.
A computationally expensive step arises when using second order methods combined with deep neural net policies with tens of thousands of parameters. If the policy had few enough parameters that can be computed exactly and stored in memory, then an update to can be computed efficiently. To work with large policies, TRPO uses the conjugate gradient method to approximate directly without explicitly computing . When used with NPG, the resulting algorithm is referred to as Truncated Natural Policy Gradients (TNPG) (Duan et al., 2016). This implies that each setting of considered requires a new run of the conjugate gradient algorithm. Thus, BO might be a suitable choice in such situations. However, our experiments suggest that random search with a rather small sample size of 10 performs well even in this case.
4 Related Work
Most hyperparameter search methods can be broadly classified into sequential search, parallel search, and gradient based methods.
Sequential search methods perform a training run with some candidate hyperparameters, and use the results to inform the choice of the next set of hyperparameters for evaluation. BO is a sample efficient global optimisation framework that models performance as a function of the hyperparameters, and is especially suited for sequential search as each training run is expensive. After each training run BO uses the observed performance to update the model in a Bayesian way, which then informs the choice of the next set of hyperparameters for evaluation. Several modifications have been suggested to further reduce the number of evaluations required: input warping (Snoek et al., 2014) to address nonstationary fitness landscapes; freezethaw BO (Swersky et al., 2014) to decide whether a new training run should be started and the current one discontinued based on interim performance; transferring knowledge about hyperparameters across similar tasks (Swersky et al., 2013); and modelling training time as a function of dataset size (Klein et al., 2016). To further speed up the wall clock time, some BO based methods use a hybrid mode wherein batches of hyperparameter settings are evaluated in parallel (Contal et al., 2013; Desautels et al., 2014; Shah & Ghahramani, 2015; Wang et al., 2016; Kandasamy et al., 2018).
By contrast, parallel search methods like grid search and random search run multiple training runs with different hyperparameter settings in parallel to reduce wall clock time, but require more parallel computational resources. These methods are easy to implement, and have been shown to perform well (Bergstra et al., 2011; Bergstra & Bengio, 2012).
Both sequential and parallel search suffer from two key disadvantages. First, they require performing multiple training runs to identify good hyperparameters. Not only is this computationally inefficient, but when applied to RL, also sample inefficient as each run requires fresh interactions with the environment. Second, these methods learn fixed values for the hyperparameters that are used throughout training instead of a schedule, which can lead to suboptimal performance (Luketina et al., 2016; Jaderberg et al., 2017; Xu et al., 2018).
PBT (Jaderberg et al., 2017) is a hybrid of random search and sequential search, with the added benefit of learning a schedule of hyperparameters. It starts by training a population of hyperparameters which are then updated periodically throughout training to further explore promising hyperparameter settings. However, by requiring multiple training runs, it inherits the sample inefficiency of random search.
HOOF is much more sample efficient because it requires no more interactions with the environment than those gathered by the underlying policy gradient method for one training run. Consequently, it is also far more computationally efficient. However, while HOOF can only optimise hyperparameters that directly affect the policy update, these methods can tune other hyperparameters, e.g., the policy architecture, and batch size. Combining these complementary strengths in an interesting topic for future work.
Like HOOF, gradient based methods (Sutton, 1992; Bengio, 2000; Luketina et al., 2016; Pedregosa, 2016; Xu et al., 2018)
are highly sample efficient and require only one training run to optimise hyperparameters. They perform gradient descent on some suitably chosen loss function with respect to the hyperparameters. Hence, they are even more restricted than HOOF in the hyperparameters that they can optimise. For example, the approach of
Sutton (1992) optimises only the learning rate. Metagradients (Xu et al., 2018) optimise only the discount rate and TD() hyperparameters and, unlike HOOF, cannot optimise, e.g., the learning rate and the policy entropy coefficient in the A2C objective function.However, the major disadvantage of metagradients is that the metagradient estimates can have high variance, which in turn significantly affects performance. To address this, the objective function of metagradients relies on reference hyperparameters to trade off bias and variance. As a result, its performance can be sensitive to the choice of the reference , as well as other hyperparameters like the metalearning rate, as the experimental results of Xu et al. (2018) show. As a gradient free method, HOOF does not require metagradient estimates and, while it has a few hyperparameters of its own, we show in Section 5 that it is robust to these.
Other work on nongradient based methods includes that of Kearns & Singh (2000), who derive a theoretical schedule for the TD() hyperparameter that they show is better than any fixed value. Downey et al. (2010) learn a schedule for TD() using a Bayesian approach. White & White (2016) greedily adapt the TD() hyperparameter as a function of state. Unlike HOOF, these methods can only be applied to TD() and, in the case of Kearns & Singh (2000), are not compatible with function approximation.
5 Experiments
To experimentally validate HOOF, we apply it to four simulated continuous control tasks from MuJoCo OpenAI Gym (Brockman et al., 2016): HalfCheetah, Hopper, Ant, and Walker. We start with A2C as our first order method, and show that using HOOF to learn a schedule can lead to better performance than using a linearly annealed learning rate. We further show that we can separate the policy entropy from the A2C objective function, and learn a learning rate for this separately using HOOF, leading to even better performance.
Next we use NPG as our second order method and apply HOOF to learn a dynamic schedule for that outperforms TRPO with fixed hyperparameters.
The HalfCheetah and Ant environments have additive reward functions whose individual components can be easily exposed to the learning agent within the Gym framework. For our final set of experiments, we use HOOF to learn a different discount rate for each of these reward streams, and show that this can lead to faster learning.
We repeat all experiments across 10 random starts. Across all figures solid lines represent the median, and shaded regions the quartiles .
5.1 HOOF with A2C
In the A2C framework, a neural net with parameters is commonly used to represent both the policy and the value function, usually with some shared layers. The objective for the update function (1) for A2C is a linear combination of the policy loss, the value loss, and the policy entropy:
(9) 
where we have omitted the dependence on the timestep and hyperparameters for ease of notation. The coefficients and of the value loss and policy entropy terms are usually fixed a priori, and one learning rate is used for the entire objective. For our first experiment, we learn using HOOF, and refer to it as HOOFLR.
We can also separate the policy entropy term from the objective function and view it as a separate objective function that has its own learning rate. The update can thus be reformulated as
(10) 
where
(11) 
This is equivalent to (5.1) with . Thus under this formulation we can use HOOF to learn both and . We refer to this as HOOFadditive.
We compare HOOFAdditive and HOOFLR to a baseline A2C which uses a learning rate 7e4. This learning rate was shown to be within the optimal range for these environments (Henderson et al., 2018b).
Figure 1 shows learning curves for HOOFAdditive and HOOFLR, both with a KL constraint , and compares them to that of Baseline A2C. It demonstrates that HOOF, by learning an adaptive schedule for the learning rate, can consistently outperform Baseline A2C’s linearly annealed schedule. Furthermore, HOOFAdditive matches the performance of HOOFLR across all environments, except for Walker where it significantly outperforms HOOFLR. This suggests that separating out the entropy term from the objective and learning a specific learning rate for it can help performance in certain instances.
Figure 2 compares the performance of HOOFLR with different settings of , its own hyperparameter quantifying the KL constraint. The results show that HOOFLR’s performance is stable across different values of this parameter.
Figure 3, which shows the KL divergence between each Baseline A2C update for the all four environments, demonstrates that, although Baseline A2C does not enforce a KL constraint, in practice it always satisfies it, due to the careful selection of the initial learning rate. Therefore, the performance advantage of HOOF shown in Figure 1 is due to effective hyperparameter optimisation and is not an artefact of the KL constraint added in Section 3.1.
Appendix A.3 contains further experimental details, including results confirming that the KL constraint is crucial to ensuring sound WIS estimates.
5.2 HOOF with TNPG
A major disadvantage of second order methods is that they require the inversion of the FIM in (3), which can be prohibitively expensive for large neural net policies with thousands of parameters. TNPG and TRPO address this by using the conjugate gradient algorithm to efficiently compute . TRPO has been shown to perform better than TNPG in continuous control tasks (Schulman et al., 2015), a result attributed to stricter enforcement of the KL constraint.
However, in this section, we show that stricter enforcement of the KL constraint becomes unnecessary once we properly adapt TNPG’s learning rate. To do so, we apply HOOF to learn of TNPG (HOOFAll), and compare it to two baseline versions of TRPO: one with and following Schulman et al. (2015); Duan et al. (2016), and another with and following Henderson et al. (2018a); Rajeswaran et al. (2017). We refer to these as TRPO(0.99,1) and TRPO(0.995,0.97). The KL constraint for both is set to 0.01 following Schulman et al. (2015); Henderson et al. (2018a).
Figure 4 shows the learning curves of HOOFall and the two TRPO baselines. Across all four environments TRPO(0.995,0.97) outperforms TRPO(0.99,1). HOOFAll learns much faster, and achieves a significantly better return than TRPO(0.995,0.97) in HalfCheetah and Ant. In Hopper and Walker, there is no significant performance difference between HOOF and the TRPO(0.995,0.97) baseline. However, the large variation in the returns for all methods in Walker suggests that the choice of the random seed has a far greater impact on performance than the choice of the hyperparameters.
Figure 5 presents the learnt for HalfCheetah, Hopper, and Walker. The results show that different KL constraints and GAE hyperparameters are needed for different domains. See Appendix B.3 for plots of the learnt hyperparameters for Ant.
Finally, we compare the performance of HOOF when we learn only , while are fixed to , and compare it against TRPO(0.995, 0.97). The results in Appendix B.2 show that even in this case HOOF outperforms TRPO, confirming that it is beneficial to learn a schedule for the KL constraint even when good values of are known a priori.
5.3 Tasks with Multiple Reward Streams
The reward function for HalfCheetah is the sum of two components: positive reward for forward movement and penalties for joint movements. Similarly for Ant, the agent gets a fixed reward at each timestep it survives, together with other rewards for forward movement and penalties for joint movement. These additive reward components can be exposed to the learning agent through Gym’s API. We use this to test if learning a separate set of hyperparameters for each reward stream with HOOF can improve learning.
Following (8), our method MultiHOOF, learns a single KL constraint, but a different for each reward stream. We compare MultiHOOF to a second HOOF baseline (BLHOOF) that learns a single KL constraint and for the full reward function. For both was fixed to 1.
Figure 6 present the results, together with the learnt hyperparameters for HalfCheetah. They show that MultiHOOF learns a significantly different discount rate for each reward stream and that this helps improve performance.
The results for Ant presented in Figure 7, show that once again learning multiple discount schedules helps improve performance. In Figure 6(c), MultiHOOF’s discount rate schedule corresponding to the forward movement reward stream starts lower than that of BLHOOF’s discount rate, but the two soon converge. The discount rate for the survival reward stream on the other hand starts high and then reduces. We believe this is because during early stages of training the key signal for policy improvement comes from the survival reward stream as the agent learns to stay alive. Once it starts getting the survival reward consistently, the forward movement reward stream provides a much better signal for learning a better policy.
6 Conclusions & Future Work
The performance of a policy gradient method is highly dependent on its hyperparameter settings. However, methods typically used to tune these hyperparameters are highly sample inefficient, computationally expensive, and learn only a fixed setting of the hyperparameters. In this paper we presented HOOF, a sample efficient method that automatically learns a schedule for the learning rate and GAE hyperparameters of policy gradient methods without requiring multiple training runs. We believe that this, combined with its simplicity and ease of implementation, makes HOOF a compelling method for optimising policy gradient hyperparameters.
While this paper focused on learning only a few hyperparameters of policy gradient methods, in principle HOOF could be used to learn other hyperparameters as well, e.g., those of Generative Adversarial Imitation Learning (GAIL)
(Ho & Ermon, 2016) or Model Agnostic MetaLearning (MAML) (Finn et al., 2017), which could lead to more stable learning.Acknowledgements
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement #637713), and Samsung R&D Institute UK. The experiments were made possible by a generous equipment grant from NVIDIA.
References
 Bengio (2000) Bengio, Y. Gradientbased optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.

Bergstra & Bengio (2012)
Bergstra, J. and Bengio, Y.
Random search for hyperparameter optimization.
Journal of Machine Learning Research
, 13(Feb):281–305, 2012.  Bergstra et al. (2011) Bergstra, J. S., Bardenet, R., Bengio, Y., and Kégl, B. Algorithms for hyperparameter optimization. In Advances in neural information processing systems, pp. 2546–2554, 2011.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
 Chen et al. (2018) Chen, Y., Huang, A., Wang, Z., Antonoglou, I., Schrittwieser, J., Silver, D., and de Freitas, N. Bayesian optimization in alphago. CoRR, abs/1812.06855, 2018.
 Contal et al. (2013) Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. Parallel gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 225–240. Springer, 2013.
 Desautels et al. (2014) Desautels, T., Krause, A., and Burdick, J. W. Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923, 2014.
 Dhariwal et al. (2017) Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. Openai baselines. https://github.com/openai/baselines, 2017.
 Downey et al. (2010) Downey, C., Sanner, S., et al. Temporal difference bayesian model averaging: A bayesian perspective on adapting lambda. In ICML, pp. 311–318. Citeseer, 2010.
 Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning (ICML), 2016.
 Farquhar et al. (2018) Farquhar, G., Rocktäschel, T., Igl, M., and Whiteson, S. Treeqn and atreec: Differentiable tree planning for deep reinforcement learning. International Conference on Learning Representations, 2018.
 Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. Modelagnostic metalearning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
 Henderson et al. (2018a) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep reinforcement learning that matters. In AAAI, 2018a.
 Henderson et al. (2018b) Henderson, P., Romoff, J., and Pineau, J. Where did my optimum go?: An empirical analysis of gradient descent optimization in policy gradient methods. CoRR, abs/1810.02525, 2018b.
 Ho & Ermon (2016) Ho, J. and Ermon, S. Generative adversarial imitation learning. In Neural Information Processing Systems (NIPS. 2016.
 Hutter et al. (2011) Hutter, F., Hoos, H. H., and LeytonBrown, K. Sequential modelbased optimization for general algorithm configuration. In International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer, 2011.
 Igl et al. (2018) Igl, M., Zintgraf, L. M., Le, T. A., Wood, F., and Whiteson, S. Deep variational reinforcement learning for pomdps. In ICML, volume 80 of JMLR Workshop and Conference Proceedings, pp. 2122–2131. JMLR.org, 2018.
 Jaderberg et al. (2017) Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
 Kakade (2001) Kakade, S. A natural policy gradient. In Neural Information Processing Systems (NIPS), 2001.

Kandasamy et al. (2018)
Kandasamy, K., Krishnamurthy, A., Schneider, J., and Póczos, B.
Parallelised bayesian optimisation via thompson sampling.
InInternational Conference on Artificial Intelligence and Statistics
, pp. 133–142, 2018.  Kearns & Singh (2000) Kearns, M. J. and Singh, S. P. Biasvariance error bounds for temporal difference updates. In COLT, pp. 142–147. Citeseer, 2000.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Klein et al. (2016) Klein, A., Falkner, S., Bartels, S., Hennig, P., and Hutter, F. Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079, 2016.
 Luketina et al. (2016) Luketina, J., Raiko, T., Berglund, M., and Greff, K. Scalable gradientbased tuning of continuous regularization hyperparameters. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pp. 2952–2960, 2016. URL http://jmlr.org/proceedings/papers/v48/luketina16.html.
 Mahmood et al. (2018) Mahmood, A. R., Korenkevych, D., Vasan, G., Ma, W., and Bergstra, J. Benchmarking reinforcement learning algorithms on realworld robots. In Proceedings of The 2nd Conference on Robot Learning, 2018.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning (ICML), 2016.

Mordatch et al. (2015)
Mordatch, I., Lowrey, K., Andrew, G., Popovic, Z., and Todorov, E. V.
Interactive control of diverse complex characters with neural networks.
In Neural Information Processing Systems (NIPS). 2015.  Pedregosa (2016) Pedregosa, F. Hyperparameter optimization with approximate gradient. arXiv preprint arXiv:1602.02355, 2016.
 Rajeswaran et al. (2017) Rajeswaran, A., Lowrey, K., Todorov, E. V., and Kakade, S. M. Towards generalization and simplicity in continuous control. In Neural Information Processing Systems (NIPS). 2017.
 Schaul et al. (2015) Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In International Conference on Machine Learning (ICML), 2015.
 Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning (ICML), 2015.
 Schulman et al. (2016) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. Highdimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
 Shah & Ghahramani (2015) Shah, A. and Ghahramani, Z. Parallel predictive entropy search for batch global optimization of expensive objective functions. In Advances in Neural Information Processing Systems, pp. 3330–3338, 2015.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
 Snoek et al. (2014) Snoek, J., Swersky, K., Zemel, R., and Adams, R. Input warping for bayesian optimization of nonstationary functions. In International Conference on International Conference on Machine Learning (ICML), 2014.
 Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In International Conference on Machine Learning (ICML), 2010.
 Sutton (1992) Sutton, R. S. Adapting bias by gradient descent: An incremental version of deltabardelta. In AAAI, pp. 171–176, 1992.
 Sutton et al. (1999) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems (NIPS, 1999.
 Swersky et al. (2013) Swersky, K., Snoek, J., and Adams, R. P. Multitask bayesian optimization. In Advances in neural information processing systems, pp. 2004–2012, 2013.
 Swersky et al. (2014) Swersky, K., Snoek, J., and Adams, R. P. Freezethaw bayesian optimization. arXiv preprint arXiv:1406.3896, 2014.
 Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 van Seijen et al. (2017) van Seijen, H., Fatemi, M., Laroche, R., Romoff, J., Barnes, T., and Tsang, J. Hybrid reward architecture for reinforcement learning. In Neural Information Processing Systems (NIPS), 2017.
 Wang et al. (2016) Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. Parallel bayesian global optimization of expensive functions. arXiv preprint arXiv:1602.05149, 2016.
 White & White (2016) White, M. and White, A. A greedy approach to adapting the trace parameter for temporal difference learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pp. 557–565. International Foundation for Autonomous Agents and Multiagent Systems, 2016.
 Williams (1992) Williams, R. J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 1992.
 Xu et al. (2018) Xu, Z., van Hasselt, H., and Silver, D. Metagradient reinforcement learning, 2018.
Appendix A A2C Experimental details
We present further details about our A2C experiments in this section.
a.1 Implementation details
Our codebase for the A2C experiments is based on OpenAI Baselines (Dhariwal et al., 2017) implementation of A2C and uses their default hyperparameters. Experiments involving HOOF use the same hyperparameters apart from those that are learnt by HOOF. All hyperparameters are presented in Table 1.
Hyperparameter  Value 

Number of environments (num_envs)  20 
Timesteps per worker (nsteps)  5 
Total environment steps  1e6 
Discounting  0.99 
Max gradient norm  0.5 
Optimiser  RMSProp 
–  0.99 
–  1e5 
Policy  MLP 
– num of fully connected layers  2 
– num of units per layer  64 
– activation  tanh 
Default settings for Baseline A2C & HOOF  
– Initial learning rate  7e4 
– Learning rate schedule  linear annealing 
– Value function cost weight  0.5 
– Entropy cost weight  0.01 
HOOF specific hyperparameters  
– HOOFLR search bounds for  [0,1e2] 
– HOOFadditive search bounds  [0,1e2] 
– HOOFadditive search bounds  [0,5e4] 
a.2 Learnt learning rate of updates
The learning rates learnt by HOOFLR and HOOFAdditive are presented in Figure 8. The schedule for the learning rates suggests that the default option of linear annealing used by Baseline A2C might not be optimal.
a.3 Performance of HOOF without KL constraint
Figure 9 shows that without a KL constraint HOOFA2C does not converge, which confirms that we need to constrain policy updates so that WIS estimates remain sound.
Appendix B TNPG Experimental details
We present further details about our TNPG experiments in this section.
b.1 Implementation details
Our codebase for the TNPG experiments is based on RLLab (Duan et al., 2016) implementation. The hyperparameters are presented in Table 2.
Batch size  

– HalfCheetahv1  20 
– Hopperv1  20 
– Walker2dv1  40 
– Antv1  40 
Baseline TRPO  
– KL constraint  0.01 
– Discounting  0.99 or 0.995 
– GAE  1.0 or 0.97 
Policy  MLP 
– num of fully connected layers  2 
– num of units per layer  100 
– activation  ReLU 
HOOF specific hyperparameters  
– search bounds for  [0.00125, 0.045] 
– search bounds for  [0,95, 1] 
– search bounds for  [0.95, 1] 
b.2 Performance of HOOF with fixed
In certain situations good values of may be known a priori. To check if learning only with HOOF (HOOFKL) leads to any benefit, we compare the performance of TRPO(0.995, 0.97) against TNPG where HOOF is used to learn and is fixed to (0.995, 0.97). Results presented in Figure 10 show that even in this case HOOF outperforms TRPO.
b.3 Learnt hyperparameter schedules for Ant
The hyperparameter schedule learnt by HOOFAll for Ant are presented in Figure 11.