Implementation of the Fast Efficient Hyperparameter Tuning for Policy Gradient Methods https://arxiv.org/abs/1902.06583
The performance of policy gradient methods is sensitive to hyperparameter settings that must be tuned for any new application. Widely used grid search methods for tuning hyperparameters are sample inefficient and computationally expensive. More advanced methods like Population Based Training that learn optimal schedules for hyperparameters instead of fixed settings can yield better results, but are also sample inefficient and computationally expensive. In this paper, we propose Hyperparameter Optimisation on the Fly (HOOF), a gradient-free meta-learning algorithm that can automatically learn an optimal schedule for hyperparameters that affect the policy update directly through the gradient. The main idea is to use existing trajectories sampled by the policy gradient method to optimise a one-step improvement objective, yielding a sample and computationally efficient algorithm that is easy to implement. Our experimental results across multiple domains and algorithms show that using HOOF to learn these hyperparameter schedules leads to faster learning with improved performance.READ FULL TEXT VIEW PDF
Implementation of the Fast Efficient Hyperparameter Tuning for Policy Gradient Methods https://arxiv.org/abs/1902.06583
Policy gradient methods (Williams, 1992; Sutton et al., 1999) optimise reinforcement learning policies by performing gradient ascent on the policy parameters and have shown considerable success in environments characterised by large or continuous action spaces (Mordatch et al., 2015; Schulman et al., 2016; Rajeswaran et al., 2017). However, like other gradient-based optimisation methods, their performance can be sensitive to a number of key hyperparameters.
For example, the performance of first order policy gradient methods can depend critically on the learning rate, the choice of which in turn often depends on the task, the particular policy gradient method in use, and even the optimiser, e.g., RMSProp(Tieleman & Hinton, 2012) and ADAM (Kingma & Ba, 2014) have narrow (and different) ranges for good learning rates (Henderson et al., 2018b). Even for second order methods like Natural Policy Gradients (NPG) (Kakade, 2001) or Trust Region Policy Optimisation (TRPO) (Schulman et al., 2015), which tend to be more robust to the KL divergence constraint (which can be interpreted as a learning rate), significant performance gains can often be obtained by tuning these parameters (Duan et al., 2016).
Given such sensitivities, there is a great need for effective methods for tuning policy gradient hyperparameters. Perhaps the most popular hyperparameter optimiser is simply grid search (Schulman et al., 2015; Mnih et al., 2016; Duan et al., 2016; Igl et al., 2018; Farquhar et al., 2018). More sophisticated techniques such as Bayesian optimisation (BO) (Srinivas et al., 2010; Hutter et al., 2011; Snoek et al., 2012; Chen et al., 2018) have also proven effective and new innovations such as Population Based Training (PBT) (Jaderberg et al., 2017)
have shown considerable promise. Furthermore, a host of methods have been proposed for hyperparameter optimisation in supervised learning (see Section4).
However, such methods are poorly suited to reinforcement learning. They typically estimate only the best fixed values of the hyperparameters, when in fact they often need to change dynamically during learning according to a schedule that is not known a priori (Jaderberg et al., 2017). This is particularly important in reinforcement learning, where the distribution of visited states, the need for exploration, and the cost of taking suboptimal actions can all vary greatly during a single learning run.
Even methods such as PBT that can learn such schedules suffer from another weakness: they require performing many learning runs to identify good hyperparameters. Even methods that do so far more efficiently than grid search still require many more runs than the single run that would be conducted if good hyperparameters were known a priori. This inefficiency is particularly problematic in reinforcement learning, where it incurs not just computational costs but sample costs, as new learning runs typically require fresh interactions with the environment.
To make hyperparameter optimisation practical for reinforcement learning methods such as policy gradients, we need radically more efficient methods that can dynamically set key hyperparameters on the fly, not just find the best fixed values, and do so within a single run, using only the data that the baseline method would have gathered anyway.
This goal may seem overly ambitious, but in this paper we show that it is actually entirely feasible, using a surprisingly simple method that we call Hyperparameter Optimisation on the Fly (HOOF). HOOF automatically learns a schedule for those hyperparameters that affect the policy update directly through the gradient, e.g., the learning rate, and the GAE hyperparameters.
The main idea is to use the policy gradient method to find the gradient direction, and then greedily select the hyperparameter setting that maximises the return along that direction. To maintain sample efficiency, HOOF uses importance sampling (IS) to construct off-policy estimates of the value of the policy after various candidate updates.
The viability of such a simple approach is counterintuitive since off-policy evaluation using IS tends to have high variance that grows rapidly as the behaviour and evaluation policies diverge. However, HOOF is motivated by the insight that in second order methods such as NPG and TRPO, constraints on the magnitude of the update in policy space ensure that the IS estimates remain informative. While this is not the case for first order methods, we show that adding a simple KL constraint, without any of the complications of second order methods, suffices to keep IS estimates informative and enable effective hyperparameter optimisation.
HOOF is 1) sample efficient, requiring no more than one training run and as a result, 2) computationally efficient compared to sequential and parallel search methods; 3) able to learn a dynamic schedule for the hyperparameters that outperforms methods that learn fixed hyperparameter settings; and 4) simple to implement.
Furthermore, when reward is a sum of multiple separately observed reward streams (van Seijen et al., 2017), HOOF can learn different hyperparameter schedules for each reward stream, leading to even faster and better learning.
Finally, because HOOF is gradient free, it avoids the limitations of gradient-based methods (Sutton, 1992; Luketina et al., 2016; Xu et al., 2018) for learning hyperparameters. While such methods can be highly sample efficient, they are more restricted in which hyperparameters they can learn, e.g., just the learning rate (Sutton, 1992), or the discount factor and in TD() (Xu et al., 2018), and can be sensitive to the choice of their own hyperparameters (see Section 4).
We evaluate HOOF across a range of simulated continuous control tasks using the Mujoco OpenAI Gym environments (Brockman et al., 2016). First, we show that using HOOF to learn optimal hyperparameter schedules for NPG can outperform TRPO. This suggests that while strictly enforcing the KL constraint enables TRPO to outperform NPG, doing so becomes unnecessary once we can properly adapt NPG’s hyperparameters. Second, we apply HOOF to A2C (Mnih et al., 2016), and show that using it to learn the learning rate can improve performance. Third, we show that HOOF can be used to disentangle the entropy term from the A2C objective, with its coefficient learnt separately, leading to even better performance. Finally, we consider tasks with multiple reward streams and show that HOOF enables faster learning in such settings.
Consider the RL task where an agent interacts with its environment and tries to maximise its expected return. At timestep , it observes the current state , takes an action , receives a reward , and transitions to a new state
following some transition probability. The value function of the state is for some discount rate . The undiscounted formulation of the objective is to find a policy that maximises the expected return . In stochastic policy gradient algorithms, is sampled from a parametrised stochastic policy that maps states to actions. We abuse notation to use to denote both the policy as well as the parameters. These methods perform an update of the form
Here represents a step along the gradient direction for some objective function estimated from a batch of sampled trajectories , and is the set of hyperparameters.
For first order policy gradient methods with GAE, , and the update takes the form:
where with . By discounting future rewards and bootstrapping off the value function, GAE reduces the variance due to rewards observed far in the future, but adds bias to the policy gradient estimate. Well chosen can significantly speed up learning (Schulman et al., 2016; Henderson et al., 2018a; Mahmood et al., 2018).
In first order methods, small updates in parameter space can lead to large changes in policy space, leading to large changes in performance. Second order methods like NPG address this by restricting the change to the policy through the constraint . An approximate solution to this constrained optimisation problem leads to the update rule:
where is the estimated Fisher information matrix (FIM).
Since the above is only an approximate solution, the constraint can be violated in some iterations. Further, since is not adaptive, it might be too large for some iterations. TRPO addresses these issues by requiring an improvement in the surrogate , as well as ensuring that the KL-divergence constraint is satisfied. It does this by performing a backtracking line search along the gradient direction. As a result, TRPO is more robust to the choice of (Schulman et al., 2015).
The main idea behind HOOF is to automatically learn a schedule for the hyperparameters by greedily maximising the value of the updated policy, i.e., starting with policy at iteration , HOOF sets
Given a set of sampled trajectories, can be computed for any , and thus we can generate different candidate without requiring any further samples. However, solving the optimisation problem in (3) requires evaluating for each such candidate. Any on-policy approach would have prohibitive sample requirements, so HOOF uses weighted importance sampling (WIS) to construct an off-policy estimate of . Given sampled trajectories , with corresponding returns , the WIS estimate of is given by:
where . Since , we have:
The success of this approach depends critically on the quality of the WIS estimates, which can suffer from high variance that grows rapidly as the distributions of and diverge. Fortunately, for second order methods like NPG, is automatically approximately bounded by the update, ensuring reasonable WIS estimates when HOOF directly uses (3). In the following, we consider the more challenging case of first order methods.
Without a KL bound on the policy update, it may seem that WIS will not yield adequate estimates to solve (3). However, a key insight is that, while the estimated policy value can have high variance, the relative ordering of the policies, which HOOF solves for, has much lower variance. Nonetheless, HOOF could still fail if becomes too large, which can occur in first order methods. Hence, First Order HOOF modifies (3) by constraining :
While this yields an update that superficially resembles that of second order methods, the KL constraint is applied only during the search for the optimal hyperparameter settings using WIS. The direction of the update is determined solely by a first order gradient update rule, and the estimation and inversion of the FIM is not required.
First order methods do not typically use KL constraints. Instead, as we show experimentally in Section 5.1, they rely on a good choice of initial learning rate as well as a manually constructed annealing schedule to keep reasonable. HOOF obviates the need for expensive optimisation of the initial learning rate, e.g., via a grid search, and learns a dynamic schedule for the learning rate, as well as other hyperparameters.
While first order HOOF has its own hyperparameter , we show in Section 5.1 that its performance is highly robust to the choice of .
In some environments, the reward function is a sum of multiple reward streams, i.e., (van Seijen et al., 2017). For example, if we are trying to learn a locomotion behaviour for a robot there could be a reward stream for forward motion, another that penalises joint movement, and another for reaching the goal. If each of these reward streams is observable, we can use HOOF to learn hyperparameters specific to each reward stream. In this setting, the GAE in (2) is simply a linear combination of the advantages for each reward stream, each with its own set of parameters:
Setting the hyperparameters at each iteration with HOOF requires greedily maximising by solving (3). This can be done using random search or BO, depending on the computational expense of generating and evaluating each candidate .
In (2), is independent of . Thus, if we only want to learn a schedule for , the gradient needs to be computed once. Subsequently, computing for different involves a multiplication and an addition operation, which is far cheaper than computing the gradient. Thus, in this case we can employ random search to solve (3) efficiently.
If we use HOOF to learn as well, has to be computed for each setting of . With neural net value function approximations, we modify our value function such that its inputs are , similar to Universal Value Function Approximators (Schaul et al., 2015). Thus we learn a -conditioned value function that can make value predictions for any candidate at the cost of a single forward pass.
A computationally expensive step arises when using second order methods combined with deep neural net policies with tens of thousands of parameters. If the policy had few enough parameters that can be computed exactly and stored in memory, then an update to can be computed efficiently. To work with large policies, TRPO uses the conjugate gradient method to approximate directly without explicitly computing . When used with NPG, the resulting algorithm is referred to as Truncated Natural Policy Gradients (TNPG) (Duan et al., 2016). This implies that each setting of considered requires a new run of the conjugate gradient algorithm. Thus, BO might be a suitable choice in such situations. However, our experiments suggest that random search with a rather small sample size of 10 performs well even in this case.
Most hyperparameter search methods can be broadly classified into sequential search, parallel search, and gradient based methods.
Sequential search methods perform a training run with some candidate hyperparameters, and use the results to inform the choice of the next set of hyperparameters for evaluation. BO is a sample efficient global optimisation framework that models performance as a function of the hyperparameters, and is especially suited for sequential search as each training run is expensive. After each training run BO uses the observed performance to update the model in a Bayesian way, which then informs the choice of the next set of hyperparameters for evaluation. Several modifications have been suggested to further reduce the number of evaluations required: input warping (Snoek et al., 2014) to address nonstationary fitness landscapes; freeze-thaw BO (Swersky et al., 2014) to decide whether a new training run should be started and the current one discontinued based on interim performance; transferring knowledge about hyperparameters across similar tasks (Swersky et al., 2013); and modelling training time as a function of dataset size (Klein et al., 2016). To further speed up the wall clock time, some BO based methods use a hybrid mode wherein batches of hyperparameter settings are evaluated in parallel (Contal et al., 2013; Desautels et al., 2014; Shah & Ghahramani, 2015; Wang et al., 2016; Kandasamy et al., 2018).
By contrast, parallel search methods like grid search and random search run multiple training runs with different hyperparameter settings in parallel to reduce wall clock time, but require more parallel computational resources. These methods are easy to implement, and have been shown to perform well (Bergstra et al., 2011; Bergstra & Bengio, 2012).
Both sequential and parallel search suffer from two key disadvantages. First, they require performing multiple training runs to identify good hyperparameters. Not only is this computationally inefficient, but when applied to RL, also sample inefficient as each run requires fresh interactions with the environment. Second, these methods learn fixed values for the hyperparameters that are used throughout training instead of a schedule, which can lead to suboptimal performance (Luketina et al., 2016; Jaderberg et al., 2017; Xu et al., 2018).
PBT (Jaderberg et al., 2017) is a hybrid of random search and sequential search, with the added benefit of learning a schedule of hyperparameters. It starts by training a population of hyperparameters which are then updated periodically throughout training to further explore promising hyperparameter settings. However, by requiring multiple training runs, it inherits the sample inefficiency of random search.
HOOF is much more sample efficient because it requires no more interactions with the environment than those gathered by the underlying policy gradient method for one training run. Consequently, it is also far more computationally efficient. However, while HOOF can only optimise hyperparameters that directly affect the policy update, these methods can tune other hyperparameters, e.g., the policy architecture, and batch size. Combining these complementary strengths in an interesting topic for future work.
are highly sample efficient and require only one training run to optimise hyperparameters. They perform gradient descent on some suitably chosen loss function with respect to the hyperparameters. Hence, they are even more restricted than HOOF in the hyperparameters that they can optimise. For example, the approach ofSutton (1992) optimises only the learning rate. Meta-gradients (Xu et al., 2018) optimise only the discount rate and TD() hyperparameters and, unlike HOOF, cannot optimise, e.g., the learning rate and the policy entropy coefficient in the A2C objective function.
However, the major disadvantage of meta-gradients is that the meta-gradient estimates can have high variance, which in turn significantly affects performance. To address this, the objective function of meta-gradients relies on reference hyperparameters to trade off bias and variance. As a result, its performance can be sensitive to the choice of the reference , as well as other hyperparameters like the meta-learning rate, as the experimental results of Xu et al. (2018) show. As a gradient free method, HOOF does not require meta-gradient estimates and, while it has a few hyperparameters of its own, we show in Section 5 that it is robust to these.
Other work on non-gradient based methods includes that of Kearns & Singh (2000), who derive a theoretical schedule for the TD() hyperparameter that they show is better than any fixed value. Downey et al. (2010) learn a schedule for TD() using a Bayesian approach. White & White (2016) greedily adapt the TD() hyperparameter as a function of state. Unlike HOOF, these methods can only be applied to TD() and, in the case of Kearns & Singh (2000), are not compatible with function approximation.
To experimentally validate HOOF, we apply it to four simulated continuous control tasks from MuJoCo OpenAI Gym (Brockman et al., 2016): HalfCheetah, Hopper, Ant, and Walker. We start with A2C as our first order method, and show that using HOOF to learn a schedule can lead to better performance than using a linearly annealed learning rate. We further show that we can separate the policy entropy from the A2C objective function, and learn a learning rate for this separately using HOOF, leading to even better performance.
Next we use NPG as our second order method and apply HOOF to learn a dynamic schedule for that outperforms TRPO with fixed hyperparameters.
The HalfCheetah and Ant environments have additive reward functions whose individual components can be easily exposed to the learning agent within the Gym framework. For our final set of experiments, we use HOOF to learn a different discount rate for each of these reward streams, and show that this can lead to faster learning.
We repeat all experiments across 10 random starts. Across all figures solid lines represent the median, and shaded regions the quartiles .
In the A2C framework, a neural net with parameters is commonly used to represent both the policy and the value function, usually with some shared layers. The objective for the update function (1) for A2C is a linear combination of the policy loss, the value loss, and the policy entropy:
where we have omitted the dependence on the timestep and hyperparameters for ease of notation. The coefficients and of the value loss and policy entropy terms are usually fixed a priori, and one learning rate is used for the entire objective. For our first experiment, we learn using HOOF, and refer to it as HOOF-LR.
We can also separate the policy entropy term from the objective function and view it as a separate objective function that has its own learning rate. The update can thus be reformulated as
This is equivalent to (5.1) with . Thus under this formulation we can use HOOF to learn both and . We refer to this as HOOF-additive.
We compare HOOF-Additive and HOOF-LR to a baseline A2C which uses a learning rate 7e-4. This learning rate was shown to be within the optimal range for these environments (Henderson et al., 2018b).
Figure 1 shows learning curves for HOOF-Additive and HOOF-LR, both with a KL constraint , and compares them to that of Baseline A2C. It demonstrates that HOOF, by learning an adaptive schedule for the learning rate, can consistently outperform Baseline A2C’s linearly annealed schedule. Furthermore, HOOF-Additive matches the performance of HOOF-LR across all environments, except for Walker where it significantly outperforms HOOF-LR. This suggests that separating out the entropy term from the objective and learning a specific learning rate for it can help performance in certain instances.
Figure 2 compares the performance of HOOF-LR with different settings of , its own hyperparameter quantifying the KL constraint. The results show that HOOF-LR’s performance is stable across different values of this parameter.
Figure 3, which shows the KL divergence between each Baseline A2C update for the all four environments, demonstrates that, although Baseline A2C does not enforce a KL constraint, in practice it always satisfies it, due to the careful selection of the initial learning rate. Therefore, the performance advantage of HOOF shown in Figure 1 is due to effective hyperparameter optimisation and is not an artefact of the KL constraint added in Section 3.1.
Appendix A.3 contains further experimental details, including results confirming that the KL constraint is crucial to ensuring sound WIS estimates.
A major disadvantage of second order methods is that they require the inversion of the FIM in (3), which can be prohibitively expensive for large neural net policies with thousands of parameters. TNPG and TRPO address this by using the conjugate gradient algorithm to efficiently compute . TRPO has been shown to perform better than TNPG in continuous control tasks (Schulman et al., 2015), a result attributed to stricter enforcement of the KL constraint.
However, in this section, we show that stricter enforcement of the KL constraint becomes unnecessary once we properly adapt TNPG’s learning rate. To do so, we apply HOOF to learn of TNPG (HOOF-All), and compare it to two baseline versions of TRPO: one with and following Schulman et al. (2015); Duan et al. (2016), and another with and following Henderson et al. (2018a); Rajeswaran et al. (2017). We refer to these as TRPO(0.99,1) and TRPO(0.995,0.97). The KL constraint for both is set to 0.01 following Schulman et al. (2015); Henderson et al. (2018a).
Figure 4 shows the learning curves of HOOF-all and the two TRPO baselines. Across all four environments TRPO(0.995,0.97) outperforms TRPO(0.99,1). HOOF-All learns much faster, and achieves a significantly better return than TRPO(0.995,0.97) in HalfCheetah and Ant. In Hopper and Walker, there is no significant performance difference between HOOF and the TRPO(0.995,0.97) baseline. However, the large variation in the returns for all methods in Walker suggests that the choice of the random seed has a far greater impact on performance than the choice of the hyperparameters.
Figure 5 presents the learnt for HalfCheetah, Hopper, and Walker. The results show that different KL constraints and GAE hyperparameters are needed for different domains. See Appendix B.3 for plots of the learnt hyperparameters for Ant.
Finally, we compare the performance of HOOF when we learn only , while are fixed to , and compare it against TRPO(0.995, 0.97). The results in Appendix B.2 show that even in this case HOOF outperforms TRPO, confirming that it is beneficial to learn a schedule for the KL constraint even when good values of are known a priori.
The reward function for HalfCheetah is the sum of two components: positive reward for forward movement and penalties for joint movements. Similarly for Ant, the agent gets a fixed reward at each timestep it survives, together with other rewards for forward movement and penalties for joint movement. These additive reward components can be exposed to the learning agent through Gym’s API. We use this to test if learning a separate set of hyperparameters for each reward stream with HOOF can improve learning.
Following (8), our method Multi-HOOF, learns a single KL constraint, but a different for each reward stream. We compare Multi-HOOF to a second HOOF baseline (BL-HOOF) that learns a single KL constraint and for the full reward function. For both was fixed to 1.
Figure 6 present the results, together with the learnt hyperparameters for HalfCheetah. They show that Multi-HOOF learns a significantly different discount rate for each reward stream and that this helps improve performance.
The results for Ant presented in Figure 7, show that once again learning multiple discount schedules helps improve performance. In Figure 6(c), Multi-HOOF’s discount rate schedule corresponding to the forward movement reward stream starts lower than that of BL-HOOF’s discount rate, but the two soon converge. The discount rate for the survival reward stream on the other hand starts high and then reduces. We believe this is because during early stages of training the key signal for policy improvement comes from the survival reward stream as the agent learns to stay alive. Once it starts getting the survival reward consistently, the forward movement reward stream provides a much better signal for learning a better policy.
The performance of a policy gradient method is highly dependent on its hyperparameter settings. However, methods typically used to tune these hyperparameters are highly sample inefficient, computationally expensive, and learn only a fixed setting of the hyperparameters. In this paper we presented HOOF, a sample efficient method that automatically learns a schedule for the learning rate and GAE hyperparameters of policy gradient methods without requiring multiple training runs. We believe that this, combined with its simplicity and ease of implementation, makes HOOF a compelling method for optimising policy gradient hyperparameters.
While this paper focused on learning only a few hyperparameters of policy gradient methods, in principle HOOF could be used to learn other hyperparameters as well, e.g., those of Generative Adversarial Imitation Learning (GAIL)(Ho & Ermon, 2016) or Model Agnostic Meta-Learning (MAML) (Finn et al., 2017), which could lead to more stable learning.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement #637713), and Samsung R&D Institute UK. The experiments were made possible by a generous equipment grant from NVIDIA.
Journal of Machine Learning Research, 13(Feb):281–305, 2012.
Parallelised bayesian optimisation via thompson sampling.In
International Conference on Artificial Intelligence and Statistics, pp. 133–142, 2018.
Interactive control of diverse complex characters with neural networks.In Neural Information Processing Systems (NIPS). 2015.
We present further details about our A2C experiments in this section.
Our codebase for the A2C experiments is based on OpenAI Baselines (Dhariwal et al., 2017) implementation of A2C and uses their default hyperparameters. Experiments involving HOOF use the same hyperparameters apart from those that are learnt by HOOF. All hyperparameters are presented in Table 1.
|Number of environments (num_envs)||20|
|Timesteps per worker (nsteps)||5|
|Total environment steps||1e6|
|Max gradient norm||0.5|
|– num of fully connected layers||2|
|– num of units per layer||64|
|Default settings for Baseline A2C & HOOF|
|– Initial learning rate||7e-4|
|– Learning rate schedule||linear annealing|
|– Value function cost weight||0.5|
|– Entropy cost weight||0.01|
|HOOF specific hyperparameters|
|– HOOF-LR search bounds for||[0,1e-2]|
|– HOOF-additive search bounds||[0,1e-2]|
|– HOOF-additive search bounds||[0,5e-4]|
The learning rates learnt by HOOF-LR and HOOF-Additive are presented in Figure 8. The schedule for the learning rates suggests that the default option of linear annealing used by Baseline A2C might not be optimal.
Figure 9 shows that without a KL constraint HOOF-A2C does not converge, which confirms that we need to constrain policy updates so that WIS estimates remain sound.
We present further details about our TNPG experiments in this section.
|– KL constraint||0.01|
|– Discounting||0.99 or 0.995|
|– GAE-||1.0 or 0.97|
|– num of fully connected layers||2|
|– num of units per layer||100|
|HOOF specific hyperparameters|
|– search bounds for||[0.00125, 0.045]|
|– search bounds for||[0,95, 1]|
|– search bounds for||[0.95, 1]|
In certain situations good values of may be known a priori. To check if learning only with HOOF (HOOF-KL) leads to any benefit, we compare the performance of TRPO(0.995, 0.97) against TNPG where HOOF is used to learn and is fixed to (0.995, 0.97). Results presented in Figure 10 show that even in this case HOOF outperforms TRPO.
The hyperparameter schedule learnt by HOOF-All for Ant are presented in Figure 11.