1 Introduction
A key consideration when applying reinforcement learning (RL) to a physical setting is the risk and expense of running trials, e.g., while learning the optimal policy for a robot. Another consideration is the robustness of the learned policies. Since it is typically infeasible to test a policy in all contexts, it is difficult to ensure it works as broadly as intended. Fortunately, policies can often be tested in a simulator that exposes key environment variables – state features that are unobserved and randomly determined by the environment in a physical setting but are controllable in the simulator. This paper considers how to use environment variables to help learn robust policies.
Although running trials in a simulator is cheaper and safer than running physical trials, the computational cost of each simulated trial can still be quite high. The challenge then is to develop algorithms that are sample efficient, i.e., that minimise the number of such trials. In such settings, Bayesian Optimisation (BO) [Brochu, Cora, and de Freitas2010] is a sampleefficient approach that has been successfully applied to RL in multiple domains [Lizotte et al.2007, MartinezCantin et al.2007, MartinezCantin et al.2009, Cully et al.2015, Calandra et al.2015].
A naïve approach would be to randomly sample values for the environment variables in each trial, so as to estimate expected performance. However, this approach (1) often requires testing each policy in a prohibitively large number of scenarios, and (2) is not robust to
significant rare events (SREs), i.e., it fails any time there are rare events that substantially affect expected performance. For example, rare localisation errors may mean that a robot is much nearer to an obstacle than expected, increasing the risk of a collision. Since collisions are so catastrophic, avoiding them is key to maximising expected performance, even though the factors contributing to the collision occur only rarely. In such cases, the naïve approach will not see such rare events often enough to learn an appropriate response.Instead, we propose a new approach called alternating optimisation and quadrature (ALOQ) specifically aimed towards learning policies that are robust to these rare events while being as sample efficient as possible. The main idea is to actively select the environment variables (instead of sampling them) in a simulator thanks to a Gaussian Process (GP) that models returns as a function of both the policy and the environment variables and then, at each timestep, to use BO and Bayesian Quadrature (BQ) in turn to select a policy and environment setting, respectively, to evaluate.
We apply ALOQ to a number of problems and our results demonstrate that ALOQ learns better and faster than multiple baselines. We also demonstrate that the policy learnt by ALOQ in a simulated hexapod transfers successfully to the real robot.
2 Related Work
frank2008 (frank2008) also consider the problems posed by SREs. In particular, they propose an approach based on importance sampling (IS) for efficiently evaluating policies whose expected value may be substantially affected by rare events. While their approach is based on temporal difference
(TD) methods, we take a BObased policy search approach. Unlike TD methods, BO is well suited to settings in which sample efficiency is paramount and/or where assumptions (e.g., the Markov property) that underlie TD methods cannot be verified. More importantly, they assume prior knowledge of the SREs, such that they can directly alter the probability of such events during policy evaluation. By contrast, a key strength of ALOQ is that it requires only that a set of environment variables can be controlled in the simulator, without assuming any prior knowledge about whether SREs exist, or about the settings of the environment variables that might trigger them.
More recently, OFFER (OFFER) also proposed an IS based algorithm, OFFER, where the setting of the environment variable is gradually changed based on observed trials. Since OFFER is a TD method, it suffers from all the disadvantages mentioned earlier. It also assumes that the environment variable only affects the initial state as otherwise it leads to unstable IS estimates.
williams_santner (williams_santner) consider a problem setting they call the design of computer experiments that is similar to our setting, but does not specifically consider SREs. Their proposed GPbased approach marginalises out the environment variable by alternating between BO and BQ. However, unlike ALOQ, their method is based on the EI acquisition function, which makes it computationally expensive for reasons discussed in Section 4, and is applicable only to discrete environment variables. We include their method as a baseline in our experiments. Our results presented in Section 5 show that, compared to ALOQ, their method is unsuitable for settings with SREs. Further, their method is far more computationally expensive and fails even to outperform a baseline that randomly samples the environment variable at each step.
krause2011contextual (krause2011contextual) also address optimising performance in the presence of environment variables. However, they address a fundamentally different contextual bandit setting in which the learned policy conditions on the observed environment variable.
PILCO [Deisenroth and Rasmussen2011] is a modelbased policy search method that achieves remarkable sample efficiency in robot control [Deisenroth, Fox, and Rasmussen2015]. PILCO superficially resembles ALOQ in its use of GPs but the key difference is that in PILCO the GP models the transition dynamics while in ALOQ it models the returns as a function of the policy and environment variable. PILCO is fundamentally ill suited to our setting. First, it assumes that the transition dynamics are Gaussian and can be learned with a few hundred observed transitions, which is often infeasible in more complex environments (i.e., it scales poorly as the dimensionality of the state/action space increases). Second, even in simple environments, PILCO will not be able to learn the transition dynamics because in our setting the environment variable is not observed in physical trials, leading to major violations of the Gaussian assumption when those environment variables can cause SREs.
Policies found in simulators are rarely optimal when deployed on the physical agent due to the reality gap that may exist due to the inability of any simulator to model reality perfectly. EPOpt [Rajeswaran et al.2016] tries to address this by finding policies that are robust to simulators with different settings of its parameters. First, multiple instances of the simulator are generated by drawing a random sample of the simulator parameter settings. Trajectories are then sampled from each of these instances and used by a batch policy optimisation algorithm (e.g. TRPO [Schulman et al.2015]). While ALOQ finds a riskneutral policy, EPOpt finds a riskaverse solution based on maximising the conditional value at risk (CVaR) by feeding the policy optimisation only the sampled trajectories whose returns are lower than the CVaR. In a riskneutral setting, EPOpt reduces to the underlying policy optimisation algorithm with trajectories randomly sampled from different instances of the simulator. This approach will not see SREs often enough to learn an appropriate response, as we demonstrate in our experiments.
RARL (RARL) also suggest a method to address the problem of finding robust policies. Their method learns a policy by training in a simulator that is adversarial in nature, i.e., the simulator settings are dynamically chosen to minimise the returns of the policy. This method requires significant prior knowledge to be able to set the simulator settings such that it provides just the right amount of challenge to the policy. Furthermore, it does not consider any settings with SREs.
3 Background
GPs provide a principled way of quantifying uncertainties associated with modelling unknown functions. A GP is a distribution over functions, and is fully specified by its mean function and covariance function (see GPML (GPML) for an indepth treatment) which encode any prior belief about the nature of the function. The prior can be combined with observed values to update the belief about the function in a Bayesian way to generate a posterior distribution.
The prior mean function of the GP is often assumed to be 0 for convenience. A popular choice for the covariance function is the class of stationary functions of the form , which implies that the correlation between the function values of any two points depends only on the distance between them.
In GP regression, it is assumed that the observed function values
is a sample from a multivariate Gaussian distribution. The prediction for a new point
is connected with the observations through the mean and covariance functions. By conditioning on the observed data, this can be computed analytically as a Gaussian :(1a)  
(1b)  
(1c) 
where
denotes the vector of observed inputs,
the vector of corresponding function values, and is the matrix with entries .This property of generating estimates of the uncertainty associated with any prediction makes it particularly suited for finding the optimum of using BO. BO requires an acquisition function to guide the search and balance exploitation (searching the space expected to have the optimum) and exploration (searching the space which has not been explored well). Given a set of observations, the next point for evaluation is actively chosen as the that maximises the acquisition function.
Two commonly used acquisition functions are expected improvement (EI) [Močkus1975, Jones, Schonlau, and Welch1998] and upper confidence bound (UCB) [Cox and John1992, Cox and John1997]. Defining as the current optimal evaluation, i.e., , EI seeks to maximise the expected improvement over the current optimum , where . By contrast, UCB does not depend on but directly incorporates the uncertainty in the prediction by defining an upper bound: , where controls the explorationexploitation tradeoff.
BQ [O’Hagan1991, Rasmussen and Ghahramani2003] is a sampleefficient technique for computing integrals of the form , where
is a probability distribution. Using GP regression to compute the prediction for any
given some observed data,is a Gaussian whose mean and variance can be computed analytically for particular choices of the covariance function and
[Briol et al.2015b]. If no analytical solution exists, we can approximate the mean and variance via Monte Carlo quadrature by sampling the predictions of various .Given some observed data , we can also devise acquisition functions for BQ to actively select the next point for evaluation. A natural objective here is to select that minimises the uncertainty of , i.e., [Osborne et al.2012]. Due to the nature of GPs, does not depend on and is thus computationally feasible to evaluate. Uncertainty sampling [Settles2010] is an alternative acquisition function that chooses the with the maximum posterior variance: . Although simple and computationally cheap, it is not the same as reducing uncertainty about since evaluating the point with the highest prediction uncertainty does not necessarily lead to the maximum reduction in the uncertainty of the estimate of the integral.
Monte Carlo (MC) quadrature simply samples from and estimates the integral as This typically requires a large and so is less sample efficient than BQ: it should only be used if is cheap to evaluate. The many merits of BQ over MC, both philosophically and practically, are brought out by oHagan1987 (oHagan1987) and HenOsbGirRSPA2015 (HenOsbGirRSPA2015). Below, we will describe an active Bayesian quadrature scheme (that is, selecting points according to an acquisition function), inspired by the empirical improvements offered by those of osborne_duvenaud (osborne_duvenaud) and gunter14fastbayesianquadrature (gunter14fastbayesianquadrature).
4 Problem Setting & Method
We assume access to a computationally expensive simulator that takes as input a policy and environment variable and produces as output the return , where both and belong to some compact sets in and , respectively.
We also assume access to , the probability distribution over . may be known a priori, or it may be a posterior distribution estimated from whatever physical trials have been conducted. Note that we do not require a perfect simulator: any uncertainty about the dynamics of the physical world can be modelled in , i.e., some environment variables may just be simulator parameters whose correct fixed setting is not known with certainty.
Defining , we assume we have a dataset . Our objective is to find an optimal policy :
(2) 
First, consider a naïve approach consisting of a standard application of BO that disregards , performs BO on with only one input , and attempts to estimate . Formally, this approach models as a GP with a zero mean function and a suitable covariance function . For any given , the variation in due to different settings of is treated as noise. To estimate , the naïve approach applies BO, while sampling from at each timestep. This approach will almost surely fail due to not sampling SREs often enough to learn a suitable response.
By contrast, our method ALOQ (see Alg. 1) models as a GP: , acknowledging both its inputs. The main idea behind ALOQ is, given , to use a BO acquisition function to select for evaluation and then use a BQ acquisition function to select , conditioning on .
Selecting requires maximising a BO acquisition function (6) on , which requires estimating , together with the uncertainty associated with it. Fortunately BQ is well suited for this since it can use the GP to estimate together with the uncertainty associated with it. This is illustrated in Figure 4.
Once is chosen, ALOQ selects by minimising a BQ acquisition function (7) quantifying the uncertainty about . After is selected, ALOQ evaluates it on the simulator and updates the GP with the new datapoint . Our estimate of is thus:
(3) 
Although the approach described so far actively selects and through BO and BQ, it is unlikely to perform well in practice. A key observation is that the presence of SREs, which we seek to address with ALOQ, implies that the scale of varies considerably, e.g., returns in case of collision vs no collision. This nonstationarity cannot be modelled with our stationary kernel. Therefore, we must transform the inputs to ensure stationarity of . In particular, we employ Beta warping, i.e., transform the inputs using Beta CDFs with parameters [Snoek et al.2014]
. The CDF of the beta distribution on the support
is given by:(4) 
where is the beta function. The beta CDF is particularly suitable for our purpose as it is able to model a variety of warpings based on the settings of only two parameters . ALOQ transforms each dimension of and independently, and treats the corresponding
as hyperparameters. We assume that we are working with the transformed inputs for the rest of the paper.
While the resulting algorithm should be able to cope with SREs, the that it returns at each iteration may still be poor, since our BQ evaluation of leads to a noisy approximation of the true expected return. This is particularly problematic in high dimensional settings. To address this, intensification [BartzBeielstein, Lasarczyk, and Preuss2005, Hutter et al.2009], i.e., reevaluation of selected policies in the simulator, is essential. Therefore, ALOQ performs two simulator calls at each timestep. In the first evaluation, is selected via the BO/BQ scheme described earlier. In the second stage, is evaluated, where is selected using (3) and using the BQ acquisition function (7).
Computing : For discrete with support , the estimate of the mean and variance for is straightforward:
(5a)  
(5b) 
where is the prediction from the GP with mean and covariance computed using (1). For continuous , we apply Monte Carlo quadrature. Although this requires sampling a large number of and evaluating the corresponding , it is feasible since we evaluate , not from the expensive simulator, but from the computationally cheaper GP.
BO acquisition function for : A modified version of the UCB acquisition function is a natural choice since using (5) we can compute it easily as
(6) 
and set .
Note that although it is possible to define an EIbased acquisition function: , where , as an alternative choice for ALOQ, it is prohibitively expensive to compute in practice. The stochastic renders this analytically intractable. Approximating it using Monte Carlo sampling would require performing predictions on points, i.e., all the observed ’s paired with all the possible settings of the environment variable, which is infeasible even for moderate as the computational complexity of GP predictions scales quadratically with the number of predictions.
BQ acquisition function for : BQ can be viewed as performing policy evaluation in our approach. Since the presence of SREs leads to high variance in the returns associated with any given policy, it is of critical importance that we minimise the uncertainty associated with our estimate of the expected return of a policy. We formalise this objective through our BQ acquisition function for : ALOQ selects by minimising the posterior variance of , yielding:
(7) 
We also tried uncertainty sampling in our experiments. Unsurprisingly it performed worse as it is not as good at reducing the uncertainty associated with the expected return of a policy as explained in Section 3.
Properties of ALOQ: Thanks to convergence guarantees for BO using [Srinivas et al.2010], ALOQ converges if the BQ scheme on which it relies also converges. Unfortunately, to the best of our knowledge, existing convergence guarantees [Kanagawa, Sriperumbudur, and Fukumizu2016, Briol et al.2015a] apply only to BQ methods that do not actively select points, as (7) does. Of course, we expect such active selection to only improve the rate of convergence of our algorithms over nonactive versions. However, our empirical results in Section 5 show that in practice ALOQ efficiently optimises policies in the presence of SREs across a variety of tasks.
ALOQ’s computational complexity is dominated by an matrix inversion, where is the sample size of the dataset . This cubic scaling is common to all BO methods involving GPs. The BQ integral estimation in each iteration requires only GP predictions, which are .
5 Experimental Results
To evaluate ALOQ we applied it to 1) a simulated robot arm control task, including a variation where is not known a priori but must be inferred from data, and 2) a hexapod locomotion task [Cully et al.2015]. Further experiments on test functions to clearly show the how each element of ALOQ is necessary for settings with SREs is presented in the supplementary material.
We compare ALOQ to several baselines: 1) the naïve method described in the previous section; 2) the method of williams_santner (williams_santner), which we refer to as WSN; 3) the simple policy gradient method Reinforce [Williams1992], and 4) the stateoftheart policy gradient method TRPO [Schulman et al.2015]. To show the importance of each component of ALOQ, we also perform experiments with ablated versions of ALOQ, namely: 1) Random Quadrature ALOQ (RQALOQ), in which is sampled randomly from instead of being chosen actively; 2) unwarped ALOQ, which does not perform Beta warping of the inputs; and 3) onestep ALOQ, which does not use intensification. All plotted results are the median of 20 independent runs. Details of the experimental setups and the variability in performance can be found in the supplementary material.
5.1 Robotic Arm Simulator
In this experiment, we evaluate ALOQ’s performance on a robot control problem implemented in a kinematic simulator. The goal is to configure each of the three controllable joints of a robot arm such that the tip of the arm gets as close as possible to a predefined target point.
Collision Avoidance
In the first setting, we assume that the robotic arm is part of a mobile robot that has localised itself near the target. However, due to localisation errors, there is a small possibility that it is near a wall and some joint angles may lead to the arm colliding with the wall and incurring a large cost. Minimising cost entails getting as close to the target as possible while avoiding the region where the wall may be present. The environment variable in this setting is the distance to the wall.
Figures 8 and 8 show the expected cost (lower is better) of the arm configurations after each timestep for each method. ALOQ, unwarped ALOQ, and RQALOQ greatly outperform the other baselines. Reinforce and TRPO, being relatively sample inefficient, exhibit a very slow rate of improvement in performance, while WSN fails to converge at all.
Figure 8 shows the learned arm configurations, as well as the policy that would be learned by ALOQ if there was no wall (No Wall). The shaded region represents the possible locations of the wall. This plot illustrates that ALOQ learns a policy that gets closest to the target. Furthermore, while all the BO based algorithms learn to avoid the wall, active selection of allows ALOQ to do so more quickly: smart quadrature allows it to more efficiently observe rare events and accurately estimate their boundary. For readability we have only presented the arm configurations for algorithms which have performance comparable to ALOQ.
Joint Breakage
Next we consider a variation in which instead of uncertainty introduced by localisation, some settings of the first joint carry a 5% probability of it breaking, which consequently incurs a large cost. Minimising cost thus entails getting as close to the target as possible, while minimising the probability of the joint breaking.
Figures 12 and 12 shows the expected cost (lower is better) of the arm configurations after each timestep for each method. Since is continuous in this setting, and WSN requires discrete , it was run on a slightly different version with discretised by 100 equidistant points. The results are similar to the previous experiment, except that the baselines perform worse. In particular, the Naïve baseline, WSN, and Reinforce seem to have converged to a suboptimal policy since they have not witnessed any SREs.
Figure 12 shows the learned arm configurations together with the policy that would be learned if there were no SREs (‘No break’). The shaded region represents the joint angles that can lead to failure. This figure illustrates that ALOQ learns a qualitatively different policy than the other algorithms, one which avoids the joint angles that might lead to a breakage while still getting close to the target faster than the other methods. Again for readability we only present the arm configurations for the most competitive algorithms.
Performance of Reinforce and TRPO
Both these baselines are relatively sample inefficient. However, one question that arises is whether these methods eventually find the optimal policy. To check this, we ran them for 2000 iterations with a batch size of 5 trajectories (thus a total of 10000 simulator calls). We repeated this for both the Collision Avoidance and Joint Breakage settings. The expected cost of the arm configurations after each iteration are presented in Figure 15 (we only present the results up to 1000 simulator calls for readability  there is no improvement beyond what can be seen in the plot). Both baselines can solve the tasks in settings without SREs, i.e. where there is no possibility of a collision or a breakage (’No Wall’ and ’No Break’ in the figures). However, in settings with SREs they converge rapidly to a suboptimal policy from which they are unable recover even if run for much longer, since they don’t experience the SREs often enough. This is especially striking in the collision avoidance task where TRPO converges to a policy that has a relatively high probability of leading to a collision.
Setting with unknown
Now we consider the setting where is not known a priori, but must be approximated using trajectories from some baseline policy. In this setting, instead of directly setting the robot arm’s joint angles, we set the torque applied to each joint . The final joint angles are determined by the torque and the unknown friction between the joints . Setting the torque too high can lead to the joint breaking, which incurs a large cost.
We use the simulator as a proxy for both real trials as well as the simulated trials. In the first case, we simply sample from a uniform prior, run a baseline policy, and use the observed returns to compute an approximate posterior over . We then use ALOQ to compute the optimal policy over this posterior (‘ALOQ policy’). For comparison, we also compute the MAP of and the corresponding optimal policy (‘MAP policy’). To show that active selection of is advantageous, we also compare against the policy learned by RQALOQ.
Since we are approximating the unknown with a set of samples, it makes sense to keep the sample size relatively low for computational efficiency when finding the ALOQ policy (50 samples in this instance). However, to show that ALOQ is robust to this approximation, when comparing the performance of the ALOQ and MAP policies, we used a much larger sample size of 400 for the posterior distribution.
For evaluation, we drew 1000 samples of from the more granular posterior distribution and measured the returns of the three policies for each of the samples. The average cost incurred by the ALOQ policy (presented in Table 1) was 31% lower than that incurred by the MAP policy and 23.6% lower than the RQALOQ policy. This is because ALOQ finds a policy that slightly underperforms the MAP policy in some of cases but avoids over 95% of the SREs (cost 70 in Table 1) experienced by the MAP and RQALOQ policies.
Average  % Episodes in Cost Range  

Cost  020  2070  70  
ALOQ Policy  19.82  61.3%  38.5%  0.2% 
MAP Policy  28.76  67.1%  28.7%  4.2% 
RQALOQ  25.95    94.5%  5.5% 
5.2 Hexapod Locomotion Task
As robots move from fully controlled environments to more complex and natural ones, they have to face the inevitable risk of getting damaged. However, it may be expensive or even impossible to decommission a robot whenever any damage condition prevents it from completing its task. Hence, it is desirable to develop methods that enable robots to recover from failure.
Intelligent trial and error (IT&E) [Cully et al.2015] has been shown to recover from various damage conditions and thereby prevent catastrophic failure. Before deployment, IT&E uses the simulator to create an archive of diverse and locally high performing policies for the intact robot that are mapped to a lower dimensional behaviour space. If the robot becomes damaged after deployment, it uses BO to quickly find the policy in the archive that has the highest performance on the damaged robot. However, it can only respond after damage has occurred. Though it learns quickly, performance may still be poor while learning during the initial trials after damage occurs. To mitigate this effect, we propose to use ALOQ to learn in simulation the policy with the highest expected performance across the possible damage conditions. By deploying this policy, instead of the policy that is optimal for the intact robot, we can minimise in expectation the negative effects of damage in the period before IT&E has learned to recover.
We consider a hexapod locomotion task with a setup similar to that of [Cully et al.2015] to demonstrate this experimentally. The objective is to cross a finish line a fixed distance from its starting point. Failure to cross the line leads to a large negative reward, while the reward for completing the task is inversely proportional to the time taken.
It is possible that a subset of the legs may be damaged or broken when deployed in a physical setting. For our experiments we assume that, based on prior experience, any of the front two or back two legs can be shortened or removed with probability of 10% and 5% respectively, independent of the other legs, leading to 81 possible configurations. We excluded the middle two legs from our experiment as their failure had a relatively lower impact on the hexapod’s movement. The configuration of the six legs acts as our environment variable. Figure 18 shows one such setting.
We applied ALOQ to learn the optimal policy given these damage probabilities, but restricted the search to the policies in the archive created by [Cully et al.2015]. Figure 18 shows that ALOQ finds a policy with much higher expected reward than RQALOQ. It also shows the policy that generates the maximum reward when none of the legs are damaged or broken (‘opt undamaged policy’).
To demonstrate that ALOQ learns a policy that can be applied to a physical environment, we also deployed the best ALOQ policy on the real hexapod. In order to limit the number of physical trials required to evaluate ALOQ, we limited the possibility of damage to the rear two legs. The learnt policy performed well on the physical robot because it optimised performance on the rare configurations that matter most for expected return (e.g., either leg shortened).
6 Conclusions
This paper proposed ALOQ, a novel approach to using BO and BQ to perform sampleefficient RL in a way that is robust to the presence of significant rare events. We empirically evaluated ALOQ on different simulated tasks involving a robotic arm simulator, and a hexapod locomotion task and showed how it can be also be applied to settings where the distribution of the environment variable is unknown a priori, and that it successfully transfers to a real robot. Our results demonstrated that ALOQ outperforms multiple baselines, including related methods proposed in the literature. Further, ALOQ is computationally efficient and does not require any restrictive assumptions to be made about the environment variables.
Acknowledgements
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreements #637713 and #637972).
References

[BartzBeielstein, Lasarczyk, and
Preuss2005]
BartzBeielstein, T.; Lasarczyk, C. W. G.; and Preuss, M.
2005.
Sequential parameter optimization.
In
2005 IEEE Congress on Evolutionary Computation
, 773–780 Vol.1.  [Briol et al.2015a] Briol, F.X.; Oates, C. J.; Girolami, M.; Osborne, M. A.; and Sejdinovic, D. 2015a. Probabilistic Integration: A Role for Statisticians in Numerical Analysis? ArXiv eprints.
 [Briol et al.2015b] Briol, F.X.; Oates, C. J.; Girolami, M.; Osborne, M. A.; and Sejdinovic, D. 2015b. Probabilistic Integration: A role for statisticians in numerical analysis? arXiv:1512.00933 [cs, math, stat]. arXiv: 1512.00933.
 [Brochu, Cora, and de Freitas2010] Brochu, E.; Cora, V. M.; and de Freitas, N. 2010. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. eprint arXiv:1012.2599, arXiv.org.

[Calandra et al.2015]
Calandra, R.; Seyfarth, A.; Peters, J.; and Deisenroth, M.
2015.
Bayesian optimization for learning gaits under uncertainty.
Annals of Mathematics and Artificial Intelligence
.  [Ciosek and Whiteson2017] Ciosek, K., and Whiteson, S. 2017. Offer: Offenvironment reinforcement learning. In AAAI 2017: Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence.
 [Cox and John1992] Cox, D. D., and John, S. 1992. A statistical method for global optimization. In Systems, Man and Cybernetics, 1992., IEEE International Conference on.
 [Cox and John1997] Cox, D. D., and John, S. 1997. Sdo: A statistical method for global optimization. In in Multidisciplinary Design Optimization: StateoftheArt.
 [Cully and Mouret2015] Cully, A., and Mouret, J.B. 2015. Evolving a behavioral repertoire for a walking robot. Evolutionary Computation.
 [Cully et al.2015] Cully, A.; Clune, J.; Tarapore, D.; and Mouret, J.B. 2015. Robots that can adapt like animals. Nature 521.
 [Deisenroth and Rasmussen2011] Deisenroth, M. P., and Rasmussen, C. E. 2011. Pilco: A modelbased and dataefficient approach to policy search. In ICML.
 [Deisenroth, Fox, and Rasmussen2015] Deisenroth, M. P.; Fox, D.; and Rasmussen, C. E. 2015. Gaussian processes for dataefficient learning in robotics and control. IEEE Trans. Pattern Anal. Mach. Intell. 37(2):408–423.
 [Frank, Mannor, and Precup2008] Frank, J.; Mannor, S.; and Precup, D. 2008. Reinforcement learning in the presence of rare events. In ICML.
 [Gunter et al.2014] Gunter, T.; Osborne, M. A.; Garnett, R.; Hennig, P.; and Roberts, S. 2014. Sampling for inference in probabilistic models with fast bayesian quadrature. In NIPS.
 [Hennig, Osborne, and Girolami2015] Hennig, P.; Osborne, M. A.; and Girolami, M. 2015. Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences.
 [Hutter et al.2009] Hutter, F.; Hoos, H. H.; LeytonBrown, K.; and Murphy, K. P. 2009. An experimental investigation of modelbased parameter optimisation: Spo and beyond. In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, 271–278.
 [Jones, Perttunen, and Stuckman1993] Jones, D. R.; Perttunen, C. D.; and Stuckman, B. E. 1993. Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications 79(1):157–181.
 [Jones, Schonlau, and Welch1998] Jones, D.; Schonlau, M.; and Welch, W. 1998. Efficient global optimization of expensive blackbox functions. Journal of Global Optimization.
 [Kanagawa, Sriperumbudur, and Fukumizu2016] Kanagawa, M.; Sriperumbudur, B. K.; and Fukumizu, K. 2016. Convergence guarantees for kernelbased quadrature rules in misspecified settings. In Advances in Neural Information Processing Systems 29.
 [Krause and Ong2011] Krause, A., and Ong, C. S. 2011. Contextual gaussian process bandit optimization. In NIPS.
 [Lizotte et al.2007] Lizotte, D. J.; Wang, T.; Bowling, M.; and Schuurmans, D. 2007. Automatic gait optimization with gaussian process regression. In IJCAI.
 [MartinezCantin et al.2007] MartinezCantin, R.; de Freitas, N.; Doucet, A.; and Castellanos, J. 2007. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and Systems.
 [MartinezCantin et al.2009] MartinezCantin, R.; de Freitas, N.; Brochu, E.; Castellanos, J.; and Doucet, A. 2009. A bayesian explorationexploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots 27(2).
 [Močkus1975] Močkus, J. 1975. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference Novosibirsk, July 1–7, 1974.
 [Mouret and Clune2015] Mouret, J.B., and Clune, J. 2015. Illuminating search spaces by mapping elites. arxiv:1504.04909.
 [Neal2000] Neal, R. 2000. Slice sampling. Annals of Statistics 31.
 [O’Hagan1987] O’Hagan, A. 1987. Monte carlo is fundamentally unsound. Journal of the Royal Statistical Society. Series D (The Statistician) 36(2/3):pp. 247–249.
 [O’Hagan1991] O’Hagan, A. 1991. Bayeshermite quadrature. Journal of Statistical Planning and Inference.
 [Osborne et al.2012] Osborne, M.; Garnett, R.; Ghahramani, Z.; Duvenaud, D. K.; Roberts, S. J.; and Rasmussen, C. E. 2012. Active learning of model evidence using bayesian quadrature. In NIPS.
 [Pinto et al.2017] Pinto, L.; Davidson, J.; Sukthankar, R.; and Gupta, A. 2017. Robust adversarial reinforcement learning. CoRR abs/1703.02702.
 [Pritchard et al.1999] Pritchard, J. K.; Seielstad, M. T.; PerezLezaun, A.; and Feldman, M. W. 1999. Population growth of human y chromosomes: a study of y chromosome microsatellites. Molecular Biology and Evolution 16(12):1791–1798.
 [Rajeswaran et al.2016] Rajeswaran, A.; Ghotra, S.; Levine, S.; and Ravindran, B. 2016. Epopt: Learning robust neural network policies using model ensembles. CoRR abs/1610.01283.
 [Rasmussen and Ghahramani2003] Rasmussen, C. E., and Ghahramani, Z. 2003. Bayesian monte carlo. NIPS 15.

[Rasmussen and Williams2005]
Rasmussen, C. E., and Williams, C. K. I.
2005.
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
. The MIT Press.  [Rubin1984] Rubin, D. B. 1984. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Statist. 12(4):1151–1172.
 [Schulman et al.2015] Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In Bach, F., and Blei, D., eds., Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research. Lille, France: PMLR.
 [Settles2010] Settles, B. 2010. Active learning literature survey. University of Wisconsin, Madison 52(5566):11. 00000.
 [Snoek et al.2014] Snoek, J.; Swersky, K.; Zemel, R.; and Adams, R. 2014. Input warping for bayesian optimization of nonstationary functions. In ICML.
 [Srinivas et al.2010] Srinivas, N.; Krause, A.; Kakade, S. M.; and Seeger, M. 2010. Gaussian process optimization in the bandit setting: no regret and experimental design. In ICML.
 [Tavaré et al.1997] Tavaré, S.; Balding, D. J.; Griffiths, R. C.; and Donnelly, P. 1997. Inferring coalescence times from dna sequence data. Genetics 145(2):505–518.
 [Williams, Santner, and Notz2000] Williams, B. J.; Santner, T. J.; and Notz, W. I. 2000. Sequential design of computer experiments to minimize integrated response functions. Statistica Sinica 10(4).
 [Williams1992] Williams, R. J. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning 8(3):229–256.
General Experimental Details
We provide further details of our experiments in this section.
Covariance function: All our experiments use a squared exponential covariance function given by:
(8) 
where the hyperparameter specifies the variance and the length scales for the dimensions.
Treatment of hyperparameters: Instead of maximising the likelihood of the hyperparameters, we follow a full Bayesian approach and compute the marginalised posterior distribution
by first placing a hyperprior distribution on
, the set of all hyperparameters, and then marginalising it out from . In practice, an analytical solution for this is unlikely to exist so we estimate using Monte Carlo quadrature. Slice sampling [Neal2000] was used to draw random samples from .Choice of hyperpriors: We assume a lognormal hyperprior distribution for all the above hyperparameters. For the variance we use , while for the lengthscales we use . For we used .
Optimising the BO/BQ acquisition functions: We used DIRECT [Jones, Perttunen, and Stuckman1993] to maximise the BO acquisition function . To minimise the BQ acquisition function, we exhaustively computed for each since this was computationally very cheap.
Robotic Arm Simulator
The configuration of the robot arm is determined by three joint angles, each of which is normalised to lie in . The arm has a reach of on the axis. We set for all three experiments in this section.
Collision Avoidance
In this experiment, the target was set to the final position of the end effector for . The location of the wall, , was discrete with 20 support points logarithmically distributed in . The probability mass was distributed amongst these points such that there was only a 12% chance of collision for .
Joint Breakage
The target for the arm breakage experiment was set to the final position of the end effector for . Angles between for the first joint have an associated 5% probability of breakage.
Comparison of runtimes
A comparison of the perstep runtimes for the GP based methods are presented in Figure 21. As expected, ALOQ is once again much faster than WSN.
Variation in performance
Setting with unknown
As described in the paper, in this setting we assume that is the torque applied to the joints, and controls the rigidity of the joints. The final joint angle is determined as . If the torque applied to any of the joints is greater than the rigidity, (i.e. any of the angles end up ), then the joint is damaged, incurring a large cost.
To simulate a set of physical trials with a baseline policy , we sample from and observe the return and add iid Gaussian noise to them. The posterior can then be computed as , where . We can approximate this using slice sampling since both the prior and the likelihood are analytical.
An alternative formulation would be to corrupt the joint angles with Gaussian noise instead of the observed returns. The posterior can still be computed in this case, but instead of using slice sampling, we would have to make use of approximate Bayesian computation [Rubin1984, Tavaré et al.1997, Pritchard et al.1999], which would be computationally expensive.
To ensure that only the information gained about gets carried over from the physical trials to the final ALOQ/MAP policy being learned, the target for the baseline policy was different to the target for the final policy.
As mentioned in the paper, to find the optimal policy using ALOQ, we approximated the posterior with 50 samples using a slice sampler. However, for evaluation and comparison with the MAP policy, we used a much more granular approximation with 400 samples.
Hexapod Locomotion Task
The robot has six legs with three degrees of freedom each. We built a fairly accurate model of the robot which involved creating a URDF model with dynamic properties of each of the legs and the body, including their weights, and used the DART simulator for the dynamic physics simulation.
^{1}^{1}1https://dartsim.github.io We also used velocity actuators.The lowlevel controller (or policy) is the same openloop controller as in [Cully and Mouret2015] and [Cully et al.2015]. The position of the first two joints of each of the six legs is controlled by a periodic function with three parameters: an offset, a phase shift, and an amplitude (we keep the frequency fixed). The position of the third joint of each leg is the opposite of the position of the second one, so that the last segment always stays vertical. This results in 36 parameters.
The archive of policies in the behaviour space was created using the MAPElites algorithm [Mouret and Clune2015]. MAPElites searches for the highestperforming solution for each point in the duty factor space [Cully et al.2015], i.e., the time each tip of the leg spent touching the ground. MAPElites also acts as a dimensionality reduction algorithm and maps the high dimensional controller/policy space (in our case 36D) to the lower dimensional behaviour space (in our case 6D). We also used this lower dimensional representation of the policies in the archive as the policy search space () for ALOQ.
For our experiment, we set the reward such that failure to cross the finish line within 5 seconds yields zero reward, while crossing the finish line gives a reward of where is the average velocity in m/s.
Further experiments
In this section, we present the results of further experiments performed on test functions to demonstrate that each element of ALOQ is necessary for settings with SREs.
We begin with modified versions of the Branin and Hartmann 6 test functions used by williams_santner. The modified Branin test function is a fourdimensional problem, with two dimensions treated as discrete environment variables with a total of 12 support points, while the modified Hartmann 6 test function is sixdimensional with two dimensions treated as environment variables with a total of 49 support points. See [Williams, Santner, and Notz2000] for the mathematical formulation of these test functions.
The performance of the algorithms on the two functions is presented in Figure 24. In the Branin function, ALOQ, RQALOQ, unwarped ALOQ, and onestep ALOQ all substantially outperform WSN. WSN performs better in the Hartmann 6 function as it does not get stuck in a local maximum. However, it still cannot outperform onestep ALOQ. Note that ALOQ slightly underperforms onestep ALOQ. This is not surprising: since the problem does not have SREs, the intensification procedure used by ALOQ does not yield any significant benefit.
Figure 27 plots in log scale the perstep runtime of each algorithm, i.e., the time taken to process one data point on the two test functions. WSN takes significantly longer than ALOQ or the other baselines, and shows a clear increasing trend. The reduction in time near the end is a computational artefact due to resources being freed up as some runs finish faster than others.
The slow runtime of WSN is as expected due to the reasons mentioned in the paper. However, its failure to outperform RQALOQ is surprising as these are the test problems williams_santner use in their own evaluation. However, they never compared WSN to these (or any other) baselines. Consequently, they never validated the benefit of modelling explicitly, much less selecting it actively. In retrospect, these results make sense because the function is not characterised by significant rare events and there is no other a priori reason to predict that simpler methods will fail.
These results underscore the fact that a meaningful evaluation must include a problem with SREs, as such problems do demand more robust methods. To create such an evaluation, we formulated two test functions, FSRE1 and FSRE2, that are characterised by significant rare events. For , FSRE1 is defined as:
(9) 
And for , FSRE2 is defined as:
(10) 
Figure 30 shows the contour plots of these two functions. Both functions have a narrow band of which corresponds to the SRE regions, i.e. the scale of the rewards is much larger in these regions. In FSRE1 this is while in FSRE2 this is . We downscaled the region corresponding to the SRE by a factor of 10 to make the plots more readable. The final learned policy, i.e., , of each algorithm is shown as a vertical line, along with (the true maximum). These lines illustrate that properly accounting for significant rare events can lead to learning qualitatively different policies.
Figures 33, which plots the performance of all methods the two functions, shows that ALOQ substantially outperforms all the other algorithms except for onestep ALOQ (note that both WSN and the naïve approach fail completely in these settings). As expected, intensification does not yield any additional benefit in this low dimensional problem. However, our experiments on the robotics tasks presented in the paper show that intensification is crucial for success in higher dimensional problems.
The perstep runtime is presented in Fig 36. Again WSN is significantly slower than all other methods. In fact, it was not computationally feasible to run WSN beyond 100 data points for FSRE2.
To provide a sense of the variance in the performance of each algorithm across the 20 independent runs, Table 9 presents the quartiles of the expected function value of the final for all four artificial test functions.
Across all four test functions, we used a lognormal hyperprior distribution with for each of and .
Comments
There are no comments yet.