Robust Policy Search for Robot Navigation with Stochastic Meta-Policies

03/02/2020 ∙ by Javier Garcia-Barcos, et al. ∙ University of Zaragoza 0

Bayesian optimization is an efficient nonlinear optimization method where the queries are carefully selected to gather information about the optimum location. Thus, in the context of policy search, it has been called active policy search. The main ingredients of Bayesian optimization for sample efficiency are the probabilistic surrogate model and the optimal decision heuristics. In this work, we exploit those to provide robustness to different issues for policy search algorithms. We combine several methods and show how their interaction works better than the sum of the parts. First, to deal with input noise and provide a safe and repeatable policy we use an improved version of unscented Bayesian optimization. Then, to deal with mismodeling errors and improve exploration we use stochastic meta-policies for query selection and an adaptive kernel. We compare the proposed algorithm with previous results in several optimization benchmarks and robot tasks, such as pushing objects with a robot arm, or path finding with a rover.



There are no comments yet.


page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robot control and navigation in uncertain environments can be framed as a policy search problem [43], which has led to important achievements in robotics [9, 31, 18, 20]. Previous results had been achieved using gradient-based policy search, which might require a large number of optimization steps and a good initialization to avoid local minima. As a result, policy search typically require a large number of trials, or policy evaluations, to find a local solution.

Robot experiments are quite expensive, requiring to move and interact with a robot. Even if we use a simulator for planning or learning, robotic simulators are typically complex and computationally expensive. For example, information-based planning might be quite expensive, specially if we require also to perform localization and dense map-building simultaneously [24, 23]. Thus, sample efficiency during policy learning is of paramount importance. Active policy search uses Bayesian optimization to drive the search for optimality in an intelligent and efficient fashion. Bayesian optimization is a method for sample efficient nonlinear optimization that does not require gradients to obtain global convergence [35]

. Furthermore, the probabilistic core of Bayesian optimization also allows the use of partial of incomplete information, similarly to the stochastic gradient descent commonly used in classical policy search


. In fact, Bayesian optimization has been already used in many robotics and reinforcement learning setups, such as robot walking

[21, 5, 32], control [38, 19], planning [23, 22], grasping [29, 6] and damage recovery [8].

Fig. 1: Path planning on uneven terrain with obstacles, with different trajectories displayed (left and right). The orange regions represents slopes with a higher traversing cost. The red rectangles are obstacles. Top: the desired trajectories (blue dashed line). Bottom: possible deviations (blue lines) from desired trajectories due to input noise. The right trajectory is more efficient without input noise. Once we take into account input noise, it becomes unsafe as it can collide with obstacles easily. The left trajectory is safer in the presence of input noise despite being less efficient.

Bayesian optimization relies on a probabilistic surrogate model of the target function, typically a Gaussian process. In the original formulation, this model is incorporated for sample efficiency [15]. In addition, the surrogate model can also be used for secondary tasks that can be useful in robotics scenarios, such as, guaranteeing a minimum outcome [37]

, detect and remove outliers

[26] or incorporate prior information [8]. However, the surrogate model must be simple enough that it can be learned using few samples, as intended within the Bayesian optimization context. This simplicity results in certain assumptions about the surrogate model, such as spatial stationarity, that might not be satisfied when applied to policy search [27]. In this work, we propose several methods to incorporate robustness to mismodeling errors and assumptions. First, we use an adaptive surrogate model for nonstationary environments [27]. Bayesian optimization also relies on optimal decision theory to actively select the most informative trial to find the optimal result. When combined with a biased model, this can lead to poor results and lack of convergence. Thus, we have also incorporated a decision rule based on stochastic meta-policies that can be robust to surrogate modeling errors [11]. The intuition behind the meta-policies is to consider the Bayesian optimization component as a meta reinforcement learning problem on top of the actual policy search problem. Another advantage of the stochastic meta-policies is that they trivially allow to perform distributed Bayesian optimization in a multi-robot setup or using a simulator in a computer cluster.

In the case of robotic applications, localization or trajectory uncertainty should also be considered. The optimal policy should be able to be effectively followed by the robot if we need to repeat the task multiple times. For example, take the path planning problem from Figure 1

, although the right trajectory is more cost efficient, as we introduce input noise, it becomes unsafe and incurs on a higher cost on average. On the contrary, the left trajectory is less cost efficient without input noise, but at a certain input noise it becomes a safe and efficient route. If we think on the cost function in terms of the policy parameters, we can see that the left trajectory lies in a smooth flat region while the right trajectory lies in a high variance region with a narrow valley. Thus, depending on the task or environmental conditions, the algorithm should able to select between narrow and flat optimum regions. In this work, we combine previous works on Bayesian optimization input noise

[29] and adaptive kernels [25], which allows to get the best solution for each situation. Furthermore, we provide a new interpretation of unscented Bayesian optimization [29] as an integrated response method and a new formulation based on the scaled unscented transform.

Ii Active Policy Search

Policy search consists of finding the optimal parameters of a policy , which is a distribution over actions conditioned on states , with respect to the expectation future return , denoted . The expectation is under the policy and the system dynamics which together form a distribution over trajectories . If we use an episodic formulation, such as REINFORCE [43], the expectation is usually approximated from Monte-Carlo rollouts of the robot trajectory. In this setup, finding the optimal policy parameters can be framed as a pure optimization problem, where the objective function is then computed as:


where are the parameters of the optimal policy , is the instantaneous reward at time step following rollout . Active policy search [24] computes the optimal policy parameters using Bayesian optimization. Similarly to stochastic gradient descent in gradient-based policy search, Bayesian optimization can directly be applied to stochastic optimization thanks to the probabilistic surrogate model [14]. Therefore, the expectation in equation (1) can be approximated with a small batch of rollouts or even a single episode. Algorithm 1 summarized the active policy search strategy. Section III details the steps of updating the surrogate model and generating the next set of policy parameters

1:Optimization budget
2:Initialize based on a low discrepancy sequence.
3:for each optimization iteration until budget  do:
4:     Generate episode
6:     Add to surrogate model with equation (3)
7:     Generate using equation (2)
8:end for
Algorithm 1 Active Policy Search

Iii Bayesian Optimization

Bayesian optimization (BO) is a framework that aims to efficiently optimize noisy, expensive, blackbox functions. It uses two distinct components: a probabilistic surrogate model that learns the properties and features of the target function using previously evaluated observations and an acquisition function that, based on the surrogate model, builds an utility function which rates how promising a subsequent query could be. For the remainder of the paper, the discussion and results are based on the use of a Gaussian process (GP) as the surrogate model and the expected improvement (EI) as the acquisition function because they are the most commonly used in the literature due to their excellent performance in a large variety of problems.

Formally, BO attempts to find the global optima of an expensive unknown function over some domain by sequentially performing queries. At iteration , all previously observed values at queried points are used to learn a probabilistic surrogate model . Typically, the next query is then determined by greedily optimizing the acquisition function in :


although we will replace the greedy selection in Section V-B.

Surrogate Model

As commented previously, the most common surrogate model is the Gaussian process (GP). For the remainder of the paper we consider a GP with zero mean and kernel

with hyperparameters

. The GP posterior model allows predictions at a query point

which are normally distributed

, such that:


where and . For the kernel, we have used the Spartan kernel [25] which provides robustness to nonstationary function and improves convergence. The use of the kernel is further explained in Section V-A.

Acquisition Function

The expected improvement (EI) [28] is an standard acquisition function defined in terms of the query improvement at iteration and is defined as:


where and

are the corresponding Gaussian probability density function (PDF) and cumulative density function (CDF), being

. In this case, are the prediction parameters computed with (3) and is the incumbent optimum at that iteration.


The GP formulation provides a nice closed-form computation of predictive distributions. However, the kernel hyperparameters introduce nonlinearity and non-Gaussianity that breaks the closed-form solution. In many applications of GPs, including Bayesian optimization, the empirical Bayes

approach is employed where a point estimate is used, resulting in an overconfident estimate of the GP uncertainty

[33]. Instead, we use a fully Bayesian

approach based on Markov chain Monte Carlo (MCMC) to generate a set of samples

with . In particular, we use the slice sampling algorithm which has already been used successfully in Bayesian optimization [36]. In this case, the resulting acquisition function has to be approximated by the Monte Carlo counterpart:



During the first iterations, EI is known to be unstable due to lack of information [15, 3]. Therefore, the optimization is initialized with evaluations from a low discrepancy sequence, like the Sobol sequence, before we start using the acquisition function to decide the location of the next query.

Iv Robustness to Input Noise

Local robustness in optimization can be achieved by performing sensitivity analysis of the optimum selection. In Bayesian optimization, this can be performed online thanks to the surrogate model. Instead of selecting the point that optimizes a single outcome, we select the point that optimizes an integrated outcome:


where can be interpreted objectively as input noise or, subjectively, as a probabilistic representation of the local stability or safety region. That is, a region that guarantees good results even if the query is repeated several times. Instead of , the integrated outcome becomes the function that will be optimized. This has been previously studied in the context of Bayesian optimization using the unscented transformation [29, 4]. Input noise has also been addressed [2, 30], however the former focuses on worst case scenarios of the input noise [2], while the latter is designed to find unstable global optima despite input noise [30]. In this paper, we use a more flexible variant of the unscented transformation, called the scaled unscented transformation [17], to allow more control on the stability region and avoid numerical issues. Note that, contrary to previous works that focus on finding stable/safe regions, we consider the more challenging scenario of input noise, where the objective is not only to find a broad maximum, but queries are also perturbed .

Iv-a Scaled unscented Bayesian optimization

The unscented transformation (UT) is a method to propagate probability distributions through nonlinear transformations with a trade off of computational cost vs accuracy. It is based on the principle that

it is easier to approximate a probability distribution than to approximate an arbitrary nonlinear function [16]. The unscented transformation uses a set of deterministically selected samples from the original distribution (called sigma points) and transform them through the nonlinear function . Then, the transformed distribution is computed based on the weighted combination of the transformed sigma points:


where is the i-th row or column of the corresponding matrix square root and . The weight for the initial point is and for the rest. The parameters should follow and . As pointed out by van der Merwe [40], we recommend a and close to . For the matrix square root function, we use the Cholesky decomposition for its numerical stability and robustness [40].

The unscented transformation is used twice in our algorithm. First, it is used to drive the search towards stable optimum by computing the unscented expected improvement. However, there might be some cases where unstable optima are found by chance. Therefore, we further compute the unscented optimal incumbent which selects the most stable optimum among those predicted by the surrogate model. We define the unscented optimal incumbent

and its corresponding input vector



In this case, the output of the optimization process is the final optimal incumbent

Iv-B The Unscented transform as integration

Although the unscented transformation was designed for propagation of probability distribution through nonlinear functions, it can be also interpreted as a probabilistic integration. In fact, the unscented transformation with and is equivalent to the three point Gauss-Hermite quadrature rule [40]. While the Gauss-Hermite method computes the integral exactly under the Gaussian assumption, it has a cost of where is the polynomial order of the function in the region. Meanwhile the unscented transform, has a quadratic cost for computing the expected value [40]. The low cost of the unscented transformation is also an advantage compared to other more advanced integration methods such as Monte Carlo or Bayesian quadrature, which have higher computational cost.

Note that, during optimization the integrated outcome is always applied with respect to the surrogate model to avoid increasing the number of queries to . Therefore, the integral would be as accurate as the Gaussian process with respect to the target function. We found that, in practice, it is more efficient to employ the computational resources to improve the surrogate model, than to provide a better integrated outcome. Furthermore, the unscented transform also computes the full posterior uncertainty, which can be used for further Bayesian analysis and hierarchical decision models that the integrated outcome of (6).

V Robustness to Mismodeling Errors

In this section, we also consider different sources of modeling errors and methods to deal with them.

V-a Adaptive nonstationarity

Consider again the example from Figure 1. Given that the optimum can be either trajectory depending on the input noise level, our optimization algorithm must be able to model both if needed. Thus, it might also require to find very narrow optima in nonstationary spaces, which is known to be problematic for GP-based BO. Furthermore, it has been previously shown that reward functions in robotics environments are typically nonstationary [27]. This presents a problem for standard Bayesian optimization which assumes spatial stationarity of the target function. For that reason, we have incorporated the Spartan kernel [25] which combines a local and a global kernel to allow better modelling of complex functions.

This composite kernel is the combination of a kernel with global influence with several kernels with moving local influence. The influence is determined by a weighting function. The influence of the local kernels is centered in a single point with multiple diameters, creating a funnel structure:


where the weighting function for the local kernel

includes the parameters to move the center of the local kernels along the input space. In order to achieve smooth interpolation between regions, each region have an associated weighting function

, having the maximum in the corresponding region and decreasing its value with distance to region [25]. Then, we can set . The unnormalized weights are defined as:


where and can be seen as the center of the influence region of each kernel while and are related to the diameter of the area of influence. Note that all the local kernels share the same position (mean value) but different size (variance), generating a funnel-like structure.

For the local kernels, we estimate the center of the funnel structure based on the data gathered. We propose to consider as part of the hyperparameters for the Spartan kernel, which also includes the parameters of the local and global kernels, that is,


In the experiments, we have used a single local Matérn kernel with automatic relevance determination of the hyperparameters [33, 25]. The global kernel is also a Matérn kernel with the same prior on the hyperparameters.

V-B Stochastic meta-policies

POMDPs Bayesian optimization
State: Target function:
Action: Next query:
Observation: Response value:
Belief: Surrogate model:
Q-function: Acquisition function:
Reward: Improvement:
TABLE I: Comparison of POMDP and BO terms

As a sequential decision making process, we can interpret the Bayesian optimization framework as a

partially observable Markov decision process

(POMDP) [39, 11]. In this interpretation, the state is the target function, the action is the next query point, the belief is the surrogate model and the action-value (Q-function) is the acquisition function for each possible query. Table I contains a comparison between the elements of both frameworks. Note that this POMDP model would represent the learning process of the actual policy search. Then, equation (2) can be seen as a meta-policy, because it is used to learn the actual policy . We can see that the Bayesian optimization meta-policies that can be found in the literature are spatially greedy, that is, they select the action or next query that maximizes the acquisition function or Q-function.

In the reinforcement learning literature, the stochastic policy can be used to improve exploration and increase the performance when the model is not accurate, as mismodelling errors might result in a lack of convergence of greedy policies. Mismodelling errors are also common in Bayesian optimization by selecting a specific surrogate model (GP, random forest, Bayesian NN…) we introduce some assumptions that the target function might not satisfy, as discussed in Section

V-A. Furthermore, having input noise during the optimization is another source of misleading observations, as the observed query will deviate from the intended query.

Our approach consist on replacing the greedy policy of equation (2) with a stochastic policy such as the following Boltzmann policy (also known as Gibbs or softmax policy):


wich defines a probability distribution for the next query or action [11]. Thus, the actual next query is selected by sampling that distribution . This policy allows exploration even if the model is completely biased. This approach can be applied to any acquisition function or surrogate model that can be found in the literature.

V-C Distributed Bayesian optimization

Fig. 2: Illustration of a fully distributed BO architecture where each of the nodes perform BO. In the example, nodes A and B are already up and working with different random seeds. We want to spin up a new node C during optimization. A and B run independently and only need to broadcast their new queries and observations. C needs all previous queries and observations up to the current instant to start working in the same way as A and B. No state or model needs to be transferred.

A secondary advantage of the stochastic meta-policies is that they also trivially enable distributed optimization, where different policy parameters can be evaluated in parallel in a fully distributed fashion. This could be applied in multi-robot scenarios or for simulation-based reinforcement learning. Most parallel and batch BO methods require a central node to keep track of the computed and deployed queries in order to ensure that parallel queries are diverse. Without this central node, different nodes can evaluate the same query independently due to the maximum in equation (2). Many parallel methods for Bayesian optimization have been proposed in the past few years with heuristics to enforce diverse queries. Some authors include artificially augmented data by hallucinated observations [12, 36]

or combine optimization with some degree of active learning in order maximize the knowledge about the target function

[10, 7, 34] or enforcing spatial coverage [13].

Sampling from the stochastic policy already ensures diverse queries [11]. It does not require a centralized node and all the computation can be done in each node in a fully distributed manner in multiple nodes as shown in Figure 2. Furthermore, for optimal results, the nodes only need to broadcast their latest evaluated query and observation value , requiring minimal communication bandwidth. In addition to that, communication can be asynchronous and be even robust to failures in the network, as the order of the queries and observations is irrelevant.

Fig. 3: Benchmark functions optimization results. In general, UBO is able to find a more stable solution than the vanilla BO, resulting in a better average value. However, using stochastic meta-policies results in an improved stability. Parallelized runs had a much lower walltime without a penalty in performance.

Vi Results

In this section we describe the experiments used to compare the performance of different Bayesian optimization methods in the presence of input noise. We compare a vanilla implementation of Bayesian optimization (BO), the unscented Bayesian optimization (UBO) with a greedy policy and UBO with the stochastic meta-policy (UBO-SP). We also include in the comparison a parallelized version of stochastic policy applied to UBO with 4 nodes (UBO-SPx4) to study the performance impact of adding parallelization. Note that in the results we show the number of function evaluations, not iterations. For example, at the 20 evaluation, the UBO-SPx4 method had run only for 5 iterations, therefore requiring less wall time and using only the information of 16 points, instead of 19.

Fig. 4: Robot pushing problem and rover path planning optimization results. For the more complex problems, the UBO is not able to find a stable solution, but the stochastic meta-policy does. Only for the 4D robot push, there is a penalty of using the parallel version.

As discussed previously, all methods share the same configuration: expected improvement (EI) as the acquisition function and a Gaussian process as the surrogate model with the Spartan kernel and MCMC for the hyperparameters. The initial samples are taken from a Sobol sequence. We normalize the input of all the problems between 0 and 1, so the reported input noise already assumes that the input is normalized.

The performance of each method was evaluated in the following way: For every function evaluation , each method computes their best solution (the optimal incumbent or the unscented optimal incumbent ) using the observations and according to their model at that point of the optimization. Then, we evaluate the integrated outcome at the best solution by approximating (6) using 1000 Monte Carlo samples from over the actual function

. For the plots, we repeat each optimization 20 times and display the mean value with 95% confidence interval. Common random numbers where used for all the methods.

Vi-a Benchmark Optimization Functions

First, we have evaluated the methods on synthetic benchmark functions for optimization. We have used the functions from Nogueira et al. [29], the RKHS function [1]

and a Mixture of 2D Gaussian distributions (GM). These functions have unstable global optima for certain levels of input noise. This means that in order to locate the safe optima, we need to model and take into account the input noise. We have also used a 4D Michalewicz function

111, a popular test problem for global optimization because its sharp edges and the large number of local optima. All benchmark functions use input noise and 40 evaluations. Number of initial samples is set based on the dimensionality of each problem to 5, 20 and 30 samples for RKHS, GM and Michalewicz respectively.

Figure 3 shows the results on the benchmark functions. We can see how UBO is able to find better stable optima than vanilla BO and how introducing an stochastic meta-policy to the unscented method (UBO-SP) further improves the performance. It also shows that if we choose to add parallelization, it barely impacts the optimization results. This means that we can achieve better performance and better wall-time using parallel stochastic meta-policy.

Fig. 5: Examples of optimized trajectories found by different methods (rows) and trials (columns), showing the possible deviations from the trajectories by simulating input noise . We display the cost of the desired trajectory (assuming no input noise) and the average cost from possible deviations over each result.
Fig. 6: Overlapping rover trajectories evaluated during optimization, for different methods (rows) and trials (columns) with input noise . This allow us to visualize the effect of greedy and stochastic meta-policies in the optimization. We can see how greedy methods (Default-EI and UBO-EI) are more prone to over-sample similar trajectories while stochastic methods (UBO-SP-EI and UBO-SPx4-EI) performs more exploration of trajectories.

Vi-B Robot Pushing

Next, we have used the active learning for robot pushing setup and code from Wang et al. [42]. The task is to perform active policy search for pushing an object towards a designated goal location. In the 3D version, the policy parameters are the robot location and pushing duration . An alternative 4D function is also used, adding the robot angle as a fourth parameter. These functions have also been used previously to study robust Bayesian optimization [2]. In both functions, we use 10 initial queries and a budget of 40 function evaluations during optimization. The 3D version uses while the 4D version uses . We reduced the input noise in the 4D function because the robot angle parameter is very sensitive to input noise, as a small change in direction of the robot might result in completely missing the goal.

Figure 4, shows the results of Robot Pushing problem. In both functions the results are consistent with the experiments on benchmark functions. In general, applying UBO improves the performance over BO and, by introducing an stochastic meta-policies, UBO-SP and UBO-SPx4 further enhances exploration and improves robustness and, thus, the overall performance.

Vi-C Robot Path Planning

In this section we cover the problem of safe path planning. The objective is to find a stable and efficient trajectory of a rover through a rugged terrain with multiple obstacles. It is based on optimizing rover trajectories from Wang et al. [41]. In this case, there are 4 policy parameters. We designed a new environment in which a rover has to perform path planning while avoiding obstacles, which might be dangerous for the rover to collide with, and changes in elevation, which might be dangerous as the rover can tip over. In the figures, obstacles are red rectangles and slopes are orange regions. Uncertainty following the desired trajectory is represented as input noise in the trajectory parameters, meaning that we are interested in finding stable trajectories that avoid the danger that might arise from possible deviations. This is a common problem in robot navigation as localization errors might results in the robot not following the desired trajectory accurately [24, 23].

We study this problem using 2 different input noise: and . In both cases, we use 30 initial samples and 40 function evaluations during optimization. Figure 4 shows the resulting optimization performance of each of the methods. Figure 5 shows some trajectories obtained using different methods. In this problem, applying UBO is not enough, as the results show that it does not improve over BO. In order to understand why, we study the function evaluations performed by each method, shown in Figure 6. We can see how, greedy methods cannot recover from a biased models as shown by the high density of the trajectories. Contrary to that, the stochastic approach do not suffer from it and keep exploring with different trajectories. This shows how UBO-SP and UBO-SPx4 methods are more robust to mismodeling errors and how the improved exploration helps in finding better safe trajectories.

Vii Conclusions

In this paper, we propose the first active policy search algorithm that offers robustness to input noise and mismodeling errors using unscented Bayesian optimization with stochastic meta-policies. First, we have presented a new formulation and interpretation of the UBO algorithm. Second, we have combined the UBO algorithm with the Spartan kernel to deal with nonstationary functions and an stochastic meta-policy for mismodeling robustness and evaluated it on several benchmark functions and robotic applications that showcase the influence of input noise, such as safe robot navigation; confirming that, the synergies of both methods (UBO and stochastic meta-policy) results in improved robustness over the methods separately. This further highlights previous results that indicates that the ubiquitous greedy strategy in the Bayesian optimization literature can be suboptimal in many applications. We also take advantage of the embarrassingly parallel nature of the stochastic meta-policies that could be using in multi-robot setups or simulation enviroments.


  • [1] J. Assel, Z. Wang, B. Shahriari, and N. Freitas (2015) Heteroscedastic treed Bayesian optimizaion. Note: arXiv:1410.7172v2 Cited by: §VI-A.
  • [2] I. Bogunovic, J. Scarlett, S. Jegelka, and V. Cevher (2018) Adversarially robust optimization with gaussian processes. In Advances in Neural Information Processing Systems, pp. 5760–5770. Cited by: §IV, §VI-B.
  • [3] A. D. Bull (2011) Convergence rates of efficient global optimization algorithms.

    Journal of Machine Learning Research

    12, pp. 2879–2904.
    Cited by: §III.
  • [4] J. Castanheira, P. Vicente, R. Martinez-Cantin, L. Jamone, and A. Bernardino (2018) Finding safe 3d robot grasps through efficient haptic exploration with unscented bayesian optimization and collision penalty. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1643–1648. Cited by: §IV.
  • [5] R. Calandra, A. Seyfarth, J. Peters, and M. Deisenroth (2015) Bayesian optimization for learning gaits under uncertainty.

    Annals of Mathematics and Artificial Intelligence (AMAI)

    1 1, pp. 1–19 1–19.
    Cited by: §I.
  • [6] C. Daniel, O. Kroemer, M. Viering, J. Metz, and J. Peters (2015) Active reward learning with a novel acquisition function. Autonomous Robots 39 (3), pp. 389–405. Cited by: §I.
  • [7] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis (2013) Parallel Gaussian process optimization with upper confidence bound and pure exploration. In ECMLKDD, pp. 225–240. Cited by: §V-C.
  • [8] A. Cully, J. Clune, D. Tarapore, and J. B. Mouret (2015) Robots that can adapt like animals. Nature 521, pp. 503–507. Cited by: §I, §I.
  • [9] M. Deisenroth, G. Neumann, and J. Peters (2013) A survey on policy search for robotics. Foundations and Trends in Robotics 2 (1-2), pp. 1–142. Cited by: §I.
  • [10] T. Desautels, A. Krause, and J.W. Burdick (2014) Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. JMLR 15 (1), pp. 3873–3923. Cited by: §V-C.
  • [11] J. Garcia-Barcos and R. Martinez-Cantin (2019) Fully distributed bayesian optimization with stochastic policies. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 2357–2363. Cited by: §I, §V-B, §V-B, §V-C.
  • [12] D. Ginsbourger, R. Le Riche, and L. Carraro (2010) Kriging is well-suited to parallelize optimization. In Computational intelligence in expensive optimization problems, pp. 131–162. Cited by: §V-C.
  • [13] J. González, Z. Dai, P. Hennig, and N. Lawrence (2016) Batch bayesian optimization via local penalization. In AISTATS, pp. 648–657. Cited by: §V-C.
  • [14] D. Huang, T. T. Allen, W. I. Notz, and N. Zheng (2006) Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of Global Optimization 34 (3), pp. 441– 466. Cited by: §II.
  • [15] D. R. Jones, M. Schonlau, and W. J. Welch (1998) Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13 (4), pp. 455–492. Cited by: §I, §III.
  • [16] S. Julier and J. Uhlmann (2004-03) Unscented filtering and nonlinear estimation. Proceedings of the IEEE 92 (3), pp. 401–422. Cited by: §IV-A.
  • [17] S. Julier and J.K. Uhlmann (2002-8–10 May) The scaled unscented transformation. In IEEE American Control Conf., Anchorage AK, USA, pp. 4555–4559. Cited by: §IV.
  • [18] N. Kohl and P. Stone (2004) Policy gradient reinforcement learning for fast quadrupedal locomotion. In Proc. of the IEEE Int. Conf. on Robotics & Automation, Cited by: §I.
  • [19] S. R. Kuindersma, R. A. Grupen, and A. G. Barto (2013) Variable risk control via stochastic optimization. The International Journal of Robotics Research 32 (7), pp. 806–825. Cited by: §I.
  • [20] S. Levine and P. Abbeel (2014)

    Learning neural network policies with guided policy search under unknown dynamics

    In Advances in Neural Information Processing Systems, pp. 1071–1079. Cited by: §I.
  • [21] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. (2007) Automatic gait optimization with gaussian process regression. In IJCAI, pp. 944–949. Cited by: §I.
  • [22] R. Marchant and F. Ramos (2014) Bayesian optimisation for informative continuous path planning. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 6136–6143. Cited by: §I.
  • [23] R. Martinez-Cantin, N. de Freitas, E. Brochu, J. Castellanos, and A. Doucet (2009) A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot.. Autonomous Robots 27 (3), pp. 93–103. Cited by: §I, §VI-C.
  • [24] R. Martinez-Cantin, N. de Freitas, A. Doucet, and J. A. Castellanos (2007) Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and Systems, Cited by: §I, §II, §VI-C.
  • [25] R. Martinez-Cantin (2019) Funneled Bayesian optimization for design, tuning and control of autonomous systems. IEEE Trans Cybern 49 (4), pp. 1489–1500. Cited by: §I, §III, §V-A, §V-A, §V-A.
  • [26] R. Martinez-Cantin, K. Tee, and M. McCourt (2018) Practical bayesian optimization in the presence of outliers. In International Conference on Artificial Intelligence and Statistics, pp. 1722–1731. Cited by: §I.
  • [27] R. Martinez-Cantin (2017) Bayesian optimization with adaptive kernels for robot control. In Proc. of the IEEE International Conference on Robotics and Automation, pp. 3350–3356. Cited by: §I, §V-A.
  • [28] J. Mockus, V. Tiesis, and A. Zilinskas (1978) The application of Bayesian methods for seeking the extremum. In Towards Global Optimisation 2, pp. 117–129. Cited by: §III.
  • [29] J. Nogueira, R. Martinez-Cantin, A. Bernardino, and L. Jamone (2016) Unscented bayesian optimization for safe robot grasping. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1967–1972. Cited by: §I, §I, §IV, §VI-A.
  • [30] R. Oliveira, L. Ott, and F. Ramos (2019) Bayesian optimisation under uncertain inputs. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1177–1184. Cited by: §IV.
  • [31] J. Peters and S. Schaal (2006) Policy gradient methods for robotics. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, Cited by: §I.
  • [32] A. Rai, R. Antonova, F. Meier, and C. G. Atkeson (2019) Using simulation to improve sample-efficiency of bayesian optimization for bipedal robots.. Journal of machine learning research 20 (49), pp. 1–24. Cited by: §I.
  • [33] C. Rasmussen and C. Williams (2006) Gaussian processes for machine learning. The MIT Press. External Links: ISBN 026218253X Cited by: §III, §V-A.
  • [34] A. Shah and Z. Ghahramani (2015) Parallel predictive entropy search for batch global optimization of expensive objective functions. In NIPS, Cited by: §V-C.
  • [35] B. Shahriari, K. Swersky, Z. Wang, R.P. Adams, and N. de Freitas (2016) Taking the human out of the loop: a review of Bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §I.
  • [36] J. Snoek, H. Larochelle, and R. Adams (2012) Practical Bayesian optimization of machine learning algorithms. In NIPS, pp. 2960–2968. Cited by: §III, §V-C.
  • [37] Y. Sui, A. Gotovos, J. Burdick, and A. Krause (2015) Safe exploration for optimization with gaussian processes. In International Conference on Machine Learning, pp. 997–1005. Cited by: §I.
  • [38] M. Tesch, J. Schneider, and H. Choset (2011) Adapting control policies for expensive systems to changing environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems, Cited by: §I.
  • [39] M. Toussaint (2014) The Bayesian search game. In Theory and Principled Methods for Designing Metaheuristics, Cited by: §V-B.
  • [40] R. van der Merwe (2004)

    Sigma-point Kalman filters for probabilistic inference in dynamic state-space models

    Ph.D. Thesis, OGI School of Science & Engineering, Oregon Health & Science University. Cited by: §IV-A, §IV-B.
  • [41] Z. Wang, C. Gehring, P. Kohli, and S. Jegelka (2018) Batched large-scale bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics (AISTATS), External Links: Link Cited by: §VI-C.
  • [42] Z. Wang and S. Jegelka (2017) Max-value entropy search for efficient Bayesian optimization. In ICML, Vol. 70, pp. 3627–3635. Cited by: §VI-B.
  • [43] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §I, §I, §II.