BayesOpt: A Library for Bayesian optimization with Robotics Applications

09/03/2013 ∙ by Ruben Martinez-Cantin, et al. ∙ 0

The purpose of this paper is twofold. On one side, we present a general framework for Bayesian optimization and we compare it with some related fields in active learning and Bayesian numerical analysis. On the other hand, Bayesian optimization and related problems (bandits, sequential experimental design) are highly dependent on the surrogate model that is selected. However, there is no clear standard in the literature. Thus, we present a fast and flexible toolbox that allows to test and combine different models and criteria with little effort. It includes most of the state-of-the-art contributions, algorithms and models. Its speed also removes part of the stigma that Bayesian optimization methods are only good for "expensive functions". The software is free and it can be used in many operating systems and computer languages.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many problems in engineering, computer science, robotics., require to find the extremum of a real valued function. In many cases, those functions do not have a closed-form expression or might be multimodal, where some of the local extrema might have a bad outcome compared to the global extremum, or the evaluation of those functions might be costly.

Global optimization is a special case of non-convex optimization where we want to find the global extremum of a real valued function, that is, the target function. The search is done by some pointwise evaluation of the target function.

The objective of a global optimization algorithm is to find the sequence of points

(1)

which converges to the point , that is, the extremum of the target function, when is large. The algorithm should be able to find that sequence at least for all functions from a given family.

As explained in [26], this search procedure is a sequential decision making problem where point at step is based on decision which considers all previous data:

(2)

where . For simplicity, many works assume , that is, function evaluations are deterministic. However, we can easily extend the description to include stochastic functions (e.g.: homoscedastic noise ).

The search method is the sequence of decisions , which leads to the final decision . In most applications, the objective is to optimize the response of the final decisions. Then, the criteria relies on the optimality error or optimality gap, which can be expressed as:

(3)

In other applications, the objective may require to converge to in the input space. Then, we can use for example the Euclidean distance error:

(4)

Equations (3) and (4) can also be interpreted as variants of the loss

function for the decision at each step. Thus, the optimal decision is defined as the function that minimizes the loss function:

(5)

This requires full knowledge of function , which is unavailable. Instead, let assume that the target function belongs to a family of functions , e.g.: continuous functions in

. Let also assume that the function can be represented as sample from a probability distribution over functions

. Then, the best response case analysis for the search process is defined as the decision that optimizes the expectation of the loss function:

(6)

where is a prior distribution over functions.

However, we can improve equation (6) considering that, at decision we have already observed the actual response of the function at points, . Thus, the prior information of the function can be updated with the observations and the Bayes rule:

(7)

In fact, we can actually rewrite the equation to represent the updates sequentially:

(8)

. Thus, equation (6) can be rewritten as:

(9)

Equation (9) is the root of Bayesian optimization, where the Bayesian part comes from the fact that we are computing the expectation with respect to the posterior distribution, also called belief, over functions. Therefore, Bayesian optimization is a memory-based optimization algorithm.

As commented before, most of the theory of Bayesian optimization is related to deterministic functions, we consider also stochastic functions, that is, we assume there might be a random error in the function output. In fact, evaluations can produce different outputs if repeated. In that case, the target function is the expected output. Furthermore, in a recent paper by [8] it has been shown that, even for deterministic functions, it is better to assume certain error in the observation. The main reason being that, in practice, there might be some mismodelling errors which can lead to instability of the recursion if neglected.

Ii Related fields

Equation (9) is, in fact, much more general than Bayesian optimization. There are many fields of research that combines optimal decision making over the expectation of an unknown function which is recursively learned.

Ii-a Active and interactive learning

Active learning consider the situation where data points can be selected for labeling during the training process. The rationale behind that is that: data points might be expensive to label, bad data points might introduce important bias, etc. The analogy with Bayesian optimization is clear.

In fact, one can rethink Bayesian optimization as an active learning problem where the unknown parameters are the extremum of the function, i.e. . In this setup, equation (3) correspond to the prediction bias and equation (4

) corresponds to the parameter variance.

In fact, the first applications of Bayesian optimization in fields like robotics or computer graphics design are all related to problems of active learning like [24] or interactive learning [3]. See Section VI for a detailed description.

Ii-B Experimental design

The field of Bayesian optimization shares important pieces of knowledge with the sequential experimental design field, especially related to the Design and Analysis of Computer Experiments (DACE), pioneered by [31] and beautifully reviewed in [32]. The main difference resides on the optimum point , which in the case of DACE, it is not longer the extremum of the function , but the point that provides most information to reconstruct the target function . The reader should note also the analogy with active learning, where the target function

corresponds to the regression function or classifier that the system is learning, while trying to sequentially querying the most informative point

.

For the sequential experimental design equation (3) is replaced by a different loss function. For example, if we consider that the function can be represented in a –potentially infinite– parametric form , then we can define the loss function in terms of the parameters:

(10)
(11)
(12)

There are related criteria (E-optimality, C-optimality, …) which can be easily mapped in a loss function. For a complete description of the different criteria, see [1]. Thus, the only difference between code for experimental design and Bayesian optimization is the criteria used.

Ii-C Reinforcement learning

The field of reinforcement learning

[17]

Ii-C1 Multi-armed bandits

The simplest known reinforcement learning problem is called the multi-armed bandit problem [36]. In the bandits problem, the response function is a reward or cost function –for example, the cost of a manufacturing a product in a certain way–. The algorithm must find the lowest production cost, regarding that changing the process may result in a more expensive manufacturing cost.

The main different with the optimization problem is that intermediate evaluations of the function may incur in extra cost. Thus, the definition of the performance function is no longer the optimality error. Instead, equation (3) is called the instantaneous regret in the reinforcement learning literature. The target of the bandits algorithm is to minimize the accumulated regret or average regret . Thus, (3) can be replaced by:

(13)

In this setup, an algorithm with no-regret, that is , is guaranteed to converge to the optimum. In fact, as cleverly found by [36], the average regret can be used to provide convergence rates of the optimization algorithm.

Typically, in the multi-armed bandit setup, each input has associated a probability distribution to deliver the reward or cost values. Thus, several evaluations of the same input results in different outcomes. The analogous case would be the optimization problem of stochastic function.

Finally, Bull [5] suggest -greedy, a classical bandits algorithm, to improve the convergence of Bayesian optimization.

Ii-C2 Partially-observable Markov Decision Processes (POMDP)

As pointed out in [39], Bayesian optimization, as a sequential decision making problem, has a direct connection to a POMDP. In fact, bandits can be modeled as a single state reinforcement learning problem, for example, an POMDP with immediate reward.

An independent connection between Bayesian optimization and POMDPs is through the specific methodology to solve POMDPs based on policy search. In policy search, it is assume that the policy space can be expressed as a parametric function. In an abuse of notation, let us name the policy parameters as . Thus, the parametrization inherently encodes the dynamics of the world (priors, transitions, etc). Then, the reinforcement learning problem becomes a static optimization problem.

(14)

which is equivalent to (6) replacing by and reward function by loss function.

Ii-D Bayesian numerical analysis

In the seminal works by [6] and [27]

introduce the field of Bayesian numerical analysis. The main idea is to solve a complex analysis problem –for example: interpolation, regression, integral evaluation, etc.– by following a simple methodology:

  1. Put a prior on the family of functions (e.g.: continuous) on the working domain.

  2. Compute the response of the function as a set of sample points .

  3. Compute the posterior.

  4. Solve the original problem by the Bayes rule.

[6] presents a set of beautiful examples where, the previous algorithm results a powerful machinery. [27] goes one step further and formalize the previous methodology in a single, although very general model, showing its applicability in several analysis problems –including optimization–. In fact, all the previous fields, including Bayesian optimization itself, can be seen as particular cases of some Bayesian numerical analysis.

The models presented in this paper are partially based on [27] formulation.

Iii Bayesian optimization general model

In order to simplify the description, we are going to use a special case of Bayesian optimization model defined previously which corresponds to the most common application. In subsequent Sections we will introduce some generalizations for different applications.

Without loss of generality, consider the problem of finding the minimum of an unknown real valued function , where is a compact space, . Let be a prior distribution over functions represented as a stochastic process, for example, a Gaussian process , with inputs and an associate kernel or covariance function . Let also assume that the target function is a sample of the stochastic process .

In order to find the minimum, the algorithm has a maximum budget of evaluations of the target function . The purpose of the algorithm is to find optimal decisions that provide a better performance at the end, according to equation (9).

One advantage of using Gaussian processes as a prior distributions over functions is that new observations of the target function can be easily used to update the distribution over functions. Furthermore, the posterior distribution is also a Gaussian process . Therefore, the posterior can be used as an informative prior for the next iteration in a recursive algorithm.

In a more general setting, many authors have suggested to modify the standard zero-mean Gaussian process for different variations that include semi-parametric models

[12, 10, 16, 27]

, use of hyperpriors on the parameters

[25, 4, 11], Student t processes [9, 31, 43], etc.

We use a generalized linear model of the form:

(15)

where

(16)

The term means a non-parametric process, which can make reference to a Gaussian process or a Student’s t process . In both cases, is the observation noise variance, sometimes called nugget, and it is problem specific. Many authors decide to fix this value when the function is deterministic, for example, a computer simulation. However, as cleverly pointed out in [8], there might be more reasons to include this term appart from being the observation noise, for example, to consider model inaccuracies.

This model has been presented in different ways depending on the field where it was used:

  • As a generalized linear model

    with heteroscedastic perturbation

    .

  • As a nonparametric process of the form .

  • As a semiparametric model

Iv Computing the hyperparameters

Based on the model that we have defined previously, we have to consider four sets of hyperparameters: the kernel hyperparameters

, the mean function parameters , the signal variance and the noise variance .

A fully Bayesian approach does not admit a closed form solution for most kernel functions. In the Bayesian optimization literature, the standard and most effective approach is to rely on an empirical Bayes approach, where some parameters are approximated by a point value –e.g.: – or directly fixed a priori –e.g.: – while some other parameters are actually computed in closed form –e.g.: – and

On the other hand, we can assign conjugate hyperpriors to some parameters of the model, that is, and and still get a closed form posterior. For example, let assume that the prior can be decompose such as . For example, as a normal inverse-gamma: , a normal scaled-inverse-chi-squared: or a Jeffreys prior: .

In fact, the first and second option are different parametrizations of the same model. The Jeffreys prior can be consider as the limit case. Also the Jeffreys prior [14], is an uninformative prior which is invariant to reparametrization –for example: we can exchange the signal variance for the signal precision and the prior is the same–. However, they can perform poorly for multidimensional parameters, for example, if they are many mean basis functions .

Iv-a Learning the kernel parameters.

For Bayesian optimization this can not be trivially implemented. As [40] proved, updating the value of during the optimization process introduce bias which might result of the optimization being stuck in a local minimum. To avoid this problem, we learn the parameters during a preliminar stage, with a small number of data points and then freeze or we updating the parameters infrequently.

We consider that the hyperprior of the kernel hyperparameters

–if available– is independent of other variables. Depending on the model, the likelihood function will be a multivariate Gaussian distribution or multivariate t distribution. Based on

[32, 30], we are going to consider the following algorithms to learn the kernel hyperparameters:

  • Cross-validation: In this case, we try to maximize the average predicted log probability by the leave one out (LOO) strategy [30].

  • Maximum likelihood: For any of the models presented, one approach to learn the hyperparameters is to maximize the likelihood of all the parameters , and . Then, the likelihood function is a multivariate Gaussian distribution. As presented in [32]

    , we can obtain a better estimate if we adjust the number of degrees of freedom.

  • Posterior maximum likelihood: In this case, the likelihood function is modified to consider the posterior estimate of based on the different cases defined in Section IV. In this case, the function will be a multivariate Gaussian or t distribution, depending on the kind of prior used for .

  • Maximum a posteriori: We can modify any of the previous algorithms by adding a prior distribution . As commented in [5, 40], applying the maximum likelihood naively while doing optimization might be problematic. [42] suggest to add bounds to restrict the exploration of an overfitted or underfitted width parameter. However, this solution is restricted to using a box-bounded optimizer to learn and does not generalize to include extra information –if available– about the parameters. Adding a prior distribution is a more general and elegant solution.

  • Sampling strategies: The applicability of those techniques seems to be against the philosophy of Bayesian optimization where we aim to reduce the number of data points. Nevertheless, some efforts have been developed in this area, as that presented in [35] or [18].

V BayesOpt Toolbox

We present a highly efficient Bayesian optimization toolbox. It is implemented in C++ and it is compatible with Windows, Mac Os and Linux. It also provides interfaces to C, Python and Matlab/Octave.

The library was influenced by the GPML toolbox by [29] and NLOPT by [15]. The toolbox is highly configurable. The user can select among different kernels, mean functions, models and optimization criteria.

V-a Advantage of the library design

First, it is fast. The execution time of the library is represented in Figure 1 running on standard laptop and a single process. The optimization problem was trivial, so the computational cost plotted can be considered purely of the optimization process.

Fig. 1: Time execution (in seconds) for the BayesOpt library compared to an equivalent Matlab naive implementation based on GPML. Clearly, the use of C/C++ for the core library combined with some code optimizations specially intended for iterative optimization reduce the total computational cost. The target function was a trivial function, thus the plot can be considered purely the cost of the optimization code.

One of the most critical components is the computation of the inverse of the kernel matrix. We have compared different numerical tricks like the incremental computation of the inverse matrix using blockwise inversion. We found that the Cholesky decomposition method outperforms any other method in terms of performance and numerical stability. Besides, it guarantees the numerical stability of the Gram matrix.

Furthermore, we can use two properties of the method:

  1. Points arrive one at a time.

    Thus, we can do incremental computations of the matrix and vectors. For example, at each iteration, we know that only

    new elements will appear in the correlation matrix –the correlation of the new point with each of the existing points–. The rest of the matrix remains invariant. Thus, instead of computing the whole Cholesky decomposition, being we just add the new row of elements to the triangular matrix, which is .

  2. Multiple queries from the same model. Computing the optimal decision requires to evaluate many queries. However, all those queries rely on the same previous data. Thus, we can factor out –precompute– all the operations that are independent of the query.

Second, it is flexible. The design has been carefully selected relying on inheritance and polymorphism for all the components of the optimization process. It is very easy to create a new kernel, surrogate or criteria model by inheriting the abstract model or one of the existing models. Thus, the newly created model will be fully integrated in the library. It will also inherit most of the existing linear algebra optimizations and safety checks. This is specially important in some part of the codes that have been carefully design to optimize performance or guarantee a correct implementation by design. For example, the initial set of points is selected using latin hypercube sampling. The kernel hyperparameters are not updated after every new data point since that is known to introduce bias [5, 40].

The library includes different kernel functions (Matern, Gaussian, isotropic, automatic relevance determination, etc.), different mean functions (constant, linear, radial, etc.), different surrogate models depending on the parameters hyperpriors (Gaussian process, Student’s t process, etc.), and different criteria (expected improvement (EI), lower confidence bound (LCB), probability of improvement (POI), etc.). See Figure 2 for the inheritance tree of the implemented criteria.

Fig. 2: Inheritance diagram for the criteria class

One of the advantages of having such a flexible design is that it is easy to combine different options. For example, we have implemented what we called, some metacriteria algorithms, like GP-Hedge [11], which can be used to find the most relevant criteria online. Other metacriterion is the linear combination of multiple criteria, which can be used to implement a optimization criteria with movement penalties [22].

The library can be downloaded from https://bitbucket.org/rmcantin/bayesopt/downloads and the online documentation can be found in http://rmcantin.bitbucket.org/html/

The library internally uses NLOPT for the inner optimization loops (optimize criteria, learn kernel hyperparameters, etc.) [15]. The interface is very similar, thus NLOPT can be also used for comparison and benchmarking.

V-B Compatibility

The toolbox has been design with the idea to be highly compatible in many platforms and setups. It has already been tested and compiled in different operating systems (Windows, Debian/Ubuntu, Mac OS …), with different compilers (Visual Studio, gcc, clang, mingw …). The core of the library is in C++98 for compatibility with older compilers. It also provides interfaces and demos for C, Python and Matlab/Octave.

V-C Demos

The code includes many demos for different languages, models (continuous, discrete), test functions (Ackley, Michalewicz, Rosenbrock, etc.). It even includes some specific demos such as multiprocessing computation or a simple computer vision application (image binarization).

V-D Using BayesOpt API

The API is compatible with several languages and programming paradigms. In general, the usage of the toolbox can be summarize in three steps:

  1. Define the function to optimize.

  2. Set or modify the parameters and models of the optimization process.

  3. Run the optimizer.

Here we show a brief summary of the different ways to use the library:

V-D1 C/C++ callback usage

This interface is the most standard approach, it could also be used as an interface or wrapper of other languages such as Fortran, Ada, etc.

The function to optimize must agree with the following template:

double my_function (unsigned int n, const double *x,
                    double *gradient, void *func_data);

Then we call the optimizer for continuous or discrete optimization, passing the target function as a pointer111The gradient has been included for future compatibility. In the current implementation, it is not used..

V-D2 C++ inheritance usage

This is the most straighforward and complete method to use the library. The object that must be optimized must inherit from the bayesopt::ContinuousModel or bayesopt::DiscreteModel classes. For example:

class MyOptimization: public bayesopt::ContinuousModel
{
public:
 MyOptimization(bopt_params param):
   ContinuousModel(param) {}
 double evaluateSample(const ublas::vector<double> &query)
 {  // My function here   };
 bool checkReachability(const ublas::vector<double> &query)
 {  // My constraints here   };
 int optimize(ublas::vector<double> &result)
};

Then, we just need to override one of the virtual functions called evaluateSample, which can be called with C arrays and uBlas vectors. Since there is no pure virtual functions, you can just redefine your preferred interface. You can also override the checkReachability function to include nonlinear constraints. Again, the parameters are defined in the bopt_params struct.

V-D3 Python callback/inheritance usage

Both interfaces are analogous to the C/C++ interface. In this case, the parameters are defined as a Python dictionary.

V-D4 Matlab/Octave callback usage

Matlab/Octave only support callback. The parameters are defined as a Matlab struct analogous to the parameters struct in the C/C++ interface.

V-E BayesOpt parameters

The parameters are defined in the bopt_params struct. The easiest way to set the parameters is to use

bopt_params parameters = initialize_parameters_to_default();

and then, modify the necesary fields. The details of each parameter can be found in the included documentation. For example:

par.kernel.name = "kSum(kSEISO,kConst)"; // Sum of kernels
par.kernel.hp_mean = {0.5, 0.5};
par.kernel.hp_std = {1, 1};         // Hyperprior on kernel
par.kernel.n_hp = 2;
//Surrogate hyperpriors (w,sigma_s)
par.surr_name = S_STUDENT_T_PROCESS_JEFFREYS;
// We combine Expected Improvement, Lower Confidence Bound
// probability of improvement y Thompson sampling
par.crit_name = "cHedge(cEI,cLCB,cPOI,cThompsonSampling)";
par.crit_params = {1,1,0.01}; // Each criterion has
par.n_crit_params = 3;        // its parameters
par.l_type = L_ML;         // Learning type
par.n_iterations = 200;    // Number of iterations
Fig. 3: This simulation shows three stages of the robot exploring an environment. The simulation includes landmarks that the robot does not know a priori. As soon as the robot observes these landmarks, it incorporates them into its model of the world. The robot continuously plans and replans so as to minimize the uncertainty in its pose and in the location of the known landmarks. The figure also shows the robot’s limited field of view and the paths that it plans to follow at the three simulation stages.

Vi Robotics applications

Bayesian optimization has already been applied to mane robotics and related problems. It was first introduced in the field by [21] and, independently by [24]. The work of Lizotte et al. [21] used Bayesian optimization for robot gait optimization, whereas Martinez-Cantin et al. [24, 25] introduced a policy search algorithm for robot planning that assumes expensive reward functions (see Figure 3). Based on related work from [32], the surrogate model in [24] included hyperpriors on the model parameters. This work was the original motivation to build the library.

The idea of learning a controller and a planner was later combined and extended in [4] in a hierarchical fashion. Although it was originally intended for video games and simulation, its applicability to robotic platforms is direct.

As commented before, this topic is highly related with the work on sequential experimental design relying on Gaussian processes. In that field, one of the major breakthroughs was the discovery of the mutual information as an efficient criterion for submodular optimization. Originally intended for sensor placement [19] it was later extended for robot planning (moving sensors) [33, 34].

In fact, sensor placement or selection was addressed in a pure Bayesian optimization methodology using a more classical RMS error measure by Garnett et al. [7].

A related problem of robots as moving sensors was presented in [22], which introduced the cost of robot movement as an additional criteria to trade-off exploration and exploitation.

Gait optimization in a more complex setup (snake robot) was further analyzed in [37] where the expected improvement algorithm is extended with non-controllable variables as in [43]. Recently, the same authors have extended the work to deal with multiobjective functions [38].

Manipulation has also been attracted by the capabilities of Bayesian optimization. Kroemer et al. [20] presented a reinforcement learning algorithm for grasping where the Bayesian optimization algorithm is used as a proxy between the reward function of the grasp and the grasping parameters. The reward function is directly computed by experimenting the grasp and performing a set of movements to guarantee stability of the grip. For the inner loop in the Bayesian optimization, the authors used a gradient-based strategy. This kind of strategy has some advantages in reinforcement learning for robots [28]. However, they provide suboptimal performance due to local minima.

Instead of relying on arbitrary movements to test the grasp reward, some authors have suggested a metric based on a simulated environment where the wrench space of the contact points can be computed [2]. This idea was explored in the Bayesian optimization setup in [23]. Later, it has been tested in [41] which uses the library presented here.

Fig. 4: Bayesian optimization for grasping. The reward of the grasp is computed by the volume of the wrench space generated by the contact points. This provides a metric of the stability of the grasp assuming that the friction of the fingers can be predicted.

Finally, as presented in a recent work [13], the parameters –either continuous, discrete or categorical– of any algorithm can be optimized using Bayesian optimization.

Vii Conclusion

We have presented a common framework for Bayesian nonlinear optimization, sequential experimental design, bandits and, in general, hyperparameter optimization for many applications. The framework has been implemented in an efficient and easy to use library compatible with many operating systems and computer languages. It includes most of the state-of-the-art contributions to the field, both in terms of core algorithms and application oriented modifications.

As the implementation includes metamodels and metacriteria, the library is able to generalize to many situations with a small contribution from the user. On the other hand, for advanced users, it allows fully customization to improve the performance in dedicated applications.

References

  • Atkinson et al. [2007] Anthony C. Atkinson, Alexander N. Donev, and Randall D. Tobias. Optimum Experimental Design with SAS. Oxford University Press, 2007.
  • Baier and Zhang [2007] Tim Baier and Jianwei Zhang. Learning to grasp everyday objects using reinforcement-learning with automatic value cut-off. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2007), 2007.
  • Brochu et al. [2007] Eric Brochu, Nando de Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances in Neural Information Processing Systems, 2007.
  • Brochu et al. [2010] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. eprint arXiv:1012.2599, arXiv.org, December 2010.
  • Bull [2011] Adam D. Bull. Convergence rates of efficient global optimization algorithms.

    Journal of Machine Learning Research

    , 12:2879–2904, 2011.
  • Diaconis [1988] Persi Diaconis. Bayesian numerical analysis. In S.S. Gupta and J.O. Berger, editors, Statistical Decision Theory and Related Topics IV, volume 1, pages 163–175, 1988.
  • Garnett et al. [2010] Roman Garnett, Michael A. Osborne, and Stephen J. Roberts. Bayesian optimization for sensor set selection. In Information Processing in Sensor Networks (IPSN 2010), 2010.
  • Gramacy and Lee [2012] Robert B. Gramacy and Herbert K.H. Lee. Cases for the nugget in modeling computer experiments. Statistics and Computing, 22:713–722, 2012. ISSN 0960-3174. doi: 10.1007/s11222-010-9224-x. URL http://dx.doi.org/10.1007/s11222-010-9224-x.
  • Gramacy and Polson [2009] Robert B Gramacy and Nicholas G Polson. Particle learning of gaussian process models for sequential design and optimization. Business, 20(1):18, 2009. URL http://arxiv.org/abs/0909.5262.
  • Handcock and Stein [1993] Mark S. Handcock and Michael L. Stein. A Bayesian analysis of Kriging. Technometrics, 35(4):403–410, 1993.
  • Hoffman et al. [2011] Matthew Hoffman, Eric Brochu, and Nando de Freitas. Portfolio allocation for Bayesian optimization. In

    27th Conference on Uncertainty in Artificial Intelligence (UAI2011)

    , 2011.
  • Huang et al. [2006] D. Huang, T. T. Allen, W. I. Notz, and N. Zheng. Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of Global Optimization, 34(3):441– 466, 2006.
  • Hutter et al. [2011] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proceedings of Learning and Intelligent Optimization (LION-5), page 507–523, 2011.
  • Jeffreys [1946] Harold. Jeffreys.

    An invariant form for the prior probability in estimation problems.

    Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 186(1007):453–461, 1946.
  • Johnson [2013] Steven G. Johnson. The NLopt nonlinear-optimization package, 2013. URL http://ab-initio.mit.edu/nlopt.
  • Jones et al. [1998] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, 1998.
  • Kaelbling et al. [1996] Leslie P. Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
  • Kleijnen et al. [2012] Jack PC Kleijnen, Wim van Beers, and Inneke Van Nieuwenhuyse. Expected improvement in efficient global optimization through bootstrapped kriging. Journal of Global Optimization, 54(1):59–73, 2012.
  • Krause et al. [2008] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9:235–284, 2008.
  • Kroemer et al. [2010] Oliver Kroemer, Renaud Detry, Justus Piater, and Jan Peters. Combining active learning and reactive control for robot grasping. Robotics and Autonomous Systems, 58(9):1105–1116, 2010. URL http://robot-learning.de/pmwiki/uploads/Publications/KroemerJRAS_6636[0].pdf.
  • Lizotte et al. [2007] Daniel Lizotte, Tao Wang, Michael Bowling, and Dale Schuurmans. Automatic gait optimization with gaussian process regression. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI), pages 944–949, 2007.
  • Marchant and Ramos [2012] Roman Marchant and Fabio Ramos. Bayesian optimisation for intelligent environmental monitoring. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2242–2249, 2012.
  • Martinez-Cantin [2012] Ruben Martinez-Cantin. Efficient computation of stable grasps through Bayesian optimization. Technical report, ISR-Vislab, 2012.
  • Martinez-Cantin et al. [2007] Ruben Martinez-Cantin, Nando de Freitas, Arnoud Doucet, and Jose A. Castellanos. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and Systems, 2007.
  • Martinez-Cantin et al. [2009] Ruben Martinez-Cantin, Nando de Freitas, Eric Brochu, Jose A. Castellanos, and Arnoud Doucet. A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots - Special Issue on Robot Learning, Part B, 27(3):93–103, 2009.
  • Mockus [1994] Jonas Mockus. Application of bayesian approach to numerical methods of global and stochastic optimization. Journal of Global Optimization, 4:347–365, 1994.
  • O’Hagan [1992] Anthony O’Hagan. Some Bayesian numerical analysis. Bayesian Statistics, 4:345–363, 1992.
  • Peters and Schaal [2006] J. Peters and S. Schaal. Policy gradient methods for robotics. In Proceedings of the IEEE International Conference on Intelligent Robotics Systems (IROS 2006), 2006. URL http://www-clmc.usc.edu/publications/P/peters-IROS2006.pdf.
  • Rasmussen and Nickisch [2010] Carl E. Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (GPML) toolbox. Journal of Machine Learning Research, 11:3011–3015, 2010.
  • Rasmussen and Williams [2006] Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, Massachusetts, 2006.
  • Sacks et al. [1989] Jerome Sacks, William J. Welch, Toby J. Mitchell, and Henry P. Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409–423, 1989.
  • Santner et al. [2003] Thomas J. Santner, Brian J. Williams, and William I. Notz. The Design and Analysis of Computer Experiments. Springer-Verlag, 2003.
  • Singh et al. [2009a] Amarjeet Singh, Andreas Krause, Carlos Guestrin, and William Kaiser. Efficient informative sensing using multiple robots. Journal of Artificial Intelligence Research (JAIR), 34:707–755, 2009a.
  • Singh et al. [2009b] Amarjeet Singh, Andreas Krause, and William Kaiser. Nonmyopic adaptive informative path planning for multiple robots. In Proc. International Joint Conference on Artificial Intelligence (IJCAI), 2009b.
  • Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan Adams. Practical Bayesian optimization of machine learning algorithms. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2960–2968, 2012.
  • Srinivas et al. [2010] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. International Conference on Machine Learning (ICML), 2010.
  • Tesch et al. [2011] Matthew Tesch, Jeff Schneider, and Howie Choset. Adapting control policies for expensive systems to changing environments. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), September 2011.
  • Tesch et al. [2012] Matthew Tesch, Jeff Schneider, and Howie Choset. Expensive multiobjective optimization and validation with a robotics application. In Neural Information Processing Systems: Workshop on Bayesian Optimization & Decision Making, December 2012.
  • Toussaint [in press] Marc Toussaint. The Bayesian search game. In Alberto Moraglio Yossi Borenstein, editor, Theory and Principled Methods for Designing Metaheuristics. Springer, in press.
  • Vazquez and Bect [2010] Emmanuel Vazquez and Julien Bect. Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. Journal of Statistical Planning and Inference, 140(11):3088–3095, 2010.
  • Veiga and Bernardino [2013] Filipe Veiga and Alexandre Bernardino. Active tactile exploration for grasping. In ICRA Workshop on Autonomous Learning, 2013.
  • Wang et al. [2013] Ziyu Wang, Masrour Zoghi, David Matheson, Frank Hutter, and Nando de Freitas. Bayesian optimization in a billion dimensions via random embeddings. In International Joint Conference on Artificial Intelligence (IJCAI), 2013.
  • Williams et al. [2000] Brian J. Williams, Thomas J. Santner, and William I. Notz. Sequential design of computer experiments to minimize integrated response functions. Statistica Sinica, 10(4):1133–1152, 2000.