1 Introduction
We are interested in finding the global minimum of an objective function over some bounded domain, typically , subject to the nonnegativity of a series of constraint functions . This can be formalized as
(1) 
However, and are unknown and can only be evaluated pointwise via expensive queries to blackboxes that provide noisecorrupted evaluations of and . We assume that and each of the constraints are defined over the entire space . We seek to find a solution to (1) with as few queries as possible. Bayesian optimization (Mockus et al., 1978) methods approach this type of problem by building a Bayesian model of the unknown objective function and/or constraints, using this model to compute an acquisition function that represents how useful each input is thought to be as a next evaluation, and then maximizing this acquisition function to select a suggestion for function evaluation.
In this we work we extend Predictive Entropy Search (PES) (HernándezLobato et al., 2014) to solve (1), an approach that we call Predictive Entropy Search with Constraints (PESC). PESC is an acquisition function that approximates the expected information gain about the value of the constrained minimizer . As we will show below, PESC is effective in practice and can be applied to a much wider variety of constrained problems than existing methods.
2 Related Work and Challenges
Most previous approaches to Bayesian optimization with unknown constraints are variants of expected improvement (EI) (Mockus et al., 1978; Jones et al., 1998). EI measures the expected amount by which observing at leads to improvement over the current best value or incumbent :
(2) 
where is the collected data.
2.1 Expected improvement with constraints
One way to use EI with constraints works by discounting EI by the posterior probability of a constraint violation. The resulting acquisition function, which we call expected improvement with constraints (EIC), is given by
(3) 
where is the set of objective function observations and is the set of observations for constraint . Initially proposed by Schonlau et al. (1998), EIC has recently been independently developed in Snoek (2013); Gelbart et al. (2014); Gardner et al. (2014). In the constrained case, is the smallest value of the posterior mean of such that all the constraints are satisfied at the corresponding location.
2.2 Augmented Lagrangian
Gramacy et al. (2014) propose a combination of the expected improvement heuristic and the augmented Lagrangian (AL) optimization framework for constrained blackbox optimization. AL methods are a class of algorithms for constrained nonlinear optimization that work by iteratively optimizing the unconstrained AL:
where is a penalty parameter and is an approximate Lagrange multiplier, both of which are updated at each iteration.
The method proposed by Gramacy et al. (2014) uses Bayesian optimization with EI to solve the unconstrained inner loop of the augmented Lagrangian formulation. AL is limited by requiring noiseless constraints so that and can be updated at each iteration. In section 4.3 we show that PESC and EIC perform better than AL on the synthetic benchmark problem considered in Gramacy et al. (2014), even when the AL method has access to the true objective function and PESC and EIC do not.
2.3 Integrated expected conditional improvement
Gramacy & Lee (2011) propose an acquisition function based on the integrated expected conditional improvement (IECI), which is given by
(4) 
where is the expected improvement at and is the expected improvement at when the objective has been evaluated at , but without knowing the value obtained. The IECI at is the expected reduction in improvement at under the density caused by observing the objective at that location, where is the probability of all the constraints being satisfied at . Gelbart et al. (2014) compare IECI with EIC for optimizing the hyperparameters of a topic model with constraints on the entropy of the pertopic word distribution and show that EIC outperforms IECI for this problem.
2.4 Expected volume reduction
Picheny (2014) proposes to sequentially explore the location that yields that largest the expected volume reduction (EVR) of the feasible region below the best feasible objective value found so far. This quantity is given by integrating the product of the probability of improvement and the probability of feasibility. That is,
(5) 
where, as in IECI, is the probability that the constraints are satisfied at . This stepwise uncertainty reduction approach is similar to PESC in that both methods work by reducing a specific type of uncertainty measure (entropy for PESC and expected volume for EVR).
2.5 Challenges
EIbased methods for constrained optimization have several issues. First, when no point in the search space is feasible under the above definition, does not exist and the EI cannot be computed. This issue affects EIC, IECI, and EVR. To address this issue, Gelbart et al. (2014) modify EIC to ignore the factor in (3) and only consider the posterior probability of the constraints being satisfied when is not defined. The resulting acquisition function focuses only on searching for a feasible location and ignores learning about the objective .
Furthermore, Gelbart et al. (2014) identify a pathology with EIC when one is able to separately evaluate the objective or the constraints, i.e., the decoupled case. The best solution must satisfy a conjunction of low objective value and high (nonnegative) constraint values. By only evaluating the objective or a single constraint, this conjunction cannot be satisfied by a single observation under a myopic search policy. Thus, the new observed cannot become the new incumbent as a result of a decoupled observation and the expected improvement is zero. Therefore standard EIC fails in the decoupled setting. Gelbart et al. (2014) circumvent this pathology by treating decoupling as a special case and using a twostage acquisition function: first, is chosen with EIC, and then, given , the task (whether to evaluate the objective or one of the constraints) is chosen with the method in Villemonteix et al. (2009). This approach does not take full advantage of the available information in the way a joint selection of and the task would. Like EIC, the methods AL, IECI, and EVR are also not easily extended to the decoupled setting.
In addition to this difficulties, EVR and IECI are limited by having to compute the integrals in (4) and (5) over the entire domain, which is done numerically over a grid on (Gramacy & Lee, 2011; Picheny, 2014). The resulting acquisition function must then be globally optimized, which also requires a grid on . This nesting of grid operations limits the application of this method to small .
Our new method, PESC, does not suffer from these pathologies. First, the PESC acquisition function does not depend on the current best feasible solution, so it can operate coherently even when there is not yet a feasible solution. Second, PESC naturally separates the contribution of each task (objective or constraint) in its acquisition function. As a result, no pathology arises in the decoupled case and, thus, no ad hoc modifications to the acquisition function are required. Third, likewise EVR and IECI, PESC also involves computing a difficult integral (over the posterior on ). However, this can be done efficiently using the sampling approach described in HernándezLobato et al. (2014). Furthermore, in addition to its increased generality, our experiments show that PESC performs favorably when compared to EIC and AL even in the basic setting of joint evaluations to which these methods are most suited.
3 Predictive entropy search with constraints
We seek to maximize information about the location , the constrained global minimum, whose posterior distribution is . We assume that and follow independent Gaussian process (GP) priors (see, e.g., Rasmussen & Williams, 2006) and that observation noise is i.i.d. Gaussian with zero mean. GPs are widelyused probabilistic models for Bayesian nonparametric regression which provide a flexible framework for working with unknown response surfaces.
In the coupled setting we will let denote all the observations up to step , where
is a vector collecting the objective and constraint observations at step
. The next query can then be defined as that which maximizes the expected reduction in the differential entropy of the posterior on . We can write the PESC acquisition function as(6) 
where the expectation is taken with respect to the posterior distribution on the noisy evaluations of and at , that is, .
The exact computation of the above expression is infeasible in practice. Instead, we follow Houlsby et al. (2012); HernándezLobato et al. (2014) and take advantage of the symmetry of mutual information, rewriting this acquisition function as the mutual information between and given the collected data . That is,
(7) 
where the expectation is now with respect to the posterior and where
is the posterior predictive distribution for objective and constraint values given past data and the location of the global solution to the constrained optimization problem
. We call the conditioned predictive distribution (CPD).The first term on the righthand side of (7) is straightforward to compute: it is the entropy of a product of independent Gaussians, which is given by
(8) 
where and
are the predictive variances of the objective and constraints, respectively. However, the second term in the righthand side of (
7) has to be approximated. For this, we first approximate the expectation by averaging over samples of approximately drawn from . To sample , we first approximately draw and from their GP posteriors using a finite parameterization of these functions. Then we solve a constrained optimization problem using the sampled functions to yield a sample of . This optimization approach is an extension of the approach described in more detail by HernándezLobato et al. (2014), extended to the constrained setting. For each value of generated by this procedure, we approximate the CPD as described in the next section.3.1 Approximating the CPD
Let denote the concatenated vector of the noisefree objective and constraint values at . We can approximate the CPD by first approximating the posterior predictive distribution of conditioned on , , and , which we call the noise free CPD (NFCPD), and then convolving that approximation with additive Gaussian noise of variance .
We first consider the distribution . The variable is in fact a deterministic function of the latent functions : in particular, is the global minimizer if and only if (i) all constraints are satisfied at and (ii) is the smallest feasible value in the domain. We can informally translate these deterministic conditions into a conditional probability:
(9) 
where is defined as
and the symbol denotes the Heaviside step function with the convention that . The first product in (9) encodes condition (i) and the infinite product over encodes condition (ii). Note that also depends on and ; we use the notation for brevity.
Because is simply a vector containing the values of at , is also a deterministic function of and we can write using Dirac delta functions to pick out the values at :
(10) 
We can now write the NFCPD by i) noting that is independent of given , ii) multiplying the product of (9) and (10) by and iii) integrating out the latent functions :
(11) 
where is an infinitedimensional Gaussian given by the GP posterior on , and we have separated out from the infinite product over .
We find a Gaussian approximation to (11) in several steps. The general approach is to separately approximate the factors that do and do not depend on , so that the computations associated with the latter factors can be reused rather than recomputed for each . In (11), the only factors that depend on are the deltas in the first line, and .
Let denote the dimensional vector containing objective function evaluations at and , and define constraint vectors similarly. Then, we approximate (11) by conditioning only on and , rather than the full . We first approximate the factors in (11) that do not depend on as
(12) 
where is the GP predictive distribution for objective and constraint values. Because (12) is not tractable, we approximate the normalized version of with a product of Gaussians using expectation propagation (EP) (Minka, 2001). In particular, we obtain
(13) 
where is the normalization constant of and for are the mean and covariance terms determined by EP. See the supplementary material for details on the EP approximation. Roughly speaking, EP approximates each true (but intractable) factor in (12) with a Gaussian factor whose parameters are iteratively refined. The product of all these Gaussian factors produces a tractable Gaussian approximation to (12).
We now approximate the portion of (11) that does depend on , namely the first line and the factor , by replacing the deltas with , the dimensional, Gaussian conditional distribution given by the GP priors on . Our full approximation to (11) is then
(14) 
where is a normalization constant. From here, we analytically marginalize out all integration variables except ; see the supplementary material for the full details. This calculation, and those that follow, must be repeated for every ; however, the EP approximation in (13) can be reused over all . After performing the integration, we arrive at
(15) 
where . Details on how to compute the means and variances , as well as the 2dimensional mean vector and the covariance matrix can be found in the supplementary material.
3.2 The PESC acquisition function
By approximating the NFCPD with a product of independent Gaussians, we can approximate the entropy in the CPD by performing the following operations. First, we add the noise variances to the marginal variances of our final approximation of the NFCPD and second, we compute the entropy with (8). The PESC acquisition function, which approximates (6), is then
(16) 
where is the number of samples drawn from , is the th of these samples, and are the predictive variances for the noisy evaluations of and at , respectively, and and are the approximated marginal variances of the CPD for the noisy evaluations of and at given that . Marginalization of (16) over the GP hyperparameters can be done efficiently as in HernándezLobato et al. (2014).
The PESC acquisition function is additive in the expected amount of information that is obtained from the evaluation of each task (objective or constraint) at any particular location . For example, the expected information gain obtained from the evaluation of at is given by the term in (16). The other terms in (16) measure the corresponding contribution from evaluating each of the constraints. This allows PESC to easily address the decoupled scenario when one can independently evaluate the different functions at different locations. In other words, Equation (16) is a sum of individual acquisition functions, one for each function that we can evaluate. Existing methods for Bayesian optimization with unknown constraints (described in Section 2) do not possess this desirable property. Finally, the complexity of PESC is of order per iteration in the coupled setting. As with unconstrained PES, this is dominated by the cost of a matrix inversion in the EP step.
4 Experiments
We evaluate the performance of PESC through experiments with i) synthetic functions sampled from the GP prior distribution, ii) analytic benchmark problems previously used in the literature on Bayesian optimization with unknown constraints and iii) realworld constrained optimization problems.
For case i) above, the synthetic functions sampled from the GP prior are generated following the same experimental set up as in Hennig & Schuler (2012) and HernándezLobato et al. (2014). The search space is the unit hypercube of dimension , and the ground truth objective is a sample from a zeromean GP with a squared exponential covariance function of unit amplitude and length scale in each dimension. We represent the function by first sampling from the GP prior on a grid of 1000 points generated using a Halton sequence (see Leobacher & Pillichshammer, 2014) and then defining as the resulting GP posterior mean. We use a single constraint function whose ground truth is sampled in the same way as . The evaluations for and are contaminated with i.i.d. Gaussian noise with variance .
4.1 Accuracy of the PESC approximation
We first analyze the accuracy of the approximation to (7) generated by PESC. We compare the PESC approximation with a ground truth for (7) obtained by rejection sampling (RS). The RS method works by discretizing the search space using a uniform grid. The expectation with respect to in (7) is then approximated by Monte Carlo. To achieve this, and are sampled on the grid and the grid cell with positive (feasibility) and the lowest value of (optimality) is selected. For each sample of generated by this procedure, is approximated by rejection sampling: we select those samples of and whose corresponding feasible optimal solution is the sampled and reject the other samples. We then assume that the selected samples for and are independent and have Gaussian marginal distributions. Under this assumption,
can be approximated using the formula for the entropy of independent Gaussian random variables, with the variance parameters in this formula being equal to the empirical marginal variances of the selected samples of
and at plus the corresponding noise variances and .The left plot in Figure 1 shows the posterior distribution for and given 5 evaluations sampled from the GP prior with
. The posterior is computed using the optimal GP hyperparameters. The corresponding approximations to (
7) generated by PESC and RS are shown in the middle plot of Figure 1. Both PESC and RS use a total of 50 samples from when approximating the expectation in (7). The PESC approximation is very accurate, and importantly its maximum value is very close to the maximum value of the RS approximation.One disadvantage of the RS method is its high cost, which scales with the size of the grid used. This grid has to be large to guarantee good performance, especially when is large. An alternative is to use a small dynamic grid that changes as data is collected. Such a grid can be obtained by sampling from using the same approach as in PESC. The samples obtained would then form the dynamic grid. The resulting method is called Rejection Sampling with a Dynamic Grid (RSDG).
We compare the performance of PESC, RS and RSDG in experiments with synthetic data corresponding to 500 pairs of and sampled from the GP prior with . At each iteration, RSDG draws the same number of samples of as PESC. We assume that the GP hyperparameter values are known to each method. Recommendations are made by finding the location with lowest posterior mean for such that is nonnegative with probability at least , where . For reporting purposes, we set the utility of a recommendation to be if satisfies the constraint, and otherwise a penalty value of the worst (largest) objective function value achievable in the search space. For each recommendation at , we compute the utility gap , where is the true solution of the optimization problem. Each method is initialized with the same three random points drawn with Latin hypercube sampling.
The right plot in Figure 1 shows the median of the utility gap for each method across the 500 realizations of and . The axis in this plot is the number of joint function evaluations for and
. We report the median because the empirical distribution of the utility gap is heavytailed and in this case the median is more representative of the location of the bulk of the data than the mean. The heavy tails arise because we are measuring performance across 500 different optimization problems with very different degrees of difficulty. In this and all following experiments, standard errors on the reported plot are computed using the bootstrap. The plot shows that PESC and RS are better than RSDG. Furthermore, PESC is very similar to RS, with PESC even performing slightly better at the end of the data collection process since PESC is not limited by a finite grid as RS is. These results show that PESC yields a very accurate approximation of the information gain. Furthermore, although RSDG performs worse than PESC, RSDG is faster because the rejection sampling operation (with a small grid) is less expensive than the EP algorithm. Thus, RSDG is an attractive alternative to PESC when the available computing time is very limited.
4.2 Synthetic functions in 2 and 8 input dimensions
We also compare the performance of PESC and RSDG with that of EIC (Section 2.1) using the same experimental protocol as in the previous section, but with dimensionalities and . We do not compare with RS here because its use of grids does not scale to higher dimensions. Figure 2 shows the utility gap for each method across 500 different samples of and from the GP prior with (a) and (b). Overall, PESC is the best method, followed by RSDG and EIC. RSDG performs similarly to PESC when , but is significantly worse when . This shows that, when is high, grid based approaches (e.g. RSDG) are at a disadvantage with respect to methods that do not require a grid (e.g. PESC).
4.3 A toy problem
We compare PESC with EIC and AL (Section 2.2) in the toy problem described in Gramacy et al. (2014). We seek to minimize the function , subject to the constraint functions and , given by
(17)  
(18) 
where is confined to the unit square. The evaluations for , and are noisefree. We compare PESC and EIC with and a squared exponential GP kernel. PESC uses 10 samples from when approximating the expectation in (7). We use the AL implementation provided by Gramacy et al. (2014) in the R package laGP which is based on the squared exponential kernel and assumes the objective
is known. Thus, in order for this implementation to be used, AL has an advantage over other methods in that it has access to the true objective function. In all three methods, the GP hyperparameters are estimated by maximum likelihood.
Figure 2(c) shows the mean utility gap for each method across 500 independent realizations. Each realization corresponds to a different initialization of the methods with three data points selected with Latin hypercube sampling. Here, we report the mean because we are now measuring performance across realizations of the same optimization problem and the heavytailed effect described in Section 4.1 is less severe. The results show that PESC is significantly better than EIC and AL for this problem. EIC is superior to AL, which performs slightly better at the beginning, presumably because it has access to the ground truth objective .
4.4 Finding a fast neural network
In this experiment, we tune the hyperparamters of a threehiddenlayer neural network subject to the constraint that the prediction time must not exceed 2 ms on a GeForce GTX 580 GPU (also used for training). The search space consists of 12 parameters: 2 learning rate parameters (initial and decay rate), 2 momentum parameters (initial and final), 2 dropout parameters (input layer and other layers), 2 other regularization parameters (weight decay and max weight norm), the number of hidden units in each of the 3 hidden layers, the activation function (RELU or sigmoid). The network is trained using the
deepnet package^{1}^{1}1https://github.com/nitishsrivastava/deepnet, and the prediction time is computed as the average time of 1000 predictions, each for a batch of size 128. The network is trained on the MNIST digit classification task with momentumbased stochastic gradient descent for 5000 iterations. The objective is reported as the classification error rate on the validation set. As above, we treat constraint violations as the worst possible value (in this case a classification error of 1.0).
Figure 3(a) shows the results of 50 iterations of Bayesian optimization. In this experiment and the next, the axis represents observed objective values, , a Matérn 5/2 GP covariance kernel is used, and GP hyperparameters are integrated out using slice sampling (Neal, 2000) as in Snoek et al. (2012). Curves are the mean over 5 independent experiments. We find that PESC performs significantly better than EIC. However, when the noise level is high, reporting the best objective observation is an overly optimistic metric (due to “lucky” evaluations); on the other hand, groundtruth is not available. Therefore, to validate our results further, we used the recommendations made at the final iteration of Bayesian optimization for each method (EIC and PESC) and evaluted the function with these recommended parameters. We repeated the evaluation 10 times for each of the 5 repeated experiments to compute a groundtruth score averaged of 50 function evaluations. This procedure yields a score of for PESC and for EIC (as in the figure, constraint violations are treated as a classification error of ). This result is consistent with Figure 3(a) in that PESC performs significantly better than EIC, but also demonstrates that, due to noise, Figure 3(a) is overly optimistic. While we may believe this optimism to affect both methods equally, the groundtruth measurement provides a more reliable result and a much clearer understanding of the classification error attained by Bayesian optimization.
4.5 Tuning Markov chain Monte Carlo
Hybrid Monte Carlo, also known as Hamiltonian Monte Carlo (HMC), is a popular Markov Chain Monte Carlo (MCMC) technique that uses gradient information in a numerical integration to select the next sample. However, using numerical integration gives rise to new parameters like the integration step size and the number of integration steps. Following the experimental set up in
Gelbart et al. (2014), we optimize the number of effective samples produced by an HMC sampler limited to 5 minutes of computation time, subject to passing of the Geweke (Geweke, 1992) and GelmanRubin (Gelman & Rubin, 1992) convergence diagnostics, as well as the constraint that the numerical integration should not diverge. We tune 4 parameters of an HMC sampler: the integration step size, number of integration steps, fraction of the allotted 5 minutes spent in burnin, and an HMC mass parameter (see Neal, 2011). We use the coda R package (Plummer et al., 2006) to compute the effective sample size and the Geweke convergence diagnostic, and the PyMC python package (Patil et al., 2010) to compute the GelmanRubin diagnostic over two independent traces. Following Gelbart et al. (2014), we impose the constraints that the absolute value of the Geweke test score be at most 2.0 and the GelmanRubin score be at most 1.2, and sample from the posterior distribution of a logistic regression problem using the UCI German credit data set
(Frank & Asuncion, 2010).Figure 3(b) evaluates EIC and PESC on this task, averaged over 10 independent experiments. As above, we perform a groundtruth assessment of the final recommendations. The average effective sample size is for PESC and for EIC. From these results we draw a similar conclusion to that of Figure 3(b); namely, that PESC outperforms EIC but only by a small margin, and furthermore that the experiment is very noisy.
5 Discussion
In this paper, we addressed global optimization with unknown constraints. Motivated by the weaknesses of existing methods, we presented PESC, a method based on the theoretically appealing expected information gain heuristic. We showed that the approximations in PESC are quite accurate, and that PESC performs about equally well to a ground truth method based on rejection sampling. In sections 4.5, 4.4, 4.3 and 4.2, we showed that PESC outperforms current methods such as EIC and AL over a variety of problems. Furthermore, PESC is easily applied to problems with decoupled constraints, without additional computational cost or the pathologies discussed in Gelbart et al. (2014).
One disadvantage of PESC is that it is relatively difficult to implement: in particular, the EP approximation often leads to numerical instabilities. Therefore, we have integrated our implementation, which carefully addresses these numerical issues, into the opensource Bayesian optimization package Spearmint at https://github.com/HIPS/Spearmint/tree/PESC. We have demonstrated that PESC is a flexible and powerful method and we hope the existence of such a method will bring constrained Bayesian optimization into the standard toolbox of Bayesian optimization practitioners.
Acknowledgements
José Miguel HernándezLobato acknowledges support from the Rafael del Pino Foundation. Zoubin Ghahramani acknowledges support from Google Focused Research Award and EPSRC grant EP/I036575/1. Matthew W. Hoffman acknowledges support from EPSRC grant EP/J012300/1.
References

Frank & Asuncion (2010)
Frank, Andrew and Asuncion, Arthur.
UCI machine learning repository, 2010.
 Gardner et al. (2014) Gardner, Jacob R., Kusner, Matt J., Xu, Zhixiang (Eddie), Weinberger, Kilian Q., and Cunningham, John P. Bayesian optimization with inequality constraints. In ICML, 2014.
 Gelbart et al. (2014) Gelbart, Michael A., Snoek, Jasper, and Adams, Ryan P. Bayesian optimization with unknown constraints. In UAI, 2014.
 Gelman & Rubin (1992) Gelman, Andrew and Rubin, Donald R. A single series from the Gibbs sampler provides a false sense of security. In Bayesian Statistics, pp. 625–32. 1992.

Geweke (1992)
Geweke, John.
Evaluating the accuracy of samplingbased approaches to the calculation of posterior moments.
In Bayesian Statistics, pp. 169–193, 1992.  Gramacy & Lee (2011) Gramacy, Robert B. and Lee, Herbert K. H. Optimization under unknown constraints. Bayesian Statistics, 9, 2011.
 Gramacy et al. (2014) Gramacy, Robert B., Gray, Genetha A., Digabel, Sebastien Le, Lee, Herbert K. H., Ranjan, Pritam, Wells, Garth, and Wild, Stefan M. Modeling an augmented Lagrangian for improved blackbox constrained optimization, 2014. arXiv:1403.4890v2 [stat.CO].
 Hennig & Schuler (2012) Hennig, Philipp and Schuler, Christian J. Entropy search for informationefficient global optimization. JMLR, 13, 2012.
 HernándezLobato et al. (2014) HernándezLobato, J. M, Hoffman, M. W., and Ghahramani, Z. Predictive entropy search for efficient global optimization of blackbox functions. In NIPS. 2014.
 Houlsby et al. (2012) Houlsby, N., HernándezLobato, J. M, Huszar, F., and Ghahramani, Z. Collaborative Gaussian processes for preference learning. In NIPS. 2012.
 Jones et al. (1998) Jones, Donald R, Schonlau, Matthias, and Welch, William J. Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492, 1998.
 Leobacher & Pillichshammer (2014) Leobacher, Gunther and Pillichshammer, Friedrich. Introduction to quasiMonte Carlo integration and applications. Springer, 2014.

Minka (2001)
Minka, Thomas P.
A family of algorithms for approximate Bayesian inference
. PhD thesis, Massachusetts Institute of Technology, 2001.  Mockus et al. (1978) Mockus, Jonas, Tiesis, Vytautas, and Zilinskas, Antanas. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2, 1978.
 Neal (2000) Neal, Radford. Slice sampling. Annals of Statistics, 31:705–767, 2000.
 Neal (2011) Neal, Radford. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC, 2011.
 Patil et al. (2010) Patil, Anand, Huard, David, and Fonnesbeck, Christopher. PyMC: Bayesian stochastic modelling in Python. Journal of Statistical Software, 2010.
 Picheny (2014) Picheny, Victor. A stepwise uncertainty reduction approach to constrained global optimization. In AISTATS, 2014.
 Plummer et al. (2006) Plummer, Martyn, Best, Nicky, Cowles, Kate, and Vines, Karen. CODA: Convergence diagnosis and output analysis for MCMC. R News, 6(1):7–11, 2006.
 Rasmussen & Williams (2006) Rasmussen, C. and Williams, C. Gaussian Processes for Machine Learning. MIT Press, 2006.
 Schonlau et al. (1998) Schonlau, Matthias, Welch, William J, and Jones, Donald R. Global versus local search in constrained optimization of computer models. Lecture NotesMonograph Series, pp. 11–25, 1998.
 Snoek (2013) Snoek, Jasper. Bayesian Optimization and Semiparametric Models with Applications to Assistive Technology. PhD thesis, University of Toronto, Toronto, Canada, 2013.
 Snoek et al. (2012) Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical Bayesian optimization of machine learning algorithms. In NIPS, 2012.
 Villemonteix et al. (2009) Villemonteix, Julien, Vázquez, Emmanuel, and Walter, Eric. An informational approach to the global optimization of expensivetoevaluate functions. Journal of Global Optimization, 44(4):509–534, 2009.