1 Introduction
In Bayesian optimization [19] (BO), we wish to optimize a derivativefree expensivetoevaluate function with feasible domain ,
with as few function evaluations as possible. In this paper, we assume that membership in the domain is easy to evaluate and we can evaluate only at points in . We assume that evaluations of
are either noisefree, or have additive independent normally distributed noise. We consider the parallel setting, in which we perform more than one simultaneous evaluation of
.BO typically puts a Gaussian process prior distribution on the function , updating this prior distribution with each new observation of , and choosing the next point or points to evaluate by maximizing an acquisition function that quantifies the benefit of evaluating the objective as a function of where it is evaluated. In comparison with other global optimization algorithms, BO often finds “near optimal” function values with fewer evaluations [19]
. As a consequence, BO is useful when function evaluation is timeconsuming, such as when training and testing complex machine learning algorithms (e.g. deep neural networks) or tuning algorithms on largescale dataset (e.g. ImageNet)
[4]. Recently, BO has become popular in machine learning as it is highly effective in tuning hyperparameters of machine learning algorithms [19, 22, 9, 8].Most previous work in BO assumes that we evaluate the objective function sequentially [13], though a few recent papers have considered parallel evaluations [18, 26, 5, 3]. While in practice, we can often evaluate several different choices in parallel, such as multiple machines can simultaneously train the machine learning algorithm with different sets of hyperparameters. In this paper, we assume that we can access evaluations simultaneously at each iteration. Then we develop a new parallel acquisition function to guide where to evaluate next based on the decisiontheoretical analysis.
Our Contributions. We propose a novel batch BO method which measures the information gain of evaluating points via a new acquisition function, the parallel knowledge gradient (qKG). This method is derived using a decisiontheoretic analysis that chooses the set of points to evaluate next that is optimal in the averagecase with respect to the posterior when there is only one batch of points remaining. Naively maximizing qKG would be extremely computationally intensive, especially when is large, and so, in this paper, we develop a method based on infinitesimal perturbation analysis (IPA) [26] to evaluate qKG’s gradient efficiently, allowing its efficient optimization. In our experiments on both synthetic functions and tuning practical machine learning algorithms, qKG consistently finds better function values than other parallel BO algorithms, such as parallel EI [19, 2, 26], batch UCB [5] and parallel UCB with exploration [3]. qKG provides especially large value when function evaluations are noisy. The code in this paper is available at https://github.com/wujian16/qKG.
The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 gives background on Gaussian processes and defines notation used later. Section 4 proposes our new acquisition function qKG for batch BO. Section 5 provides our computationally efficient approach to maximizing qKG. Section 6 presents the empirical performance of qKG and several benchmarks on synthetic functions and real problems. Finally, Section 7 concludes the paper.
2 Related work
Within the past several years, the machine learning community has revisited BO [19, 22, 9, 8, 18, 20] due to its huge success in tuning hyperparameters of complex machine learning algorithms. BO algorithms consist of two components: a statistical model describing the function and an acquisition function guiding evaluations. In practice, Gaussian Process (GP) [16]
is the mostly widely used statistical model due to its flexibility and tractability. Much of the literature in BO focuses on designing good acquisition functions that reach optima with as few evaluations as possible. Maximizing this acquisition function usually provides a single point to evaluate next, with common acquisition functions for sequential Bayesian optimization including probability of improvement (PI)
[24], expected improvement (EI) [13], upper confidence bound (UCB) [21], entropy search (ES) [11], and knowledge gradient (KG) [17].Recently, a few papers have extended BO to the parallel setting, aiming to choose a batch of points to evaluate next in each iteration, rather than just a single point. [10, 19] suggests parallelizing EI by iteratively constructing a batch, in each iteration adding the point with maximal singleevaluation EI averaged over the posterior distribution of previously selected points. [10] also proposes an algorithm called “constant liar", which iteratively constructs a batch of points to sample by maximizing singleevaluation while pretending that points previously added to the batch have already returned values.
There are also work extending UCB to the parallel setting. [5] proposes the GPBUCB policy, which selects points sequentially by a UCB criterion until filling the batch. Each time one point is selected, the algorithm updates the kernel function while keeping the mean function fixed. [3] proposes an algorithm combining UCB with pure exploration, called GPUCBPE. In this algorithm, the first point is selected according to a UCB criterion; then the remaining points are selected to encourage the diversity of the batch. These two algorithms extending UCB do not require Monte Carlo sampling, making them fast and scalable. However, UCB criteria are usually designed to minimize cumulative regret rather than immediate regret, causing these methods to underperform in BO, where we wish to minimize simple regret.
The parallel methods above construct the batch of points in an iterative greedy fashion, optimizing some singleevaluation acquisition function while holding the other points in the batch fixed. The acquisition function we propose considers the batch of points collectively, and we choose the batch to jointly optimize this acquisition function. Other recent papers that value points collectively include [2] which optimizes the parallel EI by a closedform formula, [15, 26], in which gradientbased methods are proposed to jointly optimize a parallel EI criterion, and [18], which proposes a parallel version of the ES algorithm and uses Monte Carlo Sampling to optimize the parallel ES acquisition function.
We compare against methods from a number of these previous papers in our numerical experiments, and demonstrate that we provide an improvement, especially in problems with noisy evaluations.
Our method is also closely related to the knowledge gradient (KG) method [17, 7] for the nonbatch (sequential) setting, which chooses the Bayesoptimal point to evaluate if only one iteration is left [17], and the final solution that we choose is not restricted to be one of the points we evaluate. (Expected improvement is Bayesoptimal if the solution is restricted to be one of the points we evaluate.) We go beyond this previous work in two aspects. First, we generalize to the parallel setting. Second, while the sequential setting allows evaluating the KG acquisition function exactly, evaluation requires Monte Carlo in the parallel setting, and so we develop more sophisticated computational techniques to optimize our acquisition function. Recently, [27] studies a nested batch knowledge gradient policy. However, they optimize over a finite discrete feasible set, where the gradient of KG does not exist. As a result, their computation of KG is much less efficient than ours. Moreover, they focus on a nesting structure from materials science not present in our setting.
3 Background on Gaussian processes
In this section, we state our prior on , briefly discuss well known results about Gaussian processes (GP), and introduce notation used later. We put a Gaussian process prior over the function , which is specified by its mean function and kernel function . We assume either exact or independent normally distributed measurement errors, i.e. the evaluation at point satisfies
where
is a known function describing the variance of the measurement errors. If
is not known, we can also estimate it as we do in Section
6.Supposing we have measured at points and obtained corresponding measurements , we can then combine these observed function values with our prior to obtain a posterior distribution on . This posterior distribution is still a Gaussian process with the mean function and the kernel function as follows
(3.1) 
4 Parallel knowledge gradient (qKG)
In this section, we propose a novel parallel Bayesian optimization algorithm by generalizing the concept of the knowledge gradient from [7] to the parallel setting. The knowledge gradient policy in [7] for discrete chooses the next sampling decision by maximizing the expected incremental value of a measurement, without assuming (as expected improvement does) that the point returned as the optimum must be a previously sampled point.
We now show how to compute this expected incremental value of an additional iteration in the parallel setting. Suppose that we have observed function values. If we were to stop measuring now, would be the minimum of the predictor of the GP. If instead we took one more batch of samples, would be the minimum of the predictor of the GP. The difference between these quantities, , is the increment in expected solution quality (given the posterior after samples) that results from the additional batch of samples.
This increment in solution quality is random given the posterior after samples, because
is itself a random vector due to its dependence on the outcome of the samples. We can compute the probability distribution of this difference (with more details given below), and the
qKG algorithm values the sampling decision according to its expected value, which we call the parallel knowledge gradient factor, and indicate it using the notation qKG. Formally, we define the qKG factor for a set of candidate points to sample as(4.1) 
where is the expectation taken with respect to the posterior distribution after evaluations. Then we choose to evaluate the next batch of points that maximizes the parallel knowledge gradient,
(4.2) 
By construction, the parallel knowledge gradient policy is Bayesoptimal for minimizing the minimum of the predictor of the GP if only one decision is remaining. The qKG algorithm will reduce to the parallel EI algorithm if function evaluations are noisefree and the final recommendation is restricted to the previous sampling decisions. Because under the two conditions above, the increment in expected solution quality will become
which is exactly the parallel EI acquisition function. However, computing qKG and its gradient is very expensive. We will address the computational issues in Section 5. The full description of the qKG algorithm is summarized as follows.
5 Computation of qKG
In this section, we provide the strategy to maximize qKG by a gradientbased optimizer. In Section 5.1 and Section 5.2, we describe how to compute qKG and its gradient when is finite in (4.1). Section 5.3 describes an effective way to discretize in (4.1). The readers should note that there are two s here, one is in (4.1) which is used to compute the qKG factor given a sampling decision . The other is the feasible domain in (A.1) () that we optimize over. We are discretizing the first .
5.1 Estimating qKG when is finite in (4.1)
Following [7], we express as
Because is normally distributed with zero mean and covariance matrix with respect to the posterior after observations, we can rewrite as
(5.1) 
where is a standard dimensional normal random vector, and
where is the Cholesky factor of the covariance matrix . Now we can compute the qKG factor using Monte Carlo sampling when is finite: we can sample , compute (5.1), then plug in (4.1), repeat many times and take average.
5.2 Estimating the gradient of qKG when is finite in (4.1)
In this section, we propose an unbiased estimator of the gradient of
qKG using IPA when is finite. Accessing a stochastic gradient makes optimization much easier. By (5.1), we express qKG as(5.2) 
where . Under the condition that and are continuously differentiable, one can show that (please see the details in the supplementary materials)
(5.3) 
where is the th dimension of the th point in . By the formula of ,
where , , and
Now we can sample many times and take average to estimate the gradient of qKG via (5.3). This technique is called infinitesimal perturbation analysis (IPA) in gradient estimation [14]. Since we can estimate the gradient of qKG efficiently when is finite, we will apply some standard gradientbased optimization algorithms, such as multistart stochastic gradient ascent to maximize qKG.
5.3 Approximating qKG when is infinite in (4.1) through discretization
We have specified how to maximize qKG when is finite in (4.1), but usually is infinite. In this case, we will discretize to approximate qKG, and then maximize over the approximate qKG. The discretization itself is an interesting research topic [17].
In this paper, the discrete set is not chosen statically, but evolves over time: specifically, we suggest drawing samples from the global optima of the posterior distribution of the Gaussian process (please refer to [11, 18] for a description of this technique). This sample set, denoted by , is then extended by the locations of previously sampled points and the set of candidate points . Then (4.1) can be restated as
(5.4) 
where . For the experimental evaluation we recompute in every iteration after updating the posterior of the Gaussian process.
6 Numerical experiments
We conduct experiments in two different settings: the noisefree setting and the noisy setting. In both settings, we test the algorithms on wellknown synthetic functions chosen from [1] and practical problems. Following previous literature [19], we use a constant mean prior and the ARD Matrn kernel. In the noisy setting, we assume that is constant across the domain , and we estimate it together with other hyperparameters in the GP using maximum likelihood estimation (MLE). We set to discretize the domain following the strategy in Section 5.3. In general, the qKG algorithm performs as well or better than stateofart benchmark algorithms on both synthetic and real problems. It performs especially well in the noisy setting.
Before describing the details of the empirical results, we highlight the implementation details of our method and the opensource implementations of the benchmark methods. Our implementation inherits the opensource implementation of parallel EI from the Metrics Optimization Engine [25], which is fully implemented in C++ with a python interface. We reuse their GP regression and GP hyperparameter fitting methods and implement the qKG method in C++. Besides comparing to parallel EI in [25]
, we also compare our method to a wellknown heuristic parallel EI implemented in
Spearmint [12], the parallel UCB algorithm (GPBUCB) and parallel UCB with pure exploration (GPUCBPE) both implemented in Gpoptimization [6].6.1 Noisefree problems
In this section, we focus our attention on the noisefree setting, in which we can evaluate the objective exactly. We show that parallel knowledge gradient outperforms or is competitive with stateofart benchmarks on several wellknown test functions and tuning practical machine learning algorithms.
6.1.1 Synthetic functions
First, we test our algorithm along with the benchmarks on 4 wellknown synthetic test functions: Branin2 on the domain , Rosenbrock3 on the domain , Ackley5 on the domain , and Hartmann6 on the domain . We initiate our algorithms by randomly sampling points from a Latin hypercube design, where is the dimension of the problem. Figure 5
reports the mean and the standard deviation of the base 10 logarithm of the immediate regret by running
random initializations with batch size .The results show that qKG is significantly better on Rosenbrock3, Ackley5 and Hartmann6, and is slightly worse than the best of the other benchmarks on Branin2. Especially on Rosenbrock3 and Ackley5, qKG makes dramatic progress in early iterations.
6.1.2 Tuning logistic regression and convolutional neural networks (CNN)
In this section, we test the algorithms on two practical problems: tuning logistic regression on the MNIST dataset and tuning CNN on the CIFAR10 dataset. We set the batch size to
.First, we tune logistic regression on the MNIST dataset. This task is to classify handwritten digits from images, and is a
class classification problem. We train logistic regression on a training set with instances with a given set of hyperparameters and test it on a test set with instances. We tune hyperparameters: mini batch size from to , training iterations from to , the regularization parameter from to , and learning rate from to . We report the mean and standard deviation of the test error for independent runs. From the results, one can see that both algorithms are making progress at the initial stage while qKG can maintain this progress for longer and results in a better algorithm configuration in general.In the second experiment, we tune a CNN on CIFAR10 dataset. This is also a class classification problem. We train the CNN on the training data with certain hyperparameters and test it on the test set with instances. For the network architecture, we choose the one in tensorflow tutorial. It consists of convolutional layers,
fully connected layers, and on top of them is a softmax layer for final classification. We tune totally
hyperparameters: the mini batch size from to, training epoch from
to , the regularization parameter from to , learning rate from to , the kernel size from to , the number of channels in convolutional layers from to , the number of hidden units in fully connected layers from to , and the dropout rate from to . We report the mean and standard deviation of the test error for independent runs. In this example, the qKG is making better (more aggressive) progress than parallel EI even in the initial stage and maintain this advantage to the end. This architecture has been carefully tuned by the human expert, and achieve a test error around , and our automatic algorithm improves it to around .6.2 Noisy problems
In this section, we study problems with noisy function evaluations. Our results show that the performance gains over benchmark algorithms from qKG evident in the noisefree setting are even larger in the noisy setting.
6.2.1 Noisy synthetic functions
We test on the same synthetic functions from the noisefree setting, and add independent gaussian noise with standard deviation to the function evaluation. The algorithms are not given this standard deviation, and must learn it from data.
The results in Figure 4 show that qKG is consistently better than or at least competitive with all competing methods. Also observe that the performance advantage of qKG is larger than for noisefree problems.
6.2.2 Noisy logistic regression with small test sets
Testing on a large test set such as ImageNet is slow, especially when we must test many times for different hyperparameters. To speed up hyperparameter tuning, we may instead test the algorithm on a subset of the testing data to approximate the test error on the full set. We study the performance of our algorithm and benchmarks in this scenario, focusing on tuning logistic regression on MNIST. We train logistic regression on the full training set of , but we test the algorithm by testing on randomly selected samples from the test set, which provides a noisy approximation of the test error on the full test set.
We report the mean and standard deviation of the test error on the full set using the hyperparameters recommended by each parallel BO algorithm for independent runs. The result shows that qKG is better than both versions of parallel EI, and its final test error is close to the noisefree test error (which is substantially more expensive to obtain). As we saw with synthetic test functions, qKG’s performance advantage in the noisy setting is wider than in the noisefree setting.
Acknowledgments
The authors were partially supported by NSF CAREER CMMI1254298, NSF CMMI1536895, NSF IIS1247696, AFOSR FA95501210200, AFOSR FA95501510038, and AFOSR FA95501610046.
7 Conclusions
In this paper, we introduce a novel batch Bayesian optimization method qKG, derived from a decisiontheoretical perspective, and develop a computational method to implement it efficiently. We show that qKG outperforms or is competitive with the stateofart benchmark algorithms on several synthetic functions and in tuning practical machine learning algorithms.
Supplementary Material
Appendix A Asynchronous qKG Optimization
The (A.1) corresponds to the synchronous qKG optimization, in which we wait for all points from our previous batch to finish before searching for a new batch of points. However, in some applications, we may wish to generate a new batch of points to evaluate next while points are still being evaluated, before we have their values. This is common in training machine learning algorithms, where different machine learning models do not necessarily finish at the same time.
(A.1) 
We can generalize (A.1) to the asynchronous qKG optimization. Given that points are still under evaluation, now we would like to recommend a batch of points to evaluate. As we did for the synchronous qKG optimization above, now we estimate the qKG of the combined points only with respect to the points that we need to recommend. Then we proceed the same way via gradientbased algorithms.
Appendix B Speedup analysis
Next, we compare qKG at different levels of parallelism against the fully sequential KG algorithm. We test the algorithms with different batch sizes on two noisy synthetic functions Branin2 and Hartmann6, whose standard deviation of the noise is . From the results, our parallel knowledge gradient method does provide a speedup as goes up.
Appendix C The unbiasedness of the stochastic gradient estimator
Recall that in Section 5 of the main document, we have expressed the qKG factor as follows,
(C.1) 
where the expectation is taken over and
The main purpose of this section is to prove the following proposition.
Proposition 1.
When is finite, under the condition that and are continuous differentiable,
(C.2) 
where , , is the th dimension of the th point in and .
Without loss of generality, we assume that (1) and are fixed in advance and (2) , we would like to prove that (C.2) is correct. Before proceeding, we define one more notation where equals to componentwise except for . To prove it, we cite Theorem 1 in [14], which requires three conditions to make (C.2) valid: there exists an open neighborhood of where is the th dimension of th point in such that (i) is continuous in for any fixed and , (ii) is differentiable except on a denumerable set in for any given and , (iii) the derivative of (when it exists) is uniformly bounded by for all , and the expectation of is finite.
c.1 The proof of condition (i)
Under the condition that the mean function and the kernel function are continuous differentiable, we see that for any given , is continuous differentiable in by the result that the multiplication, the inverse (when the inverse exists) and the Cholesky operators [smith1995differentiation] preserve continuous differentiability. When is finite, we see that is continuous in . Then is also continuous in by the definition of the function .
c.2 The proof of condition (ii)
By the expression that , if both and are unique, then is differentiable at . We define to be the set that is not differentiable, then we see that
where . depend on if () where is the th point of . As is finite, we only need to show that and is denumerable.
Defining on , one can see that is continuous differentiable on . We would like to show that is denumerable. To prove it, we will show that contains only isolated points. Then one can use a theorem in real analysis: any set of isolated points in is denumerable (see the proof of statement 4.2.25 on page 165 in [23]). To prove that only contains isolated points, we use the definition of an isolated point: is an isolated point of if and only if is not a limit point of . We will prove by contradiction, suppose that is a limit point of , then it means that there exists a sequence of points all belong to such that . However, by the definition of derivative and , , a contradiction. So we conclude that only contains isolated points, so is denumerable.
Defining on , is also continuous differentiable on , then one can similarly prove that is denumerable.
c.3 The proof of condition (iii)
Recall that from Section 5 of the main document,
where , , and
We can calculate the as follows
Using the fact that is continuously differentiable and is compact, then is bounded by some . By the result that is continous, it is bounded by a vector as is compact. Then where . And .
Appendix D The convergence of stochastic gradient ascent
In this section, we will prove that SGA converges to a stationary point. We follow the same idea of proving the Theorem 2 in [26].
First, it requires the step size satisfying as , and
. Second, it requires the second moment of the gradient estimator is finite. In the above section 1.3, we have show that
, then .References
 Bingham [2015] Bingham, D. (2015). Optimization test problems. http://www.sfu.ca/~ssurjano/optimization.html.
 Chevalier and Ginsbourger [2013] Chevalier, C. and Ginsbourger, D. (2013). Fast computation of the multipoints expected improvement with applications in batch selection. In Learning and Intelligent Optimization, pages 59–69. Springer.
 Contal et al. [2013] Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. (2013). Parallel gaussian process optimization with upper confidence bound and pure exploration. In Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer.
 Deng and Yu [2014] Deng, L. and Yu, D. (2014). Deep learning: Methods and applications. Foundations and Trends in Signal Processing, 7(3–4):197–387.
 Desautels et al. [2014] Desautels, T., Krause, A., and Burdick, J. W. (2014). Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923.
 etal [2015] etal, E. C. (2015). Gpoptimization. https://reine.cmla.enscachan.fr/e.contal/gpoptimization.
 Frazier et al. [2009] Frazier, P., Powell, W., and Dayanik, S. (2009). The knowledgegradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613.
 Gardner et al. [2014] Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q., and Cunningham, J. (2014). Bayesian optimization with inequality constraints. In ICML, pages 937–945.

Gelbart et al. [2014]
Gelbart, M., Snoek, J., and Adams, R. (2014).
Bayesian optimization with unknown constraints.
In
Proceedings of the Thirtieth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI14)
, pages 250–259, Corvallis, Oregon. AUAI Press.  Ginsbourger et al. [2010] Ginsbourger, D., Le Riche, R., and Carraro, L. (2010). Kriging is wellsuited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems, pages 131–162. Springer.
 HernándezLobato et al. [2014] HernándezLobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of blackbox functions. In Advances in Neural Information Processing Systems, pages 918–926.
 Jasper Snoek [2015] Jasper Snoek, Hugo Larochelle, R. P. A. e. (2015). Spearmint. https://github.com/HIPS/Spearmint.
 Jones et al. [1998] Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492.
 L’Ecuyer [1990] L’Ecuyer, P. (1990). A unified view of the IPA, SF, and LR gradient estimation techniques. Management Science, 36(11):1364–1383.
 Marmin et al. [2015] Marmin, S., Chevalier, C., and Ginsbourger, D. (2015). Differentiating the multipoint expected improvement for optimal batch design. In International Workshop on Machine Learning, Optimization and Big Data, pages 37–48. Springer.
 Rasmussen [2006] Rasmussen, C. E. (2006). Gaussian processes for machine learning.
 Scott et al. [2011] Scott, W., Frazier, P., and Powell, W. (2011). The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression. SIAM Journal on Optimization, 21(3):996–1026.
 Shah and Ghahramani [2015] Shah, A. and Ghahramani, Z. (2015). Parallel predictive entropy search for batch global optimization of expensive objective functions. In Advances in Neural Information Processing Systems, pages 3312–3320.
 Snoek et al. [2012] Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959.
 Snoek et al. [2014] Snoek, J., Swersky, K., Zemel, R., and Adams, R. (2014). Input warping for bayesian optimization of nonstationary functions. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 1674–1682.
 Srinivas et al. [2010] Srinivas, N., Krause, A., Seeger, M., and Kakade, S. M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 1015–1022.
 Swersky et al. [2013] Swersky, K., Snoek, J., and Adams, R. P. (2013). Multitask bayesian optimization. In Advances in neural information processing systems, pages 2004–2012.
 Thomson et al. [2008] Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Elementary real analysis. ClassicalRealAnalysis. com.
 Törn and Zilinskas [1989] Törn, A. and Zilinskas, A. (1989). Global optimization, volume 350 of lecture notes in computer science.
 Wang et al. [2014] Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. (2014). Metrics optimization engine. http://yelp.github.io/MOE/. Last accessed on 20160121.
 Wang et al. [2015a] Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. (2015a). Parallel bayesian global optimization of expensive functions.
 Wang et al. [2015b] Wang, Y., Reyes, K. G., Brown, K. A., Mirkin, C. A., and Powell, W. B. (2015b). Nestedbatchmode learning and stochastic optimization with an application to sequential multistage testing in materials science. SIAM Journal on Scientific Computing, 37(3):B361–B381.
Comments
There are no comments yet.