In Bayesian optimization  (BO), we wish to optimize a derivative-free expensive-to-evaluate function with feasible domain ,
with as few function evaluations as possible. In this paper, we assume that membership in the domain is easy to evaluate and we can evaluate only at points in . We assume that evaluations of
are either noise-free, or have additive independent normally distributed noise. We consider the parallel setting, in which we perform more than one simultaneous evaluation of.
BO typically puts a Gaussian process prior distribution on the function , updating this prior distribution with each new observation of , and choosing the next point or points to evaluate by maximizing an acquisition function that quantifies the benefit of evaluating the objective as a function of where it is evaluated. In comparison with other global optimization algorithms, BO often finds “near optimal” function values with fewer evaluations 
. As a consequence, BO is useful when function evaluation is time-consuming, such as when training and testing complex machine learning algorithms (e.g. deep neural networks) or tuning algorithms on large-scale dataset (e.g. ImageNet). Recently, BO has become popular in machine learning as it is highly effective in tuning hyperparameters of machine learning algorithms [19, 22, 9, 8].
Most previous work in BO assumes that we evaluate the objective function sequentially , though a few recent papers have considered parallel evaluations [18, 26, 5, 3]. While in practice, we can often evaluate several different choices in parallel, such as multiple machines can simultaneously train the machine learning algorithm with different sets of hyperparameters. In this paper, we assume that we can access evaluations simultaneously at each iteration. Then we develop a new parallel acquisition function to guide where to evaluate next based on the decision-theoretical analysis.
Our Contributions. We propose a novel batch BO method which measures the information gain of evaluating points via a new acquisition function, the parallel knowledge gradient (q-KG). This method is derived using a decision-theoretic analysis that chooses the set of points to evaluate next that is optimal in the average-case with respect to the posterior when there is only one batch of points remaining. Naively maximizing q-KG would be extremely computationally intensive, especially when is large, and so, in this paper, we develop a method based on infinitesimal perturbation analysis (IPA)  to evaluate q-KG’s gradient efficiently, allowing its efficient optimization. In our experiments on both synthetic functions and tuning practical machine learning algorithms, q-KG consistently finds better function values than other parallel BO algorithms, such as parallel EI [19, 2, 26], batch UCB  and parallel UCB with exploration . q-KG provides especially large value when function evaluations are noisy. The code in this paper is available at https://github.com/wujian16/qKG.
The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 gives background on Gaussian processes and defines notation used later. Section 4 proposes our new acquisition function q-KG for batch BO. Section 5 provides our computationally efficient approach to maximizing q-KG. Section 6 presents the empirical performance of q-KG and several benchmarks on synthetic functions and real problems. Finally, Section 7 concludes the paper.
2 Related work
Within the past several years, the machine learning community has revisited BO [19, 22, 9, 8, 18, 20] due to its huge success in tuning hyperparameters of complex machine learning algorithms. BO algorithms consist of two components: a statistical model describing the function and an acquisition function guiding evaluations. In practice, Gaussian Process (GP) 
is the mostly widely used statistical model due to its flexibility and tractability. Much of the literature in BO focuses on designing good acquisition functions that reach optima with as few evaluations as possible. Maximizing this acquisition function usually provides a single point to evaluate next, with common acquisition functions for sequential Bayesian optimization including probability of improvement (PI), expected improvement (EI) , upper confidence bound (UCB) , entropy search (ES) , and knowledge gradient (KG) .
Recently, a few papers have extended BO to the parallel setting, aiming to choose a batch of points to evaluate next in each iteration, rather than just a single point. [10, 19] suggests parallelizing EI by iteratively constructing a batch, in each iteration adding the point with maximal single-evaluation EI averaged over the posterior distribution of previously selected points.  also proposes an algorithm called “constant liar", which iteratively constructs a batch of points to sample by maximizing single-evaluation while pretending that points previously added to the batch have already returned values.
There are also work extending UCB to the parallel setting.  proposes the GP-BUCB policy, which selects points sequentially by a UCB criterion until filling the batch. Each time one point is selected, the algorithm updates the kernel function while keeping the mean function fixed.  proposes an algorithm combining UCB with pure exploration, called GP-UCB-PE. In this algorithm, the first point is selected according to a UCB criterion; then the remaining points are selected to encourage the diversity of the batch. These two algorithms extending UCB do not require Monte Carlo sampling, making them fast and scalable. However, UCB criteria are usually designed to minimize cumulative regret rather than immediate regret, causing these methods to underperform in BO, where we wish to minimize simple regret.
The parallel methods above construct the batch of points in an iterative greedy fashion, optimizing some single-evaluation acquisition function while holding the other points in the batch fixed. The acquisition function we propose considers the batch of points collectively, and we choose the batch to jointly optimize this acquisition function. Other recent papers that value points collectively include  which optimizes the parallel EI by a closed-form formula, [15, 26], in which gradient-based methods are proposed to jointly optimize a parallel EI criterion, and , which proposes a parallel version of the ES algorithm and uses Monte Carlo Sampling to optimize the parallel ES acquisition function.
We compare against methods from a number of these previous papers in our numerical experiments, and demonstrate that we provide an improvement, especially in problems with noisy evaluations.
Our method is also closely related to the knowledge gradient (KG) method [17, 7] for the non-batch (sequential) setting, which chooses the Bayes-optimal point to evaluate if only one iteration is left , and the final solution that we choose is not restricted to be one of the points we evaluate. (Expected improvement is Bayes-optimal if the solution is restricted to be one of the points we evaluate.) We go beyond this previous work in two aspects. First, we generalize to the parallel setting. Second, while the sequential setting allows evaluating the KG acquisition function exactly, evaluation requires Monte Carlo in the parallel setting, and so we develop more sophisticated computational techniques to optimize our acquisition function. Recently,  studies a nested batch knowledge gradient policy. However, they optimize over a finite discrete feasible set, where the gradient of KG does not exist. As a result, their computation of KG is much less efficient than ours. Moreover, they focus on a nesting structure from materials science not present in our setting.
3 Background on Gaussian processes
In this section, we state our prior on , briefly discuss well known results about Gaussian processes (GP), and introduce notation used later. We put a Gaussian process prior over the function , which is specified by its mean function and kernel function . We assume either exact or independent normally distributed measurement errors, i.e. the evaluation at point satisfies
is a known function describing the variance of the measurement errors. If
is not known, we can also estimate it as we do in Section6.
Supposing we have measured at points and obtained corresponding measurements , we can then combine these observed function values with our prior to obtain a posterior distribution on . This posterior distribution is still a Gaussian process with the mean function and the kernel function as follows
4 Parallel knowledge gradient (q-KG)
In this section, we propose a novel parallel Bayesian optimization algorithm by generalizing the concept of the knowledge gradient from  to the parallel setting. The knowledge gradient policy in  for discrete chooses the next sampling decision by maximizing the expected incremental value of a measurement, without assuming (as expected improvement does) that the point returned as the optimum must be a previously sampled point.
We now show how to compute this expected incremental value of an additional iteration in the parallel setting. Suppose that we have observed function values. If we were to stop measuring now, would be the minimum of the predictor of the GP. If instead we took one more batch of samples, would be the minimum of the predictor of the GP. The difference between these quantities, , is the increment in expected solution quality (given the posterior after samples) that results from the additional batch of samples.
This increment in solution quality is random given the posterior after samples, becauseq-KG algorithm values the sampling decision according to its expected value, which we call the parallel knowledge gradient factor, and indicate it using the notation q-KG. Formally, we define the q-KG factor for a set of candidate points to sample as
where is the expectation taken with respect to the posterior distribution after evaluations. Then we choose to evaluate the next batch of points that maximizes the parallel knowledge gradient,
By construction, the parallel knowledge gradient policy is Bayes-optimal for minimizing the minimum of the predictor of the GP if only one decision is remaining. The q-KG algorithm will reduce to the parallel EI algorithm if function evaluations are noise-free and the final recommendation is restricted to the previous sampling decisions. Because under the two conditions above, the increment in expected solution quality will become
which is exactly the parallel EI acquisition function. However, computing q-KG and its gradient is very expensive. We will address the computational issues in Section 5. The full description of the q-KG algorithm is summarized as follows.
5 Computation of q-KG
In this section, we provide the strategy to maximize q-KG by a gradient-based optimizer. In Section 5.1 and Section 5.2, we describe how to compute q-KG and its gradient when is finite in (4.1). Section 5.3 describes an effective way to discretize in (4.1). The readers should note that there are two s here, one is in (4.1) which is used to compute the q-KG factor given a sampling decision . The other is the feasible domain in (A.1) () that we optimize over. We are discretizing the first .
5.1 Estimating q-KG when is finite in (4.1)
Following , we express as
Because is normally distributed with zero mean and covariance matrix with respect to the posterior after observations, we can rewrite as
where is a standard -dimensional normal random vector, and
where is the Cholesky factor of the covariance matrix . Now we can compute the q-KG factor using Monte Carlo sampling when is finite: we can sample , compute (5.1), then plug in (4.1), repeat many times and take average.
5.2 Estimating the gradient of q-KG when is finite in (4.1)
In this section, we propose an unbiased estimator of the gradient ofq-KG using IPA when is finite. Accessing a stochastic gradient makes optimization much easier. By (5.1), we express q-KG as
where . Under the condition that and are continuously differentiable, one can show that (please see the details in the supplementary materials)
where is the th dimension of the th point in . By the formula of ,
where , , and
Now we can sample many times and take average to estimate the gradient of q-KG via (5.3). This technique is called infinitesimal perturbation analysis (IPA) in gradient estimation . Since we can estimate the gradient of q-KG efficiently when is finite, we will apply some standard gradient-based optimization algorithms, such as multi-start stochastic gradient ascent to maximize q-KG.
5.3 Approximating q-KG when is infinite in (4.1) through discretization
We have specified how to maximize q-KG when is finite in (4.1), but usually is infinite. In this case, we will discretize to approximate q-KG, and then maximize over the approximate q-KG. The discretization itself is an interesting research topic .
In this paper, the discrete set is not chosen statically, but evolves over time: specifically, we suggest drawing samples from the global optima of the posterior distribution of the Gaussian process (please refer to [11, 18] for a description of this technique). This sample set, denoted by , is then extended by the locations of previously sampled points and the set of candidate points . Then (4.1) can be restated as
where . For the experimental evaluation we recompute in every iteration after updating the posterior of the Gaussian process.
6 Numerical experiments
We conduct experiments in two different settings: the noise-free setting and the noisy setting. In both settings, we test the algorithms on well-known synthetic functions chosen from  and practical problems. Following previous literature , we use a constant mean prior and the ARD Matrn kernel. In the noisy setting, we assume that is constant across the domain , and we estimate it together with other hyperparameters in the GP using maximum likelihood estimation (MLE). We set to discretize the domain following the strategy in Section 5.3. In general, the q-KG algorithm performs as well or better than state-of-art benchmark algorithms on both synthetic and real problems. It performs especially well in the noisy setting.
Before describing the details of the empirical results, we highlight the implementation details of our method and the open-source implementations of the benchmark methods. Our implementation inherits the open-source implementation of parallel EI from the Metrics Optimization Engine , which is fully implemented in C++ with a python interface. We reuse their GP regression and GP hyperparameter fitting methods and implement the q-KG method in C++. Besides comparing to parallel EI in 
, we also compare our method to a well-known heuristic parallel EI implemented inSpearmint , the parallel UCB algorithm (GP-BUCB) and parallel UCB with pure exploration (GP-UCB-PE) both implemented in Gpoptimization .
6.1 Noise-free problems
In this section, we focus our attention on the noise-free setting, in which we can evaluate the objective exactly. We show that parallel knowledge gradient outperforms or is competitive with state-of-art benchmarks on several well-known test functions and tuning practical machine learning algorithms.
6.1.1 Synthetic functions
First, we test our algorithm along with the benchmarks on 4 well-known synthetic test functions: Branin2 on the domain , Rosenbrock3 on the domain , Ackley5 on the domain , and Hartmann6 on the domain . We initiate our algorithms by randomly sampling points from a Latin hypercube design, where is the dimension of the problem. Figure 5
reports the mean and the standard deviation of the base 10 logarithm of the immediate regret by runningrandom initializations with batch size .
The results show that q-KG is significantly better on Rosenbrock3, Ackley5 and Hartmann6, and is slightly worse than the best of the other benchmarks on Branin2. Especially on Rosenbrock3 and Ackley5, q-KG makes dramatic progress in early iterations.
6.1.2 Tuning logistic regression and convolutional neural networks (CNN)
In this section, we test the algorithms on two practical problems: tuning logistic regression on the MNIST dataset and tuning CNN on the CIFAR10 dataset. We set the batch size to.
First, we tune logistic regression on the MNIST dataset. This task is to classify handwritten digits from images, and is a-class classification problem. We train logistic regression on a training set with instances with a given set of hyperparameters and test it on a test set with instances. We tune hyperparameters: mini batch size from to , training iterations from to , the regularization parameter from to , and learning rate from to . We report the mean and standard deviation of the test error for independent runs. From the results, one can see that both algorithms are making progress at the initial stage while q-KG can maintain this progress for longer and results in a better algorithm configuration in general.
In the second experiment, we tune a CNN on CIFAR10 dataset. This is also a -class classification problem. We train the CNN on the training data with certain hyperparameters and test it on the test set with instances. For the network architecture, we choose the one in tensorflow tutorial. It consists of convolutional layers,
fully connected layers, and on top of them is a softmax layer for final classification. We tune totallyhyperparameters: the mini batch size from to
, training epoch fromto , the regularization parameter from to , learning rate from to , the kernel size from to , the number of channels in convolutional layers from to , the number of hidden units in fully connected layers from to , and the dropout rate from to . We report the mean and standard deviation of the test error for independent runs. In this example, the q-KG is making better (more aggressive) progress than parallel EI even in the initial stage and maintain this advantage to the end. This architecture has been carefully tuned by the human expert, and achieve a test error around , and our automatic algorithm improves it to around .
6.2 Noisy problems
In this section, we study problems with noisy function evaluations. Our results show that the performance gains over benchmark algorithms from q-KG evident in the noise-free setting are even larger in the noisy setting.
6.2.1 Noisy synthetic functions
We test on the same synthetic functions from the noise-free setting, and add independent gaussian noise with standard deviation to the function evaluation. The algorithms are not given this standard deviation, and must learn it from data.
The results in Figure 4 show that q-KG is consistently better than or at least competitive with all competing methods. Also observe that the performance advantage of q-KG is larger than for noise-free problems.
6.2.2 Noisy logistic regression with small test sets
Testing on a large test set such as ImageNet is slow, especially when we must test many times for different hyperparameters. To speed up hyperparameter tuning, we may instead test the algorithm on a subset of the testing data to approximate the test error on the full set. We study the performance of our algorithm and benchmarks in this scenario, focusing on tuning logistic regression on MNIST. We train logistic regression on the full training set of , but we test the algorithm by testing on randomly selected samples from the test set, which provides a noisy approximation of the test error on the full test set.
We report the mean and standard deviation of the test error on the full set using the hyperparameters recommended by each parallel BO algorithm for independent runs. The result shows that q-KG is better than both versions of parallel EI, and its final test error is close to the noise-free test error (which is substantially more expensive to obtain). As we saw with synthetic test functions, q-KG’s performance advantage in the noisy setting is wider than in the noise-free setting.
The authors were partially supported by NSF CAREER CMMI-1254298, NSF CMMI-1536895, NSF IIS-1247696, AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, and AFOSR FA9550-16-1-0046.
In this paper, we introduce a novel batch Bayesian optimization method q-KG, derived from a decision-theoretical perspective, and develop a computational method to implement it efficiently. We show that q-KG outperforms or is competitive with the state-of-art benchmark algorithms on several synthetic functions and in tuning practical machine learning algorithms.
Appendix A Asynchronous q-KG Optimization
The (A.1) corresponds to the synchronous q-KG optimization, in which we wait for all points from our previous batch to finish before searching for a new batch of points. However, in some applications, we may wish to generate a new batch of points to evaluate next while points are still being evaluated, before we have their values. This is common in training machine learning algorithms, where different machine learning models do not necessarily finish at the same time.
We can generalize (A.1) to the asynchronous q-KG optimization. Given that points are still under evaluation, now we would like to recommend a batch of points to evaluate. As we did for the synchronous q-KG optimization above, now we estimate the q-KG of the combined points only with respect to the points that we need to recommend. Then we proceed the same way via gradient-based algorithms.
Appendix B Speed-up analysis
Next, we compare q-KG at different levels of parallelism against the fully sequential KG algorithm. We test the algorithms with different batch sizes on two noisy synthetic functions Branin2 and Hartmann6, whose standard deviation of the noise is . From the results, our parallel knowledge gradient method does provide a speed-up as goes up.
Appendix C The unbiasedness of the stochastic gradient estimator
Recall that in Section 5 of the main document, we have expressed the q-KG factor as follows,
where the expectation is taken over and
The main purpose of this section is to prove the following proposition.
When is finite, under the condition that and are continuous differentiable,
where , , is the th dimension of the th point in and .
Without loss of generality, we assume that (1) and are fixed in advance and (2) , we would like to prove that (C.2) is correct. Before proceeding, we define one more notation where equals to component-wise except for . To prove it, we cite Theorem 1 in , which requires three conditions to make (C.2) valid: there exists an open neighborhood of where is the th dimension of th point in such that (i) is continuous in for any fixed and , (ii) is differentiable except on a denumerable set in for any given and , (iii) the derivative of (when it exists) is uniformly bounded by for all , and the expectation of is finite.
c.1 The proof of condition (i)
Under the condition that the mean function and the kernel function are continuous differentiable, we see that for any given , is continuous differentiable in by the result that the multiplication, the inverse (when the inverse exists) and the Cholesky operators [smith1995differentiation] preserve continuous differentiability. When is finite, we see that is continuous in . Then is also continuous in by the definition of the function .
c.2 The proof of condition (ii)
By the expression that , if both and are unique, then is differentiable at . We define to be the set that is not differentiable, then we see that
where . depend on if () where is the th point of . As is finite, we only need to show that and is denumerable.
Defining on , one can see that is continuous differentiable on . We would like to show that is denumerable. To prove it, we will show that contains only isolated points. Then one can use a theorem in real analysis: any set of isolated points in is denumerable (see the proof of statement 4.2.25 on page 165 in ). To prove that only contains isolated points, we use the definition of an isolated point: is an isolated point of if and only if is not a limit point of . We will prove by contradiction, suppose that is a limit point of , then it means that there exists a sequence of points all belong to such that . However, by the definition of derivative and , , a contradiction. So we conclude that only contains isolated points, so is denumerable.
Defining on , is also continuous differentiable on , then one can similarly prove that is denumerable.
c.3 The proof of condition (iii)
Recall that from Section 5 of the main document,
where , , and
We can calculate the as follows
Using the fact that is continuously differentiable and is compact, then is bounded by some . By the result that is continous, it is bounded by a vector as is compact. Then where . And .
Appendix D The convergence of stochastic gradient ascent
In this section, we will prove that SGA converges to a stationary point. We follow the same idea of proving the Theorem 2 in .
First, it requires the step size satisfying as , and
. Second, it requires the second moment of the gradient estimator is finite. In the above section 1.3, we have show that, then .
- Bingham  Bingham, D. (2015). Optimization test problems. http://www.sfu.ca/~ssurjano/optimization.html.
- Chevalier and Ginsbourger  Chevalier, C. and Ginsbourger, D. (2013). Fast computation of the multi-points expected improvement with applications in batch selection. In Learning and Intelligent Optimization, pages 59–69. Springer.
- Contal et al.  Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. (2013). Parallel gaussian process optimization with upper confidence bound and pure exploration. In Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer.
- Deng and Yu  Deng, L. and Yu, D. (2014). Deep learning: Methods and applications. Foundations and Trends in Signal Processing, 7(3–4):197–387.
- Desautels et al.  Desautels, T., Krause, A., and Burdick, J. W. (2014). Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923.
- etal  etal, E. C. (2015). Gpoptimization. https://reine.cmla.ens-cachan.fr/e.contal/gpoptimization.
- Frazier et al.  Frazier, P., Powell, W., and Dayanik, S. (2009). The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613.
- Gardner et al.  Gardner, J. R., Kusner, M. J., Xu, Z. E., Weinberger, K. Q., and Cunningham, J. (2014). Bayesian optimization with inequality constraints. In ICML, pages 937–945.
Gelbart et al. 
Gelbart, M., Snoek, J., and Adams, R. (2014).
Bayesian optimization with unknown constraints.
Proceedings of the Thirtieth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-14), pages 250–259, Corvallis, Oregon. AUAI Press.
- Ginsbourger et al.  Ginsbourger, D., Le Riche, R., and Carraro, L. (2010). Kriging is well-suited to parallelize optimization. In Computational Intelligence in Expensive Optimization Problems, pages 131–162. Springer.
- Hernández-Lobato et al.  Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems, pages 918–926.
- Jasper Snoek  Jasper Snoek, Hugo Larochelle, R. P. A. e. (2015). Spearmint. https://github.com/HIPS/Spearmint.
- Jones et al.  Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492.
- L’Ecuyer  L’Ecuyer, P. (1990). A unified view of the IPA, SF, and LR gradient estimation techniques. Management Science, 36(11):1364–1383.
- Marmin et al.  Marmin, S., Chevalier, C., and Ginsbourger, D. (2015). Differentiating the multipoint expected improvement for optimal batch design. In International Workshop on Machine Learning, Optimization and Big Data, pages 37–48. Springer.
- Rasmussen  Rasmussen, C. E. (2006). Gaussian processes for machine learning.
- Scott et al.  Scott, W., Frazier, P., and Powell, W. (2011). The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression. SIAM Journal on Optimization, 21(3):996–1026.
- Shah and Ghahramani  Shah, A. and Ghahramani, Z. (2015). Parallel predictive entropy search for batch global optimization of expensive objective functions. In Advances in Neural Information Processing Systems, pages 3312–3320.
- Snoek et al.  Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959.
- Snoek et al.  Snoek, J., Swersky, K., Zemel, R., and Adams, R. (2014). Input warping for bayesian optimization of non-stationary functions. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1674–1682.
- Srinivas et al.  Srinivas, N., Krause, A., Seeger, M., and Kakade, S. M. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1015–1022.
- Swersky et al.  Swersky, K., Snoek, J., and Adams, R. P. (2013). Multi-task bayesian optimization. In Advances in neural information processing systems, pages 2004–2012.
- Thomson et al.  Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Elementary real analysis. ClassicalRealAnalysis. com.
- Törn and Zilinskas  Törn, A. and Zilinskas, A. (1989). Global optimization, volume 350 of lecture notes in computer science.
- Wang et al.  Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. (2014). Metrics optimization engine. http://yelp.github.io/MOE/. Last accessed on 2016-01-21.
- Wang et al. [2015a] Wang, J., Clark, S. C., Liu, E., and Frazier, P. I. (2015a). Parallel bayesian global optimization of expensive functions.
- Wang et al. [2015b] Wang, Y., Reyes, K. G., Brown, K. A., Mirkin, C. A., and Powell, W. B. (2015b). Nested-batch-mode learning and stochastic optimization with an application to sequential multistage testing in materials science. SIAM Journal on Scientific Computing, 37(3):B361–B381.