1 Introduction
is able to find global optima with a remarkably small number of potentially noisy objective function evaluations. Bayesian optimization has thus been particularly successful for automatic hyperparameter tuning of machine learning algorithms
[35, 38, 11, 10], where objectives can be extremely expensive to evaluate, noisy, and multimodal.Bayesian optimization supposes that the objective function (e.g., the predictive performance with respect to some hyperparameters) is drawn from a prior distribution over functions, typically a Gaussian process (GP), maintaining a posterior as we observe the objective in new places. Acquisition functions, such as expected improvement [17, 15, 28], upper confidence bound [37], predictive entropy search [14] or the knowledge gradient [32], determine a balance between exploration and exploitation, to decide where to query the objective next. By choosing points with the largest acquisition function values, one seeks to identify a global optimum using as few objective function evaluations as possible.
Bayesian optimization procedures do not generally leverage derivative information, beyond a few exceptions described in Sect. 2. By contrast, other types of continuous optimization methods [36] use gradient information extensively. The broader use of gradients for optimization suggests that gradients should also be quite useful in Bayesian optimization: (1) Gradients inform us about the objective’s relative value as a function of location, which is wellaligned with optimization. (2) In dimensional problems, gradients provide distinct pieces of information about the objective’s relative value in each direction, constituting values per query together with the objective value itself. This advantage is particularly significant for highdimensional problems. (3) Derivative information is available in many applications at little additional cost. Recent work [e.g., 23]
makes gradient information available for hyperparameter tuning. Moreover, in the optimization of engineering systems modeled by partial differential equations, which predates most hyperparameter tuning applications
[8], adjoint methods provide gradients cheaply [29, 16]. And even when derivative information is not readily available, we can compute approximative derivatives in parallel through finite differences.In this paper, we explore the “what, when, and why” of Bayesian optimization with derivative information. We also develop a Bayesian optimization algorithm that effectively leverages gradients in hyperparameter tuning to outperform the state of the art. This algorithm accommodates incomplete and noisy gradient observations, can be used in both the sequential and batch settings, and can optionally reduce the computational overhead of inference by selecting the single most valuable directional derivatives to retain. For this purpose, we develop a new acquisition function, called the derivativeenabled knowledgegradient (dKG). dKG generalizes the previously proposed batch knowledge gradient method of Wu and Frazier [44] to the derivative setting, and replaces its approximate discretizationbased method for calculating the knowledgegradient acquisition function by a novel faster exact discretizationfree method. We note that this discretizationfree method is also of interest beyond the derivative setting, as it can be used to improve knowledgegradient methods for other problem settings. We also provide a theoretical analysis of the dKG
algorithm, showing (1) it is onestep Bayesoptimal by construction when derivatives are available; (2) that it provides onestep value greater than in the derivativefree setting, under mild conditions; and (3) that its estimator of the global optimum is asymptotically consistent.
In numerical experiments we compare with stateoftheart batch Bayesian optimization algorithms with and without derivative information, and the gradientbased optimizer BFGS with full gradients.
We assume familiarity with GPs and Bayesian optimization, for which we recommend Rasmussen and Williams [31] and Shahriari et al. [34] as a review. In Section 2 we begin by describing related work. In Sect. 3 we describe our Bayesian optimization algorithm exploiting derivative information. In Sect. 4 we compare the performance of our algorithm with several competing methods on a collection of synthetic and real problems.
The code for this paper is available at https://github.com/wujian16/CornellMOE.
2 Related Work
Osborne et al. [26] proposes fully Bayesian optimization procedures that use derivative observations to improve the conditioning of the GP covariance matrix. Samples taken near previously observed points use only the derivative information to update the covariance matrix. Unlike our current work, derivative information does not affect the acquisition function. We directly compare with Osborne et al. [26]
within the KNN benchmark in Sect.
4.2.Lizotte [22, Sect. 4.2.1 and Sect. 5.2.4] incorporates derivatives into Bayesian optimization, modeling the derivatives of a GP as in Rasmussen and Williams [31, Sect. 9.4]. Lizotte [22] shows that Bayesian optimization with the expected improvement (EI) acquisition function and complete gradient information at each sample can outperform BFGS. Our approach has six key differences: (i) we allow for noisy and incomplete derivative information; (ii) we develop a novel acquisition function that outperforms EI with derivatives; (iii) we enable batch evaluations; (iv) we implement and compare batch Bayesian optimization with derivatives across several acquisition functions, on benchmarks and new applications such as kernel learning, logistic regression, deep learning and knearest neighbors, further revealing empirically where gradient information will be most valuable; (v) we provide a theoretical analysis of Bayesian optimization with derivatives; (vi) we develop a scalable implementation.
Very recently, Koistinen et al. [19] uses GPs with derivative observations for minimum energy path calculations of atomic rearrangements and Ahmed et al. [1] studies expected improvement with gradient observations. In Ahmed et al. [1], a randomly selected directional derivative is retained in each iteration for computational reasons, which is similar to our approach of retaining a single directional derivative, though differs in its random selection in contrast with our valueofinformationbased selection. Our approach is complementary to these works.
For batch Bayesian optimization, several recent algorithms have been proposed that choose a set of points to evaluate in each iteration [35, 39, 24, 5, 6, 18, 33, 12]. Within this area, our approach to handling batch observations is most closely related to the batch knowledge gradient (KG) of Wu and Frazier [44]. We generalize this approach to the derivative setting, and provide a novel exact method for computing the knowledgegradient acquisition function that avoids the discretization used in Wu and Frazier [44]. This generalization improves speed and accuracy, and is also applicable to other knowledge gradient methods in continuous search spaces.
Recent advances improving both access to derivatives and computational tractability of GPs make Bayesian optimization with gradients increasingly practical and timely for discussion.
3 Knowledge Gradient with Derivatives
Sect. 3.1 reviews a general approach to incorporating derivative information into GPs for Bayesian optimization. Sect. 3.2 introduces a novel acquisition function dKG, based on the knowledge gradient approach, which utilizes derivative information. Sect. 3.3 computes this acquisition function and its gradient efficiently using a novel fast discretizationfree approach. Sect. 3.4 shows that this algorithm provides greater value of information than in the derivativefree setting, is onestep Bayesoptimal, and is asymptotically consistent when used over a discretized feasible space.
3.1 Derivative Information
Given an expensivetoevaluate function , we wish to find , where is the domain of optimization. We place a GP prior over , which is specified by its mean function and kernel function . We first suppose that for each sample of we observe the function value and all
partial derivatives, possibly with independent normally distributed noise, and then later discuss relaxation to observing only a single directional derivative.
Since the gradient is a linear operator, the gradient of a GP is also a GP (see also Sect. 9.4 in Rasmussen and Williams [31]), and the function and its gradient follow a multioutput GP with mean function and kernel function defined below:
(3.1) 
where and is the Hessian of .
When evaluating at a point , we observe the noiseobscured function value and gradient . Jointly, these observations form a
dimensional vector with conditional distribution
(3.2) 
where
gives the variance of the observational noise. If
is not known, we may estimate it from data. The posterior distribution is again a GP. We refer to the mean function of this posterior GP after samples as and its kernel function as . Suppose that we have sampled at points and observed , where each observation consists of the function value and the gradient at . Then and are given byIf our observations are incomplete, then we remove the rows and columns in , , , and of Eq. (LABEL:eq:posterior) corresponding to partial derivatives (or function values) that were not observed. If we can observe directional derivatives, then we add rows and columns corresponding to these observations, where entries in and
are obtained by noting that a directional derivative is a linear transformation of the gradient.
3.2 The dKG Acquisition Function
We propose a novel Bayesian optimization algorithm to exploit available derivative information, based on the knowledge gradient approach [9]. We call this algorithm the derivativeenabled knowledge gradient (dKG).
The algorithm proceeds iteratively, selecting in each iteration a batch of points in that has a maximum value of information (VOI). Suppose we have observed points, and recall from Section 3.1 that is the dimensional vector giving the posterior mean for and its partial derivatives at . Sect. 3.1 discusses how to remove the assumption that all values are provided.
The expected value of under the posterior distribution is . If after samples we were to make an irrevocable (riskneutral) decision now about the solution to our overarching optimization problem and receive a loss equal to the value of at the chosen point, we would choose and suffer conditional expected loss . Similarly, if we made this decision after samples our conditional expected loss would be . Therefore, we define the dKG factor for a given set of candidate points as
where is the expectation taken with respect to the posterior distribution after evaluations, and the distribution of under this posterior marginalizes over the observations upon which it depends. We subsequently refer to Eq. (LABEL:eqn:multiKG) as the inner optimization problem.
The dKG algorithm then seeks to evaluate the batch of points next that maximizes the dKG factor,
(3.5) 
We refer to Eq. (3.5) as the outer optimization problem. dKG solves the outer optimization problem using the method described in Section 3.3.
The dKG acquisition function differs from the batch knowledge gradient acquisition function in Wu and Frazier [44] because here the posterior mean at time depends on . This in turn requires calculating the distribution of these gradient observations under the time posterior and marginalizing over them. Thus, the dKG algorithm differs from KG not just in that gradient observations change the posterior, but also in that the prospect of future gradient observations changes the acquisition function. An additional major distinction from Wu and Frazier [44] is that dKG employs a novel discretizationfree method for computing the acquisition function (see Section 3.3).
Fig. 1 illustrates the behavior of dKG and dEI on a d example. dEI generalizes expected improvement (EI) to batch acquisition with derivative information [22]. dKG clearly chooses a better point to evaluate than dEI.
Including all partial derivatives can be computationally prohibitive since GP inference scales as . To overcome this challenge while retaining the value of derivative observations, we can include only one directional derivative from each iteration in our inference. dKG can naturally decide which derivative to include, and can adjust our choice of where to best sample given that we observe more limited information. We define the dKG acquisition function for observing only the function value and the derivative with direction at as
(3.6) 
where conditioning on is here understood to mean that is the conditional mean of given and . The full algorithm is as follows.
3.3 Efficient Exact Computation of dKG
Calculating and maximizing dKG is difficult when is continuous because the term in Eq. (3.6) requires optimizing over a continuous domain, and then we must integrate this optimal value through its dependence on and . Previous work on the knowledge gradient in continuous domains [32, 44, 30] approaches this computation by taking minima within expectations not over the full domain but over a discretized finite approximation. This approach supports analytic integration in Scott et al. [32] and Poloczek et al. [30], and a samplingbased scheme in Wu and Frazier [44]. However, the discretization in this approach introduces error and scales poorly with the dimension of .
Here we propose a novel method for calculating an unbiased estimator of the gradient of
dKG which we then use within stochastic gradient ascent to maximize dKG. This method avoids discretization, and thus is exact. It also improves speed significantly over a discretizationbased scheme.In Section A of the supplement we show that the dKG factor can be expressed as
(3.7) 
where is the mean function of after evaluations, is a dimensional standard normal random column vector and is the first row of a dimensional matrix, which is related to the kernel function of after evaluations with an exact form specified in (A.2) of the supplement.
Under sufficient regularity conditions [21], one can interchange the gradient and expectation operators,
where here the gradient is with respect to and . If is continuously differentiable and is compact, the envelope theorem [25] implies
(3.8) 
where . To find , one can utilize a multistart gradient descent method since the gradient is analytically available for the objective . Practically, we find that the learning rate of is robust for finding .
The expression (3.8) implies that is an unbiased estimator of , when the regularity conditions it assumes hold. We can use this unbiased gradient estimator within stochastic gradient ascent [13], optionally with multiple starts, to solve the outer optimization problem and can use a similar approach when observing full gradients to solve (3.5). For the outer optimization problem, we find that the learning rate of performs well over all the benchmarks we tested.
Bayesian Treatment of Hyperparameters.
We adopt a fully Bayesian treatment of hyperparameters similar to Snoek et al. [35]. We draw samples of hyperparameters for via the emcee package [7] and average our acquisition function across them to obtain
(3.9) 
where the additional argument in dKG indicates that the computation is performed conditioning on hyperparameters . In our experiments, we found this method to be computationally efficient and robust, although a more principled treatment of unknown hyperparameters within the knowledge gradient framework would instead marginalize over them when computing and .
3.4 Theoretical Analysis
Here we present three theoretical results giving insight into the properties of dKG, with proofs in the supplementary material. For the sake of simplicity, we suppose all partial derivatives are provided to dKG. Similar results hold for dKG with relevant directional derivative detection. We begin by stating that the value of information (VOI) obtained by dKG exceeds the VOI that can be achieved in the derivativefree setting.
Proposition 1.
Next, we show that dKG is onestep Bayesoptimal by construction.
Proposition 2.
If only one iteration is left and we can observe both function values and partial derivatives, then dKG is Bayesoptimal among all feasible policies.
As a complement to the onestep optimality, we show that dKG is asymptotically consistent if the feasible set is finite. Asymptotic consistency means that dKG will choose the correct solution when the number of samples goes to infinity.
Theorem 1.
If the function is sampled from a GP with known hyperparameters, the dKG algorithm is asymptotically consistent, i.e.
almost surely, where is the point recommended by dKG after iterations.
4 Experiments
We evaluate the performance of the proposed algorithm dKG with relevant directional derivative detection (Algorithm 1) on six standard synthetic benchmarks (see Fig. 2). Moreover, we examine its ability to tune the hyperparameters for the weighted knearest neighbor metric, logistic regression, deep learning, and for a spectral mixture kernel (see Fig. 3).
We provide an easytouse Python package with the core written in C++, available at https://github.com/wujian16/CornellMOE.
We compare dKG to several stateoftheart methods: (1) The batch expected improvement method (EI) of Wang et al. [39] that does not utilize derivative information and an extension of EI that incorporates derivative information denoted dEI. dEI is similar to Lizotte [22] but handles incomplete gradients and supports batches. (2) The batch GPUCBPE method of Contal et al. [5] that does not utilize derivative information, and an extension that does. (3) The batch knowledge gradient algorithm without derivative information (KG) of Wu and Frazier [44]. Moreover, we generalize the method of Osborne et al. [26] to batches and evaluate it on the KNN benchmark. All of the above algorithms allow incomplete gradient observations. In benchmarks that provide the full gradient, we additionally compare to the gradientbased method LBFGSB provided in scipy. We suppose that the objective function is drawn from a Gaussian process , where is a constant mean function and is the squared exponential kernel. We sample sets of hyperparameters by the emcee package [7].
Recall that the immediate regret is defined as the loss with respect to a global optimum. The plots for synthetic benchmark functions, shown in Fig. 2
, report the log10 of immediate regret of the solution that each algorithm would pick as a function of the number of function evaluations. Plots for other experiments report the objective value of the solution instead of the immediate regret. Error bars give the mean value plus and minus one standard deviation. The number of replications is stated in each benchmark’s description.
4.1 Synthetic Test Functions
We evaluate all methods on six test functions chosen from Bingham [2]. To demonstrate the ability to benefit from noisy derivative information, we sample additive normally distributed noise with zero mean and standard deviation for both the objective function and its partial derivatives. is unknown to the algorithms and must be estimated from observations. We also investigate how incomplete gradient observations affect algorithm performance. We also experiment with two different batch sizes: we use a batch size for the Branin, Rosenbrock, and Ackley functions; otherwise, we use a batch size . Fig. 2 summarizes the experimental results.
Functions with Full Gradient Information.
For 2d Branin on domain , 5d Ackley on , and 6d Hartmann function on , we assume that the full gradient is available.
Looking at the results for the Branin function in Fig. 2, dKG outperforms its competitors after 40 function evaluations and obtains the best solution overall (within the limit of function evaluations). BFGS makes faster progress than the Bayesian optimization methods during the first 20 evaluations, but subsequently stalls and fails to obtain a competitive solution. On the Ackley function dEI makes fast progress during the first evaluations but also fails to make subsequent progress. Conversely, dKG requires about evaluations to improve on the performance of dEI, after which dKG achieves the best overall performance again. For the Hartmann function dKG clearly dominates its competitors over all function evaluations.
Functions with Incomplete Derivative Information.
For the 3d Rosenbrock function on we only provide a noisy observation of the third partial derivative. Both EI and dEI get stuck early. dKG on the other hand finds a nearoptimal solution after function evaluations; KG, without derivatives, catches up after evaluations and performs comparably afterwards. The 4d Levy benchmark on , where the fourth partial derivative is observable with noise, shows a different ordering of the algorithms: EI has the best performance, beating even its formulation that uses derivative information. One explanation could be that the smoothness and regular shape of the function surface benefits this acquisition criteria. For the 8d Cosine mixture function on we provide two noisy partial derivatives. dKG and UCB with derivatives perform better than EItype criterion, and achieve the best performances, with dKG beating UCB with derivatives slightly.
In general, we see that dKG successfully exploits noisy derivative information and has the best overall performance.
4.2 RealWorld Test Functions
Weighted kNearest Neighbor.
Suppose a cab company wishes to predict the duration of trips. Clearly, the duration not only depends on the endpoints of the trip, but also on the day and time. In this benchmark we tune a weighted knearest neighbor (KNN) metric to optimize predictions of these durations, based on historical data. A trip is described by the pickup time , the pickup location , and the dropoff point . Then the estimate of the duration is obtained as a weighted average over all trips in our database that happened in the time interval minutes, where is a tunable hyperparameter: The weight of trip in this prediction is given by ,
where are the respective parameter values for trip , and are tunable hyperparameters. Thus, we have hyperparameters to tune: . We choose in , in , and each in .
We use the yellow cab NYC public data set from June 2016, sampling records from June 1 – 25 as training data and trip records from June – as validation data. Our test criterion is the root mean squared error (RMSE), for which we compute the partial derivatives on the validation dataset with respect to the hyperparameters , while the hyperparameter is not differentiable. In Fig. 3 we see that dKG overtakes the alternatives, and that UCB and KG acquisition functions also benefit from exploiting derivative information.
Kernel Learning.
Spectral mixture kernels [40] can be used for flexible kernel learning to enable longrange extrapolation. These kernels are obtained by modeling a spectral density by a mixture of Gaussians. While any stationary kernel can be described by a spectral mixture kernel with a particular setting of its hyperparameters, initializing and learning these parameters can be difficult. Although we have access to an analytic closed form of the (marginal likelihood) objective, this function is (i) expensive to evaluate and (ii) highly multimodal. Moreover, (iii) derivative information is available. Thus, learning flexible kernel functions is a perfect candidate for our approach.
The task is to train a component spectral mixture kernel on an airline data set [40]. We must determine the mixture weights, means, and variances, for each of the two Gaussians. Fig. 3 summarizes performance for batch size . BFGS is sensitive to its initialization and human intervention and is often trapped in local optima. dKG, on other hand, more consistently finds a good solution, and obtains the best solution of all algorithms (within the step limit). Overall, we observe that gradient information is highly valuable in performing this kernel learning task.
Logistic Regression and Deep Learning.
We tune logistic regression and a feedforward neural network with
hidden layers on the MNIST dataset [20], a standard classification task for handwritten digits. The training set contains images, the test set . We tune hyperparameters for logistic regression: the regularization parameter from to , learning rate from to , mini batch size from toand training epochs from
to . The first derivatives of the first two parameters can be obtained via the technique of Maclaurin et al. [23]. For the neural network, we additionally tune the number of hidden units in .Fig. 3 reports the mean and standard deviation of the mean crossentropy loss (or its log scale) on the test set for replications. dKG outperforms the other approaches, which suggests that derivative information is helpful. Our algorithm proves its value in tuning a deep neural network, which harmonizes with research computing the gradients of hyperparameters [23, 27].
5 Discussion
Bayesian optimization is successfully applied to low dimensional problems where we wish to find a good solution with a very small number of objective function evaluations. We considered several such benchmarks, as well as logistic regression, deep learning, kernel learning, and knearest neighbor applications. We have shown that in this context derivative information can be extremely useful: we can greatly decrease the number of objective function evaluations, especially when building upon the knowledge gradient acquisition function, even when derivative information is noisy and only available for some variables.
Bayesian optimization is increasingly being used to automate parameter tuning in machine learning, where objective functions can be extremely expensive to evaluate. For example, the parameters to learn through Bayesian optimization could even be the hyperparameters of a deep neural network. We expect derivative information with Bayesian optimization to help enable such promising applications, moving us towards fully automatic and principled approaches to statistical machine learning.
In the future, one could combine derivative information with flexible deep projections [43], and recent advances in scalable Gaussian processes for training and test time predictions [41, 42]. These steps would help make Bayesian optimization applicable to a much wider range of problems, wherever standard gradient based optimizers are used – even when we have analytic objective functions that are not expensive to evaluate – while retaining faster convergence and robustness to multimodality.
Acknowledgments
Wilson was partially supported by NSF IIS1563887. Frazier, Poloczek, and Wu were partially supported by NSF CAREER CMMI1254298, NSF CMMI1536895, NSF IIS1247696, AFOSR FA95501210200, AFOSR FA95501510038, and AFOSR FA95501610046.
References
 Ahmed et al. [2016] M. O. Ahmed, B. Shahriari, and M. Schmidt. Do we need “harmless” bayesian optimization and “firstorder” bayesian optimization? In NIPS BayesOpt, 2016.
 Bingham [2015] D. Bingham. Optimization test problems. http://www.sfu.ca/~ssurjano/optimization.html, 2015.
 Brochu et al. [2010] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
 Commission [2016] N. T. . L. Commission. NYC Trip Record Data. http://www.nyc.gov/html/tlc/, June 2016. Last accessed on 20161010.
 Contal et al. [2013] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis. Parallel gaussian process optimization with upper confidence bound and pure exploration. In Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer, 2013.
 Desautels et al. [2014] T. Desautels, A. Krause, and J. W. Burdick. Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923, 2014.
 ForemanMackey et al. [2013] D. ForemanMackey, D. W. Hogg, D. Lang, and J. Goodman. emcee: the mcmc hammer. Publications of the Astronomical Society of the Pacific, 125(925):306, 2013.
 Forrester et al. [2008] A. Forrester, A. Sobester, and A. Keane. Engineering design via surrogate modelling: a practical guide. John Wiley & Sons, 2008.
 Frazier et al. [2009] P. Frazier, W. Powell, and S. Dayanik. The knowledgegradient policy for correlated normal beliefs. INFORMS Journal on Computing, 21(4):599–613, 2009.
 Gardner et al. [2014] J. R. Gardner, M. J. Kusner, Z. E. Xu, K. Q. Weinberger, and J. Cunningham. Bayesian optimization with inequality constraints. In International Conference on Machine Learning, pages 937–945, 2014.
 Gelbart et al. [2014] M. Gelbart, J. Snoek, and R. Adams. Bayesian optimization with unknown constraints. In International Conference on Machine Learning, pages 250–259, Corvallis, Oregon, 2014.
 Gonzalez et al. [2016] J. Gonzalez, Z. Dai, P. Hennig, and N. Lawrence. Batch bayesian optimization via local penalization. In AISTATS, pages 648–657, 2016.
 Harold et al. [2003] J. Harold, G. Kushner, and G. Yin. Stochastic approximation and recursive algorithm and applications. Springer, 2003.
 HernándezLobato et al. [2014] J. M. HernándezLobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of blackbox functions. In Advances in Neural Information Processing Systems, pages 918–926, 2014.
 Huang et al. [2006] D. Huang, T. T. Allen, W. I. Notz, and N. Zeng. Global Optimization of Stochastic BlackBox Systems via Sequential Kriging MetaModels. Journal of Global Optimization, 34(3):441–466, 2006.
 Jameson [1999] A. Jameson. Reengineering the design process through computation. Journal of Aircraft, 36(1):36–50, 1999.
 Jones et al. [1998] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive blackbox functions. Journal of Global optimization, 13(4):455–492, 1998.
 Kathuria et al. [2016] T. Kathuria, A. Deshpande, and P. Kohli. Batched gaussian process bandit optimization via determinantal point processes. In Advances in Neural Information Processing Systems, pages 4206–4214, 2016.
 Koistinen et al. [2016] O.P. Koistinen, E. Maras, A. Vehtari, and H. Jónsson. Minimum energy path calculations with gaussian process regression. Nanosystems: Physics, Chemistry, Mathematics, 7(6), 2016.

LeCun et al. [1998]
Y. LeCun, C. Cortes, and C. J. Burges.
The mnist database of handwritten digits, 1998.
 L’Ecuyer [1995] P. L’Ecuyer. Note: On the interchange of derivative and expectation for likelihood ratio derivative estimators. Management Science, 41(4):738–747, 1995.
 Lizotte [2008] D. J. Lizotte. Practical bayesian optimization. PhD thesis, University of Alberta, 2008.
 Maclaurin et al. [2015] D. Maclaurin, D. Duvenaud, and R. P. Adams. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
 Marmin et al. [2016] S. Marmin, C. Chevalier, and D. Ginsbourger. Efficient batchsequential bayesian optimization with moments of truncated gaussian vectors. arXiv preprint arXiv:1609.02700, 2016.
 Milgrom and Segal [2002] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.
 Osborne et al. [2009] M. A. Osborne, R. Garnett, and S. J. Roberts. Gaussian processes for global optimization. In 3rd International Conference on Learning and Intelligent Optimization (LION3), pages 1–15. Citeseer, 2009.
 Pedregosa [2016] F. Pedregosa. Hyperparameter optimization with approximate gradient. In International Conference on Machine Learning, pages 737–746, 2016.
 Picheny et al. [2013] V. Picheny, D. Ginsbourger, Y. Richet, and G. Caplin. Quantilebased optimization of noisy computer experiments with tunable precision. Technometrics, 55(1):2–13, 2013.
 Plessix [2006] R.É. Plessix. A review of the adjointstate method for computing the gradient of a functional with geophysical applications. Geophysical Journal International, 167(2):495–503, 2006.
 Poloczek et al. [2017] M. Poloczek, J. Wang, and P. I. Frazier. Multiinformation source optimization. In Advances in Neural Information Processing Systems, 2017. Accepted for publication. ArXiv preprint 1603.00389.
 Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. ISBN 026218253X.
 Scott et al. [2011] W. Scott, P. Frazier, and W. Powell. The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression. SIAM Journal on Optimization, 21(3):996–1026, 2011.
 Shah and Ghahramani [2015] A. Shah and Z. Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. In Advances in Neural Information Processing Systems, pages 3312–3320, 2015.
 Shahriari et al. [2016] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 Snoek et al. [2012] J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.

Snyman [2005]
J. Snyman.
Practical mathematical optimization: an introduction to basic optimization theory and classical and new gradientbased algorithms
, volume 97. Springer Science & Business Media, 2005.  Srinivas et al. [2010] N. Srinivas, A. Krause, M. Seeger, and S. M. Kakade. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning, pages 1015–1022, 2010.
 Swersky et al. [2013] K. Swersky, J. Snoek, and R. P. Adams. Multitask bayesian optimization. In Advances in Neural Information Processing Systems, pages 2004–2012, 2013.
 Wang et al. [2016] J. Wang, S. C. Clark, E. Liu, and P. I. Frazier. Parallel bayesian global optimization of expensive functions. arXiv preprint arXiv:1602.05149, 2016.
 Wilson and Adams [2013] A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning, pages 1067–1075, 2013.

Wilson and Nickisch [2015]
A. G. Wilson and H. Nickisch.
Kernel interpolation for scalable structured gaussian processes (kissgp).
In International Conference on Machine Learning, pages 1775–1784, 2015.  Wilson et al. [2015] A. G. Wilson, C. Dann, and H. Nickisch. Thoughts on massively scalable gaussian processes. arXiv preprint arXiv:1511.01870, 2015.

Wilson et al. [2016]
A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing.
Deep kernel learning.
In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
, pages 370–378, 2016.  Wu and Frazier [2016] J. Wu and P. Frazier. The parallel knowledge gradient method for batch bayesian optimization. In Advances in Neural Information Processing Systems, pages 3126–3134, 2016.
Appendix A The Computation of dKG and its Gradient: Additional Details
In this section, we show additional details in Sect. 3.3 of the main document: how to provide unbiased estimators of and its gradient. It is wellknown that if and are the mean and the kernel function respectively of the posterior of after evaluating points, then follows a bivariate Gaussian process with the mean function and the kernel function as follows
Analogously, the is also subject to noise,
where . Following Wu and Frazier [44], we express as
Conditioning on and the knowledge after evaluations, we have is normally distributed with mean and covariance matrix where the function maps the sample to its function and the directional derivative observation at direction . Thus, we can rewrite as
(A.1) 
where is a dimensional standard normal vector and
(A.2) 
Here is the Cholesky factor of the covariance matrix . Now we can follow Sect. 3.3 of the main document to estimate and its gradient.
Appendix B Proof of Proposition 1 and Proposition 2
Proof of Proposition 1.
Recall that we start with the same posterior . Then
(B.1)  
where recall that is the observed function value at , and are the derivative observations at . The inequality above holds due to Jensen’s inequality.
Now we will show that the inequality is strict when depends on , where . Equality holds only if there exists a set such that: (1) ; (2) for any given , is a linear function of for all in a set (allowed to depend on ) such that .
By (LABEL:eq:posterior), we can express as
Then condition (2) holds only if is constant for all (in a conditionally almost sure set allowed to depend on ), which holds only if are the same for all . Thus, the inequality is strict in settings where affects . ∎
Next we analyze the Bayesian optimization problem under a dynamic programming (DP) framework and show that dKG is onestep Bayesoptimal.
Proof of Proposition 2.
Suppose that we are given a budget of samples, i.e. we may run the algorithm for iterations. Our goal is to choose sampling decisions and the implementation decision that minimizes . We assume that is drawn from the prior , then follows the posterior process after iterations, so we have . Thus, letting be the set of feasible policies , we can formulate our problem as follows
We analyze this problem under the DP framework. We define our state space as after iteration as it completely characterizes our belief on . Under the DP framework, we will define the value function as follows
(B.2) 
for every . The Bellman equation tells us that the value function can be written recursively as
where
At the same time, we also know that any policy whose decisions satisfy
(B.3) 
is optimal. If we were to stop at iteration , then and (B.3) reduces to
which is exactly the dKG algorithm. This proves that dKG is onestep Bayesoptimal. ∎
Appendix C Proof of Theorem 1
Recall that we define the value function in Eq. (B.2). Similarly, we can define the value function for a specific policy as
(C.1) 
Since we are varying the number of iterations , we define as the optimal value function when the size of the iteration budget is . Additionally, we define . Similarly, we define and for a specific policy .
Next we will state two lemmas concerning the benefits of additional samples, which will be useful in the latter proofs. First we have the following result for any stationary policy . A policy is called stationary if the decision of the policy only depends on the current state (not on the iteration ). dKG is stationary.
Lemma 1.
For any stationary policy and state , .
This lemma states that for any stationary policy, one additional iteration helps on average.
Proof of Lemma 1.
We prove by induction on . When , by Jensen’s inequality,
Then by the induction hypothesis,
where line 2 above is due to the induction hypothesis and line 3 is due to the stationarity of the policy and the transition kernel of . We conclude the proof. ∎
The following lemma is related to the optimal policy. It says that if allowed an extra fixed batch of samples, the optimal policy performs better on average than if no extra samples allowed.
Lemma 2.
For any state and , .
As a direct corollary, we have for any state .
Proof of Lemma 2.
Recall that .The lemma below shows that is well defined and bounded below.
Lemma 3.
For any state , exists and
(C.2) 