1 Introduction
Bayesian optimization (BO) has a long history and has been used in a variety of fields (see Shahriari et al, 2016)
, with recent interest from the machine learning community in the context of automatic hyperparameter tuning
(Snoek et al, 2012; Golovin et al, 2017). BO is an example of a global blackbox optimization algorithm (Hendrix et al, 2010; Jones et al, 1998; Pintér, 1991; Rios and Sahinidis, 2013) which optimizes an unknown function that may not have nice properties such as convexity. In the typical setting, we assume that we only have access to a black box that evaluates the function and that it is expensive to do these evaluations. The objective is to find a global optimum of the unknown function with the minimum number of function evaluations.The global optimization of a realvalued function is impossible unless we make assumptions about the structure of the unknown function. Lipschitz continuity (that the function can’t change arbitrarily fast as we change the inputs) is one of the weakest assumptions under which optimizing an unknown function is still possible. Lipschitz optimization (Piyavskii, 1972; Shubert, 1972) (LO) exploits knowledge of the Lipschitz constant of the function (a bound on the amount that it can change) to prune the search space in order to locate the optimum. In contrast, Bayesian optimization makes a stronger assumption that the unknown function belongs to a known model class (typically a class of smooth functions), the most common being a Gaussian process (GP) generated using a Gaussian or Matérn kernel (Stein, 2012). We review LO and BO in Section 2.
Under their own specific sets of additional assumptions, both BO (Bull, 2011, Theorem 5) and LO (Malherbe and Vayatis, 2017) can be shown to be exponentially faster than random search strategies. If the underlying function is close to satisfying the stronger BO assumptions, then typically BO is able to optimize functions faster than LO. However, when these assumptions are not reasonable, BO may converge slower than simply trying random values (Li et al, 2016; Ahmed et al, 2016). On the other hand, LO makes minimal assumptions (not even requiring differentiability^{1}^{1}1The absolute value function is an example of a simple nondifferentiable but Lipschitzcontinuous function.) and simply prunes away values of the parameters that are not compatible with the Lipschitz condition and thus cannot be solutions. This is useful in speeding up simple algorithms like random search. Given a new function to optimize, it is typically not clear which of these strategies will perform better.
In this paper, we propose to combine BO and LO to exploit the advantages of both methods. We call this Lipschitz Bayesian Optimization (LBO). Specifically, in Section 3, we design mixed acquisition functions that use Lipschitz continuity in conjunction with existing BO algorithms. We also address the issue of providing a “harmless” estimate of the Lipschitz constant (see Section 2.3), which is an important practical issue for any LO method. Our experiments (Section 4) indicate that in some settings the addition of estimated Lipschitz information leads to a huge improvement over standard BO methods. This is particularly true for Thompson sampling, which often outperforms other standard acquisition functions when augmented with Lipschitz information. This seems to be because the estimated Lipschitz continuity seems to correct for the wellknown problem of overexploration (Shahriari et al, 2014). Further, our experiments indicate that it does not hurt to use the Lipschitz information since even in the worst case it does not change the runtime or the performance of the method.
2 Background
We consider the problem of maximizing a realvalued function with parameters over a compact set . We assume that on iteration , an algorithm chooses a point and then receives the corresponding function value . Typically, our goal is to find the largest possible across iterations. We describe two approaches for solving this problem, namely BO and LO, in detail below.
2.1 Bayesian Optimization
BO methods are typically based on Gaussian processes (GPs), since they have appealing universal consistency properties over compact sets and admit a closedform posterior distribution (Rasmussen and Williams, 2006). BO methods typically assume a smooth GP prior on the unknown function, and use the observed function evaluations to compute a posterior distribution over the possible function values at any point . At iteration , given the previously selected points and their corresponding observations , the algorithm uses an acquisition function (based on the GP posterior) to select the next point to evaluate. The value of the acquisition function at a point characterizes the importance of evaluating that point in order to maximize . To determine , we usually maximize this acquisition function over all using an auxiliary optimization procedure (typically we can only approximately solve this maximization).
We now formalize the above highlevel procedure. We assume that follows a distribution where is a kernel function which quantifies the similarity between points and . Throughout this paper, we use the Matérn kernel for which where . Here the hyperparameter is referred to as the lengthscale for dimension and dictates the extent of smoothness we assume about the function in direction . The hyperparameter quantifies the amount of noise we expect in the function values.
We denote the maximum value of the function until iteration as and the set as . Let and let us denote the kernel matrix as (so for all ). Given the function evaluations (observations), the posterior distribution at point after iterations is given as
. Here, the mean and standard deviation of the function at
are given as:(1) 
As alluded to earlier, an acquisition function uses the above posterior distribution in order to select the next point to evaluate the function at. A number of acquisition functions have been proposed in the literature, with the most popular ones: (UCB) (Srinivas et al, 2010), Thompson sampling (TS) (Thompson, 1933), expected improvement (EI) (Močkus, 1975)
, probability of improvement (PI)
(Kushner, 1964), and entropy search (Villemonteix et al, 2009; Hennig and Schuler, 2012; HernándezLobato et al, 2014). In this work, we focus on four simple widelyused acquisition functions: UCB, TS, EI, and PI. However, we expect that our conclusions would apply to other acquisition functions. For brevity, when defining the acquisition functions, we drop the subscripts from , , and .UCB: The acquisition function is defined as:
(2) 
Here, is positive parameter that trades off exploration and exploitation.
TS: For TS, in each iteration we first sample a function from the GP posterior, . TS then selects the point which maximizes this deterministic function .
PI: We define the possible improvement (over the current maximum) at as and the indicator of improvement u(x) as
PI selects the point which maximizes the probability of improving over . If and
are the probability density function and the cumulative distribution function for the standard normal distribution
, then the PI acquisition function is given as (Kushner, 1964):(3) 
where we have defined .
EI: EI selects an that maximizes , where the expectation is over the distribution . If is the pdf of the standard normal distribution, the expected improvement acquisition function can be written as (Močkus, 1975):
(4) 
2.2 Lipschitz Optimization
As opposed to assuming that the function comes from a specific family of functions, in LO we simply assume that the function cannot change too quickly as we change . In particular, we say that a function is Lipschitzcontinuous if for all and we have
(5) 
for a constant which is referred to as the Lipschitz constant. Note that unlike typical priors used in BO (like the Gaussian or Matérn kernel), a function can be nonsmooth and still be Lipschitz continuous.
Lipschitz optimization uses this Lipschitz inequality in order to test possible locations for the maximum of the function. In particular, at iteration the Lipschitz inequality implies that the function’s value at any can be upper and lower bounded for any by
Since the above inequality holds simultaneously for all , for any the function value can be bounded as:
(6) 
Notice that if , then cannot achieve a higher function value than our current maximum .
To exploit these bounds, at each iteration of a typical Lipschitz optimization (LO) method, Malherbe and Vayatis (2017) might sample points uniformly at random from until it finds an that satisfies . If we know the Lipschitz constant (or use a valid upper bound on the minimum value), this strategy may prune away large areas of the space while guaranteeing that we do not prune away any optimal solutions. This can substantially decrease the number of function values needed to come close to the global optimum compared to using random points without pruning.
A major drawback of Lipschitz optimization is that in most applications we do not know a valid . We discuss this scenario in the next section, but first we note that there exist applications where we do have access to a valid . For example, Bunin and François (2016) discuss cases where can be dictated by the physical laws of the underlying process (e.g., in heat transfer, solid oxide fuelcell system, and polymerization). Alternately, if we have a lower and an upper bound on the possible values that the function can take, then we can combine this with the size of to obtain an overestimate of the minimum value.
2.3 Harmless Lipschitz Optimization
When our blackbox functions arises from a real world process, a suitable value of is typically dictated by physical limitations of the process. However, in practice we often do not know and thus need to estimate it. A simple way to obtain an underestimate of at iteration is to use the maximum value that satisfies the Lipschitz inequality across all pairs of points,
(7) 
Note that this estimate monotonically increases as we see more examples, but that it may be far smaller than the true value. A common variation is to sample several points on a grid (or randomly) to use in the estimate above. Unfortunately, without knowing the Lipschitz constant we do not know how fine this grid should be so in general this may still significantly underestimate the true quantity.
A reasonable property of any estimate of that we use is that it is “harmless” in the sense of Ahmed et al (2016). Specifically, the choice of should not make the algorithm converge to the global optimum at a slower speed than random guessing (in the worst case). If we have an overestimate for the minimum possible value of , then the LO algorithm is harmless as it can only prune values that cannot improve the objective function (although if we overestimate it by too much then it may not prune much of the space). However, the common underestimates of discussed in the previous paragraph are not harmless since they may prune the global optima.
We propose a simple solution to the problem that LO is not harmless if we don’t have prior knowledge about : we use a growing estimate of . The danger in using a growing strategy is that if we grow too slowly then the algorithm may not be harmless. However, in the appendix we show that LO is “harmless” for most reasonable strategies for growing . This result is not prescriptive in the sense that it does not suggest a practical strategy for growing (since it depends on the true ), but this result shows that even for enormous values of that an estimate would have to be growing exceedingly slowly in order for it to not be harmless (exponentiallyslow in the minimum value of , the dimensionality, and the desired accuracy). In our experiments we simply use , the underestimator multiplied by the (growing) iteration number and a constant (a tunable hyperparameter). In Section 4, we observe that this choice of with consistently works well across 14 datasets with 4 different acquisition functions.
3 Lipschitz Bayesian optimization
In this section, we show how simple changes to the standard acquisition functions used in BO allow us to incorporate the Lipschitz inequality bounds. We call this Lipschitz Bayesian Optimization (LBO). LBO prevents BO from considering values of that cannot be global maxima (assuming we have overestimated ) and also restricts the range of values considered in the acquisition function to those that are consistent with the Lipschitz inequalities. Figure 1 illustrates the key features of BO, LO, and LBO. It is important to note that the Lipschitz constant has a different interpretation than the lengthscale of the GP. The constant specifies an absolute maximum rate of change for the function, while specifies how quickly a parameterized distance between pairs of points changes the GP. We also note that the computational complexity of using the Lipschitz inequalities is which is cheaper than the cost of (exact) inference in the GP.
We can use the Lipschitz bounds to restrict the limits of the unknown function value for computing the improvement. The upper bound will always be , while the lower bound will depend on the relative value of . In particular, we have the following two cases:
The second case represents points that cannot improve over the current best value (that are “rejected” by the Lipschitz inequalities).
TruncatedPI: We can define a similar variant for the PI acquisition function as:
(8) 
TruncatedEI: Using the above bounds, the truncated expected improvement for point is given by:
(9) 
Note that removing the Lipschitz bounds corresponds to using and , and in this case we recover the usual PI and EI methods in Equations (3) and (4), respectively.
TruncatedUCB: The same strategy can be applied to UCB as follows:
(10) 
AcceptReject: An alternative strategy to incorporate the Lipschitz bounds is to use an acceptreject based mixed acquisition function. This approach uses the Lipschitz bounds as a sanitycheck to accept or reject the value provided by the original acquisition function, similar to LO methods. Formally, if is the value of the original acquisition function (e.g. or for TS), then the mixed acquisition function is given as follows:
We refer to the acceptreject based mixed acquisition functions as ARUCB and ARTS, respectively. Note that the acceptreject method is quite generic and can be used with any acquisition function that has values on the same scale as that of the function. When using an estimate of it is possible that a good point could be rejected because the estimate of is too small, but using a growing estimate ensures that such points can again be selected on later iterations.
3.1 Regret bound for ARUCB
In this section, we show that under reasonable assumptions, ARUCB is provably “harmless”, in the sense that it retains the good theoretical properties of GPUCB. We prove the following theorem under the following assumptions:

The GP is correctly specified and with infinite observations, the posterior distribution will collapse to the “true” function .

The noise in the observations is small enough for the Lipschitz bounds in Equations 6 to hold.

The Lipschitz constant is known or has been overestimated using the techniques described in Section 2.3.
Assumption is a common assumption made for providing theoretical results for GPUCB (Srinivas et al, 2010). Under these assumptions, we obtain the following theorem (proved in Appendix B):
Theorem 1.
Let be a finite decision space and be the standard deviation of the noise in the observations. Let be a positive scalar such that and . If we use the ARUCB algorithm with assuming that the above conditions  hold, then the expected cumulative regret can be bounded as follows:
Here, refers to the information gain for the selected points and depends on the kernel being used. For the squared exponential kernel, we obtain the following specific bound:
The term can also be bounded for the Matérn kernel following Srinivas et al (2010). The above theorem shows that under reasonable assumptions, using the Lipschitz bounds in conjunction with GPUCB cannot result in worse regret. We empirically show that if is overestimated, then ARUCB matches the performance of GPUCB in the worst case.
Note that the above theorem assumes that the GP is correctly specified with the correct hyperparameters. It also assumes that we are able to specify the correct value of the tradeoff parameter . These assumptions are not guaranteed to hold in practice and this may result in worse performance of the GPUCB algorithm. In such cases, our experiments show that using the Lipschitz bounds can lead to better empirical performance than the original GPUCB.
4 Experiments
Datasets: We perform an extensive experimental evaluation and present results on twelve synthetic datasets and three realworld tasks. For the synthetic experiments, we use the standard globaloptimization benchmarks namely the Branin, Camel, Goldstein Price, Hartmann (2 variants), Michalwicz (3 variants) and Rosenbrock (4 variants). The closed form and domain for each of these functions is given in Jamil and Yang (2013). As examples of realworld tasks, we consider tuning the parameters for a robotpushing simulation (2 variants) (Wang and Jegelka, 2017)
and tuning the hyperparameters for logistic regression
(Wu et al, 2017). For the robot pushing example, our aim is to find a good preimage (Kaelbling and LozanoPérez, 2017) in order for the robot to push the object to a prespecified goal location. We follow the experimental protocol from Wang and Jegelka (2017) and use the negative of the distance to the goal location as the blackbox function to maximize. We consider tuning the robot position , and duration of the push for the 3D case. We also tune the angle of the push to make it a 4 dimensional problem. For the hyperparameter tuning task, we consider tuning the strength of the regularization (in the range) , the learning rate for stochastic gradient descent (in the range
) and the number of passes over the data (in the range ). The blackbox function is the negative loss on the test set (using a train/test split of ) for the MNIST dataset.Experimental Setup: For Bayesian optimization, we use a Gaussian Process prior with the Matérn kernel (with a different length scale for each dimension). We modified the publically available BO package pybo of Hoffman and Shahriari (2014) to construct the mixed acquisition functions. All the prior hyperparameters were set and updated across iterations according to the opensource Spearmint package^{2}^{2}2https://github.com/hips/spearmint.In order to make the optimization invariant to the scale of the function values, similar to Spearmint, we standardize the function values; after each iteration, we centre the observed function values by subtracting their mean and dividing by their standard deviation. We then fit a GP to these rescaled function values and correct for our Lipschitz constant estimate by dividing it by the standard deviation. We use DIRECT (Jones et al, 1993) in order to optimize the acquisition function in each iteration. This is one of the standard choices in current works on BO (Eric et al, 2008; MartinezCantin et al, 2007; Mahendran et al, 2012), but we expect that Lipschitz information could improve the performance under other choices of the acquisition function optimization approach such as discretization (Snoek et al, 2012), adaptive grids (Bardenet and Kégl, 2010), and other gradientbased methods (Hutter et al, 2011; Lizotte et al, 2012). In order to ensure that Bayesian optimization does not get stuck in suboptimal maxima (either because of the auxiliary optimization or a “bad” set of hyperparameters), on every fourth iteration of BO (or LBO) we choose a random point to evaluate rather than optimizing the acquisition function. This makes the optimization procedure “harmless” in the sense that BO (or LBO) will not perform worse than random search (Ahmed et al, 2016). This has become common in recent BO methods such as Bull (2011); Hutter et al (2011); and Falkner et al (2017), and to make the comparison fair we add this “exploration” step to all methods. Note that in the case of LBO we may need to reject random points until we find one satisfying the Lipschitz inequalities (this does not require evaluating the function). In practice, we found that both the standardization and iterations of random exploration are essential for good performance.^{3}^{3}3Note that we verified that our baseline version of BO performs better than or equal to Spearmint across benchmark problems. All our results are averaged over independent runs, and each of our figures plots the mean and standard deviation of the absolute error (compared to the global optimum) versus the number of function evaluations. For functions evaluated on log scale, we show the 10^{th} and 90^{th}quantiles.
Algorithms compared: We compare the performance of Random search, BO, and LBO methods (using both estimated and True ) for the EI, PI, UCB and TS acquisition functions. The True was estimated offline using a large number of random points. For UCB, we set the tradeoff parameter according to Kandasamy et al (2017). For EI and PI, we use Lipschitz bounds to truncate the range of function values for calculating the improvement and use the LBO variants TEI and TPI respectively. For UCB and TS, we use the acceptreject strategy and evaluate the LBO variants ARUCB and ARTS respectively. In addition to these, we use random exploration as another baseline. We chose the hyperparameter (that controls the extent of overestimating the Lipschitz constant) on the Rosenbrock4D function and use the best value of for all the other datasets and acquisition functions for both BO and LBO. In particular, we set .
Results: To make the results easier to read, we divide the results into the following groups:

LBO provides huge improvements over BO shown in Figure 2. Overall, this represents of all the test cases.

LBO provides improvements over BO shown in Figure 3(a). Overall, this represents of all the test cases.

LBO performs similar to BO shown in 3(b). Overall, this represents of all the test cases.

LBO performs slightly worse than BO shown in Figure 3(c). Overall, this represents of all the test cases.
A comparison of the performance across different acquisition functions (for both BO and LBO) on some of the functions is shown in Figure 4, where we also show an example of UCB where is misspecified. The plots for all functions and methods are available in Appendix C. From these experiments, we can observe:

LBO can potentially lead to large gains in performance across acquisition functions and datasets, particularly for TS.

Across datasets, we observe that the gains for EI are relatively small, they are occasionally large for PI and UCB and tend to be consistently large for TS. This can be explained as follows: using EI results in underexploration of the search space, a fact that has been consistently observed and even theoretically proven by Qin et al (2017). As a result of this, BO does not tend to explore “bad” regions when using EI which results in smaller gains from LBO (on the other hand, it may underexplore).

TS suffers from exactly the opposite problem: it results in high variance leading to overexploration of the search space and poor performance. This can be observed in Figures 2(a), 2(b) and 2(c) where the performance of TS is near random. This has also been observed and noted by Shahriari et al (2016). For the discrete multiarmed bandit case, Chapelle and Li (2011) multiply the obtained variance estimate by a small number to discourage overexploration and show that it leads to better results. LBO offers a more principled way of obtaining this same effect and consequently results in making TS more competitive with the other acquisition functions.

The only functions where LBO slightly hurts are Rosenbrock4D and Goldstein with UCB and PI.

For Michalwicz5D (Figure 4(a)), we see that there is no gain for EI, PI, or UCB. However, the gain is huge for TS functions. In fact, even though TS is the worst performing acquisition function on this dataset, its LBO variant ARTS gives the best performance across all methods. This demonstrates the possible gain that can be obtained from using mixed acquisition functions.

We observe a similar trend in Figures 4(b) where LBO improves TS from nearrandom performance to being competitive with the best performing methods (while it does not adversely affect the methods performing well).

For the cases where BO performs slightly better than LBO, we notice that the True estimate of provides comparable performance to BO, so the problem can be narrowed down to finding a good estimate of .

Figure 4(c) shows examples where LBO saves BO with UCB when the parameter is chosen too large (). In this case BO performs near random, but using LBO leads to better performance than random search.
In any case, our experiments indicate that LBO methods rarely hurt the performance of the original acquisition function. Since they have minimal computational or memory requirements and are simple to implement, these experiments support using use the Lipschitz bounds.
5 Related work
The Lipschitz condition has been used with BO under different contexts in two previous works (González et al, 2016; Sui et al, 2015). The aim of Sui et al (2015) is to design a “safe” BO algorithm. They assume knowledge of the true Lipschitz constant and exploit Lipschitz continuity to construct a safety threshold in order to construct a “safe” region of the parameter space. This is different than our goal of improving the performance of existing BO methods, and also different in that we estimate the Lipschitz constant as we run the algorithm. On the other hand, González et al (2016) used Lipschitz continuity to model interactions between a batch of points chosen simultaneously in every iteration of BO (referred to as “Batch” Bayesian optimization). This contrasts with our work where we are aiming to improve the performance of existing sequential algorithms (it is possible that our ideas could be used in their framework).
6 Discussion
In this paper, we have proposed simple ways to combine Lipschitz inequalities with some of the most common BO methods. Our experiments show that this often gives a performance gain, and in the worst case it performs similar to a standard BO method. Although we have focused on four of the simplest acquisition functions, it seems that these inequalities could be used within other acquisition functions. Further, we expect that the Lipschitz inequalities could also be used in other settings like BO with constraints (Gelbart et al, 2014; HernándezLobato et al, 2016; Gardner et al, 2014)
, BO methods based on other model classes like neural networks
(Snoek et al, 2015)(Hutter et al, 2011), and methods that evaluate more than one at a time (Ginsbourger et al, 2010; Wang et al, 2016). Finally, there has been recent interest in firstorder Bayesian optimization methods (Ahmed et al, 2016; Wu et al, 2017). If the gradient is Lipschitz continuous then it is possible to use the descent lemma (Bertsekas, 2016) to obtain Lipschitz bounds that depend on both function values and gradients.References
 Ahmed et al (2016) Ahmed MO, Shahriari B, Schmidt M (2016) Do we need “harmless” Bayesian optimization and “firstorder” Bayesian optimization? NIPS Workshop on Bayesian Optimization
 Bardenet and Kégl (2010) Bardenet R, Kégl B (2010) Surrogating the surrogate: accelerating gaussianprocessbased global optimization with a mixture crossentropy algorithm. In: International Conference on Machine Learning (ICML), Omnipress, pp 55–62
 Bertsekas (2016) Bertsekas DP (2016) Nonlinear Programming, 3rd edn. MIT
 Bull (2011) Bull AD (2011) Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research 12(Oct):2879–2904
 Bunin and François (2016) Bunin GA, François G (2016) Lipschitz constants in experimental optimization. arXiv preprint arXiv:160307847
 Chapelle and Li (2011) Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. In: Advances in Neural Information Processing Systems (NIPS), pp 2249–2257
 Eric et al (2008) Eric B, Freitas ND, Ghosh A (2008) Active preference learning with discrete choice data. In: Advances in Neural Information Processing Systems (NIPS), pp 409–416
 Falkner et al (2017) Falkner S, Klein A, Hutter F (2017) Combining hyperband and bayesian optimization. In: NIPS Workshop on Bayesian Optimization
 Gardner et al (2014) Gardner JR, Kusner MJ, Xu ZE, Weinberger KQ, Cunningham JP (2014) Bayesian optimization with inequality constraints. In: International Conference on Machine Learning (ICML), pp 937–945
 Gelbart et al (2014) Gelbart MA, Snoek J, Adams RP (2014) Bayesian optimization with unknown constraints. arXiv preprint arXiv:14035607
 Ginsbourger et al (2010) Ginsbourger D, Le Riche R, Carraro L (2010) Kriging is wellsuited to parallelize optimization. In: Computational intelligence in expensive optimization problems, Springer, pp 131–162
 Golovin et al (2017) Golovin D, Solnik B, Moitra S, Kochanski G, Karro J, Sculley D (2017) Google vizier: A service for blackbox optimization. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1487–1495

González et al (2016)
González J, Dai Z, Hennig P, Lawrence N (2016) Batch Bayesian optimization via local penalization. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp 648–657
 Hendrix et al (2010) Hendrix EM, Boglárka G, et al (2010) Introduction to nonlinear and global optimization. Springer
 Hennig and Schuler (2012) Hennig P, Schuler CJ (2012) Entropy search for informationefficient global optimization. Journal of Machine Learning Research 13(Jun):1809–1837
 HernándezLobato et al (2014) HernándezLobato JM, Hoffman MW, Ghahramani Z (2014) Predictive entropy search for efficient global optimization of blackbox functions. In: Advances in Neural Information Processing Systems (NIPS), pp 918–926
 HernándezLobato et al (2016) HernándezLobato JM, Gelbart MA, Adams RP, Hoffman MW, Ghahramani Z (2016) A general framework for constrained Bayesian optimization using informationbased search. Journal of Machine Learning Research 17(1):5549–5601
 Hoffman and Shahriari (2014) Hoffman MW, Shahriari B (2014) Modular mechanisms for Bayesian optimization. In: NIPS Workshop on Bayesian Optimization, pp 1–5
 Hutter et al (2011) Hutter F, Hoos HH, LeytonBrown K (2011) Sequential modelbased optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, Springer, pp 507–523
 Jamil and Yang (2013) Jamil M, Yang XS (2013) A literature survey of benchmark functions for global optimisation problems. International Journal of Mathematical Modelling and Numerical Optimisation 4(2):150–194
 Jones et al (1993) Jones DR, Perttunen CD, Stuckman BE (1993) Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications 79(1):157–181
 Jones et al (1998) Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive blackbox functions. Journal of Global optimization 13(4):455–492
 Kaelbling and LozanoPérez (2017) Kaelbling LP, LozanoPérez T (2017) Preimage backchaining in belief space for mobile manipulation. In: Robotics Research, Springer, pp 383–400
 Kandasamy et al (2017) Kandasamy K, Krishnamurthy A, Schneider J, Poczos B (2017) Asynchronous parallel Bayesian optimisation via thompson sampling. arXiv preprint arXiv:170509236
 Kushner (1964) Kushner HJ (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86(1):97–106

Li et al (2016)
Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A (2016) Efficient hyperparameter optimization and infinitely many armed bandits. arXiv preprint arXiv:160306560
 Lizotte et al (2012) Lizotte DJ, Greiner R, Schuurmans D (2012) An experimental methodology for response surface optimization methods. Journal of Global Optimization 53(4):699–736
 Mahendran et al (2012) Mahendran N, Wang Z, Hamze F, De Freitas N (2012) Adaptive mcmc with bayesian optimization. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp 751–760
 Malherbe and Vayatis (2017) Malherbe C, Vayatis N (2017) Global optimization of lipschitz functions. In: International Conference on Machine Learning (ICML), pp 2314–2323, URL http://proceedings.mlr.press/v70/malherbe17a.html
 MartinezCantin et al (2007) MartinezCantin R, de Freitas N, Doucet A, Castellanos JA (2007) Active policy learning for robot planning and exploration under uncertainty. In: Robotics: Science and Systems, vol 3, pp 321–328
 Močkus (1975) Močkus J (1975) On Bayesian methods for seeking the extremum. In: Optimization Techniques IFIP Technical Conference, Springer, pp 400–404
 Pintér (1991) Pintér JD (1991) Global optimization in action. Scientific American 264:54–63
 Piyavskii (1972) Piyavskii S (1972) An algorithm for finding the absolute extremum of a function. USSR Computational Mathematics and Mathematical Physics 12(4):57–67
 Qin et al (2017) Qin C, Klabjan D, Russo D (2017) Improving the expected improvement algorithm. In: Advances in Neural Information Processing Systems (NIPS), pp 5387–5397
 Rasmussen and Williams (2006) Rasmussen CE, Williams CK (2006) Gaussian processes for machine learning. MIT Press
 Rios and Sahinidis (2013) Rios LM, Sahinidis NV (2013) Derivativefree optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization 56(3):1247–1293
 Shahriari et al (2014) Shahriari B, Wang Z, Hoffman MW, BouchardCôté A, de Freitas N (2014) An entropy search portfolio. In: NIPS Workshop on Bayesian Optimization
 Shahriari et al (2016) Shahriari B, Swersky K, Wang Z, Adams RP, de Freitas N (2016) Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104(1):148–175
 Shubert (1972) Shubert BO (1972) A sequential method seeking the global maximum of a function. SIAM Journal on Numerical Analysis 9(3):379–388
 Snoek et al (2012) Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems (NIPS)
 Snoek et al (2015) Snoek J, Rippel O, Swersky K, Kiros R, Satish N, Sundaram N, Patwary M, Prabhat M, Adams R (2015) Scalable Bayesian optimization using deep neural networks. In: International Conference on Machine Learning (ICML), pp 2171–2180
 Srinivas et al (2010) Srinivas N, Krause A, Kakade SM, Seeger M (2010) Gaussian process optimization in the bandit setting: No regret and experimental design. In: International Conference on Machine Learning (ICML), pp 1015–1022

Stein (2012)
Stein ML (2012) Interpolation of spatial data: some theory for kriging. Springer Science & Business Media
 Sui et al (2015) Sui Y, Gotovos A, Burdick J, Krause A (2015) Safe exploration for optimization with gaussian processes. In: International Conference on Machine Learning (ICML), pp 997–1005
 Thompson (1933) Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294
 Villemonteix et al (2009) Villemonteix J, Vazquez E, Walter E (2009) An informational approach to the global optimization of expensivetoevaluate functions. Journal of Global Optimization 44(4):509
 Wang et al (2016) Wang J, Clark SC, Liu E, Frazier PI (2016) Parallel Bayesian global optimization of expensive functions. arXiv preprint arXiv:160205149
 Wang and Jegelka (2017) Wang Z, Jegelka S (2017) Maxvalue entropy search for efficient bayesian optimization. In: International Conference on Machine Learning (ICML)
 Wu et al (2017) Wu J, Poloczek M, Wilson AG, Frazier P (2017) Bayesian optimization with gradients. In: Advances in Neural Information Processing Systems (NIPS), pp 5267–5278
References
 Ahmed et al (2016) Ahmed MO, Shahriari B, Schmidt M (2016) Do we need “harmless” Bayesian optimization and “firstorder” Bayesian optimization? NIPS Workshop on Bayesian Optimization
 Bardenet and Kégl (2010) Bardenet R, Kégl B (2010) Surrogating the surrogate: accelerating gaussianprocessbased global optimization with a mixture crossentropy algorithm. In: International Conference on Machine Learning (ICML), Omnipress, pp 55–62
 Bertsekas (2016) Bertsekas DP (2016) Nonlinear Programming, 3rd edn. MIT
 Bull (2011) Bull AD (2011) Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research 12(Oct):2879–2904
 Bunin and François (2016) Bunin GA, François G (2016) Lipschitz constants in experimental optimization. arXiv preprint arXiv:160307847
 Chapelle and Li (2011) Chapelle O, Li L (2011) An empirical evaluation of thompson sampling. In: Advances in Neural Information Processing Systems (NIPS), pp 2249–2257
 Eric et al (2008) Eric B, Freitas ND, Ghosh A (2008) Active preference learning with discrete choice data. In: Advances in Neural Information Processing Systems (NIPS), pp 409–416
 Falkner et al (2017) Falkner S, Klein A, Hutter F (2017) Combining hyperband and bayesian optimization. In: NIPS Workshop on Bayesian Optimization
 Gardner et al (2014) Gardner JR, Kusner MJ, Xu ZE, Weinberger KQ, Cunningham JP (2014) Bayesian optimization with inequality constraints. In: International Conference on Machine Learning (ICML), pp 937–945
 Gelbart et al (2014) Gelbart MA, Snoek J, Adams RP (2014) Bayesian optimization with unknown constraints. arXiv preprint arXiv:14035607
 Ginsbourger et al (2010) Ginsbourger D, Le Riche R, Carraro L (2010) Kriging is wellsuited to parallelize optimization. In: Computational intelligence in expensive optimization problems, Springer, pp 131–162
 Golovin et al (2017) Golovin D, Solnik B, Moitra S, Kochanski G, Karro J, Sculley D (2017) Google vizier: A service for blackbox optimization. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 1487–1495

González et al (2016)
González J, Dai Z, Hennig P, Lawrence N (2016) Batch Bayesian optimization via local penalization. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp 648–657
 Hendrix et al (2010) Hendrix EM, Boglárka G, et al (2010) Introduction to nonlinear and global optimization. Springer
 Hennig and Schuler (2012) Hennig P, Schuler CJ (2012) Entropy search for informationefficient global optimization. Journal of Machine Learning Research 13(Jun):1809–1837
 HernándezLobato et al (2014) HernándezLobato JM, Hoffman MW, Ghahramani Z (2014) Predictive entropy search for efficient global optimization of blackbox functions. In: Advances in Neural Information Processing Systems (NIPS), pp 918–926
 HernándezLobato et al (2016) HernándezLobato JM, Gelbart MA, Adams RP, Hoffman MW, Ghahramani Z (2016) A general framework for constrained Bayesian optimization using informationbased search. Journal of Machine Learning Research 17(1):5549–5601
 Hoffman and Shahriari (2014) Hoffman MW, Shahriari B (2014) Modular mechanisms for Bayesian optimization. In: NIPS Workshop on Bayesian Optimization, pp 1–5
 Hutter et al (2011) Hutter F, Hoos HH, LeytonBrown K (2011) Sequential modelbased optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, Springer, pp 507–523
 Jamil and Yang (2013) Jamil M, Yang XS (2013) A literature survey of benchmark functions for global optimisation problems. International Journal of Mathematical Modelling and Numerical Optimisation 4(2):150–194
 Jones et al (1993) Jones DR, Perttunen CD, Stuckman BE (1993) Lipschitzian optimization without the lipschitz constant. Journal of Optimization Theory and Applications 79(1):157–181
 Jones et al (1998) Jones DR, Schonlau M, Welch WJ (1998) Efficient global optimization of expensive blackbox functions. Journal of Global optimization 13(4):455–492
 Kaelbling and LozanoPérez (2017) Kaelbling LP, LozanoPérez T (2017) Preimage backchaining in belief space for mobile manipulation. In: Robotics Research, Springer, pp 383–400
 Kandasamy et al (2017) Kandasamy K, Krishnamurthy A, Schneider J, Poczos B (2017) Asynchronous parallel Bayesian optimisation via thompson sampling. arXiv preprint arXiv:170509236
 Kushner (1964) Kushner HJ (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering 86(1):97–106

Li et al (2016)
Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A (2016) Efficient hyperparameter optimization and infinitely many armed bandits. arXiv preprint arXiv:160306560
 Lizotte et al (2012) Lizotte DJ, Greiner R, Schuurmans D (2012) An experimental methodology for response surface optimization methods. Journal of Global Optimization 53(4):699–736
 Mahendran et al (2012) Mahendran N, Wang Z, Hamze F, De Freitas N (2012) Adaptive mcmc with bayesian optimization. In: International Conference on Artificial Intelligence and Statistics (AISTATS), pp 751–760
 Malherbe and Vayatis (2017) Malherbe C, Vayatis N (2017) Global optimization of lipschitz functions. In: International Conference on Machine Learning (ICML), pp 2314–2323, URL http://proceedings.mlr.press/v70/malherbe17a.html
 MartinezCantin et al (2007) MartinezCantin R, de Freitas N, Doucet A, Castellanos JA (2007) Active policy learning for robot planning and exploration under uncertainty. In: Robotics: Science and Systems, vol 3, pp 321–328
 Močkus (1975) Močkus J (1975) On Bayesian methods for seeking the extremum. In: Optimization Techniques IFIP Technical Conference, Springer, pp 400–404
 Pintér (1991) Pintér JD (1991) Global optimization in action. Scientific American 264:54–63
 Piyavskii (1972) Piyavskii S (1972) An algorithm for finding the absolute extremum of a function. USSR Computational Mathematics and Mathematical Physics 12(4):57–67
 Qin et al (2017) Qin C, Klabjan D, Russo D (2017) Improving the expected improvement algorithm. In: Advances in Neural Information Processing Systems (NIPS), pp 5387–5397
 Rasmussen and Williams (2006) Rasmussen CE, Williams CK (2006) Gaussian processes for machine learning. MIT Press
 Rios and Sahinidis (2013) Rios LM, Sahinidis NV (2013) Derivativefree optimization: a review of algorithms and comparison of software implementations. Journal of Global Optimization 56(3):1247–1293
 Shahriari et al (2014) Shahriari B, Wang Z, Hoffman MW, BouchardCôté A, de Freitas N (2014) An entropy search portfolio. In: NIPS Workshop on Bayesian Optimization
 Shahriari et al (2016) Shahriari B, Swersky K, Wang Z, Adams RP, de Freitas N (2016) Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104(1):148–175
 Shubert (1972) Shubert BO (1972) A sequential method seeking the global maximum of a function. SIAM Journal on Numerical Analysis 9(3):379–388
 Snoek et al (2012) Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems (NIPS)
 Snoek et al (2015) Snoek J, Rippel O, Swersky K, Kiros R, Satish N, Sundaram N, Patwary M, Prabhat M, Adams R (2015) Scalable Bayesian optimization using deep neural networks. In: International Conference on Machine Learning (ICML), pp 2171–2180
 Srinivas et al (2010) Srinivas N, Krause A, Kakade SM, Seeger M (2010) Gaussian process optimization in the bandit setting: No regret and experimental design. In: International Conference on Machine Learning (ICML), pp 1015–1022

Stein (2012)
Stein ML (2012) Interpolation of spatial data: some theory for kriging. Springer Science & Business Media
 Sui et al (2015) Sui Y, Gotovos A, Burdick J, Krause A (2015) Safe exploration for optimization with gaussian processes. In: International Conference on Machine Learning (ICML), pp 997–1005
 Thompson (1933) Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285–294
 Villemonteix et al (2009) Villemonteix J, Vazquez E, Walter E (2009) An informational approach to the global optimization of expensivetoevaluate functions. Journal of Global Optimization 44(4):509
 Wang et al (2016) Wang J, Clark SC, Liu E, Frazier PI (2016) Parallel Bayesian global optimization of expensive functions. arXiv preprint arXiv:160205149
 Wang and Jegelka (2017) Wang Z, Jegelka S (2017) Maxvalue entropy search for efficient bayesian optimization. In: International Conference on Machine Learning (ICML)
 Wu et al (2017) Wu J, Poloczek M, Wilson AG, Frazier P (2017) Bayesian optimization with gradients. In: Advances in Neural Information Processing Systems (NIPS), pp 5267–5278
Appendix A Proof for Lipschitz constant estimation
Let be the minimum possible Lipschitz constant for the function , and let be a globallyoptimal solution. Here we consider with the number of iterations we need before we have a point satisfying
(11) 
for some small .
By the Lipschitz inequality we have
for any . If we assume that is at least distance away from the boundary, then this inequality implies that there exists a hypersphere around of size where all values satisfy (11). Using that is compact (and thus its total volume is bounded), if we choose a random then the probability that it lands inside this ball is .^{4}^{4}4If is near the boundary, then the area is still and the argument is unchanged. Since we sample the independently, this probability gives the chance we successfully find an approximate solution on each iteration. The expected number
before we have such a “success” is thus the mean of a geometric random variable with this probability, which means we need
iterations in expectation to find a point satisfying (11).^{5}^{5}5Instead of “number of iterations to reach ”, we could equivalently state the result in terms “expected error at iteration ” (simple regret) by inverting the relationship between and . This would give an expected error on iteration of .Unfortunately, if we use an estimate of instead of an satisfying the Lipschitz inequality, we could reject an approximate solution. However, if grows with then eventually it is sufficiently large that we will not reject an approximate solution (unless we already have an optimal solution). Thus, a crude bound on the expected number of iterations before we find a solution with accuracy is given by , where is the first iteration beyond which we always have . Thus, if we choose the sequence such that , then LO is harmless as it requires the same expected number of iterations as random guessing. A simple example of a sequence of values satisfying this property would be to choose , which grows extremelyslowly (for small and nontrivial or). Larger sequences would imply a smaller and hence also would be harmless.
Appendix B Regret Bound
Theorem 2.
Let be a finite decision space and be the standard deviation of the noise in the observations. Let be a positive scalar such that and . If we use the ARUCB algorithm with assuming that the above conditions  hold, then the expected cumulative regret can be bounded as follows:
Here, refers to the information gain for the selected points and depends on the kernel being used. For the squared exponential kernel, we obtain the following specific bound:
Proof.
By definition of Lipschitz bounds and assuming we know the true Lipschitz constant , at iteration , for all ,  
(13) 
We now use the following lemma from Srinivas et al (2010):
Lemma 1 (Lemma 5.1 in Srinivas et al (2010)).
Denoting as a finite decision space, let and . Choose where . Then, for all and , with probability ,
(14) 
For the point selected at round , the following relation holds because of the AcceptReject condition:
(15)  
The following holds because of the definition of the UCB rule:  
(16)  
From Equations 13 and 14,  
(17)  
From Equations 14 and 15  
(18) 
Let be the instantaneous regret in round . Then,  
(From Equation 17)  
(From Equation 16)  
()  
()  
(From Equation 18)  
()  
(From Equation 13) 
Let us now consider the term  
(By Equation 6)  
()  
()  
From the above equations,  
Let be the cumulative regret after rounds.  
() 
We now bound the term using the lemma in Srinivas et al (2010) which we restate next:
Lemma 2 (Lemma in Srinivas et al (2010)).
Choosing ,
where . Here refers to the information gain for the selected points.
Using the above lemma, we obtain the following bound:
∎
Appendix C Additional Experimental Results
Below we show the results of all the experiments for all the datasets as follows:

Figure 5 shows the performance of Random search, BO, and LBO (using both estimated and True ) for the TS acquisition function.

Figure 6 shows the performance of Random search, BO, and LBO (using both estimated and True ) for the UCB acquisition function.

Figure 7 shows the performance of Random search, BO, and LBO (using both estimated and True ) for the EI acquisition function.

Figure 8 shows the performance of Random search, BO, and LBO (using both estimated and True ) for the PI acquisition function.

Figure 9 shows the performance of BO and LBO using the estimated for the all acquisition function.

Figure 10 shows the performance of Random search, BO, and LBO (using both estimated and True ) for the UCB acquisition function with very large .
Comments
There are no comments yet.