Efficient Batch Black-box Optimization with Deterministic Regret Bounds

05/24/2019 ∙ by Yueming Lyu, et al. ∙ Microsoft 0

In this work, we investigate black-box optimization from the perspective of frequentist kernel methods. We propose a novel batch optimization algorithm to jointly maximize the acquisition function and select points from a whole batch in a holistic way. Theoretically, we derive regret bounds for both the noise-free and perturbation settings. Moreover, we analyze the property of the adversarial regret that is required by robust initialization for Bayesian Optimization (BO), and prove that the adversarial regret bounds decrease with the decrease of covering radius, which provides a criterion for generating (initialization point set) to minimize the bound. We then propose fast searching algorithms to generate a point set with a small covering radius for the robust initialization. Experimental results on both synthetic benchmark problems and real-world problems show the effectiveness of the proposed algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian Optimization (BO) is a promising approach to address the expensive black-box (non-convex) optimization problems. Applications of BO include: automatic tuning of hyper-parameters in machine learning 

Nips2012practical , the gait optimization in robot control robot , identifying molecular compounds in drug discovery drugdiscovery , and optimizing of computation-intensive engineering design engineerDisign .

BO aims to find the optimum of an unknown, usually non-convex function . Since little information is known about the underlying function

, BO requires to estimate a surrogate function to model the unknown function. Therefore, one major challenge of BO, is to seek a balance between collecting information to model the function

(exploration) and searching for an optimum based on collected information (exploitation).

Typically, BO assumes that the underlying function is sampled from a Gaussian process (GP) prior. BO selects the candidate solutions for evaluation by maximizing an acquisition function PI ; ei1975 ; EI ), which balances the exploration and exploitation, given all previous observations. In practice, BO can usually find an approximate maximum solution with a remarkably small number of function evaluations Nips2012practical ; scarlett2018tight .

In many real applications, it is usually preferred that multiple function evaluations are performed in parallel to achieve time efficiency, for example, examining various hyperparameter settings of a machine learning algorithm simultaneously or running multiple instances of a reinforcement learning simulation in parallel. Sequential BO selection is by no means efficient in these scenarios. Therefore, several batch BO approaches have been proposed to address this issue. Shah et al. 

shah2015parallel propose a parallel predictive entropy search method, which is based on the predictive entropy search (PES) PES and adapts the method to the batch case. Wu et al. wu2016parallel generalize the knowledge gradient method frazier2009knowledge to a parallel knowledge gradient method. These methods are, however, computationally intensive due to relying on Monte Carlo sampling. In addition, they are not scalable to a large batch size and they lack the theoretical convergence guarantee.

Instead of using Monte Carlo sampling, another line of research improves the efficiency by deriving a tighter upper confidence bound. The GP-BUCB policy GPBUCB makes selections point by point sequentially, according to an upper confidence bound (UCB) criterion auer2002using ; gpucb with a fixed mean function and an updated covariance function, until reaching the preset batch size. It proves sublinear growth bounds on the cumulative regret, which guarantee the bound on the number of required iterations to get close enough to the optimal. The GP-UCB-PE UCBPE combines the UCB strategy and the pure exploration PE in the evaluations of the same batch. The GP-UCB-PE achieves a better upper bound on the cumulative regret compared with the GP-BUCB. However, although these methods do not require Monte Carlo sampling, they select candidate queries of a batch in a greedy manner. As a result, they are still far from satisfactory in terms of efficiency and scalability.

To achieve a greater efficiency in batch selection, we propose to simultaneously select candidate queries of a batch in a holistic manner, rather than the previous sequential manner. In this paper, we analyze both batch BO and sequential BO from a frequentist perspective. For the batch BO, we propose a novel batch selection method which takes both the mean prediction value and correlation of points in a batch into consideration. By jointly maximizing the novel acquisition function with respect to all the points in a batch, the proposed method is able to attain a better exploitation/exploration trade off. For the sequential BO, we obtain a similar acquisition function as that in the GP-UCB gpucb , except that our function employs a constant weight for the deviation term. The constant weight is preferred over the previous theoretical weight proposed in GP-UCB, because the latter is overly conservative, which has been observed in many other works bogunovic2016truncated ; bogunovic2018adversarially ; gpucb . Moreover, for functions with a bounded norm in the reproducing kernel Hilbert space (RKHS), we derive the non-trival regret bounds for both the batch BO method and the sequential BO method.

At the beginning of the BO, as little information is known, the initialization phase becomes vitally important. To obtain a good and robust initialization, we first study the properties that need to be satisfied for a robust initialization through analyzing the adversarial regret. We prove that the regret bounds decrease with the decrease of the covering radius (named fill distance in kanagawa2018gaussian ). As minimizing the covering radius of a lattice is equivalent to maximizing its packing radius (named separate distance in kanagawa2018gaussian rank1Image ; keller2007monte , we then propose a novel fast searching method to maximize the packing radius of a rank-1 lattice and obtain the points set with a small covering radius.

Limitations of Bull’s batch method: Bull bull presents a non-adaptive batch method with all the query points except one being fixed at the beginning. However, as mentioned by Bull, this method is not practical. We propose an adaptive BO method and initialize it with a robust initialization algorithm. More specifically, we first select the initialization query points by minimizing the covering radius, and then select the query points based on our adaptive methods.

Relationship to Bull’s bounds: We give the regret bound w.r.t. the covering radius for different kernels; while Bull’s bound is limited to Martérn type kernel. Compared with Bull’s bound (Theorem 1 in bull ), our regret bound directly links to the covering radius (fill distance), which provides a criterion for generating a point set to achieve small bounds. In contrast, Bull’s bound does not provide a criterion for minimizing the bound. We generate the initialization point set by minimizing covering radius (our bound); while Bull’s work doesn’t.

Our contributions are summarized as follows:

  • We study black-box optimization for functions with a bounded norm in RKHS and achieve deterministic regret bounds for both the noise-free setting and the perturbation setting. The study not only brings more insight into the BO literature, but also provides a better guidance for designing new acquisition functions.

  • We propose more-efficient novel adaptive algorithms for batch optimization, which select candidate queries of a batch in a holistic manner. Theoretically, we prove that the proposed methods achieve non-trivial regret bounds.

  • We analyze the adversarial regret for a robust initialization of BO, and theoretically prove that the regret bounds decrease with the decrease of the covering radius, which provides a criterion for generating points set to minimize the bound (for initialization of BO).

  • We propose novel, fast searching algorithms to maximize the packing radius of a rank-1 lattice and generate a set of points with a small covering radius. The generated points set provides a robust initialization for BO. Moreover, the set of points can be used for integral approximation on domain . Experimental results show that the proposed method can achieve a larger packing radius (separate distance) compared with baselines.

2 Notations and Symbols

Let denote a separable reproducing kernel Hilbert space associated with the kernel , and Let denote the RKHS norm in . denotes the norm (Euclidean distance). Let denotes a bounded subset in the RKHS, and denote a compact set in . Symbol denotes the set . and denote the integer set and prime number set, respectively. Bold capital letters are used for matrices.

3 Background and Problem Setup

Let be the unknown black-box function to be optimized, where is a compact set. BO aims to find a maximum of the function , i.e., .

In sequential BO, a single point is selected to query an observation at round . Batch BO is introduced in the literature for the benefits of parallel execution. Batch BO methods select a batch of points simultaneously at round , where is the batch size. The batch BO is different from the sequential BO because the observation is delayed for batch BO during the batch selection phase. An additional challenge is introduced in batch BO since it needs to select a batch of points at one time, without knowing the latest information about the function .

In BO, the effectiveness of a selection policy can be measured by the cumulative regret and the simple regret (minimum regret) over steps. The cumulative regret and simple regret are defined in equations (1) and (2), respectively:

(1)
(2)

The regret bound introduced in numerous theoretical works is based on the maximum information gain defined as

(3)

The bounds of for commonly used kernels are studied in gpucb . Specifically, Srinivas et al. gpucb state that for the linear kernel, for the squared exponential kernel and for the Matérn kernels with , where . We employ the term to build the regret bounds of our algorithms.

In this work, we consider two settings: the noise-free setting and perturbation setting.

Noise-Free Setting: We assume the underlying function belongs to an RKHS associated with kernel , i.e., , with . In the noise-free setting, we can directly observe without noise perturbation.

Perturbation Setting: In the perturbation setting, we can not observe the function evaluation directly. Instead, we observe , where is an unknown perturbation function.

Define for , where and . We assume , with and , respectively. Therefore, we know and .

4 BO in Noise-Free Setting

In this section, we will first present algorithms and theoretical analysis in the sequential case. We then discuss our batch selection method. All detailed proofs are included in the supplementary material.

4.1 Sequential Selection in Noise Free Setting

Define and as follows:

(4)
(5)

where and the kernel matrix

. These terms are closely related to the posterior mean and variance functions of GP with zero noise. We use them in the deterministic setting. A detailed review of the relationships between GP methods and kernel methods can be found in

kanagawa2018gaussian .

The sequential optimization method in the noise-free setting is described in Algorithm 1. It has a similar form to GP-UCB gpucb , except that it employs a constant weight of the term to balance exploration and exploitation. In contrast, GP-UCB uses a increasing weight. In practice, a constant weight is preferred in the scenarios where an aggressive selection manner is needed. For example, only a small number of evaluations can be done in the hyperparameter tuning in RL algorithms due to limited resources. The regret bounds of Algorithm 1 are given in Theorem 1.

Theorem 1.

Suppose associated with and . Let . Algorithm 1 achieves a cumulative regret bound and a simple regret bound given as follows:

(6)
(7)

Remark: We can achieve concrete bounds w.r.t by replacing with the specific bound for the corresponding kernel. For example, for SE kernels, we can obtain that and , respectively. Bull bull presents bounds for Matérn type kernels. The bound in Theorem 1 is tighter than Bull’s bound of pure EI (Theorem 4 in bull ) when the smoothness parameter of the Matérn kernel . This is no better than the bound of mixed strategies (Theorem 5) in Bull’s work. Nevertheless, the bound in Theorem 1 makes fewer assumptions about the kernels, and covers more general results (kernels) compared with Bull’s work.

  for  to  do
     Obtain and via equations (4) and (5).
     Choose .
     Query the observation at location .
  end for
Algorithm 1

4.2 Batch Selection in Noise-Free Setting

Let N and L be the number of batches and the batch size, respectively. Without loss of generality, we assume . Let and be the batch of points and all the batches of points, respectively. The covariance function of for the noise free case is given in equation (8):

(8)

where is the kernel matrix, denotes the kernel matrix between and . When , is the prior kernel matrix. We assume that the kernel matrix is invertible in the noise-free setting.

The proposed batch optimization algorithm is presented in Algorithm 2. It employs the mean prediction value of a batch together with a term of covariance to balance the exploration/exploitation trade-off. The covariance term in Algorithm 2 penalizes the batch with over-correlated points. Intuitively, for SE kernels and Matérn kernels, it penalizes the batch with points that are too close to each other (w.r.t Euclidean distance). As a result, it encourages the points in a batch to spread out for a better exploration. The regret bounds of our batch optimization method are summarized in Theorem 2.

Theorem 2.

Suppose associated with and . Let , and . Algorithm 2 with batch size achieves a cumulative regret bound and a simple regret bound given by equations (9) and (10), respectively:

(9)
(10)

Remark: (1) A large leads to a large bound, while a small attains a small bound. Algorithm 2 punishes the correlated points and encourages the uncorrelated points in a batch, which can attain a small in general. (2) A trivial bound of is .

To prove Theorem 2, the following Lemma is proposed. The detailed proof can be found in the supplementary material.

Lemma 1.

Suppose associated with kernel and , then , where denotes the kernel covariance matrix with .

Remark: Lemma 1 provides a tighter bound for the deviation of the summation of a batch than directly applying the bound for a single point times.

  for  to  do
     Obtain and via equations (4) and (8), respectively.
     Choose .
     Query the batch observations at locations .
  end for
Algorithm 2

5 BO in Perturbation Setting

In the perturbation setting, we can not observe the function evaluation directly. Instead, we observe , where is an unknown perturbation function. We will discuss the sequential selection and batch selection methods in the following sections, respectively.

5.1 Sequential Selection in Perturbation Setting

Define and as follows:

(11)
(12)

where and the kernel matrix .

The sequential selection method is presented in Algorithm 3. It has a similar formula to Algorithm 1, while Algorithm 3 employs an regularization to handle the uncertainty of the perturbation. The regret bounds of Algorithm 3 are summarized in Theorem 3.

  for  to  do
     Obtain and via equation (11) and (12).
     Choose
     Query the observation at location .
  end for
Algorithm 3
Theorem 3.

Define , where and . Suppose , associated with kernel and kernel with and , respectively. Let . Algorithm 3 achieves a cumulative regret bound and a simple regret bound given by equations (13) and (14), respectively.

(13)
(14)

Remark: In the perturbation setting, the unknown perturbation function results in an unavoidable regret term with respect to in the regret bound compared with GP-UCB gpucb . Note that the bounds in gpucb

are probabilistic bounds. There is always a positive probability that the bounds in

gpucb fail. In contrast, the bounds in Theorem 3 are deterministic.

Corollary 1.

Suppose associated with and . Let . Algorithm 3 achieves a cumulative regret bound and a simple regret bound given by equations (15) and (16), respectively:

(15)
(16)
Proof.

Setting and in Theorem 3, we can achieve the results. ∎

Remark: In practice, a small constant is added to the kernel matrix to avoid numeric problems in the noise-free setting. Corollary 1 shows that the small constant results in an additional biased term in the regret bound. Theorem 1 employs (4) and (5) for updating, while Corollary 1 presents the regret bound for the practical updating by (11) and (12).

5.2 Batch Selection in Perturbation Setting

The covariance kernel function of for the perturbation setting is defined as equation (17),

(17)

where is the kernel matrix, and denotes the kernel matrix between and . The batch optimization method for the perturbation setting is presented in Algorithm 4. The regret bounds of Algorithm 4 are summarized in Theorem 4.

Theorem 4.

Define , where and . Suppose and associated with kernel and kernel with and , respectively. Let , and . Algorithm 4 with batch size achieves a cumulative regret bound and a simple regret bound given by equations (18) and (19), respectively:

(18)
(19)

Remark: When the batch size is one, the regret bounds reduce to the sequential case.

(a) Rosenbrock function
(b) Nesterov function
(c) Different-Powers function
(d) Dixon-Price function
(e) Levy function
(f) Ackley function
Figure 1: The mean value of simple regret for different algorithms over 30 runs on different test functions
  for  to  do
     Obtain and via equation (11) and (17) respectively.
     Choose .
     Query the batch observations at locations .
  end for
Algorithm 4

6 Experiments

In this section, we focus on the evaluation of the proposed batch method. We evaluate the proposed Batch kernel optimization (BKOP) by comparing it with GP-BUCB GPBUCB and GP-UCB-PEUCBPE on several synthetic benchmark test problems, hyperparameters tuning of a deep network on CIFAR100 cifar100 and the robot pushing task in mes . An empirical study of our fast rank-1 lattice searching method is included in the supplementary material.

(a) Simple regret on network tuning task on CIFAR100
(b) Simple regret on robot pushing task
Figure 2: The mean value of simple regret on network tuning task and robot pushing task.

Synthetic benchmark problems: The synthetic test functions and the domains employed are listed in Table 4 in the supplementary material, which includes nonconvex, nonsmooth and multimodal functions.

We fix the weight of the covariance term in the acquisition function of BKOP to one in all the experiments. For all the synthetic test problems, we set the dimension of the domain , and we set the batch size to and for all the batch BO algorithms. We use the the ARD Matérn 5/2 kernel for all the methods. Instead of finding the optimum by discrete approximation, we employ the CMA-ES algorithm CMAES to optimize the acquisition function in the continuous domain for all the methods, which usually improves the performance compared with discrete approximation. For each test problem, we use 20 rank-1 lattice points resized in the domain as the initialization. All the methods use the same initial points.

The mean value and error bar of the simple regret over 30 independent runs with respect to different algorithms are presented in Figure 1. We can observe that BKOP with batch size 5 and 10 performs better than the other methods with the same batch size. Moreover, algorithms with batch size 5 achieve faster decreasing regret compared with batch size 10. BKOP achieves significantly low regret compared with the other methods on the Different-Powers and Rosenbrock test functions.

Hyperparameter tuning of network:

We evaluate BKOP on hyperparameter tuning of the network on the CIFAR100 dataset. The network we employ contains three hidden building blocks, each one consisting of one convolution layer, one batch normalization layer and one Relu layer. The depth of a building block is defined as the repeat number of these three layers. Seven hyperparameters are used in total for searching, namely, the depth of the building block (

), the initialized learning rate for SGD (), the momentum weight (), weight of L2 regularization (), and three hyperparameters related to the filter size for each building block, the domain of these three parameters is . We employ the default training set (i.e., samples) for training, and use the default test set (i.e., samples) to compute the validation error regret of automatic hyperparameter tuning for all the methods.

We employ five rank-1 lattice points resized in the domain as the initialization. All the methods use the same initial points. The mean value of the simple regret of the validation error in percentage terms over 10 independent runs is presented in Figure 2(a). We can observe that BKOP with both batch size 5 and 10 outperforms the others. Moreover, the performance of GP-UCB-PE with batch size 10 is worse than the others.

Robot Pushing Task: We further evaluate the performance of BKOP on the robot pushing task in mes . The goal of this task is to select a good action for pushing an object to a target location. The 4-dimensional robot pushing problem consists of the robot location and angle and the pushing duration as the input. And it outputs the distance between the pushed object and the target location as the function value. We employ 20 rank-1 lattice points as initialization. All the methods use the same initialization points. Thirty goal locations are randomly generated for testing. All the methods use the same goal locations. The mean value and error bars over 30 trials are presented in Figure 2(b). We can observe that BKOP with both batch size 5 and batch size 10 can achieve lower regret compared with GP-BUCB and GP-UCB-PE.

7 Conclusion

We analyzed black box optimization for functions with a bounded norm in RKHS. For sequential BO, we obtain a similar acquisition function to GP-UCB, but with a constant deviation weight. For batch BO, we proposed the BKOP algorithm, which is competitive with, or better than, other batch confidence-bound methods on a variety of tasks. Theoretically, we derive regret bounds for both the sequential case and batch case. Furthermore, we derive adversarial regret bounds with respect to the covering radius. We proposed fast searching methods to construct a good rank-1 lattice. Empirically, the proposed searching methods can obtain a large packing radius (separate distance).

References

  • [1] Dirk Nuyens Adrian Ebert, Hernan Leövey. Successive coordinate search and component-by-component construction of rank-1 lattice rules. arXiv preprint arXiv:1703.06334, 2018.
  • [2] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • [3] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka, and Volkan Cevher. Adversarially robust optimization with gaussian processes. In Advances in Neural Information Processing Systems, pages 5765–5775, 2018.
  • [4] Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher. Truncated variance reduction: A unified approach to bayesian optimization and level-set estimation. In Advances in Neural Information Processing Systems, pages 1507–1515, 2016.
  • [5] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009.
  • [6] Adam D Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(Oct):2879–2904, 2011.
  • [7] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer, 2013.
  • [8] Sabrina Dammertz and Alexander Keller. Image synthesis by rank-1 lattices. In Monte Carlo and Quasi-Monte Carlo Methods 2006, pages 217–236. Springer, 2008.
  • [9] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15(1):3873–3923, 2014.
  • [10] Peter Frazier, Warren Powell, and Savas Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613, 2009.
  • [11] Leonhard Grünschloß, Johannes Hanika, Ronnie Schwede, and Alexander Keller. (t, m, s)-nets and maximized minimum distance. In Monte Carlo and Quasi-Monte Carlo Methods 2006, pages 397–412. Springer, 2008.
  • [12] Nikolaus Hansen, Sibylle D Müller, and Petros Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation, 11(1):1–18, 2003.
  • [13] José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In NIPS, pages 918–926, 2014.
  • [14] L-K Hua and Yuan Wang. Applications of number theory to numerical analysis. Springer Science & Business Media, 2012.
  • [15] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998.
  • [16] Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, and Bharath K Sriperumbudur. Gaussian processes and kernel methods: A review on connections and equivalences. arXiv preprint arXiv:1807.02582, 2018.
  • [17] Alexander Keller, Stefan Heinrich, and Harald Niederreiter. Monte Carlo and Quasi-Monte Carlo Methods 2006. Springer, 2007.
  • [18] N. M. Korobov. Properties and calculation of optimal coefficients. Dokl. Akad. Nauk SSSR, 132:1009–1012, 1960.
  • [19] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • [20] Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.
  • [21] Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic gait optimization with gaussian process regression. In IJCAI, volume 7, pages 944–949, 2007.
  • [22] Jonas Močkus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404. Springer, 1975.
  • [23] Diana M Negoescu, Peter I Frazier, and Warren B Powell. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346–363, 2011.
  • [24] Jonathan Scarlett. Tight regret bounds for bayesian optimization in one dimension. In Proceedings of the 35th International Conference on Machine Learning (ICML), pages 4500–4508, 2018.
  • [25] Amar Shah and Zoubin Ghahramani. Parallel predictive entropy search for batch global optimization of expensive objective functions. In NIPS, pages 3330–3338, 2015.
  • [26] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In NIPS, pages 2951–2959, 2012.
  • [27] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. 2010.
  • [28] G Gary Wang and Songqing Shan. Review of metamodeling techniques in support of engineering design optimization. Journal of Mechanical design, 129(4):370–380, 2007.
  • [29] Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesian optimization. In International Conference on Machine Learning (ICML), page 3627–3635, 2017.
  • [30] Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004.
  • [31] Jian Wu and Peter Frazier. The parallel knowledge gradient method for batch bayesian optimization. In NIPS, pages 3126–3134, 2016.

Appendix A Robust Initialization for BO

In this section, we will discuss how to achieve robust initialization by analyzing regret in the adversarial setting. We will show that algorithms that attain a small covering radius (fill distance) are able to achieve small adversarial regret bounds.

Let , be the black-box function to be optimized at round . Let with . The simple adversarial regret is defined as:

(20)

where the constraints ensure that each has the same observation values as the history at previous query points . This can be viewed as an adversarial game. During each round , the opponent chooses a function from a candidate set, and we then choose a query in order to achieve a small regret. A robust initialization setting can be viewed as the batch of points that can achieve low simple adversarial regret irrespective of the access order.

Define covering radius (fill distance [16]) and packing radius (separate distance [16]) of a set of points as equations (21) and (22), respectively:

(21)
(22)

Our method for robust initialization is presented in Algorithm 5, which constructs an initialization set by minimizing the covering radius. We present one such method in Algorithm 6 in the next section. The initialization set can be evaluated in a batch manner, which is able to benefit from parallel evaluation. The regret bounds of Algorithm 5 are summarized in Theorem 5 and Theorem 6.

Theorem 5.

Define associated with for . Suppose and is norm-equivalent to the Sobolev space of order . Then there exits a constant , such that the query point set generated by Algorithm 5 with a sufficiently small covering radius (fill distance) achieves a regret bound given by equation (23):

(23)

Remark: The regret bound decreases as the covering radius becomes smaller. This means that a query set with a small covering radius can guarantee a small regret. Bull [6] gives bounds of fixed points set for Matérn kernels (Theorem 1). However, it does not link to the covering radius. The bound in Theorem 5 directly links to the covering radius, which provides a criterion for generating points to achieve small bounds.

Theorem 6.

Define associated with square-exponential on unit cube . Suppose . Then there exits a constant , such that the query point set generated by Algorithm 5 with a sufficiently small covering radius (fill distance) achieves a regret bound given by equation (24):

(24)

Remark: Theorem 6 presents a regret bound for the SE kernel. It attains higher rate w.r.t covering radius compared with Theorem 5, because functions in RKHS with SE kernel are more smooth than functions in Sobolev space.

  Construct Candidate set with points by minimizing the fill distance (e.g.Algorithm 6).
  Query the observations at .
  Obtain and via equation (4) and (5).
  Choose
  Query the observation at location .
Algorithm 5
(a) 100 lattice points
(b) 100 random points
Figure 3: Lattice Points and Random Points on

We analyze the regret under a more adversarial setting. This relates to a more robust requirement. The regret bounds under a fully adversarial setting when little information is known are summarized in Theorem 7.

Theorem 7.

Define associated with a shift invariant kernel that decreases w.r.t . Let with . Then the query point set generated by Algorithm 5 with covering radius (fill distance) achieves a regret bound as

(25)

Remark: Theorem 7 gives a fully adversarial bound. Namely, the opponent can choose functions from without the same history. The regret bound decreases with the decrease of the covering radius (fill distance).

Corollary 2.

Define associated with squared exponential kernel. Let with . Then the query point set generated by Algorithm 5 with covering radius (fill distance) achieves a regret bound as

(26)

Remark: For a regular grid, [30], we then achieve . Computer search can find a points set with a smaller covering radius than that of a regular grid.

All the adversarial regret bounds discussed above decrease with the decrease of the covering radius. Thus, the point set generated by Algorithm 5 with small covering radius can serve as a good robust initialization for BO.

Appendix B Fast Rank-1 Lattice Construction

  Input: Number of primes , dimension , number of lattice points
  Output: Lattice points

, base vector

  Set , initialize .
  Construct set containing primes.
  for  each  do
     for  to  do
        Set , where and .
        Set .
        Set as by concatenating vector and .
        Generate lattice given base vector as Eq.(27).
        Calculate the packing radius (separate distance) of as Eq.(29).
        if   then
           Set and .
        end if
     end for
  end for
  Generate lattice given base vector as Eq.(27).
Algorithm 6 Rank-1 Lattice Construction
  Input: Number of primes , dimension , number of lattice points , number of iteration of SCS search subroutine .
  Output: Lattice points , base vector
  Set , initialize .
  Construct set containing primes.
  for  each  do
     for  to  do
        Set , where and .
        Set .
        Set as by concatenating vector and .
        Perform SCS search [1] with as the initialization base vector to get a better base and .
        if   then
           Set and .
        end if
     end for
  end for
  Generate lattice given base vector as Eq.(27).
Algorithm 7 Rank-1 Lattice Construction with Successive Coordinate Search (SCS)

In this section, we describe the procedure of generating a query points set that has a small covering radius (fill distance). Because minimizing the covering radius of lattice is equivalent to maximizing the packing radius (separate distance) [17], we generate the query points set through maximizing the packing radius (separate distance) of the rank-1 lattice. An illustration of the rank-1 lattice constructed by Algorithm 6 is given in Fig. 3

b.1 The rank-1 lattice construction given a base vector

Rank-1 lattice is widely used in the Quasi-Monte Carlo (QMC) literature for integral approximation [17, 18]. The lattice points of the rank-1 lattice in are generated by a base vector. Given an integer base vector , a lattice set that consists of points in is constructed as

(27)

where denotes the component-wise modular function, i.e., . We use to denote the fractional part of number in this work.

b.2 The separate distance of a rank-1 lattice

Denote the toroidal distance [11] between two lattice points and as:

(28)

Because the difference (subtraction) between two lattice points is still a lattice point, and a rank-1 lattice has a periodic 1, the packing radius (separate distance) of a rank-1 lattice with set in can be calculated as

(29)

where can be seen as the toroidal distance between and . This formulation calculates the packing radius (separate distance) with a time complexity of rather than in pairwise computation.

b.3 Searching the rank-1 lattice with maximized separate distance

Given the number of primes , the dimension , and the number of lattice points , we try to find the optimal base vector and its corresponding lattice points such that the separate distance is maximized over a candidate set. We adopt the algebra field based construction formula in [14] to construct the base vector of a rank-1 lattice. Instead of using the same form as [14], we adopt a searching procedure as summarized in Algorithm 6. The main idea is a greedy search starting from a set of prime numbers. For each prime number , it also searches the offset from to to construct the possible base vector and its corresponding . After the greedy search procedure, the algorithm returns the optimal base vector and the lattice points set that obtains the maximum separate distance. Algorithm 6 can be extended by including successive coordinate search (SCS) [1] as an inner searching procedure. The extended method is summarized in Algorithm 7. This method can achieve superior performance compared to other baselines.

b.4 Comparison of minimum distance generated by different methods

Alg 6 0.59632 1.0051 1.3031 1.5482 1.7571
Korobov 0.56639 0.90139 1.0695 1.2748 1.3987
SCS 0.60224 1.0000 1.2247 1.4142 1.5811
Alg 7 0.62738 1.0472 1.3620 1.6175 1.8401
Table 1: Minimum distance () of 1,000 lattice points in for , , , and .
Alg 6 0.54658 0.95561 1.2595 1.4996 1.7097
Korobov 0.51536 0.80039 0.96096 1.1319 1.2506
SCS 0.57112 0.98420 1.2247 1.4142 1.5811
Alg 7 0.58782 1.0144 1.3221 1.5758 1.8029
Table 2: Minimum distance () of 2,000 lattice points in for , , , and .
Alg 6 0.53359 0.93051 1.2292 1.4696 1.7009
Korobov 0.50000 0.67185 0.82285 0.95015 1.0623
SCS 0.52705 0.74536 0.91287 1.0541 1.1785
Alg 7 0.56610 0.98601 1.2979 1.5553 1.7771
Table 3: Minimum distance () of 3,000 lattice points in for , , , and .

We evaluate the proposed Algorithm 6 and Algorithm 7 by comparing them with searching in Korobov form [18] and SCS [1]. We fix for Algorithm 6 and Algorithm 7 in all the experiments. The number of iterations of SCS search [1] is set to , and number of iterations of SCS search as a subroutine in Algorithm 7 is set to .

The minimum distances () of points, points and points generated by different methods are summarized in Tables 12 and 3, respectively. Algorithm 7 can achieve a larger separate (minimum) distance than other searching methods. This means that Algorithm 7 can generate points set with smaller covering radius (fill distance). Thus, it can generate more robust initialization for BO. Moreover, Algorithm 7 can also be used to generate points for integral approximation on .

b.5 Comparison between lattice points and random points

(a) 100 lattice points
(b) 100 random points
(c) 1000 lattice points
(d) 1000 random points
Figure 4: Lattice Points and Random Points on

The points generated by Algorithm 6 and uniform sampling are presented in Figure 4. We can observe that the points generated by Algorithm 6 cut the domain into several cells. It obtains a smaller covering radius (fill distance) than the random sampling. Thus, it can be used as robust initialization of BO.

Appendix C Synthetic Benchmark Test Problems

Synthetic benchmark test problems are listed in Table 4.

name function domain
Rosenbrock