# Bayesian Optimization with Exponential Convergence

This paper presents a Bayesian optimization method with exponential convergence without the need of auxiliary optimization and without the delta-cover sampling. Most Bayesian optimization methods require auxiliary optimization: an additional non-convex global optimization problem, which can be time-consuming and hard to implement in practice. Also, the existing Bayesian optimization method with exponential convergence requires access to the delta-cover sampling, which was considered to be impractical. Our approach eliminates both requirements and achieves an exponential convergence rate.

## Authors

• 19 publications
• 36 publications
• 26 publications
• ### Bayesian Multi-Scale Optimistic Optimization

Bayesian optimization is a powerful global optimization technique for ex...
02/27/2014 ∙ by Ziyu Wang, et al. ∙ 0

• ### Bayesian optimization in ab initio nuclear physics

Theoretical models of the strong nuclear interaction contain unknown cou...
02/03/2019 ∙ by A. Ekström, et al. ∙ 0

• ### Benchmarking five global optimization approaches for nano-optical shape optimization and parameter reconstruction

Numerical optimization is an important tool in the field of computationa...
09/18/2018 ∙ by Philipp-Immanuel Schneider, et al. ∙ 4

• ### Bayesian Optimization in AlphaGo

During the development of AlphaGo, its many hyper-parameters were tuned ...
12/17/2018 ∙ by Yutian Chen, et al. ∙ 128

• ### Lookahead Bayesian Optimization via Rollout: Guarantees and Sequential Rolling Horizons

Lookahead, also known as non-myopic, Bayesian optimization (BO) aims to ...
11/04/2019 ∙ by Xubo Yue, et al. ∙ 0

• ### Generic Camera Attribute Control using Bayesian Optimization

Cameras are the most widely exploited sensor in both robotics and comput...
07/21/2018 ∙ by Joowan Kim, et al. ∙ 0

• ### Stopping criteria for boosting automatic experimental design using real-time fMRI with Bayesian optimization

Bayesian optimization has been proposed as a practical and efficient too...
11/24/2015 ∙ by Romy Lorenz, et al. ∙ 0

## Code Repositories

### Bayesian_Optimization

Bayesian Optimisation for Parameter Tuning of the XOR Neural Network

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider a general global optimization problem: maximize subject to where

is a non-convex black-box deterministic function. Such a problem arises in many real-world applications, such as parameter tuning in machine learning

[3], engineering design problems [4], and model parameter fitting in biology [5]. For this problem, one performance measure of an algorithm is the simple regret, , which is given by where

is the best input vector found by the algorithm. For brevity, we use the term “regret” to mean simple regret.

The general global optimization problem is known to be intractable if we make no further assumptions [6]. The simplest additional assumption to restore tractability is to assume the existence of a bound on the slope of . A well-known variant of this assumption is Lipschitz continuity with a known Lipschitz constant, and many algorithms have been proposed in this setting [7, 8, 9]

. These algorithms successfully guaranteed certain bounds on the regret. However appealing from a theoretical point of view, a practical concern was soon raised regarding the assumption that a tight Lipschitz constant is known. Some researchers relaxed this somewhat strong assumption by proposing procedures to estimate a Lipschitz constant during the optimization process

[10, 11, 12].

Bayesian optimization is an efficient way to relax this assumption of complete knowledge of the Lipschitz constant, and has become a well-recognized method for solving global optimization problems with non-convex black-box functions. In the machine learning community, Bayesian optimization—especially by means of a Gaussian process (GP)—is an active research area [13, 14, 15]. With the requirement of the access to the -cover sampling procedure (it samples the function uniformly such that the density of samples doubles in the feasible regions at each iteration), de Freitas et al. [1] recently proposed a theoretical procedure that maintains an exponential convergence rate (exponential regret). However, as pointed out by Wang et al. [2], one remaining problem is to derive a GP-based optimization method with an exponential convergence rate without the -cover sampling procedure, which is computationally too demanding in many cases.

In this paper, we propose a novel GP-based global optimization algorithm, which maintains an exponential convergence rate and converges rapidly without the -cover sampling procedure.

## 2 Gaussian Process Optimization

In Gaussian process optimization, we estimate the distribution over function and use this information to decide which point of should be evaluated next. In a parametric approach, we consider a parameterized function , with being distributed according to some prior. In contrast, the nonparametric GP approach directly puts the GP prior over as where is the mean function and is the covariance function or the kernel. That is, and . For a finite set of points, the GP model is simply a joint Gaussian: , where and is the number of data points. To predict the value of

at a new data point, we first consider the joint distribution over

of the old data points and the new data point:

 (f(x1:N)f(xN+1))∼N(m(x1:N)m(xN+1),[KkkTκ(xN+1,xN+1)])

where . Then, after factorizing the joint distribution using the Schur complement for the joint Gaussian, we obtain the conditional distribution, conditioned on observed entities and , as:

 f(xN+1)|DN,xN+1∼N(μ(xN+1|DN),σ2(xN+1|DN))

where and . One advantage of GP is that this closed-form solution simplifies both its analysis and implementation.

To use a GP, we must specify the mean function and the covariance function. The mean function is usually set to be zero. With this zero mean function, the conditional mean can still be flexibly specified by the covariance function, as shown in the above equation for . For the covariance function, there are several common choices, including the Matern kernel and the Gaussian kernel. For example, the Gaussian kernel is defined as where

is the kernel parameter matrix. The kernel parameters or hyperparameters can be estimated by empirical Bayesian methods

The flexibility and simplicity of the GP prior make it a common choice for continuous objective functions in the Bayesian optimization literature. Bayesian optimization with GP selects the next query point that optimizes the acquisition function generated by GP. Commonly used acquisition functions include the upper confidence bound (UCB) and expected improvement (EI). For brevity, we consider Bayesian optimization with UCB, which works as follows. At each iteration, the UCB function is maintained as where is a parameter of the algorithm. To find the next query for the objective function , GP-UCB solves an additional non-convex optimization problem with as . This is often carried out by other global optimization methods such as DIRECT and CMA-ES. The justification for introducing a new optimization problem lies in the assumption that the cost of evaluating the objective function dominates that of solving additional optimization problem.

For deterministic function, de Freitas et al. [1] recently presented a theoretical procedure that maintains exponential convergence rate. However, their own paper and the follow-up research [1, 2] point out that this result relies on an impractical sampling procedure, the -cover sampling. To overcome this issue, Wang et al. [2] combined GP-UCB with a hierarchical partitioning optimization method, the SOO algorithm [18], providing a regret bound with polynomial dependence on the number of function evaluations. They concluded that creating a GP-based algorithm with an exponential convergence rate without the impractical sampling procedure remained an open problem.

## 3 Infinite-Metric GP Optimization

### 3.1 Overview

The GP-UCB algorithm can be seen as a member of the class of bound-based search methods, which includes Lipschitz optimization, A* search, and PAC-MDP algorithms with optimism in the face of uncertainty. Bound-based search methods have a common property: the tightness of the bound determines its effectiveness. The tighter the bound is, the better the performance becomes. However, it is often difficult to obtain a tight bound while maintaining correctness. For example, in A* search, admissible heuristics maintain the correctness of the bound, but the estimated bound with admissibility is often too loose in practice, resulting in a long period of global search.

The GP-UCB algorithm has the same problem. The bound in GP-UCB is represented by UCB, which has the following property:

with some probability. We formalize this property in the analysis of our algorithm. The problem is essentially due to the difficulty of obtaining a tight bound

such that and (with some probability). Our solution strategy is to first admit that the bound encoded in GP prior may not be tight enough to be useful by itself. Instead of relying on a single bound given by the GP, we leverage the existence of an unknown bound encoded in the continuity at a global optimizer.

###### Assumption 1.

(Unknown Bound) There exists a global optimizer and an unknown semi-metric such that for all , and .

In other words, we do not expect the known upper bound due to GP to be tight, but instead expect that there exists some unknown bound that might be tighter. Notice that in the case where the bound by GP is as tight as the unknown bound by semi-metric in Assumption 1, our method still maintains an exponential convergence rate and an advantage over GP-UCB (no need for auxiliary optimization). Our method is expected to become relatively much better when the known bound due to GP is less tight compared to the unknown bound by .

As the semi-metric is unknown, there are infinitely many possible candidates that we can think of for . Accordingly, we simultaneously conduct global and local searches based on all the candidates of the bounds. The bound estimated by GP is used to reduce the number of candidates. Since the bound estimated by GP is known, we can ignore the candidates of the bounds that are looser than the bound estimated by GP. The source code of the proposed algorithm is publicly available at http://lis.csail.mit.edu/code/imgpo.html.

### 3.2 Description of Algorithm

Figure 1 illustrates how the algorithm works with a simple 1-dimensional objective function. We employ hierarchical partitioning to maintain hyperintervals, as illustrated by the line segments in the figure. We consider a hyperrectangle as our hyperinterval, with its center being the evaluation point of (blue points in each line segment in Figure 1). For each iteration , the algorithm performs the following procedure for each interval size:

1. [label=()]

2. Select the interval with the maximum center value among the intervals of the same size.

3. Keep the interval selected by (i) if it has a center value greater than that of any larger interval.

4. Keep the interval accepted by (ii) if it contains a UCB greater than the center value of any smaller interval.

5. If an interval is accepted by (iii), divide it along with the longest coordinate into three new intervals.

6. For each new interval, if the UCB of the evaluation point is less than the best function value found so far, skip the evaluation and use the UCB value as the center value until the interval is accepted in step (ii) on some future iteration; otherwise, evaluate the center value.

7. Repeat steps (i)–(v) until every size of intervals are considered

Then, at the end of each iteration, the algorithm updates the GP hyperparameters. Here, the purpose of steps (i)–(iii) is to select an interval that might contain the global optimizer. Steps (i) and (ii) select the possible intervals based on the unknown bound by , while Step (iii) does so based on the bound by GP.

We now explain the procedure using the example in Figure 1. Let be the number of divisions of intervals and let be the number of function evaluations. is the number of iterations. Initially, there is only one interval (the center of the input region ) and thus this interval is divided, resulting in the first diagram of Figure 1. At the beginning of iteration , step (i) selects the third interval from the left side in the first diagram (, as its center value is the maximum. Because there are no intervals of different size at this point, steps (ii) and (iii) are skipped. Step (iv) divides the third interval, and then the GP hyperparameters are updated, resulting in the second diagram (). At the beginning of iteration , it starts conducting steps (i)–(v) for the largest intervals. Step (i) selects the second interval from the left side and step (ii) is skipped. Step (iii) accepts the second interval, because the UCB within this interval is no less than the center value of the smaller intervals, resulting in the third diagram (). Iteration continues by conducting steps (i)–(v) for the smaller intervals. Step (i) selects the second interval from the left side, step (ii) accepts it, and step (iii) is skipped, resulting in the forth diagram (). The effect of the step (v) can be seen in the diagrams for iteration . At , the far right interval is divided, but no function evaluation occurs. Instead, UCB values given by GP are placed in the new intervals indicated by the red asterisks. One of the temporary dummy values is resolved at when the interval is queried for division, as shown by the green asterisk. The effect of step (iii) for the rejection case is illustrated in the last diagram for iteration . At , is increased to 10 from 9, meaning that the largest intervals are first considered for division. However, the three largest intervals are all rejected in step (iii), resulting in the division of a very small interval near the global optimum at .

### 3.3 Technical Detail of Algorithm

We define to be the depth of the hierarchical partitioning tree, and to be the center point of the hyperrectangle at depth . is the number of the GP evaluations. Define to be the largest integer such that the set is not empty. To compute UCB , we use where is the number of the calls made so far for (i.e., each time we use , we increment by one). This particular form of is to maintain the property of during an execution of our algorithm with probability at least . Here, is the parameter of IMGPO. is another parameter, but it is only used to limit the possibly long computation of step (iii) (in the worst case, step (iii) computes UCBs times although it would rarely happen).

The pseudocode is shown in Algorithm 1. Lines 8 to 23 correspond to steps (i)-(iii). These lines compute the index of the candidate of the rectangle that may contain a global optimizer for each depth . For each depth , non-null index at Line 24 indicates the remaining candidate of a rectangle that we want to divide. Lines 24 to 33 correspond to steps (iv)-(v) where the remaining candidates of the rectangles for all are divided. To provide a simple executable division scheme (line 29), we assume to be a hyperrectangle (see the last paragraph of section 4 for a general case).

Lines 8 to 17 correspond to steps (i)-(ii). Specifically, line 10 implements step (i) where a single candidate is selected for each depth, and lines 11 to 12 conduct step (ii) where some candidates are screened out. Lines 13 to 17 resolve the the temporary dummy values computed by GP. Lines 18 to 23 correspond to step (iii) where the candidates are further screened out. At line 21, indicates the set of all center points of a fully expanded tree until depth within the region covered by the hyperrectangle centered at . In other words, contains the nodes of the fully expanded tree rooted at with depth and can be computed by dividing the current rectangle at and recursively divide all the resulting new rectangles until depth (i.e., depth from , which is depth in the whole tree).

### 3.4 Relationship to Previous Algorithms

The most closely related algorithm is the BaMSOO algorithm [2], which combines SOO with GP-UCB. However, it only achieves a polynomial regret bound while IMGPO achieves a exponential regret bound. IMGPO can achieve exponential regret because it utilizes the information encoded in the GP prior/posterior to reduce the degree of the unknownness of the semi-metric .

The idea of considering a set of infinitely many bounds was first proposed by Jones et al. [19]. Their DIRECT algorithm has been successfully applied to real-world problems [4, 5], but it only maintains the consistency property (i.e., convergence in the limit) from a theoretical viewpoint. DIRECT takes an input parameter to balance the global and local search efforts. This idea was generalized to the case of an unknown semi-metric and strengthened with a theoretical support (finite regret bound) by Munos [18] in the SOO algorithm. By limiting the depth of the search tree with a parameter , the SOO algorithm achieves a finite regret bound that depends on the near-optimality dimension.

## 4 Analysis

In this section, we prove an exponential convergence rate of IMGPO and theoretically discuss the reason why the novel idea underling IMGPO is beneficial. The proofs are provided in the supplementary material. To examine the effect of considering infinitely many possible candidates of the bounds, we introduce the following term.

###### Definition 1.

(Infinite-metric exploration loss). The infinite-metric exploration loss is the number of intervals to be divided during iteration .

The infinite-metric exploration loss can be computed as at line 25. It is the cost (in terms of the number of function evaluations) incurred by not committing to any particular upper bound. If we were to rely on a specific bound, would be minimized to 1. For example, the DOO algorithm [18] has . Even if we know a particular upper bound, relying on this knowledge and thus minimizing is not a good option unless the known bound is tight enough compared to the unknown bound leveraged in our algorithm. This will be clarified in our analysis. Let be the maximum of the averages of for (i.e., .

###### Assumption 2.

For some pair of a global optimizer and an unknown semi-metric that satisfies Assumption 1, both of the following, (i) shape on and (ii) lower bound constant, conditions hold:

1. [label=()]

2. there exist , and in such that for all , .

3. there exists such that for all , .

In Theorem 1, we show that the exponential convergence rate with is achieved. We define to be the largest used so far with total node expansions. For simplicity, we assume that is a square, which we satisfied in our experiments by scaling original .

###### Theorem 1.

Assume Assumptions 1 and 2. Let . Let . Then, with probability at least , the regret of IMGPO is bounded as

 rN≤L(3βD1/p)αexp(−α[N+Ngp2CD¯ρt−Ξn−2]ln3)=O(λN+Ngp).

Importantly, our bound holds for the best values of the unknown and even though these values are not given. The closest result in previous work is that of BaMSOO [2], which obtained with probability for . As can be seen, we have improved the regret bound. Additionally, in our analysis, we can see how and affect the bound, allowing us to view the inherent difficulty of an objective function in a theoretical perspective. Here, is a constant in and is used in previous work [18, 2]. For example, if we conduct or function evaluations per node-expansion and if , we have that .

We note that can get close to one as input dimension increases, which suggests that there is a remaining challenge in scalability for higher dimensionality. One strategy for addressing this problem would be to leverage additional assumptions such as those in [14, 20].

###### Remark 1.

(The effect of the tightness of UCB by GP) If UCB computed by GP is “useful” such that , then our regret bound becomes . If the bound due to UCB by GP is too loose (and thus useless), can increase up to (due to ), resulting in the regret bound of , which can be bounded by 111This can be done by limiting the depth of search tree as . Our proof works with this additional mechanism, but results in the regret bound with being replaced by . Thus, if we assume to have at least “not useless” UCBs such that , this additional mechanism can be disadvantageous. Accordingly, we do not adopt it in our experiments.. This is still better than the known results.

###### Remark 2.

(The effect of GP) Without the use of GP, our regret bound would be as follows: , where is the infinite-metric exploration loss without GP. Therefore, the use of GP reduces the regret bound by increasing and decreasing , but may potentially increase the bound by increasing .

###### Remark 3.

(The effect of infinite-metric optimization) To understand the effect of considering all the possible upper bounds, we consider the case without GP. If we consider all the possible bounds, we have the regret bound for the best unknown and . For standard optimization with a estimated bound, we have for an estimated , and . By algebraic manipulation, considering all the possible bounds has a better regret when . For an intuitive insight, we can simplify the above by assuming and as . Because and are the ones that achieve the lowest bound, the logarithm on the right-hand side is always non-negative. Hence, always satisfies the condition. When and are not tight enough, the logarithmic term increases in magnitude, allowing to increase. For example, if the second term on the right-hand side has a magnitude of greater than 0.5, then satisfies the inequality. Therefore, even if we know the upper bound of the function, we can see that it may be better not to rely on this, but rather take the infinite many possibilities into account.

One may improve the algorithm with different division procedures than one presented in Algorithm 1 as discussed in the supplementary material.

## 5 Experiments

In this section, we compare the IMGPO algorithm with the SOO, BaMSOO, GP-PI and GP-EI algorithms [18, 2, 3]. In previous work, BaMSOO and GP-UCB were tested with a pair of a handpicked good kernel and hyperparameters for each function [2]. In our experiments, we assume that the knowledge of good kernel and hyperparameters is unavailable, which is usually the case in practice. Thus, for IMGPO, BaMSOO, GP-PI and GP-EI, we simply used one of the most popular kernels, the isotropic Matern kernel with . This is given by , where . Then, we blindly initialized the hyperparameters to and for all the experiments; these values were updated with an empirical Bayesian method after each iteration. To compute the UCB by GP, we used for IMGPO and BaMSOO. For IMGPO, was fixed to be (the effect of selecting different values is discussed later). For BaMSOO and SOO, the parameter was set to , according to Corollary 4.3 in [18]. For GP-PI and GP-EI, we used the SOO algorithm and a local optimization method using gradients to solve the auxiliary optimization. For SOO, BaMSOO and IMGPO, we used the corresponding deterministic division procedure (given

, the initial point is fixed and no randomness exists). For GP-PI and GP-EI, we randomly initialized the first evaluation point and report the mean and one standard deviation for 50 runs.

The experimental results for eight different objective functions are shown in Figure 2. The vertical axis is log, where is the global optima and is the best value found by the algorithm. Hence, the lower the plotted value on the vertical axis, the better the algorithm’s performance. The last five functions are standard benchmarks for global optimization [21]. The first two were used in [18] to test SOO, and can be written as for Sin1 and for Sin2. The form of the third function is given in Equation (16) and Figure 2 in [22]. The last function is Sin2 embedded in 1000 dimension in the same manner described in Section 4.1 in [14], which is used here to illustrate a possibility of using IMGPO as a main subroutine to scale up to higher dimensions with additional assumptions. For this function, we used REMBO [14] with IMGPO and BaMSOO as its Bayesian optimization subroutine. All of these functions are multimodal, except for Rosenbrock2, with dimensionality from 1 to 1000.

As we can see from Figure 2, IMGPO outperformed the other algorithms in general. SOO produced the competitive results for Rosenbrock2 because our GP prior was misleading (i.e., it did not model the objective function well and thus the property did not hold many times). As can be seen in Table 1, IMGPO is much faster than traditional GP optimization methods although it is slower than SOO. For Sin 1, Sin2, Branin and Hartmann3, increasing does not affect IMGPO because did not reach (Figure 2). For the rest of the test functions, we would be able to improve the performance of IMGPO by increasing at the cost of extra CPU time.

## 6 Conclusion

We have presented the first GP-based optimization method with an exponential convergence rate () without the need of auxiliary optimization and the -cover sampling. Perhaps more importantly in the viewpoint of a broader global optimization community, we have provided a practically oriented analysis framework, enabling us to see why not relying on a particular bound is advantageous, and how a non-tight bound can still be useful (in Remarks 1, 2 and 3). Following the advent of the DIRECT algorithm, the literature diverged along two paths, one with a particular bound and one without. GP-UCB can be categorized into the former. Our approach illustrates the benefits of combining these two paths.

As stated in Section 3.1, our solution idea was to use a bound-based method but rely less on the estimated bound by considering all the possible bounds. It would be interesting to see if a similar principle can be applicable to other types of bound-based methods such as planning algorithms (e.g., A* search and the UCT or FSSS algorithm [23]) and learning algorithms (e.g., PAC-MDP algorithms [24]).

#### Acknowledgments

The authors would like to thank Dr. Remi Munos for his thoughtful comments and suggestions. We gratefully acknowledge support from NSF grant 1420927, from ONR grant N00014-14-1-0486, and from ARO grant W911NF1410433. Kenji Kawaguchi was supported in part by the Funai Overseas Scholarship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.

Bayesian Optimization with Exponential Convergence: Supplementary Material

In this supplementary material, we provide the proofs of the theoretical results. Along the way, we also prove regret bounds for a general class of algorithms, the result of which may be used to design a new algorithm.

We first provide a known property of the upper confidence bound of GP.

###### Lemma 1.

(Bound Estimated by GP) According to the belief encoded in the GP prior/posterior222Thus, the probability in this analysis should be seen as that of the subjective view. If we assume that is indeed a sample from the GP, we have the same result with the objective view of probability. , for any , holds during the execution of Algorithm 1 with probability at least .

###### Proof.

It follows the proof of lemma 5.1 of [15]

. From the property of the standard gaussian distribution,

. Taking union bound on the entire execution of Algorithm 1, . Substituting , we obtain the statement. ∎

Our algorithm has a concrete division procedure in line 27 of Algorithm 1. However, one may improve the algorithm with different division procedures. Accordingly, we first derive abstract version of regret bound for the IMGPO (Algorithm 1) under a family of division procedures that satisfy Assumptions 3 and 4. After that, we provide a proof for the main results in the paper.

### A With Family of Division Procedure

In this section, we modify the result obtained by [18]. Let to be any point in the region covered by the th hyperinterval at depth , and be the global optimizer that may exist in the th hyperinterval at depth . The previous work provided the regret bound of the SOO algorithm with a family of division procedure that satisfies the following two assumptions.

###### Assumption 3.

(Decreasing diameter) There exists a diameter function such that, for any hyperinterval and its center and any , we have and for all 1.

###### Assumption 4.

(Well-shaped cell) There exists 0 such that any hyperinterval contains at least an -ball of radius centered in .

Thus, in this section, hyperinterval is not restricted to hyperrectangle. We now revisit the definitions of several terms and variables used in [18]. Let the -optimal space be defined as . That is, the -optimal space is the set of input vectors whose function value is at least -close to the global optima. To bound the number of hyperintervals relevant to this -optimal space, we define a near-optimality dimension as follows.

###### Definition 3.

(Near-optimality dimension) The near-optimality dimension is the smallest such that, there exists for all the maximum number of disjoint -balls of radius with center in the -optimal space is less than .

Finally, we define the set of -optimal hyperintervals as . The -optimal hyperinterval is used to relate the hyperintervals to the -optimal space. Indeed, the -optimal hyperinterval is almost identical to the -optimal space , except that is focused on the center points whereas considers the whole input vector space. In the following, we use to denote the number of and derive its upper bound.

###### Lemma 2.

(Lemma 3.1 in [18]) Let be the near-optimality dimension and denote the corresponding constant in Definition 1. Then, the number of -optimal hyperintervals is bounded by .

We are now ready to present the main result in this section. In the following, we use the term optimal hyperinterval to indicate a hyperinterval that contains a global optimizer . We say a hyperinterval is dominated by other intervals when it is rejected or not selected in step (i)-(iii). In Lemma 3, we bound the maximum size of the optimal hyperinterval. From Assumption 1, this can be translated to the regret bound, as we shall see in Theorem 2.

###### Lemma 3.

Let be the largest used so far with total node expansions. Let be the depth of the deepest expanded node that contains a global optimizer after total node expansions (i.e., determines the size of the optimal hyperinterval). Then, with probability at least , is bounded below by some that satisfies

 n≥∑h′+Ξl=0|Il|∑τ =1ρτ.
###### Proof.

Let denote the time at which the optimal hyperinterval is further divided. We prove the statement by showing that the time difference is bounded by the number of -optimal hyperintervals. To do so, we first note that there are three types of hyperinterval that can dominate an optimal hyperinterval during the time , all of which belong to -optimal hyperintervals . The first type has the same size (i.e., same depth ), . In this case,

 f(ch+1,i)≥f(ch+1,∗)≥f(x∗h+1,∗)−δ(h+1),

where the first inequality is due to line 10 (step (i)) and the second follows Assumptions 1 and 2. Thus, it must be . The second case is where the optimal hyperinterval may be dominated by a hyperinterval of larger size (depth ), . In this case, similarly,

 f(cl,i)≥f(ch+1,∗)≥f(x∗h+1,∗)−δ(l),

where the first inequality is due to lines 11 to 12 (step (ii)) and thus . In the final scenario, the optimal hyperinterval is dominated by a hyperinterval of smaller size (depth ), . In this case,

 f(ch+1+ξ,i)≥z(h+1,∗)≥f(x∗h+1,∗)−δ(h+1+ξ)

with probability at least where is defined in line 21 of Algorithm 1. The first inequality is due to lines 19 to 23 (step (iii)) and the second inequality follows Lemma 1 and Assumptions 1 and 3. Hence, we can see that .

For all of the above arguments, the temporarily assigned under GP has no effect. This is because the algorithm still covers the above three types of -optimal hyperintervals , as with probability at least (Lemma 1). However, these are only expanded based on because of the temporary nature of . Putting these results together,

 Th+1−Th≤∑h+1+Ξnl=1|Iδ(l)|∑τ=1ρτ.

Since if one of the is divided during , it cannot be divided again during another time period,

 h∗n∑h=0Th+1−Th≤∑h∗n+1+Ξnl=1|Il|∑τ=1ρτ,

where on the right-hand side, we could combine the summation and into the one, because each in the summation refers to the same -optimal interval with , and should not be double-counted. As , and ,

 Th∗n+1≤1+∑h∗n+1+Ξnl=1|Il|∑τ=1ρτ≤∑h∗n+1+Ξnl=0|Il|∑τ=1ρτ.

As by definition, for any such that , we have . ∎

With Lemmas 2 and 3, we are ready to present a finite regret bound with the family of division procedures.

###### Theorem 2.

Assume Assumptions 1, 3, and 4. Let be the smallest integer such that

 n≤C∑h+Ξnl=0δ(l)−d∑τ=1ρτ.

Then, with probability at least , the regret of the IMGPO with any general division procedure is bounded as

 rn≤δ(h(n)−1).
###### Proof.

Let and be the center point expanded at the th expansion and the optimal hyperinterval containing a global optimizer , respectively. Then, from Assumptions 1, 3, and 4, , where is the global optima. Hence, the regret bound is . To find a lower bound for the quantity , we first relate to Lemma 3 by

 n>C∑h(n)+Ξn−1l=0δ(l)−d∑τ=1ρτ ≥∑h(n)+Ξn−1l=0|Il|∑τ=1ρτ,

where the first inequality comes from the definition of , and the second follows from Lemma 2. Then, from Lemma 3, we have . Therefore, . ∎

###### Assumption 5.

(Decreasing diameter revisit) The decreasing diameter defined in Assumption 3 can be written as for some and with a division procedure that requires function evaluations per node expansion.

###### Corollary 1.

Assume Assumptions 1, 3, 4, and 5. Then, if , with probability at least ,

 rN≤O(exp(−N+Ngpc2CD¯ρt)).

If , with probability at least ,

 rN≤O⎛⎝(1N+Ngp)1/d(−c2C¯ρt1−γd/D)1/dγ−1D⎞⎠.
###### Proof.

For the case , we have , where the first inequality follows from the definition of , and the second comes from the definition of and the assumption . The second inequality holds for that only considers with . This is computable, because by construction. Indeed, the condition of Lemma 3 implies . Therefore, the two inequalities hold, and we can deduce that by algebraic manipulation. By Assumption 5, . With this, substituting the lower bound of into the statement of Theorem 2 with Assumption 5,

 rN≤c1exp(−[N+Ngpc2D1C¯ρt−Ξn−2]ln1γ).

Similarly, for the case ,

 n≤C∑h(n)+Ξnl=0δ(l)−d∑τ=1ρτ≤c−dCγ−(h(n)+Ξn+1)d/D−1γ−d/D−1∑τ=1¯ρt,

and hence by algebraic manipulation. Substituting this into the result of Theorem 2, we arrive at the desired result. ∎

### B With a Concrete Division Procedure

In this section, we prove the main result in the paper. In Theorem 1, we show that the exponential convergence rate bound with is achieved without Assumptions 3, 4 and 5 and without the assumption that .

###### Theorem 1.

Assume Assumptions 1 and 2. Let . Let . Then, without Assumptions 3, 4 and 5 and without the assumption on , with probability at least , the regret of IMGPO with the division procedure in Algorithm 1 is bounded as

 rN≤L(3βD1/p)αexp(−α[N+Ngp2C¯ρtD−Ξn−2]ln3)=O(λN+Ngp).
###### Proof.

To prove the statement, we show that Assumptions 3, 4, and 5 can all be satisfied while maintaining .

From Assumption 2 (i), and based on the division procedure that the algorithm uses,

 supx∈ωh,iℓ(x,ch,i)≤supx∈ωh,iL||x−ch,i||αp≤L(3−⌊h/D⌋βD1/p)α.

This upper bound corresponds to the diagonal length of each hyperrectangle with respect to -norm, where corresponds to the length of the longest side. We fix the form of as , which satisfies Assumption 3.

This form of also satisfies Assumption 5 with and .

Every hyperrectangle contains at least one -ball with a radius corresponding to the length of the shortest side of the hyperrectangle. Thus, we have at least one -ball of radius