1 Introduction
Sequential optimization has evolved into one of the fastest developing areas of machine learning
(Mazumdar et al., 2020). We consider sequential optimization of an unknown objective function from noisy and expensive to evaluate zerothorder^{1}^{1}1Zerothorder feedback signifies observations from in contrast to firstorder feedback which refers to observations from gradient ofas e.g. in stochastic gradient descent
(see, e.g., Agarwal et al., 2011; Vakili and Zhao, 2019).observations. That is a ubiquitous problem in academic research and industrial production. Examples of applications include exploration in reinforcement learning, recommendation systems, medical analysis tools and speech recognizers
(Shahriari et al., 2016). A notable application in the field of machine learning is automatic hyperparameter tuning. Prevalent methods such as grid search can be prohibitively expensive (Bergstra et al., 2011; McGibbon et al., 2016). Sequential optimization methods, on the other hand, are shown to efficiently find good hyperparameters by an adaptive exploration of the hyperparameter space (Falkner et al., 2018).Our sequential optimization setting is as follows. Consider an objective function defined over a domain , where is the dimension of the input. A learning algorithm is allowed to perform an adaptive exploration to sequentially observe the potentially corrupted values of the objective function , where are random noises. At the end of exploration trials, the learning algorithm returns a candidate maximizer of . Let be a true optimal solution. We may measure the performance of the learning algorithm in terms of simple regret; that is, the difference between the performance under the true optimal, , and that under the learnt value, .
Our formulation falls under the general framework of continuum armed bandits that signifies receiving feedback only for the selected observation point at each time (Agrawal, 1995; Kleinberg, 2004; Bubeck et al., 2011b, a). Bandit problems have been extensively studied under numerous settings and various performance measures including simple regret (see, e.g., Bubeck et al., 2011b; Carpentier and Valko, 2015; Deshmukh et al., 2018), cumulative regret (see, e.g., Auer et al., 2002; Slivkins, 2019; Zhao, 2019), and best arm identification (see, e.g., Audibert et al., 2010; Grover et al., 2018). The choice of performance measure strongly depends on the application. Simple regret is suitable for situations with a preliminary exploration phase (for instance hyperparameter tuning) in which costs are not measured in terms of rewards but rather in terms of resources expended (Bubeck et al., 2011b).
Due to infinite cardinality of the domain, approaching is feasible only when appropriate regularity assumptions on and noise are satisfied. Following a growing literature (Srinivas et al., 2010; Chowdhury and Gopalan, 2017; Janz et al., 2020; Vakili et al., 2020a), we focus on a variation of the problem where is assumed to belong to a reproducing kernel Hilbert space (RKHS) that is a very general assumption. Almost all continuous functions can be approximated with the RKHS elements of practically relevant kernels such as Matérn family of kernels (Srinivas et al., 2010). We consider two classes of noise: subGaussian and lighttailed.
Our regularity assumption on allows us to utilize Gaussian processes (GPs) which provide powerful Bayesian (surrogate) models for (Rasmussen and Williams, 2006). Sequential optimization based on GP models is often referred to as Bayesian optimization in the literature (Shahriari et al., 2016; Snoek et al., 2012; Frazier, 2018)
. We build on prediction and uncertainty estimates provided by GP models to study an efficient adaptive exploration algorithm referred to as Maximum Variance Reduction (MVR). Under simple regret measure, MVR embodies the simple principle of exploring the points with the highest variance first. Intuitively, the variance in the GP model is considered as a measure of uncertainty about the unknown objective function and the exploration steps are designed to maximally reduce the uncertainty. At the end of exploration trials, MVR returns a candidate maximizer based on the prediction provided by the learnt GP model. With its simple structure, MVR is amenable to a tight analysis that significantly improves the best known bounds on simple regret. To this end, we derive novel and sharp confidence intervals for GP models applicable to RKHS elements. In addition, we provide numerical experiments on the simple regret performance of MVR comparing it to GPUCB
(Srinivas et al., 2010; Chowdhury and Gopalan, 2017), GPPI (Hoffman et al., 2011) and GPEI (Hoffman et al., 2011).1.1 Main Results
Our main contributions are as follows.
We first derive novel confidence intervals for GP models applicable to RKHS elements (Theorems 1 and 2). As part of our analysis, we formulate the posterior variance of a GP model as the sum of two terms: the maximum prediction error from noisefree observations, and the effect of noise (Proposition 1
). This interpretation elicits new connections between GP regression and kernel ridge regression
(Kanagawa et al., 2018). These results are of interest on their own.We then build on the confidence intervals for GP models to provide a tight analysis of the simple regret of the MVR algorithm (Theorem 3
). In particular, we prove a high probability
^{2}^{2}2The notations and are used to denote the mathematical order and the mathematical order up to logarithmic factors, respectively. simple regret, where is the maximal information gain (see § 2.4). In comparison to the existing bounds on simple regret (see, e.g., Srinivas et al., 2010; Chowdhury and Gopalan, 2017; Scarlett et al., 2017), we show an improvement. It is noteworthy that our bound guarantees convergence to the optimum value of , while previous bounds do not, since although grows sublinearly with , it can grow faster than .We then specialize our results for the particular cases of practically relevant Matérn and Squared Exponential (SE) kernels. We show that our regret bounds match the lower bounds and close the gap reported in Scarlett et al. (2017); Cai and Scarlett (2020), who showed that an average simple regret of requires exploration trials in the case of SE kernel. For the Matérn kernel (where is the smoothness parameter, see § 2.1) they gave the analogous bound of . They also reported a significant gap between these lower bounds and the upper bounds achieved by GPUCB algorithm. In Corollary 1, we show that our analysis of MVR closes this gap in the performance and establishes upper bounds matching the lower bounds up to logarithmic factors.
In contrast to the existing results which mainly focus on Gaussian and subGaussian distributions for noise, we extend our analysis to the more general class of lighttailed distributions, thus broadening the applicability of the results. This extension increases both the confidence interval width and the simple regret by only a multiplicative logarithmic factor. These results apply to e.g. the privacy preserving setting where often a lighttailed noise is employed
(Basu et al., 2019; Ren et al., 2020; Zheng et al., 2020).1.2 Literature Review
The celebrated work of Srinivas et al. Srinivas et al. (2010) pioneered the analysis of GP bandits by proving an upper bound on the cumulative regret of GPUCB, an optimistic optimization algorithm which sequentially selects that maximize an upper confidence bound score over the search space. That implies an simple regret (Scarlett et al., 2017). Their analysis relied on deriving confidence intervals for GP models applicable to RKHS elements. They also considered a fully Bayesian setting where is assumed to be a sample from a GP and noise is assumed to be Gaussian. Chowdhury and Gopalan (2017) built on feature space representation of GP models and selfnormalized martingale inequalities, first developed in AbbasiYadkori et al. (2011) for linear bandits, to improve the confidence intervals of Srinivas et al. (2010) by a multiplicative factor. That led to an improvement in the regret bounds by the same multiplicative factor. A discussion on the comparison between these results and the confidence intervals derived in this paper is provided in § 3.3. A technical comparison with some recent advances in regret bounds requires introducing new notations and is deferred to Appendix A.
The performance of Bayesian optimization algorithms has been extensively studied under numerous settings including contextual information (Krause and Ong, 2011), high dimensional spaces (Djolonga et al., 2013; Mutny and Krause, 2018), safety constraints (Berkenkamp et al., 2016; Sui et al., 2018), parallelization (Kandasamy et al., 2018), metalearning (Wang et al., 2018a), multifidelity evaluations (Kandasamy et al., 2019), ordinal models (Picheny et al., 2019), corruption tolerance (Bogunovic et al., 2020; Cai and Scarlett, 2020), and neural tangent kernels (Zhou et al., 2020; Zhang et al., 2020). Javidi and Shekhar (2018) introduced an adaptive discretization of the search space improving the computational complexity of a GPUCB based algorithm. Sparse approximation of GP posteriors are shown to preserve the regret orders while improving the computational complexity of Bayesian optimization algorithms (Mutny and Krause, 2018; Calandriello et al., 2019; Vakili et al., 2020b). Under the RKHS setting with noisy observations, GPTS (Chowdhury and Gopalan, 2017) and GPEI (Nguyen et al., 2017; Wang and de Freitas, 2014) are also shown to achieve the same regret guarantees as GPUCB (up to logarithmic factors). All these works report regret bounds.
The regret bounds are also reported under other often simpler settings such as noisefree observations (Bull, 2011; Vakili et al., 2020c, ) or a Bayesian regret that is averaged over a known prior on (Kandasamy et al., 2018; Wang et al., 2018b; Wang and Jegelka, 2017; Scarlett, 2018; Shekhar and Javidi, 2021; Grünewälder et al., 2010; de Freitas et al., 2012; Kawaguchi et al., 2015), rather than for a fixed and unknown as in our setting.
Other lines of work on continuum armed bandits exist relying on other regularity assumptions such as Lipschitz continuity (Kleinberg, 2004; Bubeck et al., 2011a; Carpentier and Valko, 2015; Kleinberg et al., 2008), convexity (Agarwal et al., 2011) and unimodality (Combes et al., 2020), to name a few. A notable example is Bubeck et al. (2011a) who showed that hierarchical algorithms based on tree search yield cumulative regret. We do not compare with these results due to the inherent difference in the regularity assumptions.
1.3 Organization
In § 2, the problem formulation, the regularity assumptions, and the preliminaries on RKHS and GP models are presented. The novel confidence intervals for GP models are proven in § 3. MVR algorithm and its analysis are given in § 4. The experiments are presented in § 5. We conclude with a discussion in § 6.
2 Problem Formulation and Preliminaries
Consider an objective function , where is a convex and compact domain. Consider an optimal point . A learning algorithm sequentially selects observation points and observes the corresponding noise disturbed objective values , where is the observation noise. We use the notations , , , , , for all . In a simple regret setting, the learning algorithm determines a sequence of mappings where each mapping predicts a candidate maximizer . For algorithm , the simple regret under a budget of tries is defined as
(1) 
The budget may be unknown a priori. Notationwise, we use and to denote the noise free part of the observations and the noise history, respectively, similar to and .
2.1 Gaussian Processes
The Bayesian optimization algorithms build on GP (surrogate) models. A GP is a random process , where each of its finite subsets follow a multivariate Gaussian distribution. The distribution of a GP is fully specified by its mean function and a positive definite kernel (or covariance function) . Without loss of generality, it is typically assumed that for prior GP distributions.
Conditioning GPs on available observations provides us with powerful nonparametric Bayesian (surrogate) models over the space of functions. In particular, using the conjugate property, conditioned on , the posterior of is a GP with mean function and kernel function specified as follows:
(2) 
where with some abuse of notation , is the covariance matrix, ,
is the identity matrix of dimension
and is a real number.In practice, Matérn and squared exponential (SE) are the most commonly used kernels for Bayesian optimization (see, e.g., Shahriari et al., 2016; Snoek et al., 2012),
where is referred to as lengthscale, is the Euclidean distance between and , is referred to as the smoothness parameter, and are, respectively, the Gamma function and the modified Bessel function of the second kind. Variation over parameter creates a rich family of kernels. The SE kernel can also be interpreted as a special case of Matérn family when .
2.2 RKHSs and Regularity Assumptions on
Consider a positive definite kernel with respect to a finite Borel measure (e.g., the Lebesgue measure) supported on . A Hilbert space of functions on equipped with an inner product is called an RKHS with reproducing kernel if the following is satisfied. For all , , and for all and , (reproducing property). A constructive definition of RKHS requires the use of Mercer theorem which provides an alternative representation for kernels as an inner product of infinite dimensional feature maps (e.g., Kanagawa et al., 2018, Theorem 4.1), and is deferred to Appendix B. We have the following regularity assumption on the objective function .
Assumption 1
The objective function is assumed to live in the RKHS corresponding to a positive definite kernel . In particular, , for some , where .
For common kernels, such as Matérn family of kernels, members of can uniformly approximate any continuous function on any compact subset of the domain (Srinivas et al., 2010). This is a very general class of functions; more general than e.g. convex or Lipschitz. It has thus gained increasing interest in recent years.
2.3 Regularity Assumptions on Noise
We consider two different cases regarding the regularity assumption on noise. Let us first revisit the definition of subGaussian distributions.
Definition 1
is called subGaussian if its moment generating function
is upper bounded by that of a Gaussian random variable.The subGaussian assumption implies that . It also allows us to use ChernoffHoeffding concentration inequality (Antonini et al., 2008) in our analysis.
We next recall the definition of lighttailed distributions.
Definition 2
A random variable
is called lighttailed if its momentgenerating function exists, i.e., there exists
such that for all , .For a zero mean lighttailed random variable , we have (Chareka et al., 2006)
(3) 
where denotes the second derivative of and is the parameter specified in Definition 2. We observe that the upper bound in (3) is the moment generating function of a zero mean Gaussian random variable with variance . Thus, lighttailed distributions are also called locally subGaussian distributions (Vakili et al., 2013).
We provide confidence intervals for GP models and regret bounds for MVR under each of the following assumptions on the noise terms.
Assumption 2 (SubGaussian Noise)
The noise terms are i.i.d. over . In addition, for some .
Assumption 3 (LightTailed Noise)
The noise terms are i.i.d. zero mean random variables over . In addition, for some .
Bayesian optimization uses GP priors for the objective function and assumes a Gaussian distribution for noise (for its conjugate property). It is noteworthy that the use of GP models is merely for the purpose of algorithm design and does not affect our regularity assumptions on and noise. We use the notation to distinguish the GP model from the fixed .
2.4 Maximal Information Gain
The regret bounds derived in this work are given in terms of the maximal information gain, defined as where denotes the mutual information between and (see, e.g., Cover, 1999, Chapter ). In the case of a GP model, the mutual information can be given as where denotes the determinant of a square matrix. Note that the maximal information gain is kernelspecific and independent. Upper bounds on are derived in Srinivas et al. (2010); Janz et al. (2020); Vakili et al. (2020a) which are commonly used to provide explicit regret bounds. In the case of Matérn and SE , and , respectively (Vakili et al., 2020a).
3 Confidence Intervals for Gaussian Process Models
The analysis of bandit problems classically builds on confidence intervals applicable to the values of the objective function (see, e.g., Auer, 2002; Bubeck et al., 2012). The GP modelling allows us to create confidence intervals for complex functions over continuous domains. In particular, we utilize the prediction () and the uncertainty estimate () provided by GP models in building the confidence intervals which become an important building block of our analysis in the next section. To this end, we first prove the following proposition which formulates the posterior variance of a GP model as the sum of two terms: the maximum prediction error for an RKHS element from noise free observations and the effect of noise.
Proposition 1
Let be the posterior variance of the surrogate GP model as defined in (2). Let . We have
Notice that the first term captures the maximum prediction error from noise free observations . The second term captures the effect of noise in the surrogate GP model (and is independent of ). A detailed proof for Proposition 1 is provided in Appendix C.
Proposition 1 elicits new connections between GP models and kernel ridge regression. While the equivalence of the posterior mean in GP models and the regressor in kernel ridge regression is well known, the interpretation of posterior variance of GP models as the maximum prediction error for an RKHS element is less studied (see Kanagawa et al., 2018, Section 3, for a detailed discussion on the connections between GP models and kernel ridge regression).
3.1 Confidence Intervals under SubGaussian Noise
The following theorem provides a confidence interval for GP models applicable to RKHS elements under the assumption that the noise terms are subGaussian.
Theorem 1
We can write the difference in the objective function and the posterior mean as follows.
The first term can be bounded directly following Proposition 1. The second term is bounded as a result of Proposition 1 and ChernoffHoeffding inequality. A detailed proof of Theorem 1 is provided in Appendix D.
3.2 Confidence Intervals under LightTailed Noise
We now extend the confidence intervals to the case of lighttailed noise. The main difference with subGaussian noise is that ChernoffHoeffding inequality is no more applicable. We derive new bounds accounting for lighttailed noise in the analysis of Theorem 2.
Theorem 2
3.3 Comparison with the Existing Confidence Intervals
The most relevant work to our Theorems 1 and 2 is (Chowdhury and Gopalan, 2017, Theorem ) which itself was an improvement over (Srinivas et al., 2010, Theorem ). Chowdhury and Gopalan (2017) built on feature space representation of GP kernels and selfnormalized martingale inequalities (AbbasiYadkori et al., 2011; Peña et al., 2008) to establish a confidence interval in the same form as in Theorem 1, under Assumptions 1 and 2, with confidence interval width ^{4}^{4}4The effect of is absorbed in . (instead of ). There is a stark contrast between this confidence interval and the one given in Theorem 1 in its dependence on which has a relatively large and possibly polynomial in value. That contributes an extra multiplicative factor to regret.
Neither of these two results (our Theorem 1 and (Chowdhury and Gopalan, 2017, Theorem )) imply the other. Although our confidence interval is much tighter, there are two important differences in the settings of these theorems. One difference is in the probabilistic dependencies between the observation points and the noise terms . While Theorem 1 assumes that are independents of , (Chowdhury and Gopalan, 2017, Theorem ) allows for the dependence of on the previous noise terms . This is a reflection of the difference in the analytical requirements of MVR and GPUCB. The other difference is that (Chowdhury and Gopalan, 2017, Theorem ) holds for all . While, Theorem 1 holds for a single . As we will see in § 4.2, a probability union bound can be used to obtain confidence intervals applicable to all in (a discretization of) , which contributes only logarithmic terms to regret in contrast to . Roughly speaking, we are trading off the extra term for restricting the confidence interval to hold for a single . It remains an open problem whether the same can be done when are allowed to depend on .
4 Maximum Variance Reduction and Simple Regret
In this section, we first formally present an exploration policy based on GP models referred to as Maximum Variance Reduction (MVR). We then utilize the confidence intervals for GP models derived in § 3 to prove bounds on the simple regret of MVR.
4.1 Maximum Variance Reduction Algorithm
MVR relies on the principle of reducing the maximum uncertainty where the uncertainty is measured by the posterior variance of the GP model. After exploration trials, MVR returns a candidate maximizer according to the prediction provided by the learnt GP model. A pseudocode is given in Algorithm 1.
4.2 Regret Analysis
For the analysis of MVR, we assume there exists a fine discretization of the domain for RKHS elements, which is a standard assumption in the literature (see, e.g., Srinivas et al., 2010; Chowdhury and Gopalan, 2017; Vakili et al., 2020b).
Assumption 4
For each given and with , there exists a discretization of such that , where is the closest point in to , and , where is a constant independent of and .
Assumption 4 is a mild assumption that holds for typical kernels such as SE and Matérn (Srinivas et al., 2010; Chowdhury and Gopalan, 2017). The following theorem provides a high probability bound on the regret performance of MVR when the noise terms satisfy either Assumption 2 or 3.
Theorem 3
A detailed proof of the theorem is provided in Appendix E.
Remark 2
4.3 Optimal Order Simple Regret with SE and Matérn Kernels
To enable a direct comparison with the lower bounds on simple regret proven in Scarlett et al. (2017); Cai and Scarlett (2020), in the following corollary, we state a dual form of Theorem 3 for the Matérn and SE kernels. Specifically we formalize the number of exploration trials required to achieve an average regret.
Corollary 1
A proof is provided in Appendix F. Scarlett et al. (2017); Cai and Scarlett (2020) showed that for the SE kernel, an average simple regret of requires . For the Matérn kernel they gave the analogous bound of . They also reported significant gaps between these lower bounds and the existing results (see, e.g., Scarlett et al., 2017, Table I). Comparing with Corollary 1, our bounds are tight in all cases up to factors.
5 Experiments
In this section, we provide numerical experiments on the simple regret performance of MVR, Improved GPUCB (IGPUCB) as presented in Chowdhury and Gopalan (2017), and GPPI and GPEI as presented in Hoffman et al. (2011).
We follow the experiment set up in Chowdhury and Gopalan (2017) to generate test functions from the RKHS. First, points are uniformly sampled from interval . A GP sample with kernel is drawn over these points. Given this sample, the mean of posterior distribution is used as the test function . Parameter is set to of the function range. For IGPUCB we set the parameters exactly as described in Chowdhury and Gopalan (2017). The GP model is equipped with SE or Matérn kernel with . We use different models for the noise: a zero mean Gaussian with variance equal to (a subGaussian distribution) and a zero mean Laplace with scale parameter equal to (a lighttailed distribution). We run each experiment over 25 independent trials and plot the average simple regret in Figure 1. More experiments on two commonly used benchmark functions for Bayesian optimization (Rosenbrock and Hartman) are reported in Appendix G. Further details on the experiments are provided in the supplementary material.
6 Discussion
In this paper, we proved novel and sharp confidence intervals for GP models applicable to RKHS elements. We then built on these results to prove bounds for the simple regret of an adaptive exploration algorithm under the framework of GP bandits. In addition, for the practically relevant SE and Matérn kernels, where a lower bound on regret is known Scarlett et al. (2017); Cai and Scarlett (2020), we showed the order optimality of our results up to logarithmic factors. That closes a significant gap in the literature of analysis of Bayesian optimization algorithms under the performance measure of simple regret.
The limitation of our work adhering to simple regret is that neither our theoretical nor experimental result proves that MVR is a better algorithm in practice. Overall, explorationexploitation oriented algorithms such as GPUCB may perform worse than MVR in terms of simple regret due to two reasons. One is overexploitation of local maxima when is multimodal, and the other is dependence on an explorationexploitation balancing hyperparameter that is often set too conservatively, to guarantee low regret bounds. Furthermore, their existing analytical regret bounds are suboptimal and possibly vacuous (nondiminishing; when grows faster than , as discussed). On the other hand, when compared in terms of cumulative regret (), MVR suffers from a linear regret.
The main value of our work is in proving tight bounds on the simple regret of a GP based exploration algorithm, when other Bayesian optimization algorithms such as GPUCB lack a proof for an always diminishing and nonvacuous regret under the same setting as ours. It remains an open question whether the possibly vacuous regret bounds of GPUCB (as well as GPTS and GPEI whose analysis is inspired by that of GPUCB) is a fundamental limitation or an artifact of its proof.
It is worth reiterating that simple regret is favorable in situations with a preliminary exploration phase (for instance hyperparameter tuning) (Bubeck et al., 2011b). It has been explicitly studied under numerous settings, e.g., (Bubeck et al., 2011b; Carpentier and Valko, 2015; Deshmukh et al., 2018, Lipschitz continuous ), (Bull, 2011, in RKHS, noisefree observations), (Grünewälder et al., 2010; de Freitas et al., 2012; Kawaguchi et al., 2015, a known prior distribution on , noisefree observations), (Contal et al., 2013, a known prior distribution on , noisy observations), (Scarlett et al., 2017; Cai and Scarlett, 2020; Shekhar and Javidi, 2020; Bogunovic et al., 2016, in RKHS, noisy observations). See also § 1.2 and Appendix A for comparison with existing results including Shekhar and Javidi (2020); Bogunovic et al. (2016).
References
 Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1.2, §3.3.
 Stochastic convex optimization with bandit feedback. Advances in Neural Information Processing Systems 24, pp. 1035–1043. Cited by: §1.2, footnote 1.
 The continuumarmed bandit problem. SIAM journal on control and optimization 33 (6), pp. 1926–1951. Cited by: §1.
 Convergence of series of dependent subgaussian random variables. Journal of mathematical analysis and applications 338 (2), pp. 1188–1203. Cited by: Appendix D, §2.3.
 Best arm identification in multiarmed bandits.. In COLT, pp. 41–53. Cited by: §1.
 Finitetime analysis of the multiarmed bandit problem. 47 (2–3), pp. 235–256. Cited by: §1.
 Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research 3 (Nov), pp. 397–422. Cited by: §3.
 Hybrid batch bayesian optimization. arXiv preprint arXiv:1202.5597. Cited by: §G.1.
 Differential privacy for multiarmed bandits: what is it and what is its cost?. arXiv preprint arXiv:1905.12298. Cited by: §1.1.
 Algorithms for hyperparameter optimization. In 25th annual conference on neural information processing systems (NIPS 2011), Vol. 24. Cited by: §1.
 Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. arXiv preprint arXiv:1602.04450. Cited by: §1.2.
 Corruptiontolerant gaussian process bandit optimization. arXiv preprint arXiv:2003.01971. Cited by: §1.2.
 Truncated variance reduction: a unified approach to bayesian optimization and levelset estimation. arXiv preprint arXiv:1610.07379. Cited by: Appendix A, §6.
 Bandits with heavy tail. arxiv. arXiv preprint arXiv:1209.1727. Cited by: §3.
 Xarmed bandits.. Journal of Machine Learning Research 12 (5). Cited by: §1.2, §1.
 Pure exploration in finitelyarmed and continuousarmed bandits. Theoretical Computer Science 412 (19), pp. 1832–1852. Cited by: §1, §6.
 Convergence rates of efficient global optimization algorithms. The Journal of Machine Learning Research, pp. . Cited by: §1.2, §6.
 On lower bounds for standard and robust gaussian process bandit optimization. arXiv preprint arXiv:2008.08757. Cited by: Appendix A, §1.1, §1.2, §4.3, §4.3, §6, §6.
 Gaussian process optimization with adaptive sketching: scalable and no regret. In Proceedings of the ThirtySecond Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 99, Phoenix, USA. Cited by: Appendix A, §G.2, §1.2.
 Simple regret for infinitely many armed bandits. In International Conference on Machine Learning, pp. 1133–1141. Cited by: §1.2, §1, §6.

Locally subgaussian random variable and the strong law of large numbers
. Atlantic Electronic Journal of Mathematics 1 (1), pp. 75–81. Cited by: §2.3.  On kernelized multiarmed bandits. In International Conference on Machine Learning, pp. 844–853. Cited by: Appendix A, Appendix A, §G.2, §G.2, §1.1, §1.2, §1.2, §1, §1, §3.3, §3.3, §4.2, §4.2, §5, §5.
 Unimodal bandits with continuous arms: orderoptimal regret without smoothness. Proceedings of the ACM on Measurement and Analysis of Computing Systems 4 (1), pp. 1–28. Cited by: §1.2.
 Parallel gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 225–240. Cited by: §6.
 Elements of information theory. John Wiley & Sons. Cited by: §2.4.
 Exponential regret bounds for gaussian process bandits with eterministic observations. In Proceedings of the 29th International Conference on Machine Learning, pp. 955–962. Cited by: §1.2, §6.
 Simple regret minimization for contextual bandits. arXiv preprint arXiv:1810.07371. Cited by: §1, §6.
 Highdimensional gaussian process bandits. In Advances in Neural Information Processing Systems 26, pp. 1025–1033. Cited by: §1.2.

BOHB: robust and efficient hyperparameter optimization at scale
. arXiv preprint arXiv:1807.01774. Cited by: §1.  Bayesian optimization. In Recent Advances in Optimization and Modeling of Contemporary Problems, pp. 255–278. Cited by: §1.

Best arm identification in multiarmed bandits with delayed feedback.
In
International Conference on Artificial Intelligence and Statistics
, pp. 833–842. Cited by: §1.  Regret bounds for gaussian process bandit problems. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 273–280. Cited by: §1.2, §6.
 Gaussian processes for big data. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence (UAI 2013), Cited by: §G.2.
 Portfolio allocation for bayesian optimization.. In UAI, pp. 327–336. Cited by: 2nd item, 3rd item, §1, §5.
 Applied Analysis. World Scientific. Cited by: Appendix B.
 Bandit optimisation of functions in the matern kernel rkhs. In Proceedings of Machine Learning Research, Vol. 108, pp. 2486–2495. Cited by: Appendix A, §1, §2.4.
 Gaussian process bandits with adaptive discretization. Electron. J. Statist. 12 (2), pp. 3829–3874. Cited by: §1.2.
 Gaussian processes and kernel methods: a review on connections and equivalences. Available at Arxiv. (), pp. . Cited by: Appendix B, Appendix B, Appendix B, Appendix C, Appendix E, §1.1, §2.2, §3.
 Multifidelity gaussian process bandit optimisation. Journal of Artificial Intelligence Research 66, pp. 151–196. Cited by: §1.2.

Parallelised bayesian optimisation via thompson sampling
. In International Conference on Artificial Intelligence and Statistics, pp. 133–142. Cited by: §1.2, §1.2.  Bayesian optimization with exponential convergence. In Advances in Neural Information Processing Systems, Vol. 2015Janua, pp. 2809–2817. External Links: 1604.01348, ISSN 10495258 Cited by: §1.2, §6.

Multiarmed bandits in metric spaces.
In
Proceedings of the fortieth annual ACM symposium on Theory of computing
, pp. 681–690. Cited by: §1.2.  Nearly tight bounds for the continuumarmed bandit problem. Advances in Neural Information Processing Systems 17, pp. 697–704. Cited by: §1.2, §1.
 Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems 24, pp. 2447–2455. Cited by: §1.2.
 On thompson sampling with langevin algorithms. Proceedings of ICML. Cited by: §1.

Osprey: hyperparameter optimization for machine learning.
Journal of Open Source Software
1 (5), pp. 34. Cited by: §1.  Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems 31, pp. 9005–9016. Cited by: §1.2.
 Regret for expected improvement over the bestobserved value and stopping condition. In Asian Conference on Machine Learning, pp. 279–294. Cited by: Appendix A, §1.2.
 Selfnormalized processes: limit theory and statistical applications. Springer Science & Business Media. Cited by: §3.3.
 Ordinal bayesian optimisation. arXiv preprint arXiv:1912.02493. Cited by: §1.2.
 A benchmark of krigingbased infill criteria for noisy optimization. Structural and Multidisciplinary Optimization 48 (3), pp. 607–626. External Links: Document, ISSN 1615147X Cited by: §G.1.
 Gaussian Processes for Machine Learning. MIT Press. Cited by: §1.
 Multiarmed bandits with local differential privacy. arXiv preprint arXiv:2007.03121. Cited by: §1.1.
 Lower bounds on regret for noisy Gaussian process bandit optimization. In Proceedings of the 2017 Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 65, Amsterdam, Netherlands, pp. 1723–1742. Cited by: Appendix A, §1.1, §1.1, §1.2, §4.3, §4.3, §6, §6.
 Tight regret bounds for bayesian optimization in one dimension. arXiv preprint arXiv:1805.11792. Cited by: §1.2.
 Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1, §1, §2.1.
 Multiscale zeroorder optimization of smooth functions in an rkhs. arXiv preprint arXiv:2005.04832. Cited by: Appendix A, §6.
 Significance of gradient information in bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2836–2844. Cited by: §1.2.
 Introduction to multiarmed bandits. arXiv preprint arXiv:1904.07272. Cited by: §1.
 Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pp. 2951–2959. Cited by: §1, §2.1.
 Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015–1022. Cited by: Appendix A, Appendix E, Appendix F, §G.2, §1.1, §1.2, §1, §1, §2.2, §2.4, §3.3, §4.2, §4.2.
 Stagewise safe bayesian optimization with gaussian processes. arXiv preprint arXiv:1806.07555. Cited by: §1.2.
 Convergence of gaussian process regression with estimated hyperparameters and applications in bayesian inverse problems. Available at Arxiv. (), pp. . Cited by: Appendix B.
 Variational Learning of Inducing Variables inSparse Gaussian Processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 567–574. Cited by: §G.2.
 On information gain and regret bounds in gaussian process bandits. arXiv preprint arXiv:2009.06966. Cited by: Appendix A, Appendix F, §1, §2.4.
 Deterministic sequencing of exploration and exploitation for multiarmed bandit problems. IEEE Journal of Selected Topics in Signal Processing 7 (5), pp. 759–767. Cited by: §2.3.
 Scalable thompson sampling using sparse gussian process mdels. Available at Arxiv. (), pp. . Cited by: §G.2, §1.2, §4.2.
 Regret bounds for noisefree bayesian optimization. arXiv preprint arXiv:2002.05096. Cited by: §1.2.
 A random walk approach to firstorder stochastic convex optimization. In IEEE International Symposium on Information Theory (ISIT), Cited by: footnote 1.
 Finitetime analysis of kernelised contextual bandits. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, UAI’13, Arlington, Virginia, USA, pp. 654–663. Cited by: Appendix A.
 Maxvalue entropy search for efficient Bayesian optimization. In 34th International Conference on Machine Learning, ICML 2017, Vol. 7, pp. 5530–5543. External Links: 1703.01968, ISBN 9781510855144 Cited by: §1.2.
 Regret bounds for meta bayesian optimization with an unknown gaussian process prior. In Advances in Neural Information Processing Systems, pp. 10477–10488. Cited by: §1.2.
 Regret bounds for meta bayesian optimization with an unknown gaussian process prior. arXiv preprint arXiv:1811.09558. Cited by: §1.2.
 Theoretical analysis of bayesian optimisation with unknown gaussian process hyperparameters. arXiv preprint arXiv:1406.7758. Cited by: Appendix A, §1.2.
 Neural thompson sampling. arXiv preprint arXiv:2010.00827. Cited by: §1.2.
 Multiarmed bandits: theory and applications to online learning in networks. Synthesis Lectures on Communication Networks 12 (1), pp. 1–165. Cited by: §1.
 Locally differentially private (contextual) bandits learning. arXiv preprint arXiv:2006.00701. Cited by: §1.1.
 Neural contextual bandits with ucbbased exploration. In International Conference on Machine Learning, pp. 11492–11502. Cited by: §1.2.
Appendix A Further Comparison with the Existing Regret Bounds
There are several Bayesian optimization algorithms namely GPUCB [Srinivas et al., 2010], IGPUCB, GPTS [Chowdhury and Gopalan, 2017], TruVar [Bogunovic et al., 2016], GPEI [Wang and de Freitas, 2014, Nguyen et al., 2017] and KernelUCB [Valko et al., 2013] which enjoy theoretical upper bounds on regret (under Assumptions , and ), which grow at least as fast as . These bounds do not necessarily converge to zero, since can grow faster than resulting in vacuous regret bounds. For example, in the case of a Matérn kernel, replacing [Vakili et al., 2020a] results in an regret which does not converge to zero for , meaning the algorithm does not necessarily approach . Janz et al. [2020] developed a GPUCB based algorithm, specific to Matérn family of kernels, that constructs a cover for the search space, as many hypercubes, and fits an independent GP to each cover element. This algorithm, referred to as GPUCB, was proven to achieve diminishing regret for all and . Recently, Shekhar and Javidi [2020] introduced LPGPUCB where the GP model is augmented with local polynomial estimators to construct a multiscale upper confidence bound guiding the sequential optimization. They further improved the regret bounds of Janz et al. [2020] and showed that LPGPUCB matches the lower bounds for some configuration of parameters and in the case of a Matérn kernel. Defining , , and , their bounds on simple regret are as follows. For , . For , . For , [see, Shekhar and Javidi, 2020, Sec. , for a detailed discussion on the bounds on the simple regret of LPGPUCB]. In comparison, our bounds on simple regret match the lower bound, up to logarithmic factors, with all parameters and
. In addition, LPGPUCB is impractical due to large constant factors, though a practical heuristic was also given. While, MVR enjoys a simple implementation and works efficiently in practice. Of important theoretical value, SupKernelUCB
Valko et al. [2013], which builds on episodic independent batches of observations was proven to achieve regret on a finite set (). SupKernelUCB is also reported to perform poorly in practice [Janz et al., 2020, Calandriello et al., 2019, Cai and Scarlett, 2020].It is noteworthy that our techniques do not directly apply to the analysis of cumulative regret of algorithms such as GPUCB. The key difference is that in MVR the observation points are independent of the noise terms (although are allowed to depend on , and is allowed to depend on ), while in GPUCB are allowed to depend on (see also Sec. ). It remains an interesting open question whether the state of the art upper bound on the regret performance of GPUCB [Chowdhury and Gopalan, 2017] is tight or the gap with the lower bound [Scarlett et al., 2017] is an artifact of its proof.
Appendix B Constructive Definition of RKHS
A constructive definition of RKHS requires the use of Mercer theorem which provides an alternative representation for kernels as an inner product of infinite dimensional feature maps [see, e.g., Kanagawa et al., 2018, Theorem ].
Mercer Theorem:
Let be a continuous kernel with respect to a finite Borel measure. There exists such that , , for , and
The RKHS can consequently be represented in terms of using Mercer’s representation theorem [see, e.g., Kanagawa et al., 2018, Theorem ].
Mercer’s Representation Theorem:
Let be the same as in Mercer Theorem. Then, the RKHS of is given by
Mercer’s representation theorem indicates that form an orthonormal basis for . It also provides a constructive definition for the RKHS as the span of this orthonormal basis, and a definition for the norm of a member as the norm of the weights .
The RKHS of Matérn is equivalent to a Sobolev space with parameter [Kanagawa et al., 2018, Teckentrup, 2018]. This observation provides an intuitive interpretation for the norm of Matérn RKHS as proportional to the cumulative norm of the weak derivatives of up to order. I.e., in the case of Matérn family, Assumption on the norm of translates to the existence of weak derivatives of up to order which can be understood as a versatile measure for the smoothness of controlled by . In the case of SE kernel, the regularity assumption implies the existence of all weak derivatives of . For the details on the definition of weak derivatives and Sobolev spaces see Hunter and Nachtergaele [2011].
Appendix C Proof of Proposition
Recall the notations , , . Let . From the closed form expression for the posterior mean of GP models, we have .
The proof of Proposition uses the following lemma.
Lemma 1
For a positive definite kernel and its corresponding RKHS, the following holds.
(5) 
The lemma establishes the equivalence of the RKHS norm of a linear combination of feature vectors induced by
to the supremum of the linear combination of the corresponding function values, over the functions in the unit ball of the RKHS. For a proof, see [Kanagawa et al., 2018, Lemma ].Expanding the RKHS norm in the right hand side through an algebraic manipulation, we get
Comments
There are no comments yet.