Optimal Order Simple Regret for Gaussian Process Bandits

08/20/2021 ∙ by Sattar Vakili, et al. ∙ Imperial College London 0

Consider the sequential optimization of a continuous, possibly non-convex, and expensive to evaluate objective function f. The problem can be cast as a Gaussian Process (GP) bandit where f lives in a reproducing kernel Hilbert space (RKHS). The state of the art analysis of several learning algorithms shows a significant gap between the lower and upper bounds on the simple regret performance. When N is the number of exploration trials and γ_N is the maximal information gain, we prove an 𝒪̃(√(γ_N/N)) bound on the simple regret performance of a pure exploration algorithm that is significantly tighter than the existing bounds. We show that this bound is order optimal up to logarithmic factors for the cases where a lower bound on regret is known. To establish these results, we prove novel and sharp confidence intervals for GP models applicable to RKHS elements which may be of broader interest.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequential optimization has evolved into one of the fastest developing areas of machine learning 

(Mazumdar et al., 2020). We consider sequential optimization of an unknown objective function from noisy and expensive to evaluate zeroth-order111Zeroth-order feedback signifies observations from in contrast to first-order feedback which refers to observations from gradient of

as e.g. in stochastic gradient descent 

(see, e.g., Agarwal et al., 2011; Vakili and Zhao, 2019).

observations. That is a ubiquitous problem in academic research and industrial production. Examples of applications include exploration in reinforcement learning, recommendation systems, medical analysis tools and speech recognizers

(Shahriari et al., 2016). A notable application in the field of machine learning is automatic hyper-parameter tuning. Prevalent methods such as grid search can be prohibitively expensive (Bergstra et al., 2011; McGibbon et al., 2016). Sequential optimization methods, on the other hand, are shown to efficiently find good hyper-parameters by an adaptive exploration of the hyper-parameter space (Falkner et al., 2018).

Our sequential optimization setting is as follows. Consider an objective function defined over a domain , where is the dimension of the input. A learning algorithm is allowed to perform an adaptive exploration to sequentially observe the potentially corrupted values of the objective function , where are random noises. At the end of exploration trials, the learning algorithm returns a candidate maximizer of . Let be a true optimal solution. We may measure the performance of the learning algorithm in terms of simple regret; that is, the difference between the performance under the true optimal, , and that under the learnt value, .

Our formulation falls under the general framework of continuum armed bandits that signifies receiving feedback only for the selected observation point at each time  (Agrawal, 1995; Kleinberg, 2004; Bubeck et al., 2011b, a). Bandit problems have been extensively studied under numerous settings and various performance measures including simple regret (see, e.g., Bubeck et al., 2011b; Carpentier and Valko, 2015; Deshmukh et al., 2018), cumulative regret (see, e.g., Auer et al., 2002; Slivkins, 2019; Zhao, 2019), and best arm identification (see, e.g., Audibert et al., 2010; Grover et al., 2018). The choice of performance measure strongly depends on the application. Simple regret is suitable for situations with a preliminary exploration phase (for instance hyper-parameter tuning) in which costs are not measured in terms of rewards but rather in terms of resources expended (Bubeck et al., 2011b).

Due to infinite cardinality of the domain, approaching is feasible only when appropriate regularity assumptions on and noise are satisfied. Following a growing literature (Srinivas et al., 2010; Chowdhury and Gopalan, 2017; Janz et al., 2020; Vakili et al., 2020a), we focus on a variation of the problem where is assumed to belong to a reproducing kernel Hilbert space (RKHS) that is a very general assumption. Almost all continuous functions can be approximated with the RKHS elements of practically relevant kernels such as Matérn family of kernels (Srinivas et al., 2010). We consider two classes of noise: sub-Gaussian and light-tailed.

Our regularity assumption on allows us to utilize Gaussian processes (GPs) which provide powerful Bayesian (surrogate) models for  (Rasmussen and Williams, 2006). Sequential optimization based on GP models is often referred to as Bayesian optimization in the literature (Shahriari et al., 2016; Snoek et al., 2012; Frazier, 2018)

. We build on prediction and uncertainty estimates provided by GP models to study an efficient adaptive exploration algorithm referred to as Maximum Variance Reduction (MVR). Under simple regret measure, MVR embodies the simple principle of exploring the points with the highest variance first. Intuitively, the variance in the GP model is considered as a measure of uncertainty about the unknown objective function and the exploration steps are designed to maximally reduce the uncertainty. At the end of exploration trials, MVR returns a candidate maximizer based on the prediction provided by the learnt GP model. With its simple structure, MVR is amenable to a tight analysis that significantly improves the best known bounds on simple regret. To this end, we derive novel and sharp confidence intervals for GP models applicable to RKHS elements. In addition, we provide numerical experiments on the simple regret performance of MVR comparing it to GP-UCB 

(Srinivas et al., 2010; Chowdhury and Gopalan, 2017), GP-PI (Hoffman et al., 2011) and GP-EI (Hoffman et al., 2011).

1.1 Main Results

Our main contributions are as follows.

We first derive novel confidence intervals for GP models applicable to RKHS elements (Theorems 1 and 2). As part of our analysis, we formulate the posterior variance of a GP model as the sum of two terms: the maximum prediction error from noise-free observations, and the effect of noise (Proposition 1

). This interpretation elicits new connections between GP regression and kernel ridge regression 

(Kanagawa et al., 2018). These results are of interest on their own.

We then build on the confidence intervals for GP models to provide a tight analysis of the simple regret of the MVR algorithm (Theorem 3

). In particular, we prove a high probability

222The notations and are used to denote the mathematical order and the mathematical order up to logarithmic factors, respectively. simple regret, where is the maximal information gain (see § 2.4). In comparison to the existing bounds on simple regret (see, e.g., Srinivas et al., 2010; Chowdhury and Gopalan, 2017; Scarlett et al., 2017), we show an improvement. It is noteworthy that our bound guarantees convergence to the optimum value of , while previous bounds do not, since although grows sublinearly with , it can grow faster than .

We then specialize our results for the particular cases of practically relevant Matérn and Squared Exponential (SE) kernels. We show that our regret bounds match the lower bounds and close the gap reported in Scarlett et al. (2017); Cai and Scarlett (2020), who showed that an average simple regret of requires exploration trials in the case of SE kernel. For the Matérn- kernel (where is the smoothness parameter, see § 2.1) they gave the analogous bound of . They also reported a significant gap between these lower bounds and the upper bounds achieved by GP-UCB algorithm. In Corollary 1, we show that our analysis of MVR closes this gap in the performance and establishes upper bounds matching the lower bounds up to logarithmic factors.

In contrast to the existing results which mainly focus on Gaussian and sub-Gaussian distributions for noise, we extend our analysis to the more general class of light-tailed distributions, thus broadening the applicability of the results. This extension increases both the confidence interval width and the simple regret by only a multiplicative logarithmic factor. These results apply to e.g. the privacy preserving setting where often a light-tailed noise is employed 

(Basu et al., 2019; Ren et al., 2020; Zheng et al., 2020).

1.2 Literature Review

The celebrated work of Srinivas et al. Srinivas et al. (2010) pioneered the analysis of GP bandits by proving an upper bound on the cumulative regret of GP-UCB, an optimistic optimization algorithm which sequentially selects that maximize an upper confidence bound score over the search space. That implies an simple regret (Scarlett et al., 2017). Their analysis relied on deriving confidence intervals for GP models applicable to RKHS elements. They also considered a fully Bayesian setting where is assumed to be a sample from a GP and noise is assumed to be Gaussian. Chowdhury and Gopalan (2017) built on feature space representation of GP models and self-normalized martingale inequalities, first developed in Abbasi-Yadkori et al. (2011) for linear bandits, to improve the confidence intervals of Srinivas et al. (2010) by a multiplicative factor. That led to an improvement in the regret bounds by the same multiplicative factor. A discussion on the comparison between these results and the confidence intervals derived in this paper is provided in § 3.3. A technical comparison with some recent advances in regret bounds requires introducing new notations and is deferred to Appendix A.

The performance of Bayesian optimization algorithms has been extensively studied under numerous settings including contextual information  (Krause and Ong, 2011), high dimensional spaces (Djolonga et al., 2013; Mutny and Krause, 2018), safety constraints (Berkenkamp et al., 2016; Sui et al., 2018), parallelization (Kandasamy et al., 2018), meta-learning (Wang et al., 2018a), multi-fidelity evaluations (Kandasamy et al., 2019), ordinal models (Picheny et al., 2019), corruption tolerance (Bogunovic et al., 2020; Cai and Scarlett, 2020), and neural tangent kernels (Zhou et al., 2020; Zhang et al., 2020). Javidi and Shekhar (2018) introduced an adaptive discretization of the search space improving the computational complexity of a GP-UCB based algorithm. Sparse approximation of GP posteriors are shown to preserve the regret orders while improving the computational complexity of Bayesian optimization algorithms (Mutny and Krause, 2018; Calandriello et al., 2019; Vakili et al., 2020b). Under the RKHS setting with noisy observations, GP-TS (Chowdhury and Gopalan, 2017) and GP-EI (Nguyen et al., 2017; Wang and de Freitas, 2014) are also shown to achieve the same regret guarantees as GP-UCB (up to logarithmic factors). All these works report regret bounds.

The regret bounds are also reported under other often simpler settings such as noise-free observations (Bull, 2011; Vakili et al., 2020c, ) or a Bayesian regret that is averaged over a known prior on  (Kandasamy et al., 2018; Wang et al., 2018b; Wang and Jegelka, 2017; Scarlett, 2018; Shekhar and Javidi, 2021; Grünewälder et al., 2010; de Freitas et al., 2012; Kawaguchi et al., 2015), rather than for a fixed and unknown as in our setting.

Other lines of work on continuum armed bandits exist relying on other regularity assumptions such as Lipschitz continuity (Kleinberg, 2004; Bubeck et al., 2011a; Carpentier and Valko, 2015; Kleinberg et al., 2008), convexity (Agarwal et al., 2011) and unimodality (Combes et al., 2020), to name a few. A notable example is Bubeck et al. (2011a) who showed that hierarchical algorithms based on tree search yield cumulative regret. We do not compare with these results due to the inherent difference in the regularity assumptions.

1.3 Organization

In § 2, the problem formulation, the regularity assumptions, and the preliminaries on RKHS and GP models are presented. The novel confidence intervals for GP models are proven in § 3. MVR algorithm and its analysis are given in § 4. The experiments are presented in § 5. We conclude with a discussion in § 6.

2 Problem Formulation and Preliminaries

Consider an objective function , where is a convex and compact domain. Consider an optimal point . A learning algorithm sequentially selects observation points and observes the corresponding noise disturbed objective values , where is the observation noise. We use the notations , , , , , for all . In a simple regret setting, the learning algorithm determines a sequence of mappings where each mapping predicts a candidate maximizer . For algorithm , the simple regret under a budget of tries is defined as

(1)

The budget may be unknown a priori. Notationwise, we use and to denote the noise free part of the observations and the noise history, respectively, similar to and .

2.1 Gaussian Processes

The Bayesian optimization algorithms build on GP (surrogate) models. A GP is a random process , where each of its finite subsets follow a multivariate Gaussian distribution. The distribution of a GP is fully specified by its mean function and a positive definite kernel (or covariance function) . Without loss of generality, it is typically assumed that for prior GP distributions.

Conditioning GPs on available observations provides us with powerful non-parametric Bayesian (surrogate) models over the space of functions. In particular, using the conjugate property, conditioned on , the posterior of is a GP with mean function and kernel function specified as follows:

(2)

where with some abuse of notation , is the covariance matrix, ,

is the identity matrix of dimension

and is a real number.

In practice, Matérn and squared exponential (SE) are the most commonly used kernels for Bayesian optimization (see, e.g., Shahriari et al., 2016; Snoek et al., 2012),

where is referred to as lengthscale, is the Euclidean distance between and , is referred to as the smoothness parameter, and are, respectively, the Gamma function and the modified Bessel function of the second kind. Variation over parameter creates a rich family of kernels. The SE kernel can also be interpreted as a special case of Matérn family when .

2.2 RKHSs and Regularity Assumptions on

Consider a positive definite kernel with respect to a finite Borel measure (e.g., the Lebesgue measure) supported on . A Hilbert space of functions on equipped with an inner product is called an RKHS with reproducing kernel if the following is satisfied. For all , , and for all and , (reproducing property). A constructive definition of RKHS requires the use of Mercer theorem which provides an alternative representation for kernels as an inner product of infinite dimensional feature maps (e.g., Kanagawa et al., 2018, Theorem 4.1), and is deferred to Appendix B. We have the following regularity assumption on the objective function .

Assumption 1

The objective function is assumed to live in the RKHS corresponding to a positive definite kernel . In particular, , for some , where .

For common kernels, such as Matérn family of kernels, members of can uniformly approximate any continuous function on any compact subset of the domain  (Srinivas et al., 2010). This is a very general class of functions; more general than e.g. convex or Lipschitz. It has thus gained increasing interest in recent years.

2.3 Regularity Assumptions on Noise

We consider two different cases regarding the regularity assumption on noise. Let us first revisit the definition of sub-Gaussian distributions.

Definition 1

A random variable

is called sub-Gaussian if its moment generating function

is upper bounded by that of a Gaussian random variable.

The sub-Gaussian assumption implies that . It also allows us to use Chernoff-Hoeffding concentration inequality (Antonini et al., 2008) in our analysis.

We next recall the definition of light-tailed distributions.

Definition 2

A random variable

is called light-tailed if its moment-generating function exists, i.e., there exists

such that for all , .

For a zero mean light-tailed random variable , we have (Chareka et al., 2006)

(3)

where denotes the second derivative of and is the parameter specified in Definition 2. We observe that the upper bound in (3) is the moment generating function of a zero mean Gaussian random variable with variance . Thus, light-tailed distributions are also called locally sub-Gaussian distributions (Vakili et al., 2013).

We provide confidence intervals for GP models and regret bounds for MVR under each of the following assumptions on the noise terms.

Assumption 2 (Sub-Gaussian Noise)

The noise terms are i.i.d. over . In addition, for some .

Assumption 3 (Light-Tailed Noise)

The noise terms are i.i.d. zero mean random variables over . In addition, for some .

Bayesian optimization uses GP priors for the objective function and assumes a Gaussian distribution for noise (for its conjugate property). It is noteworthy that the use of GP models is merely for the purpose of algorithm design and does not affect our regularity assumptions on and noise. We use the notation to distinguish the GP model from the fixed .

2.4 Maximal Information Gain

The regret bounds derived in this work are given in terms of the maximal information gain, defined as where denotes the mutual information between and  (see, e.g., Cover, 1999, Chapter ). In the case of a GP model, the mutual information can be given as where denotes the determinant of a square matrix. Note that the maximal information gain is kernel-specific and -independent. Upper bounds on are derived in  Srinivas et al. (2010); Janz et al. (2020); Vakili et al. (2020a) which are commonly used to provide explicit regret bounds. In the case of Matérn and SE , and , respectively (Vakili et al., 2020a).

3 Confidence Intervals for Gaussian Process Models

The analysis of bandit problems classically builds on confidence intervals applicable to the values of the objective function (see, e.g., Auer, 2002; Bubeck et al., 2012). The GP modelling allows us to create confidence intervals for complex functions over continuous domains. In particular, we utilize the prediction () and the uncertainty estimate () provided by GP models in building the confidence intervals which become an important building block of our analysis in the next section. To this end, we first prove the following proposition which formulates the posterior variance of a GP model as the sum of two terms: the maximum prediction error for an RKHS element from noise free observations and the effect of noise.

Proposition 1

Let be the posterior variance of the surrogate GP model as defined in (2). Let . We have

Notice that the first term captures the maximum prediction error from noise free observations . The second term captures the effect of noise in the surrogate GP model (and is independent of ). A detailed proof for Proposition 1 is provided in Appendix C.

Proposition 1 elicits new connections between GP models and kernel ridge regression. While the equivalence of the posterior mean in GP models and the regressor in kernel ridge regression is well known, the interpretation of posterior variance of GP models as the maximum prediction error for an RKHS element is less studied (see Kanagawa et al., 2018, Section 3, for a detailed discussion on the connections between GP models and kernel ridge regression).

3.1 Confidence Intervals under Sub-Gaussian Noise

The following theorem provides a confidence interval for GP models applicable to RKHS elements under the assumption that the noise terms are sub-Gaussian.

Theorem 1

Assume Assumptions 1 and 2 hold. Provided noisy observations from , let and be as defined in (2). Assume are independent of . For a fixed , define the upper and lower confidence bounds, respectively,

(4)

with , where , and and are the parameters specified in Assumptions 1 and 2. We have

We can write the difference in the objective function and the posterior mean as follows.

The first term can be bounded directly following Proposition 1. The second term is bounded as a result of Proposition 1 and Chernoff-Hoeffding inequality. A detailed proof of Theorem 1 is provided in Appendix D.

3.2 Confidence Intervals under Light-Tailed Noise

We now extend the confidence intervals to the case of light-tailed noise. The main difference with sub-Gaussian noise is that Chernoff-Hoeffding inequality is no more applicable. We derive new bounds accounting for light-tailed noise in the analysis of Theorem 2.

Theorem 2

Assume Assumptions 1 and 3 hold. For a fixed , define the upper and lower confidence bounds and similar to Theorem 1 with  333The notation is used to denote the maximum of two real numbers, ., where , and , and are specified in Assumptions 1 and 3. Assume are independent of . We have

In comparison to Theorem 1, under the light-tailed assumption, the confidence interval width increases with a multiplicative factor. A detailed proof of Theorem 2 is provided in Appendix D.

Remark 1

Theorems 1 and 2 rely on the assumption that are independent of . As we shall see in § 4, this assumption is satisfied when the confidence intervals are applied to the analysis of MVR.

3.3 Comparison with the Existing Confidence Intervals

The most relevant work to our Theorems 1 and 2 is (Chowdhury and Gopalan, 2017, Theorem ) which itself was an improvement over (Srinivas et al., 2010, Theorem ). Chowdhury and Gopalan (2017) built on feature space representation of GP kernels and self-normalized martingale inequalities (Abbasi-Yadkori et al., 2011; Peña et al., 2008) to establish a confidence interval in the same form as in Theorem 1, under Assumptions 1 and 2, with confidence interval width  444The effect of is absorbed in . (instead of ). There is a stark contrast between this confidence interval and the one given in Theorem 1 in its dependence on which has a relatively large and possibly polynomial in value. That contributes an extra multiplicative factor to regret.

Neither of these two results (our Theorem 1 and (Chowdhury and Gopalan, 2017, Theorem )) imply the other. Although our confidence interval is much tighter, there are two important differences in the settings of these theorems. One difference is in the probabilistic dependencies between the observation points and the noise terms . While Theorem 1 assumes that are independents of (Chowdhury and Gopalan, 2017, Theorem ) allows for the dependence of on the previous noise terms . This is a reflection of the difference in the analytical requirements of MVR and GP-UCB. The other difference is that (Chowdhury and Gopalan, 2017, Theorem ) holds for all . While, Theorem 1 holds for a single . As we will see in § 4.2, a probability union bound can be used to obtain confidence intervals applicable to all in (a discretization of) , which contributes only logarithmic terms to regret in contrast to . Roughly speaking, we are trading off the extra term for restricting the confidence interval to hold for a single . It remains an open problem whether the same can be done when are allowed to depend on .

4 Maximum Variance Reduction and Simple Regret

In this section, we first formally present an exploration policy based on GP models referred to as Maximum Variance Reduction (MVR). We then utilize the confidence intervals for GP models derived in § 3 to prove bounds on the simple regret of MVR.

4.1 Maximum Variance Reduction Algorithm

MVR relies on the principle of reducing the maximum uncertainty where the uncertainty is measured by the posterior variance of the GP model. After exploration trials, MVR returns a candidate maximizer according to the prediction provided by the learnt GP model. A pseudo-code is given in Algorithm 1.

1:Initialization: , , , .
2:for  do
3:     , where a tie is broken arbitrarily.
4:     Update according to (2).
5:end for
6:Update according to (2)
7:return , where a tie is broken arbitrarily.
Algorithm 1 Maximum Variance Reduction (MVR)

4.2 Regret Analysis

For the analysis of MVR, we assume there exists a fine discretization of the domain for RKHS elements, which is a standard assumption in the literature (see, e.g., Srinivas et al., 2010; Chowdhury and Gopalan, 2017; Vakili et al., 2020b).

Assumption 4

For each given and with , there exists a discretization of such that , where is the closest point in to , and , where is a constant independent of and .

Assumption 4 is a mild assumption that holds for typical kernels such as SE and Matérn (Srinivas et al., 2010; Chowdhury and Gopalan, 2017). The following theorem provides a high probability bound on the regret performance of MVR when the noise terms satisfy either Assumption 2 or 3.

Theorem 3

Consider the Gaussian process bandit problem. Under Assumptions 14, and (2 or 3), for , with probability at least , MVR satisfies

where under Assumption 2, , and under Assumption 3, , and , , , , and are the constants specified in Assumptions 123 and 4.

A detailed proof of the theorem is provided in Appendix E.

Remark 2

Under Assumptions 2 and 3, respectively, the regret bounds can be simplified as

For instance, in the case of Matérn- kernel, under Assumption 2 and 3, respectively,

which always converge to zero as grows (unlike the existing regret bounds).

Remark 3

In the analysis of Theorem 3, we apply Assumption 4 to as well as . For this purpose, we derive a high probability upper bound on (see Lemma 4 in Appendix E), which appears in the regret bound expression.

4.3 Optimal Order Simple Regret with SE and Matérn Kernels

To enable a direct comparison with the lower bounds on simple regret proven in Scarlett et al. (2017); Cai and Scarlett (2020), in the following corollary, we state a dual form of Theorem 3 for the Matérn and SE kernels. Specifically we formalize the number of exploration trials required to achieve an average regret.

Corollary 1

Consider the GP bandit problem with an SE or a Matérn kernel. For , define Under Assumptions 14, and (2 or 3), upper bounds on are reported in Table 1.

Kernel Under Assumption 2 Under Assumption 3
SE
Matérn-
Table 1: The upper bounds on defined in Corollary 1 with SE or Matérn kernel.

A proof is provided in Appendix F. Scarlett et al. (2017); Cai and Scarlett (2020) showed that for the SE kernel, an average simple regret of requires . For the Matérn- kernel they gave the analogous bound of . They also reported significant gaps between these lower bounds and the existing results (see, e.g., Scarlett et al., 2017, Table I). Comparing with Corollary 1, our bounds are tight in all cases up to factors.

5 Experiments

In this section, we provide numerical experiments on the simple regret performance of MVR, Improved GP-UCB (IGP-UCB) as presented in Chowdhury and Gopalan (2017), and GP-PI and GP-EI as presented in Hoffman et al. (2011).

We follow the experiment set up in Chowdhury and Gopalan (2017) to generate test functions from the RKHS. First, points are uniformly sampled from interval . A GP sample with kernel is drawn over these points. Given this sample, the mean of posterior distribution is used as the test function . Parameter is set to of the function range. For IGP-UCB we set the parameters exactly as described in Chowdhury and Gopalan (2017). The GP model is equipped with SE or Matérn- kernel with . We use different models for the noise: a zero mean Gaussian with variance equal to (a sub-Gaussian distribution) and a zero mean Laplace with scale parameter equal to (a light-tailed distribution). We run each experiment over 25 independent trials and plot the average simple regret in Figure 1. More experiments on two commonly used benchmark functions for Bayesian optimization (Rosenbrock and Hartman) are reported in Appendix G. Further details on the experiments are provided in the supplementary material.

(a) SE, Gaussian Noise
(b) Matérn, Gaussian Noise
(c) SE, Laplace Noise
(d) Matérn, Laplace Noise
Figure 1: Comparison of the simple regret performance of Bayesian optimization algorithms on samples from RKHS.

6 Discussion

In this paper, we proved novel and sharp confidence intervals for GP models applicable to RKHS elements. We then built on these results to prove bounds for the simple regret of an adaptive exploration algorithm under the framework of GP bandits. In addition, for the practically relevant SE and Matérn kernels, where a lower bound on regret is known Scarlett et al. (2017); Cai and Scarlett (2020), we showed the order optimality of our results up to logarithmic factors. That closes a significant gap in the literature of analysis of Bayesian optimization algorithms under the performance measure of simple regret.

The limitation of our work adhering to simple regret is that neither our theoretical nor experimental result proves that MVR is a better algorithm in practice. Overall, exploration-exploitation oriented algorithms such as GP-UCB may perform worse than MVR in terms of simple regret due to two reasons. One is over-exploitation of local maxima when is multi-modal, and the other is dependence on an exploration-exploitation balancing hyper-parameter that is often set too conservatively, to guarantee low regret bounds. Furthermore, their existing analytical regret bounds are suboptimal and possibly vacuous (non-diminishing; when grows faster than , as discussed). On the other hand, when compared in terms of cumulative regret (), MVR suffers from a linear regret.

The main value of our work is in proving tight bounds on the simple regret of a GP based exploration algorithm, when other Bayesian optimization algorithms such as GP-UCB lack a proof for an always diminishing and non-vacuous regret under the same setting as ours. It remains an open question whether the possibly vacuous regret bounds of GP-UCB (as well as GP-TS and GP-EI whose analysis is inspired by that of GP-UCB) is a fundamental limitation or an artifact of its proof.

It is worth reiterating that simple regret is favorable in situations with a preliminary exploration phase (for instance hyper-parameter tuning) (Bubeck et al., 2011b). It has been explicitly studied under numerous settings, e.g.,  (Bubeck et al., 2011b; Carpentier and Valko, 2015; Deshmukh et al., 2018, Lipschitz continuous )(Bull, 2011, in RKHS, noise-free observations)(Grünewälder et al., 2010; de Freitas et al., 2012; Kawaguchi et al., 2015, a known prior distribution on , noise-free observations), (Contal et al., 2013, a known prior distribution on , noisy observations), (Scarlett et al., 2017; Cai and Scarlett, 2020; Shekhar and Javidi, 2020; Bogunovic et al., 2016, in RKHS, noisy observations). See also § 1.2 and Appendix A for comparison with existing results including Shekhar and Javidi (2020); Bogunovic et al. (2016).

References

  • Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. (2011) Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §1.2, §3.3.
  • A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin (2011) Stochastic convex optimization with bandit feedback. Advances in Neural Information Processing Systems 24, pp. 1035–1043. Cited by: §1.2, footnote 1.
  • R. Agrawal (1995) The continuum-armed bandit problem. SIAM journal on control and optimization 33 (6), pp. 1926–1951. Cited by: §1.
  • R. G. Antonini, Y. Kozachenko, and A. Volodin (2008) Convergence of series of dependent -subgaussian random variables. Journal of mathematical analysis and applications 338 (2), pp. 1188–1203. Cited by: Appendix D, §2.3.
  • J. Audibert, S. Bubeck, and R. Munos (2010) Best arm identification in multi-armed bandits.. In COLT, pp. 41–53. Cited by: §1.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. 47 (2–3), pp. 235–256. Cited by: §1.
  • P. Auer (2002) Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 (Nov), pp. 397–422. Cited by: §3.
  • J. Azimi, A. Jalali, and X. Fern (2012) Hybrid batch bayesian optimization. arXiv preprint arXiv:1202.5597. Cited by: §G.1.
  • D. Basu, C. Dimitrakakis, and A. Tossou (2019) Differential privacy for multi-armed bandits: what is it and what is its cost?. arXiv preprint arXiv:1905.12298. Cited by: §1.1.
  • J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. In 25th annual conference on neural information processing systems (NIPS 2011), Vol. 24. Cited by: §1.
  • F. Berkenkamp, A. Krause, and A. P. Schoellig (2016) Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. arXiv preprint arXiv:1602.04450. Cited by: §1.2.
  • I. Bogunovic, A. Krause, and J. Scarlett (2020) Corruption-tolerant gaussian process bandit optimization. arXiv preprint arXiv:2003.01971. Cited by: §1.2.
  • I. Bogunovic, J. Scarlett, A. Krause, and V. Cevher (2016) Truncated variance reduction: a unified approach to bayesian optimization and level-set estimation. arXiv preprint arXiv:1610.07379. Cited by: Appendix A, §6.
  • S. Bubeck, N. Cesa-Bianchi, and G. Lugosi (2012) Bandits with heavy tail. arxiv. arXiv preprint arXiv:1209.1727. Cited by: §3.
  • S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári (2011a) X-armed bandits.. Journal of Machine Learning Research 12 (5). Cited by: §1.2, §1.
  • S. Bubeck, R. Munos, and G. Stoltz (2011b) Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science 412 (19), pp. 1832–1852. Cited by: §1, §6.
  • A. D. Bull (2011) Convergence rates of efficient global optimization algorithms. The Journal of Machine Learning Research, pp. . Cited by: §1.2, §6.
  • X. Cai and J. Scarlett (2020) On lower bounds for standard and robust gaussian process bandit optimization. arXiv preprint arXiv:2008.08757. Cited by: Appendix A, §1.1, §1.2, §4.3, §4.3, §6, §6.
  • D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco (2019) Gaussian process optimization with adaptive sketching: scalable and no regret. In Proceedings of the Thirty-Second Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 99, Phoenix, USA. Cited by: Appendix A, §G.2, §1.2.
  • A. Carpentier and M. Valko (2015) Simple regret for infinitely many armed bandits. In International Conference on Machine Learning, pp. 1133–1141. Cited by: §1.2, §1, §6.
  • P. Chareka, O. Chareka, and S. Kennendy (2006)

    Locally sub-gaussian random variable and the strong law of large numbers

    .
    Atlantic Electronic Journal of Mathematics 1 (1), pp. 75–81. Cited by: §2.3.
  • S. R. Chowdhury and A. Gopalan (2017) On kernelized multi-armed bandits. In International Conference on Machine Learning, pp. 844–853. Cited by: Appendix A, Appendix A, §G.2, §G.2, §1.1, §1.2, §1.2, §1, §1, §3.3, §3.3, §4.2, §4.2, §5, §5.
  • R. Combes, A. Proutière, and A. Fauquette (2020) Unimodal bandits with continuous arms: order-optimal regret without smoothness. Proceedings of the ACM on Measurement and Analysis of Computing Systems 4 (1), pp. 1–28. Cited by: §1.2.
  • E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis (2013) Parallel gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 225–240. Cited by: §6.
  • T. M. Cover (1999) Elements of information theory. John Wiley & Sons. Cited by: §2.4.
  • N. de Freitas, A. J. Smola, and M. Zoghi (2012) Exponential regret bounds for gaussian process bandits with eterministic observations. In Proceedings of the 29th International Conference on Machine Learning, pp. 955–962. Cited by: §1.2, §6.
  • A. A. Deshmukh, S. Sharma, J. W. Cutler, M. Moldwin, and C. Scott (2018) Simple regret minimization for contextual bandits. arXiv preprint arXiv:1810.07371. Cited by: §1, §6.
  • J. Djolonga, A. Krause, and V. Cevher (2013) High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems 26, pp. 1025–1033. Cited by: §1.2.
  • S. Falkner, A. Klein, and F. Hutter (2018)

    BOHB: robust and efficient hyperparameter optimization at scale

    .
    arXiv preprint arXiv:1807.01774. Cited by: §1.
  • P. I. Frazier (2018) Bayesian optimization. In Recent Advances in Optimization and Modeling of Contemporary Problems, pp. 255–278. Cited by: §1.
  • A. Grover, T. Markov, P. Attia, N. Jin, N. Perkins, B. Cheong, M. Chen, Z. Yang, S. Harris, W. Chueh, et al. (2018) Best arm identification in multi-armed bandits with delayed feedback. In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 833–842. Cited by: §1.
  • S. Grünewälder, J. Audibert, M. Opper, and J. Shawe–Taylor (2010) Regret bounds for gaussian process bandit problems. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 273–280. Cited by: §1.2, §6.
  • J. Hensman, N. Fusi, and N. D. Lawrence (2013) Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI 2013), Cited by: §G.2.
  • M. D. Hoffman, E. Brochu, and N. de Freitas (2011) Portfolio allocation for bayesian optimization.. In UAI, pp. 327–336. Cited by: 2nd item, 3rd item, §1, §5.
  • J. K. Hunter and B. Nachtergaele (2011) Applied Analysis. World Scientific. Cited by: Appendix B.
  • D. Janz, D. Burt, and J. Gonzalez (2020) Bandit optimisation of functions in the matern kernel rkhs. In Proceedings of Machine Learning Research, Vol. 108, pp. 2486–2495. Cited by: Appendix A, §1, §2.4.
  • T. Javidi and S. Shekhar (2018) Gaussian process bandits with adaptive discretization. Electron. J. Statist. 12 (2), pp. 3829–3874. Cited by: §1.2.
  • M. Kanagawa, P. Hennig, D. Sejdinovic, and B. K. Sriperumbudur (2018) Gaussian processes and kernel methods: a review on connections and equivalences. Available at Arxiv. (), pp. . Cited by: Appendix B, Appendix B, Appendix B, Appendix C, Appendix E, §1.1, §2.2, §3.
  • K. Kandasamy, G. Dasarathy, J. Oliva, J. Schneider, and B. Poczos (2019) Multi-fidelity gaussian process bandit optimisation. Journal of Artificial Intelligence Research 66, pp. 151–196. Cited by: §1.2.
  • K. Kandasamy, A. Krishnamurthy, J. Schneider, and B. Póczos (2018)

    Parallelised bayesian optimisation via thompson sampling

    .
    In International Conference on Artificial Intelligence and Statistics, pp. 133–142. Cited by: §1.2, §1.2.
  • K. Kawaguchi, L. P. Kaelbling, and T. Lozano-Pérez (2015) Bayesian optimization with exponential convergence. In Advances in Neural Information Processing Systems, Vol. 2015-Janua, pp. 2809–2817. External Links: 1604.01348, ISSN 10495258 Cited by: §1.2, §6.
  • R. Kleinberg, A. Slivkins, and E. Upfal (2008) Multi-armed bandits in metric spaces. In

    Proceedings of the fortieth annual ACM symposium on Theory of computing

    ,
    pp. 681–690. Cited by: §1.2.
  • R. Kleinberg (2004) Nearly tight bounds for the continuum-armed bandit problem. Advances in Neural Information Processing Systems 17, pp. 697–704. Cited by: §1.2, §1.
  • A. Krause and C. S. Ong (2011) Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems 24, pp. 2447–2455. Cited by: §1.2.
  • E. Mazumdar, A. Pacchiano, Y. Ma, P. L. Bartlett, and M. I. Jordan (2020) On thompson sampling with langevin algorithms. Proceedings of ICML. Cited by: §1.
  • R. T. McGibbon, C. X. Hernández, M. P. Harrigan, S. Kearnes, M. M. Sultan, S. Jastrzebski, B. E. Husic, and V. S. Pande (2016) Osprey: hyperparameter optimization for machine learning.

    Journal of Open Source Software

    1 (5), pp. 34.
    Cited by: §1.
  • M. Mutny and A. Krause (2018) Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems 31, pp. 9005–9016. Cited by: §1.2.
  • V. Nguyen, S. Gupta, S. Rana, C. Li, and S. Venkatesh (2017) Regret for expected improvement over the best-observed value and stopping condition. In Asian Conference on Machine Learning, pp. 279–294. Cited by: Appendix A, §1.2.
  • V. H. Peña, T. L. Lai, and Q. Shao (2008) Self-normalized processes: limit theory and statistical applications. Springer Science & Business Media. Cited by: §3.3.
  • V. Picheny, S. Vakili, and A. Artemev (2019) Ordinal bayesian optimisation. arXiv preprint arXiv:1912.02493. Cited by: §1.2.
  • V. Picheny, T. Wagner, and D. Ginsbourger (2013) A benchmark of kriging-based infill criteria for noisy optimization. Structural and Multidisciplinary Optimization 48 (3), pp. 607–626. External Links: Document, ISSN 1615147X Cited by: §G.1.
  • C. E. Rasmussen and C. K. Williams (2006) Gaussian Processes for Machine Learning. MIT Press. Cited by: §1.
  • W. Ren, X. Zhou, J. Liu, and N. B. Shroff (2020) Multi-armed bandits with local differential privacy. arXiv preprint arXiv:2007.03121. Cited by: §1.1.
  • J. Scarlett, I. Bogunovic, and V. Cevher (2017) Lower bounds on regret for noisy Gaussian process bandit optimization. In Proceedings of the 2017 Conference on Learning Theory, Proceedings of Machine Learning Research, Vol. 65, Amsterdam, Netherlands, pp. 1723–1742. Cited by: Appendix A, §1.1, §1.1, §1.2, §4.3, §4.3, §6, §6.
  • J. Scarlett (2018) Tight regret bounds for bayesian optimization in one dimension. arXiv preprint arXiv:1805.11792. Cited by: §1.2.
  • B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas (2016) Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104 (1), pp. 148–175. Cited by: §1, §1, §2.1.
  • S. Shekhar and T. Javidi (2020) Multi-scale zero-order optimization of smooth functions in an rkhs. arXiv preprint arXiv:2005.04832. Cited by: Appendix A, §6.
  • S. Shekhar and T. Javidi (2021) Significance of gradient information in bayesian optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2836–2844. Cited by: §1.2.
  • A. Slivkins (2019) Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272. Cited by: §1.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pp. 2951–2959. Cited by: §1, §2.1.
  • N. Srinivas, A. Krause, S. Kakade, and M. Seeger (2010) Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015–1022. Cited by: Appendix A, Appendix E, Appendix F, §G.2, §1.1, §1.2, §1, §1, §2.2, §2.4, §3.3, §4.2, §4.2.
  • Y. Sui, V. Zhuang, J. W. Burdick, and Y. Yue (2018) Stagewise safe bayesian optimization with gaussian processes. arXiv preprint arXiv:1806.07555. Cited by: §1.2.
  • A. L. Teckentrup (2018) Convergence of gaussian process regression with estimated hyper-parameters and applications in bayesian inverse problems. Available at Arxiv. (), pp. . Cited by: Appendix B.
  • M. K. Titsias (2009) Variational Learning of Inducing Variables inSparse Gaussian Processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 567–574. Cited by: §G.2.
  • S. Vakili, K. Khezeli, and V. Picheny (2020a) On information gain and regret bounds in gaussian process bandits. arXiv preprint arXiv:2009.06966. Cited by: Appendix A, Appendix F, §1, §2.4.
  • S. Vakili, K. Liu, and Q. Zhao (2013) Deterministic sequencing of exploration and exploitation for multi-armed bandit problems. IEEE Journal of Selected Topics in Signal Processing 7 (5), pp. 759–767. Cited by: §2.3.
  • S. Vakili, V. Picheny, and A. Artemev (2020b) Scalable thompson sampling using sparse gussian process mdels. Available at Arxiv. (), pp. . Cited by: §G.2, §1.2, §4.2.
  • S. Vakili, V. Picheny, and N. Durrande (2020c) Regret bounds for noise-free bayesian optimization. arXiv preprint arXiv:2002.05096. Cited by: §1.2.
  • S. Vakili and Q. Zhao (2019) A random walk approach to first-order stochastic convex optimization. In IEEE International Symposium on Information Theory (ISIT), Cited by: footnote 1.
  • M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini (2013) Finite-time analysis of kernelised contextual bandits. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13, Arlington, Virginia, USA, pp. 654–663. Cited by: Appendix A.
  • Z. Wang and S. Jegelka (2017) Max-value entropy search for efficient Bayesian optimization. In 34th International Conference on Machine Learning, ICML 2017, Vol. 7, pp. 5530–5543. External Links: 1703.01968, ISBN 9781510855144 Cited by: §1.2.
  • Z. Wang, B. Kim, and L. P. Kaelbling (2018a) Regret bounds for meta bayesian optimization with an unknown gaussian process prior. In Advances in Neural Information Processing Systems, pp. 10477–10488. Cited by: §1.2.
  • Z. Wang, B. Kim, and L. P. Kaelbling (2018b) Regret bounds for meta bayesian optimization with an unknown gaussian process prior. arXiv preprint arXiv:1811.09558. Cited by: §1.2.
  • Z. Wang and N. de Freitas (2014) Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters. arXiv preprint arXiv:1406.7758. Cited by: Appendix A, §1.2.
  • W. Zhang, D. Zhou, L. Li, and Q. Gu (2020) Neural thompson sampling. arXiv preprint arXiv:2010.00827. Cited by: §1.2.
  • Q. Zhao (2019) Multi-armed bandits: theory and applications to online learning in networks. Synthesis Lectures on Communication Networks 12 (1), pp. 1–165. Cited by: §1.
  • K. Zheng, T. Cai, W. Huang, Z. Li, and L. Wang (2020) Locally differentially private (contextual) bandits learning. arXiv preprint arXiv:2006.00701. Cited by: §1.1.
  • D. Zhou, L. Li, and Q. Gu (2020) Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492–11502. Cited by: §1.2.

Appendix A Further Comparison with the Existing Regret Bounds

There are several Bayesian optimization algorithms namely GP-UCB [Srinivas et al., 2010], IGP-UCB, GP-TS [Chowdhury and Gopalan, 2017], TruVar [Bogunovic et al., 2016], GP-EI [Wang and de Freitas, 2014, Nguyen et al., 2017] and KernelUCB [Valko et al., 2013] which enjoy theoretical upper bounds on regret (under Assumptions  and ), which grow at least as fast as . These bounds do not necessarily converge to zero, since can grow faster than resulting in vacuous regret bounds. For example, in the case of a Matérn- kernel, replacing  [Vakili et al., 2020a] results in an regret which does not converge to zero for , meaning the algorithm does not necessarily approach . Janz et al. [2020] developed a GP-UCB based algorithm, specific to Matérn family of kernels, that constructs a cover for the search space, as many hypercubes, and fits an independent GP to each cover element. This algorithm, referred to as -GP-UCB, was proven to achieve diminishing regret for all and . Recently, Shekhar and Javidi [2020] introduced LP-GP-UCB where the GP model is augmented with local polynomial estimators to construct a multi-scale upper confidence bound guiding the sequential optimization. They further improved the regret bounds of Janz et al. [2020] and showed that LP-GP-UCB matches the lower bounds for some configuration of parameters and in the case of a Matérn kernel. Defining , , and , their bounds on simple regret are as follows. For , . For , . For ,  [see, Shekhar and Javidi, 2020, Sec. , for a detailed discussion on the bounds on the simple regret of LP-GP-UCB]. In comparison, our bounds on simple regret match the lower bound, up to logarithmic factors, with all parameters and

. In addition, LP-GP-UCB is impractical due to large constant factors, though a practical heuristic was also given. While, MVR enjoys a simple implementation and works efficiently in practice. Of important theoretical value, SupKernelUCB 

Valko et al. [2013], which builds on episodic independent batches of observations was proven to achieve regret on a finite set (). SupKernelUCB is also reported to perform poorly in practice [Janz et al., 2020, Calandriello et al., 2019, Cai and Scarlett, 2020].

It is noteworthy that our techniques do not directly apply to the analysis of cumulative regret of algorithms such as GP-UCB. The key difference is that in MVR the observation points are independent of the noise terms (although are allowed to depend on , and is allowed to depend on ), while in GP-UCB are allowed to depend on (see also Sec. ). It remains an interesting open question whether the state of the art upper bound on the regret performance of GP-UCB [Chowdhury and Gopalan, 2017] is tight or the gap with the lower bound [Scarlett et al., 2017] is an artifact of its proof.

Appendix B Constructive Definition of RKHS

A constructive definition of RKHS requires the use of Mercer theorem which provides an alternative representation for kernels as an inner product of infinite dimensional feature maps [see, e.g., Kanagawa et al., 2018, Theorem ].

Mercer Theorem:

Let be a continuous kernel with respect to a finite Borel measure. There exists such that , , for , and

The RKHS can consequently be represented in terms of using Mercer’s representation theorem [see, e.g., Kanagawa et al., 2018, Theorem ].

Mercer’s Representation Theorem:

Let be the same as in Mercer Theorem. Then, the RKHS of is given by

Mercer’s representation theorem indicates that form an orthonormal basis for . It also provides a constructive definition for the RKHS as the span of this orthonormal basis, and a definition for the norm of a member as the norm of the weights .

The RKHS of Matérn is equivalent to a Sobolev space with parameter  [Kanagawa et al., 2018, Teckentrup, 2018]. This observation provides an intuitive interpretation for the norm of Matérn RKHS as proportional to the cumulative norm of the weak derivatives of up to order. I.e., in the case of Matérn family, Assumption  on the norm of translates to the existence of weak derivatives of up to order which can be understood as a versatile measure for the smoothness of controlled by . In the case of SE kernel, the regularity assumption implies the existence of all weak derivatives of . For the details on the definition of weak derivatives and Sobolev spaces see Hunter and Nachtergaele [2011].

Appendix C Proof of Proposition

Recall the notations , , . Let . From the closed form expression for the posterior mean of GP models, we have .

The proof of Proposition uses the following lemma.

Lemma 1

For a positive definite kernel and its corresponding RKHS, the following holds.

(5)

The lemma establishes the equivalence of the RKHS norm of a linear combination of feature vectors induced by

to the supremum of the linear combination of the corresponding function values, over the functions in the unit ball of the RKHS. For a proof, see [Kanagawa et al., 2018, Lemma ].

Expanding the RKHS norm in the right hand side through an algebraic manipulation, we get