No-regret Bayesian Optimization with Unknown Hyperparameters

01/10/2019 ∙ by Felix Berkenkamp, et al. ∙ ETH Zurich 26

Bayesian optimization (BO) based on Gaussian process models is a powerful paradigm to optimize black-box functions that are expensive to evaluate. While several BO algorithms provably converge to the global optimum of the unknown function, they assume that the hyperparameters of the kernel are known in advance. This is not the case in practice and misspecification often causes these algorithms to converge to poor local optima. In this paper, we present the first BO algorithm that is provably no-regret and converges to the optimum without knowledge of the hyperparameters. We slowly adapt the hyperparameters of stationary kernels and thereby expand the associated function class over time, so that the BO algorithm considers more complex function candidates. Based on the theoretical insights, we propose several practical algorithms that achieve the empirical data efficiency of BO with online hyperparameter estimation, but retain theoretical convergence guarantees. We evaluate our method on several benchmark problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of machine learning algorithms often critically depends on the choice of tuning inputs, e.g., learning rates or regularization constants. Picking these correctly is a key challenge in machine learning. Traditionally, these inputs are optimized using grid or random search 

(Bergstra and Bengio, 2012). However, as data sets become larger, the computation time required to train a single model increases, which renders these approaches less applicable. Bayesian optimization (BO, Mockus (2012)) is an alternative method that provably determines good inputs within few evaluations of the underlying objective function in some settings. BO methods construct a statistical model of the underlying objective function and use it to evaluate inputs that are informative about the optimum. However, the theoretical guarantees, empirical performance, and data efficiency of BO algorithms critically depend on their own choice of hyperparameters and, in particular, on the prior distribution over the function space. Thus, we effectively shift the problem of tuning inputs one level up, to the tuning of hyperparameters of the BO algorithm.

In this paper, we present the first BO algorithm that does not require any knowledge about the hyperparameters and provably converges to the global optimum. To this end, we adapt the hyperparameters of the stationary GP kernel and our BO algorithm, so that the associated function space grows over time. The resulting class of algorithms provably converges to the global optimum and retains theoretical convergence guarantees, even when combined with online estimation of hyperparameters.

Related work

General BO has received a lot of attention in recent years. Typically, BO algorithms suggest inputs that should be evaluated based on an acqusition function that measures informativeness about the optimum. The next inputs are obtained by maximizing the acquisition function. Classical acquisition functions are the expected improvement over the best known function value encountered so far given the GP distribution (Mockus et al., 1978) and the Upper Confidence Bound algorithm, GP-UCB, which applies the ‘optimism in the face of uncertainty’ principle. The latter is shown to provably converge by Srinivas et al. (2012). Durand et al. (2018) extend this framework to the case of unknown measurement noise. A related method is

truncated variance reduction

by Bogunovic et al. (2016), which considers the reduction in uncertainty at candidate locations for the optimum. Hennig and Schuler (2012) propose entropy search, which approximates the distribution of the optimum of the objective function and uses the reduction of the entropy in this distribution as an acquisition function. Computationally less expensive, approximations are proposed by Hernández-Lobato et al. (2014); Wang and Jegelka (2017); Ru et al. (2018). Other alternatives are the knowledge gradient (Frazier et al., 2009), which is one-step Bayes optimal, and information directed sampling by Russo and Van Roy (2014), which considers a tradeoff between regret and information gained when evaluating an input. Kirschner and Krause (2018) extend the latter framework to heteroscadastic noise.

These BO methods have also been succesful empirically. In machine learning, they are used to optimize the performance of learning methods (Brochu et al., 2010; Snoek et al., 2012)

. BO is also applicable more broadly; for example, in reinforcement learning to optimize a parametric policy for a robot 

(Calandra et al., 2014; Lizotte et al., 2007; Berkenkamp et al., 2016) or in control to optimize the energy output of a power plant (Abdelrahman et al., 2016). It also forms the backbone of Google vizier, a service for tuning black-box functions (Golovin et al., 2017).

Some of the previous BO algorithms provide theoretical guarantees about convergence to the optimum. However, the theoretical guarantees associated with BO algorithms only hold when the hyperparameters are known a priori. When this is not the case, hyperparameters of the kernel are often inferred using either maximum a posteriori estimates or sampling-based inference (Snoek et al., 2012). However, methods that estimate the hyperparameters online are known to get stuck in local optima (Bull, 2011). Instead, we propose to adapt the hyperparameters online in order to consider larger function spaces over time, which allows us to provide guarantees in terms of convergence to the global optimum. Wang and de Freitas (2014) analyze this setting when a lower bound on the kernel lengthscales is known a priori

. They decrease the lengthscales over time and bound the regret in terms of the known lower-bound on the lengthscales. Empirically, similar heuristic approaches are used by 

Wang et al. (2016); Wabersich and Toussaint (2016). In contrast, this paper considers the case where the hyperparameter are not known. Moreover, the scaling of the hyperparameters in the previous two papers did not depend on the dimensionality of the problem, which can cause the function space to increase to quickly.

Considering larger function classes as more data becomes available is the core idea behind structural risk minimization (Vapnik, 1992)

in statistical learning theory. However, there data is assumed to be sampled independently and identically distributed. This is not the case in BO, where new data is generated actively based on past information.

Our contribution

In this paper, we present Adaptive GP-UCB  (A-GP-UCB), the first algorithm that provably converges to the globally optimal inputs when BO hyperparameters are unknown. Our method expands the function class encoded in the model over time, but does so slowly enough to ensure sublinear regret and convergence to the optimum. Based on the theoretical insights, we propose practical variants of the algorithm with guaranteed convergence. Since our method can be used as an add-on module to existing algorithms with hyperparameter estimation, it achieves similar performance empirically, but avoids local optima when hyperparameters are misspecified. In summary, we:

  • Provide theoretical convergence guarantees for BO with unknown hyperparameters;

  • Propose several practical algorithms based on the theoretical insights;

  • Evaluate the performance in practice and show that our method retains the empirical performance of heuristic methods based on online hyperparameter estimation, but leads to significantly improved performance when the model is misspecified initially.

The remainder of the paper is structured as follows. We state the problem in Sec. 2 and provide relevant background material in Sec. 3. We derive the main theoretical result of the paper in Sec. 4 and use insights gained from the theory to propose practical algorithms. We evaluate these algorithms experimentally in Sec. 5 and draw conclusions in Sec. 6. The technical details of the proofs are shown in the appendix.

2 Problem Statement

In general, BO considers global optimization problems of the form,

(1)

where  is a compact domain over which we want to optimize inputs , and  is an objective function that evaluates the cost  associated with a given input configuration . For example, in a machine learning application,  may be the validation loss and  may be the tuning inputs (e.g., regularization parameters) of the training algorithm. We do not have any significant prior knowledge about the structure of . Specifically, we cannot assume convexity or that we have access to gradient information. Moreover, evaluations of  are corrupted by -sub-Gaussian noise, a general class of noise models that includes, for example, bounded and Gaussian noise.

Regret

We aim to construct a sequence of input evaluations , that eventually maximizes the function value . One natural way to prove this convergence is to show that an algorithm has sublinear regret. The instantaneous regret at iteration  is defined as , which is the loss incurred by evaluating the function at  instead of at the a priori unknown optimal inputs. The cumulative regret is defined as , the sum of regrets incurred over  steps. If we can show that the cumulative regret is sublinear for a given algorithm, that is, , then eventually the algorithm evaluates the function at close-to-optimal inputs most of the time. We say that such an algorithm has no-regret. Intuitively, if the average regret approaches zero then, on average, the instantaneous regret must approach zero too, since  is strictly positive. This implies that there exists a  such that  is arbitrarily close to and the algorithm converges. Thus, we aim to design an optimization algorithm that has sublinear regret.

Regularity assumptions

Without further assumptions, it is impossible to design an algorithm that achieves sublinear regret on creftype 1. In the worst case, could be discontinuous at every input in . To make the optimization problem in creftype 1 tractable, we make regularity assumptions about . In particular, we assume that the function  has low complexity, as measured by the norm in a reproducing kernel Hilbert space (RKHS, Christmann and Steinwart (2008)). An RKHS  contains well-behaved functions of the form , for given representer points  and weights  that decay sufficiently quickly. The kernel  determines the roughness and size of the function space and the induced RKHS norm  measures the complexity of a function  with respect to the kernel.

In the following, we assume that  in creftype 1 has bounded RKHS norm  with respect to a kernel  that is parameterized by hyperparameters . We write  for the corresponding RKHS, . For known  and , no-regret BO algorithms for creftype 1 are known, e.g., in (Srinivas et al., 2012). In practice, this means that  and  need to be tuned. In this paper, we consider the case where  and  are unknown. We focus on stationary kernels, which measure similarity based on the distance of inputs, . The most commonly used hyperparameters for these kernels are the lengthscales , which scale the inputs to the kernel in order to account for different magnitudes and effects on the output value in the different components of . That is, we scale the difference  by the lengthscales ,

(2)

where  denotes the th element of . Typically, these kernels assign larger similarity scores to inputs when the scaled distance between these two inputs is small. Another common hyperparameter is the prior variance of the kernel, a multiplicative constant that determines the magnitude of the kernel. We assume  for all without loss of generality, as any multiplicative scaling can be absorbed by the norm bound .

In summary, the goal is to efficiently solve creftype 1 via a BO algorithm with sublinear regret, where  lies in some RKHS , but neither the hyperparameters  nor the norm-bound  are known.

3 Background

In this section, we review Gaussian processes (GPs) and Bayesian optimization (BO).

3.1 Gaussian Processes (GP)

Based on the assumptions in Sec. 2

, we can use GPs to infer confidence intervals on 

. The goal of GP inference is to infer a posterior distribution over a nonlinear map, 

, from an input vector 

to the function value . This is accomplished by assuming that the function values , associated with different values of

, are random variables and that any finite number of these random variables have a joint Gaussian distribution 

(Rasmussen and Williams, 2006). A GP distribution is parameterized by a prior mean function and a covariance function or kernel , which defines the covariance of any two function values  and for . In this work, the mean is assumed to be zero, without loss of generality. The choice of kernel function is problem-dependent and encodes assumptions about the unknown function.

We can condition a  on a set of  past observations at inputs  in order to obtain a posterior distribution on  for any input . The GP model assumes that observations are noisy measurements of the true function value, , where . The posterior distribution is a again with mean , covariance , and variance ,

(3)
(4)
(5)

where the covariance matrix  has entries , , and the vector contains the covariances between the input  and the observed data points in 

. The identity matrix is denoted by 

.

3.2 Learning RKHS functions with GPs

The GP framework uses a statistical model that makes different assumptions from the ones made about  in Sec. 2. In particular, we assume a different noise model, and samples from a GP are rougher than RKHS funtions and are not contained in . Surprisingly, it is still possible to use GP models to infer reliable confidence intervals on  in creftype 1.

Lemma 1 (Abbasi-Yadkori (2012); Chowdhury and Gopalan (2017))

Assume that has bounded RKHS norm, and that measurements are corrupted by -sub-Gaussian noise. If , then for all  and 

it holds jointly with probability at least 

that

Lemma 1 implies that, with high probability, the true function  is contained in the confidence intervals induced by the posterior GP distribution that uses the kernel  from Lemma 1 as a covariance function, scaled by an appropriate constant . Here, denotes the mutual information between the GP prior on  and the  measurements . Intriguingly, for GP models this quantity only depends on the inputs , not the corresponding measurement . Specifically, for a given set of measurements  at inputs , the mutual information is given by

(6)

where  is the kernel matrix . Intuitively, the mutual information measures how informative the collected samples  are about the function . If the function values are independent of each other under the GP prior, they will provide large amounts of new information. However, if measurements are taken close to each other as measured by the kernel, they are correlated under the GP prior and provide less information.

3.3 Bayesian Optimization (BO)

BO aims to find the global maximum of an unknown function (Mockus, 2012). The framework assumes that evaluating the function is expensive in terms of time required or monetary costs, while other computational resources are comparatively inexpensive. In general, BO methods model the objective function  with a statistical model and use it to determine informative sample locations. A popular approach is to model the underlying function as a GP, see Sec. 3.1. GP-based methods use the posterior mean and variance predictions in creftypeplural 5 and 3 to compute the next sample location.

One popular algorithm is the GP-UCB algorithm by Srinivas et al. (2012). It uses confidence intervals on the function , e.g., from Lemma 1, in order to select the next inputs to evaluate:

(7)

Intuitively, creftype 7 selects new evaluation points at locations where the upper bound of the confidence interval of the GP estimate is maximal. Repeatedly evaluating the function  at inputs  given by creftype 7 improves the mean estimate of the underlying function and decreases the uncertainty at candidate locations for the maximum, so that the global maximum is provably found eventually (Srinivas et al., 2012). While creftype 7 is also an optimization problem, it only depends on the GP model of  and solving it therefore does not require any expensive evaluations of .

Regret bounds

Srinivas et al. (2012) show that the GP-UCB algorithm has cumulative regret  for all with the same  probability that the confidence intervals hold, e.g., in Lemma 1. Here  is the largest amount of mutual information that could be obtained by any algorithm from at most  measurements,

(8)

We refer to  as the information capacity, since it can be interpreted as a measure of complexity of the function class associated with a GP prior. It was shown by Srinivas et al. (2012) that  has a sublinear dependence on  for many commonly used kernels such as the Gaussian kernel. As a result,  has a sublinear dependence on  so that  and we can conclude that GP-UCB

converges to the optimal inputs. These regret bounds were extended to Thompson sampling, an algorithm that uses samples from the posterior GP as the acquisition function, by 

Chowdhury and Gopalan (2017).

Online hyperparameter estimation

(a) Sample from GP prior.
(b) GP estimate (RKHS).
(c) Lengthscale distribution.
Figure 1: A sample from the GP prior in Fig. 0(a) typically varies at a consistent rate over the input space. However, RKHS functions with the same kernel may be less consistent and can have bumps, as in Fig. 0(b) (gray). As a result, inferring the posterior lengthscales based on measurements (blue crosses in Fig. 0(b)) can lead to erroneous results. In Fig. 0(c), most of the probability mass of the posterior lengthscales has concentrated around large lengthscales that encode smooth functions. Consequently, the GP’s confidence intervals in Fig. 0(b) (blue shaded) based on the posterior samples do not contain the true function.

In the previous section, we have seen that the GP-UCB algorithm provably converges. However, it requires access to a RKHS norm bound under the correct kernel hyperparameters  in order to construct reliable confidence intervals using Lemma 1. In practice, these are unknown and have to be estimated online, based on a prior distribution placed on . Unfortunately, it is well-known that online estimation of the inputs, be it via maximum a posteriori (MAP) or sampling methods, does not always converge to the optimum (Bull, 2011). The problem does not primarily lie with the inference scheme, but rather with the assumptions made by the GP. In particular, typical samples drawn from a GP with a stationary kernel tend to have a similar rate of change throughout the input space, see Fig. 0(a). In contrast, the functions inside the RKHS, as specified in Sec. 2, can have different rates of change and are thus improbable under the GP prior. For example, the grey function in Fig. 0(b) is linear but has one bump that defines the global maximum, which makes this function an improbable sample under the GP prior even though it uses the same kernel. This property of GPs with stationary kernels means that, for inference, it is sufficient to estimate the lengthscales in a small part of the state-space in order to make statements about the function space globally. This is illustrated in Fig. 0(c), where we show samples from the posterior distribution over the lengthscales based on the measurements obtained from the GP-UCB algorithm in Fig. 0(b) (blue crosses). Even though the prior distribution on the lengthscales 

is suggestive of short lengthscales, most of the posterior probability mass is concentrated around lengthscales that are significantly larger than the true lengthscales. As a result, even under model averaging over the samples from the posterior distribution, the confidence intervals of the posterior distribution do not contain the true function in 

Fig. 0(b). This is not a problem of the inference method applied, but rather a direct consequence of the probabilistic model that we have specified based on the stationary kernel, which does not consider functions with different rates of change to be likely.

4 The Adaptive Gp-Ucb Algorithm

In this section, we extend the GP-UCB algorithm to the case when neither the norm bound  nor the lengthscales  are known. In this case, it is always possible that the local optimum is defined by a local bump based on a kernel with small lengthscales, which has not been encountered by the data points as in Fig. 0(b). The only solution to avoid this problem is to keep exploring to eventually cover the space (Bull, 2011). We consider expanding the function space associated with the hyperparameters slowly over time, so that we obtain sublinear regret once the true function class has been identified. Intuitively, this can help BO algorithms avoid premature convergence to local optima caused by misspecified hyperparameters  and . For example, in Fig. 1(a), the GP-UCB algorithm has converged to a local maximum. By decreasing the lengthscales, we increase the underlying function class, which means that the GP confidence intervals on the function increase. This enables GP-UCB to explore further so that the global optimum is found in Fig. 1(c).

(a) Stuck in local optimum.
(b) Expanding the function class.
(c) Global optimum found.
Figure 2: BO algorithms get stuck in local optima when the hyperpararameters of the model are misspecified. In Fig. 1(a), the true function is not contained within the GP’s confidence intervals (blue shaded), so that GP-UCB only collects data at the local optimum on the right (green arrow), see also Fig. 1. Our method expands the function class over time by scaling the hyperparameters, which encourages additional exploration in Fig. 1(b). The function class grows slowly enough, so that the global optimum is provably found in Fig. 1(c).

Specifically, we start with an initial guess  and  for the lengthscales and norm bound on , respectively. Over the iterations, we scale down the lengthscales and scale up the norm bound,

(9)

where  and  with  are functions that can, additionally, depend on  and , the data up to iteration . As  increases, the lengthscales  of the kernel become shorter, which enlargens the underlying function space:

Lemma 2

(Bull, 2011, Lemma 4) If , then for all , and

(10)

Lemma 2 states that when decreasing the lengthscales , the resulting function space contains the previous one. Thus, as  increases we consider larger RKHS spaces as candidate spaces for the function . In addition, as we increase , we consider larger norm balls within the function space , which corresponds to more complex functions. However, it follows from creftype 10 that, as we increase , we also increase the norm of any existing function in  by at most a factor of . This is illustrated in Fig. 2(a): as we scale up the norm ball to , we capture  under the initial lengthscales . However, by shortening the lengthscales by , the function  has a larger norm in the new function space . We account for this through the additional scaling factor  in the norm bound  in creftype 9.

(a) Scaling of the norm bound.
(b) Cumulative regret with scaling.
Figure 3: The function in Fig. 2(a) has RKHS norm . To account for this, we expand the norm balls by  over time. By scaling down the lengthscales by , the norm of  in the resulting RKHS is larger, see Lemma 2. We account for this when defining the norm ball . As we scale the lengthscales and norm bound, we effectively increase the cumulative regret in Fig. 2(b). Nevertheless, even if we choose  and  adaptively, as long as the cumulative regret is upper bounded by a sublinear function , ultimately the A-GP-UCB  algorithm converges to the global optimum.

Theoretical analysis

Based on the previous derivations together with Lemma 2, it is clear that, if  and  are monotonically increasing functions and  with  for some , then  and  for all . That is, once the function  is contained within the norm ball of  for the lengthscales , then, for any further increase in  or , the function  is still contained in the candidate space . Based on this insight, we propose A-GP-UCB in Algorithm 1. At iteration , A-GP-UCB sets the GP lengthscales to  and selects new inputs  similar to the GP-UCB algorithm, but based on the norm bound . We extend the analysis of GP-UCB and Lemma 1 to obtain our main result. Assume that  has bounded RKHS norm  in a RKHS that is parametrized by a stationary kernel with unknown lengthscales . Based on an initial guess, and , define monotonically increasing functions  and  and run A-GP-UCB with and GP lengthscales . Then, with probability at least , we obtain a regret bound of

(11)

where  is the mutual information in creftype 6 based on the GP model with lengthscales  and .

The proof is given in the appendix. Intuitively, the regret bound in creftype 11 splits the run of the algorithm into two distinct phases. In the first one, either the RKHS space  or the norm bound  are too small to contain the true function . Thus, the GP confidence intervals scaled by  do not necessarily contain the true function , as in Fig. 0(b). In these iterations, we obtain constant regret that is bounded by , since . After both  and  have grown sufficiently in order for the considered function space to contain the true function, the confidence bounds are reliable and we can apply the theoretical results of the GP-UCB algorithm. This second phase is illustrated in Fig. 2(b): nominal GP-UCB incurs sublinear regret for fixed hyperparameters. While we increase the cumulative regret bound by increasing  and , as long as the overall regret remains bounded by a sublinear function , the overall regret bound of the second stage remains sublinear. While creftype 11 assumes that the noise properties are known, the results can be extended to estimate the noise similar to Durand et al. (2018). The regret bound in creftype 11 depends on the true hyperparameters  and . However, the algorithm does not depend on them. Moreover, creftype 11 gives an instance-specific bound, since the mutual information depends on the inputs in . One can obtain a worst-case upper bound by bounding , which is the worst-case mutual information as in creftype 8, but based on the GP model with lengthscales .

1:   Input: Input space , , functions  and 
2:   Set and
3:  for all  do
4:      Set the GP kernel lengthscsales to 
5:      with
6:      Choose
7:      Evaluate
8:      Perform Bayesian update to obtain  and 
Algorithm 1 Adaptive GP-UCB(A-GP-UCB)

For arbitrary functions  and , the candidate function space can grow at a faster rate than it contracts by from selecting informative measurements  according to creftype 7. In particular, in the regret term , both  and  depend on the scaling factors  and . If these factors grow at a faster rate than , the resulting algorithm does not enjoy sublinear regret. Effectively,  is a measure for the size and complexity of the underlying function space. We have the following result that explicitly states the dependence of  on the scaling factor . Let  be a stationary kernel parameterized by lengthscales  as in creftype 2 and define  for lengthscales  as in creftype 8. Define the lengthscales as  as in creftype 9.

  • Let  be a squared exponential (Gaussian) kernel, then

    (12)
  • Let  be a Matérn kernel, where , is the modified Bessel function with , and  is the gamma function. Then

    (13)

Proposition 4 explicitly states the relationship between  and . For the Gaussian kernel, if we scale down the lengthscales by a factor of two, the amount of mutual information that we can gather in the worst case, , grows by . Knowing the dependence of  on , we can refine creftype 11 to obtain concrete regret bounds for two commonly-used kernels. If, under the assumptions of creftype 11 and  grow without bound, then we obtain the following, high-probability regret bounds for Algorithm 1:

  • Squared exponential kernel: ;

  • Matérn kernel: .

If  and  grow without bound, the first term of the cumulative regret in creftype 11 can be upper bounded by a constant. The remaining result is obtained by plugging in  and the bounds from creftype 8. Thus, any functions and that render the regret bounds in Corollary 4 sublinear, allow the algorithm to converge, even though the true lengthscales and norm bound are unknown.

The specific choices of  and  matter for the regret bound in creftype 11 in practice. Consider the one-dimensional case  for the Gaussian kernel. Given the true hyperparameters  and , if we set  and  to be constant, we recover the non-adaptive regret bounds of GP-UCB with known hyperparameters. If  depends on  and grows slowly, then it incurs constant regret during the initial rounds, while functions that grow to values larger than the true ones lead to additional exploration and incur an additional  factor in the cumulative regret in later rounds, c.f., Corollary 4. In the following section, we discuss appropriate choices for these functions in practice.

4.1 Choosing the scaling functions  and 

It follows from creftype 11 that A-GP-UCB achieves no-regret for any functions  and  that increase without bound and render creftype 11 sublinear in . Thus, the corresponding BO routine converges to the optimal value eventually. For example,  satisfy this condition. However, the convergence guarantees only hold asymptotically as . In practice, BO is often used with objective functions  that are expensive to evaluate, which imposes a hard constraint on the number of evaluations. For regret bounds to be meaningful in this setting, we must choose functions  and  that grow fast enough to ensure that the constant regret period in creftype 11 is small, yet slow enough that the effect of the sublinear regret is visible for small enough . In the following, we propose two methods to choose  and  adaptively, based on the observations seen so far.

For convenience, we fix the relative magnitude of  and . In particular, we define and together with a weighting , which encodes whether we prefer to scale up the norm bound using  or decrease the lengthscales using . This enables us to reason about the overall magnitude of the scaling , which can be uniquely decomposed into  and  given . For  we have , and the algorithm prefers to attribute an increase in  to  and shorten the lengthscales, while for  the algorithm prefers to scale up the RKHS norm. The assumptions in Corollary 4 hold for any  if  grows without bound. Moreover, if  is sublinear, then so are  and .

Reference regret

While any sublinear function  that grows without bound leads to sublinear regret asymptotically, we want to ensure that our method performs well in the finite time too. For fixed hyperparameters with , which implies , our algorithm reduces to GP-UCB and the regret bound term  is sublinear, which is illustrated by the bottom curve in Fig. 2(b). However, this does not imply no-regret if hyperparameters are misspecified as in Fig. 1(a), since the first term in creftype 11 is unbounded in these cases. To avoid this, we must increase the scaling factor  to consider larger function classes.

We propose to define a sublinear reference regret , see Fig. 2(b), and to scale  to match an estimate of the regret with respect to the current hyperparameters to this reference. As GP-UCB converges, the regret estimate with respect to the current hyperparameters levels off and drops below the reference . In these cases, we increase  to consider larger function classes and explore further. The choice of  thus directly specifies the amount of additional regret one is willing to incur for the additional exploration. Specifically, given an regret estimate  that depends on the data collected so far and the selected scaling , we obtain  from matching the reference, as

(14)

Here we explicitly enforce that  must be an increasing function. In the following, we consider estimators  that are increasing functions of , so that creftype 14 can be solved efficiently via a line search.

Whether choosing  according to creftype 14 leads to a sublinear function  depends on the regret estimator . However, it is always possible to upper bound the  obtained from creftype 14 by a fixed sublinear function. This guarantees sublinear regret eventually. In the following, we consider two estimators that upper bound the cumulative regret experienced so far with respect to the hyperparameters suggested by .

Regret bound

As a first estimator for the cumulative regret, we consider the regret bound on  in creftype 11. We focus on the Gaussian kernel, but the arguments transfer directly to the case of the Matérn kernel. The term  bounds the regret with respect to the current function class specified by . In addition to the direct dependence on  in , the regret bound also depends on  implicitly through the mutual information , where . To make the dependence on  more explicit, we use Sec. 4 and rewrite the mutual information as  instead. Note that the scaling factor was derived for the worst-case mutual information, but remains a good indicator of increase in mutual information. With this replacement we use

(15)

to estimate the regret, where the term  is as in creftype 11, but with the mutual information similarly replaced with the explicit dependence on . Solving creftype 14 together with creftype 15 is fast, since computing  does not require inverting the kernel matrix.

One step predictions

While creftype 15 is fast to compute, it requires knowing how  depends on . Deriving analytic bounds can be infeasible for many kernels. As an alternative, we estimate the regret one-step ahead directly. In particular, if the considered function class is sufficiently large so that , then the one-step ahead cumulative regret  at iteration  is bounded from above by

(16)

where each  and  is based on the corresponding hyperparameters . In creftype 11, is further upper-bounded by creftype 11. The regret estimate in creftype 16 depends on , which is the next input that would be evaluated based on the UCB criterion with GP hyperparameters scaled according to . As the hyperparameters for previous iterations are fixed, the only term that depends on  is the bound on the instantaneous regret, . Unlike creftype 15, creftype 16 is not able to exploit the known dependence of  on , so that it cannot reason about the long-term effects of changing . This means that, empirically, the cumulative regret will often overshoot the reference regret, only to settle below it later.

Scaling  according to creftype 16 provides an interesting perspective on the method by Wang and de Freitas (2014). They decrease the kernel lengthscales whenever . In our framework, this corresponds to , which is not sublinear. As a consequence, while they ultimately bound the cumulative regret using the smallest possible lengthscale, the algorithmic choice for  forces too much exploration to achieve sublinear regret before the lower bound is reached. In contrast, if we choose  to be sublinear, then the function class grows slowly enough to ensure more careful exploration. This allows us to achieve sublinear regret even if a lower bound on the hyperparameters it not known.

4.2 Practical Considerations and Discussion

In this section, we discuss additional practical considerations and show how to combine the theoretical results with online inference of the hyperparameters.

Online inference and exploration strategies

The theoretical results presented in the previous sections extend to the case where the initial guess  of the GP’s lengthscale is improved online using any estimator, e.g., with MAP estimation to obtain . Theoretically, as long as the change in  is bounded, the cumulative regret increases by at most a constant factor. In practice, this bound can always be enforced by truncating the estimated hyperparameters. Moreover, the scaling induced by online inference can be considered to be part of  according to Lemma 2, in which case the norm bound can be adapted accordingly. In practice, online inference improves performance drastically, as it is often difficult to specify an appropriate relative initial scaling of the lengthscales .

In more than one dimension, , there are multiple ways that MAP estimation can be combined with the theoretical results of the paper. The simplest one is to enforce an upper bound on the lengthscales based on ,

(17)

where the min is taken elementwise. This choice is similar to the one by Wang et al. (2016). If all entries of  have the same magnitude, this scaling can be understood as encouraging additional exploration in the smoothest direction of the input space first. This often makes sense, since MAP estimates tend to assume functions that are too smooth, see Fig. 1. However, it can be undesirable in the case when the true function only depends on a subset of the inputs. In these cases, the MAP estimate would correctly eliminate these inputs from the input space by assigning long lengthscales, but the scaling in creftype 17 would encourage additional exploration in these directions first. Note that eventually exploring the entire input space is unavoidable to avoid getting stuck in local optima (Bull, 2011).

An alternative approach is to instead scale down the MAP estimate directly,

(18)

This scaling can be understood as evenly encouraging additional exploration in all directions. While creftype 18 also explores in directions that have been eliminated by the MAP estimate, unlike creftype 17 it simultaneously explores all directions relative to the MAP estimate. From a theoretical point, the choice of exploration strategy does not matter, as in the limit as  all lengthscales approach zero. In the one-dimensional case, both strategies are equivalent. Both strategies use the MAP lengthscales for BO in the nominal case, but the  factor eventually scales down the lengthscales further. In practice, this ensures that our method only improves on the empirical performance of BO with MAP estimation.

In practice, maximum likelihood estimates for the inputs are often good enough when the underlying function resembles a sample from a GP. Thus the approach presented in this paper is most relevant when the underlying function has some nonstationarity. In the literature, other approaches to deal with nonstationarity have been proposed. For example, Snoek et al. (2013) scale the input inputs through a beta function and infer its hyperparameters online. Our approach can easily be combined with any such method, as it works on top of any estimate provided by the underlying inference scheme. Moreover, in high-dimensional spaces one can combine our algorithm with methods to automatically identify a low-dimensional subspace of  (Djolonga et al., 2013; Wang et al., 2016).

In this paper, we have considered the kernel to be fixed, and only adapted the lengthscales and norm bound. However, often the kernel structure itself is a critical tuning input (Duvenaud et al., 2011). The same strategy as presented in this paper could be used to instead add rougher kernels over time or, for example, to adapt the  input of the Matérn kernel, which determines its roughness.

Confidence intervals

Empirically, is often set to be a constant rather than using the theoretical bounds in Lemma 1. In particular, typically measurement data is standardized to be zero mean and unit variance and  is set to two or three. This often works well in practice, but does not provide any guarantees. However, if one were to believe the resulting confidence bounds, our method can be used to avoid getting stuck in local optima, too. In this case on can set  and apply our method as before.

General discussion

Knowing how the sample complexity of the underlying BO algorithm depends on the lengthscales also has implications for practical implementations of BO algorithms. For example, Wang et al. (2016) and Wabersich and Toussaint (2016) suggest to scale down the lengthscales by a factor of  and , respectively, although not at every iteration. As shown in Sec. 4, this scales the regret bound by a factor of , which quickly grows with the number of dimensions. Exponentiating their factors with  is likely to make their approaches more robust as BO scales to higher-dimensional problems.

Lastly, in a comparison of multiple BO algorithms (acquisition functions) on a robotic platform, Calandra et al. (2014) conclude that the GP-UCB algorithm shows the best empirical performance for their problem. They implemented the theoretical version of the algorithm by Srinivas et al. (2012), in which  grows with an additional factor of . In our framework with the bounds in Lemma 1, this is equivalent to scaling up the initial guess for the RKHS norm bound for  by the same factor at every iteration, which increases the function class considered by the algorithm over time. We conjecture that this increasing of the function class was likely responsible for pushing the MAP estimate of the lengthscales out of the local minima, which in turn led to better empirical performance.

5 Experiments

In this section, we evaluate our proposed method on several benchmark problems. As baselines, we consider algorithms based on the UCB acquisition function. We specify a strong gamma prior that encourages short lengthscales, and consider both maximum a posteriori (MAP) point-estimates of the hyperparameters and a Hamiltonian Monte Carlo (HMC) approach that samples from the posterior distribution of the hyperparameters and marginalizes them out. Unless otherwise specified, the initial lengthscales are set to , the initial norm bound is , the confidence bounds hold with probability at least , and the tradeoff factor between  and  is .

We follow several best-practices in BO to ensure a fair comparison with the baselines. We rescale the input space  to the unit hypercube in order to ensure that both the initial lengthscales and the prior over lengthscales are reasonable for different problems. As is common in practice, the comparison baselines use the empirical confidence intervals suggested in Sec. 4.2, instead of the theoretical bounds in Lemma 1 that are used for our method. Lastly, we initialize all GPs with  measurements that are collected uniformly at random within the domain . To measure performance, we use the cumulative regret that has been the main focus of this paper. In addition, we evaluate the different methods in terms of simple regret, which is the regret of the best inputs evaluated so far, . This metric is relevant when costs during experiments do not matter and BO is only used to determine close-to-optimal inputs.

5.1 Synthetic Experiments

Example function

(a) Simple regret.
(b) Cumulative regret.
(c) Scaling .
Figure 4:

Mean and standard deviation of the empirical simple and cumulative regret over ten different random initializations for the function in 

Fig. 2. The HMC baseline (red) gets stuck in a local optimum and obtains constant regret in Fig. 3(a). GP-UCB with the true hyperparameters (gray dashed) obtains the lowest cumulative regret in Fig. 3(b). However, our methods (orange/blue) increase the function class over time, see Fig. 3(c), and thus obtain sublinear regret without knowing the true hyperparameters.

We first evaluate all proposed methods on the example function in Fig. 2, which lives inside the RKHS associated with a Gaussian kernel with  and has norm . We evaluate our proposed method for the sublinear reference function  together with maximum a posteriori hyperparameter estimation. We compare against both GP-UCB with the fixed, correct hyperparameters and HMC hyperparameter estimation. Additionally, we consider a modified variant of the method suggested by Wang and de Freitas (2014), see Sec. 4.1. Rather than scaling the lengthscales by a fixed constant, we conduct a line search to find the smallest possible scaling factor that renders . This is the most conservative variant of the algorithm. Note that we do not know a lower bound on the hyperparameters and therefore do not enforce it.

The results of the experiments are shown in Fig. 4. The simple regret plot in Fig. 3(a) shows that all methods based on hyperparameter adaptation evaluate close-to-optimal inputs eventually, and do so almost as quickly as GP-UCB based on the true hyperparameters (black, dashed). However, the method based on HMC hyperparameter estimation (red) considers functions that are too smooth and gets stuck in local optima, as in Fig. 2. This can also be seen in Fig. 3(c), which plots the effective scaling  based on the combination of Bayesian hyperparameter estimation and hyperparameter adaptation through . The HMC hyperparameters consistenly over-estimate the lengthscales by a factor of roughly two. In contrast, while the MAP estimation leads to the wrong hyperparameters initially, the adaptation methods in creftypeplural 16 and 15 slowly increase the function class until the true lengthscales are found eventually. It can be seen that the one step estimate creftype 16 (orange) is more noisy than the upper bound in creftype 15 (blue).

(a) Simple regret.
(b) Cumulative regret.
Figure 5: Simple and cumulative regret over 10 random seeds for samples from a GP with bounded RKHS norm. The GP-UCB algorithm with misspecified hyperparameters (magenta) fails to converge given only a wrong choice of . In contrast, our methods (blue/orange) converge even though  is misspecified in addition.

While all adaptation methods determine good inputs quickly according to the simple regret, they perform differently in terms of the cumulative regret in Fig. 3(b). As expected, the HMC method (red line) converges to a local optimum and experiences constant regret equal to the simple regret at every time step. The modified method of Wang and de Freitas (2014) (green line) expands the function class too aggressively and also achieves constant regret. Empirically their method always explores and never repeatedly evaluates close-to-optimal inputs that would decrease cumulative regret. While the method works well in terms of simple regret, Without a lower bound on the hyperparameters it never converges to sublinear regret. As expected from creftype 11, GP-UCB based on the optimal hyperparameters achieves the lowest cumulative regret. Our two methods expand the function class over time, which allows them to converge to close-to-optimal inputs, even though MAP estimation estimates the hyperparameters wrongly initially. While the regret is sublinear, the additional exploration caused by  means that the cumulative regret is larger. This is the additional cost we incur for not knowing the hyperparameters in advance.

Samples from a GP

As a second experiment, we compare GP-UCB to A-GP-UCB on samples drawn from a GP when the norm bound 

is misspecified. Samples from a GP are not contained in the RKHS. To avoid this technical issue, we sample function values from the posterior GP at only a finite number of discrete gridpoints and interpolate between them using the kernel with the correct lengthscales 

. We rescale these functions to have RKHS norm of , but use  as an initial guess for both BO algorithms and do not use any hyperparameter estimation. Even though we use the correct kernel lengthscales for GP-UCB, this discrepancy means that the true function is not contained in the initial confidence intervals. As before, for our method we use the reference regret  and additionally misspecify the lengthscales, .

The results are shown in Fig. 5. GP-UCB with the correct hyperparameters (black, dashed) obtains the lowest cumulative regret. However, it fails to converge when hyperparameters are misspecified (magenta), since the confidence intervals are too small to encourage any exploration. In contrast, our methods (blue/orange) converge to close-to-optimal inputs as in the previous example.

5.2 Logistic Regression Experiment

(a) Simple regret.
(b) Cumulative regret.
Figure 6:

Simple and cumulative regret over 5 random seeds for a logistic regression problem. All methods determine close-to-optimal parameters. However, our methods explore more to counteract misspecified hyperparameters.

Lastly, we use our method to tune a logistic regression problem on the MNIST data set (LeCun, 1998). As in the experiment in Klein et al. (2016), we consider four training inputs: the learning rate, the  regularization constant, the batch size, and the dropout rate. We use the validation loss as the optimization objective.

The results are shown in Fig. 6. Even though the input space is fairly high-dimensional with , all algorithms determine close-to-optimal inputs quickly. In particular, MAP estimation determines that both the dropout rate and the batch size do not influence the validation loss significantly. Since the theoretical results in A-GP-UCB are compatible with MAP estimation, our approach achieves the same empirical performance, but has theoretical worst-case regret bounds. After convergence, the BO baselines repeatedly evaluate the same inputs, without gaining any new information. In contrast, our method continues to explore in order to potentially find better inputs. While it does not occur in this case, this allows us to be more confident that the global optimum has been identified as  increases. For standard BO methods, there is no guarantee of convergence with misspecified hyperparameters.

6 Conclusion and Future Work

We introduced A-GP-UCB, a BO algorithm that is provably no-regret when hyperparameters are unknown. Our method adapts the hyperparameters online, which causes the underlying BO algorithm to consider larger function spaces over time. Eventually, the function space is large enough to contain the true function, so that our algorithm provably converges. We evaluated our method on several benchmark problems, confirming that, on the one hand, it provably converges even in cases where standard BO algorithms get stuck in local optima, and, on the other hand, enjoys the same empirical performance as standard BO algorithms that do not have theoretical guarantees in this setting.

The main idea behind our analysis is that adapting the hyperparameters increases the cumulative regret bound, but that we can do this slowly enough to converge eventually. This idea is fairly general and could also be applied to other no-regret algorithms. Another potential future direction is to investigate alternative strategies to select the scaling factors  and  and consider adapting other parameters, e.g., the kernel structure.

This research was supported in part by SNSF grant 200020_159557, NSERC grant RGPIN-2014-04634, the Vector Institute, and an Open Philantropy Project AI fellowship. We would like to thank Johannes Kirschner for valueable discussions.

References

  • Abbasi-Yadkori (2012) Yasin Abbasi-Yadkori. Online learning of linearly parameterized control problems. PhD thesis, 2012.
  • Abdelrahman et al. (2016) Hany Abdelrahman, Felix Berkenkamp, and Andreas Krause. Bayesian optimization for maximum power point tracking in photovoltaic power plants. In 2016 European Control Conference (ECC), pages 2078–2083, 2016.
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
  • Berkenkamp et al. (2016) Felix Berkenkamp, Andreas Krause, and Angela P. Schoellig. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. arXiv:1602.04450 [cs.RO], 2016.
  • Bogunovic et al. (2016) Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, and Volkan Cevher. Truncated variance reduction: a unified approach to bayesian optimization and level-set estimation. In Advances in Neural Information Processing Systems 29, pages 1507–1515. Curran Associates, Inc., 2016.
  • Brochu et al. (2010) Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv:1012.2599 [cs], 2010.
  • Bull (2011) Adam D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(Oct):2879–2904, 2011.
  • Calandra et al. (2014) Roberto Calandra, André Seyfarth, Jan Peters, and Marc Peter Deisenroth. An experimental comparison of bayesian optimization for bipedal locomotion. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 1951–1958, 2014.
  • Chowdhury and Gopalan (2017) Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 844–853. PMLR, 2017.
  • Christmann and Steinwart (2008) Andreas Christmann and Ingo Steinwart. Support Vector Machines. Information Science and Statistics. Springer, New York, NY, 2008.
  • Djolonga et al. (2013) Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional gaussian process bandits. In Advances in Neural Information Processing Systems 26, pages 1025–1033, 2013.
  • Durand et al. (2018) Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. Journal of Machine Learning Research, 19(17):1–34, 2018.
  • Duvenaud et al. (2011) David K. Duvenaud, Hannes Nickisch, and Carl Edward Rasmussen. Additive gaussian processes. In Advances in Neural Information Processing Systems 24, pages 226–234, 2011.
  • Frazier et al. (2009) Peter Frazier, Warren Powell, and Savas Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing, 21(4):599–613, 2009.
  • Golovin et al. (2017) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. Google vizier: a service for black-box optimization. In Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 1487–1495, New York, NY, USA, 2017. ACM.
  • Gradshteĭn et al. (2007) I. S. Gradshteĭn, I. M. Ryzhik, and Alan Jeffrey. Table of integrals, series, and products. Academic Press, Amsterdam, Boston, 7th ed edition, 2007.
  • Hennig and Schuler (2012) Philipp Hennig and Christian J. Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(1):1809–1837, 2012.
  • Hernández-Lobato et al. (2014) José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 918–926. Curran Associates, Inc., 2014.
  • Kirschner and Krause (2018) Johannes Kirschner and Andreas Krause.

    Information directed sampling and bandits with heteroscedastic noise.

    In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 358–384. PMLR, 2018.
  • Klein et al. (2016) Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter.

    Bayesian neural network for predicting learning curves.

    In NIPS 2016 Bayesian Neural Network Workshop, 2016.
  • LeCun (1998) Yann LeCun.

    The mnist database of handwritten digits.

    http://yann.lecun.com/exdb/mnist/, 1998.
  • Lizotte et al. (2007) Daniel J. Lizotte, Tao Wang, Michael H. Bowling, and Dale Schuurmans. Automatic gait optimization with gaussian process regression. In

    Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI)

    , volume 7, pages 944–949, 2007.
  • Mockus (2012) Jonas Mockus. Bayesian approach to global optimization: theory and applications. Springer Science & Business Media, 2012.
  • Mockus et al. (1978) Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of bayesian methods for seeking the extremum. Towards Global Optimization, 2:117–129, 1978.
  • Rasmussen and Williams (2006) Carl Edward Rasmussen and Christopher K.I Williams. Gaussian processes for machine learning. MIT Press, Cambridge MA, 2006.
  • Ru et al. (2018) Binxin Ru, Michael A. Osborne, Mark Mcleod, and Diego Granziol. Fast information-theoretic bayesian optimisation. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4384–4392. PMLR, 2018.
  • Russo and Van Roy (2014) Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1583–1591. Curran Associates, Inc., 2014.
  • Seeger et al. (2008) M. W. Seeger, S. M. Kakade, and D. P. Foster. Information consistency of nonparametric gaussian process methods. IEEE Transactions on Information Theory, 54(5):2376–2382, 2008.
  • Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2951–2959. Curran Associates, Inc., 2012.
  • Snoek et al. (2013) Jasper Snoek, Kevin Swersky, Richard S. Zemel, and Ryan P. Adams. Input warping for bayesian optimization of non-stationary functions. In NIPS Workshop on Bayesian Optimization, 2013.
  • Srinivas et al. (2012) Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
  • Vapnik (1992) V. Vapnik. Principles of risk minimization for learning theory. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 831–838. Morgan-Kaufmann, 1992.
  • Wabersich and Toussaint (2016) Kim Peter Wabersich and Marc Toussaint. Advancing bayesian optimization: the mixed-global-local (mgl) kernel and length-scale cool down. arXiv:1612.03117 [cs, stat], 2016.
  • Wang and Jegelka (2017) Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3627–3635. PMLR, 2017.
  • Wang and de Freitas (2014) Ziyu Wang and Nando de Freitas. Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters. arXiv:1406.7758 [cs, stat], 2014.
  • Wang et al. (2016) Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Freitas. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55:361–387, 2016.

Appendix A Proof of Main Theorem

Lemma 3

Let  with . Then, for any monotonically increasing functions  and  and for all : with 

Lemma 2 together with monotonicity of  yields so that  and

Lemma 4

Under the assumptions of Lemma 1, let  be a predictable sequence of kernel hyperparameters such that  and let the GP predictions  and  use the prior covariance . If , then holds for all  and iterations  jointly with probability at least .

The proof is the same as the one by Abbasi-Yadkori (2012); Chowdhury and Gopalan (2017), except that the kernel is time-dependent.

We are now ready to prove the main result:

[creftype 11] We split the regret bound into two terms, . In the initial rounds, where either  or , the regret is trivially bounded by . Thus . Let  be the first iteration such that  with . From Lemma 3, we have that with  for all . Thus we can use Lemma 4 to conclude  for all  and  jointly with probability at least . We use Lemmas 5.2-5.4 in Srinivas et al. (2012) to conclude that the second stage has a regret bound of , which concludes the proof.

Appendix B Bound on information capacity 

[Theorem 8 in Srinivas et al. (2012)] Suppose that is compact, and  is a covariance function for which the additional assumption of Theorem 2 in Srinivas et al. (2012) hold. Moreover, let , where  is the operator spectrum of 

with respect to the uniform distribution over 

. Pick , and let  with . Then, the following bound holds true:

(19)

Appendix B allows us to bound  through the operator spectrum of the kernel with respect to the uniform distribution. We now consider this quantity for two specific kernels.

b.1 Bounds for the Squared Exponential Kernel

Lemma 5

For all  it holds that

In this section, we use Appendix B to obtain concrete bounds for the Gaussian kernel. From Seeger et al. (2008), we obtain a bound on the eigenspectrum that is given by

The constant  parameterizes the distribution . As a consequence of , we have that , , , and . In the following, we bound the eigenspectrum. The steps follow the outline of Seeger et al. (2008), but we provide more details and the dependence on the lengtscales  is made explicit:

where . Now substitute . Then and
where for as in Gradshteĭn et al. (2007, (8.352.4)). Then, with ,