Bayesian optimization (BO), building on Gaussian Processes (GPs), provides a spectrum of powerful and flexible modeling tools allowing for efficiently addressing the exploration-exploitation trade-off in sequential optimization. A popular instance of BO is Thompson Sampling (GP-TS, Chowdhury2017bandit), that leverages surrogate GP models to sample from the posterior distribution of the optimum point of an objective function. A particular advantage of GP-TS is its versatility (see hernandez2017parallel; kandasamy2018parallelised and paria2019flexible for batch-sequential and multi-objective versions, respectively). In sequential optimization problems with bandit feedback, TS has been shown to enjoy favorable performance both in theory and practice (see Kaufmann2012; Agrawal2013TS
for Bernoulli distributions with beta priors,Agrawal2013TS
for Gaussian distributions with Gaussian priors,Kaufmann2013TS for the one-dimensional exponential family with uninformative priors, Gopalan2014TS for finitely supported distributions and priors, (Abeille2017LinTS) for linear bandits, and Chowdhury2017bandit for kernelized bandits).
Despite the appealing results, the practical implementation of GP-TS introduces two main difficulties that prevent the method from scaling in terms of time horizon. First, building the GP posterior distribution (based on the Bayes rule), which is needed every time a new observation is acquired, is well known to require an computation (where is the number of observations) due to a matrix inversion step (Rasmussen2006). Given the updated posterior model, sampling from its optimum is the second challenging task. The standard approach is to draw a joint sample on a grid discretizing the search space. Sampling over a grid of size has an complexity (due to a Cholesky decomposition step, diggle1998model), which is the second computational bottleneck.
A natural answer to the first challenge is to rely on the recent advances in sparse variational GP models (referred to as SVGP for the rest of the paper) which allow a low rank approximation of the GP posterior in computation, where is the number of the so-called inducing variables and grows at a rate much slower than (Titsias2009Variational). The inducing variables are manifested either as inducing points or inducing features (sometimes referred to as inducing inter-domain variables, Burt2019Rates; vanderWilk2020Framework). Furthermore, wilson2020efficientsampling introduced an efficient sampling rule (referred to as decomposed sampling) which decomposes a sample from the posterior into the sum of a prior with features (see Sec. 3.2) and an SVGP update, reducing the computational cost of drawing a sample to . Leveraging this sampling rule results in a scalable GP-TS algorithm (henceforth S-GP-TS) that is simple to implement and can handle orders of magnitude larger number of observation points.
The main question is to assess whether such an approach maintains the performance guarantees (in terms of regret, see (1)) of the vanilla GP-TS. Indeed, using sparse models and decomposed sampling introduce two layers of approximation, that, if handled without care, can have a dramatic effect on performance (see e.g. Phan2019TSExample, that showed that even a small constant posterior error (in divergence) can lead to under- or over-exploration and poor performance (i.e. linear regret)). Hence, the objective of this work is the analysis of S-GP-TS methods, along with necessary conditions on some algorithmic parameters to guarantee performance.
In Theorem 1, we first establish an upper bound on the regret of S-GP-TS for any approximate model that satisfies some conditions on the quality of their posterior approximations (Assumptions 1 and 2). Then, focusing on SVGP models, we provide bounds for the number of inducing variables required to guarantee a low regret when the decomposed sampling rule of wilson2020efficientsampling is used. The bounds on are characterized by the spectrum of the kernel of the GP model. With these bounds in place, we report drastic improvement in the computational complexity of S-GP-TS with Matérn and Squared Exponential kernels (see Table 1) without compromising the performance, i.e., preserving the same order of regret as in the exact GP-TS.
As part of our analysis, we prove a novel concentration inequality for GP models resulting in stronger regret bounds compared to the existing work. Specifically, an immediate result of Theorem 1 is an 111The notation is used to denote the mathematical order up to the factors. regret bound for the exact GP-TS (where is the time horizon) that is an improvement over the regret bounds reported in Chowdhury2017bandit; srinivas2010gaussian; Calandriello2019Adaptive for GP-TS or GP-UCB (an optimistic optimization algorithm that selects the observation points according to an upper confidence bound score), where denotes an upper bound on the information gain, sublinearly scaling with , that depends on the GP kernel. This improvement in regret bounds is significant as the existing bounds are not always sublinear.
1.2 Related work
Janz2020SlightImprov recently addressed the issue that the existing regret bound is not necessarily sublinear for the Matérn kernel. They used a GP-UCB based approach that constructs a cover for the search space (as many hypercubes) and fits an independent GP to each cover element. Our regret bounds applied to the Matérn kernel matches the ones reported in Janz2020SlightImprov, for the exact GP models. More generally, the concentration inequality given in Lemma 1 applies to the analysis of GP-UCB as well and improves its regret bounds by a factor of (e.g. it can replace the concentration inequality presented in Theorem 2 of Chowdhury2017bandit with no additional algorithmic sophistication). Calandriello2019Adaptive
introduced a scalable version of GP-UCB based on randomized matrix sketching andleverage score sampling which is a low rank approximation based on inducing points. Their selection rule for the inducing points is however different from the one for SVGP introduced in Burt2019Rates. They proved the same regret guarantees as the ones for GP-UCB (up to a multiplicative factor). In comparison, the analysis of S-GP-TS is different and has additional complexity due to the cost of posterior sampling ( per step). Also, as mentioned above, our results offer an improvement in regret bounds over the ones presented in Calandriello2019Adaptive.
Among other approaches to sparse approximation of GP posteriors is to use a finite dimensional truncated feature representation (Hernandez2014features; Shahriari2016) such as random Fourier features (FFs) which might suffer from variance starvation
, i.e., underestimate the variance of points far from the observations(wang2018batched; Mutny2018SGPTS; wilson2020efficientsampling). Intuitively, that is because the Fourier basis is only an efficient basis for representing stationary GPs, while the posterior is generally nonstationary (wilson2020efficientsampling). To the best of our knowledge, no regret guarantees are reported for this approach. Mutny2018SGPTS however took a different approach based on the quadratic FFs (in contrast to the random FFs) and constructed scalable Bayesian optimization methods not suffering from variance starvation. They provided regret bounds for the case of additive SE kernel which almost match the ones resulting from the application of our Theorem 1 to the SE kernel, though our bounds are tighter by a factor of . The application of our regret bounds to dimensional linear kernels results in the same bounds as in linear bandits with TS (Abeille2017LinTS) or with UCB (Abbasi2011).
introduced a heuristic parallel implementation of TS where each updated posterior model is used to drawsamples. kandasamy2018parallelised showed that parallelization improves the computational complexity due to updating the model, by a constant factor of at the price of incurring an additive regret. The parallelization however does not reduce the cost due to sampling. This idea directly applies to our results as well.
There exist several other approaches to BO, namely expected improvement and probability of improvement which lack theoretical guarantees on low regret.
2 Problem Formulation
We consider the sequential optimization of a fixed and unknown function over a compact set . The goal is to design a learning policy that selects an observation point at each discrete time instance , and receives the corresponding real-valued reward , where is a zero mean noise, i.i.d. over . Let be the optimal point. The goal is to minimize regret defined as the cumulative loss compared to over a time horizon . Specifically, regret is defined as
where is the sequence of the selected observation points and the expectation is taken with respect to the possible randomness in . For the algebra , is measurable and is measurable. We assume that is sub-Gaussian where is a fixed constant that is a standard assumption both in bandits and BO literature. Specifically, The sub-Gaussian assumption implies that , for all .
Our regret measure is strict as it is defined for a fixed in contrast to a possible alternative definition, referred to as Bayesian regret (e.g. see Russo2018TutorialTS; kandasamy2018parallelised), where the performance is averaged over a prior distribution on . Upper bounds on strict regret directly apply to the Bayesian regret while the reverse statement does not necessarily hold true. The analysis of GP-TS with Bayesian regret is simpler than with strict regret in the sense that using the technique introduced in Russo2014 (also, see Russo2018TutorialTS), it can be reduced to the analysis of an upper confidence bound policy. The convenience of the analysis however comes at the price of weaker results.
Our regularity assumption on the objective function, sometimes referred to as agnostic (srinivas2010gaussian), is motivated by kernelized learning models and their associated reproducing kernel Hilbert spaces (RKHS). Let be a positive definite kernel. A Hilbert space of functions on equipped with an inner product is called an RKHS with reproducing kernel if , for all , and (reproducing property) for all and . A RKHS is completely specified by its kernel function and vice-versa. The inner product induces the RKHS norm that can be interpreted as a measure of the complexity of .
We assume that the RKHS norm of the objective function is bounded for some , and for all (the same regularity assumption used in Chowdhury2017bandit; srinivas2010gaussian; Calandriello2019Adaptive).
For a detailed construction of RKHSs and more examples we refer to Kanagawa2018. Here, we provide an intuitive explanation of our assumption for particular classes of kernels, namely Squared Exponential (SE) and Matérn, which are perhaps the most popular kernels in practice for BO (see e.g. Rasmussen2006; Snoek2012practicalBO),
where is the Euclidean distance between and , is referred to as lengthscale, is referred to as the smoothness parameter and is the modified Bessel function of the second kind. Variation over parameter creates a rich family of kernels. The SE kernel can also be interpreted as a special case of Matérn when . When , the Matérn kernel can be expressed as the product of an exponential and a polynomial of order .
The RKHS of Matérn is equivalent to a Sobolev space with parameter (Teckentrup2019; Kanagawa2018). This observation provides an intuitive interpretation for the norm of Matérn RKHS as proportional to the cumulative norm of the weak derivatives of up to order. I.e., in the case of Matérn family, our assumption on the norm of translates to the existence of weak derivatives of up to order which can be understood as a versatile measure for the smoothness of controlled by . In the case of SE kernel, our regularity assumption implies the existence of all weak derivatives of . For the details on the definition of weak derivatives and Sobolev spaces see Hunter2001Analysis.
3 Gaussian Processes and Sparse Models
A GP is a collection of (possibly infinitely many) random variables, , whose finite subsets each follows a multivariate Gaussian distribution (Rasmussen2006). The distribution of a GP can be specified by its mean function (without loss of generality, it is typically assumed that for prior GP distributions) and a positive definite kernel (or covariance function) .
3.1 Surrogate GP Models
Conditioning GPs on available observations provides us with powerful non-parametric surrogate Bayesian models over the space of functions. Consider a set of observations , where and (recall that and for all ). Conditioned on the set of past observations , the posterior of is a GP distribution with mean function and kernel function specified as follows:
where and is the positive definite covariance matrix . The surrogate model induces a Gaussian likelihood by assuming a Gaussian distribution for the noise 222The GP-TS methods use surrogate GP models and assume a Gaussian noise to obtain closed form posterior distributions applied to the optimization of .
This Bayesian approach is only for the purpose of algorithm design and does not affect our general agnostic assumptions on (being fixed and unknown, living in a RKHS) and (being sub-Gaussian). The notation is used to distinguish the GP model from the fixed . The assumed variance for the noise has an effect similar to the regularization parameter in kernel ridge regression (see
for the noise has an effect similar to the regularization parameter in kernel ridge regression (seeKanagawa2018 for more details).. The posterior variance of is given by . Although GP models enjoy closed-form posterior expressions, calculating requires computations which is the computational bottleneck for large .
3.2 Feature Representation
Using Mercer’s theorem, we can view the kernel as the inner product of possibly infinite dimensional feature maps (see Theorem 4.1 in Kanagawa2018):
are referred to as the eigenvalues and
, an orthonormal basis of RKHS, are referred to as the eigenfunctions. Using the feature representation, a samplecan be expressed as a weighted sum of eigenfunctions where the weights are independent normal random variables. With decaying eigenvalues, an approximate sample can be drawn from the GP using an truncated feature map as This property can be used for sampling from the prior (as we will see in the decomposed sampling rule of wilson2020efficientsampling).
3.3 Sparse Variational Models
To overcome the cubic cost of exact GPs, SVGPs (Titsias2009Variational; Hensman2013) approximate GP posterior using a finite set of inducing points . SVGPs condition the prior on inducing variables (i.e. , for ) instead of observations and specify a Gaussian density such that the approximate posterior distribution can be interpreted as . The approximate distribution takes the shape with the posterior mean and covariance
which can be interpreted as a method for a low rank approximation of . Variational parameters and are the maximizer of the evidence lower bound (ELBO)
where , , , , is the identity matrix and . Titsias2009Variational provides an explicit solution for the convex optimization problem of finding . Hensman2013 proposed a numerical solution allowing for mini-batching (also, see Burt2019Rates).
The inducing points are sampled from according to a discrete k-Determinantal Point Process (k-DPP). While sampling according to an exact k-DPP distribution might be costly, defying the computational gain of SVGP, Burt2019Rates showed that can be efficiently sampled from close sampling methods without compromising the predictive quality of SVGP.
The distribution of can be specified on the integral transformations of in the form of using eigenfunctions . It can be shown that and (Burt2019Rates). The resulting sparse approximation of the posterior mean and covariance using inducing features is
where similar to the previous case the variational parameters and are the maximizer of the ELBO,
is the truncated feature vector andis the diagonal matrix of eigenvalues, . The expression of ELBO is the same as in the case of inducing points where , , , .
In both cases, as a result of the low rank approximation of the covariance matrix, the computational complexity of updating the GP model reduces to , at the price of introducing an approximation error.
4 Scalable Thompson Sampling using Gaussian Process Models (S-GP-TS)
An ideal GP-TS proceeds by sampling from the posterior distribution of and finding its maximizer. Since is an infinite dimensional object, it is standard in practice to resort to sampling on a discretization of . Then is selected as the maximizer of the discretized sample:
The discrete draws are generated via an affine transformation of Gaussian random variables , where for , is the zero vector and is the identity matrix. Specifically, , where , , , and denotes a matrix square root, such as a Cholesky factor which incurs an computational cost.
wilson2020efficientsampling introduced a sampling method based on decomposing the posterior sample into a truncated feature representation of the prior and an SVGP update which improves the computational cost of drawing a sample. We build on their results to introduce two sampling rules for S-GP-TS. The first rule is referred to as Decomposed Sampling with Inducing Points:
where , , () and where are drawn i.i.d from , for . This is a slight modification of the sampling rule of wilson2020efficientsampling where we have scaled the covariance of the sample with (to be specified later). The scaling is necessary for TS methods to ensure sufficient exploration based on anti-concentration of Gaussian distributions. In addition, we consider a second sampling rule referred to as Decomposed Sampling with Inducing Features:
where . The computational complexity of both sampling rules introduced above is per step 333The computational cost also comprises an term which is dominated by . This will become clear later when the values of and are specified. .
5 Regret Analysis
In this section, we present the main contribution of our work that is the theoretical analysis of S-GP-TS methods. First, in Lemma 1, we derive a novel concentration inequality for GP posteriors. We then establish an upper bound on the regret of S-GP-TS (Theorem 1) based on the quality of approximations parameterized in Assumptions 1 and 2. The consequences of Theorem 1 for the regret bounds and the computational complexity of S-GP-TS methods is further discussed.
5.1 Existing Bounds and A New Concentration Inequality
Chowdhury2017bandit proved that, with probability at least , , where and is the maximal information gain. Specifically, where denotes the mutual information between observations and the underlying function values . The maximal information gain can itself be bounded for a specific kernel, e.g. for the SE kernel (srinivas2010gaussian) and for the Matérn kernel (Janz2020SlightImprov).
Based on this concentration inequality, Chowdhury2017bandit
showed that the regret of GP-TS scales with the cumulative uncertainty at the observation points measured by the standard deviation,. Furthermore, srinivas2010gaussian showed that . Using this result and applying Jensen inequality to , Chowdhury2017bandit proved that . In some of the existing works is referred as the effective dimension of the problem and used in the bounds in place of maximal information gain (e.g. Calandriello2019Adaptive; Janz2020SlightImprov).
We build our regret bounds on a tighter concentration inequality provided in the following lemma.
Assume , the observation noise is sub-Gaussian, and and are the posterior mean and standard deviation as defined in (3.1). With probability at least ,
To prove the lemma, we first derive an important expression for the posterior variance by breaking it down into the maximum expected error in prediction and the variance introduced by the (assumed) noise in the surrogate Bayesian model.
The posterior mean and variance defined in (3.1) satisfy
5.2 Regret Bounds Based on the Quality of Approximations
We start our analysis by making two assumptions on the quality of approximations (, ). This parameterization is agnostic to the particular sampling rules and provides valuable intuition that can be applied to any approximate method.
Assumption 1 (quality of the approximate standard deviation).
For the approximate , the exact , and for all , where , for all and some constants , and for all and some small constant .
Assumption 2 (quality of the approximate prediction).
For the approximate , the exact and , and for all , where for all and some constant .
The following Lemma establishes a concentration inequality for the approximate statistics as a direct result of the assumptions and Lemma 1
Proof is provided in Appendix A. Following Chowdhury2017bandit; srinivas2010gaussian, we also assume that is selected in a way that for all , where is the closest point (in Euclidean norm) to in . The size of this discretization satisfies where is independent of (Chowdhury2017bandit; srinivas2010gaussian). We are now in a position to present the regret bounds based on the quality of the approximations. Let
The result can be simplified as which makes the scaling of regret more visible with respect to the parameterization of the quality of the approximations. The regret bound scales with the product of the ratios and . It also comprises an additive term depending on the additive approximation errors. An immediate result of Theorem 1 is an improved regret bound for the exact GP-TS given in the following corollary.
Under the setting of Theorem 1, the regret performance of the exact GP-TS (corresponding to and ) satisfies
Parameter is used to properly tune the scaling . If it is not known in advance, standard guess-and-double techniques apply (the same as in GP-UCB, see srinivas2010gaussian; Calandriello2019Adaptive)
Proof of Theorem 1.
Concentration of the approximate predictions around (provided by Lemma 2) as well as concentration of Gaussian samples around result in high probability bounds in the form of . This allows us to upper bound the instantaneous regret within . Specifically, with high probability
where the last inequality holds by the selection rule of TS. While the second term can efficiently be upper bounded using maximal information gain, the first term does not necessarily converge unless there is sufficient exploration around the optimum point. Anti-concentration of Gaussian distributions will ensure sufficient exploration, if is large enough. On the other hand a small implies that the prediction around optimum point is close to the value of . We thus proceed under two different cases and . Under the second case we can move to upper bound the cumulative regret based on (13) and the assumption that is sufficiently small. Under the first case, an intermediary point (that is a point with the smallest variance among less explored points) is introduced and shown to satisfy and where the expectation is taken with respect to the randomness in . We then proceed with
where the last inequality holds by the selection rule of TS. We then use the upper bound on the cumulative standard deviations based on the information gain. Accounting for the probability of concentration inequalities to fail as well as contribution of discritizaton to regret we conclude the result. A detailed proof is provided in Appendix B. ∎
5.3 Approximation Quality of the Decomposed Sampling Rule
The quality of the approximation is characterized using the spectrum properties of the GP kernel. Let us define where . With decaying eigenvalues, including sufficient eigenfunctions in the feature representation results in a small . In addition, Burt2019Rates showed that, for SVGP, a sufficient number of inducing variables ensures that the Kullback–Leibler (KL) divergence between the approximate and the true posterior distributions diminishes. Consequently, the approximate posterior mean and the approximate posterior variance converge to the true ones. Building on this result we prove the following proposition on the quality of approximations.
For S-GP-TS based on sampling rule (6) with a SVGP using an close k-DPP for selecting , with probability at least , Assumptions 1 and 2 hold with parameters , , , and , where is a constant specified in Appendix C and .
The results in Burt2019Rates do not directly apply to our setting for two reasons. First, we use the decomposed sampling rule of wilson2020efficientsampling (recall (6) and (7)) which introduces additional error in approximate variance compared to SVGP (while keeping the prediction the same as in SVGP). This additional error in the approximate variance in particular makes the analysis of S-GP-TS more challenging. Second, Burt2019Rates build their convergence results on the assumption that the observation points are drawn from a fixed distribution which is not the case in S-GP-TS, where are selected according to an experimental design method. A detailed proof of Proposition 2 is provided in Appendix C.
5.4 Application of Regret Bounds to Matérn and SE Kernels
Gabriel2020practicalfeature gave closed form expression of eigenvalue-eigenfunction pairs of Matérn and SE kernels. In case of a Matérn kernel with smoothness parameter , which implies . In order to apply Proposition 2, under sampling rule (6) we select , and under sampling rule (7), we select for some small . We also select under both sampling rules. In case of SE kernel, which implies . In order to apply Proposition 2, under both sampling rules (6) and (7), we select and .
With this choice of parameters, Proposition 2 implies that ; consequently, , as grows. Also, as grows. Applying Theorem 1, the regret of S-GP-TS satisfies with the computational complexities given in Table 1 which shows drastic improvements over computational cost of the exact GP-TS. For the Matérn kernel, under sampling rule (6), in order for to grow slower than , is required to be sufficiently larger than .
6 Discussions and Future Work
We showed that low regret is achievable with an computation per step (due to discretization) which is a drastic improvement over the cost of the standard sampling. however is exponential in the dimension of the search space, which remains a limiting factor in implementation of Bayesian optimization methods on high dimensional spaces. Hence, while S-GP-TS with decomposed sampling rule run for orders of magnitude longer , it still suffers from the curse of dimensionality. Intuitively, this seems inevitable due to NP-Hardness of the non-convex optimization problems (Jian2017NPHard). The same complexity with appears in finding the maximizer of the UCB in GP-UCB (Calandriello2019Adaptive), and even in the application of UCB to linear bandits (Dani2008). In particular, the computational cost of the adaptive sketching method of Calandriello2019Adaptive was reported as where , referred to as the effective dimension of the problem, is upper bounded by .
Some existing works assume oracle access to an efficient optimizer which can select the maximizer of the acquisition function, that might be efficiently accessible in case of convex or sufficiently smooth acquisition functions (e.g. see Mutny2018SGPTS). Our results would also benefit from the access to an oracle maximizer, to further reduce the <