# Efficient Gaussian Process Bandits by Believing only Informative Actions

Bayesian optimization is a framework for global search via maximum a posteriori updates rather than simulated annealing, and has gained prominence for decision-making under uncertainty. In this work, we cast Bayesian optimization as a multi-armed bandit problem, where the payoff function is sampled from a Gaussian process (GP). Further, we focus on action selections via upper confidence bound (UCB) or expected improvement (EI) due to their prevalent use in practice. Prior works using GPs for bandits cannot allow the iteration horizon T to be large, as the complexity of computing the posterior parameters scales cubically with the number of past observations. To circumvent this computational burden, we propose a simple statistical test: only incorporate an action into the GP posterior when its conditional entropy exceeds an ϵ threshold. Doing so permits us to derive sublinear regret bounds of GP bandit algorithms up to factors depending on the compression parameter ϵ for both discrete and continuous action sets. Moreover, the complexity of the GP posterior remains provably finite. Experimentally, we observe state of the art accuracy and complexity tradeoffs for GP bandit algorithms applied to global optimization, suggesting the merits of compressed GPs in bandit settings.

## Authors

• 12 publications
• 1 publication
• 73 publications
• 24 publications
• ### Time-Varying Gaussian Process Bandit Optimization

We consider the sequential Bayesian optimization problem with bandit fee...
01/25/2016 ∙ by Ilija Bogunovic, et al. ∙ 0

• ### Lenient Regret and Good-Action Identification in Gaussian Process Bandits

In this paper, we study the problem of Gaussian process (GP) bandits und...
02/11/2021 ∙ by Xu Cai, et al. ∙ 0

• ### Scalable Thompson Sampling using Sparse Gaussian Process Models

Thompson Sampling (TS) with Gaussian Process (GP) models is a powerful t...
06/09/2020 ∙ by Sattar Vakili, et al. ∙ 0

• ### Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior

Bayesian optimization usually assumes that a Bayesian prior is given. Ho...
11/23/2018 ∙ by Zi Wang, et al. ∙ 0

• ### Differentiating the multipoint Expected Improvement for optimal batch design

This work deals with parallel optimization of expensive objective functi...
03/18/2015 ∙ by Sébastien Marmin, et al. ∙ 0

• ### Optimization as Estimation with Gaussian Processes in Bandit Settings

Recently, there has been rising interest in Bayesian optimization -- the...
10/21/2015 ∙ by Zi Wang, et al. ∙ 0

• ### Harnessing Low-Fidelity Data to Accelerate Bayesian Optimization via Posterior Regularization

Bayesian optimization (BO) is a powerful derivative-free technique for g...
02/11/2019 ∙ by Bin Liu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian optimization is a framework for global optimization of a black box function via noisy evaluations (Frazier, 2018), and provides an alternative to simulated annealing Kirkpatrick et al. (1983); Bertsimas and Tsitsiklis (1993) or exhaustive search Davis (1991)

. These methods have proven adept at hyper-parameter tuning of machine learning models

Snoek et al. (2012); Li et al. (2017), nonlinear system identification Srivastava et al. (2013), experimental design Chaloner and Verdinelli (1995); Press (2009), and semantic mapping Shotton et al. (2008).

More specifically, denote the function we seek to optimize through noisy samples, i.e., for a given choice , we observe sequentially. We make no assumptions for now on the convexity, smoothness, or other properties of , other than each function evaluation must be selected judiciously. Our goal is to select a sequence of actions that eventuates in competitive performance with respect to the optimal selection . For sequential decision making, a canonical performance metric is regret, which quantifies the performance of a sequence of decisions as compared with the optimal :

 RegT:=T∑t=1f(x∗)−f(xt). (1.1)

Regret in (1.1) is natural because at each time we quantify how far decision was from optimal through the difference . An algorithm eventually learns the optimal strategy if it is no-regret: as .

In this work, we focus on Bayesian optimization, which hypothesizes a likelihood on the relationship between the unknown function and action selection . Then upon selecting an action , one tracks a posterior distribution, or belief model Powell and Ryzhov (2012), over possible outcomes

which informs how the next action is selected. In classical Bayesian inference, posterior distributions do not influence which samples

are observed next Ghosal et al. (2000). In contrast, in multi-armed bandits, action selection determines which observations form the posterior, which is why it is also referred to as active learning Jamieson et al. (2015).

Two key questions in this setting are how to specify a (i) likelihood and (ii) action selection strategy. These specifications come with their own merits and drawbacks in terms of optimality and computational efficiency. Regarding (i) the likelihood model, when the action space is discrete and of moderate size

, one may track a probability for each element of

, as in Thompson (posterior) sampling Russo et al. (2018), Gittins indices Gittins et al. (2011), and the Upper Confidence Bound (UCB) Auer et al. (2002)

. These methods differ in their manner of action selection, but not distributional representation.

However, when the range of possibilities is large, computational challenges arise. This is because the number of parameters one needs to define a posterior distribution over is proportional to

, an instance of the curse of dimensionality in nonparametric statistics. One way to circumvent this issue for continuous spaces is to discretize the action space according to a pre-defined time-horizon that determines the total number of selected actions

Bubeck et al. (2011); Magureanu et al. (2014), and carefully tune the discretization to the time-horizon . The drawback of these approaches is that as , the number of parameters in the posterior grows intractably large.

An alternative is to define a history-dependent distribution directly over the large (possibly continuous) space using, e.g., Gaussian Processes (GPs) Rasmussen (2004) or Monte Carlo (MC) methods Smith (2013). Bandit action selection strategies based on such distributional representations have been shown to be no-regret in recent years – see Srinivas et al. (2012); Gopalan et al. (2014). While MC methods permit the most general priors on the unknown function , computational and technical challenges arise when the prior/posterior no longer posses conjugacy properties Gopalan et al. (2014). By contrast, GPs, stochastic processes with any finite collection of realizations of which are jointly Gaussian Krige (1951)

, have a conjugate prior and posterior, and thus their parametric updates admit a closed form – see

Rasmussen (2004)[Ch. 2].

The conjugacy of GPs has driven their use in bandit action selection. In particular, by connecting regret to maximum information-gain based exploration, which may be quantified by variance

Srinivas et al. (2012); De Freitas et al. (2012), no-regret algorithms may be derived through variance maximization. Doing so yields actions which over-prioritize exploration, which may be balanced through, e.g., upper-confidence bound (UCB) based action selection. GP-UCB algorithms, and variants such as expected improvement (EI) Wang and de Freitas (2014); Nguyen et al. (2017), and step-wise uncertainty reduction (SUR) Villemonteix et al. (2009), including knowledge gradient Frazier et al. (2008), have been shown to be no-regret or statistically consistent Bect et al. (2019) in recent years.

However, these convergence results hinge upon requiring use of the dense GP whose posterior distribution [cf. (2.7)], has complexity cubic in due to the inversion of a Gram (kernel) matrix formed from the entire training set. Numerous efforts to reduce the complexity of GPs exist in the literature – see Csató and Opper (2002); Bauer et al. (2016); Bui et al. (2017). These methods all fix the complexity of the posterior and “project” all additional points onto a fixed likelihood “subspace.” Doing so, however, may cause uncontrollable statistical bias and divergence. In this work, we seek to explicitly design GPs to ensure both small regret and complexity which remains under control.

Contributions. In this context, we propose a statistical test for the GP that explicitly trades off memory and regret (1.1), motivated by compression routines that permit flexible representational complexity of nonparametric models Koppel (2019); Elvira et al. (2016). Specifically, we:

• propose a statistical test that operates inside GP UCB or EI which incorporates actions into the posterior only when conditional entropy exceeds an threshold (Sec. 2). We call these methods Compressed GP-UCB or Compressed GP-EI (Algorithm 1).

• derive sublinear regret bounds of GP bandit algorithms up to factors depending on the compression parameter for both discrete and continuous action sets (Sec. 3).

• establish that the complexity of the GP posterior remains provably finite (Sec. 3).

• experimentally employ these approaches for optimizing non-convex functions and tuning the regularizer and step-size of a logistic regressor, which obtains a state of the art trade off in regret versus computational efficiency relative to a few baselines. (Sec. 4).

## 2 Gaussian Process Bandits

Information Gain and Upper-Confidence Bound: To find when is unknown, one may first globally approximate well, and then evaluate it at the maximizer. In order to formalize this approach, we propose to quantify how informative a collection of points is through information gain (Cover and Thomas, 2012), a standard quantity that tracks the mutual information between and observations all indices in some sampling set, defined as

 I({yu};f)=H({yu})−H({yu}∣∣f) (2.1)

where denotes the entropy of observations and denotes the entropy conditional on . For a Gaussian with mean and covariance , the entropy is given as

 H(N(μ,Σ))=12log|2πeΣ| (2.2)

which allows us to evaluate the information gain in closed form as

 I({yu};f)=12log|2+σ−2Kt|. (2.3)

Suppose we are tasked with finding a subset of points that maximize the information gain. This amounts to a challenging subset selection problem whose exact solution cannot be found in polynomial time (Ko et al., 1995). However, near-optimal solutions may be obtained via greedy maximization, as information gain is submodular (Krause et al., 2008). Maximizing information gain, i.e., selecting , is equivalent to (Srinivas et al., 2012)

 xt=\operatornamewithlimitsargmaxx∈XσXt−1(x) (2.4)

where

is the empirical standard deviation associated with a matrix

of data points . We note that (2.4) may be shown to obtain the near-optimal selection of points in the sense that after rounds, executing (2.4) guarantees

 I({yu}Tu=1;f)≥(1−1/e)I({yu}Ku=1;f)

for some points via the theory of submodular functions (Nemhauser et al., 1978). Indeed, selecting points based upon (2.4) permits one to efficiently explore globally. However, it dictates that action selection does not move towards the actual maximizer of . For this, should be chosen according to prior knowledge about the function , exploiting information about where is large. To balance between these two extremes, a number of different acquisition functions are possible based on the GP posterior – see (Powell and Ryzhov, 2012). Here, for simplicity, we propose to do so either based upon the upper-confidence bound (UCB):

 xt=\operatornamewithlimitsargmaxx∈XμXt−1(x)+√βtσXt−1(x)αUCB(x) (2.5)

with as an exploration parameter , or the expected improvement (EI) (Nguyen et al., 2017), defined as

 xt=\operatornamewithlimitsargmaxx∈Xσt−1ϕ(z)+[μt−1(x)−ymaxt−1]Φ(z)αEI(x), (2.6)

where is the maximum observation value of past data, is the -score of , and and

denote the density and distribution function of a standard Gaussian distribution. Moreover, the aforementioned mean

and standard deviation in the preceding expressions are computed via a GP, to be defined next.

Gaussian Processes: A Gaussian Process (GP) is a stochastic process for which every finite collection of realizations is jointly Gaussian. We hypothesize a Gaussian Process prior for , which is specified by a mean function

 μ(x)=E[f(x)]

and covariance kernel defined as

 κ(x,x′)=E[(f(x)−μ(x))T(f(x′)−μ(x′))].

Subsequently, we assume the prior is zero-mean . GPs play multiple roles in this work: as a way of specifying smoothness and a prior for unknown function , as well as characterizing regret when is a sample from a known GP . GPs admit a closed form for their conditional a posteriori mean and covariance given training set as (Rasmussen, 2004)[Ch. 2].

 μXt(x) =kt(x)T(Kt+σ2I)−1yt (2.7) σ2Xt(x) =κ(x,x′)−kt(x)T(Kt+σ2I)−1kt(x′)T

where denotes the empirical kernel map and denotes the gram matrix of kernel evaluations whose entries are for . The subscript underscores its role in parameterizing the mean and covariance. Further, note that (2.7) depends upon a linear observation model with Gaussian noise prior . The parametric updates (2.7) depend on past actions , which causes the kernel dictionary to grow by one at each iteration, i.e.,

 Xt+1=[Xt ; xt+1]∈Rd×t,

and the posterior at time uses all past observations . Henceforth, the number of columns in the dictionary is called the model order. The GP posterior at time has model order .

The resulting action selection strategy (2.5) using the GP (2.7) is called GP-UCB, and its regret (1.1) is established in (Srinivas et al., 2012)[Theorem 1 and 2] as sublinear with high probability up to factors depending on the maximum information gain over points, which is defined as

 γT:=max{xu}I({yu}u=1;f)  such % that  |{xu}|=T. (2.8)

Compression Statistic: The fundamental role of information gain in the regret of GP, using either UCB or EI, provides a conceptual basis for finding a parsimonious GP posterior that nearly preserves no-regret properties of (2.5) - (2.7). To define our compression rule, first we define some key quantities related to approximate GPs. Suppose we select some other kernel dictionary rather than at time , where is the model order of the Gaussian Process. Then, the only difference is that the kernel matrix in (2.7) and the empirical kernel map are substituted by and , respectively, where the entries of and . Further, denotes the sub-vector of associated with only the indices of training points in matrix . We denote the training subset associated with these indices as . By rewriting (2.7) with as the dictionary rather than , we obtain

 μD(x) =kD(xt+1)[KD,D+σ2I]−1yD (2.9) σ2D(x)= κ(x,x′)−kD(x)T(KD,D+σ2I)−1κD(x′).

The question, then, is how to select a sequence of dictionaries whose columns comprise a subset of those of in such a way to approximately preserve the regret bounds of (Srinivas et al., 2012)[Theorem 1 and 2] while ensuring the model order is moderate .

We propose using conditional entropy as a statistic to compress against, i.e., a new data point should be appended to the Gaussian process posterior only when its conditional entropy is at least , which results in the following update rule for the dictionary :

 If  H(yt|^yt−1)=12log(2πe(σ2+σ2Dt−1(xt)))>ϵ update   Dt=[Dt−1 ; xt] else update   Dt=Dt−1, (2.10)

where we define as the compression budget. This amounts to a statistical test of whether the action yielded an informative sample in the sense that its conditional entropy exceeds an threshold. Therefore, uninformative past decisions are dropped from belief formation about the present. The modification of GP-UCB, called Compressed GP-UCB, or CUB for short, uses (2.5) with the lazy GP belief model (2.9) defined by dictionary updates (2). Similarily, the compression version of EI is called Compressed EI or CEI for short. We present them together for simplicity as Algorithm 1 with the understanding that in practice, one must specify UCB (2.5) or EI (2.6). Next, we rigorously establish how Algorithm 1 trades off regret and memory through the threshold on conditional entropy for whether a point should be included in the GP.

## 3 Balancing Regret and Complexity

In this section, we establish that Algorithm 1 attains comparable regret (1.1) to the standard GP approach to bandit optimization under the canonical settings of the action space being a discrete finite set and a continuous compact Euclidean subset, when actions follow the upper-confidence bound (2.5). We further establish sublinear regret of the expected improvement (2.6) when the action section is discrete. We build upon techniques pioneered in (Srinivas et al., 2012; Nguyen et al., 2017). The points of departure in our analysis are: (i) the characterization of statistical bias induced by the compression rule (2) in the regret bounds, and (ii) the relating of properties of the posterior (2) and action selections (2.5)-(2.6) to topological properties of the action space to ensure the model order of the GP defined by (2.9) is at-worst finite for all . Next we present the regret performance of Algorithm 1 when actions are selected according to the UCB (2.5).

###### Theorem 3.1.

(Regret of Compressed GP-UCB) Fix and suppose the Gaussian Process prior for has zero mean with covariance kernel . Define constant Then under the following parameter selections and conditions on the data domain , we have:

1. (Finite decision set) For finite cardinality , with exploration parameter selected as

 βt=2log(Xt2π2/6δ),

the accumulated regret is sublinear regret with probability .

 P{RegT≤√C1TβT^γT+√ϵT}≥1−δ (3.1)

where is the compression budget.

2. (General decision set) For continuous set , assume the derivative of the GP sample paths are bounded with high probability, i.e., for constants ,

 P{supx∈X|∂f/∂xj|>L}≤ae−(L/b)2  for j=1,..,d. (3.2)

Then, under exploration parameter

 βt=2log(Xt2π2/3δ)+2dlog(t2dbr√log(4da/δ)),

the accumulated regret is

 P{RegT≤√C1TβT^γT+√ϵT+π26}≥1−δ. (3.3)

Theorem 3.1, whose proof is the supplementary material attached to the submission, establishes that Algorithm 1 with action selected according to (2.5) attains sublinear regret with high probability when the action space is discrete and finite, as well as when it is a continuous compact subset of Euclidean space, up to factors depending on the maximum information gain (2.8) and the compression budget in (2). The sublinear dependence of the information gain on in terms of the parameter dimension is derived in (Srinivas et al., 2012)[Sec. V-B] for common kernels such as the linear, Gaussian, and Matérn.

The proof follows a path charted in (Srinivas et al., 2012)

[Appendix I], except that we must contend with the compression-induced error. Specifically, we begin by computing the confidence interval for each action

taken by the proposed algorithm at time . Then, we bound the instantaneous regret in terms of the problem parameters such as , , , compression budget , and information gain

using the fact that the upper-confidence bound overshoots the maximizer. By summing over time with Cauchy-Schwartz, we build an upper-estimate of cumulative regret based on instantaneous regret

. Unsurprisingly, an additional term appears due to our compression budget in the final regret bounds, which for reduces to (Srinivas et al., 2012)[Theorem 1 and 2]. However, rather than permitting the complexity of the GP to grow unbounded with , instead it grows only when informative actions are taken, and preserves the sublinear growth of regret for any such that such as for any .

Next, we analyze the performance of Algorithm 1 when actions are selected according to the expected improvement (2.6).

###### Theorem 3.2.

(Regret of Compressed GP-EI) Suppose we select actions based upon Expected Improvement (2.6) together with the conditional entropy-based rule (2) for retaining past points into the GP posterior, as detailed in Algorithm 1. Then, under the same conditions and parameter selection as in Theorem 3.1, when is a finite discrete set, the regret is sublinear with probability , i.e.,

 P {RegT≤√2T(γT+ϵT)log(1+σ−2)[√3(βT+1+R2)+√βT]}≥1−δ, (3.4)

where

 R:=supt≥0supx∈X|μt−1(x)−ymax|σt−1(x)

is the maximum value of the score, is as defined in Lemma 9.7.

The proof is proved in Appendix 9. In Theorem 3.2, we have characterized how the regret of Algorithm 1 depends on the compression budget for when the actions are selected according to the EI rule. We note (3.4) holds for the discrete action space . The result for the continuous action space follows from the proof of statement ii of Theorem (3.1) and the proof of Theorem 3.2. The proof of Theorem 3.2 follows a similar path presented in the Nguyen et al. (2017). We start by upper bounding the instantaneous improvements achieved by the proposed compressed EI algorithm in terms of the acquisitions function in Lemma 9.3. Further, the sum of the predictive variances for the compressed version over instances is upper bounded in terms of the maximum information gain in Lemma 9.5. Then we upper bound the cumulative sum of the instantaneous regret in terms of the model parameters such as , , , , and . Similar to the analysis that gives rise to Theorem 3.1, an additional term arises due to compression-induced error, which explicitly trades off regret and complexity. Moreover, note that reduces to the result of Nguyen et al. (2017).

Next, we establish the main merit of doing this statistical test inside a bandit algorithm is that it controls the complexity of the belief model that decides action selections. In particular, Theorem 3.3 formalizes that the dictionary defined by (2) in Algorithm 1 will always have finite number of elements even if , which is stated next.

###### Theorem 3.3.

Suppose that the conditional entropy is bounded for all . Then, the number of elements in the dictionary denoted by in the GP posterior of Algorithm 1 is finite as for fixed compression threshold .

The implications of Theorem 3.3 are that Algorithm 1 only retains significant actions in belief formation and drops extraneous points. Interestingly, this result states that despite infinitely many actions being taken in the limit, only finite many of them are -informative. In principle, one could make adaptive with to improve performance, but analyzing such a choice becomes complicated as relating the worst-cast model complexity to the covering number of the space would then depend on variable sets whose conditional entropy is at least . In the next section, we evaluate the merit of these conceptual results on experimental settings involving black box non-convex optimization and hyper-parameter tuning of linear logistic regressors.

## 4 Experiments

In this section, we evaluate the performance of the statistical compression method under a few different action selections (acquisition functions). Specifically, Algorithm 1 employs the Upper Confidence Bound (UCB) or Expected Improvement (EI) (Nguyen et al., 2017) acquisition function, but the key insight here is a modification of the GP posterior, not the action selection. Thus, we validate its use for Most Probable Improvement (MPI) (Wang and de Freitas, 2014) as well, defined as

 αMPI(x) =σt−1ϕ(z)+[μt−1(x)−ξ]Φ(z), ξ

where and denote the standard Gaussian density and distribution functions, and is the centered -score. We further compare the compression scheme against Budgeted Kernel Bandits (BKB) proposed by (Calandriello et al., 2019) which proposes to randomly add or drop points according to a distribution that is inversely proportional to the posterior variance, also on the aforementioned acquisition functions.

Unless otherwise specified, the squared exponential kernel is used to represent the correlation between the input, the lengthscale is set to = 1.0, the noise prior is set to = 0.001, the compression budget = and the confidence bounds hold with probability of at least = 0.9. As a common practice across all three problems, we initialize the Gaussian priors with training data randomly collected from the input domain, where d is the input dimension. We quantify the performance using Mean Average Regret over the iterations and the clock time. In addition, the model order, or number of points defining the GP posterior, is visualized over time to characterize the compression of the training dictionary. To ensure fair evaluations, all the listed simulations were performed on a PC with 1.8 GHz Intel Core i7 CPU and 16 GB memory. Same initial priors and parameters are used to assess computational efficiency in terms of the compression.

### 4.1 Example function

Firstly, we evaluate our proposed method on an example function given by Equation 4.1

 f(x)=sin(x)+cos(x)+0.1×x (4.1)

A random Gaussian noise is induced at every observation of f, to emulate the practical applications of Bayesian Optimization where the black box functions are often corrupted by noise.

The results of this experiment are shown in Figure 1, and the associated wall clock times are demonstrated in Table 1. Observe that the compression rule (2) yields regret that is typically comparable to the dense GP, with orders of magnitude reduction in model complexity. This complexity reduction, in turn, permits a state of the art tradeoff in regret versus wall clock time for certain acquisition functions, i.e., the UCB and EI, but not MPI. Interestingly, the model complexity of Algorithm 1 settles to a constant discerned by the covering number (metric entropy) of the action space, validating the conceptual result of Theorem 3.3.

### 4.2 Rosenbrock Function

For the second experiment, we compare the compressed variants with their baseline algorithm on a two-dimensional non-convex function popularly known as the Rosenbrock Function, given by:

 f(x,y)=(a−x)2+b(y−x2)2 (4.2)

The Rosenbrock function is a common benchmark non-convex function used to validate the performance of global optimization methods. Here we set its parameters as and for simplicity throughout. Again, we run various (dense and reduced-order) Gaussian Process bandit algorithms with different acquisition functions.

The results of this experiment are displayed in Figure 2 with associated wall clock times collected in Table 2. Again, we observe that compression with respect to conditional entropy yields a minimal reduction in performance in terms of regret while translating to a significant reduction of complexity. Specifically, rather than growing linearly with the number of past actions, as is standard in nonparametric statistics, the model order settles down to an intrinsic constant determined by the metric entropy of the action space. This means that we obtain a state of the art tradeoff in model complexity versus regret, as compared with the dense GP or probabilistic dropping inversely proportional to the variance, as in (Calandriello et al., 2019).

### 4.3 Hyper-paramter Tuning in Logistic Regression

In this subsection, we propose using bandit algorithms to automate the hyper-parameter tuning of machine learning algorithms. More specifically, we propose using Algorithm 1

and variants with different acquisition functions to tune the following hyper-parameters of a supervised learning scheme, whose concatenation forms the action space: the learning rate, batch size, dropout of the inputs, and the

regularization constant. The specific supervised learning problem we focus on is the training of a multi-class logistic regressor over the MNIST training set LeCun and Cortes (2010)

for classifying hand written digits. The instantaneous reward here is the statistical accuracy on a hold-out validation set.

Considering the high-dimensional input domain and the number of training examples, GP dictionary may explode to a large size. In large-scale settings, the input space could be much larger with many more hyper-parameters to tune, in which case GPs may be computationally intractable. The statistical compression proposed here ameliorates this issue by keeping the size of training dictionary in check, which makes it feasible for hyper-parameter tuning as the number of training examples becomes large.

The results of this implementation are given in Figure 3 with associated compute times in Table 3. Observe that the trend identified in the previous two examples translates into practice here: the compression technique (2) yields algorithms whose regret is comparable to the dense GP, with a significant reduction in model complexity that eventually settles to a constant. This constant is a fundamental measure of the complexity of the action space required for finding a no-regret policy. Overall, then, one can run Algorithm 1 on the back-end of any training scheme for supervised learning in order to automate the selection of hyper-parameters in perpetuity without worrying about eventual slowdown.

## 5 Conclusions

We considered bandit problems whose action spaces are discrete but have large cardinality, or are continuous. The canonical performance metric, regret, quantifies how well bandit action selection is against a best comparator in hindsight. By connecting regret to maximum information-gain based exploration which may be quantified by variance, one may find no-regret algorithms through variance maximization. Doing so yields actions which over-prioritize exploration. To balance between exploration and exploitation, that is, moving towards the optimum in finite time, we focused on upper-confidence bound based action selection. Following a number of previous works for bandits with large action spaces, we parameterized the action distribution as a Gaussian Process in order to have a closed form expression for the a posteriori variance.

Unfortunately, Gaussian Processes exhibit complexity challenges when operating ad infinitum: the complexity of computing posterior parameters grows cubically with the time index. While numerous previous memory-reduction methods exist for GPs, designing compression for bandit optimization is relatively unexplored. Within this gap, we proposed a compression rule for the GP posterior explicitly derived by information-theoretic regret bounds, where the conditional entropy encapsulates the per-step progress of the bandit algorithm. This compression only includes past actions whose conditional entropy exceeds an -threshold to enter into the posterior.

As a result, we derived explicit tradeoffs between model complexity and information-theoretic regret. Moreover, the complexity of the resulting GP posterior is at worst finite and depends on the covering number (metric entropy) of the action space, a fundamental constant that determines the bandit problem’s difficulty. In experiments, we observed a favorable tradeoff between regret, model complexity, and iteration index/clock time for a couple toy non-convex optimization problems as well as the actual problem of how to tune hyper-parameters of a supervised machine learning model.

Future directions include extensions to non-stationary bandit problems, generalizations to history-dependent action selection strategies such as step-wise uncertainty reduction methods (Villemonteix et al., 2009)

, and information-theoretic compression of deep neural networks based on bandit algorithms.

## References

• P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §1.
• M. Bauer, M. van der Wilk, and C. E. Rasmussen (2016) Understanding probabilistic sparse gaussian process approximations. In Advances in neural information processing systems, pp. 1533–1541. Cited by: §1.
• J. Bect, F. Bachoc, D. Ginsbourger, et al. (2019) A supermartingale approach to gaussian process based sequential design of experiments. Bernoulli 25 (4A), pp. 2883–2919. Cited by: §1.
• D. Bertsimas and J. Tsitsiklis (1993) Simulated annealing. Statistical Science 8 (1), pp. 10–15. Cited by: §1.
• S. Bubeck, G. Stoltz, and J. Y. Yu (2011) Lipschitz bandits without the lipschitz constant. In International Conference on Algorithmic Learning Theory, pp. 144–158. Cited by: §1.
• T. D. Bui, C. Nguyen, and R. E. Turner (2017) Streaming sparse gaussian process approximations. In Advances in Neural Information Processing Systems, pp. 3301–3309. Cited by: §1.
• D. Calandriello, L. Carratino, A. Lazaric, M. Valko, and L. Rosasco (2019) Gaussian process optimization with adaptive sketching: scalable and no regret. In Conference on Learning Theory, pp. 533–557. Cited by: §4.2, §4.
• K. Chaloner and I. Verdinelli (1995) Bayesian experimental design: a review. Statistical Science, pp. 273–304. Cited by: §1.
• T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §2, §7.1.
• L. Csató and M. Opper (2002) Sparse on-line gaussian processes. Neural computation 14 (3), pp. 641–668. Cited by: §1.
• L. Davis (1991)

Handbook of genetic algorithms

.
Cited by: §1.
• N. De Freitas, A. Smola, and M. Zoghi (2012) Exponential regret bounds for gaussian process bandits with deterministic observations. arXiv preprint arXiv:1206.6457. Cited by: §1.
• V. Elvira, J. Míguez, and P. M. Djurić (2016) Adapting the number of particles in sequential monte carlo methods through an online scheme for convergence assessment. IEEE Transactions on Signal Processing 65 (7), pp. 1781–1794. Cited by: §1.
• Y. Engel, S. Mannor, and R. Meir (2004) The kernel recursive least-squares algorithm. IEEE Transactions on Signal Processing 52 (8), pp. 2275–2285. External Links: Document, ISSN 1941-0476 Cited by: §8.
• P. I. Frazier, W. B. Powell, and S. Dayanik (2008) A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization 47 (5), pp. 2410–2439. Cited by: §1.
• P. I. Frazier (2018) A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811. Cited by: §1.
• S. Ghosal, J. K. Ghosh, A. W. Van Der Vaart, et al. (2000) Convergence rates of posterior distributions. Annals of Statistics 28 (2), pp. 500–531. Cited by: §1.
• J. Gittins, K. Glazebrook, and R. Weber (2011) Multi-armed bandit allocation indices. John Wiley & Sons. Cited by: §1.
• A. Gopalan, S. Mannor, and Y. Mansour (2014) Thompson sampling for complex online problems. In International Conference on Machine Learning, pp. 100–108. Cited by: §1.
• K. G. Jamieson, L. Jain, C. Fernandez, N. J. Glattard, and R. Nowak (2015) Next: a system for real-world development, evaluation, and application of active learning. In Advances in neural information processing systems, pp. 2656–2664. Cited by: §1.
• S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi (1983) Optimization by simulated annealing. science 220 (4598), pp. 671–680. Cited by: §1.
• C. Ko, J. Lee, and M. Queyranne (1995) An exact algorithm for maximum entropy sampling. Operations Research 43 (4), pp. 684–691. Cited by: §2.
• A. Koppel (2019) Consistent online gaussian process regression without the sample complexity bottleneck. In 2019 American Control Conference (ACC), pp. 3512–3518. Cited by: §1.
• A. Krause, A. Singh, and C. Guestrin (2008) Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. Journal of Machine Learning Research 9 (Feb), pp. 235–284. Cited by: §2.
• D. G. Krige (1951) A statistical approach to some basic mine valuation problems on the witwatersrand. Journal of the Southern African Institute of Mining and Metallurgy 52 (6), pp. 119–139. Cited by: §1.
• Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. External Links: Link Cited by: §4.3.
• L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017) Hyperband: a novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research 18 (1), pp. 6765–6816. Cited by: §1.
• S. Magureanu, R. Combes, and A. Proutière (2014) Lipschitz bandits: regret lower bounds and optimal algorithms. In COLT 2014, Cited by: §1.
• G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher (1978) An analysis of approximations for maximizing submodular set functions?i. Mathematical programming 14 (1), pp. 265–294. Cited by: §2.
• V. Nguyen, S. Gupta, S. Rana, C. Li, and S. Venkatesh (2017) Regret for expected improvement over the best-observed value and stopping condition. In Asian Conference on Machine Learning, pp. 279–294. Cited by: §1, §2, §3, §3, §4, §9.1, §9.1, §9.1, §9.1, §9.1, §9.2.
• W. B. Powell and I. O. Ryzhov (2012) Optimal learning. Vol. 841, John Wiley & Sons. Cited by: §1, §2.
• W. H. Press (2009) Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research. Proceedings of the National Academy of Sciences 106 (52), pp. 22387–22392. Cited by: §1.
• C. E. Rasmussen (2004) Gaussian processes in machine learning. In Advanced lectures on machine learning, pp. 63–71. Cited by: §1, §2.
• D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al. (2018) A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11 (1), pp. 1–96. Cited by: §1.
• J. Shotton, M. Johnson, and R. Cipolla (2008) Semantic texton forests for image categorization and segmentation. In

2008 IEEE Conference on Computer Vision and Pattern Recognition

,
pp. 1–8. Cited by: §1.
• A. Smith (2013) Sequential monte carlo methods in practice. Springer Science & Business Media. Cited by: §1.
• J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1.
• N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger (2012) Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory 58 (5), pp. 3250–3265. Cited by: §1, §1, §2, §2, §2, §3, §3, §3, item ii, §9.1.
• V. Srivastava, P. Reverdy, and N. E. Leonard (2013) On optimal foraging and multi-armed bandits. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 494–499. Cited by: §1.
• J. Villemonteix, E. Vazquez, and E. Walter (2009) An informational approach to the global optimization of expensive-to-evaluate functions. Journal of Global Optimization 44 (4), pp. 509. Cited by: §1, §5.
• Z. Wang and N. de Freitas (2014) Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters. arXiv preprint arXiv:1406.7758. Cited by: §1, §4.

## 6 Preliminaries

Before proceeding with the proofs in detail, we define some notation to clarify the exposition. For instance, in the analysis, it is important to differentiate the actions taken by the standard (uncompressed) GP-UCB algorithm defined by (2.5) - (2.7) from that of Algorithm 1 which employs information gain-based compression. Therefore, we proposed the following notations.

1. We denote the parameters of the posterior defined by (2.7) without compression as for the mean, and for the covariance, and the resulting action sequence (2.5) as .

2. For the proposed Algorithm 1, we denote for actions, for the means, and for the covariance functions to emphasize that they are approximations of the scheme in (Srinivas et al., 2012).

We further re-write the proposed Algorithm 1 here in Algorithm 2 utilizing this notation.

Subsequently, we pursue proofs in terms of the aforementioned definitions.

## 7 Proof of Theorem 3.1

The statement of Theorem 3.1 is divided into two parts for finite decision set (statement (i)) and compact convex action space (statement (ii)). Next, we present the proof for both the statements separately.

### 7.1 Proof of Theorem 3.1 statement (i)

The proof of Theorem 3.1(i) is based on upper bounding the difference in terms of a scaled version of the standard deviation , which we state next.

###### Lemma 7.1.

Choose and let , for some such that . Then, the parameters of the approximate GP posterior in Algorithm 1 satisfies

 |f(x)−^μt−1(x)|≤β1/2t^σt−1(x)∀x∈X,∀t≥1 (7.1)

holds with probability at least .

###### Proof.

At each , we have dictionary which contains the data points (actions takes so far) for the function . For a given and , . In Algorithm 2), we take actions and observe which are different from and of the uncompressed bandit algorithm [cf. (2.5) - (2.7)].

Hence, are the parameters of a Gaussian whose entropy is given by . This Gaussian is parametrized by the collection of data points . At , we take an action after which we observe . Then, we check for the conditional entropy . If the conditional entropy is higher than then we update the GP distribution, otherwise not (2). Hence, there is a fundamental difference between the posterior distributions and action selections. We seek to analyze the performance of the proposed algorithm in terms the regret defined against the optimal

. To do so, we explot some properties of the Gaussian, specifically, for random variable

, the cumulative density function can be expresssed

 P(r>c) =1√2π∫∞ce−r22dr (7.2) =e−c221√2π∫∞ce((−r22+rc−c22)+(−rc−c2))dr =e−c221√2π∫∞ce−(r−c)22e−c(r−c)dr.

For and , we have that . Furthermore, the integral term scaled by resembles the Gaussian density integrated from to for a random variable with mean and unit standard deviation, integrated to . Therefore, we get

 P(r>c)≤e−c2/2P(r>0)=12e−c2/2. (7.3)

Using the expression and for some sequence of nonnegative scalars . Substituting this expression into the left-hand side of (7.2) using the left-hand side of (7.3), we obtain

 P{|f(x)−^μt−1(x)| >β1/2t^σt−1(x)}≤e−βt/2. (7.4)

Now apply Boole’s inequality to the preceding expression to write

 P{⋃x∈X |f(x)−^μt−1(x)|>β1/2t^σt−1(x)} ≤∑x∈XP{|f(x)−^μt−1(x)|>β1/2t^σt−1(x)} ≤|X|e−βt/2. (7.5)

To obtain the result in the statement of Lemma 7.1, select the constant sequence such that , with scalar parameter sequence . Applying Boole’s inequality again over all time points , we get

 P{∞⋃t=1 |f(x)−^μt−1(x)|>β1/2t^σt−1(x)} ≤ ∞∑t=1P{|f(x)−^μt−1(x)|>β1/2t^σt−1(x)} ≤ ∞∑t=1δπt = δ. (7.6)

The last equality is true since . We reverse the inequality to obtain an upper bound on the absolute difference between the true function and the estimated mean function for all and such that

 |f(x)−^μt−1(x)|≤β1/2t^σt−1(x),∀x∈X, ∀t≥1 (7.7)

holds with probability , as stated in Lemma 7.1. ∎

###### Lemma 7.2.

Fix . If for all , the instantaneous regret is bounded as

 rt≤2β1/2t^σt−1(xt). (7.8)
###### Proof.

Since Algorithm 2 chooses the next sampling point at each step, we have

 ^μt−1(^xt)+√βt^σt(^xt) ≥^μt−1(x∗)+√βt^σt−1(x∗) ≥f(x∗), (7.9)

by the definition of the maximum, where is the optimal point. The instantaneous regret is then bounded as

 rt= f(x∗)−f(^xt) ≤ ^μt−1(^x)+β1/2t^σt−1(^xt)−f(^xt). (7.10)

But from Lemma 7.1 we have that holds with probability . This implies that

 rt=f(x∗)−f(^xt)≤2√βt^σt−1(^xt). (7.11)

Then, (7.11) quantifies the instantaneous regret of action taken by Algorithm 2, as stated in Lemma 7.2. ∎

###### Lemma 7.3.

The information gain of actions selected by Algorithm 1, denoted as , admits a closed from in terms of the posterior variance of the compressed GP and the variance of the noise prior as

 I(^yT;^fT)= 12∑t∈MT(ϵ)log(1+σ−2^σ2t−