On Batch Bayesian Optimization

11/04/2019 ∙ by Sayak Ray Chowdhury, et al. ∙ 0

We present two algorithms for Bayesian optimization in the batch feedback setting, based on Gaussian process upper confidence bound and Thompson sampling approaches, along with frequentist regret guarantees and numerical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Black-box optimization of an unknown function is an important problem in several real world domains such as hyper-parameter tuning of complex machine learning models, experimental design etc, and in recent years, the Bayesian optimization framework has gained a lot of traction towards achieving this goal. Bayesian optimization (BO) methods start with a prior distribution, generally Gaussian processes (GPs), over a function class, and use function evaluations to compute the posterior distribution. Popular strategies in this vein include expected improvement (GP-EI)

Močkus (1975)

, probability of improvement (GP-PI)

Wang et al. (2016), upper confidence bounds (GP-UCB) Srinivas et al. (2012), Thompson sampling (GP-TS) Chowdhury and Gopalan (2017b), predictive-entropy search Hernández-Lobato et al. (2015), etc. In some cases, it is possible and also desirable to evaluate the function in batches, e.g., parallelizing an expensive computer simulation over multiple cores. In this case, we can gather more information within a same time window, but future decisions need to be taken without the benefit of the evaluations in progress. A flurry of parallel (or, equivalently batch) Bayesian optimization strategies have been developed recently to address this problem Desautels et al. (2014); Kandasamy et al. (2018); Contal et al. (2013); Kathuria et al. (2016); González et al. (2016). In this work, we explore further the potential of batch BO strategies for black-box optimization, assuming that the unknown function is in the Reproducing Kernel Hilbert Space (RKHS) induced by a symmetric positive semi-definite kernel.

Contributions. We design a new algorithm – Improved Gaussian Process-Batch Upper Confidence Bound (IGP-BUCB) – for batch Bayesian optimization. It is a variant of the GP-BUCB algorithm of Desautels et al. (2014)

, but with a significantly reduced confidence interval resulting in an order-wise improvement in its regret bound. We also develop a nonparametric version of Thompson sampling, namely Gaussian Process-Batch Thompson Sampling (GP-BTS), and prove the first frequentist guarantee of TS in the setting of Batch Bayesian optimization. To put this in perspective, GP-BTS can be seen as a variant of the AsyTS algorithm of

Kandasamy et al. (2018). But the setting under which it is analyzed in this work is agnostic i.e., under a fixed but unknown function, whereas Kandasamy et al. (2018) consider the pure Bayesian setup. Finally, we confirm empirically the efficiency of IGP-BUCB and GP-BTS on several synthetic and real-world datasets.

2 Problem Statement

We consider the problem of sequentially maximizing a fixed but unknown reward function over a set of decisions (equivalently arms or actions) . An algorithm for this problem chooses, at each round , an action , and observes a noisy reward . We assume that the noise sequence is conditionally -sub-Gaussian for a fixed constant , i.e., for all and , , where is the

-algebra generated by the random variables

and . The decision is chosen causally depending upon the arms played and rewards available till round . Specifically, for each decision round , let represent the index of the most recent round for which rewards are available, so that can be chosen using only rewards obtained till round , along with actions (naturally) known to the algorithm until round . We assume that for a known constant , i.e., rewards are available as batches of variable lengths upto . For example: (a) if , then rewards are available as batches of length and it is denoted as the simple batch setting and (b) if , then the rewards are delayed by time periods and it is denoted as the simple delay setting. An important special case is when or equivalently, . Then all the rewards till round are available, and this represents the standard strictly sequential setting.

Regret. A natural goal of a sequential algorithm is to maximize its cumulative reward over a time horizon or equivalently minimize its cumulative regret , where is a maximum point of (assuming the maximum is attained; not necessarily unique). A sublinear growth of in signifies that the time-average regret as .

Regularity assumptions. Attaining sub-linear regret is impossible in general for arbitrary reward functions , and thus some regularity assumptions are in order. In what follows, we assume that has small norm in the reproducing Kernel Hilbert space (RKHS), denoted as , of real valued functions on , with positive semi-definite kernel function . We assume a known bound on the RKHS norm of , i.e.,

. Moreover, we assume bounded variance by restricting

, for all . Some common kernels, such as the Squared Exponential (SE) kernel and the Matérn kernel, satisfy this property.

3 Algorithms

Representing uncertainty of via Gaussian processes. We model as a sample from a Gaussian process prior , and assume that the noise variables are i.i.d. Gaussian. By standard properties of GPs Rasmussen and Williams (2006), conditioned on the history of observations , the posterior over is also a Gaussian process, , with mean function and kernel function . Here

denotes the vector of rewards observed at the set

, denotes the vector of kernel evaluations between and elements of the set and denotes the kernel matrix computed at .

Representing the posterior GP with delayed feedback. In the batch (equivalently delayed) feedback setup, the only available rewards at the start of round are ; however, all the previous decisions are available. This suggests ‘hallucinating’ the missing rewards , an idea first proposed by Desautels et al. (2014), using the most recently updated posterior mean , i.e., setting for all . By doing this, observe via, say, the iterative GP update equations (1) and (2) Chowdhury and Gopalan (2017a), that the mean of the posterior including the hallucinated observations remains precisely , but the posterior covariance decreases to .

(1)
(2)

Therefore, a natural approach towards batch Bayesian optimization is to use a decision rule that sequentially chooses actions using all the information that is available so far, i.e., a rule that uses the most recently updated posterior mean and posterior kernel to choose action at round .

Improved GP-Batch UCB (IGP-BUCB) algorithm. IGP-BUCB (Algorithm 1), at each round , chooses the action , where and . Here is a free parameter. denotes the Maximum Information Gain about any from noisy observations , which are obtained by passing through a channel . The key quantity bounds the information we gain about from the hallucinated observations (there are at most of them at every round) conditioned on the actual observations, in the sense that for all , where and denote the vectors of actual and hallucinated observations, respectively. This rule inherently trades off exploration (picking points with high uncertainty ) with exploitation (picking points with high reward ), where serves twin purposes: (a) it balances exploration and exploitation, (b) it compensates (via ) for the bias created by the hallucinated data , in the attempt to aggressively shrink the confidence interval and reduce exploration.
Note: While Desautels et al. (2014) propose the GP-BUCB algorithm, which also uses the same template as IGP-BUCB, we are able to reduce the width of the confidence interval and provably improve upon regret (Section 4).

  Input: Kernel , feedback mapping .
  for  do
     Choose .
     Update using (2).
     if  then
        for  do
           Observe .
           Update using (1).
        end for
     end if
  end for
Algorithm 1 IGP-BUCB
  Input: Kernel , feedback mapping .
  for  do
     Sample .
     Choose .
     Update using (2).
     if  then
        for  do
           Observe .
           Update using (1).
        end for
     end if
  end for
Algorithm 2 GP-BTS

GP-Batch Thompson Sampling (GP-BTS) algorithm.

Thompson sampling is a randomized strategy, and at every round chooses the action according to the posterior probability that it is optimal. At every round

, GP-BTS (Algorithm 2) (a) samples a random function from the posterior Gaussian process , where is a suitable discretization (See Appendix C for details) of , , and (b) chooses the action . Here again, plays a role similar to that of as in IGP-BUCB, i.e., promoting exploration and compensating for the bias of hallucination.

Remark. One particular choice for is Desautels et al. (2014), and (or an upper bound on it) can be computed given the kernel Srinivas et al. (2009); e.g., for the Squared Exponential (SE) kernel, and for the Matrn kernel with smoothness parameter , .

4 Regret Bounds for IGP-BUCB and GP-BTS

Though our algorithms rely on GP priors, the setting under which they are analyzed is agnostic, i.e., under a fixed (non-random) but unknown reward function. This is arguably more challenging Srinivas et al. (2009) than traditional Bayesian regret (expected regret under a random reward function from the known GP prior) analysis.

Theorem 1 (Regret bound for IGP-BUCB)

Let , be a member of the RKHS , with and the noise sequence be conditionally -sub-Gaussian. Then, for any , IGP-BUCB enjoys, with probability at least , the regret bound .

The regret bound for IGP-BUCB is with high probability, whereas Desautels et al. (2014) show that GP-BUCB obtains regret with high probability. Hence, we obtain a multiplicative factor improvement in the final regret bound; our numerical experiments reflect this improvement.

Theorem 2 (Regret bound for GP-BTS)

Let be compact and convex, be a member of the RKHS , with and the noise sequence be conditionally -sub-Gaussian. Then, for any , GP-BTS enjoys, with probability at least , the regret bound .

The regret bound for GP-BTS is with high probability. Though it is inferior to IGP-BUCB in terms of the dependency on dimension , to the best of our knowledge, this represents the first (frequentist) regret guarantee of Thompson sampling for batch Bayesian optimization.
Remark. In the strictly sequential setup ( and ), IGP-BUCB and GP-BTS reduce to the IGP-UCB and GP-TS algorithms of Chowdhury and Gopalan (2017b), respectively.

is poly-logarithmic in for popular kernels Srinivas et al. (2009). Hence, the regret bounds of our algorithms grow sublinearly with . But, if we naively run our algorithms with as discussed in Section 3, then the regret bounds grow at least linearly with the batch size . This can be obviated by incorporating the same initialization scheme (see Appendix D for details) of Desautels et al. (2014).

5 Experiments

We numerically compare the performance of GP-BUCB (Desautels et al., 2014, Theorem 2, case 3) with our algorithms IGP-BUCB and GP-BTS in the kernelized setting. are set, unless otherwise specified, according to the theoretical bounds for the corresponding kernels, with and (similar to Desautels et al. (2014)). Unless otherwise specified, the time-average regret () of all algorithms in the simple batch setting (with ) are plotted in Figure 1. The experiments are performed on the following data:

Figure 1: Time-average regret for (a) RKHS functions of SE kernel in simple batch setting, (b) RKHS functions of Matrn kernel in simple batch setting, (c) RKHS functions of SE kernel in simple delay setting, (d) RKHS functions of Matrn kernel in simple delay setting, (e) Cosines test function, (f) Rosenbrock test function, (g) temperature sensor data and (h) light sensor data.

1. Functions from RKHS. A set of functions is generated from RKHSs corresponding to the Mat

rn and Squared-Exponential (SE) kernels with hyperparameters

, , similar to the procedure of Chowdhury and Gopalan (2017a). is a discretization of into evenly spaced points. Comparison is done for both simple batch and simple delay settings with .

2. Benchmark functions. We consider the Cosine and Rosenbrock test functions Azimi et al. (2012). is a grid of evenly spaced points on and the kernel used is SE with .

3. Temperature111http://db.csail.mit.edu/labdata/labdata.html and light sensor data222http://www.cs.cmu.edu/~guestrin/Class/10708-F08/projects/lightsensor.zip. The algorithms are compared in the context of learning the maximum reading of the sensors Srinivas et al. (2009). The kernel used is the empirical covariance of the sensor readings, is set to of the average empirical variance and is set equal to .

Observations: IGP-BUCB outperforms GP-BUCB in all experiments, thus validating our theoretical bounds. For synthetic benchmarks, IGP-BUCB performs better than GP-BTS and for sensor data experiments, GP-BTS fares comparably, if not better, with IGP-BUCB.

Challenges and future work. The adaptive discretization in GP-BTS introduces an extra multiplicative factor in the regret bound. We believe the analysis can be done without resorting to the discretization and it remains an open problem even in the strictly sequential setting. From an applied point of view, there is the important open question on how to efficiently and provably optimize the UCB rule or the functions randomly drawn from GPs.

Acknowledgments

The authors are grateful to anonymous reviewers for providing useful comments. Sayak Ray Chowdhury is supported by Google India PhD fellowship.

References

  • Abramowitz et al. [1966] Milton Abramowitz, Irene A Stegun, et al. Handbook of mathematical functions. Applied mathematics series, 55(62):39, 1966.
  • Azimi et al. [2012] Javad Azimi, Ali Jalali, and Xiaoli Fern. Hybrid batch bayesian optimization. arXiv preprint arXiv:1202.5597, 2012.
  • Chowdhury and Gopalan [2017a] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. arXiv preprint arXiv:1704.00445, 2017a.
  • Chowdhury and Gopalan [2017b] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, pages 844–853, 2017b.
  • Contal et al. [2013] Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer, 2013.
  • Desautels et al. [2014] Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923, 2014.
  • Durand et al. [2017] Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. arXiv preprint arXiv:1708.00768, 2017.
  • González et al. [2016] Javier González, Zhenwen Dai, Philipp Hennig, and Neil Lawrence. Batch bayesian optimization via local penalization. In Artificial Intelligence and Statistics, pages 648–657, 2016.
  • Hernández-Lobato et al. [2015] José Miguel Hernández-Lobato, Michael A Gelbart, Matthew W Hoffman, Ryan P Adams, and Zoubin Ghahramani. Predictive entropy search for bayesian optimization with unknown constraints. 2015.
  • Kandasamy et al. [2018] Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabás Póczos. Parallelised bayesian optimisation via thompson sampling. In International Conference on Artificial Intelligence and Statistics, pages 133–142, 2018.
  • Kathuria et al. [2016] Tarun Kathuria, Amit Deshpande, and Pushmeet Kohli. Batched gaussian process bandit optimization via determinantal point processes. In Advances in Neural Information Processing Systems, pages 4206–4214, 2016.
  • Močkus [1975] J Močkus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404. Springer, 1975.
  • Rasmussen and Williams [2006] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. 2006.
  • Srinivas et al. [2009] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  • Srinivas et al. [2012] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias W Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
  • Wang et al. [2016] Zi Wang, Bolei Zhou, and Stefanie Jegelka.

    Optimization as estimation with gaussian processes in bandit settings.

    In International Conf. on Artificial and Statistics (AISTATS), 2016.

Appendix A Relevant Definitions and Results

We first review some relevant definitions and results from the Gaussian process multi-armed bandits literature, which will be useful in the analysis of our algorithms. We first begin with the definition of Maximum Information Gain, first appeared in Srinivas et al. [2009], which basically measures the reduction in uncertainty about the unknown function after some noisy observations (rewards).

For a function and any subset of its domain, we use to denote its restriction to , i.e., a vector containing ’s evaluations at each point in (under an implicitly understood bijection from coordinates of the vector to points in ). In case is a random function,

will be understood to be a random vector. For jointly distributed random variables

, denotes the Shannon mutual information between them.

Definition 1 (Maximum Information Gain (MIG))

Let be a (possibly random) real-valued function defined on a domain , and a positive integer. For each subset , let denote a noisy version of obtained by passing through a channel . The Maximum Information Gain (MIG) about after noisy observations is defined as

(We omit mentioning explicitly the dependence on the channels for ease of notation.)

MIG will serve as a key instrument to obtain our regret bounds by virtue of Lemma 1.

For a kernel function and points , we define the vector of kernel evaluations between and , and be the kernel matrix induced by the s. Also for each and , let .

Lemma 1 (Information Gain and Predictive Variances under GP prior and additive Gaussian noise)

Let be a symmetric positive semi-definite kernel and a sample from the associated Gaussian process over . For each subset , let denote a noisy version of obtained by passing through a channel that adds iid noise to each element of . Then,

(3)

and

(4)

Further, if has bounded variance, i.e. for all ,

(5)

Proof  (3) and (4) follow from Srinivas et al. [2009].

Further from our assumption , we have for all , and hence since is non-decreasing for any . Therefore

where the last inequality follows from (4). Now see that , since for any . Hence . Now (5) follows from the Cauchy-Schwartz inequality: .  

Bound on Maximum Information Gain

Note that the right hand sides of (3) and (4) depend only on the kernel function , domain , and number of observations . Srinivas et al. [2009] proved upper bounds over for three commonly used kernels, namely Linear, Squared Exponential and Matrn, defined respectively as

where and are hyper-parameters of the kernels, encodes the similarity between two points and denotes the modified Bessel function. The bounds are given in Lemma 2.

Lemma 2 (MIG for common kernels)

Let be a symmetric positive semi-definite kernel and . Let be a compact and convex subset of and the kernel satisfies for all . Then for

  • Linear kernel: .

  • Squared Exponential kernel: .

  • Matrn kernel: .

Note that, the Maximum Information Gain depends only sublinearly on the number of observations for all these kernels.

Lemma 3 (appeared independently in Chowdhury and Gopalan [2017b] and Durand et al. [2017].) gives a concentration bound for a member of the RKHS .

Lemma 3 (Concentration of an RKHS member)

Let be a symmetric, positive-semidefinite kernel and be a member of the RKHS of real-valued functions on with kernel . Let and be stochastic processes such that form a predictable process, i.e., for each , and is conditionally -sub-Gaussian for a positive constant , i.e.,

where is the -algebra generated by and . Let be a sequence of noisy observations at the query points , where . For and , let

where denotes the vector of observations at . Then, for any , with probability at least , uniformly over ,

where is the Maximum Information Gain about any after noisy observations obtained by passing through an iid Gaussian channel .

Proof  The proof follows from the proof of Theorem 2.1 in Durand et al. [2017].  

Now we define a quantity (modified from Kandasamy et al. [2018]), which essentially measures the information about that gets hallucinated each round due to at most hallucinated observations, conditioned on the actual observations.

Definition 2 (Maximum Hallucinated Information)

Let , be a mapping such that for all and be a constant such that for all . Then, denotes the maximum hallucinated information about due to hallucinated observations (there are at most of them at every ) in the sense that, for all ,

where is the vector of actual observations up to round , is the vector of hallucinated observations and is the conditional mutual information between and , given .

The following result is modified from Desautels et al. [2014], and provide a choice of .

Lemma 4 (Relation between the Maximum Information Gain and Hallucinated Information)

Let be a function, be a mapping such that for all and be a constant such that for all . Further, let be the Maximum Information Gain about after observations (Definition 1) and be the maximum hallucinated information (Definition 2). Then, .

The proof follows from the fact that for all Desautels et al. [2014].

The next lemma is due to Desautels et al. [2014, Proposition 1] and is pivotal in the analysis of batch Bayesian optimization.

Lemma 5 (Ratio of Posterior standard deviations bounded by Hallucinated Information)

Let be a symmetric, positive-semidefinite kernel and . Let and

be the posterior standard deviations, respectively conditioned on first

and queries. Then, for all ,

Definition 2, along with Lemma 5, implies that for all .

Appendix B Regret Analysis for IGP-BUCB Algorithm

First we begin with the following lemma, which states that the reward function is always well concentrated within properly constructed confidence intervals in the batch setting.

Lemma 6 (Concentration of reward function in the batch setting)

Let be a symmetric, positive-semidefinite kernel and be a member of the RKHS of real-valued functions on corresponding to kernel , with RKHS norm bounded by . Further, let be a sequence of noisy observations at queries , where and the noise sequence be conditionally -sub-Gaussian. Let be a mapping such that for all , be the posterior mean and be the posterior variance, after and rounds, respectively. Then, for any , the following holds:

Proof  Recall that, for any , the decision rule of IGP-BUCB algorithm is

where . Implicit in this decision rule is the corresponding confidence interval for each and for each ,

A special case of the batch setup is the strictly sequential setup with and . Here, the confidence intervals take the form

where . Thus, Lemma 3 implies that

and therefore

(6)

Now, consider the confidence interval

Observe that both the intervals and are centered around , and their widths are and , respectively. Further, see that . This, along with the fact that , implies

Thus, we have for all and for all . Therefore,

(7)

Also, as for every , we have

(8)

Now combining equations 6, 7 and 8, we get

Finally, the result follows from the definition of the confidence interval .  

b.1 Proof of Theorem 1

For every round , the decision rule of IGP-BUCB (Algorithm 1) implies that

where . From Lemma 6, with probability at least ,

Therefore, with probability at least , we have for all , the instantaneous regret

Hence, with probability at least , the cumulative regret . Now from Definition 1, see that doesn’t decrease with . Hence, and thus for all . Therefore, with probability at least , . Further, , since . From Lemma 1, . Hence, with probability at least ,

Thus with high probability.

Appendix C Regret Analysis for GP-BTS Algorithm

c.1 Useful Lemmas and Definitions

Lemma 7 (Lipschitzness of RKHS functions)

Let be a function in the RKHS with and . Let the kernel be continuously differentiable and let be a constant satisfying