We consider the problem of maximizing a function from its noisy observations of the form
where is the observation noise at time . We work in the Bayesian setting, assuming that the function is a sample from a zero mean Gaussian Process (GP) indexed by the space , and for are i.i.d.
Gaussian random variables. We further assume that the functionis expensive to evaluate, and we are allocated a budget of function evaluations.
This problem can be thought of as an extension of the Multi-armed bandit (MAB) problem to the case of infinite (possibly uncountable) arms indexed by the set , and is referred to as the GP bandits problem (Srinivas et al., 2012). The goal is to design a strategy of sequentially selecting query points based on the past observations and the prior on . As in the case of MAB with finitely many arms, the performance of any query point selection strategy is usually measured by the cumulative regret , which forces the agent to address the exploration-exploitation trade-off:
An alternative measure of performance is the simple regret which is used in the Bayesian Optimization (BO) or the pure exploration problem:
1.1 Prior work
Optimizing a black-box function from its noisy observations is an active area of research with a large body of literature. Here, we review existing methods which take a Bayesian approach with GP prior to this problem, and have provable guarantees on their performance.
Srinivas et al. (2012) formulated the task of black-box function optimization as a MAB problem and proposed the GP-UCB algorithm which is a modification of the Upper Confidence Bound (UCB) strategy widely used in bandit literature. The algorithm constructs high probability UCBs on the function values using the GP posterior and selects the evaluation points by maximizing the UCB over . For finite search spaces they showed that the GP-UCB algorithm admits a high probability upper bound on the cumulative regret of the form:
where is the maximum information gain with evaluations. We will refer to cumulative regret bounds of this form as information-type regret bounds in this paper. In addition, to make the dependence on explicit, Srinivas et al. (2012) further derived bounds on the term for some commonly used kernels. Finally, they presented an extension of the GP-UCB algorithm to the case of continuous by applying it on a sequence of increasingly fine uniform discretizations of .
Follow up works to Srinivas et al. (2012) have extended the GP-UCB algorithm in several ways. Contal and Vayatis (2016) proposed a method of constructing a sequence of uniform discretizations with tight control over the approximation error, which allowed the extension of the GP-UCB algorithm to arbitrary compact metric spaces . Desautels et al. (2014) and Contal et al. (2013) considered the GP bandits problem with the additional assumption that the evaluations can be performed in parallel. Desautels et al. (2014)
proposed the GP-BUCB algorithm which selects the points in a batch sequentially by maximizing a variant of the UCB, which is computed by keeping the mean function fixed and only updating the posterior variance.Contal et al. (2013) proposed the GP-UCB-PE which uses the UCB function for selecting the first point of a batch, and then proceeds in a greedy manner selecting the remaining points by maximizing the posterior variance. Krause and Ong (2011) proposed and analyzed the CGP-UCB algorithm for the contextual GP bandits problem, where the mean reward function corresponding to context-action pairs is modeled as a sample from a GP on the context-action product space. Kandasamy et al. (2016) considered a multi-fidelity version of the GP bandits problem in which they assumed the availability of a sequence of approximations of the true function with increasing accuracies which were cheaper to evaluate. They proposed an extension of GP-UCB called the MF-GP-UCB and derived information-type bounds on its cumulative regret.
Wang et al. (2016)
proposed the GP-EST algorithm which looks at the optimization problem through the lens of estimation. In particular, the algorithm constructs an estimate of the maximum function value, and then selects a point for evaluation which has the largest probability of attaining this value. Russo and Van Roy (2014)
analyzed the performance of the Thompson Sampling algorithm to a large class of problems, including the GP bandits problem. Thompson Sampling is a randomized strategy in which query points are sampled according to the posterior distribution on. Since computing the posterior on may be complicated, in practice, the query points are selected in the following two step procedure: first, a sample of the unknown function is generated, and then the query point is chosen by maximizing over . For the case of continuous , the function samples are generated over uniform discretizations of . By observing a relation between the expected regret of Thompson Sampling and UCB strategies, Russo and Van Roy (2014) obtained information-type bounds on the expected cumulative regret of the Thompson Sampling algorithm for GP bandits.
As observed in (Bubeck et al., 2011a), bounding the cumulative regret automatically gives us a bound on the expected simple regret by employing a randomized point recommendation strategy. Additionally, for the pure exploration setting, several algorithms specifically geared towards minimizing , such as Expected Improvement (GP-EI), Probability of Improvement(GP-PI), Entropy Search and Bayesian Multi-Scale Optimistic Optimization (BaMSOO) have been proposed (see (Shahriari et al., 2016) for a recent survey). Bogunovic et al. (2016b) considered the BO and Level Set Estimation problems in a unified manner and proposed the Truncated Variance Reduction (TRUVAR) algorithm which selects evaluation points greedily to obtain the largest reduction in the sum of truncated variances of the potential maximizers. The performance of all these algorithms have been empirically studied over various synthetic as well as real-world datasets. Furthermore, theoretical guarantees are also known for GP-EI (Bull, 2011) and BaMSOO(Wang et al., 2014) with noiseless observations, and for TRUVAR (Bogunovic et al., 2016b) with noisy observations and non-uniform cost of evaluations.
All the algorithms above, with the exception of BaMSOO, require solving an auxiliary optimization problem in each round for selecting the query point . The objective function of this auxiliary optimization problem is usually non-convex and multi-modal and hence requires an exhaustive search over an increasingly fine sequence of uniform discretizations to guarantee that a close approximation of the true optimum is found (Srinivas et al., 2012; Contal and Vayatis, 2016). The size of these uniform discretizations increases exponentially with the dimension of . This is because these discretizations are chosen off-line and do not depend on the function evaluations made up to round . In contrast, BaMSOO adaptively constructs discretizations by locally refining the regions of in which is more likely to take higher values based on the observations. As a result, the size of the discretizations under BaMSOO are independent of the dimension of which leads to significantly lower computational costs when is high dimensional. Our work is strongly motivated by this aspect of BaMSOO to provide the first algorithm for GP bandits with noisy observations whose computational complexity remains independent of the dimension of .
1.2 Our contributions
In this paper, we address two issues with existing approaches to the GP bandits problem:
As discussed above, all the existing algorithms for GP bandits require solving an auxiliary optimization problem over the entire search space for selecting a query point which may be computationally infeasible, and thus practical implementations resort to various approximation techniques which do not come with theoretical guarantees.
Furthermore, by constructing specific Gaussian Processes we show that the information-type regret bounds can be too pessimistic, thus motivating the need for designing algorithms that admit alternative analysis techniques.
To tackle these two problems, we design algorithms for GP bandits which utilize ideas from existing works in the Lipschitz function optimization literature, such as (Bubeck et al., 2011b; Munos, 2011; Munos et al., 2014; Kleinberg et al., 2013). More specifically, our main contributions are as follows:
We first present an algorithm for GP bandits which employs a tree of partitions of the search space to adaptively refine it based on observations. We show that because of the adaptive discretization, when and is large, our algorithm has significantly less computational complexity than algorithms requiring auxiliary optimization.
We obtain high probability bounds on the cumulative regret of our algorithm which are always as good as, and in some cases strictly better than, the existing regret bounds. In particular, we obtain the first explicit sublinear regret bounds for the GP with exponential kernel (Ornstein-Uhlenbeck process) and also identify sufficient conditions under which our bounds improve upon the current ones for Matérn family of kernels.
We also derive high probability bounds on the simple regret for our algorithm. To the best of our knowledge, BaMSOO (Wang et al., 2014) is the only adaptive111we use the term adaptive to refer to algorithms which adaptively discretize the search space based on earlier observations algorithm for the black-box optimization problem in the Bayesian setting, for which theoretical guarantees on simple regret are known. Our algorithm matches BaMSOO’s performance with the additional advantages that it requires fewer assumptions on the covariance functions and can work with noisy observations.
We also study two extensions of our algorithm. First, we present a Bayesian Zooming algorithm based on (Kleinberg et al., 2013; Slivkins, 2014) and obtain theoretical guarantees on its regret performance. This algorithm assumes a covering oracle access to the metric space instead of requiring a hierarchical tree of partitions of . We then extend our algorithm for GP bandits to the contextual GP bandits and obtain bounds on the contextual regret.
Finally, our algorithms and the theoretical bounds rely on a set of technical results about Gaussian Process which may be of independent interest. We provide these results and discuss their implications in Section 6.
1.3 Toy examples
As mentioned earlier, our cumulative regret bounds for Matérn kernels improve upon the known information type bounds for GP bandits. In this section, we attempt to provide some intuition for this result. In particular, we construct two toy examples which serve to highlight a potential drawback of the information type regret bounds for GP bandit problems shown in (4).
The information-type regret bounds (4) depend on the maximum information gain which is defined as:
Here is the mutual information between the unknown function
and vector of observationscorresponding to the query points . This term depends on the covariance function222we will use the terms covariance functions and kernels interchangeably of the Gaussian Process (GP), and upper bounds on for many commonly used GPs are given in (Srinivas et al., 2012). We note that since our aim is to gather information about a maximizer of , and not necessarily about the behavior of over the entire space , information-type regret bounds can be quite loose. We present two examples which have been specifically constructed to illustrate the scenarios where the regret bounds implied by (4) are very pessimistic. Both examples utilize the fact that the maximum information gain () can be large if the Gaussian Process has many independent components, even when the maximizer may be simple to learn.
For our first example, we construct a GP whose samples have simple structure around the maximum despite the highly complex structure away from the maximizer. More specifically, we begin by dividing the interval into three equal subintervals. Over the second and third intervals, the GP sample varies smoothly as scaled and shifted versions of a smooth function , modulated by a Standard Normal random variable . The first subinterval is further divided into three parts, and this process continues infinitely.
Suppose and let us define a GP = as follows:
where is a non-increasing positive sequence, for , is a continuous unimodal function with and , and are a sequence of independent Standard Normal random variables.
For this GP, we can claim the following (details in Appendix-A.1):
On the other hand, if for , then the true maximizer with high probability, and it can be identified with just one function evaluation implying a constant cumulative regret, .
For our second example, we construct a GP in which the search space is partitioned at different scales, and statistically equivalent components are assigned to the sets of a given partition. This process is repeated with increasingly finer partitions, and we show that for certain choice of parameters, each observation of the GP sample results in diminishing the region of uncertainty associated with by a constant factor. However, the information-type bound again is dominated by the information obtained from the large number of independent components of the GP and gives a linear upper bound on the cumulative regret.
We again take and let denote the following function
where is the function used in Example 1. Let us now define a recursively as follows:
As before is a decreasing sequence of positive real numbers, and are i.i.d. Standard Normal random variables. For this example, we can claim the following:
If the noise variance is small enough, we have which implies a linear in information-type bound on cumulative regret.
With the choice of parameters described in Appendix A.2, we can select the evaluation points in such a way that with high probability after every observation, the size of the region containing shrinks by a factor of , which in turn implies that the cumulative regret satisfies .
Both our examples have been specifically crafted to highlight scenarios in which the information type upper bounds given in (4) may not reflect the actual performance of the algorithms due to its dependence on the term . In Section 4.2 we further strengthen this observation by showing that the information-type regret bounds are loose for a practically relevant class of Gaussian Processes.
The rest of the paper is organized as follows: In Section 2 we introduce the required definitions and present some background for the problem. We then describe our algorithm for GP bandits and analyze its regret in Section 3. We discuss the behavior of our algorithm in some specific problem instances in Section 4. In Section 5 we study two extensions of our approach and analyze their performance. Finally, Section 6 contains some technical results which were used in designing our algorithms.
In this section we recall some definitions required for stating the results, and fix the notations used.
A Gaussian Process is a collection of random variables which satisfy the property that is a jointly Gaussian random variable for all and . A Gaussian Process is completely specified by its mean function and its covariance function .
For a comprehensive discussion about Gaussian Processes and their applications in machine learning, see(Rasmussen and Williams, 2006).
Any zero mean Gaussian Process with covariance function induces a metric on its index set , defined as
which gives us the following useful tail bound for any and :
Next, we introduce some properties of any metric space which will be used later on.
Suppose is a non-empty set and is a metric on . Then we have the following:
A subset of is called an -covering set of if for any , we have where . The cardinality of the smallest such is called the -covering number of with respect to , denoted by .
The metric dimension of a space with associated metric is the smallest number such that we have for all ,
for some .
For bounded subsets of with a metric , the metric dimension coincides with the usual notion of dimension (van Handel, 2014, page 125). The metric dimension gives us a notion of dimensionality intrinsic to the metric space . We now present a function specific measure of dimensionality of .
Suppose is a non-empty set, is a metric on and is a function from to . Then
A subset of is called an -separated set of if for any we have . The cardinality of the largest such set is called the -packing number of with respect to , and is denoted by .
For any and , consider the near-optimal set and its -packing number . Then we define the ()-near-optimality dimension () associated with and the function as the smallest real number such that for all , we have
for some .
We will call a compact metric space well-behaved if there exists a sequence of subsets of satisfying the following properties:
Each subset has elements for some , i.e. , and to each element is associated a cell .
For all and , we have
The nodes for are called the children of , which in turn is referred to as the parent.
We assume that the cells have geometrically decaying radii, i.e., there exists and such that we have
From we can see that the cells partition the space for every , while implies that we get an increasingly fine sequence of partitions with increasing . Finally imposes the condition that for any , the points are evenly spread out in the space . The subsets satisfying these properties are said to form a tree of partitions (Munos et al., 2014; Bubeck et al., 2011b).
We note that if , and is any metric on , then is well-behaved according to the above definition. The cells in this case are dimensional hyper-rectangles such that for can be constructed from by dividing it along its longest edge into equal parts.
|black-box function||Section 1|
|function evaluation budget||—"—|
|observation noise distributed as||—"—|
|prior on with covariance function||—"—|
|class of covariance functions considered||Section 3.3.1|
|parameters associated with||—"—|
|posterior mean and variance functions|
|Cumulative regret||Section 1,(2)|
|Simple regret||Section 1,(3)|
|Maximum Information gain||Section 1.3, (5)|
|Contextual regret||Section 5.2, (36)|
|Compact search space with metric|
|-ball with center , and radius||(13)|
|metric induced by on||Section 2, (8)|
|Metric dimension||—"—,Definition 2|
|Covering number||—"—, —"—|
|Packing number||—"—, Definition 3|
|-near-optimality dimension||—"—, —"—|
|instances of||Remark 8, Remark 13|
|Parameters of the tree of partitions||Section 2, Definition 4|
|Parameters of Algorithm 1 and Algorithm 3|
|the set of leaf nodes||Section 3.2|
|Index used for point selection in Algorithm 1||—"—, (14)|
|upper bound on||—"—, (15)|
|parent node of|
multiplicative factor for confidence intervals
|Section 3.3.2, Claim 1|
|maximum depth of the tree||—"—, (17)|
|upper bound on maximum variation of in a cell at level||—"—, Claim 2|
|relevant leaf nodes||Section 5.2.1|
|Index used for action selection in Algorithm 3||—"—, (38)|
|,||Context space and Action space||Section 5.2|
|Parameters of Algorithm 2|
|Set of active points||Section 5.1|
|radius associated with a point||—"—|
|upper bound on variation of in for any||Claim 6|
3 Algorithm for GP bandits
We begin this section by describing the general outline of all the algorithms proposed in this paper in Section 3.1. Then we introduce our tree based algorithm for GP bandits and obtain high probability bounds on its regret in Section 3.2.
3.1 General approach
At any time , we maintain a discretization (i.e., a finite subset) of , denoted by . To each , we have an associated confidence region denoted by , and an index which is a high probability upper bound on the maximum value of the function in . The index depends on three quantities: (a) the actual function value at , (b) the amount of uncertainty in the function value at , and (c) the amount of variation in the function value in . We proceed as follows:
In each round, we select a candidate point optimistically by maximizing over .
If the uncertainty in the function value at is smaller than the variation of in the confidence region, it means that we must refine our discretization in the confidence region associated with .
If, on the other hand, the uncertainty in the function value at is larger than the variation of in the associated confidence region, our algorithm evaluates the function at this point to reduce this uncertainty.
In Section 3.2 we present an algorithm for GP bandits which uses a hierarchical partitioning scheme for locally refining the search space similar to (Munos et al., 2014; Bubeck et al., 2011b; Wang et al., 2014). Alternatively, the covering oracle based approach used by Slivkins (2014); Kleinberg et al. (2013) can also be employed for refining the discretization, and we describe such an algorithm in Section 5.1. We also apply this approach to design an adaptive algorithm for the Contextual GP bandits problem in Section 5.2.
3.2 Tree based Algorithm
We now describe our algorithm for GP bandits and derive high probability bounds on its regret. Our algorithm is motivated by several tree based methods that have been proposed for function optimization under Lipschitz-like assumptions, such as (Bubeck et al., 2011b; Munos, 2011; Munos et al., 2014). Assuming that the metric space is well behaved, i.e., we have a sequence of subsets whose associated cells form a tree of partitions of , we proceed as follows:
In every round , the algorithm maintains an active set of leaf nodes denoted by , such that the cells of the nodes in partition . This active set is initialized to with the associated cell .
The algorithm selects a node from by maximizing an index . Then index is an upper confidence bound (UCB) on the maximum function value in cell and is defined as
The term in the above equation is a high probability upper bound on the function value at and is defined as
where is the parent node of . For any , the term is an upper bound on the maximum function variation in any cell at level . Thus, we see that computes an upper bound on the value of in two ways and takes their minimum, while adding to it gives us an upper bound on the maximum function value in the cell .
Having chosen the point () according to the selection rule (Line-2 of Algorithm 1) we take one of the following two actions :
Refine: If , then the node is expanded, i.e., the children nodes of the node are added to the set of leaves, and is removed from it. (Lines 4-5 of Algorithm 1)
Evaluate: Otherwise, then the function is evaluated at the point , i.e., we observe the noisy function value and update the posterior distribution of . (Lines 7-9 of Algorithm 1)
The steps of the algorithm are shown as a pseudo-code in Algorithm 1. The algorithm maintains two counters, which counts the total number of function evaluations and refinements, and which keeps track of the number of function evaluations. The algorithm stops after function evaluations, and recommends a point from one of the deepest expanded cells (for minimizing ). The second condition on Line 3 of Algorithm 1 is added to prevent the (unlikely) scenario in which the algorithm keeps refining indefinitely without evaluating the function.
The parameter of Algorithm 1 requires the knowledge of the horizon or the budget . However, we can use the well known doubling trick(Cesa-Bianchi and Lugosi, 2006, Section 2.3) to make our algorithm anytime without any change in the theoretical regret guarantees. The trick is to work in phases of exponentially increasing lengths, and applying the algorithm with known horizon (equal to the duration of the phase) in each phase.
3.3 Analysis of Algorithm 1
In this section, we first specify the assumptions on the covariance functions required for the theoretical analysis and then furnish the missing details of our tree based algorithm for GP bandits. Finally, we derive high probability bounds on the cumulative and simple regret for our algorithm.
3.3.1 Assumptions on the covariance functions
To analyze our algorithm, we will restrict our attention to a class of covariance functions, denoted by , such that for any , we have:
For any , we have for some non-decreasing continuous function , such that . Recall that is assumed to be any metric on the space , and is the natural metric induced on by the zero mean GP with covariance function .
Moreover, we require that there exists a such that for all , we have for constants and satisfying
Assumption informally requires that at least for small distances, points which are close in the metric are also close in . These assumptions are satisfied by all the commonly used kernels such as squared exponential (SE), and the Matérn family of kernels. It also includes other kernels such as and the rational quadratic kernel for some .
We note that is closed under finite addition and multiplication operations. This is an important property as in many practical applications, often more than one kernels are combined through addition or multiplication to provide more accurate models (Duvenaud, 2014, Chapter-2),(Rasmussen and Williams, 2006).
3.3.2 Details of the algorithm
To complete the description of Algorithm 1, we need to specify the choice of the parameters , , and .
First we observe that for all , we have . This follows from the assumption in Definition 4. From the definition of metric dimension we can upper bound by . As will be evident in the proof of Theorem 1, an appropriate choice of the parameter is:
With , the following event occurs with probability at least for any :
where is the (random) number of rounds required by the algorithm to complete function evaluations.
The largest value that the random variable can take is , and for any we have . Based on these two observations, we can claim the following:
Finally, we get the required bound by selecting .
The calculation of above is based on the worst case assumption that . In the case of and for odd values of
and for odd values of, we can use a tighter bound which gives us which allows us to consider larger values of .
Next, we obtain the expressions for the parameters as an immediate consequence of Corollary 1:
3.3.3 Regret Bounds
Before presenting the regret bounds, we first characterize the sub-optimality as well as the number of times points are evaluated by Algorithm 1.
If at time a point is evaluated by the algorithm, then the suboptimality of the selected point (denoted by can be upper bounded using :
Furthermore, if the evaluated point satisfies the condition that , then we have another bound on in terms of the posterior variance:
A point , with , may be evaluated no more than times before it is expanded, where where
Furthermore for large enough so that , we have
using the assumptions on the covariance function .
We recall that under the event we have for all and for all . Furthermore, form the definition of event , we have the following for all and :
Using these two facts we can prove the first part of this lemma in the following way:
Suppose at time , the true maximizer lies in the cell associated with the point , and the algorithm selects and evaluates the point . Then we have the following sequence of inequalities:
The inequality follows from the definition of , while uses the fact that under event . For , we use the fact that must have been expanded which means must be smaller than . For inequality we observe that must lie in the cell associated with and then use the definition of , while follows from the triangle inequality.
For obtaining the bound in (21), we again use the definition of to now upper bound it by the other term in its definition to get:
The inequality above uses the fact that since the function is evaluated at time , we must have .
A point must be evaluated by the algorithm sufficiently many times to reduce the uncertainty in the function value at from below to below . We provide a loose upper bound on this quantity, by providing an upper bound on the number of function evaluations sufficient to reduce the uncertainty in the value of to below . Using the first part of Proposition 3, we define as follows to get the required result.
From Lemma 1, we can see that the algorithm only selects points lying in for . Now, for , let us define where