Among the most promising approaches to address the issue of global optimization of an unknown function under reasonable smoothness assumptions comes from extensions of the multi-armed bandit setup. Bubeck et al.  highlighted the connection between cumulative regret and simple regret which facilitates fair comparison between methods and Bubeck et al.  proposed bandit algorithms on metric space , called -armed bandits. In this context, theory and algorithms have been developed in the case where the expected reward is a function which satisfies certain smoothness conditions such as Lipschitz or Hölder continuity [Kleinberg, 2004, Kocsis and Szepesvári, 2006, Auer et al., 2007, Kleinberg et al., 2008, Munos, 2011]. Another line of work is the Bayesian optimization framework [Jones et al., 1998, Bull, 2011, Mockus, 2012] for which the unknown function is assumed to be the realization of a prior stochastic process distribution, typically a Gaussian process. An efficient algorithm that can be derived in this framework is the popular GP-UCB algorithm due to Srinivas et al. . However an important limitation of the upper confidence bound (UCB) strategies without smoothness condition is that the search space has to be finite with bounded cardinality, a fact which is well known but, up to our knowledge, has not been discussed so far in the related literature.
In this paper, we propose an approach which improves both lines of work with respect to their present limitations. Our purpose is to: (i) relax smoothness assumptions that limit the relevance of -armed bandits in practical situations where target functions may only display random smoothness, (ii) extend the UCB strategy for arbitrary sets . Here we will assume that , being the realization of a given stochastic process distribution, fulfills a probabilistic smoothness condition. We will consider the stochastic process bandit setup and we develop a UCB algorithm based on generic chaining [Bogachev, 1998, Adler and Taylor, 2009, Talagrand, 2014, Giné and Nickl, 2015]. Using the generic chaining construction, we compute hierarchical discretizations of under the form of chaining trees in a way that permits to control precisely the discretization error. The UCB algorithm then applies on these successive discrete subspaces and chooses the accuracy of the discretization at each iteration so that the cumulative regret it incurs matches the state-of-the art bounds on finite
. In the paper, we propose an algorithm which computes a generic chaining tree for arbitrary stochastic process in quadratic time. We show that this tree is optimal for classes like Gaussian processes with high probability. Our theoretical contributions have an impact in the two contexts mentioned above. From the bandit and global optimization point of view, we provide a generic algorithm that incurs state-of-the-art regret on stochastic process objectives including non-trivial functionals of Gaussian processes such as the sum of squares of Gaussian processes (in the spirit of mean-square-error minimization), or nonparametric Gaussian processes on ellipsoids (RKHS classes), or the Ornstein-Uhlenbeck process, which was conjectured impossible by[Srinivas et al., 2010] and [Srinivas et al., 2012]. From the point of view of Gaussian process theory, the generic chaining algorithm leads to tight bounds on the supremum of the process in probability and not only in expectation.
The remainder of the paper is organized as follows. In Section 2, we present the stochastic process bandit framework over continuous spaces. Section 3 is devoted to the construction of generic chaining trees for search space discretization. Regret bounds are derived in Section 4 after choosing adequate discretization depth. Finally, lower bounds are established in Section 5.
2 Stochastic Process Bandits Framework
We consider the optimization of an unknown function which is assumed to be sampled from a given separable stochastic process distribution. The input space is an arbitrary space not restricted to subsets of , and we will see in the next section how the geometry of for a particular metric is related to the hardness of the optimization. An algorithm iterates the following:
it queries at a point chosen with the previously acquired information,
it receives a noisy observation ,
where the are independent centered Gaussian
of known variance. We evaluate the performances of such an algorithm usingthe cumulative regret:
This objective is not observable in practice, and our aim is to give theoretical upper bounds that hold with arbitrary high probability in the form:
Since the stochastic process is separable, the supremum over can be replaced by the supremum over all finite subsets of [Boucheron et al., 2013]. Therefore we can assume without loss of generality that is finite with arbitrary cardinality. We discuss on practical approaches to handle continuous space in Appendix C. Note that the probabilities are taken under the product space of both the stochastic process itself and the independent Gaussian noises . The algorithm faces the exploration-exploitation tradeoff. It has to decide between reducing the uncertainty on and maximizing the rewards. In some applications one may be interested in finding the maximum of only, that is minimizing the simple regret:
We will reduce our analysis to this case by simply observing that .
Confidence Bound Algorithms and Discretization.
To deal with the uncertainty, we adopt the optimistic optimization
paradigm and compute high confidence intervals where the valueslie with high probability, and then query the point maximizing the upper confidence bound [Auer et al., 2002]. A naive approach would use a union bound over all to get the high confidence intervals at every points . This would work for a search space with fixed cardinality , resulting in a factor in the Gaussian case, but this fails when is unbounded, typically a grid of high density approximating a continuous space. In the next section, we tackle this challenge by employing generic chaining to build hierarchical discretizations of .
3 Discretizing the Search Space via Generic Chaining
3.1 The Stochastic Smoothness of the Process
Let for and be the following confidence bound on the increments of :
In short, is the best bound satisfying . For particular distributions of , it is possible to obtain closed formulae for . However, in the present work we will consider upper bounds on . Typically, if is distributed as a centered Gaussian process of covariance , which we denote , we know that , where is the canonical pseudo-metric of the process. More generally, if it exists a pseudo-metric and a function
bounding the logarithm of the moment-generating function of the increments, that is,
for and , then using the Chernoff bounding method [Boucheron et al., 2013],
where is the Fenchel-Legendre dual of and denotes its generalized inverse. In that case, we say that is a -process. For example if is sub-Gamma, that is:
3.2 A Tree of Successive Discretizations
As stated in the introduction, our strategy to obtain confidence intervals for stochastic processes is by successive discretization of . We define a notion of tree that will be used for this purpose. A set where for is a tree with parent relationship , when for all its parent is given by . We denote by the set of the nodes of at depth lower than : . For and a node with , we also denote by its parent at depth , that is and we note when is a parent of . To simplify the notations in the sequel, we extend the relation to when .
We now introduce a powerful inequality bounding the supremum of the difference of between a node and any of its descendent in , provided that is not excessively large.
[Generic Chaining Upper Bound] Fix any , and an increasing sequence of integers. Set where is the Riemann zeta function. Then for any tree such that ,
holds with probability at least , where,
The full proof of the theorem can be found in Appendix B. It relies on repeated application of the union bound over the pairs .
Now, if we look at as a discretization of where a point is approximated by , this result can be read in terms of discretization error, as stated in the following corollary.
[Discretization error of ] Under the assumptions of Theorem 3.2 with for large enough, we have that,
holds with probability at least .
3.3 Geometric Interpretation for -processes
The previous inequality suggests that to obtain a good upper bound on the discretization error, one should take such that is as small as possible for every and . We specify what it implies for -processes. In that case, we have:
Writing the -radius of the “cell” at depth containing , we remark that , that is:
In order to make this bound as small as possible, one should spread the points of in so that is evenly small, while satisfying the requirement . Let and , and define an -net as a set for which is covered by -balls of radius with center in . Then if one takes , twice the metric entropy of , that is the logarithm of the minimal -net, we obtain with probability at least that :
where . The tree achieving this bound consists in computing a minimal -net at each depth, which can be done efficiently by Algorithm 2
if one is satisfied by an almost optimal heuristic which exhibits an approximation ratio of, as discussed in Appendix C. This technique is often called classical chaining [Dudley, 1967] and we note that an implementation appears in Contal et al.  on real data. However the upper bound in Eq. 3 is not tight as for instance with a Gaussian process indexed by an ellipsoid, as discussed in Section 4.2. We will present later in Section 5 an algorithm to compute a tree in quadratic time leading to both a lower and upper bound on when is a Gaussian process.
The previous inequality is particularly convenient when we know a bound on the growth of the metric entropy of , as stated in the following corollary.
[Sub-Gamma process with metric entropy bound] If is sub-Gamma and there exists such that for all , , then with probability at least :
With the condition on the growth of the metric entropy, we obtain . With Eq. 3 for a sub-Gamma process we get, knowing that and , that . ∎
Note that the conditions of Corollary 3.3 are fulfilled when and there is such that for all , by simply cutting in hyper-cubes of side length . We also remark that this condition is very close to the near-optimality dimension of the metric space defined in Bubeck et al. . However our condition constraints the entire search space instead of the near-optimal set . Controlling the dimension of may allow to obtain an exponential decay of the regret in particular deterministic function with a quadratic behavior near its maximum. However, up to our knowledge no progress has been made in this direction for stochastic processes without constraining its behavior around the maximum. A reader interested in this subject may look at the recent work by Grill et al.  on smooth and noisy functions with unknown smoothness, and the works by de Freitas et al.  or Wang et al.  on Gaussian processes without noise and a quadratic local behavior.
4 Regret Bounds for Bandit Algorithms
Now we have a tool to discretize at a certain accuracy, we show here how to derive an optimization strategy on .
4.1 High Confidence Intervals
Assume that given observations at queried locations , we can compute and for all and , such that:
Then for any that we will carefully choose later, we obtain by a union bound on that:
And by an additional union bound on that:
where for any and is the Riemann zeta function. Our optimistic decision rule for the next query is thus:
Combining this with Corollary 3.2, we are able to prove the following bound linking the regret with and the width of the confidence interval.
[Generic Regret Bound] When for all , we have with probability at least :
In order to select the level of discretization to reduce the bound on the regret, it is required to have explicit bounds on and the confidence intervals. For example by choosing
we obtain as shown later. The performance of our algorithm is thus linked with the decrease rate of , which characterizes the “size” of the optimization problem. We first study the case where is distributed as a Gaussian process, and then for a sum of squared Gaussian processes.
4.2 Results for Gaussian Processes
The problem of regret minimization where is sampled from a Gaussian process has been introduced by Srinivas et al.  and Grunewalder et al. . Since then, it has been extensively adapted to various settings of Bayesian optimization with successful practical applications. In the first work the authors address the cumulative regret and assume that either is finite or that the samples of the process are Lipschitz with high probability, where the distribution of the Lipschitz constant has Gaussian tails. In the second work the authors address the simple regret without noise and with known horizon, they assume that the canonical pseudo-metric is bounded by a given power of the supremum norm. In both works they require that the input space is a subset of . The analysis in our paper permits to derive similar bounds in a nonparametric fashion where is an arbitrary metric space. Note that if is not totally bounded, then the supremum of the process is infinite with probability one, so is the regret of any algorithm.
Confidence intervals and information gain.
First, being distributed as a Gaussian process, it is easy to derive confidence intervals given a set of observations. Writing
the vector of noisy values at points in
, we find by Bayesian inference[Rasmussen and Williams, 2006] that:
for all and , where:
where is the covariance vector between and , , and the covariance matrix and the variance of the Gaussian noise. Therefore the width of the confidence interval in Theorem 4.1 can be bounded in terms of :
Furthermore it is proved in Srinivas et al.  that the sum of the posterior variances at the queried points is bounded in terms of information gain:
where and is the maximum information gain of obtainable by a set of points. Note that for Gaussian processes, the information gain is simply . Finally, using the Cauchy-Schwarz inequality and the fact that is increasing we have with probability at least :
The quantity heavily depends on the covariance of the process. On one extreme, if is a Kronecker delta,
is a Gaussian white noise process and. On the other hand Srinivas et al.  proved the following inequalities for widely used covariance functions and :
linear covariance , .
squared exponential covariance , .
Matérn covariance, , where and is the modified Bessel function, , with for .
Bounding with the metric entropy.
This upper bound holds true in particular for Gaussian processes with and for all , . For stationary covariance this becomes which is satisfied for the usual covariances used in Bayesian optimization such as the squared exponential covariance or the Matérn covariance with parameter . For these values of it is well known that , with , and . Then we see that is suffices to choose to obtain and since and ,
holds with high probability. Such a bound holds true in particular for the Ornstein-Uhlenbeck process, which was conjectured impossible in Srinivas et al.  and Srinivas et al. . However we do not know suitable bounds for in this case and can not deduce convergence rates.
Gaussian processes indexed on ellipsoids and RKHS.
As mentioned in Section 3.3, the previous bound on the discretization error is not tight for every Gaussian process. An important example is when the search space is a (possibly infinite dimensional) ellipsoid:
where , and with , and the pseudo-metric coincide with the usual
metric. The study of the supremum of such processes is connected to learning error bounds for kernel machines like Support Vector Machines, as a quantity bounding the learning capacity of a class of functions in a RKHS, see for exampleMendelson . It can be shown by geometrical arguments that and that this supremum exhibits -tails around its expectation, see for example Boucheron et al.  and Talagrand . This concentration is not grasped by Corollary 3.3, it is required to leverage the construction of Section 5
to get a tight estimate. Therefore the present work forms a step toward efficient and practical online model selection in such classes in the spirit ofRakhlin and Sridharan  and Gaillard and Gerchinovitz .
4.3 Results for Quadratic Forms of Gaussian Processes
The preeminent model in Bayesian optimization is by far the Gaussian process. Yet, it is a very common task to attempt minimizing a regret on functions which does not look like Gaussian processes. Consider the typical cases where has the form of a mean square error or a Gaussian likelihood. In both cases, minimizing is equivalent to minimize a sum of squares, which we can not assume to be sampled from a Gaussian process. To alleviate this problem, we show that this objective fits in our generic setting. Indeed, if we consider that is a sum of squares of Gaussian processes, then is sub-Gamma with respect to a natural pseudo-metric. In order to match the challenge of maximization, we will precisely take the opposite. In this particular setting we allow the algorithm to observe directly the noisy values of the separated Gaussian processes, instead of the sum of their square. To simplify the forthcoming arguments, we will choose independent and identically distributed processes, but one can remove the covariances between the processes by Cholesky decomposition of the covariance matrix, and then our analysis adapts easily to processes with non identical distributions.
The stochastic smoothness of squared GP.
Let , where are independent centered Gaussian processes with stationary covariance such that for every . We have for and :
Therefore with and , we conclude that is a -process. Since for , which can be proved by series comparison, we obtain that is sub-Gamma with parameters and . Now with Eq. 2,
Furthermore, we also have that for and standard covariance functions including the squared exponential covariance or the Matérn covariance with parameter or . Then Corollary 3.3 leads to:
Confidence intervals for squared GP.
As mentioned above, we consider here that we are given separated noisy observations for each of the processes. Deriving confidence intervals for given is a tedious task since the posterior processes given are not standard nor centered. We propose here a solution based directly on a careful analysis of Gaussian integrals. The proof of the following technical lemma can be found in Appendix D.
[Tails of squared Gaussian] Let and . We have:
for and .
Using this lemma, we compute the confidence interval for by a union bound over . Denoting and the posterior expectation and deviation of given (computed as in Eq. 6 and 7), the confidence interval follows for all :
Then we also have:
Since , and we obtain . Therefore Theorem 4.1 says with probability at least :
It is now possible to proceed as in Section 4.2 and bound the sum of posterior variances with :
As before, under the conditions of Eq. 9 and choosing the discretization level we obtain , and since ,
holds with high probability.
5 Tightness Results for Gaussian Processes
We present in this section a strong result on the tree obtained by Algorithm 1. Let be a centered Gaussian process with arbitrary covariance . We show that a converse of Theorem 3.2 is true with high probability.
5.1 A High Probabilistic Lower Bound on the Supremum
We first recall that for Gaussian process we have , that is:
with probability at least . For the following, we will fix for a geometric sequence for all . Therefore we have the following upper bound:
Fix any and let be constructed as in Algorithm 1. Then there exists a constant such that, for ,
holds for all and with probability at least .
To show the tightness of this result, we prove the following probabilistic bound: [Generic Chaining Lower Bound] Fix any and let be constructed as in Algorithm 1. Then there exists a constant such that, for ,
holds for all and with probability at least .
The benefit of this lower bound is huge for theoretical and practical reasons. It first says that we cannot discretize in a finer way that Algorithm 1 up to a constant factor. This also means that even if the search space is “smaller” than what suggested using the metric entropy, like for ellipsoids, then Algorithm 1 finds the correct “size”. Up to our knowledge, this result is the first construction of tree leading to a lower bound at every depth with high probability. The proof of this theorem shares some similarity with the construction to obtain lower bound in expectation, see for example Talagrand  or Ding et al.  for a tractable algorithm.
5.2 Analysis of Algorithm 1
Therefore we have for all . Moreover, by looking at how the -net is computed we also have for all . These two properties are crucial for the proof of the lower bound.
Then, the algorithm updates the tree to make it well balanced, that is such that no node has more that children. We note at this time that this condition will be already satisfied in every reasonable space, so that the complex procedure that follows is only required in extreme cases. To force this condition, Algorithm 1 starts from the leafs and “prunes” the branches if they outnumber . We remark that this backward step is not present in the literature on generic chaining, and is needed for our objective of a lower bound with high probability. By doing so, it creates a node called a pruned node which will take as children the pruned branches. For this construction to be tight, the pruning step has to be careful. Algorithm 1 attaches to every pruned node a value, computed using the values of its children, hence the backward strategy. When pruning branches, the algorithm keeps the nodes with maximum values and displaces the others. The intuition behind this strategy is to avoid pruning branches that already contain pruned node.
Finally, note that this pruning step may creates unbalanced pruned nodes when the number of nodes at depth is way larger that . When this is the case, Algorithm 1 restarts the pruning with the updated tree to recompute the values. Thanks to the doubly exponential growth in the balance condition, this can not occur more that times and the total complexity is .
5.3 Computing the Pruning Values and Anti-Concentration Inequalities
We end this section by describing the values used for the pruning step. We need a function satisfying the following anti-concentration inequality. For all , let and such that and , and finally . Then is such that:
We thank Cédric Malherbe and Kevin Scaman for fruitful discussions.
- Adler and Taylor  R. J. Adler and J. E. Taylor. Random fields and geometry. Springer Science & Business Media, 2009.
- Auer et al.  P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
- Auer et al.  P. Auer, R. Ortner, and C. Szepesvári. Improved rates for the stochastic continuum-armed bandit problem. In Proceedings of the 20th Annual Conference on Learning Theory (COLT), pages 454–468. Omnipress, 2007.
- Bogachev  V. I. Bogachev. Gaussian measures, volume 62. American Mathematical Society Providence, 1998.
- Boucheron et al.  S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
- Bubeck et al.  S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory: 20th International Conference (ALT), pages 23–37. Springer-Verlag, 2009.
- Bubeck et al.  S. Bubeck, R. Munos, G. Stoltz, and C. Szepesvári. X-armed bandits. Journal of Machine Learning Research, 12:1655–1695, 2011.
- Bull  A. D. Bull. Convergence rates of efficient global optimization algorithms. The Journal of Machine Learning Research, 12:2879–2904, 2011.
- Contal et al.  E. Contal, C. Malherbe, and N. Vayatis. Optimization for gaussian processes via chaining. NIPS Workshop on Bayesian Optimization, 2015.
- Côté et al.  F. D. Côté, I. N. Psaromiligkos, and W. J. Gross. A Chernoff-type lower bound for the Gaussian Q-function. arXiv preprint arXiv:1202.6483, 2012.
- de Freitas et al.  N. de Freitas, A. J. Smola, and M. Zoghi. Exponential regret bounds for Gaussian process bandits with deterministic observations. In Proceedings of the 29th International Conference on Machine Learning (ICML). icml.cc / Omnipress, 2012.
Ding et al. 
J. Ding, J. R. Lee, and Y. Peres.
Cover times, blanket times, and majorizing measures.
Proceedings of the forty-third annual ACM symposium on Theory of computing (STOC), pages 61–70. ACM, 2011.
- Dudley  R. M. Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
- Gaillard and Gerchinovitz  P. Gaillard and S. Gerchinovitz. A chaining algorithm for online nonparametric regression. Proceedings of the Conference on Learning Theory (COLT), 2015.
- Giné and Nickl  E. Giné and R. Nickl. Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2015.
- Grill et al.  J. B. Grill, M. Valko, and R. Munos. Black-box optimization of noisy functions with unknown smoothness. In Advances in Neural Information Processing Systems 28(NIPS), pages 667–675. Curran Associates, Inc., 2015.
Grunewalder et al. 
S. Grunewalder, J-Y. Audibert, M. Opper, and J. Shawe-Taylor.
Regret bounds for Gaussian process bandit problems.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTAT, pages 273–280. MIT Press, 2010.
- Johnson  D. S. Johnson. Approximation algorithms for combinatorial problems. In Proceedings of the fifth annual ACM symposium on Theory of computing (STOC), pages 38–49. ACM, 1973.
- Jones et al.  D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, December 1998.
- Kleinberg  R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems 17(NIPS), pages 697–704. MIT Press, 2004.
- Kleinberg et al.  R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th annual ACM symposium on Theory of computing (STOC), pages 681–690, 2008.
- Kocsis and Szepesvári  L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In Proceedings of the 17th European conference on Machine Learning (ECML), pages 282–293. Springer, 2006.
- Ledoux and Talagrand  M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 1991.
Geometric parameters of kernel machines.
Proceedings of the 15th Annual Conference on Computational Learning Theory (ALT), pages 29–43. Springer-Verlag, 2002.
- Mockus  J. Mockus. Bayesian approach to global optimization: theory and applications, volume 37. Springer Science & Business Media, 2012.
- Munos  R. Munos. Optimistic optimization of deterministic functions without the knowledge of its smoothness. In Advances in neural information processing systems (NIPS), 2011.
- Rakhlin and Sridharan  A. Rakhlin and K. Sridharan. Online nonparametric regression. Proceedings of the Conference on Learning Theory (COLT), 35:1232–1264, 2014.
- Rasmussen and Williams  C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
- Raz and Safra  R. Raz and S. Safra. A sub-constant error-probability low-degree test, and a sub-constant error-probability pcp characterization of np. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing (STOC), pages 475–484. ACM, 1997.
- Srinivas et al.  N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the International Conference on Machine Learning (ICML), pages 1015–1022. icml.cc / Omnipress, 2010.
- Srinivas et al.  N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
- Talagrand  M. Talagrand. Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems, volume 60. Springer-Verlag Berlin Heidelberg, 2014.
- Wang et al.  Z. Wang, B. Shakibi, L. Jin, and N. de Freitas. Bayesian multi-scale optimistic optimization. In Artificial Intelligence and Statistics (AISTATS), pages 1005–1014, 2014.