After the publication of our article, we found an error in the proof of Lemma 1 which invalidates the main theorem. It appears that the information given to the algorithm is not sufficient for the main theorem to hold true. The theoretical guarantees would remain valid in a setting where the algorithm observes the instantaneous regret instead of noisy samples of the unknown function. We describe in this page the mistake and its consequences.
Let be the unknown function to be optimized, which is a sample from a Gaussian process. Let’s fix and the observations where the noise variables are independent Gaussian noise . We define the instantaneous regret and,
In Lemma 1, we claimed that is a Gaussian martingale with respect to . Even if is a centered Gaussian conditioned on , it is wrong to say that is a martingale since it is not measurable with respect to .
In order to fix Lemma 1, it is possible to modify and use its natural filtration instead of ,
Then is a Gaussian martingale with respect to . Now to adapt the algorithm for this new quantity it needs to observe instead of
to be able to compute both the posterior expectation and variance for allin :
We remark that the experiments performed in this article are remarkably good in spite of Lemma 1 being unproved. After having discovered the mistake we were able to build scenarios were the GP-MI algorithm is overconfident and misses the optimum of , and therefore incurs a linear cumulative regret.
Stochastic optimization problems are encountered in numerous real world domains including engineering design Wang & Shan (2007), finance Ziemba & Vickson (2006), natural sciences Floudas & Pardalos (2000), or in machine learning for selecting models by tuning the parameters of learning algorithms Snoek et al. (2012). We aim at finding the input of a given system which optimizes the output (or reward). In this view, an iterative procedure uses the previously acquired measures to choose the next query predicted to be the most useful. The goal is to maximize the sum of the rewards received at each iteration, that is to minimize the cumulative regret by balancing exploration (gathering information by favoring locations with high uncertainty) and exploitation (focusing on the optimum by favoring locations with high predicted reward). This optimization task becomes challenging when the dimension of the search space is high and the evaluations are noisy and expensive. Efficient algorithms have been studied to tackle this challenge such as multiarmed bandit Auer et al. (2002); Kleinberg (2004); Bubeck et al. (2011); Audibert et al. (2011)2011); Chen & Krause (2013) or Bayesian optimization Mockus (1989); Grunewalder et al. (2010); Srinivas et al. (2012); de Freitas et al. (2012). The theoretical analysis of such optimization procedures requires some prior assumptions on the underlying function . Modeling as a function distributed from a Gaussian process (GP) enforces near-by locations to have close associated values, and allows to control the general smoothness of
with high probability according to the kernel of the GPRasmussen & Williams (2006). Our main contribution is twofold: we propose a generic algorithm scheme for Gaussian process optimization and we prove sharp upper bounds on its cumulative regret. The theoretical analysis has a direct impact on strategies built with the GP-UCB algorithm Srinivas et al. (2012) such as Krause & Ong (2011); Desautels et al. (2012); Contal et al. (2013). We suggest an alternative policy which achieves an exponential speed up with respect to the cumulative regret. We also introduce a novel algorithm, the Gaussian Process Mutual Information algorithm (GP-MI), which improves furthermore upper bounds for the cumulative regret from for GP-UCB, the current state of the art, to the spectacular , where is the number of iterations, is the dimension of the input space and the kernel function is Gaussian. The remainder of this article is organized as follows. We first introduce the setup and notations. We define the GP-MI algorithm in Section 2. Main results on the cumulative regret bounds are presented in Section 3. We then provide technical details in Section 4. We finally confirm the performances of GP-MI on real and synthetic tasks compared to the state of the art of GP optimization and some heuristics used in practice.
2 Gaussian process optimization and the Gp-Mi algorithm
2.1 Sequential optimization and cumulative regret
Let , where is a compact and convex set, be the unknown function modeling the system we want to be optimized. We consider the problem of finding the maximum of denoted by:
via successive queries . At iteration , the choice of the next query depends on the previous noisy observations, at locations where for all , and the noise variables
are independently distributed as a Gaussian random variablewith zero mean and variance . The efficiency of a policy and its ability to address the exploration/exploitation trade-off is measured via the cumulative regret , defined as the sum of the instantaneous regret , the gaps between the value of the maximum and the values at the sample locations,
Our aim is to obtain upper bounds on the cumulative regret with high probability.
2.2 The Gaussian process framework
Prior assumption on .
In order to control the smoothness of the underlying function, we assume that is sampled from a Gaussian process with mean function and kernel function . We formalize in this manner the prior assumption that high local variations of have low probability. The prior mean function is considered without loss of generality to be zero, as the kernel can completely define the GP Rasmussen & Williams (2006). We consider the normalized and dimensionless framework introduced by Srinivas et al. (2010) where the variance is assumed to be bounded, that is for all .
At iteration , given the previously observed noisy values at locations
, we use Bayesian inference to compute the current posterior distributionRasmussen & Williams (2006), which is a GP of mean and variance given at any by,
is the vector of covariances betweenand the query points at time , and with the kernel matrix, is the variance of the noise and
stands for the identity matrix. The Bayesian inference is illustrated on Figure1
in a sample problem in dimension one, where the posteriors are based on four observations of a Gaussian Process with squared exponential kernel. The height of the gray area represents two posterior standard deviations at each point.
2.3 The Gaussian Process Mutual Information algorithm
A novel algorithm.
The Gaussian Process Mutual Information algorithm (GP-MI) is presented as Algorithm 1. The key statement is the choice of the query, . The exploitation ability of the procedure is driven by , while the exploration is governed by , which is an increasing function of . The novelty in the GP-MI algorithm is that is empirically controlled by the amount of exploration that has been already done, that is, the more the algorithm has gathered information on , the more it will focus on the optimum. In the GP-UCB algorithm from Srinivas et al. (2012) the exploration coefficient is a and therefore tends to infinity. The parameter in Algorithm 1 governs the trade-off between precision and confidence, as shown in Theorem 2. The efficiency of the algorithm is robust to the choice of its value. We confirm empirically this property and provide further discussion on the calibration of in Section 5.
The quantity controlling the exploration in our algorithm forms a lower bound on the information acquired on by the query points . The information on is formally defined by , the mutual information between and the noisy observations at , hence the name of the GP-MI algorithm. For a Gaussian process distribution where is the kernel matrix . We refer to Cover & Thomas (1991) for further reading on mutual information. We denote by the maximum mutual information obtainable by a sequence of query points. In the case of Gaussian processes with bounded variance, the following inequality is satisfied (Lemma 5.4 in Srinivas et al. (2012)):
where and is the noise variance. The upper bounds on the cumulative regret we derive in the next section depend mainly on this key quantity.
3 Main Results
3.1 Generic Optimization Scheme
We first consider the generic optimization scheme defined in Algorithm 2, where we let as a generic function viewed as a parameter of the algorithm. We only require to be measurable with respect to , the observations available at iteration . The theoretical analysis of Algorithm 2 can be used as a plug-in theorem for existing algorithms. For example the GP-UCB algorithm with parameter is obtained with . A generic analysis of Algorithm 2 leads to the following upper bounds on the cumulative regret with high probability.
Theorem 1 (Regret bounds for the generic algorithm).
For all and , the regret incurred by Algorithm 2 on distributed as a GP perturbed by independent Gaussian noise with variance satisfies the following bound with high probability, with and :
The proof of Theorem 1, relies on concentration guarantees for Gaussian processes (see Section 4.1). Theorem 1 provides an intermediate result used for the calibration of to face the exploration/exploitation trade-off. For example by choosing (where the dimensional constant is hidden), Algorithm 2 becomes a variant of the GP-UCB algorithm where in particular the exploration parameter is fixed to instead of being an increasing function of . The upper bounds on the cumulative regret with this definition of are of the form , as stated in Corollary 1. We then also consider the case where the kernel of the Gaussian process is under the form of a squared exponential (RBF) kernel, , for all and length scale . In this setting, the maximum mutual information satisfies the upper bound , where is the dimension of the input space Srinivas et al. (2012).
To prove Corollary 1 we apply Theorem 1 with the given definition of and then Equation 3, which leads to . The previously known upper bounds on the cumulative regret for the GP-UCB algorithm are of the form where . The improvement of the generic Algorithm 2 with over the GP-UCB algorithm with respect to the cumulative regret is then exponential in the case of Gaussian processes with RBF kernel. For sampled from a GP with linear kernel, corresponding to with , we obtain . We remark that the GP assumption with linear kernel is more restrictive than the linear bandit framework, as it implies a Gaussian prior over the linear coefficients . Hence there is no contradiction with the lower bounds stated for linear bandit like those of Dani et al. (2008). We refer to Srinivas et al. (2012) for the analysis of with other kernels widely used in practice.
3.2 Regret bounds for the Gp-Mi algorithm
We present here the main result of the paper, the upper bound on the cumulative regret for the GP-MI algorithm.
Theorem 2 (Regret bounds for the Gp-Mi algorithm).
For all and , the regret incurred by Algorithm 1 on distributed as a GP perturbed by independent Gaussian noise with variance satisfies the following bound with high probability, with and :
Corollary 2 (RBF kernels).
The cumulative regret incurred by Algorithm 1 on sampled from a GP with RBF kernel satisfies with high probability,
The GP-MI algorithm significantly improves the upper bounds for the cumulative regret over the GP-UCB algorithm and the alternative policy of Corollary 1.
4 Theoretical analysis
In this section, we provide the proofs for Theorem 1 and Theorem 2. The approach presented here to study the cumulative regret incurred by Gaussian process optimization strategies is general and can be used further for other algorithms.
4.1 Analysis of the general algorithm
The theoretical analysis of Theorem 1 uses a similar approach to the Azuma-Hoeffding inequality adapted for Gaussian processes. Let for all . We define , which is shown later to be a martingale with respect to ,
for and . Let be defined as the martingale difference sequence with respect to , that is the difference between the instantaneous regret and the gap between the posterior mean for the optimum and the one for the point queried,
The sequence is a martingale with respect to and for all , given , the random variable is distributed as a Gaussian with zero mean and variance , where:
From the GP assumption, we know that given , the distribution of is Gaussian for all , and is a projection of a Gaussian random vector, that is is distributed as a Gaussian and is distributed as Gaussian , with , hence is a Gaussian martingale. ∎
We now give a concentration result for using inequalities for self-normalized martingales.
For all and , the martingale normalized by the predictable quadratic variation satisfies the following concentration inequality with and :
Let . We introduce the notation for and for . Given that is a Gaussian martingale, using Theorem 4.2 and Remark 4.2 from Bercu & Touati (2008) with and and we obtain for all :
With where , we have:
Now, using Theorem 4.1 and Remark 4.2 from Bercu & Touati (2008) the following inequality is satisfied for all :
With we have:
4.2 Analysis of the Gp-Mi algorithm
In order to bound the cumulative regret for the GP-MI algorithm, we focus on an alternative definition of the exploration functions where the last term is modified inductively so as to simplify the sum for all . Being a constant term for a fixed , Algorithm 1 remains unchanged. Let be defined as,
where is the point selected by Algorithm 1 at iteration . We have for all ,
We can now derive upper bounds for which will be plugged in Theorem 1 in order to cancel out the terms involving . In this manner we can calibrate sharply the exploration/exploitation trade-off by optimizing the remaining terms.
For the GP-MI algorithm, the exploration term in the equation of Theorem 1 satisfies the following inequality:
Using our alternative definition of which gives the equality stated in Equation 8, we know that,
By concavity of the square root, we have for all that . Introducing the notations and , we obtain,
Moreover, with for all , we have and for all which gives,
leading to the inequality of Lemma 4. ∎
Considering Theorem 1 in the case of the GP-MI algorithm and bounding with Lemma 4, we obtain the following bound on the cumulative regret incurred by GP-MI: Pr[ R˙T ≤α^γ˙T - α2 ∑˙t=1Tσ˙t2(x⋆) ^γ˙T+1 + 4α(C˙1 γ˙T+1) + α2 ∑˙t=1Tσ˙t2(x⋆)C˙1 γ˙T+1 ] ≥1-δ , which simplifies to the inequality of Lemma 5 using Equation 3, and thus proves Theorem 2. ∎
5 Practical considerations and experiments
5.1 Numerical experiments
We compare the empirical performances of our algorithm against the state-of-the-art of GP optimization, the GP-UCB algorithm Srinivas et al. (2012), and a commonly used heuristic, the Expected Improvement (EI) algorithm with GP Jones et al. (1998). The tasks used for assessment come from two real applications and five synthetic problems described here. For all data sets and algorithms the learners were initialized with a random subset of observations
. When the prior distribution of the underlying function was not known, the Bayesian inference was made using a squared exponential kernel. We first picked the half of the data set to estimate the hyper-parameters of the kernel via cross validation in this subset. In this way, each algorithm was running with the same prior information. The value of the parameterfor the GP-MI and the GP-UCB algorithms was fixed to for all these experimental tasks. Modifying this value by several orders of magnitude is insignificant with respect to the empirical mean cumulative regret incurred by the algorithms, as discussed in Section 5.2. The results are provided in Figure 3. The curves show the evolution of the average regret in term of iteration
. We report the mean value with the confidence interval over a hundred experiments.
Description of the data sets.
We describe briefly all the data sets used for assessment.
Generated GP. The generated Gaussian process functions are random GPs drawn from an isotropic Matérn kernel in dimension and , with the kernel bandwidth set to for dimension , and for dimension . The Matérn parameter was set to and the noise standard deviation to 1% of the signal standard deviation.
Gaussian Mixture. This synthetic function comes from the addition of three -D Gaussian functions. We then perturb these Gaussian functions with smooth variations generated from a Gaussian Process with isotropic Matérn Kernel and 1% of noise. It is shown on Figure 2(a). The highest peak being thin, the sequential search for the maximum of this function is quite challenging.
Himmelblau. This task is another synthetic function in dimension . We compute a slightly tilted version of the Himmelblau’s function with the addition of a linear function, and take the opposite to match the challenge of finding its maximum. This function presents four peaks but only one global maximum. It gives a practical way to test the ability of a strategy to manage exploration/exploitation trade-offs. It is represented in Figure 2(b).
Branin. The Branin or Branin-Hoo function is a common benchmark function for global optimization. It presents three global optimum in the -D square . This benchmark is one of the two synthetic functions used by Srinivas et al. (2012) to evaluate the empirical performances of the GP-UCB algorithm. No noise has been added to the original signal in this experimental task.
Goldstein-Price. The Goldstein & Price function is an other benchmark function for global optimization, with a single global optimum but several local optima in the -D square . This is the second synthetic benchmark used by Srinivas et al. (2012). Like in the previous challenge, no noise has been added to the original signal.
Tsunamis. Recent post-tsunami survey data as well as the numerical simulations of Hill et al. (2012) have shown that in some cases the run-up, which is the maximum vertical extent of wave climbing on a beach, in areas which were supposed to be protected by small islands in the vicinity of coast, was significantly higher than in neighboring locations. Motivated by these observations Stefanakis et al. (2012) investigated this phenomenon by employing numerical simulations using the VOLNA code Dutykh et al. (2011) with the simplified geometry of a conical island sitting on a flat surface in front of a sloping beach. In the study of Stefanakis et al. (2013) the setup was controlled by five physical parameters and the aim was to find with confidence and with the least number of simulations the parameters leading to the maximum run-up amplification.
Mackey-Glass function. The Mackey-Glass delay-differential equation is a chaotic system in dimension , but without noise. It models real feedback systems and is used in physiological domains such as hematology, cardiology, neurology, and psychiatry. The highly chaotic behavior of this function makes it an exceptionally difficult optimization problem. It has been used as a benchmark for example by Flake & Lawrence (2002).
Empirical comparison of the algorithms.
Figure 3 compares the empirical mean average regret for the three algorithms. On the easy optimization assessments like the Branin data set (Fig. 3(e)) the three strategies behave in a comparable manner, but the GP-UCB algorithm incurs a larger cumulative regret. For more difficult assessments the GP-UCB algorithm performs poorly and our algorithm always surpasses the EI heuristic. The improvement of the GP-MI algorithm against the two competitors is the most significant for exceptionally challenging optimization tasks as illustrated in Figures 3(a) to 3(d) and 3(h), where the underlying functions present several local optima. The ability of our algorithm to deal with the exploration/exploitation trade-off is emphasized by these experimental results as its average regret decreases directly after the first iterations, avoiding unwanted exploration like GP-UCB on Figures 3(a) to 3(d), or getting stuck in some local optimum like EI on Figures 3(c), 3(g) and 3(h). We further mention that the GP-MI algorithm is empirically robust against the number of dimensions of the data set (Fig. 3(b), 3(g), 3(h)).
5.2 Practical aspects
Calibration of .
The value of the parameter is chosen following Theorem 2 as with being a confidence parameter. The guarantees we prove in Section 4.2 on the cumulative regret for the GP-MI algorithm holds with probability at least . With increasing linearly for decreasing exponentially toward , the algorithm is robust to the choice of . We present on Figure 4 the small impact of on the average regret for four different values selected on a wide range.
Even if the numerical cost of GP-MI is insignificant in practice compared to the cost of the evaluation of , the complexity of the sequential Bayesian update Osborne (2010) is and might be prohibitive for large . One can reduce drastically the computational time by means of Lazy Variance Calculation Desautels et al. (2012), built on the fact that always decreases for increasing and for all . We further mention that approximated inference algorithms such as the EP approximation and MCMC sampling Kuss et al. (2005) can be used as an alternative if the computational time is a restrictive factor.
We introduced the GP-MI algorithm for GP optimization and prove upper bounds on its cumulative regret which improve exponentially the state-of-the-art in common settings. The theoretical analysis was presented in a generic framework in order to expand its impact to other similar algorithms. The experiments we performed on real and synthetic assessments confirmed empirically the efficiency of our algorithm against both the theoretical state-of-the-art of GP optimization, the GP-UCB algorithm, and the commonly used EI heuristic.
The authors would like to thank David Buffoni and Raphaël Bonaque for fruitful discussions. The authors also thank the anonymous reviewers of the 31st International Conference on Machine Learning for their detailed feedback.
- Audibert et al. (2011) Audibert, J-Y., Bubeck, S., and Munos, R. Bandit view on noisy optimization. In Optimization for Machine Learning, pp. 431–454. MIT Press, 2011.
- Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
- Bercu & Touati (2008) Bercu, B. and Touati, A. Exponential inequalities for self-normalized martingales with applications. The Annals of Applied Probability, 18(5):1848–1869, 2008.
- Bubeck et al. (2011) Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. X-armed bandits. Journal of Machine Learning Research, 12:1655–1695, 2011.
- Carpentier et al. (2011) Carpentier, A., Lazaric, A., Ghavamzadeh, M., Munos, R., and Auer, P. Upper-confidence-bound algorithms for active learning in multi-armed bandits. In Proceedings of the International Conference on Algorithmic Learning Theory, pp. 189–203. Springer-Verlag, 2011.
- Chen & Krause (2013) Chen, Y. and Krause, A. Near-optimal batch mode active learning and adaptive submodular optimization. In Proceedings of the International Conference on Machine Learning. icml.cc / Omnipress, 2013.
- Contal et al. (2013) Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Machine Learning and Knowledge Discovery in Databases, volume 8188, pp. 225–240. Springer Berlin Heidelberg, 2013.
- Cover & Thomas (1991) Cover, T. M. and Thomas, J. A. Elements of Information Theory. Wiley-Interscience, 1991.
- Dani et al. (2008) Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pp. 355–366, 2008.
- de Freitas et al. (2012) de Freitas, N., Smola, A. J., and Zoghi, M. Exponential regret bounds for Gaussian process bandits with deterministic observations. In Proceedings of the 29th International Conference on Machine Learning. icml.cc / Omnipress, 2012.
- Desautels et al. (2012) Desautels, T., Krause, A., and Burdick, J.W. Parallelizing exploration-exploitation tradeoffs with Gaussian process bandit optimization. In Proceedings of the 29th International Conference on Machine Learning, pp. 1191–1198. icml.cc / Omnipress, 2012.
- Dutykh et al. (2011) Dutykh, D., Poncet, R, and Dias, F. The VOLNA code for the numerical modelling of tsunami waves: generation, propagation and inundation. European Journal of Mechanics B/Fluids, 30:598–615, 2011.
- Flake & Lawrence (2002) Flake, G. W. and Lawrence, S. Efficient SVM regression training with SMO. Machine Learning, 46(1-3):271–290, 2002.
- Floudas & Pardalos (2000) Floudas, C.A. and Pardalos, P.M. Optimization in Computational Chemistry and Molecular Biology: Local and Global Approaches. Nonconvex Optimization and Its Applications. Springer, 2000.
Grunewalder et al. (2010)
Grunewalder, S., Audibert, J-Y., Opper, M., and Shawe-Taylor, J.
Regret bounds for Gaussian process bandit problems.
Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 273–280. MIT Press, 2010.
- Hill et al. (2012) Hill, E. M., Borrero, J. C., Huang, Z., Qiu, Q., Banerjee, P., Natawidjaja, D. H., Elosegui, P., Fritz, H. M., Suwargadi, B. W., Pranantyo, I. R., Li, L., Macpherson, K. A., Skanavis, V., Synolakis, C. E., and Sieh, K. The 2010 Mw 7.8 Mentawai earthquake: Very shallow source of a rare tsunami earthquake determined from tsunami field survey and near-field GPS data. J. Geophys. Res., 117:B06402–, 2012.
- Jones et al. (1998) Jones, D. R., Schonlau, M., and Welch, W. J. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, December 1998.
- Kleinberg (2004) Kleinberg, R. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems 17, pp. 697–704. MIT Press, 2004.
- Krause & Ong (2011) Krause, A. and Ong, C.S. Contextual Gaussian process bandit optimization. In Advances in Neural Information Processing Systems 24, pp. 2447–2455, 2011.
- Kuss et al. (2005) Kuss, M., Pfingsten, T., Csató, L., and Rasmussen, C.E. Approximate inference for robust Gaussian process regression. Max Planck Inst. Biological Cybern., Tubingen, GermanyTech. Rep, 136, 2005.
- Mockus (1989) Mockus, J. Bayesian approach to global optimization. Mathematics and its applications. Kluwer Academic, 1989.
- Osborne (2010) Osborne, Michael. Bayesian Gaussian processes for sequential prediction, optimisation and quadrature. PhD thesis, Oxford University New College, 2010.
- Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. Gaussian Processes for Machine Learning. MIT Press, 2006.
- Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25, pp. 2960–2968, 2012.
- Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the International Conference on Machine Learning, pp. 1015–1022. icml.cc / Omnipress, 2010.
- Srinivas et al. (2012) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
- Stefanakis et al. (2012) Stefanakis, T. S., Dias, F., Vayatis, N., and Guillas, S. Long-wave runup on a plane beach behind a conical island. In Proceedings of the World Conference on Earthquake Engineering, 2012.
- Stefanakis et al. (2013) Stefanakis, T. S., Contal, E., Vayatis, N., Dias, F., and Synolakis, C. E. Can small islands protect nearby coasts from tsunamis ? An active experimental design approach. arXiv preprint arXiv:1305.7385, 2013.
- Wang & Shan (2007) Wang, G. and Shan, S. Review of metamodeling techniques in support of engineering design optimization. Journal of Mechanical Design, 129(4):370–380, 2007.
- Ziemba & Vickson (2006) Ziemba, W. T. and Vickson, R. G. Stochastic optimization models in finance. World Scientific Singapore, 2006.