Suppose we wish to minimize a continuous function , where is a compact subset of . Observing is costly (it may require a lengthy computer simulation or physical experiment), so we wish to use as few observations as possible. We know little about the shape of ; in particular we will be unable to make assumptions of convexity or unimodality. We therefore need a global optimization algorithm, one which attempts to find a global minimum.
Many standard global optimization algorithms exist, including genetic algorithms, multistart, and simulated annealing(Pardalos and Romeijn, 2002), but these algorithms are designed for functions that are cheap to evaluate. When is expensive, we need an efficient algorithm, one which will choose its observations to maximize the information gained.
We can consider this a continuum-armed-bandit problem (Srinivas et al., 2010, and references therein), with noiseless data, and loss measured by the simple regret (Bubeck et al., 2009). At time , we choose a design point , make an observation , and then report a point where we believe will be low. Our goal is to find a strategy for choosing the and , in terms of previous observations, so as to minimize .
We would like to find a strategy which can guarantee convergence: for functions in some smoothness class, should tend to , preferably at some fast rate. The simplest method would be to fix a sequence of in advance, and set , for some approximation to . We will show that if converges in supremum norm at the optimal rate, then also converges at its optimal rate. However, while this strategy gives a good worst-case bound, on average it is clearly a poor method of optimization: the design points are completely independent of the observations .
We may therefore ask if there are more efficient methods, with better average-case performance, that nevertheless provide good guarantees of convergence. The difficulty in designing such a method lies in the trade-off between exploration and exploitation. If we exploit the data, observing in regions where is known to be low, we will be more likely to find the optimum quickly; however, unless we explore every region of , we may not find it at all (Macready and Wolpert, 1998).
Initial attempts at this problem include work on Lipschitz optimization (summarized in Hansen et al., 1992) and the DIRECT algorithm (Jones et al., 1993), but perhaps the best-known strategy is expected improvement. It is sometimes called Bayesian optimization, and first appeared in Močkus (1974) as a Bayesian decision-theoretic solution to the problem. Contemporary computers were not powerful enough to implement the technique in full, and it was later popularized by Jones et al. (1998), who provided a computationally efficient implementation. More recently, it has also been called a knowledge-gradient policy by Frazier et al. (2009). Many extensions and alterations have been suggested by further authors; a good summary can be found in Brochu et al. (2010).
Expected improvement performs well in experiments (Osborne, 2010, §9.5), but little is known about its theoretical properties. The behaviour of the algorithm depends crucially on the Gaussian process prior chosen for . Each prior has an associated space of functions , its reproducing-kernel Hilbert space. contains all functions as smooth as a posterior mean of , and is the natural space in which to study questions of convergence.
Vazquez and Bect (2010) show that when is a fixed Gaussian process prior of finite smoothness, expected improvement converges on the minimum of any , and almost surely for drawn from . Grunewalder et al. (2010) bound the convergence rate of a computationally infeasible version of expected improvement: for priors of smoothness , they show convergence at a rate on drawn from . We begin by bounding the convergence rate of the feasible algorithm, and show convergence at a rate on all . We go on to show that a modification of expected improvement converges at the near-optimal rate .
For practitioners, however, these results are somewhat misleading. In typical applications, the prior is not held fixed, but depends on parameters estimated sequentially from the data. This process ensures the choice of observations is invariant under translation and scaling of , and is believed to be more efficient (Jones et al., 1998, §2). It has a profound effect on convergence, however: Locatelli (1997, §3.2) shows that, for a Brownian motion prior with estimated parameters, expected improvement may not converge at all.
We extend this result to more general settings, showing that for standard priors with estimated parameters, there exist smooth functions on which expected improvement does not converge. We then propose alternative estimates of the prior parameters, chosen to minimize the constants in the convergence rate. We show that these estimators give an automatic choice of parameters, while retaining the convergence rates of a fixed prior.
Table 1 summarizes the notation used in this paper. We say is a bump function if is infinitely differentiable and of compact support, and is Hermitian if . We use the Landau notation to denote , and to denote . If , we say , and if both and , we say . If further , we say . Finally, if and are random, and as , we say .
In Section 2, we briefly describe the expected-improvement algorithm, and detail our assumptions on the priors used. We state our main results in Section 3, and discuss implications for further work in Section 4. Finally, we give proofs in Appendix A.
|unknown function to be minimized|
|compact subset of to minimize over|
|number of dimensions to minimize over|
|points in at which is observed|
|estimated minimum of , given|
|prior distribution for|
|strategy for choosing ,|
|expected improvement given|
global mean and variance of Gaussian-process prior
|underlying correlation kernel for|
|correlation kernel for with length-scales|
|,||smoothness parameters of|
|, , ,||quantities describing posterior distribution of given|
|expected improvement strategy with fixed prior|
|,||estimates of prior parameters ,|
|rate of decay of|
|expected improvement strategy with estimated prior|
|reproducing-kernel Hilbert space of on|
|Sobolev Hilbert space of order on|
|loss suffered over an RKHS ball after steps|
|expected improvement strategy with robust estimated prior|
|-greedy expected improvement strategies|
2 Expected Improvement
Suppose we wish to minimize an unknown function , choosing design points and estimated minima as in the introduction. If we pick a prior distribution for , representing our beliefs about the unknown function, we can describe this problem in terms of decision theory. Let
be a probability space, equipped with a random processhaving law . A strategy
is a collection of random variables, taking values in . Set , and define the filtration . The strategy is valid if is conditionally independent of given , and likewise given . (Note that we allow random strategies, provided they do not depend on unknown information about .)
When taking probabilities and expectations we will write and , denoting the dependence on both the prior and strategy . The average-case performance at some future time is then given by the expected loss,
and our goal, given , is to choose the strategy to minimize this quantity.
2.1 Bayesian Optimization
For this problem is very computationally intensive (Osborne, 2010, §6.3), but we can solve a simplified version of it. First, we restrict the choice of to the previous design points . (In practice this is reasonable, as choosing an we have not observed can be unreliable.) Secondly, rather than finding an optimal strategy for the problem, we derive the myopic strategy: the strategy which is optimal if we always assume we will stop after the next observation. This strategy is suboptimal (Ginsbourger et al., 2008, §3.1), but performs well, and greatly simplifies the calculations involved.
In this setting, given , if we are to stop at time we should choose , where . (In the case of ties, we may pick any minimizing .) We then suffer a loss , where . Were we to observe at before stopping, the expected loss would be
so the myopic strategy should choose to minimize this quantity. Equivalently, it should maximize the expected improvement over the current loss,
So far, we have merely replaced one optimization problem with another. However, for suitable priors, can be evaluated cheaply, and thus maximized by standard techniques. The expected-improvement algorithm is then given by choosing to maximize (1).
2.2 Gaussian Process Models
We still need to choose a prior for . Typically, we model as a stationary Gaussian process: we consider the values to be jointly Gaussian, with mean and covariance
is the global mean of ; we place a flat prior on , reflecting our uncertainty over the location of .
is the global scale of variation of , and its correlation kernel, governing the local properties of . In the following, we will consider kernels
for an underlying kernel with . (Note that we can always satisfy this condition by suitably scaling and .) The are the length-scales of the process: two values and will be highly correlated if each is small compared with . For now, we will assume the parameters and are fixed in advance.
is continuous and integrable.
thus has Fourier transform
and by Bochner’s theorem, is non-negative and integrable.
is isotropic and radially non-increasing.
In other words, for a non-increasing function ; as a consequence, is isotropic.
As , either:
for some ; or
for all (we will then say that ).
Note the condition is required for to be integrable.
is , for the largest integer less than , and at the origin, has -th order Taylor approximation satisfying
as , for some .
When , this is just the condition that be -Hölder at the origin; when , we instead require this condition up to a log factor.
The rate controls the smoothness of functions from the prior: almost surely, has continuous derivatives of any order (Adler and Taylor, 2007, §1.4.2). Popular kernels include the Matérn class,
where is a modified Bessel function of the second kind, and the Gaussian kernel,
obtained in the limit (Rasmussen and Williams, 2006, §4.2). Between them, these kernels cover the full range of smoothness . Both kernels satisfy Assumptions 1–4 for the given; except for the Matérn kernel with , where (Abramowitz and Stegun, 1965, §9.6).
Having chosen our prior distribution, we may now derive its posterior. We find
for , , and (Santner et al., 2003, §4.1.3). Equivalently, these expressions are the best linear unbiased predictor of and its variance, as given in Jones et al. (1998, §2). We will also need the reduced sum of squares,
2.3 Expected Improvement Strategies
are the standard normal distribution and density functions respectively.
For a prior as above, expected improvement chooses to maximize (9), but this does not fully define the strategy. Firstly, we must describe how the strategy breaks ties, when more than one maximizes . In general, this will not affect the behaviour of the algorithm, so we allow any choice of maximizing (9).
Secondly, we must say how to choose , as the above expressions are undefined when . In fact, Jones et al. (1998, §4.2) find that expected improvement can be unreliable given few data points, and recommend that several initial design points be chosen in a random quasi-uniform arrangement. We will therefore assume that until some fixed time , points are instead chosen by some (potentially random) method independent of . We thus obtain the following strategy.
An strategy chooses:
initial design points independently of ; and
further design points from the maximizers of (9).
So far, we have not considered the choice of parameters and . While these can be fixed in advance, doing so requires us to specify characteristic scales of the unknown function , and causes expected improvement to behave differently on a rescaling of the same function. We would prefer an algorithm which could adapt automatically to the scale of .
A natural approach is to take maximum likelihood estimates of the parameters, as recommended by Jones et al. (1998, §2). Given , the MLE ; for full generality, we will allow any choice , where . Estimates of , however, must be obtained by numerical optimization. As can vary widely in scale, this optimization is best performed over ; as the likelihood surface is typically multimodal, this requires the use of a global optimizer. We must therefore place (implicit or explicit) bounds on the allowed values of . We have thus described the following strategy.
3 Convergence Rates
To discuss convergence, we must first choose a smoothness class for the unknown function . Each kernel is associated with a space of functions , its reproducing-kernel Hilbert space (RKHS) or native space. contains all functions as smooth as a posterior mean of , and is the natural space to study convergence of expected-improvement algorithms, allowing a tractable analysis of their asymptotic behaviour.
3.1 Reproducing-Kernel Hilbert Spaces
Given a symmetric positive-definite kernel on , set . For , let be the space of functions spanned by the , for . Furnish with the inner product defined by
The completion of under this inner product is the reproducing-kernel Hilbert space of on . The members are abstract objects, but we can identify them with functions through the reproducing property,
We will find it convenient also to use an alternative characterization of . We begin by describing in terms of Fourier transforms. Let denote the Fourier transform of a function . The following result is stated in Parzen (1963, §2), and proved in Wendland (2005, §10.2); we give a short proof in Appendix A.
is the space of real continuous whose norm
is finite, taking .
We may now describe in terms of .
Lemma 2 (Aronszajn, 1950, §1.5).
is the space of functions for some , with norm
and there is a unique minimizing this expression.
These spaces are in fact closely related to the Sobolev Hilbert spaces of functional analysis. Say a domain is Lipschitz if its boundary is locally the graph of a Lipschitz function (see Tartar, 2007, §12, for a precise definition). For such a domain , the Sobolev Hilbert space is the space of functions , given by the restriction of some , whose norm
is finite. Thus, for the kernel with Fourier transform , this is just the RKHS . More generally, if satisfies our assumptions with , these spaces are equivalent in the sense of normed spaces: they contain the same functions, and have norms satisfying
for constants .
Let denote the RKHS of on , and be a Lipschitz domain.
If , is equivalent to the Sobolev Hilbert space .
If , is continuously embedded in for all .
Thus if , and is, say, a product of intervals , the RKHS is equivalent to the Sobolev Hilbert space , identifying each function in that space with its unique continuous extension to .
3.2 Fixed Parameters
We are now ready to state our main results. Let be compact with non-empty interior. For a function , let and denote probability and expectation when minimizing the fixed function with strategy . (Note that while is fixed, may be random, so its performance is still probabilistic in nature.) We define the loss suffered over the ball in after steps by a strategy ,
We will say that converges on the optimum at rate , if
for all . Note that we do not allow to vary with ; the strategy must achieve this rate without prior knowledge of .
We begin by showing that the minimax rate of convergence is .
If , then for any , ,
and this rate can be achieved by a strategy not depending on .
The upper bound is provided by a naive strategy as in the introduction: we fix a quasi-uniform sequence in advance, and take
to minimize a radial basis function interpolant of the data. As remarked previously, however, this naive strategy is not very satisfying; in practice it will be outperformed by any good strategy varying with the data. We may thus ask whether more sophisticated strategies, with better practical performance, can still provide good worst-case bounds.
One such strategy is the strategy of 1. We can show this strategy converges at least at rate , up to log factors.
Let be a prior with length-scales . For any ,
For , these rates are near-optimal. For , we are faced with a more difficult problem; we discuss this in more detail in Section 3.4.
3.3 Estimated Parameters
First, we consider the effect of the prior parameters on . While the previous result gives a convergence rate for any fixed choice of parameters, the constant in that rate will depend on the parameters chosen; to choose well, we must somehow estimate these parameters from the data. The strategy, given by 2, uses maximum likelihood estimates for this purpose. We can show, however, that this may cause the strategy to never converge.
Suppose . Given , , , there exists satisfying , and for some fixed ,
The counterexamples constructed in the proof of the theorem may be difficult to minimize, but they are not badly-behaved (Figure 1). A good optimization strategy should be able to minimize such functions, and we must ask why expected improvement fails.
We can understand the issue by considering the constant in Theorem 2. Define
From the proof of Theorem 2, the dominant term in the convergence rate has constant
for not depending on or . In Appendix A, we will prove the following result.
is non-decreasing in , and bounded above by .
Hence for fixed , the estimate , and thus . Inserting this choice into (11) gives a constant growing exponentially in , destroying our convergence rate.
To resolve the issue, we will instead try to pick to minimize (11). The term is increasing in , and the term is decreasing in ; we may balance the terms by taking . The constant is then proportional to , which we may minimize by taking . In practice, we will not know in advance, so we must estimate it from the data; from 1, a convenient estimate is .
Suppose, then, that we make some bounded estimate of , and set . As Theorem 3 holds for any of faster than logarithmic decay, such a choice is necessary to ensure convergence. (We may also choose to minimize (11); we might then pick minimizing but our assumptions on are weak enough that we need not consider this further.)
If we believe our Gaussian-process model, this estimate is certainly unusual. We should, however, take care before placing too much faith in the model. The function in Figure 1 is a reasonable function to optimize, but as a Gaussian process it is highly atypical: there are intervals on which the function is constant, an event which in our model occurs with probability zero. If we want our algorithm to succeed on more general classes of functions, we will need to choose our parameter estimates appropriately.
To obtain good rates, we must add a further condition to our strategy. If , is identically zero, and all choices of are equally valid. To ensure we fully explore , we will therefore require that when our strategy is applied to a constant function , it produces a sequence dense in . (This can be achieved, for example, by choosing uniformly at random from when .) We have thus described the following strategy.
We cannot now prove a convergence result uniform over balls in , as the rate of convergence depends on the ratio , which is unbounded. (Indeed, any estimator of must sometimes perform poorly: can appear from the data to have arbitrarily small norm, while in fact having a spike somewhere we have not yet observed.) We can, however, provide the same convergence rates as in Theorem 2, in a slightly weaker sense.
For any , under ,
3.4 Near-Optimal Rates
So far, our rates have been near-optimal only for . To obtain good rates for , standard results on the performance of Gaussian-process interpolation (Narcowich et al., 2003, §6) then require the design points to be quasi-uniform in a region of interest. It is unclear whether this occurs naturally under expected improvement, but there are many ways we can modify the algorithm to ensure it.
Perhaps the simplest, and most well-known, is an -greedy strategy (Sutton and Barto, 1998, §2.2). In such a strategy, at each step with probability we make a decision to maximize some greedy criterion; with probability we make a decision completely at random. This random choice ensures that the short-term nature of the greedy criterion does not overshadow our long-term goal.
The parameter controls the trade-off between global and local search: a good choice of will be small enough to not interfere with the expected-improvement algorithm, but large enough to prevent it from getting stuck in a local minimum. Sutton and Barto (1998, §2.2) consider the values and , but in practical work should of course be calibrated to a typical problem set.
We therefore define the following strategies.
Let denote , or . For , an strategy:
chooses initial design points independently of ;
with probability , chooses design point as in ; or
with probability , chooses uniformly at random from .
We can show that these strategies achieve near-optimal rates of convergence for all .
Let be one of the strategies in 4. If , then for any ,
while if , the statement holds for all .
Note that unlike a typical -greedy algorithm, we do not rely on random choice to obtain global convergence: as above, the and strategies are already globally convergent. Instead, we use random choice simply to improve upon the worst-case rate. Note also that the result does not in general hold when ; to obtain good rates, we must combine global search with inference about .
We have shown that expected improvement can converge near-optimally, but a naive implementation may not converge at all. We thus echo Diaconis and Freedman (1986) in stating that, for infinite-dimensional problems, Bayesian methods are not always guaranteed to find the right answer; such guarantees can only be provided by considering the problem at hand.
We might ask, however, if our framework can also be improved. Our upper bounds on convergence were established using naive algorithms, which in practice would prove inefficient. If a sophisticated algorithm fails where a naive one succeeds, then the sophisticated algorithm is certainly at fault; we might, however, prefer methods of evaluation which do not consider naive algorithms so successful.
Vazquez and Bect (2010) and Grunewalder et al. (2010) consider a more Bayesian formulation of the problem, where the unknown function is distributed according to the prior , but this approach can prove restrictive: as we saw in Section 3.3, placing too much faith in the prior may exclude functions of interest. Further, Grunewalder et al. find the same issues are present also within the Bayesian framework.
A more interesting approach is given by the continuum-armed-bandit problem (Srinivas et al., 2010, and references therein). Here the goal is to minimize the cumulative regret,
in general observing the function under noise. Algorithms controlling the cumulative regret at rate also solve the optimization problem, at rate (Bubeck et al., 2009, §3). The naive algorithms above, however, have poor cumulative regret. We might, then, consider the cumulative regret to be a better measure of performance, but this approach too has limitations. Firstly, the cumulative regret is necessarily increasing, so cannot establish rates of optimization faster than . (This is not an issue under noise, where typically , see Kleinberg and Slivkins, 2010.) Secondly, if our goal is optimization, then minimizing the regret, a cost we do not incur, may obscure the problem at hand.
Bubeck et al. (2010) study this problem with the additional assumption that has finitely many minima, and is, say, quadratic in a neighbourhood of each. This assumption may suffice in practice, and allows the authors to obtain impressive rates of convergence. For optimization, however, a further weakness is that these rates hold only once the algorithm has found a basin of attraction; they thus measure local, rather than global, performance. It may be that convergence rates alone are not sufficient to capture the performance of a global optimization algorithm, and the time taken to find a basin of attraction is more relevant. In any case, the choice of an appropriate framework to measure performance in global optimization merits further study.
Finally, we should also ask how to choose the smoothness parameter (or the equivalent parameter in similar algorithms). van der Vaart and van Zanten (2009) show that Bayesian Gaussian-process models can, in some contexts, automatically adapt to the smoothness of an unknown function . Their technique requires, however, that the estimated length-scales to tend to 0, posing both practical and theoretical challenges. The question of how best to optimize functions of unknown smoothness remains open.
We would like to thank the referees, as well as Richard Nickl and Steffen Grunewalder, for their valuable comments and suggestions.
Appendix A Proofs
We now prove the results in Section 3.
a.1 Reproducing-Kernel Hilbert Spaces
Proof of 1.
Let be the space of functions described, and be the closed real subspace of Hermitian functions in . We will show is an isomorphism , so we may equivalently work with . Given , by Cauchy-Schwarz and Bochner’s theorem,
and as ,
so . is thus the Fourier transform of a real continuous , satisfying the Fourier inversion formula everywhere.
is hence an isomorphism . It remains to show that . is complete, so is. Further, , and by Fourier inversion each satisfies the reproducing property,
so is a closed subspace of . Given , for all , so . Thus . ∎
Proof of 3.
By 1, the norm on is
and has Fourier transform
If , by assumption , for a finite non-increasing function satisfying as . Hence
for constants , and we obtain that is equivalent to the Sobolev space .
From 2, is given by the restriction of functions in ; as is Lipschitz, the same is true of . is thus equivalent to . Finally, functions in are continuous, so uniquely identified by their restriction to , and
If , by a similar argument is continuously embedded in all . ∎
From 1, we can derive results on the behaviour of as varies. For small , we obtain the following result.
If , then for all , and
Likewise, for large , we obtain the following.
If , , then for , and
for a depending only on and .
As in the proof of 3, we have constants such that
Thus for ,
and we may argue as in the previous lemma. ∎
We can also describe the posterior distribution of in terms of ; as a consequence, we may deduce 1.
Suppose , .