Bayesian optimization is known to be difficult to scale to high dimensions, because the acquisition step requires solving a non-convex optimization problem in the same search space. In order to scale the method and keep its benefits, we propose an algorithm (LineBO) that restricts the problem to a sequence of iteratively chosen one-dimensional sub-problems. We show that our algorithm converges globally and obtains a fast local rate when the function is strongly convex. Further, if the objective has an invariant subspace, our method automatically adapts to the effective dimension without changing the algorithm. Our method scales well to high dimensions and makes use of a global Gaussian process model. When combined with the SafeOpt algorithm to solve the sub-problems, we obtain the first safe Bayesian optimization algorithm with theoretical guarantees applicable in high-dimensional settings. We evaluate our method on multiple synthetic benchmarks, where we obtain competitive performance. Further, we deploy our algorithm to optimize the beam intensity of the Swiss Free Electron Laser with up to 40 parameters while satisfying safe operation constraints.READ FULL TEXT VIEW PDF
Accompanying code for AAAI 2021 publication - High-Dimensional Bayesian Optimization via Tree-Structured Additive Models
Zero-order stochastic optimization problems arise in many applications such as hyper-parameter tuning of machine learning models, reinforcement-learning and industrial processes. An example that motivates the present work is parameter tuning of a free electron laser (FEL). FELs are large-scale physical machines that accelerate electrons in order to generate bright and shortly pulsed X-ray lasing. The X-ray pulses then facilitate many experiments in biology, medicine and material science. The accelerator and the electron beam line of a free electron laser consist of multiple individual components, each of which has several parameters that experts adjust to maximize the pulse energy. Because of different operational modes and parameter drift, this is a recurrent, time-consuming task which takes away valuable time for experiments. Evaluations can be obtained at a rate of more than 1 Hz, which makes this task well suited for automated optimization with a continuous search space of about 10-100 parameters. Some of these parameters are known to physically over-parametrize the objective function, which leads to invariant subspaces and local optima. Additionally, some settings can cause electron losses, which are required to stay below a pre-defined threshold.
This scenario can be cast as a gradient-free stochastic optimization problem with implicit constraints. The fact that the constraints are safety critical rules out many commonly used algorithms. Arguably, the simplest approach is to use a local optimization method with a conservatively chosen step size and a term that penalizes constraint violations in the objective, but such a method might get stuck in local optima. As an alternative, Bayesian optimization offers a principled, global optimization routine that can also operate under safety constraints (Sui et al., 2015). When applied to a low-dimensional subset of parameters, Bayesian optimization has been successfully used on FELs and in similar applications. However, it is well known that standard Bayesian optimization does not scale to high-dimensional settings, because optimizing the acquisition function becomes itself an intractable optimization problem.
In this work, we propose a novel way of using Bayesian optimization that is computationally feasible even in high dimensions. The key idea is to iteratively solve sub-problems of the global problem, each of which can be solved efficiently, both computationally and statistically. As feasible sub-problems we choose one-dimensional subspaces of the domain that contain the best point so far. On a one-dimensional domain, Bayesian optimization can be implemented computationally efficiently and the sample-complexity to obtain an -optimal point is independent of the outer dimension. A global GP model can be used and allows to share information between the sub-problem to increase data-efficiency, in particular as samples start to accumulate close to an optimum. As we will show, our approach obtains both local and global convergence guarantees and further adaptively scales with the effective dimension, if the objective contains an invariant subspace. In the constraint setting, we use SafeOpt to solve the sub-problems. This way, we obtain the first principled and safe Bayesian optimization algorithm applicable to high-dimensional domains.
We propose a novel way of using Bayesian optimization that circumvents the issue of acquisition function optimization by decomposing the global problem into a sequence of one-dimensional sub-problems that can be solved efficiently.
Theoretically, we show that if the one-dimensional subspaces are chosen randomly, the algorithm converges with a fast local rate where the function is strongly convex, and converges globally at a Lipschitz rate that adaptively scales with the effective dimension.
To respect safety constraints during optimization, each sub-problem can be solved with SafeOpt. To the best of our knowledge, this is the first principled algorithm for high dimensional safe Bayesian optimization.
Our algorithm is practical and amenable to heuristics that improve local convergence. As user feedback we provide one-dimensional slice plots that allow to monitor the progress and the model fit.
We evaluate our method on synthetic benchmark functions, and apply it to tune the Swiss Free Electron Laser (SwissFEL) with up to 40 parameters on a continuous domain, satisfying safe operation constraints.
Derivative-free stochastic optimization covers an array of algorithms from the very general grid-based methods (Nesterov, 2004; Jones, 2001) to local methods, where most of the work is spent on approximating the gradient (Nesterov & Spokoiny, 2017). Especially of interest are algorithms that optimize functions with a noisy oracle, also known as stochastic bandit feedback (Flaxman et al., 2004; Shamir, 2013). Popular examples include CMA-ES (Hansen & Ostermeier, 2001; Hansen et al., 2003), Nelder-Mead (Powell, 1973) and SPSA (Bhatnagar et al., 2013). Line-search techniques are related to our method, but have been primarily studied in the context of convex optimization (Gratton et al., 2015), also with stochastic models and search directions (Cartis & Scheinberg, 2018; Paquette & Scheinberg, 2018; Diniz-Ehrhardt et al., 2008).
Bayesian optimization is a family of algorithms using probabilistic models to determine which point to evaluate next (Mockus, 1982; Shahriari et al., 2016). Many variants appeared in literature; including GP-UCB (Srinivas et al., 2010)2017), and Expected Improvement (Mockus, 1982); and recently with information theoretic criteria such as MVES or IDS (Wang & Jegelka, 2017; Kirschner & Krause, 2018). Lower bounds are known as well (Scarlett et al., 2017). Bayesian optimization on a one dimensional domain is not necessarily thought of as line search, although it can be used as such (Mahsereci & Hennig, 2017), and the one dimensional setting is theoretically well understood (Scarlett, 2018). Success stories, where Bayesian optimization outperforms classical techniques, include applications in laser technology (Schneider et al., 2018), performance optimization of Free Electron Lasers (McIntire et al., 2016a, b) and parameter optimization in the CPLEX suite (Shahriari et al., 2016).
The scaling of Bayesian optimization to high dimensions has been considered recently, as many of the commonly used kernels suffer from the curse of dimensionality. Hence, to make the problem tractable, most approaches make structural assumptions on the function such as additivity(Rolland et al., 2018; Mutný & Krause, 2018) or a low-dimensional active subspace (Djolonga et al., 2013). The latter category includes REMBO (Wang et al., 2016), which also optimizes on a random low-dimensional subspace, however, in contrast to our method, the dimension of the low-dimensional embedding needs to be known a priori and an iterative procedure to define the subspaces is not considered. Our method also relates to the Dropout-BO algorithm of Li et al. (2017a), but the theoretical analysis is insufficient as the regret bound is non-vanishing in the limit. A method that combines local optimization with Bayesian optimization was proposed by McLeod et al. (2018) but without theoretical analysis.
The main instance of safe Bayesian optimization is the SafeOpt algorithm (Sui et al., 2015; Berkenkamp et al., 2016a; Sui et al., 2018); but its formulation relies on a discretized domain, which prevents high-dimensional applications. An adaptive discretization based on particle swarms was proposed by Berkenkamp et al. (2016b).
Let be a compact domain and the objective function we seek to minimize111Bayesian optimization is typically formulated as maximization problem, but since we also have results in the flavor of convex optimization, w.l.o.g. we use minimization here.,
where we allow for implicit constraints
. The constraint function can be chosen vector valued to allow for multiple constraints. We refer to such constraints assafety constraints if it is required that the iterates need to maintain during optimization. We assume that and can only be accessed via a noisy oracle, that given a point returns an evaluation and possibly , where is a noise term with sub-Gaussian tails.
Denote and let be a point such that . An algorithm iteratively picks a sequence of evaluations , and obtains the corresponding noisy observations . As a measure of progress we use simple regret. At any, possibly random stopping time , the optimization algorithm proposes a candidate solution . This point is allowed to differ from the point that is chosen for the purpose of optimization, still, some algorithms might set . Simple regret is defined as
and therefore measures the ability of an optimization algorithm to predict a minimizer at time . To impose some minimal regularity, we make the following assumption.
[RKHS] The objective and constraint functions and are members of reproducing kernel Hilbert spaces , with known kernel functions and bounded norm .
In its standard formulation, Bayesian optimization uses a Gaussian process prior GP with mean and kernel function and Bayes’ rule to update the posterior as observations arrive. If a Gaussian likelihood is used, the posterior mean can be computed analytically and is equivalent to the regularized least squares kernel estimator,
From the Bayesian posterior, one can obtain credible intervals
, which in this case are known to match frequentist confidence intervals up to the scaling factor. Bayesian optimization is built upon using the uncertainty estimates , or more generally the posterior distribution, to determine promising query points that efficiently reduce the uncertainty about the true maximizer . Typically, an acquisition function is defined and evaluations are chosen as . Commonly used acquisition functions include UCB, Thompson Sampling, Expected Improvement and Max-Value Entropy Search, and trade-off between exploration and exploitation on the GP posterior landscape.
The success of Bayesian optimization crucially relies on the ability to find a maximizer of the acquisition function , which requires solving a non-convex optimization problem in the same search space. In most of the literature on Bayesian optimization, this is not discussed further as the computational cost of solving is assumed to be negligible compared to obtaining a new evaluation on the oracle. In practice, however, this step renders the method intractable in high-dimensional settings.
In order to maintain tractability of the acquisition step in high dimensions, we propose to restrict the search space to a one-dimensional222Generalization to higher dimensional subspaces is possible. affine subspace , where is the offset, and is the direction. On such a restriction, the acquisition step can be effectively solved using an (adaptive) grid-search over . We will show that by carefully choosing a sequence of one-dimensional subspaces, we obtain a method that still converges globally, and additionally it has properties similar to a gradient method. By using a global GP model, we can share information between the sub-solvers and handle noise in a principled way.
The LineBO method is presented in Algorithm 1. As standard for Bayesian optimization, we initialize with a GP prior. We also assume that the user provides a direction oracle , which is used to iteratively define subspaces . The affine subspace is always chosen to contain the previous best point to ensure a monotonic improvement over iterations. We then proceed by efficiently solving the subspace using standard Bayesian optimization (Appendix A).
A canonical example of the direction oracle is to pick the direction uniformly at random, which is also the main focus of our analysis. As we will see, this algorithm obtains both a local and a global convergence rate. Another possibility is to use (random) coordinate aligned directions, which resembles a coordinate descent algorithm. In this case, our method is a special case of DropoutUCB of Li et al. (2017a), but the global rate they obtained has a non-vanishing gap in the limit and local convergence was not analysed.
The restriction of the search space allows us to effectively use a safe Bayesian optimization algorithm like SafeOpt (see Appendix A.2) as a sub-solver, which in turn renders the global method safe (SafeLineBO). We note that in its current formulation, this method crucially relies on a discretized domain, which makes it difficult to apply even with . It is, however, an easy task to implement these methods on a one dimensional domain. To the best of our knowledge, this way we obtain the first principled method for safe Bayesian optimization in high dimensions.
To understand the sample complexity of solving the one dimensional sub-problems, we rely on the standard analysis of Bayesian optimization developed by Srinivas et al. (2010); Abbasi-Yadkori (2012); Chowdhury & Gopalan (2017). The results are often stated in terms of a complexity measure called maximum information gain , which is defined as the mutual information . This quantity depends on the kernel and upper bounds are known for the RBF and Matern kernel (Seeger et al., 2008; Srinivas et al., 2010). We focus on a subset of kernels, which when restricted on the one dimensional affine subspace , their satisfies the following assumption.
[Bounded ] Let be a one-dimensional kernel and , then
This is satisfied for the squared exponential kernel () and the Matern kernel with (). Simple regret can be bounded as , and with the assumption above, the bound becomes up to logarithmic factors (see also Appendix A.1). Equivalently, the time until regret is guaranteed is . The best known lower bound for this case is (Scarlett et al., 2017), hence almost closes the gap. The overall number of evaluations after iterations of Algorithm 1 is at most .
In practice, we often encounter functions that are high-dimensional but contain an (unknown) invariant subspace. This means that there are directions in which the function is constant and after removal of these dimensions the problem might not be high dimensional (see Figure 2). The dimension of the linear space where the function varies is called effective dimension, as formalized in the following definition.
[Effective dimension] The effective dimensionality of a function is the smallest s.t. there exists a linear subspace of dimension and for all and , where is the orthogonal complement of , .
If Algorithm 1 is used with randomly chosen directions, we show that the convergence of the algorithm adaptively scales with the effective dimension . The result is quantified in the following proposition.
and directions chosen uniformly at random, with probability at least, it holds that
The proof is deferred to Appendix B.1. The result should be understood as a property of random exploration and is the best one can hope for on worst-case examples. Instances, where random search is competitive have been reported in literature (Bergstra & Bengio, 2012; Wang et al., 2016; Li et al., 2017b) and this has been attributed to the same effect. However, random search fails to control the error induced by the noise, and our method has the advantage of using the GP model to deal with the noise in a principled way.
In contrast to other algorithms that exploit subspace structure, including the REMBO algorithm of Wang et al. (2016) and SI-BO of Djolonga et al. (2013), our formulation does not require the knowledge of in advance. Intuitively, we can demonstrate the consequence of the effective dimension and the random line algorithm by plotting the set that appears as an isolated spike in the domain in the worst-case. For functions with an invariant subspace, the volume of the set increases substantially and hence the probability of a random line passing through this region increases (see Figure 2).
Naturally, such a bound cannot avoid an exponential scaling with , as also does not full-scale Bayesian optimization. However, we show in the next section, that if our algorithm finds a point in the proximity of a local optimum, the convergence is dominated by a fast local rate, a property not exhibited by random search.
By Taylor’s theorem, differentiable functions have an open set around their minimizers where the function is convex or even strongly-convex. We show that if our algorithm starts in a subset of the domain where the function is strongly convex, it converges to the (local) minimum at a linear rate. Again, we focus on the instance where directions are picked at random. The key insight is that random directions can be used as descent directions in the following sense.
[Random Descent Direction] Let be a uniformly random point on the -dimensional unit sphere or uniformly among an orthonormal basis. Then,
Let satisfy Assumption 2, be -strongly convex and -smooth if restricted to . Let and assume all iterates are contained in . Then, after iterations of Algorithm 1 with accuracy and random directions that satisfy Lemma 4.3, it holds that,
To interpret the result, we fix the total number of evaluations and assume . If the kernel restricted to any one dimensional subspace satisfies Assumption 4.1, we can set the accuracy . Then, with the previous proposition, the simple regret can we bounded by
Importantly, the bound has only a polynomial dependence on , for instance with the squared exponential kernel () we get .
The ability to use an arbitrary line solver for the subproblems allows us to implement safety by using a safe BO algorithm such as SafeOpt as a sub-solver. We call LineBO with SafeOpt as sub-solver SafeLineBO. Formally, we define the safe set . It is unavoidable that an initial safe point must be provided. The best one can hope for is the exploration of the reachable safe set , which can be defined as the connected component of that contains . For details, we refer to Sui et al. (2015) and Berkenkamp et al. (2016a) for multiple constraints.
The one dimensional subproblems are guaranteed to be solved safely by the guarantees of SafeOpt under the same additional technical assumptions as for the original algorithm. However a natural question arises as to what extend the safe set is explored sufficiently when restricting the acquisition to one-dimensional subspaces. To allow for the possibility that a safe maximizer can be reached within one iteration from a given safe starting point, the straight line segment from this point to the optimum needs to be contained in . Naturally, this is guaranteed if the safe set is convex; but other conditions are possible. For instance, if the level set is both safe and convex, one can expect that the iterates do not leave and consequently the optimum is found. Note that this is a natural condition that arises if the function is convex on a subset of domain, as we assume for our local convergence guarantees. On the other hand, it is easy to construct counterexamples even in two dimensions that are successfully solved by SafeOpt but not with the LineBO method (for instance with a U-shaped safe set). In practice, however, this might not be a severe limitation, in particular if constraint violations are not expected close to the optimum.
Our main goal is to provide a Bayesian optimization algorithm that is practically useful, with the main benefit that the acquisition function can be solved efficiently. We note that this enables the use of acquisition functions such as Thomson sampling or MVES that rely on sampling the GP posterior and where an analytical expression is not available. Besides this, our methods has several further practical advantages, as we explain below.
Picking random directions is one possibility to define the sub-problems, that allows us to simultaneously obtain global and local guarantees. In practice, this might not necessarily be the best choice, as it increases variance and a different trade-off between local and global exploration might be beneficial. An alternative way is to choose the directions coordinate aligned (CoordinateLineBO). This we found to be efficient on many benchmark problems, likely because of reduced variance and symmetries in the objective. If one seeks to speed up local convergence, using a gradient estimate is the obvious choice. As the gradient-norm becomes smaller, one can eventually switch to random directions to encourage random exploration. For estimating descent directions, we implement the following heuristic based on Thompson Sampling. First, we take the gradient at of a sample from the posterior GP. Then we evaluate , where is a small step size, and update the model. After several such steps ( times), we use the gradient of the posterior mean at as direction oracle (see Appendix C.2 for details). In our experiments, we found that this method (DescentLineBO) improves local convergence, and this variant was used on the free electron laser as well.
We introduced the LineBO method with a global GP model as usually done for Bayesian optimization. This has the advantage that data is shared between the sub-problems, which can speed up convergence, but comes at the cost of inverting the kernel matrix. The iterative update cost is quadratic in the number of data points, which becomes a limiting factor typically around a few thousand steps. It is also possible to use independent sub-solvers or keep a fixed-sized data buffer; as long as the sub-problems are solved sufficiently accurately, this does not affect our theoretical guarantees and yields a further speedup.
An additional benefit of restricting the acquisition function to a one-dimensional subspace is that we can plot evaluations together with the model predictions on this subspace. One example that we obtained when we tuned the SwissFEL is shown in Figure 4(c)
. This allows to better understand the structure of the optimization problem; moreover, it provides valuable user-feedback, as it allows to monitor model fit and to adjust GP-hyperparameters. With safety constraints this is of particular importance, as a misspecified GP model cannot capture the safe set correctly and might cause constraint violations.
As for standard benchmarks we use the Camelback (2d) and the Hartmann6 (6d) functions. Further, we use the Gaussian in 10 dimensions as a benchmark where local convergence is sufficient; note that when restricted to a small enough Euclidean ball this function is strongly convex. To obtain benchmarks with invariant subspaces, we augment the Camelback and Hartmann6 function with 10 and 14 auxiliary dimensions respectively, and shuffle the coordinates to avoid bias. For the constraint case, we add an upper bound to the objective, ie for some threshold . We found that the performance of the local methods (including our approach) depends on the initial point. For that reason, we randomized the initial points for the Camelback and Hartmann6 function uniformly in the domain; in the constrained case restricted to the safe-set. On the Gaussian function we randomize on the level set with in the unconstrained and
in the constrained case. On all experiments we add Gaussian noise with standard deviation 0.2, to obtain a similar signal-noise ratio as on our real-world application.
We compare our approach to random search, Nelder-Mead, SPSA, CMAES and standard GP-UCB. For the subspace problems we additionally compare to REMBO and its interleaved variant. The latter never perform better in our experiments and is omitted from the plots for visual clarity. In the constrained case we compare to SafeOpt in 2 dimensions, and to the SwarmSafeOpt heuristic on the higher-dimensional benchmarks. We use public libraries where available, details can be found in Appendix C. For our LineBO methods, we use the UCB acquisition function. We manually chose reasonable values for hyperparameters of the methods or use recommended setting where available, but we did not run an exhaustive hyperparameter search (which would arguable not be possible in most real-world applications). All methods that use GPs share the same hyperparameters, expect on the Gaussian, where a smaller lengthscale for GP-UCB resulted in better performance.
We evaluate progress using simple regret. Each method suggests a candidate solution in each iteration (in addition to the optimization step), which is evaluated but not used in the optimization. Naturally for the GP-methods, this was chosen as the best mean of the model, and for our line methods, the best mean was determined on the current subspace. The Nelder-Mead and SPSA implementation we used did not have such an option, so progress on each evaluation is shown. Each experiment was repeated 100 times and confidence bars show the standard error.
The results for the unconstrained case are presented in Figure 3. In the standard Camelback and Hartmann6 benchmarks, we obtain competitive performance. In particular the CoordinateLineBO method works well, which might be due to symmetries in the benchmarks. The benchmark on the Gaussian function is challenging in that the initial signal is of the same magnitude as the noise. If an optimization algorithm initially takes steps away from the optimum, the objective quickly gets very flat, and it becomes difficult to recover by means of a gradient signal only. We found that our method allow to robustly take steps towards the optimum, where local-convergence can be guaranteed, outperforming the standard GP-UCB and CMAES algorithm. Note that the DescentLineBO method works particularly well on this example, as it is designed to use the estimated gradient as line directions; but it does not necessarily perform better on the other benchmarks. When adding an invariant subspace (Figure 3 fig:camelback_sub, fig:hartmann6_sub), our methods remain competitive with the bulk of methods, but surprisingly also UCB works very well on the camelback function with augmented coordinates. This might be due to an effect similar as in Proposition 2 carrying over from the random restarts of the approximate acquisition function optimizer.
Figure 2(f) shows computation time per iteration in a 10 dimensional setting (Hartmann6d+4d) averaged over 500 steps. Our methods obtain roughly one order of magnitude speed up compared to the full-scale Bayesian optimization methods; however this is of course dependent on the implementation. For GP-UCB and REMBO, we optimize the acquisition function using L-BFGS with 50 restarts, where starting points are either randomly chosen or from a previous maximizer.
The results for the constrained case can be found in Figure 4. Our methods clearly outperform both SafeOpt and SwarmSafeOpt in terms of simple regret.
Parameter tuning is a tedious and repetitive task for operation of free electron lasers. The main objective is to increase the laser energy measured by a gas detector at the end of the beam-line. Among hundreds of available parameters that expert operators usually adjust, some parameter groups allow for automated tuning. Those include quadrupole currents settings, beam position parameters and configuration variables of the undulators. For our tests, a suitable subset of 5-40 parameters was selected by machine experts. The machine is operated at 25 Hz and we averaged 10 consecutive evaluations to reduce noise. Ideally, the computation time per step is well below 1s to avoid slowing down the overall optimization. This effectively rules out full-scale Bayesian optimization given the number of parameters. Besides manual tuning by operators, a random walk optimizer is in use and reported to often achieve satisfactory performance when run over a longer period of time; in other cases it did not improve the signal while other methods did. This hints that hill-climbing on the objective should be taken into account as a feasible step towards an acceptable solution, but global exploration and noise robustness are important, too. Nelder-Mead is mostly considered as standard benchmark in the accelerator community. Standard Bayesian optimization was previously reported to outperform it (McIntire et al., 2016a), but safety constraints, and the efficient scaling to high dimensions were not considered. Safe operation constraints include electron loss monitors and a lower threshold on the pulse energy, which is important to maintain during user operation. For our experiments we were mainly concerned with the latter, as at the time of testing, the loss monitoring system could not be used for technical reasons; but this will be an important addition once implemented.
Our results are shown in Figure 4(a). To obtain a systematic comparison, we manually detuned the machine, then run both Nelder-Mead and DescentLineBO twice from the same starting point (limited machine development time did not allow for a more extensive comparison). Our method soundly outperforms Nelder-Mead, both in terms of convergence speed and pulse energy at the final solution. A direct comparison between the LineBO and SafeLineBO in Figure 4(b) shows that the safe method is able to maintain the safety constraint. The safety constraint has the additional benefit of restricting the search space which we found to improve convergence in this case. The solution obtained after 600 steps (after min) already achieves a higher pulse energy than the previous expert setting, which was obtained with the help of a local random walk optimizer. A single, successful run with 40 parameters can be found in Figure 1.
We presented a novel and practical Bayesian optimization algorithm, LineBO, which iteratively decomposes the problem to a sequence of one dimensional sub-problems. This addresses the often ignored issue of how to maximize the acquisition function, and allows to scale the method to high-dimensional settings. We showed that the algorithm is theoretically as well as practically effective. In addition, it can also be used with safety constraints by means of safely solving each sub-problem, and is therefore, to the best of our knowledge, the first method to achieve this. Finally, we demonstrated how we apply the SafeLineBO method on SwissFEL for tuning the pulse energy with up to 40 parameters on a continuous domain while satisfying safe operation constraints.
The authors thank in particular Manuel Nonnenmacher for the work he did during his Master thesis and Kfir Levy for valuable discussions and feedback. For the experiments on the free electron laser, the authors would like to acknowledge the support of the entire SwissFEL team.
This research was supported by SNSF grant 200020_159557 and 407540_167212 through the NRP 75 Big Data program. Further, this project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No 815943.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp. 2096–2102, 2017a.
A general outline of Bayesian optimization is given in Algorithm 2 below. The evaluation point is determined using an acquisition function that typically depends on the GP model . One of the most common acquisition functions, that have been analyzed theoretically, is GP-UCB (Srinivas et al., 2010). With posterior mean and posterior standard deviation it is defined as the upper confidence bound (in the maximization setting),
for a scaling factor (for details, see Srinivas et al. (2010)).
In the frequentist analysis with confidence intervals , the simple regret at any point can be controlled with
The sample complexity bounds of GP-UCB make sure that the breaking condition in Algorithm 2 is eventually satisfied. Even though these bounds are typically formulated for cumulative regret, the proofs in fact bound the following quantity,
see (Chowdhury & Gopalan, 2017, Theorem 3). From this a bound on simple regret follows as
Bounds on are known for different kernels, including for the linear kernel: , the RBF kernel: , and the Matern kernel with : ; see Srinivas et al. (2010, Theorem 5).
SafeOpt uses an idea that is similar to Algorithm 2. A GP-model is used to estimate the implicit constraint function with confidence intervals . The confidence estimates can be used to define a conservatively estimated safe set . The acquisition step is then restricted to , and under the condition that the confidence estimates hold, the algorithm does not violate the constraints. However, the exploration problem becomes more difficult, as both and need to be explored in an appropriate way. For completeness, we reproduce the pseudo-code from (Sui et al., 2015; Berkenkamp et al., 2016a) in Algorithm 3. Please refer to the original publication for a more detailed treatment.
In each iteration , the algorithm defines a safe set , the set of potential minimizers , and the expander set with point that can possibly enlarge the safe set. The algorithm uses two functions; the first is the uncertainty at a specific point to determine which points to acquire,
Next, to quantify possible expanders of the safe set,
Also, denote and the lower and upper confidence bound of .
The algorithm in its original formulation requires the knowledge of a Lipschitz constant . It is however possible to derive a bound on from the norm bound as by Assumption 2. Note that this algorithm is in particularly simple to implement in the one-dimensional setting. There, the safe-set is always an interval with its endpoints being possible expanders.
We first show the following lemma. Let be twice differentiable with effective dimension . Further, let be the minimum objective value that can be obtained by minimizing any line up to iteration . Then,
where is a lower bound on the probability that a random line intersects the set . Further, if the first order condition at the minimum is met, then
Before we go on to the proof, we show how Proposition 2 follows.
Denote and remember that each line is solved up to accuracy, therefore . Using Lemma B.1, we know that with probability at least , , hence when the statement in the proposition is true. Solving for yields , concluding the proof. ∎
Let be a lower bound on the probability that a random line intersects the set . Using this, we find
where the last inequality uses . Hence
Using the assumption that is twice-differentiable, we can over-approximate around a minimizer using a quadratic function for small and . Therefore , and it is enough to intersect with a random line. Note that if we also use the assumption that the function varies only in dimensions, we can restrict the approximation to the active subspace, therefore . If we allow the hidden constant also to depend on the diameter of the domain , the probability that a random line intersects , and therefore , is at least . ∎
The assumption that is twice differentiable is always satisfied if the kernel function is twice differentiable, see Steinwart & Christmann (2008, Corollary 4.36).
First, we recall the definition of smooth and strongly convex functions. [Strong Convexity] A differentiable function is called -strongly convex if there exists such that for all,
[Smoothness] A differentiable function is called -smooth, if there exists such that for all ,
Strong convexity implies the Polyak–Lojasiewicz condition,
The next lemma shows that randomly chosen directions can be used as descent directions (Lemma 4.3 in the main text).
[Random Descent Direction ] Let be a randomly chosen direction. Specifically assume that is uniformly random on the -dimensional unit sphere (random directions) or uniformly among an orthonormal basis (coordinate descent). Then for any
Denote by the th coordinate of . Note that , hence . Further for due to symmetry argument if is uniformly on the sphere, and by orthonormality in the coordinate case. The results follow from expanding the square and using the previous two equations. ∎
[Exact line search oracle] Let be -strongly convex and -smooth on a domain . If we obtain iterates from Algorithm 1 with random directions that satisfy Lemma B.2, then the exact line-search solution improves per step by
Let be solution obtained from an exact line-search on the sub-problem , and assume that the directions are random, satisfy Lemma B.2 and . Let be arbitrary, then
The last inequality we obtain by setting to minimize the RHS, and note that . Taking expectation of and using Lemma B.2, we get
Rearranging concludes the proof,
Assume that we run Algorithm 2 with accuracy and obtain iterates that do not leave the subset where the function is -smooth and -strongly convex. Denote the exact line-search solutions by and , to find
by means of Lemma B.2. Recursively applying the previous inequality gives
This concludes the proof. ∎
For the Random method, we pick points uniformly in the domain and report the best observation as candidates, without any control on the noise. UCB is implemented using GPy (GPy, 2012), and the acquisition function is maximized using the L-BFGS solver provided by the SciPy library. To evade local maxima, 50 restarts are used, both containing random points and previous maximizers. For Nelder-Mead we use the SciPy implementation. SPSA is provided by noisyopt (https://noisyopt.readthedocs.io). CMAES is provided by the pycma package (https://github.com/CMA-ES/pycma). Our REMBO and InterleavedREMBO implementation is based on https://github.com/jmetzen/bayesian_optimization. SafeOpt and SwarmSafeOpt use the author implementation https://github.com/befelix/SafeOpt.
Our DescentLineBO algorithm uses the following heuristic to find the directions. We use a step-size of and evaluations in our experiments. Note that this can be seen as Thompson sampling on the Euclidean ball with a linear approximation of the posterior GP.