Excursion Search for Constrained Bayesian Optimization under a Limited Budget of Failures

05/15/2020 ∙ by Alonso Marco, et al. ∙ 0

When learning to ride a bike, a child falls down a number of times before achieving the first success. As falling down usually has only mild consequences, it can be seen as a tolerable failure in exchange for a faster learning process, as it provides rich information about an undesired behavior. In the context of Bayesian optimization under unknown constraints (BOC), typical strategies for safe learning explore conservatively and avoid failures by all means. On the other side of the spectrum, non conservative BOC algorithms that allow failing may fail an unbounded number of times before reaching the optimum. In this work, we propose a novel decision maker grounded in control theory that controls the amount of risk we allow in the search as a function of a given budget of failures. Empirical validation shows that our algorithm uses the failures budget more efficiently in a variety of optimization experiments, and generally achieves lower regret, than state-of-the-art methods. In addition, we propose an original algorithm for unconstrained Bayesian optimization inspired by the notion of excursion sets in stochastic processes, upon which the failures-aware algorithm is built.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deploying machine learning (ML) algorithms in real-world scenarios has gained increasing interest during the last decade. Under some circumstances, lacking from sufficiently accurate models, or knowledge of the environment, such algorithms can lead to undesired outcomes. Deploying machine learning (ML) algorithms in real-world scenarios is an ongoing challenge. A key difficulty lies in the proper management of undesired outcomes, which are inevitable when learning under unknown or uncertain circumstances. As an extreme case, in applications like autonomous driving, a failure in the decision-making may lead to human casualties. Such safety-critical scenarios need conservative ML algorithms, which forbid any failures. On the other hand, there exist scenarios in which failures are still undesired, although might not come at a high cost. For example, when deploying ML algorithms to optimize the parameters of an industrial drilling machine to drill faster, a few configurations might break the drill bits, but in exchange, a faster drilling can be learned. In such non-safety-critical applications, failures shall be considered as a valuable source of knowledge, and one would tolerate a limited number of them in exchange for better learning performance.

When iteratively improving machine parameters directly from data, the mapping between a specific parameter configuration and the corresponding behavior of the machine is often unknown, and can only be revealed through experiments. Normally, such experiments are time-consuming, and thus, data collection is considered expensive. In order to learn the optima of expensive black box functions, Bayesian optimization (BO) has been established in the last decade as a promising probabilistic framework (shahriari2016taking)

. Therein, the aim is to efficiently exploit the observed data in combination with prior probabilistic models to estimate the global optimum from a few trials. In the context of robot learning, BO has been used to mitigate the effort of manual controller tuning, see, e.g.,

(calandra2016bayesian; 2018vonrohr; rai2018bayesian).

When the optimization is subject to unknown external restrictions, the goal is to solve a constrained optimization problem under multiple black box constraints. (hernandez2016general; Gelbart2014; gardner2014bayesian; gramacy2011opti; schonlau1998global; picheny2014stepwise) propose different BO methods to estimate the constrained global optimum. In (NIPS2017_6785), a variant of such problem is considered, where the total budget of evaluations is explicitly included in the decision-making, by formulating the problem as a dynamic programming instance. Because these methods do not have a limit on the number of incurred failures, they can fail many times. In other words, none of them inform the decision maker about the remaining budget of failures at each iteration.

From a different perspective, zero-budget strategies (sui2015safe; berkenkamp2016safe) are needed in safety-critical applications, where failures are not allowed. Such strategies avoid failures by conservatively expanding an initially given safe area, and never exploring beyond the learned safety boundaries. However, when applied in a context where failures are allowed, such strategies become suboptimal: they will ignore such budget and miss alternative, potentially more promising, safe areas, located outisde the initial safe area.

In this work, we pose the problem of learning the constrained global optimum in settings where a non-zero budget of failures is given. In particular, we make two main contributions. Our first contribution is a failures-aware strategy for BOC that, in contrast to prior work, does not need to be initialized in a safe region and that makes decisions taking into account the budgets of remaining failures and evaluations.

Our second contribution is a novel acquisition function inspired by key notions of the geometry of excursion sets in stochastic processes. In (adler2009random), an excursion set is defined over smooth manifolds as those points for which a process realization crosses upwards a given threshold. The larger the threshold, the more likely it is that an upcrossing will reveal the location of the global maximum. Based on this intuition, we derive an acquisition function, which can be written analytically, is cheap to evaluate, and explicitly includes the process derivative to make optimal decisions.

In the following, we explain and experimentally validate the aforementioned contributions. In Section 2, we characterize excursion sets in Gaussian processes (GP), and explain their benefits when used in BO. In Section 3, we formalize the proposed novel acquisition function to solve unconstrained problems. In Section 4, such acquisition is extended for the constrained case in the presence of a budget of failures. In Section 5, we validate both acquisition functions empirically on common benchmarks for global optimization and real-world applications. We conclude with a discussion in Section 6.

2 Excursion sets in Bayesian optimization

The proposed search strategy is inspired by the study of the differential and integral geometry of excursion sets in stochastic processes (adler2009random). In the particular case of GPs, analytical expressions can be derived for such sets. In the following, we provide the needed mathematical tools and intuition over which our search strategy is constructed.

2.1 Problem formulation

The main goal is to address the unconstrained optimization problem

(1)

where the objective is a black-box function, which evaluations are corrupted by noise and are expensive to collect (due to, e.g., energetic costs), and .

2.2 Gaussian process (GP)

We model the objective as a Gaussian process, , with covariance function , and zero prior mean. Observations are modeled using additive Gaussian noise . After having collected observations from the objective , its predictive distribution at a location is given by , with predictive mean

, where the entries of vector

are , the entries of the Gram matrix are , and the entries of the vector of observations are

. The predictive variance is given by

. In the remainder of the paper, we drop the dependency on the current data set and write , to refer to , , respectively.

2.3 Excursion sets in Gaussian processes

Let us assume a zero-mean scalar Gaussian process , with , , and stationary covariance function . The excursion set is defined as the set of locations where the process is above the threshold . In (adler2009random, Part II. Geometry), such sets are characterized by the number of upcrossings of process samples through the level , i.e., , where is the derivative of the process. Intuitively, large represents a high frequency of upcrossings, which is connected with having many areas in where lives above . For a one-dimensional, stationary, almost surely continuous and mean-square differentiable Gaussian process, the expected number of upcrossings (Rasmussen2006Gaussian, Sec. 4.1) is given by the well-known Rice’s formula (lindgren2006lectures, Sec. 3.1.2)

(2)

where is the joint density of the process and its derivative, both queried at location , is the Dirac delta, and the second derivative of the covariance function must exist. Interestingly, (2

) can be used to approximate the probability of finding the supremum of a process realization above a high level

. The growth rate of the approximation error with respect to is bounded

(3)

as , with indicating the limiting behavior of the approximation error and needs to be found (adler2009random, Sec. 14). The intuitive reasoning behind this is simple: If crosses a high level , it is unlikely to do so more than once. Therefore, the probability that meets its supremum above is close to the probability that there is an upcrossing of . Since the number of upcrossings of a high level will always be small, the probability of an upcrossing is well approximated by .

While the bound in (3) does not hold for the general case , we use it as a starting point to build a new acquisition function for (cf. Section 2.5), which shows empirically superior results than state-of-the-art BO methods. In the following section, we show, for , how can be leveraged to lead the search towards areas where the number of upcrosssings is large, or equivalently, where the global maximum is more likely to be found. Thereafter, we extend the result for .

2.4 Practical interpretation for use in BO

The expected number of upcrossings (2) contains valuable information about the amount of times a sample realization of the process “upcrosses” the level . However, (2) cannot be used directly for decision-making because it is a global property of the process itself, rather than a local quantity at a specific location . Next, we provide a practical interpretation that relaxes some of the assumptions made to obtain (2) and allows for its use in BO. To this end, we introduce three modifications.

First, when seeking for the optimum of the process, it is more useful to consider both, the up- and down-crossings through the level , as both of them occur near the optimum when is large. This quantity is defined in (lindgren2006lectures, Sec. 3.1.2) as the expected number of crossings

(4)

with .

Second, BO uses pointwise information to decide on how interesting it is to explore a specific location . (lindgren2006lectures, Theorem 3.1) proposes the intensity of expected crossings , which can be computed by simply removing the domain integral in (4)

(5)

Third, when conditioning the Gaussian process on observed data , it becomes non-stationary111Note that all GPs are non-stationary when conditioned on data, even if the covariance function that defines them is stationary., and thus, the predictive distribution of a query changes as a function of . The dependency on is introduced in (5) following (lindgren2006lectures, Remark 3.2), as

(6)

where the joint density is evaluated at after resolving the integral over the Dirac delta. We next provide a brief analysis for solving (6).

Using the rule of conditional probability, we have . The first term, , is a Gaussian density222Using simplified notation, we write to refer to the density function evaluated at . Similarly, we write to refer to the joint density function evaluated at for some value . evaluated at , with the predictive mean and variance of the GP model. The second term is also a Gaussian density over the process derivative, conditioned on . This can be seen as adding a virtual observation at location to existing data set. Hence, . Then, (6) can be rewritten as

(7)

where ,

is the probability density function of a standard normal distribution, and

is the error function (see Appendix A for a complete derivation). Fig. 1 shows for two different values of , where the GP is conditioned on seven observations. As can be seen, different thresholds imply different intensity of crossings for the same process. When the threshold is near collected evaluations, the largest intensity of crossings tends to be concentrated near the data. On the contrary, when it is far from the data, the largest intensity of crossings is found in areas of large variance.

(a) Gaussian process posterior
(b) Intensity of expected crossings
Figure 1: (a) Gaussian process posterior conditioned on a set of observations. Given a process realization (dashed lines), two choices for the threshold (solid lines) determine two different excursion sets. (b) Intensity of expected crossings for each threshold . Higher values correspond to areas where the boundaries of the excursion sets are likely to be, i.e., where the process is more likely to cross . The curves are normalized to have the same maximum value.

2.5 Extension to dimensions

Although (7) was derived for , we can extend it to the case . Since (6) depends on , a natural extension is to consider the L-1 norm of the gradient of the process . Following this, we extend (7) as ,

(8)

where . The gradient follows a multivariate Gaussian, and and . Note that and depend on the extended data set .

In the following sections, we propose two novel algorithms that build upon the quantity (8).

3 Excursion search algorithm

The modifications applied to (2), detailed above, allow extracting useful information about how likely is the process to cross a certain level at each location . When is a lower bound on the collected data, (8) reveals locations where the process is more likely to have a minimum. If we repeatedly evaluate at such locations, one would expect to approach faster the global minimum. In the following, we characterize (8) as an acquisition function for optimal decision-making.

3.1 Threshold of crossings as the global minimum

The choice of the threshold in (8) is important when trying to find the global minimum. A hypothetically appropriate low value for is right above the global minimum , i.e., , where is small. Then, if crossings through are likely to occur at a specific area, we know that such area is likely to contain the global minimum, and thus, will show a large . However, in practice we do not have access to the true of the objetive function, and thus, cannot compute in the aforementioned way. At most, we are able to assume a distribution over the global minimum , implied by the GP model on . In the following, we assume that follows such distribution, i.e., .

It is well-known in extreme value theory (de2007extreme) that follows one of the three extreme value distributions: Gumbel, Fréchet, or Weibull, which generally model tails distributions. For example, in (wang2017mes), the Gumbel distribution is chosen to model . However, such distribution has infinite support, while in practice it is not useful to have any probability mass above the best observed evaluation . Instead, we consider the Fréchet distribution as a more appropriate choice as it provides finite support . For minimization problems, we can define it in terms of its survival function , given by

(9)

where , and the parameters and can be estimated from data following the same approach as in (wang2017mes, Appendix B). A thorough analysis on the advantage of using the Fréchet distribution, instead of the Gumbel distribution, for gathering samples of can be found in Appendix B. Using the above definition, the stochastic threshold , makes the quantity (8) also stochastic. We propose to compute its expectation over , i.e., , which we explain next.

3.2 Acquisition function

We define the excursion search (Xs) acquisition function as

(10)

where the outer expectation is intractable and is approximated via sampling. For each sample , (8) needs to be recomputed. The samples can be collected through the inverse of (9), .

follows a uniform distribution in the unit interval, and

.

Intuitively, the Xs acquisition function (10) reveals areas near the global maximum (i.e., where the gradient crosses the estimated with large norm), instead of directly aiming at potential maximums, minimums, or saddle points. Furthermore, Xs inherently trades off exploration with exploitation: At early stages of the search, the estimated Fréchet distribution (9) reflects large uncertainty about , which causes the samples to lie far from the data. Hence, exploration is encouraged, as shown in Fig. 1 (green lines). At later stages, when more data is available, the Fréchet distribution (9) shrinks toward the lowest observations, which then encourages exploitation, as shown in Fig. 1 (violet lines).

The acquisition (10) is our first contribution, and can be used for unconstrained optimization problems, e.g., (1).

4 Bayesian optimization with a limited budget of failures

In the previous section, we introduced a new acquisition function (10) grounded in the connection between the true optimum of the process and the expected number of crossings through its current estimate (cf. (3)). However, such acquisition does not explicitly have into account any budget of failures or evaluations . In the following, we propose an algorithm that makes use of and to balance the decision making between (i) safely exploring encountered safe areas, and (ii) searching outside the safe areas at the risk of failing, when safe areas contain no further information.

4.1 Problem formulation

To the unconstrained problem (1), we add black-box constraints, , , also corrupted by noise and expensive to evaluate. Moreover, we assume a non-safety critical scenario, where violating the constraints is allowed, but it is strictly forbidden to do so more than times. Analogously, we allow only for a maximum number of evaluations. The case is not considered herein, as the budget of failures can simply be ignored. Under these conditions, we formulate the constrained optimization problem with limited budget of failures as

(11)

where is the location of the constrained minimum, and equals 1 if at least one of the constraints is violated at location , and 0 otherwise. is the indicator function, and are the collected evaluations of the constraints at locations . Since the constraints are unknown, and modeled as independent Gaussian processes , queries and are stochastic and (11) cannot be solved directly. Instead, we address the analogous probabilistic formulation from (Gelbart2014):

(12)

where , is the cumulative density function of a standard normal distribution, and is typically set close to one. The predictive mean and variance conditioned on of each are computed as in Section 2.2. In the following, we provide a novel Bayesian optimization strategy to address (12).

4.2 Safe exploration with dynamic control

In order to include the probability of constraint satisfaction in the decision making, we propose a similar approach to (Gelbart2014) by explicitly adding a probabilistic constraint to the search of the next evaluation

(13)

where the parameter determines how much we are willing to tolerate constraint violation at each iteration . This leads the search away from areas where the constraint is likely to be violated, as those areas get revealed during the search.

Contrary to (Gelbart2014), where is fixed a priori, we propose to choose it at each iteration, depending on the remaining budget of failures and remaining iterations . Intuitively, the more failures we have left (large ), the more we are willing to tolerate constraint violation (large ). We achieve this by proposing an automatic control law to drive , which we describe next.

Let us define a latent variable , that follows a deterministic process , using a dynamic feedback controller . Such controller drives the process toward one of the two references: and , where typical values are and . We define a control law

(14)

with , , and . The first term drives the process toward when a failure occurs at iteration , with intensity . In this way, the fewer failures are left in the budget, the more urgently the process chases . The second term attempts to push down to with an intensity proportional to the ratio between the remaining failures and iterations.

When , but , only a conservative safe exploration is allowed. To do so, we set for the remaining iterations until . Additionally, if there are more failures left than remaining iterations, i.e., , the remaining budget of failures is not decisive for decision making, and thus, we set .

The resulting control strategy weights risky versus conservative decision-making by considering the budget of evaluations and iterations left: When no failures occur for a few consecutive iterations, is slowly driven toward , and when a failure takes place, it lifts up toward .

4.3 Risky search of new safe areas

The probabilistic constraint in (13) puts a hard constraint on the decision making by not allowing evaluations in regions that are known to be unsafe. When is high, (13) will discard regions where no data has been collected and locally explore regions where safe evaluations are present. Such conservative decision making is desirable when because it avoids unsafe evaluations. The smaller the , the more risky evaluations we can afford, which makes the constraint information less important in the decision making. However, when is too low, the probabilistic constraint tends to be ignored, and the decisions are based on the information from the objective. Albeit this indeed counts as the wanted risky exploration strategy, completely ignoring the constraint information could result in repeated evaluations in unsafe areas. To avoid this, we follow the apporach from (Gelbart2014), where the aquisition function is aware of the constraint information, without this being a hard constraint. Therein, locations are chosen at

(15)

This approach “jumps” outside the current safe areas at the risk of failing, while the multiplying term discourages exploration in areas revealed to be unsafe.

Trading off risky versus safe exploration depends on the remaining budget , and is quantified by , as detailed in Section 4.2. We propose a user-defined decision boundary , such that if , the next location will be selected using (15), and (13) otherwise.

While (13) assumes that a safe area has already been found, this might not be the case at an early stage of the search. In such case, we collect observations using (15) and only resort to the risk versus safety trade-off once a safe area has been found.

Pseudocode for the overall framework, named failures-aware excursion search (XsF), and an analysis of its computational complexity can be found in Appendix C. XsF returns the estimated location of the constrained minimum from (12), computed by setting .

5 Empirical analysis and validation

We empricially validate Xs and XsF by comparing their performance against state-of-the-art methods. We consider three different scenarios. In the first one, we validate each method on common challenging benchmarks for global optimization. In the second and third scenarios we compare XsF

 against state-of-the-art methods in constrained optimization problems. In the second, we optimize the hyperparameters of a neural network to achieve maximum compression without degrading its performance. In the third, we learn the state feedback controller of a cart-pole system. Both,

Xs and XsF are implemented in Python. The code, which includes scripts to reproduce the results presented herein, is documented and publicly available at https://github.com/alonrot/excursionsearch.

5.1 Experimental setup

To assess the performance of all methods we use simple regret , where is the point that yielded the best observation so far. In the constrained case, such point is given by . We quantify how often safe evaluations are collected using , where is the number of safe evaluations made at the end of each run.

In all cases, the domain is scaled to the unit hypercube. We set , , and . The decision boundary was set at

. Both, the objective function and the constraint are modeled with a zero-mean GP, with a squared exponential kernel. The lengthscales and the signal variance are fit to the data after each iteration. Further implementation details, such as hyperprior choices and number of random restarts, are reported in Appendix D.

5.2 Benchmarks for global optimization

We validate Xs and XsF in two challenging benchmarks for global optimization: Hartman 6D, and Michalewicz 10D (survey2013benchmarks). We allow a budget of evaluations in all cases and repeat all experiments 50 times for each function using a different seed. As in (wang2017mes; hernandez2016general), we use the same initial evaluation (previously selected at random) across all repetitions.

5.2.1 Excursion search (Xs)

We assess the performance of Xs by comparing against popular BO methods: Expected improvement (EI) (movckus1975bayesian), Probability of improvement (PI) (kushner1964new), Min-Value Entropy Search (mES) (wang2017mes), and Gaussian process upper confidence bound (UCB) (Srinivas10gaussianprocess). Our implementations are based on those used by (wang2017mes), available online333https://github.com/zi-w/Max-value-Entropy-Search

Fig. 1(a) shows the evolution of the simple regret over iterations in the Michalewicz 10D benchmark. Xs reaches the lowest regret, and none of the methods is able to achieve a regret close to zero, which is not surprising given high dimensionality of the problem and the number of allowed evaluations. Table 1 (top) shows statistics on the regret value for both benchmarks. While all methods report a generally high regret in Michalewicz 10D, Xs clearly outperforms all the other methdos in Hartman 6D, as it finds a near-zero regret.

Hartman 6D Michalewicz 10D
EI
mES
PI
UCB
Xs
EIC
PESC
XsF
Table 1: Constrained (top) and unconstrained benchmarks (bottom). Simple regret (mean std) and percentage of safe evaluations .
(a) Unconstrained
(b) Constrained
Figure 2: Performance assesment of Xs and XsF on the Michalewicz 10-dimensional benchmark.

5.2.2 Failures-aware excursion search (XsF)

To validate XsF, we propose a constrained optimization problem under a limited budget of failures. For this, we simply impose a constraint to the aforementioned benchmarks . Such function uniformly divides the volume in sub-hypercubes, and places convex disjoint unsafe areas in each one of the sub-hypercubes, so that they are never adjacent to each other. We allow and a considerably small budget of failures to all methods. We compare XsF against expected improvement with constraints (EIC) (Gelbart2014) and predictive entropy search with constraints (PESC) (hernandez2016general). EIC and PESC are terminated when their budget is depleted. Although individual experiments rarely finish at the same iteration (i.e., some may deplete the budget of failures earlier than others), we use in our results the last regret reported by each algorithm. For EIC, we use our own implementation, while for PESC

 we use the available open source implementation, included in Spearmint

444https://github.com/HIPS/Spearmint/tree/PESC.

In Fig. 1(b), we see that XsF reaches a higher number of total evaluations and consistently achieves lower regret than EIC and PESC. Fig. 1(b) (middle) shows the evolution of the remaining budget of failures

over iterations (mean and standard deviation). As can be seen,

EIC and PESC deplete the budget faster than XsF. Finally, Fig. 1(b) (bottom) shows the evolution of the parameter used to switch betwen risky and safe strategies in XsF, and also as a threshold for probabilistic constraint satisfaction (cf. Section 4.2). We differentiate two stages: During the initial iterations is low, and thus, risky exploration is preferred, which allows XsF to quickly discover better safe areas. At the last iterations, when the budget is depleted, XsF keeps exploring conservatively the discovered safe areas, with .

Table 1 (bottom) shows the regret for both, the Michalewicz 10D and the Hartman 6D functions in the constrained case. While the regret comparison is similar to the 10D case, the 6D case shows that Xs clearly outperforms the other methods. The quantity confirms that XsF visits safe evaluations more often than the other methods.

Generally, hyperparameter learning influences the performance of the algorithms. In Appendix E, we show experiments with fixed hyperparameters and a correct GP model, where Xs and XsF outperform the aforementioned methods.

5.3 Compressing a deep neural network

Applying modern deep neural networks (NNs) to large amounts of data typically results in large memory requirements to store the learned weights. Therefore, finding ways of reducing model size without degrading the NN performance has become an important goal in deep learning, for example, to meet storage requirements or to reduce energy consumption. Bayesian compression has been recently proposed as a mean to reduce the NN size: Given an NN architecture, an approximate posterior distribution

on the NN weights is obtained by maximizing the evidence lower bound (ELBO), which balances the expected log-likelihood of samples from and the theoretical compression size, as given by the KL divergence between and a prior distribution (havasi2018minimal). A penalization factor can be used to scale the KL divergence to control the final size of the NN. Finding the value of that achieves the lowest compression size without significantly degrading NN performance is a challenging and expensive tuning problem. To alleviate the effort of tuning hyperparameters, Bayesian optimization is commonly used. Herein, we propose to minimize the validation error of the NN while keeping its size below a threshold, using constrained Bayesian optimization under a limited budget of failures. While in this example failing to comply with the size requirements is not catastrophic, collecting many failures is undesirable.

We use a LeNet-5 on the MNIST dataset, and a required size below 15 kB. The parameters to tune are , the learning rate , and a scaling factor

on the the number of neurons of all layers. As a reference for our implementation, we used the open source implementation of MIRACLE

555https://github.com/cambridge-mlg/miracle (havasi2018minimal). We allow and

and repeat the experiments 5 times. We fix the training epochs to 20000 for each evaluation (about 25 min in wall-clock time). As shown in

Fig. 2(a), XsF achieves the lowest regret and standard deviation. The best safe observation is reported by XsF, with validation error and theoretical NN size of kB (x553 compression). The learned parameters are , and .

(a) NN compression
(b) Cart-pole problem
Figure 3: Performance comparison of XsF against EIC and PESC

5.4 Tuning a feedback controller

Bayesian optimization has been used for learning robot controllers to alleviate manual tuning (calandra2016bayesian; rai2018bayesian). Herein, we propose to tune a 4D state feedback controller on a cart-pole system, where unstable controllers found during the search are undesirable, as human intervention is required to reset the platform, but not catastrophic. In this setting, allowing a limited budget of failures might increase chances of finding a better optimum. In practice, a constraint can be placed in the cart position to trigger an emergency stop when it grows large (marco2016automatic)

. Controllers that surpass such limit at any moment during the experiment are considered a failure. We use the simulated cart-pole system

666https://gym.openai.com/envs/InvertedPendulum-v2/ from openAI gym (openAI), implemeted in the MuJoCo physics engine (MuJoCo). The tasks consists on, first stabilizing the pendulum starting from random initial conditions, and second, disturbing the cart position with a small step. We consider a budget and , and repeat all experiments 10 times. Fig. 2(b) shows that XsF finds a better controller than the other methods.

6 Conclusions

In this paper, we have presented two novel algorithms for BO: Excursion search (Xs), which is based on the study of excursion sets in Gaussian processes, and failures-aware excursion search (XsF), which trades off risky and safe exploration as a function of the remaining budget of failures through a dynamic feedback controller. Our empirical validation shows that both algorithms outperform state-of-the-art methods. Specifically, in situations in which failing is permited, but undesirable, XsF makes better use of a given budget of failures.

References

Appendix A Additional details to Sec. 2.4

Herein, the derivation of (7) is complemented with two additional insights. First, in Section A.1, we show how the integral from (7) resolves into an analytical expression. Then, in Section A.2, we reason about adding to the dataset as a virtual observation.

a.1 Analytical expression for the integral in (7)

The integral from (7) can be split in two parts

where the placeholder is used for simplicity, and the dependency of on is implicit, and also omitted. Since

is Gaussian distributed, each of the integrals above can be seen as the expected value of an unnormalized truncated normal distribution with support

, and , respectively. These expectations are given by (jawitz2004moments)

where , , is the density of a standard normal distribution and is its cumulative density function. We make use of the definition , where is the error function, to compute . Then, , and the integral can be solved analytically as

Then, (7) follows.

a.2 Virtual observation

The posterior of the process derivative is a Gaussian density and can be seen as conditioning on an extended dataset that includes as a virtual observation. In the following, we briefly discuss this.

Since differentiation is a linear operation, the derivative of a GP remains a GP (Rasmussen2006Gaussian, Sec. 9.4). Furthermore, the joint density between a process value , its derivative and the dataset is Gaussian (wu2017bayesian)

where , , , and the prior mean of the GP is assumed to be zero. Then, the conditional is also Gaussian, and can be obtained using Gaussian algebra (Rasmussen2006Gaussian, A. 2). The mean

depends on the random variable

as

(16)

The seeked Gaussian density is obtained by replacing the value in the expression for the mean (16). Thereby, appears in (16) as an additional virtual observation at location added to the existing dataset , in shorthand notation: .

Appendix B Fréchet distribution

In this section, we present a brief analysis on why assuming a Fréchet distribution is more error prone in practice than using the Gumbel distribution, when it comes to model the distribution over the global minimum . This analysis complements Sec. 3.1 in the paper.

When modeling with the Gumbel distribution and sampling from it, some samples of the global minimum can lie above , with non-zero probability, which is unrealistic. This can be explicitly avoided by using the Fréchet distribution which, contrary to Gumbel, has zero probability mass near . We illustrate this with an example, in which a GP with zero mean, unit variance, and squared exponential kernel is considered, conditioned on 20 observations sampled from the GP prior. We discretize the domain in 200 points and sample the resulting GP posterior at them. In Fig. 4, we see that a portion of the Gumbel samples lie above . To show consistency, we sample the posterior GP 100 times and average the number of times that Gumbel exceeds , i.e., of the cases, while the Fréchet distribution exceeds in of the cases.

Figure 4: (top) Gaussian process, and (red dashed line). (bottom) Survival functions for both Gumbel, and Fréchet distributions. Samples from the Gumbel (crosses) and from the Fréchet (circles) distribution are shown.

Appendix C Algorithm and complexity

Herein, we discuss pseudocode for XsF and its computational complexity.

c.1 XsF algorithm

Pseudocode for XsF is shown in Algorithm 1. The decision boundary is used to switch between safe search (cf. (13)) and risky search (cf. (15)). The algorithm returns the location where the mean of the posterior GP is minimized without violating the probabilistic constraints. To abbreviate, we have used the placeholder .

We do not explicitly discuss Xs, as it simply comprises a standard Bayesian optimization loop, which involves (i) computing samples of the global minimum, and (ii) maximizing the acquisition function (10).

c.2 Complexity

At each iteration, the most expensive operations required to obtain (13) and (15) are: (a) obtaining samples from the global minimum and (b) maximizing the acquisition function using local optimization with random restarts.

As explained in (wang2017mes), obtaining samples from involves discretizing the input domain and performing a binary search, which has a total cost of , where is the size of the discretization grid, and is the accuracy of the binary search.

Each call to the acquisition function (10), has a cost of where is the dimensionality of the input space. Then, assuming random restarts, and maximum number of function calls, the total cost of XsF in per iteration the worst case scenario is given by . The last term is the cost of inverting the Gram matrix, needed for GP predictions (cf. (16)), after having collected observations, and having constraints. When setting , we obtain the computational cost of Xs, as it also requires gathering samples from and local optimization with random restarts.

  Input:
  for  to  do
      UpdateDecisionBoundary()
      SampleGlobalMinimum()
     if  then
         (13)
     else
         (15)
     end if
     EvaluateAndUpdateGPs()
  end for
  
  Return:
  function UpdateDecisionBoundary(
     
      Controller update (14)
      Process update
     Return:
  end function
  function SampleGlobalMinimum(
     Estimate Fréchet distribution following (wang2017mes, Appendix B)
     for  to  do
        .
     end for
     Return:
  end function
  function EvaluateAndUpdateGPs(
     ,
     ,
     Update hyperparameters of GP models
  end function
Algorithm 1 Failures-aware Excursion Search (XsF)

Appendix D Implementation details

Both, Xs and XsF are developed using BoTorch777https://botorch.org/docs/introduction.html, a Python library for Bayesian optimization that serves as a low-level API for building and optimizing new acquisition functions and fitting GP models. It makes use of scipy Python optimizers888https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html for estimating the GP hyperparameters and optimizing the acquisition function through local optimization with random restarts. In all cases we allow 10 random restarts and use L-BFGS-B (byrd1995limited) as local optimization algorithm. Currently, BoTorch does not support optimization under non-linear constraints, which is needed to solve (13). To overcome this, we use the implementation of COBYLA (powell1994direct) from nlopt999https://nlopt.readthedocs.io/en/latest/.

In all experiments, the noise of the likelihood is fixed to for all GPs. The chosen hyperpriors on the lengthscales and the signal variance are reported in Table 2, where refers to a uniform prior on the interval , refers to a Gamma prior with concentration and rate , and refers to a normal distribution with mean and standard deviation .

In Sec. 5.2., both, the Michalewicz and the Hartman functions are normalized to have zero mean and unit variance. The true minimum is known for both functions, which allows to compute the regret.

In Sec. 5.4, the goal is to find the state feedback gain for the cart-pole problem that minimizes a quadratic cost , which penalizes deviations of the pendulum states from an equilibrium point . The pole angle is , the pole angular velocity is , the cart displacement is , and the cart velocity is . The input to the system is the cart acceleration , which is given by , where an integrator, with gain , is added to eliminate the steady-state the error. For each parametrization , the constraint value is computed as the maximum displacement of the cart over a simulation of steps, i.e., . Constraint violation is quantified as , where is the physical limit of the rail in which the cart moves. To allow the system to dissipate energy, the damping value of the simulated cart-pole in MuJoCo was increased from 1.0 to 1.5.

Lengthscale Variance
Michalewicz 10D
Hartman 6D
NN compression
Pendulum
Table 2: Hyperprior choices for the GP model hyperparameters for all experiments.

Appendix E Additional results

In this section, we present complementary results to Sec. 5.

To decouple the influence of the hyperparamater learning from the performance of the acquisition function itself, we fix the GP hyperparameters and sample the true objective and the true constraint from the corresponding GP priors. To obtain such samples we follow the same approach as in (hernandez2016general): First, the input domain is discretized to an irregular grid of 8000 points. Second, function evaluations are randomly sampled from the corresponding GP prior at such locations. Finally, the GP is conditioned on those evaluations and the resulting posterior mean is used as true objective. The lengthscales where fixed to and the signal variance to .

The simple regret cannot be computed because the true minimum of the GP sample is unknown a priori. Instead, we report results assuming a very conservative lower bound on all the possible sampled functions, i.e., . We allow a maximum of iterations, and a budget of failures in the constrained case. The experiments were repeated 50 times for all algorithms. At each repetition, a new function is sampled from the GP priors.

In Table 3, we show a performance comparison of both, Xs and XsF in optimizing a 3D input space. Without the influence of hyperparameter optimization, the proposed methods reach lower observations than state-of-the-art methods.

3D Synthetic function