# A General Framework for Constrained Bayesian Optimization using Information-based Search

We present an information-theoretic framework for solving global black-box optimization problems that also have black-box constraints. Of particular interest to us is to efficiently solve problems with decoupled constraints, in which subsets of the objective and constraint functions may be evaluated independently. For example, when the objective is evaluated on a CPU and the constraints are evaluated independently on a GPU. These problems require an acquisition function that can be separated into the contributions of the individual function evaluations. We develop one such acquisition function and call it Predictive Entropy Search with Constraints (PESC). PESC is an approximation to the expected information gain criterion and it compares favorably to alternative approaches based on improvement in several synthetic and real-world problems. In addition to this, we consider problems with a mix of functions that are fast and slow to evaluate. These problems require balancing the amount of time spent in the meta-computation of PESC and in the actual evaluation of the target objective. We take a bounded rationality approach and develop partial update for PESC which trades off accuracy against speed. We then propose a method for adaptively switching between the partial and full updates for PESC. This allows us to interpolate between versions of PESC that are efficient in terms of function evaluations and those that are efficient in terms of wall-clock time. Overall, we demonstrate that PESC is an effective algorithm that provides a promising direction towards a unified solution for constrained Bayesian optimization.

## Authors

• 45 publications
• 4 publications
• 43 publications
• 10 publications
• 82 publications
• ### Predictive Entropy Search for Bayesian Optimization with Unknown Constraints

Unknown constraints arise in many types of expensive black-box optimizat...

02/18/2015 ∙ by José Miguel Hernández-Lobato, et al. ∙ 0

• ### Bayesian Optimization with Unknown Constraints

Recent work on Bayesian optimization has shown its effectiveness in glob...

03/22/2014 ∙ by Michael A. Gelbart, et al. ∙ 0

• ### Constrained Bayesian Optimization with Max-Value Entropy Search

Bayesian optimization (BO) is a model-based approach to sequentially opt...

10/15/2019 ∙ by Valerio Perrone, et al. ∙ 0

• ### Optimization, fast and slow: optimally switching between local and Bayesian optimization

We develop the first Bayesian Optimization algorithm, BLOSSOM, which sel...

05/22/2018 ∙ by Mark McLeod, et al. ∙ 2

• ### Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian

An augmented Lagrangian (AL) can convert a constrained optimization prob...

05/31/2016 ∙ by Victor Picheny, et al. ∙ 0

• ### Scalarizing Functions in Bayesian Multiobjective Optimization

Scalarizing functions have been widely used to convert a multiobjective ...

04/11/2019 ∙ by Tinkle Chugh, et al. ∙ 0

• ### Classified Regression for Bayesian Optimization: Robot Learning with Unknown Penalties

Learning robot controllers by minimizing a black-box objective cost usin...

07/24/2019 ∙ by Alonso Marco, et al. ∙ 3

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many real-world optimization problems involve finding a global minimizer of a black-box objective function subject to a set of black-box constraints all being simultaneously satisfied. For example, consider the problem of optimizing the performance of a speech recognition system, subject to the requirement that it operates within a specified time limit. The system may be implemented as a neural network with hyper-parameters such as the number of hidden units, learning rates, regularization constants, etc. These hyper-parameters have to be tuned to minimize the recognition error on some validation data under a constraint on the maximum runtime of the resulting system. Another example is the discovery of new materials. Here, we aim to find new molecular compounds with optimal properties such as the power conversion efficiency in photovoltaic devices. Constraints arise from our ability (or inability) to synthesize various molecules. In this case, the estimation of the properties of the molecule and its synthesizability can be achieved by running expensive simulations on a computer.

More formally, we are interested in finding the global minimum of a scalar objective function over some bounded domain, typically , subject to the non-negativity of a set of constraint functions . We write this as

 minx∈Xf(x)s.t.c1(x)≥0,…,cK(x)≥0. (1)

However, and are unknown and can only be evaluated pointwise via expensive queries to “black boxes” that may provide noise-corrupted values. Note that we are assuming that  and each of the constraints  are defined over the entire space . We seek to find a solution to Eq. 1 with as few queries as possible.

For solving unconstrained problems, Bayesian optimization (BO) is a successful approach to the efficient optimization of black-box functions (Mockus et al., 1978). BO methods work by applying a Bayesian model to the previous evaluations of the function, with the aim of reasoning about the global structure of the objective function. The Bayesian model is then used to compute an acquisition function (i.e., expected utility function) that represents how promising each possible  is if it were to be evaluated next. Maximizing the acquisition function produces a suggestion which is then used as the next evaluation location. When the evaluation of the objective at the suggested point is complete, the Bayesian model is updated with the newly collected function observation and the process repeats. The optimization ends after a maximum number of function evaluations is reached, a time threshold is exceeded, or some other stopping criterion is met. When this occurs, a recommendation of the solution is given to the user. This is achieved for example by optimizing the predictive mean of the Bayesian model, or by choosing the best observed point among the evaluations. The Bayesian model is typically a Gaussian process (GP); an in-depth treatment of GPs is given by Rasmussen and Williams (2006). A commonly-used acquisition function is the expected improvement (EI) criterion (Jones et al., 1998), which measures the expected amount by which we will improve over some incumbent or current best solution, typically given by the expected value of the objective at the current best recommendation. Other acquisition functions aim to approximate the expected information gain or expected reduction in the posterior entropy of the global minimizer of the objective (Villemonteix et al., 2009; Hennig and Schuler, 2012; Hernández-Lobato et al., 2014). For more information on BO, we refer to the tutorial by Brochu et al. (2010).

There have been several attempts to extend BO methods to address the constrained optimization problem in Eq. 1

. The proposed techniques use GPs and variants of the EI heuristic

(Schonlau et al., 1998; Parr, 2013; Snoek, 2013; Gelbart et al., 2014; Gardner et al., 2014; Gramacy et al., 2016; Gramacy and Lee, 2011; Picheny, 2014)

. Some of these methods lack generality since they were designed to work in specific contexts, such as when the constraints are noiseless or the objective is known. Furthermore, because they are based on EI, computing their acquisition function requires the current best feasible solution or incumbent: a location in the search space with low expected objective value and high probability of satisfying the constraints. However, the best feasible solution does not exist when no point in the search space satisfies the constraints with high probability (for example, because of lack of data). Finally and more importantly, these methods run into problems when the objective and the constraint functions are

decoupled, meaning that the functions in Eq. 1 can be evaluated independently. In particular, the acquisition functions used by these methods usually consider joint evaluations of the objective and constraints and cannot produce an optimal suggestion when only subsets of these functions are being evaluated.

In this work, we propose a general approach for constrained BO that does not have the problems mentioned above. Our approach to constraints is based on an extension of Predictive Entropy Search (PES) (Hernández-Lobato et al., 2014), an information-theoretic method for unconstrained BO problems. The resulting technique is called Predictive Entropy Search with Constraints (PESC) and its acquisition function approximates the expected information gain with regard to the solution of Eq. 1, which we call . At each iteration, PESC collects data at the location that is the most informative about , in expectation. One important property of PESC is that its acquisition function naturally separates the contributions of the individual function evaluations when those functions are modeled independently. That is, the amount of information that we approximately gain by jointly evaluating a set of independent functions is equal to the sum of the gains of information that we approximately obtain by the individual evaluation of each of the functions. This additive property in its acquisition function allows PESC to efficiently solve the general constrained BO problem, including those with decoupled evaluation, something that no other existing technique can achieve, to the best of our knowledge.

An initial description of PESC is given by Hernández-Lobato et al. (2015). That work considers PESC only in the coupled evaluation scenario, where all the functions are jointly evaluated at the same input value. This is the standard setting considered by most prior approaches for constrained BO. Here, we further extend that initial work on PESC as follows:

1. We present a taxonomy of constrained BO problems. We consider problems in which the objective and constraints can be split into subsets of functions or tasks that require coupled evaluation, but where different tasks can be evaluated in a decoupled way. These different tasks may or may not compete for a limited set of resources. We propose a general algorithm for solving this type of problems and then show how PESC can be used for the practical implementation of this algorithm.

2. We analyze PESC in the decoupled scenario. We evaluate the accuracy of PESC when the different functions (objective or constraint) are evaluated independently. We show how PESC efficiently solves decoupled problems with an objective and constraints that compete for the same computational resource.

3. We intelligently balance the computational overhead of the Bayesian optimization method relative to the cost of evaluating the black-boxes. To achieve this, we develop a partial update to the PESC approximation that is less accurate but faster to compute. We then automatically switch between partial and full updates so that we can balance the amount of time spent in the Bayesian optimization subroutines and in the actual collection of data. This allows us to efficiently solve problems with a mix of decoupled functions where some are fast and others slow to evaluate.

The rest of the paper is structured as follows. Section 2 reviews prior work on constrained BO and considers these methods in the context of decoupled functions. In Section 3 we present a general framework for describing BO problems with decoupled functions, which contains as special cases the standard coupled framework considered in most prior work as well as the notion of decoupling introduced by Gelbart et al. (2014). This section also describes a general algorithm for BO problems with decoupled functions. In Section 4 we show how to extend Predictive Entropy Search (PES) (Hernández-Lobato et al., 2014) to solve Eq. 1 in the context of decoupling, an approach that we call Predictive Entropy Search with Constraints (PESC). We also show how PESC can be used to implement the general algorithm from Section 3. In Section 5 we modify the PESC algorithm to be more efficient in terms of wall-clock time by adaptively using an approximate but faster version of the method. In Sections 7 and 6 we perform empirical evaluations of PESC on coupled and decoupled optimization problems, respectively. Finally, we conclude in Section 8.

## 2 Related Work

Below we discuss previous approaches to Bayesian optimization with black-box constraints, many of which are variants of the expected improvement (EI) heuristic (Jones et al., 1998). In the unconstrained setting, EI measures the expected amount by which observing the objective at leads to improvement over the current best recommendation or incumbent, the objective value of which is denoted by (thus, has the units of , not ). The incumbent is usually defined as the lowest expected value for the objective over the optimization domain. The EI acquisition function is then given by

 αEI(x) =∫max(0,η−f(x))p(f(x)|D)df(x)=σf(x)(zf(x)Φ(zf(x))+ϕ(zf(x))), (2)

where represents the collected data (previous function evaluations) and is the predictive distribution for the objective made by a Gaussian process (GP), and

are the GP predictive mean and variance for

, , and and are the standard Gaussian CDF and PDF, respectively.

### 2.1 Expected Improvement with Constraints

An intuitive extension of EI in the presence of constraints is to define improvement as only occurring when the constraints are satisfied. Because we are uncertain about the values of the constraints, we must weight the original EI value by the probability of the constraints being satisfied. This results in what we call Expected Improvement with Constraints (EIC):

 αEIC(x) =αEI(x)K∏k=1Pr(ck(x)≥0|D)=αEI(x)K∏k=1Φ(μk(x)σk(x)), (3)

The constraint satisfaction probability factorizes because and are modeled by independent GPs. In this expression and

are the posterior predictive mean and variance for

. EIC was initially proposed by Schonlau et al. (1998) and has been revisited by Parr (2013), Snoek (2013), Gardner et al. (2014) and Gelbart et al. (2014).

In the constrained setting, the incumbent can be defined as the minimum expected objective value subject to all the constraints being satisfied at the corresponding location. However, we can never guarantee that all the constraints will be satisfied when they are only observed through noisy evaluations. To circumvent this problem, Gelbart et al. (2014) define

as the lowest expected objective value subject to all the constraints being satisfied with posterior probability larger than the threshold

, where is a small number such as . However, this value for still cannot be computed when there is no point in the search space that satisfies the constraints with posterior probability higher than . For example, because of lack of data for the constraints. In this case, Gelbart et al. change the original acquisition function given by Eq. 3 and ignore the factor in that expression. This allows them to search only for a feasible location, ignoring the objective entirely and just optimizing the constraint satisfaction probability. However, this can lead to inefficient optimization in practice because the data collected for the objective is not used to make optimal decisions.

### 2.2 Integrated Expected Conditional Improvement

Gramacy and Lee (2011) propose an acquisition function called the integrated expected conditional improvement (IECI), defined as

 αIECI(x) =∫X[αEI(x′)−αEI(x′|x)]h(x′)dx′. (4)

Here,  is the expected improvement at  is the expected improvement at  given that the objective has been observed at (but without making any assumptions about the observed value), and is an arbitrary density over . The IECI at  is the expected reduction in EI at , under the density , caused by observing the objective at . Gramacy and Lee use IECI for constrained BO by setting  to the probability of the constraints being satisfied at . They define the incumbent as the lowest posterior mean for the objective over the whole optimization domain, ignoring the fact that the lowest posterior mean for the objective may be achieved in an infeasible location.

The motivation for IECI is that collecting data at an infeasible location may also provide useful information. EIC strongly discourages this, because Eq. 3 always takes very low values when the constraints are unlikely to be satisfied. This is not the case with IECI because Eq. 4 considers the EI over the whole optimization domain instead of focusing only on the EI at the current evaluation location, which may be infeasible with high probability. Gelbart et al. (2014) compare IECI with EIC for optimizing the hyper-parameters of a topic model with constraints on the entropy of the per-topic word distribution and show that EIC outperforms IECI on this problem.

### 2.3 Expected Volume Minimization

An alternative approach is given by Picheny (2014), who proposes to sequentially explore the location that most decreases the expected volume (EV) of the feasible region below the best feasible objective value found so far. This quantity is computed by integrating the product of the probability of improvement and the probability of feasibility. That is,

 αEV(x)=∫p[f(x′)≤η]h(x′)dx′−∫p[f(x′)≤min(η,f(x))]h(x′)dx′, (5)

where, as in IECI, is the probability that the constraints are satisfied at . Picheny considers noiseless evaluations for the objective and constraint functions and defines as the best feasible objective value seen so far or when no feasible location has been found.

A disadvantage of Picheny’s method is that it requires the integral in Eq. 5 to be computed over the entire search domain , which is done numerically over a grid on . The resulting acquisition function must then be globally optimized. This is often performed by first evaluating the acquisition function on a grid on . The best point in this second grid is then used as the starting point of a numerical optimizer for the acquisition function. This nesting of grid operations limits the application of this method to small input dimension . This is also the case for IECI whose acquisition function in Eq. 4 also includes an integral over . Our method PESC requires a similar integral in the form of an expectation with respect to the posterior distribution of the global feasible minimizer . Nevertheless, this expectation can be efficiently approximated by averaging over samples of drawn using the approach proposed by Hernández-Lobato et al. (2014). This approach is further described in Section B.3. Note that the integrals in Eq. 5 could in principle be also approximated by using Marcov chain Monte Carlo (MCMC) to sample from the unnormalized density . However, this was not proposed by Picheny and he only described the grid based method.

### 2.4 Modeling an Augmented Lagrangian

Gramacy et al. (2016) propose to use a combination of EI and the augmented Lagrangian (AL) method: an algorithm which turns an optimization problem with constraints into a sequence of unconstrained optimization problems. Gramacy et al. use BO techniques based on EI to solve the unconstrained inner loop of the AL problem. When and are known the unconstrained AL objective is defined as

 LA(x|λ1,…,λK,p)=f(x)+K∑k=1[12pmin(0,ck(x))2−λkck(x)], (6)

where is a penalty parameter and serve as Lagrange multipliers. The AL method iteratively minimizes Eq. 6 with different values for and at each iteration. Let be the minimizer of Eq. 6 at iteration using parameter values  and . The next parameter values are  for  and  if  is feasible and  otherwise. When  and  are unknown we cannot directly minimize Eq. 6. However, if we have observations for  and , we can then map such data into observations for the AL objective. Gramacy et al. fit a GP model to the AL observations and then select the next evaluation location using the EI heuristic. After collecting the data, the AL parameters are updated as above using the new values for the constraints and the whole process repeats.

A disadvantage of this approach is that it assumes that the the constraints are noiseless to guarantee that that and can be correctly updated. Furthermore, Gramacy et al. (2016) focus only on the case in which the objective is known, although they provide suggestions for extending their method to unknown . In section 6.3 we show that PESC and EIC perform better than the AL approach on the synthetic benchmark problem considered by Gramacy et al., even when the AL method has access to the true objective function and PESC and EIC do not.

### 2.5 Existing Methods for Decoupled Evaluations

The methods described above can be used to solve constrained BO problems with coupled evaluations. These are problems in which all the functions (objective and constraints) are always evaluated jointly at the same input. Gelbart et al. (2014) consider extending the EIC method from Section 2.1 to the decoupled setting, where the different functions can be independently evaluated at different input locations. However, they identify a problem with EIC in the decoupled scenario. In particular, the EIC utility function requires two conditions to produce positive values. First, the evaluation for the objective must achieve a lower value than the best feasible solution so far and, second, the evaluations for the constraints must produce non-negative values. When we evaluate only one function (objective or constraint), the conjunction of these two conditions cannot be satisfied by a single observation under a myopic search policy. Thus, the new evaluation location can never become the new incumbent and the EIC is zero everywhere. Therefore, standard EIC fails in the decoupled setting.

Gelbart et al. (2014) circumvent the problem mentioned above by treating decoupling as a special case and using a two-stage acquisition function: first, the next evaluation location  is chosen with EIC, and then, given , the task (whether to evaluate the objective or one of the constraints) is chosen according to the expected reduction in the entropy of the global feasible minimizer , where the entropy computation is approximated using Monte Carlo sampling as proposed by Villemonteix et al. (2009). We call the resulting method EIC-D. Note that the two-stage decision process used by EIC-D is sub-optimal and a joint selection of and the task should produce better results. As discussed in the sections that follow, our method, PESC, does not suffer from this disadvantage and furthermore, can be extended to a wider range of decoupled problems than EIC-D can.

## 3 Decoupled Function Evaluations and Resource Allocation

We present a framework for describing constrained BO problems. We say that a set of functions (objective or constraints) are coupled when they always require joint evaluation at the same input location. We say that they are decoupled when they can be evaluated independently at different inputs. In practice, a particular problem may exhibit coupled or decoupled functions or a combination of both. An example of a problem with coupled functions is given by a financial simulator that generates many samples from the distribution of possible financial outcomes. If the objective function is the expected profit and the constraint is a maximum tolerable probability of default, then these two functions are computed jointly by the same simulation and are thus coupled to each other. An example of a problem with decoupled functions is the optimization of the predictive accuracy of a neural network speech recognition system subject to prediction time constraints. In this case different neural network architectures may produce different predictive accuracies and different prediction times. Assessing the prediction time may not require training the neural network and could be done using arbitrary network weights. Thus, we can evaluate the timing constraint without evaluating the accuracy objective.

When problems exhibit a combination of coupled and decoupled functions, we can then partition the different functions into subsets of functions that require coupled evaluation. We call these subsets of coupled functions tasks. In the financial simulator example, the objective and the constraint form the only task. In the speech recognition system there are two tasks, one for the objective and one for the constraint. Functions within different tasks are decoupled and can be evaluated independently. These tasks may or may not compete for a limited set of resources. For example, two tasks that both require the performance of expensive computations may have to compete for using a single available CPU. An example with no competition is given by two tasks, one which performs computations in a CPU and another one which performs computations in a GPU. Finally, two competitive tasks may also have different evaluation costs and this should be taken into account when deciding which one is going to be evaluated next.

In the previous section we showed that most existing methods for constrained BO can only address problems with coupled functions. Furthermore, the extension of these methods to the decoupled setting is difficult because most of them are based on the EI heuristic and, as illustrated in Section 2.5, improvement can be impossible with decoupled evaluations. A decoupled problem can, of course, be coupled artificially and then solved as a coupled problem with existing methods. We examine this approach here with a thought experiment and with empirical evaluations in Section 7. Returning to our time-limited speech recognition system, let us consider the cost of evaluating each of the tasks. Evaluating the objective requires training the neural network, which is a very expensive process. On the other hand, evaluating the constraint (run time) requires only to time the predictions made by the neural network and this could be done without training, using arbitrary network weights. Therefore, evaluating the constraint is in this case much faster than evaluating the objective. In a decoupled framework, one could first measure the run time at several evaluation locations, gaining a sense of the constraint surface. Only then would we incur the significant expense of evaluating the objective task, heavily biasing our search towards locations that are considered to be feasible with high probability. Put another way, artificially coupling the tasks becomes increasingly inefficient as the cost differential is increased; for example, one might spend a week examining one aspect of a design that could have been ruled out within seconds by examining another aspect.

In the following sections we present a formalization of constrained Bayesian optimization problems that encompasses all of the cases described above. We then show that our method, PESC (Section 4), is an effective practical solution to these problems because it naturally separates the contributions of the different function evaluations in its acquisition function.

### 3.1 Competitive Versus Non-competitive Decoupling and Parallel BO

We divide the class of problems with decoupled functions into two sub-classes, which we call competitive decoupling (CD) and non-competitive decoupling (NCD). CD is the form of decoupling considered by Gelbart et al. (2014), in which two or more tasks compete for the same resource. This happens when there is only one CPU available and the optimization problem includes two tasks with each of them requiring a CPU to perform some expensive simulations. In contrast, NCD refers to the case in which tasks require the use of different resources and can therefore be evaluated independently, in parallel. This occurs, for example, when one of the two tasks uses a CPU and the other task uses a GPU.

Note that NCD is very related to parallel Bayesian optimization (see e.g., Ginsbourger et al., 2011; Snoek et al., 2012). In both parallel BO and NCD we perform multiple task evaluations concurrently, where each new evaluation location is selected optimally according to the available data and the locations of all the currently pending evaluations. The difference between parallel BO and NCD is that in NCD the tasks whose evaluations are currently pending may be different from the task that will be evaluated next, while in parallel BO there is only a single task. Parallel BO conveniently fits into the general framework described below.

### 3.2 Formalization of Constrained Bayesian Optimization Problems

We now present a framework for describing constrained BO problems of the form given by Eq. 1. Our framework can be used to represent general problems within any of the categories previously described, including coupled and decoupled functions that may or may not compete for a limited number of resources, each of which may be replicated multiple times. Let  be the set of functions  and let the set of tasks be a partition of  indicating which functions are coupled and must be jointly evaluated. Let  be the set of resources available to solve this problem. We encode the relationship between tasks and resources with a bipartite graph  with vertices  and edges  such that  and . The interpretation of an edge is that task can be performed on resource . (We do not address the case in which a task requires multiple resources to be executed; we leave this as future work.) We also introduce a capacity  for each resource . The capacity represents how many tasks may be simultaneously executed on resource ; for example, if represents a cluster of CPUs,  would be the number of CPUs in the cluster. Introducing the notion of capacity is simply a matter of convenience since it is equivalent to setting all capacities to one and replicating each resource node in  according to its capacity.

We can now formally describe problems with coupled evaluations as well as NCD and CD. In particular, coupling occurs when two functions and belong to the same task . If this task can be evaluated on multiple resources (or one resource with ), then this is parallel Bayesian optimization. NCD occurs when two functions and belong to different tasks and , which themselves require different resources and , (that is, , and ). CD occurs when two functions and belong to different tasks and (decoupled) that require the same resource (competitive). These definitions are visually illustrated in Fig. 1. The definitions can be trivially extended to cases with more than two functions. The most general case is an arbitrary task-resource graph encoding a combination of coupling, NCD, CD and parallel Bayesian optimization.

### 3.3 A General Algorithm for Constrained Bayesian Optimization

In this section we present a general algorithm for solving constrained Bayesian optimization problems specified according to the formalization from the previous section. Our approach relies on an acquisition function that can measure the expected utility of evaluating any arbitrary subset of functions, that is, of any possible task. When an acquisition function satisfies this requirement we say that it is separable. As discussed in Section 4.1, our method PESC has this property, when the functions are modeled as independent. This property makes PESC an effective solution for the practical implementation of our general algorithm. By contrast, the EIC-D method of Gelbart et al. (2014) is not separable and cannot be applied in the general case presented here.

Algorithm 1 provides a general procedure for solving constrained Bayesian optimization problems. The inputs to the algorithm are the set of functions , the set of tasks , the set of resources , the task-resource graph , an acquisition function for each task, that is,  for , the search space , a Bayesian model , the initial data set , the resource query functions  and  and a confidence level for making a final recommendation. Recall that indicates how many tasks can be simultaneously executed on a particular resource. The function  is introduced here to indicate how many tasks are currently being evaluated in a resource. The acquisition function measures the utility of evaluating task at the location . This acquisition function depends on the predictions of the Bayesian model . The separability property of the original acquisition function guarantees that we can compute an for each .

Algorithm 1 can solve problems that exhibit any combination of coupling, parallelism, NCD, and CD.

### 3.4 Incorporating Cost Information

Algorithm 1 always selects, among a group of competitive tasks, the one whose evaluation produces the highest utility value. However, other cost factors may render the evaluation of one task more desirable than another. The most salient of these costs is the run time or duration of the task’s evaluation, which could depend on the evaluation location . For example, in the neural network speech recognition system, one of the variables to be optimized may be the number of hidden units in the neural network. In this case, the run time of an evaluation of the predictive accuracy of the system is a function of since the training time for the network scales with its size. Snoek et al. (2012) consider this issue by automatically measuring the duration of function evaluations. They model the duration as a function of  with an additional Gaussian process (GP). Swersky et al. (2013) extend this concept over multiple optimization tasks so that an independent GP is used to model the unknown duration of each task. This approach can be applied in Algorithm 1 by penalizing the acquisition function for task with the expected cost of evaluating that task. In particular, we can change lines 6 and 7 in Algorithm 1 to

where is the expected cost associated with the evaluation of task at , as estimated by a model of the collected cost data. When taking into account task costs modeled by Gaussian processes, the total number of GP models used by Algorithm 1 is equal to the number of functions in the constrained BO problem plus the number of tasks, that is, . Alternatively, one could fix the cost functions a priori instead of learning them from collected data.

## 4 Predictive Entropy Search with Constraints (PESC)

To implement Algorithm 1 in practice we need to compute an acquisition function that is separable and can measure the utility of evaluating an arbitrary subset of functions. In this section we describe how to achieve this.

Our acquisition function approximates the expected gain of information about the solution to the constrained optimization problem, which is denoted by . Importantly, our approximation is additive. For example, let be a set of functions and let be the amount of information that we approximately gain in expectation by jointly evaluating the functions in . Then . Although our acquisition function is additive, the exact expected gain of information is not. Additivity is the result of a factorization assumption in our approximation (see Section 4.2 for further details). The good results obtained in our experiments seem to support that this is a reasonable assumption. Because of this additive property, we can compute an acquisition function for any possible subset of , using the individual acquisition functions for these functions as building blocks.

We follow MacKay (1992) and measure information about by the differential entropy of , where is the data collected so far. The distribution is formally defined in the unconstrained case by Hennig and Schuler (2012). In the constrained case

can be understood as the probability distribution determined by the following sampling process. First, we draw

, from their posterior distributions given and second, we minimize the sampled subject to the sampled being non-negative, that is, we solve Eq. 1 for the sampled functions. The solution to Eq. 1 obtained by this procedure represents then a sample from .

We consider first the case in which all the black-box functions are evaluated at the same time (coupled). Let denote the differential entropy of and let , denote the measurements obtained by querying the black-boxes for , at the input location

. We encode these measurements in vector form as

. Note that contains the result of the evaluation of all the functions at , that is, the objective and the constraints . We aim to collect data at the location that maximizes the expected information gain or the expected reduction in the entropy of . The corresponding acquisition function is

 α(x)=H[x⋆|D]−Ey|D,x[H[x⋆|D∪{(x,y)}]]. (7)

In this expression, is the amount of information on that is available once we have collected new data at the input location . However, this new is unknown because it has not been collected yet. To circumvent this problem, we take the expectation with respect to the predictive distribution for given and . This produces an expression that does not depend on and could in principle be readily computed.

A direct computation of Eq. 7 is challenging because it requires evaluating the entropy of the intractable distribution when different pairs are added to the data. To simplify computations, we note that Eq. 7 is the mutual information between and given and , which we denote by . The mutual information operator is symmetric, that is, . Therefore, we can follow Houlsby et al. (2012)

and swap the random variables

and in Eq. 7. The result is a reformulation of the original equation that is now expressed in terms of entropies of predictive distributions, which are easier to approximate:

 α(x)=H[y|D,x]−Ex⋆|D[H% [y|D,x,x⋆]]. (8)

This is the same reformulation used by Predictive Entropy Search (PES) (Hernández-Lobato et al., 2014) for unconstrained Bayesian optimization, but extended to the case where is a vector rather than a scalar. Since we focus on constrained optimization problems, we call our method Predictive Entropy Search with Constraints (PESC). Eq. 8 is used by PESC to efficiently solve constrained Bayesian optimization problems with decoupled function evaluations. In the following section we describe how to obtain a computationally efficient approximation to Eq. 8. We also show that the resulting approximation is separable.

### 4.1 The PESC Acquisition Function

We assume that the functions , are independent samples from Gaussian process (GP) priors and that the noisy measurements returned by the black-boxes are obtained by adding Gaussian noise to the noise-free function evaluations at . Under this Bayesian model for the data, the first term in Eq. 8 can be computed exactly. In particular,

 H[y|D,x]=K+1∑i=112logσ2i(x)+K+12log(2πe), (9)

where is the predictive variance for at and is the -th entry in . To obtain this formula we have used the fact that , are generated independently, so that , and that is Gaussian with variance parameter given by the GP predictive variance (Rasmussen and Williams, 2006):

 σ2i(x)=ki(x)−ki(x)TK−1iki(x)+νi,i=1,…,K+1, (10)

where is the variance of the additive Gaussian noise in the -th black-box, with being the first one and the last one. The scalar is the prior variance of the noise-free black-box evaluations at . The vector contains the prior covariances between the black-box values at and at those locations for which data from the black-box is available. Finally, is a matrix with the prior covariances for the noise-free black-box evaluations at those locations for which data is available.

The second term in Eq. 8, that is, , cannot be computed exactly and needs to be approximated. We do this operation as follows. : The expectation with respect to is approximated with an empirical average over samples drawn from . These samples are generated by following the approach proposed by Hernández-Lobato et al. (2014) for sampling in the unconstrained case. We draw approximate posterior samples of , as described by Hernández-Lobato et al. (2014, Appendix A), and then solve Eq. 1 to obtain given the sampled functions. More details can be found in Section B.3 of this document. Note that this approach only applies for stationary kernels, but this class includes popular choices such as the squared exponential and Matérn kernels. : We assume that the components of are independent given , and , that is, we assume that the evaluations of , at are independent given and . This factorization assumption guarantees that the acquisition function used by PESC is additive across the different functions that are being evaluated. : Let be the -th sample from . We then find a Gaussian approximation to each using expectation propagation (EP) (Minka, 2001a). Let be the variance of the Gaussian approximation to given by EP. Then, we obtain

 Ex⋆|D[H[y|D,x,x⋆]] to20.2pt\vboxto20.2pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{7.899658pt}{0.0pt}\pgfsys@curveto{7.899658pt}{4.362861pt}{4.362% 861pt}{7.899658pt}{0.0pt}{7.899658pt}\pgfsys@curveto{-4.362861pt}{7.899658pt}{% -7.899658pt}{4.362861pt}{-7.899658pt}{0.0pt}\pgfsys@curveto{-7.899658pt}{-4.36% 2861pt}{-4.362861pt}{-7.899658pt}{0.0pt}{-7.899658pt}\pgfsys@curveto{4.362861% pt}{-7.899658pt}{7.899658pt}{-4.362861pt}{7.899658pt}{0.0pt}\pgfsys@closepath% \pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-7.499886pt% }{-2.499962pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{1}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicture≈1MM∑j=1H[y|D,x,xj⋆]to20.2pt\vboxto20.2pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{7.899658pt}{0.0pt}\pgfsys@curveto{7.899658pt}{4.362861pt}{4.362% 861pt}{7.899658pt}{0.0pt}{7.899658pt}\pgfsys@curveto{-4.362861pt}{7.899658pt}{% -7.899658pt}{4.362861pt}{-7.899658pt}{0.0pt}\pgfsys@curveto{-7.899658pt}{-4.36% 2861pt}{-4.362861pt}{-7.899658pt}{0.0pt}{-7.899658pt}\pgfsys@curveto{4.362861% pt}{-7.899658pt}{7.899658pt}{-4.362861pt}{7.899658pt}{0.0pt}\pgfsys@closepath% \pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-7.499886pt% }{-2.499962pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{2}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicture≈1MM∑j=1[K+1∑i=1H[yi|D,x,xj⋆]] to20.2pt\vboxto20.2pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}{{}{{{}}}{{}}{}{}{}{}{}{}{}{}{}{{}% \pgfsys@moveto{7.899658pt}{0.0pt}\pgfsys@curveto{7.899658pt}{4.362861pt}{4.362% 861pt}{7.899658pt}{0.0pt}{7.899658pt}\pgfsys@curveto{-4.362861pt}{7.899658pt}{% -7.899658pt}{4.362861pt}{-7.899658pt}{0.0pt}\pgfsys@curveto{-7.899658pt}{-4.36% 2861pt}{-4.362861pt}{-7.899658pt}{0.0pt}{-7.899658pt}\pgfsys@curveto{4.362861% pt}{-7.899658pt}{7.899658pt}{-4.362861pt}{7.899658pt}{0.0pt}\pgfsys@closepath% \pgfsys@moveto{0.0pt}{0.0pt}\pgfsys@stroke\pgfsys@invoke{ } }{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{-7.499886pt% }{-2.499962pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{3}} }}{}{}\pgfsys@endscope}}} {}{}{}\pgfsys@endscope}}\pgfsys@endscope\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicture≈K+1∑i=1{1MM∑j=112logσ2i(x|xj⋆)}+K+12log(2πe), (11)

where each of the approximations has been numbered with the corresponding step from the description above. Note that in step of Eq. 11 we have swapped the sums over and .

The acquisition function used by PESC is then given by the difference between Eq. 9 and the approximation shown in the last line of Eq. 11. In particular, we obtain

 αPESC(x)=K+1∑i=1~αi(x), (12)

where

 ~αi(x)=1MM∑j=112logσ2i(x)−12logσ2i(x|xj⋆)~αi(x|xj⋆),i=1,…,K+1. (13)

Interestingly, the factorization assumption that we made in step of Eq. 11 has produced an acquisition function in Eq. 12 that is the sum of function-specific acquisition functions, given by the in Eq. 13. Each measures how much information we gain on average by only evaluating the -th black box, where the first black-box evaluates and the last one evaluates . Furthermore, is the empirical average of across samples from . Therefore, we can interpret each in Eq. 13 as a function-specific acquisition function conditioned on . Crucially, by using bits of information about the minimizer as a common unit of measurement, our acquisition function can make meaningful comparisons between the usefulness of evaluating the objective and constraints.

We now show how PESC can be used to obtain the task-specific acquisition functions required by the general algorithm from Section 3.3. Let us assume that we plan to evaluate only a subset of the functions , and let contain the indices of the functions to be evaluated, where the first function is and the last one is . We assume that the functions indexed by are coupled and require joint evaluation. In this case encodes a task according to the definition from Section 3.2. We can then approximate the expected gain of information that is obtained by evaluating this task at input . The process is similar to the one used above when all the black-boxes are evaluated at the same time. However, instead of working with the full vector , we now work with the components of indexed by . One can then show that the expected information gain obtained after evaluating task at input can be approximated as

 αt(x)=∑i∈t~αi(x), (14)

where the are given by Eq. 13. PESC’s acquisition function is therefore separable since Eq. 14 can be used to obtain an acquisition function for each possible task. The process for constructing these task-specific acquisition functions is also efficient since it requires only to use the individual acquisition functions from Eq. 13 as building blocks. These two properties make PESC an effective solution for the practical implementation of the general algorithm from Section 3.3.

Fig. 2 illustrates with a toy example the process for computing the function-specific acquisition functions from Eq. 13. In this example there is only one constraint function. Therefore, the functions in the optimization problem are only and . The search space is the unit interval and we have collected four measurements for each function. The data for are shown as black points in panels A, B and C. The data for are shown as black points in panels F, G and H. We assume that and are independently sampled from a GP with zero mean function and squared exponential covariance function with unit amplitude and length-scale . The noise variance for the black-boxes that evaluate and is zero. Let and be the black-box evaluations for and at input . Under the assumed GP model we can analytically compute the predictive distributions for and , that is, and

. Panels B and G show the means of these distributions with confidence bands equal to one standard deviation. The first step to compute the

from Eq. 13 is to draw samples from . To generate each of these samples, we first approximately sample and from their posterior distributions and using the method described by Hernández-Lobato et al. (2014, Appendix A). Panels A and F show one of the samples obtained for and , respectively. We then solve the optimization problem given by Eq. 1 when and are known and equal to the samples obtained. The solution to this problem is the input that minimizes subject to being positive. This produces a sample from which is shown as a discontinuous vertical line with a red triangle in all the panels. The next step is to find a Gaussian approximation to the predictive distributions when we condition to , that is, and . This step is performed using expectation propagation (EP) as described in Section 4.2 and Appendix A. Panels C and H show the approximations produced by EP for and , respectively. Panel C shows that conditioning to decreases the posterior mean of in the neighborhood of . The reason for this is that must be the global feasible solution and this means that must be lower than any other feasible point. Panel H shows that conditioning to increases the posterior mean of in the neighborhood of . The reason for this is that must be positive because has to be feasible. In particular, by conditioning to we are giving zero probability to all such that . Let and be the variances of the Gaussian approximations to and and let and be the variances of and . We use these quantities to obtain and according to Eq. 13. These two functions are shown in panels D and I. The whole process is repeated times and the resulting and , , are averaged according to Eq. 13 to obtain the function-specific acquisition functions and , whose plots are shown in panels E and J. These plots indicate that evaluating the objective is in this case more informative than evaluating the constraint . But this is certainly not always the case, as will be demonstrated in the experiments later on.

### 4.2 How to Compute the Gaussian Approximation to p(yi|D,x,xj⋆)

We briefly describe the process followed to find a Gaussian approximation to using expectation propagation (EP) (Minka, 2001b). Recall that the variance of this approximation, that is, , is used to compute in Eq. 13. Here we only provide a sketch of the process; full details can be found in Appendix A.

We start by assuming that the search space has finite size, that is, . In this case the functions , are encoded as finite dimensional vectors denoted by , . The -th entries in these vectors are the result of evaluating , at the -th element of , that is, , . Let us assume that and are in . Then can be defined by the following rejection sampling process. First, we sample , from their posterior distribution given the assumed GP models. We then solve the optimization problem given by Eq. 1. For this, we find the entry of with lowest value subject to the corresponding entries of being positive. Let be the index of the selected entry. Then, if , we reject the sampled , and start again. Otherwise, we take the entries of , indexed by , that is, , and then obtain by adding to each of these values a Gaussian random variable with zero mean and variance , respectively. The probability distribution implied by this rejection sampling process can be obtained by first multiplying the posterior for , with indicator functions that take value zero when , should be rejected and one otherwise. We can then multiply the resulting quantity by the likelihood for given , . The desired distribution is finally obtained by marginalizing out , .

We introduce several indicator functions to implement the approach described above. The first one takes value one when is a feasible solution and value zero otherwise, that is,

 Γ(x)=K∏k=1Θ[ck(x)], (15)

where is the Heaviside step function which is equal to one if its input is non-negative and zero otherwise. The second indicator function takes value zero if is a better solution than according to the sampled functions. Otherwise takes value one. In particular,

 (16)

When is infeasible, this expression takes value one. In this case, is not a better solution than (because is infeasible) and we do not have to reject. When is feasible, the factor in Eq. 16 is zero when takes lower objective value than . This will allow us to reject , when is a better solution than . Using Eq. 15 and Eq. 16, we can then write as

 \resizebox422.7795pt$p(y|D,x,xj⋆)∝∫p(y|f,c1,…,cK,x)p(f,c1,…,cK|D)Γ(xj⋆){∏x′∈XΨ(x′)}f(f,c1,…,cK|xj⋆)dfdc1⋯cK$, (17)

where is the GP posterior distribution for the noise-free evaluations of , at and is the likelihood function, that is, the distribution of the noisy evaluations produced by the black-boxes with input given the true function values:

 \resizebox422.7795pt$p(y|f,c1,…,cK,x)=N(y1|f(x),ν1)N(y2|c1(x),ν2)⋯N(yK+1|cK(x),νK+1)$. (18)

The product of the indicator functions and in Eq. 17 takes value zero whenever is not the best feasible solution according to . The indicator in Eq. 17 guarantees that is a feasible location. The product of all the in Eq. 17 guarantees that no other point in is better than . Therefore, the product of and the in Eq. 17 rejects any value of for which is not the optimal solution to the constrained optimization problem.

The factors and in Eq. 17 are Gaussian. Thus, their product is also Gaussian and tractable. However, the integral in Eq. 17 does not have a closed form solution because of the complexity introduced by the the product of indicator functions and . This means that Eq. 17 cannot be exactly computed and has to be approximated. For this, we use EP to fit a Gaussian approximation to the product of and the indicator functions and in Eq. 17, which we have denoted by

, with a tractable Gaussian distribution given by

 q(f,c1,…,cK|xj⋆)=N(f|m1,V1)N(c1|m2,V2)⋯N(cK|mK+1,VK+1), (19)

where and are mean vectors and covariance matrices to be determined by the execution of EP. Let be the diagonal entry of corresponding to the evaluation location given by , where . Similarly, let be the entry of corresponding to the evaluation location for . Then, by replacing in Eq. 17 with , we obtain

 p(y|D,x,xj⋆)≈K+1∏i=1N(yi|mi(x),vi(x)+νi).