 # Warm Starting Bayesian Optimization

We develop a framework for warm-starting Bayesian optimization, that reduces the solution time required to solve an optimization problem that is one in a sequence of related problems. This is useful when optimizing the output of a stochastic simulator that fails to provide derivative information, for which Bayesian optimization methods are well-suited. Solving sequences of related optimization problems arises when making several business decisions using one optimization model and input data collected over different time periods or markets. While many gradient-based methods can be warm started by initiating optimization at the solution to the previous problem, this warm start approach does not apply to Bayesian optimization methods, which carry a full metamodel of the objective function from iteration to iteration. Our approach builds a joint statistical model of the entire collection of related objective functions, and uses a value of information calculation to recommend points to evaluate.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Abstract

We develop a framework for warm-starting Bayesian optimization, that reduces the solution time required to solve an optimization problem that is one in a sequence of related problems. This is useful when optimizing the output of a stochastic simulator that fails to provide derivative information, for which Bayesian optimization methods are well-suited. Solving sequences of related optimization problems arises when making several business decisions using one optimization model and input data collected over different time periods or markets. While many gradient-based methods can be warm started by initiating optimization at the solution to the previous problem, this warm start approach does not apply to Bayesian optimization methods, which carry a full metamodel of the objective function from iteration to iteration. Our approach builds a joint statistical model of the entire collection of related objective functions, and uses a value of information calculation to recommend points to evaluate.

## 1 Introduction

Businesses that use optimization to make operational decisions often solve collections of related optimization problems. Consider three examples:

• First, consider a ride-sharing company, such as Uber or Lyft, that makes real-time decisions about how to assign requests for rides to drivers that may fulfill those requests. Such a company might have a single algorithm for making these decisions: its performance depends crucially on tunable parameters. Given a particular distributional forecast of demand patterns, tuned to the particular next time period, and a simulator to estimate the performance of the algorithm using this forecast, this company might wish to optimize the parameters every few hours to find a configuration that would best achieve their goals in that time interval. Such distributional forecasts would be likely to change across hours within a day and across days within a week due to seasonality, and would also vary from week to week as distributional forecasts include more recent historical data due to changing weather, sporting events, etc.

• Second, consider a company managing inventory: to avoid overstocking or outages, an inventory management system has to decide what quantity of each item to hold in stock and how to set the threshold levels to order supplies. To maximize the profit, not only the current market prices have to be taken into account but also holding costs, delivery times affecting the lead times, and patterns in the demand. Thus, the company would re-optimize the parameters of its inventory management system on a regular schedule and whenever any of the key factors change.

• Third, consider a business that relies on machine learning models to predict the behavior of their clients. Suppose that such a model is used to predict “click-rates” for ads displayed on a collection of webpages. These predictions are subsequently used to decide which ad is to be displayed to each visitor, based on the user’s profile but also on the targets agreed on in the advertising campaigns. To obtain the most accurate predictions, the company would want to retrain the parameters of their model frequently based on the latest historical data, some of which typically have to be fit by a derivative-free method.

The common characteristic of all these problems is that we are to solve very similar problems over and over again. To squeeze the most out of the real time availability of data, we would wish to “shortcut” the optimization process, thereby saving time and money and most importantly gaining an edge over competitors, without negatively affecting the quality of solutions. One way to avoid a “cold start”, i.e. starting from scratch each time, is to exploit knowledge obtained while previously tackling related problems. This comprises not only the final solutions that were actually used in the application, but also all those evaluated in the respective optimization processes.

The idea of using previous solutions to speed up the optimization process is ubiquitous for applications in mathematical programming: in particular, the number of iterations that interior points methods for linear programming require to obtain an almost optimal solution can be reduced by a warm start, e.g., see

[YW02, GW04, JY08]. Typically this is achieved simply by using the solution to a previous optimization problem as the starting point when solving the current problem. This same technique can be used when using stochastic gradient ascent to solve optimization via simulation problems whose objective function evaluations provide stochastic gradients. However, when evaluations of the objective function are derivative-free, as they commonly are when optimizing complex simulators, many of the optimization methods that are most effective for solving a single optimization problem instance lack the structure that would permit warm starting them using this technique.

In this article, we create the ability to warm start Bayesian optimization methods. Bayesian optimization methods are effective for solving optimization via simulation problems in which the objective does not provide gradients, is expensive to evaluate, and takes a small number (< 20) of real- or integer-valued inputs. For an overview, see Bayesian optimization tutorials and surveys [FW16, BCd09, Kle14]

. Bayesian optimization methods belong to a wider set of optimization via simulation methods that leverage Bayesian statistics and value of information calculations. See, e.g., the surveys of

[Chi00, Fra11].

Bayesian optimization methods iteratively construct a Gaussian process metamodel over the objective function and use this metamodel to decide whether to sample next. Because the information carried from iteration to iteration is not just a single point but the full metamodel, simply carrying the solution from a previous problem as a starting point does not provide a warm starting capability to Bayesian optimization, necessitating a more sophisticated approach. While one can partially warm start Bayesian optimization by performing function evaluations of the current problem at solutions to previous problems, this fails to take full advantage of the set of previous evaluations at non-optimal points.

In this article, we propose a Bayesian framework that captures the observed relationship of the current black-box function to previous ones. This enables us to perform a “warm start” and hence to zoom in on an optimal solution faster. Our approach puts an emphasis on conceptual simplicity and wide applicability. In particular, we will model the tasks and their relationships by a Gaussian Process, whose covariance structure is tailored to capture the differences of the current task and the previous ones.

To optimize the current objective, we iteratively sample the search space, using the Knowledge Gradient acquisition criterion proposed by Frazier, Powell, and Dayanik [FPD09]; besides its one-step Bayes optimality, this criterion has the advantage that it has a simple form and can be computed efficiently. However, our statistical model is independent of the acquisition function and can be combined with others if that seems beneficial for the application under consideration.

Related Literature. The previous work most directly relevant to warm starting Bayesian optimization and optimization of expensive functions more generally comes from the machine learning literature, where hyper-parameters in machine learning models are fit by solving optimization problems with expensive derivative-free evaluations.

Within this literature, the most relevant previous work is due to Swersky, Snoek, and Adams [SSA13]

, who developed a warm start algorithm to tune hyper-parameters of machine learning models using Bayesian Optimization. They demonstrate that their algorithm can speed up the fine-tuning process for hyper-parameters of convolutional neural networks and logistic regression for image classification. Their approach is similar to ours, as they also construct a Gaussian Process model to quantify the relationships between all previous instances and then use an acquisition function, in their case Entropy Search, to optimize the hyper-parameters for the new instance.

We believe that our covariance kernel has the following advantages. Their model assumes the covariance of each pair of tasks to be uniform over the whole domain. Moreover, they require that all tasks are positively correlated. Our covariance function does not impose these restrictions and in particular allows for the relationship of tasks to vary over the domain. Additionally, we quantify biases between the objective functions of different tasks. Finally, the inference of the parameters of their kernel requires an expensive sampling-based step, involving slice sampling, whereas our approach performs inference in closed form.

In other related work from machine learning, Feurer, Springenberg, and Hutter [FSH14]

proposed a meta-learning procedure that improves the performance of a machine learning algorithm by selecting a small set of start-up configurations. The selection procedure explores the new task and then ranks previous tasks (and their corresponding optimal solutions) based on a metric that relies on an evaluation of the “metafeatures” that grasp the characteristics of each dataset. The authors demonstrate that their approach can speed up the process of optimizing hyper-parameters for Support Vector Machines and Algorithm Selection models. This approach differs from ours in three ways: first, the metafeatures and distance functions it develops are specific to machine learning, and extending this approach in a generic way to optimization via simulation problems encountered in operations research would require substantial effort; second, it uses only the final solution from each previous problem, while our approach uses the full set of function evaluations; third, it requires that solutions from previous problems be evaluated under the current objective before they provide useful information, while our approach utilitizes this information directly without requiring additional expensive evaluations.

Finally, in the aerospace engineering literature, Gano, Renaud, and Sanders [GRS04] studied the optimization of expensive-to-evaluate functions in a multi-fidelity setting and showed that warm starting optimization of an expensive high-fidelity objective using inexpensive evaluations of a low-fidelity model can reduce the number of high fidelity queries required. This previous work is focused on engineering design problems in which computational codes of various expense and fidelities all model the same objective, while we focus on a sequence of optimization problems with different objectives, that are not ordered by fidelity, and do not necessarily have different computational costs.

Outline of the Paper. The problem is stated formally in Sect. 2. The warm start algorithm is described in Sect. 3. The results of an experimental evaluation of the algorithm are presented in Sect. 4.

## 2 Problem Definition

We give a formal exposition of the setup. We indicate the current problem to be solved by , and tasks that have been solved previously by , for . We also refer to these optimization problems as “instances” or “tasks” in the sequel. Let be the number of previously solved problems, and define  and .

We suppose that evaluating , , at input parameter

yields independently and normally distributed observations with mean

and variance

, i.e.  denotes the objective value of  for the current task. We also suppose that accompanying this evaluation is an estimate of . This assumption of normally distributed noise accompanied by an estimate of the noise’s variance is approximately valid when each evaluation is actually an average of many independent and identically distribution observations, whose sample variance provides an estimate of .

We indicate the current optimization problem’s feasible set by . This feasible set may be shared with previous problems, or previous problems may have different feasible sets. We assume that is a compact subset of , and that membership in this set can be determined simply without evaluating an expensive function. For example, might be a hyper-rectangle or a simplex.

Our goal is to optimize the current task, i.e., to find  . To support us in this task we have evaluations of previous tasks, in the form of a collection of tuples , where is the sum of and independent normally distributed noise with variance . Typically these would be obtained from runs of the optimization algorithm we propose, or some other optimization algorithm, when solving these previous tasks. Evaluations obtained for other simulation studies can also be included.

During optimization for the current task, we assume that previous tasks cannot be evaluated. We make this assumption because maintaining the ability to evaluate previous tasks may present greater practical engineering challenges than simply storing previous evaluations, if previous tasks used different versions of the simulation software, or required access to large historical datasets. In contrast, under the assumptions made below, previous evaluations do not consume a great deal of memory, and can be stored simply in a flat file or database.

We make a few additional assumptions that should be met for our proposed methodology to be useful in comparison to other methods.

1. First, we suppose that evaluations of do not provide first- or second-order derivative information. If derivative information is available, it would be more productive to instead use a gradient-based method.

2. Second, we suppose that evaluations are expensive to obtain, limiting the number of evaluations that can be performed to at most a few hundred evaluations per problem. When the number of evaluations possible grows substantially larger, the computational expense incurred by our method grows, decreasing its utility.

3. Third, we will model both the current problem’s objective and the collection of all objectives with Gaussian processes, and so we assume that these objectives are amenable to Gaussian process modeling. While Gaussian processes are flexible, and can handle a wide variety of objective functions, they require continuity of for each to work well.

4. Fourth, while our mathematical approach does not make restrictions on , our numerical experiments all assume that . Other work on Bayesian optimization suggests that performance degrades on high-dimensional problem, suggesting that the method we develop will provide value only for problems with .

## 3 Warm Start Algorithm

The key idea of our approach is to use samples of the objectives from previous optimization problems to learn about the current optimization problem’s objective. To accomplish this, we use a Bayesian prior probability distribution that models the relationship between past objectives

and the current objective . Thereby we begin our optimization of  with substantial information obtained from previous tasks, allowing us to find one of its optima faster.

More specifically, we place a Gaussian process prior on , which models the relationships between the tasks. We emphasize that the Gaussian process prior is not just over for a single , as is typical in Bayesian optimization, but instead is over . Modeling the objective jointly across tasks allows our approach to learn about the current task from previous ones. This is described in Sect. 3.1.

The Gaussian process prior utilizes a parametrized class of covariance functions: we describe in Sect. 3.2 how the parameters for the covariance function can be obtained in our problem setting. These parameters describe the fundamental characteristics of the instances, and therefore are typically robust. We suggest to optimize them periodically rather than for every new task.

Then, using this Gaussian process prior over the collection of tasks, we apply Gaussian process regression to update our belief on  based on samples obtained from previously tasks. Although this application of Gaussian process regression is relatively straightforward, we provide a detailed description for completeness in Sect. A

showing how to calculate posterior probability distribution on

given the prior from Sect. 3.1 and samples from previous tasks.

Finally, starting from this Gaussian process posterior over the current task’s objective , we use samples of to localize the optimum. Points are chosen by our algorithm to sample by iteratively solving a sub-optimization problem, in which we find the point to evaluate that maximizes some acquisition function, sample this point, use it to update the posterior, and repeat until our budget of samples is exhausted. We choose as our acquisition function the Knowledge Gradient (KG) proposed by Frazier, Powell, and Dayanik [FPD09]; a succinct exposition is given in Sect. 3.3.

#### The Algorithm wsKG.

The proposed algorithm for warm starting Bayesian optimization proceeds as follows:

1. Using samples of previous tasks, and the prior distribution described in Sect. 3.1:

1. Estimate hyper-parameters of the Gaussian process prior as described in Sect. 3.2.

2. Use the prior with the estimated hyper-parameters and the samples on previous tasks to calculate a posterior on , as described in Sect. A.

2. Until the budget of samples on the current task is exhausted:

1. Choose the point at which to sample next by maximizing the KG factor, as described in Sect. 3.3.

2. Update the posterior distribution to include the new sample of , as described in Sect. A, so that it includes all previous samples on the current and previous tasks.

3. Choose as our solution the point with the largest estimated value according to the current posterior.

### 3.1 Construction of the Prior

We begin by describing the construction of the Gaussian Process prior, which models the relationship between previous tasks and the current one. Let  denote the difference in objective value of any design  between  and :

 δℓ(x)=f(ℓ,x)−f(0,x). (1)

We construct our Bayesian prior probability distribution over by supposing that each  with is drawn from an independent Gaussian Process prior with mean function  and covariance kernel , where  and  belong to some pre-specified parametrized class of functions. For example, we assume in our numerical experiments that , and that  belongs to the Matérn covariance functions, a popular class of stationary kernels (see chapter 4.2 in Rasmussen and Williams [RW06] for details). It would also be possible to set the prior mean to a constant. Hyper-parameters determining the specific choices for and from their respective parameterized class can be fit from data in an automated process, as we describe in more detail in Sect. 3.2.

This models the difference between previous tasks and the current task. By using knowledge of the typical values of , inferred from the hyper-parameters and from evaluations of and for and nearby to , we can learn about from knowledge of .

We also suppose that is drawn from another independent Gaussian process with mean function and covariance kernel , which also belong to some pre-specified parameterized class of functions. As with and for , these parameters may also be chosen automatically as described in Sect. 3.2.

This specification of the (joint) distribution of

and , together with the specification of in terms of these quantities, defines a joint Gaussian process probability distribution over . This is because each is a linear function of quantities that are themselves jointly Gaussian. By specifying the mean function and covariance kernel of this Gaussian process, which we call and respectively, we may leverage standard software for Gaussian process regression.

We calculate the functions and as follows. For the ease of notation let  for all , then we can write  as . Then, for , we calculate the mean function as,

 μ(ℓ,x)=E[f(ℓ,x)]=E[f(0,x)+δℓ(x)]=μ0(x)+E[δℓ(x)]=μ0(x),

since . For  and , the covariance kernel is given by

 Σ((ℓ,x),(m,x′)) =Cov(f(ℓ,x),f(m,x′)) =Cov(f(0,x)+δℓ(x),f(0,x′)+δm(x′)), =Cov(f(0,x),f(0,x′))+Cov(δℓ(x),δm(x′)) and since the Gaussian processes δℓ and δm are independent iff ℓ≠m holds, =Σ0(x,x′)+1ℓ,m⋅Σℓ(x,x′),

where we use the indicator variable  that is one if  and , and zero otherwise.

### 3.2 Estimation of the Hyper-Parameters

When designing the method, we aimed for a wide applicability, and therefore do not assume that further information on the current optimization task  and its relationship with previous tasks is given explicitly. Rather we will estimate the important quantities from historical data in an automated fashion as follows.

Let  denote the parameters of the covariance function : for instance, in case of the Squared Exponential Kernel  for  there are  parameters. Since we believe that  changes little from instance to instance, we formulate a prior distribution based on previously optimal choices for : let  be the probability under that prior. Moreover, our statistical model provides the likelihood of observations  conditioned on the samples  and , denoted by . Thinking of

as a random variable, the probability of an outcome for

conditioned on the observations equals the product . Thus, the method of maximum a posteriori (MAP) estimation the chooses the hyper-parameters that maximize this product, or equivalently as , where  is the natural logarithm. The optimal parameters can be computed efficiently via gradient methods. We refer to Sect. 5.4.1 in [RW06] for details.

### 3.3 Acquisition Function

The proposed algorithm takes as input the historical data, i.e. samples obtained in the optimization on previous tasks, and an “oracle access” to the current task ; that is, we consider the objective function a black-box. The actual optimization process is conducted in an iterative manner, where in each iteration a so-called acquisition function selects a point  to sample. A wide range of these functions has been proposed in the literature. We suggest using the Knowledge Gradient [FPD09]: this greedy strategy is known to be one-step (Bayes) optimal and moreover converges to an optimal solution in settings where a finite number of alternatives is given with a multivariate normal prior.

From the practical point of view, the main advantage of the Knowledge Gradient is that has been shown to produce good results, but nonetheless can be computed very efficiently (see their paper for details on the implementation). For the sake of a self-contained presentation, we give a brief description of the method.

We assume that we have already observed  points for  and are now to decide the next sample. Let , where  denotes the expected value under the posterior distribution. Thus, based on our current knowledge, the best  for task  has an expected objective value of . Note that this value can be computed with our information after observing the -th sample. Then the Knowledge Gradient (KG) selects a design  that maximizes . We obtain an approximation by replacing  by a discrete set of points, denoted by :

 En[maxx′∈Aμ(n+1)(x′)∣∣∣x(n+1)=x]−maxx′∈Aμ(n)(x′). (2)

Set  can either be obtained by randomly sampling from , for example using a Latin Hypercube design to ensure an even coverage. If certain regions of  seem particularly relevant, then  can represent it with higher resolution. Note that the first summand of Eq. (2), the conditional expectation for fixed , can be calculated from the posterior distribution: recall that  is mean vector and let  denote the covariance matrix of the posterior distribution after observing  samples. Define . It is well-known (e.g., see [Kai68, FPD09]) that the random variables  and  have the same distributions conditioned on the previous samples. Here  is a standard normal random variable. Thus we have that holds for every .

Based on that observation, we can either enumerate all  and pick a point that maximizes Eq. (2), which may be useful if the cost of evaluating the task is very expensive (compared to the computational cost of the enumeration). Or we utilize that the gradient of the KG can be computed efficiently (e.g., see [FXC11, SFP11, PWF16]) and perform a local search over , for instance using a multi-start gradient ascent method. In this case we would require that the noise functions  are differentiable.

## 4 Experimental Evaluation

We have evaluated the warm start algorithm on two benchmark suites. The first set comprises variants of the two-dimensional Rosenbrock function that is a standard test problem in Bayesian optimization.

The second family of testbed instances in this evaluation are taken from the Assemble-To-Order (ATO) problem that was introduced by Hong and Nelson [HN06]. In this eight-dimensional simulation optimization scenario the task is to optimize the target of each item under a continuous-review base stock policy that manages the inventory of a production facility. In order to evaluate the objective value of a specific choice of the parameters, we run the simulator of Xie, Frazier, and Chick [XFC12] available on the SimOpt website.

The performance of the warm start algorithm (wsKG) is compared to two traditional methods that do not benefit from previous optimizations of related tasks.

EGO. The first baseline method is the well-known EGO algorithm of Jones, Schonlau, and Welch [JSW98]. EGO is also a myopic algorithm that iteratively selects one point to sample. EGO’s acquisition criterion is to optimize the expected improvement (EI): for this exposition suppose that the case that observations are noiseless and that  is the best objective value that was sampled so far. Let  be the random variable that is distributed (normally) according to the posterior distribution of the objective value of some design , then we call the expected improvement for . In each iteration EGO greedily samples an  that maximizes this expectation.

KG. In order to assess the benefit of taking data on previous tasks into account, we also compare the new method to the “vanilla version” of the Knowledge Gradient (KG[FPD09]; see Sect. 3.3.

Experimental Setup. In our evaluation all algorithms are given the same initial data points drawn randomly for each instance and each run. The warm start algorithm is additionally provided with one data set for the Rosenbrock instances and two for ATO: each set contains the samples collected during a single run on a related instance. A single run consists of 25 steps (also referred to as iterations) for each of the Rosenbrock instances and 50 steps for the Assemble-to-Order suite.

The hyper-parameters of the Gaussian Process affect how accurate the posterior distribution predicts the objective function of the current task. In accordance with our scenario, we optimized the hyper-parameters once for a single instance of each suite and then invoked wsKG with this fixed setting for all instances. For the baseline methods EGO and KG we optimized these parameters for each instance in advance, thereby possibly giving them an edge over wsKG in this respect.

The plots below provide the mean of the gain over the initial solution for each iteration. Error bars are shown at the mean plus and minus two standard errors, averaged over at least 100 replications.

### 4.1 The Rosenbrock Family

The basis of these instances is the 2D Rosenbrock function

 RB1(x1,x2) =(1−x1)2+100⋅(x2−x21)2, which subject to the following modifications: RB2(x1,x2) =RB1(x1,x2)+.01⋅sin(10⋅x1+5⋅x2) RB3(x1,x2) =RB1(x1+.01,x2−.005) RB4(x1,x2) =RB2(x1,x2)+.01⋅x1

Moreover, each function evaluation is subject to i.i.d. noise with mean zero and variance . The task is to optimize the respective function on the domain .

The results are summarized in Fig. 1: the warm start algorithm is able to exploit the knowledge from a single run on a related instance and samples better solutions than the other methods from the first iteration on. In particular, wsKG obtains a near-optimal solution already after only one or two samples. KG and EGO are distanced, with KG showing a superior performance to EGO. Figure 1: (ul) The basic Rosenbrock function RB1. (ur) The Rosenbrock function RB2 with an additive sine. (bl) The shifted Rosenbrock function RB3. (br) The Rosenbrock function RB4 with an additive sine and a bias depending on x1.

### 4.2 The Assemble-To-Order Benchmark

In the Assemble-To-Order problem we consider a production facility that sells  different products that are assembled on demand whenever a customer request comes in. Products are manufactured from a selection of  different items, where we distinguish for each product between key items and non-key items: when a product is ordered whose key items are all stocked, then it is assembled and sold. Otherwise the customer is lost. Non-key items are optional and used if available; they increase the revenue. On the other hand, items kept in the inventory inflict holding cost, therefore the facility would like to avoid overstocking.

The inventory is managed by a continuous-review base stock policy that sets a target stock base for each item and triggers a replenishment order if an item is requested. The delivery time for supplies is normally distributed and varies over the items. Moreover, the orders for products vary, too, and are modeled by  Poisson arrival processes. The goal is to choose the target stock base used by the inventory management system so that the expected daily profit is maximized.

Hong and Nelson [HN06] proposed a setting (referred to as ATO 1) for  items,  products, and search domain . We have created three additional instances based on ATO 1:

In the first time period ATO 2 the forecast sees a change in the customer behavior: the expected demand for the two products that had been most popular before drops by . On the other hand, the popularity of two other products grows by up to . Additionally, the profit of some of the items increases by  on average.

In the next period ATO 3 the delivery of some of the items is delayed; the expected waiting time increases by about . Moreover, all products see a higher demand and the profits for several items increase.

In the final period ATO 4 the production facility experiences holding costs rising by . Moreover, the profits of several items drop slightly. Figure 2: (l) ATO 1 (r) ATO 2: All algorithms have the same initial data for the current problem. wsKG has also access to samples of two runs on related instances, but its hyper-parameters are not optimized for the current instance. Figure 3: (l) ATO 3 (r) ATO 4: wsKG has received the samples of two runs on related instances.

When comparing the performance of the three algorithms given in Fig 2 and Fig 3 for the task of selecting the inventory levels used by the stocking policy, we see that wsKG consistently performs significantly better than its competitors, achieving about  of the optimum after only ten samples. Among the two distanced methods, the KG policy achieves better solutions than EGO for the Assemble-To-Order problem. Looking closer, KG’s solution is typically about  away of the optimum after 10 steps, about  after 20 steps, and still about  after 50 steps, depending on the instance. EGO achieves less than half of the optimum after 20 iterations, and only about  after 50 steps. Notably, EGO shows a large discrepancy between the performance of single runs; the error bar displays the mean averaged over 100 runs.

## 5 Conclusion

We have proposed an algorithm to warm start Bayesian optimization and demonstrated its usefulness for an application in simulation optimization. The design of the algorithm is conceptually simple and computationally efficient.

A typical drawback of such Bayesian optimization methods is their ability to scale to larger input sizes. For the chosen acquisition function, Poloczek, Wang, and Frazier [PWF16] recently suggested a novel parallelization that efficiently utilizes several multi-core CPUs. Thus, the acquisition function poses no longer the bottleneck of the approach if access to a computing cluster is provided. However, in order to obtain the predictive distribution of the Gaussian Process model involves calculating the Cholesky decomposition to compute the inverse of a matrix, whose running time dependency on the number of previous observations is cubic.

There are several approaches to address this and we believe it is worthwhile to combine one of them with the proposed method. On the one hand, a possible direction is to replace the Gaussian Process by a scalable approximation. Several such techniques have been proposed lately, e.g., see [FWN15, WN15] and the references therein.

A complementary approach to speed up the computations is to reduce the number of data points: instead of the whole data set, a “sketch” reduces the number of data points significantly, while the predictions made based on the sketch are nearly indistinguishable from the full dataset. Geppert et al. [GIM15]

proposed such an approximation and demonstrated its usefulness for applications in Bayesian linear regression. Another approach also based on a linear projection of the dataset to a subspace was given by Banerjee, Dunson, and Tokdar

[BDT12].

Another interesting direction is to extend our statistical model. We sketch two ideas. Recall that the statistical model in Sect. 3.1 describes the previous tasks as deviations from the current objective , where each deviation is given by a Gaussian processes . In particular,  and  are supposed to be independent for . Alternatively, we could apply the formulation described in Sect. 2.2 of [PWF16] to additionally incorporate correlations among previous tasks.

A second approach is to suppose that all tasks are derived from an “ideal” problem. Then  gives the current task that we want to optimize, and  for  are the previous tasks. Note that  is not observable in this case and obtained via Gaussian process regression. Both approaches can be implemented with little additional effort. Therefore, we have a collection of flexible models to choose from, depending on the characteristics of the optimization problem at hand.

## Appendix A Prior and Posterior Distribution of the Gaussian Process

For the sake of a self-contained exposition, we give the prior and the posterior of the Gaussian process. Both follow canonically (see [RW06] for an in-depth treatment).

Let  be a column vector of sample locations (for generality, assume that each pair belongs to ). Then the prior distribution of  at  is given by

 (f(ℓ(1),x(1)),…,f(ℓ(n),x(n)))T∼N(μ0(X),K(X,X)), (3)

i.e. the prior distribution is multivariate normal. Here we utilized that  for any  and  as shown above; is a shorthand for the -dimensional column vector with , and  is the  matrix with for .

Next we describe the Gaussian Process regression to predict the latent function  that serves as our estimator for . Upon sampling the points , we have observed  with

 y(i)=yℓ(i)(x(i))=f(ℓ(i),x(i))+ε(i),

where  for . Recall that  is the noise function that gives the variance when evaluating task  for design , and observe that  and  are independent (for ) conditioned on , , and the respective noise functions.

Then the posterior (or predictive) distribution of the latent function value  is still multivariate normal (e.g., see Eq. (A.6) on pp. 200 in [RW06]) with

 f(0,x)∣x,X,Y∼N(μn(x),σ2,n(x))where μn(x)=μ0(x)+K(x,X)[K(X,X)+D(X)]−1(Y−μ0(X)) (4) σ2,n(x)=Σ0(x,x)−K(x,X)[K(X,X)+D(X)]−1K(x,X)T,

where  is defined as the row vector of length  with for , and  is the  diagonal matrix with . As usual,  denotes the inverse of some  matrix  under matrix multiplication, i.e.  holds where  is the identity matrix.

## References

• [BCd09] Eric Brochu, Vlad M. Cora, and Nando de Freitas.

A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning.

Technical Report TR-2009-23, Department of Computer Science, University of British Columbia, 2009.
• [BDT12] Anjishnu Banerjee, David B. Dunson, and Surya T. Tokdar. Efficient gaussian process regression for large datasets. Biometrika, 100(1):75–89, 2012.
• [Chi00] Stephen E Chick. Bayesian methods: Bayesian methods for simulation. In Jeffrey A. Joines, Russel R. Barton, Keebom Kang, and Paul A. Fishwick, editors, Proceedings of the 2000 Winter Simulation Conference, pages 109–118, San Diego, CA, USA, 2000. Society for Computer Simulation International.
• [FPD09] Peter I Frazier, Warren B Powell, and Savas Dayanik. The Knowledge Gradient Policy for Correlated Normal Beliefs. INFORMS Journal on Computing, 21(4):599–613, 2009.
• [Fra11] P. I. Frazier. Decision-theoretic foundations of simulation optimization. In Wiley Encyclopedia of Operations Research and Management Science. John Wiley and Sons, 2011.
• [FSH14] Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter.

Using meta-learning to initialize bayesian optimization of hyperparameters.

In Proceedings of the International Workshop on Meta-Learning and Algorithm Selection (ECAI), pages 3–10, 2014.
• [FW16] Peter I. Frazier and Jialei Wang. Bayesian optimization for materials design. In Information Science for Materials Discovery and Design, pages 45–75. 2016.
• [FWN15] Seth Flaxman, Andrew Gordon Wilson, Daniel B. Neill, Hannes Nickisch, and Alexander J. Smola. Fast kronecker inference in gaussian processes with non-gaussian likelihoods. In Proceedings of the 32nd International Conference on Machine Learning, pages 607–616, 2015.
• [FXC11] Peter I. Frazier, Jing Xie, and Stephen E. Chick. Value of information methods for pairwise sampling with correlations. In Sanjay Jain, Roy Creasey, and Jan Himmelspach, editors, Proceedings of the 2011 Winter Simulation Conference, pages 3979–3991, Phoenix, AZ, USA, 2011. Society for Computer Simulation International.
• [GIM15] Leo N. Geppert, Katja Ickstadt, Alexander Munteanu, Jens Quedenfeld, and Christian Sohler. Random projections for bayesian regression. Statistics and Computing, pages 1–23, 2015.
• [GRS04] Shawn E Gano, John E Renaud, and Brian Sanders. Variable fidelity optimization using a kriging based scaling function. In Proceedings of the Tenth AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, 2004.
• [GW04] Mevludin Glavic and Louis Wehenkel. Interior point methods: A survey, short survey of applications to power systems, and research opportunities. Technical report, Department of Electrical Engineering and Computer Science, Universite de Liège, Sart Tilman, Belgium, 2004.
• [HN06] L Jeff Hong and Barry L Nelson. Discrete optimization via simulation using compass. Operations Research, 54(1):115–129, 2006.
• [JSW98] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization, 13(4):455–492, 1998.
• [JY08] Elizabeth John and E Alper Yıldırım. Implementation of warm-start strategies in interior-point methods for linear programming in fixed dimension. Computational Optimization and Applications, 41(2):151–183, 2008.
• [Kai68] Thomas Kailath.

An innovations approach to least-squares estimation – Part I: Linear Filtering in Additive White Noise.

IEEE Transactions on Automatic Control, 13(6):646–655, 1968.
• [Kle14] Jack PC Kleijnen. Simulation-optimization via kriging and bootstrapping: A survey. Journal of Simulation, 8(4):241–250, 2014.
• [PWF16] Matthias Poloczek, Jialei Wang, and Peter I. Frazier. Multi-information source optimization with general model discrepancies. ArXiv e-print 1603.00389, 2016. Available via http://arxiv.org/abs/1603.00389.
• [RW06] Carl Edward Rasmussen and Christopher K.Ĩ. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
• [SFP11] Warren R. Scott, Peter I. Frazier, and Warren B. Powell. The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression. SIAM Journal on Optimization, 21(3):996–1026, 2011.
• [SSA13] Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In Advances in Neural Information Processing Systems, pages 2004–2012, 2013.
• [WN15] Andrew Gordon Wilson and Hannes Nickisch.

Kernel interpolation for scalable structured gaussian processes (KISS-GP).

In Proceedings of the 32nd International Conference on Machine Learning, pages 1775–1784, 2015.
• [XFC12] J. Xie, P. I. Frazier, and S. Chick. Assemble to order simulator. Accessed May 9, 2016.
• [YW02] E Alper Yildirim and Stephen J Wright. Warm-start strategies in interior-point methods for linear programming. SIAM Journal on Optimization, 12(3):782–810, 2002.