Multi-Information Source Optimization

03/01/2016 ∙ by Matthias Poloczek, et al. ∙ cornell university 0

We consider Bayesian optimization of an expensive-to-evaluate black-box objective function, where we also have access to cheaper approximations of the objective. In general, such approximations arise in applications such as reinforcement learning, engineering, and the natural sciences, and are subject to an inherent, unknown bias. This model discrepancy is caused by an inadequate internal model that deviates from reality and can vary over the domain, making the utilization of these approximations a non-trivial task. We present a novel algorithm that provides a rigorous mathematical treatment of the uncertainties arising from model discrepancies and noisy observations. Its optimization decisions rely on a value of information analysis that extends the Knowledge Gradient factor to the setting of multiple information sources that vary in cost: each sampling decision maximizes the predicted benefit per unit cost. We conduct an experimental evaluation that demonstrates that the method consistently outperforms other state-of-the-art techniques: it finds designs of considerably higher objective value and additionally inflicts less cost in the exploration process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of optimizing an expensive-to-evaluate black-box function, where we additionally have access to cheaper approximations of the objective. For instance, this scenario arises when tuning hyper-parameters of machine learning algorithms, e.g., parameters of a neural network architecture or a regression method: one may assess the performance of a particular setting on a smaller related dataset or even a subset of the whole database (e.g., see 

Swersky et al. [2013], Kandasamy et al. [2016], Klein et al. [2016]). Or consider a ride-sharing company, e.g., Uber or Lyft, that matches clients looking for a ride to drivers. These real-time decisions then could be made by a system whose performance crucially depends on a variety of parameters. These parameters can be evaluated either in a simulation or by deploying them in a real-world experiment. Similarly, the tasks of configuring the system that navigates a mobile robot in a factory, or steers a driverless vehicle on the road, can be approached by invoking a simulator, or by evaluating its performance on a closed test course or even a public road. The approach of exploiting cheap approximations is fundamental to make such tasks tractable.

We formalize these problems as multi-information source optimization problems

(MISO), where the goal is to optimize a complex design under an expensive-to-evaluate black-box function. To reduce the overall cost we may utilize cheap approximate estimates of the objective that are provided by so-called “information sources”. However, in the general settings we consider their output is not necessarily an unbiased estimator only subject to observational noise, but in addition may be inherently biased: the behavior of an autonomous car on a test course or the performance of a dispatch algorithm in a ride-sharing simulation will differ significantly from the real world. Thus, inevitably we face a

model discrepancy which denotes an inherent inability to describe the reality accurately. We stress that this notion goes far beyond typical “noise” such as measurement errors or numerical inaccuracies. The latter grasps uncertainty that arises when sampling from an information source, and is the type of inaccuracy that most of the previous work on multifidelity optimization deals with, e.g., see Balabanov and Venter [2004], Eldred et al. [2004], Rajnarayan et al. [2008], March and Willcox [2012], Lam et al. [2015] for problem formulations in the engineering sciences. In particular, such an understanding of noise assumes implicitly that the internal value (or state) of the information source itself is an accurate description of the truth. Our more general notion of model discrepancy captures the uncertainty about the truth that originates from the inherent deviation from reality.

Related Work.

The task of optimizing a single, expensive-to-evaluate black-box function has received a lot of attention. A successful approach to this end is Bayesian optimization, a prominent representative being the efficient global optimization (EGO) method by Jones et al. [1998]. EGO assumes that there is only a single information source that returns the true value; the goal is to find a global maximum of the objective while minimizing the query cost. It consists of the following two steps that have served as a blueprint for many subsequent Bayesian optimization techniques. First a stochastic, in their case Gaussian, process is formulated and fitted to a set of initial data points. Then they search for a global optimum by iteratively sampling a point of highest predicted score according to some “acquisition criterion”: Jones et al. employ expected improvement (EI) that samples a point next whose improvement over the best current design is maximum in expectation. Subsequently, the expected improvement method was extended to deal with observational noise, e.g., see Huang et al. [2006b], Scott et al. [2011], Picheny et al. [2013]. The extension of Picheny et al. also dynamically controls the precision of a measurement, making it well suited for many multifidelity optimization problems.

The strategy of reducing the overall cost of the optimization process by utilizing cheaper approximations of the objective function was successfully employed in a seminal paper by Swersky, Snoek, and Adams [2013], who showed that the task of tuning hyper-parameters for two classification problems can be sped up significantly by evaluating settings on a small sample instead of the whole database. To this end, they proposed a Gaussian process model to quantify the correlation between such an “auxiliary task” and the primary task, building on previous work on Gaussian process regression for multiple tasks in Bonilla et al. [2007], Goovaerts [1997], Teh et al. [2005]

: their kernel is given by the tensor product 

, where  (resp., ) denotes the covariance matrix of the tasks (resp., of the points). Their acquisition criterion is a cost-sensitive formulation of Entropy Search [Hennig and Schuler, 2012, Villemonteix et al., 2009]: here one samples in each iteration a point that yields a maximum reduction in the uncertainty over the location of the optimum.

Besides, interesting variants of the MISO problem have been studied recently: Kandasamy et al. [2016] examined an alternative objective for multifidelity settings that asks to minimize the cumulative regret over a series of function evaluations: besides classification problems they also presented an application where the likelihood of three cosmological constants is to be maximized based on Supernovae data. Klein et al. [2016] considered hyper-parameter optimization of machine learning algorithms over a large dataset . Supposing access to subsets of 

of arbitrary sizes, they show how to exploit regularity of performance across dataset sizes to significantly speed up the optimization process for support vector machines and neural networks.

In engineering sciences the approach of building cheap-to-evaluate, approximate models for the real function, that offer different fidelity-cost trade-offs, is also known as “surrogate modeling” and has gained a lot of popularity (e.g., see the survey Queipo et al. [2005]). Kennedy and O’Hagan [2000] introduced Gaussian process regression to multifidelity optimization to optimize a design given several computer codes that vary in accuracy and computational complexity. Contrasting the related work discussed above, these articles consider model discrepancy, but impose several restrictions on its nature: a common constraint in multifidelity optimization (e.g., see Kennedy and O’Hagan [2000], Balabanov and Venter [2004], Eldred et al. [2004], Rajnarayan et al. [2008], March and Willcox [2012], Gratiet and Cannamela [2015]) is that information sources are required to form a hierarchy, thereby limiting possible correlations among their outputs: in particular, once one has queried a high fidelity source for some point , then no further knowledge on  can be gained by querying any other information source of lower fidelity (at any point). A second frequent assumption is that information sources are unbiased

, admitting only (typically normally distributed) noise that further must be

independent across different sources. Lam, Allaire, and Willcox [2015] addressed several of these shortcomings by a novel surrogate-based approach that requires the information sources to be neither hierarchical nor unbiased, and allows a more general notion of model discrepancy building on theoretical foundations by Allaire and Willcox [2014]. Their model has a separate Gaussian process for each information source that in particular quantifies its uncertainty over the domain. Predictions are obtained by fusing that information via the method of Winkler [1981]. Then they apply the EI acquisition function on these surrogates to first decide what design  should be evaluated next and afterwards select the respective information source to query 

; the latter decision is based on a heuristic that aims to balance information gain and query cost (see also Sect. 

4).

Our Contributions.

We present an approach to multi-information source optimization that allows exploiting cheap approximative estimates of the objective, while handling model discrepancies in a general and stringent way. Thus, we anticipate a wide applicability, including scenarios such as described above. We build on the common technique to capture the model discrepancy of each information source by a Gaussian process. However, we break through the separation and build a single statistical model that allows a uniform Bayesian treatment of the black-box objective function and the information sources. This model improves on previous works in multifidelity optimization and in particular on Lam et al. [2015], as it allows rigorously exploiting correlations across different information sources and hence enables us to reduce our uncertainty about all information sources when receiving one new data point, even if it originates from a source with lower fidelity. Thus, we obtain a more accurate estimate of the true objective function from each sample. Note that this feature is also accomplished by the product kernel of Swersky et al. [2013]. Their model differs in the assumption that the relationship of any pair of information sources is fixed throughout the domain, whereas our covariance function is more flexible. Additionally, our model incorporates and quantifies the bias between the objective and any other source explicitly.

Another important contribution is an acquisition function that quantifies the uncertainty about the true objective and information sources, in particular due to model discrepancies and noise. To this end, we show how the Knowledge Gradient (KG) factor proposed by Frazier, Powell, and Dayanik [2009] can be computed efficiently in the presence of multiple information sources that vary in cost. This cost-sensitive Knowledge Gradient selects a pair  such that the simple regret (i.e. the loss of the current best solution with respect to the unknown global optimum) relative to the specific query cost  is minimized in expectation. Specifically, we show that the query cost can be incorporated as part of the objective in a natural way: our policy picks a pair that offers an optimal trade-off between the predicted simple regret and the corresponding cost. Specifically, its choice is even provably one-step Bayes optimal in terms of this benefit per unit cost. We regard it a conceptual advantage that the cost-sensitive KG factor can be computed analytically, whereas Swersky et al. [2013] rely on Monte Carlo approximations (see also Hennig and Schuler [2012] for a discussion). Lam et al. [2015] utilize a two-step heuristic as acquisition function.

We also demonstrate that our model is capable of handling information sources whose model discrepancies are interrelated in a more sophisticated way: in particular, we address the scenario of groups of information sources whose models deviate from the truth in a correlated fashion. For instance, in silico simulations of physical or chemical processes might employ similar approximations whose deviations from the physical laws are thus correlated, e.g., finite element analyses with different mesh fineness by Huang et al. [2006a] or calculations based on shared data sets. Additionally, experiments conducted in the same location are exposed to the same environmental conditions or singular events, thus the outputs of these experiments might deviate from the truth by more than independent measurement errors. Another important factor is humans involved in the lab work, as typically workers have received the comparable training and may have made similar experiences during previous joint projects, which influences their actions and decisions.

2 The Model

Each design (or point)  is specified by  parameters. Given some compact set  of feasible designs, our goal is to find a best design under some continuous objective function , i.e. we want to find a design in . Restrictions on  such as box constraints can be easily incorporated in our model. We have access to  possibly biased and/or noisy information sources that provide information about . Note that the  (with ) are also called "surrogates"; in the context of hyper-parameter tuning they are sometimes referred to as “auxiliary tasks” and  is the primary task. We suppose that repeatedly observing for some  and  provides independent and normally distributed observations with mean value 

and variance 

. These sources are thought of as approximating , with variable model discrepancy or bias . We suppose that  can be observed directly without bias (but possibly with noise) and set . Each  is also associated with a query cost function . We assume that the cost function  and the variance function  are both known and continuously differentiable. In practice, these functions may either be provided by domain experts or may be estimated along with other model parameters from data (see Sect. 4, the supplement, and Rasmussen and Williams [2006]). Our motivation in having the cost and noise vary over the space of designs is that physical experiments may become difficult to conduct and/or expensive when environmental parameters are extreme. Moreover, simulations may be limited to certain specified parameter settings and their accuracy diminish quickly.

We now place a single Gaussian process prior on (i.e., on and the mean response from the  information sources). Let be the mean function of this Gaussian process, and be the covariance kernel. (Here, for any  we use  as a shorthand for the set , and further define .) While our method can be used with an arbitrary mean function and positive semidefinite covariance kernel, we provide two concrete parameterized classes of mean functions and covariance kernels that are useful for multi-information source optimization. Due to space constraints the second class is deferred to the supplement, where we also detail how to estimate the hyper-parameters of the mean function and the covariance kernel.

Independent Model Discrepancy.

We first propose a parameterized class of mean functions and covariance functions derived by supposing that model discrepancies are chosen independently across information sources. This first approach is appropriate when information sources are different in kind from each other and share no relationship except the fact that they are modeling a common objective. We also propose a more general parameterized class that models correlation between model discrepancies, as is typical when information sources can be partitioned into groups, such that information sources within a group tend to agree more amongst themselves than they do with information sources in other groups. Due to space constraints, this class was deferred to the the online supplement.

We suppose here that for each was drawn from a separate independent Gaussian process, . We also suppose that is identically , and that , for some given and . We then define for each . Typically, one would not have a strong prior belief as to the direction of the bias inherent in an information source, and so we set . (If one does have a strong prior opinion that an information source is biased in one direction, then one may instead set

to a constant estimated using maximum a posteriori estimation.) With this modeling choice, we see that the mean of

with mean function  and covariance kernel  is given by

for each , since  holds. Additionally, for  and ,

where  denotes Kronecker’s delta, and where we have used independence of , , and .

3 The Value of Information Analysis

Our optimization algorithm proceeds in rounds, where in each round it selects a design  and an information source  with . The goal is to find an  that maximizes  over 

. Let us assume for the moment that query costs are uniform over the whole domain and all information sources; we will show how to remove this assumption later. Further, assume that we have already sampled 

points  and made the observations . Finally, denote by  the expected value according to the posterior distribution given  and  and shorthand . Since that distribution is normal, the best expected objective value of any design, as estimated by our statistical model, is . If we were to pick an  now irrevocably, then we would select an  of maximum expectation. This motivates choosing the next design  and information source  that we will sample such that we maximize , or equivalently maximize the expected gain over the current optimum by . Note that the equivalence of the maximizers for both expressions follows immediately from the observation that  is a constant for all  given  and .

Next we show how the assumption made at the beginning of this section, that query costs are uniform across the domain and for all information sources, can be removed. To this end, we associate a query cost function  with each information source  for . Then our goal becomes to find a sample  whose value of information divided by its respective query cost is maximum. The gist is that conditioned on any , the expected gain of all  is scaled by . Then the cost-sensitive Knowledge Gradient policy picks a sample  that maximizes the expectation

(1)

denoted by . Our task is to compute , which is a nested optimization problem.

To make this task feasible in practice, we discretize the domain of the inner maximization problem stated in Eq. (1): for simplicity, we choose the discrete set  via a Latin Hypercube design. Alternatively, one could scatter the points to emphasize promising areas of  and resample regularly. Now we have reduced the inner optimization problem for each information source to the setting of Frazier et al. [2009] who showed how to compute the value of information over a discrete set if there is only one information source and query costs are uniform.

We provide an outline and refer to their article for details. For their setting let  be the vector of posterior means for  and define for each  , where  is the posterior covariance matrix and  with a one for  and zeros elsewhere. Given these vectors, observe that

where  for vectors , and 

is a one-dimensional standard normal random variable. Frazier et al. show how to compute 

.

Thus, following our initial considerations, we approximate the cost-sensitive Knowledge Gradient by maximizing  over , i.e. the outer optimization problem is still formulated over . Note that we can compute the gradient of  with respect to  assuming that  is differentiable (e.g., when given by a suitable Gaussian process). Thus, we may apply a multi-start gradient-based optimizer to compute .

We summarize our algorithm misoKG:

  1. Using samples from all information sources, estimate hyper-parameters of the Gaussian process prior as described in the online supplement.

    Then calculate the posterior  based on the prior and samples.

  2. Until the budget for samples is exhausted do:

    Determine the information source  and the design  that maximize the cost-sensitive Knowledge Gradient proposed in Eq. (1) and observe .

    Update the posterior distribution with the new observation.

  3. Return the point with the largest estimated value according to the current posterior .

4 Numerical Experiments

We demonstrate the performance of the proposed multi-information source optimization algorithm, misoKG, by comparing it with the state-of-the-art Bayesian optimization algorithms for MISO problems. The statistical model and the value of information analysis were implemented in Python 2.7 and C++ using the functionality provided by the Metrics Optimization Engine MOE .

Benchmark Algorithms.

The first benchmark method, MTBO+, is an improved version of Multi-Task Bayesian Optimization proposed by Swersky et al. [2013]. It uses a cost-sensitive version of Entropy Search as acquisition function that picks samples to maximize the information gain over the location of the optimum of the objective function, normalized by the respective query cost (see their paper for details). MTBO combines acquisition function with a “multi-task” Gaussian process model that captures the relationships between information sources (the “tasks”) and the objective function. Following a recommendation of Snoek 2016, our implementation MTBO+ uses an improved formulation of the acquisition function given by Hernández-Lobato et al. [2014], Snoek and et al. , but otherwise is identical to MTBO; in particular, it uses the statistical model of Swersky et al. [2013].

The other algorithm, misoEI of Lam et al. [2015], was developed to solve MISO problems that involve model discrepancy and therefore is a good competing method to compare with. It maintains a separate Gaussian process for each information source: to combine this knowledge, the corresponding posterior distributions are fused for each design via Winkler’s method (1981) into a single intermediate surrogate, which is a normally distributed random variable. Then Lam et al. adapt the Expected Improvement (EI) acquisition function to select the design which is to be sampled next: for the sake of simplicity, assume that observations are noiseless and that  is the objective value of a best sampled design. If  denotes a Gaussian random variable with the posterior distribution of the objective value for design , then  is the expected improvement for , and the EI acquisition function selects an  that maximizes this expectation. Based on this decision, the information source to invoke is chosen by a heuristic that aims at maximizing the EI per unit cost.

The Experimental Setups.

We conducted numerical experiments on the following test problems: the first is the 2-dimensional Rosenbrock function which is a standard benchmark in the literature, tweaked into the MISO setting by Lam et al. [2015]. The second is a MISO benchmark proposed by Swersky et al. [2013]: the goal is to optimize the four hyper-parameters of a machine learning algorithm, using a small, related set of smaller images as cheap information source. The third is an assemble-to-order problem introduced by Hong and Nelson [2006]: here the objective is to optimize an -dimensional target stock vector in order to maximize the expected daily profit of a company, for which an estimate is provided as an output by their simulator.

In MISO settings the amount of initial data that one can use to inform the methods about each information source is typically dictated by the application, in particular by resource constraints and the availability of the respective source. In our experiments all methods were given identical initial datasets for each information source in every replication; these sets were drawn randomly via Latin Hypercube designs. For the sake of simplicity, we provided the same number of points for each , deliberately set in advance to 2.5 points per dimension of the design space . Regarding the kernel and mean function, MTBO+ uses the settings provided in [Snoek and et al., ]. The other algorithms used the squared exponential kernel and a constant mean function set to the average of a random sample.

We report the “gain” over the best initial solution, that is the true objective value of the respective design that a method would return at each iteration minus the best value in the initial data set. If the true objective value is not known for a given design, we report the value obtained from the information source of highest fidelity. This gain is plotted as a function of the total cost, that is the cumulative cost for invoking the information sources plus the fixed cost for the initial data; this metric naturally generalizes the number of function evaluations prevalent in Bayesian optimization. Note that the computational overhead of choosing the next information source and sample is omitted, as it is negligible compared to invoking an information source in real-world applications. Error bars are shown at the mean plus and minus

two standard errors

averaged over at least 100 runs of each algorithm. Note that even for deterministic sources a tiny observational noise of  is supposed to avoid numerical issues during matrix inversion.

4.1 The Rosenbrock Benchmarks

We consider the design space , and information sources. is the Rosenbrock function  plus optional Gaussian noise, and equals  with an additional oscillatory component:

(2)

where , and is i.i.d. noise drawn from the standard normal distribution. and are configuration constants. We suppose that  is not subject to observational noise, hence the uncertainty only originates from the model discrepancy. We experimented on two different configurations to gain a better insight into characteristics of the algorithms. Since Lam et al. reported a good performance of their method on (2), we replicated their experiment using the same parameters to compare the performance of the four methods: that is, we set , . Replicating the setting in [Lam et al., 2015, p. 15], we also suppose a tiny uncertainty for , although it actually outputs the truth, and set  and  for all . Furthermore, we assume a cost of  for each query to  and of  for .

Since all methods converged to good solutions within few queries, we investigate the ratio of gain to cost: Fig. 1 (t) displays the gain of each method over the best initial solution as a function of the total cost, that is the cost of the initial data and the query cost accumulated by the acquisition functions.

Figure 1: (t) The Rosenbrock benchmark with the parameter setting of Lam et al. [2015]. (b) The Rosenbrock benchmark with the alternative setup. misoKG offers an excellent gain-to-cost ratio and outperforms its competitors substantially.

We see that the new method misoKG offers a better gain per unit cost, typically finding an almost optimal solution within  samples. We note that misoKG relies only on cheap samples, therefore managing the uncertainties successfully. This is also true for MTBO+ that obtains a slightly worse solution. misoEI on the other hand queries the expensive truth to find the global optimum, thereby accumulating considerably higher cost.

For the second setup, we make the following changes: we set  and , and suppose for  uniform observational noise of  and uniform query cost . Note that now the difference of the costs of both sources is much smaller, while their uncertainties are considerably bigger. The results are displayed in Fig. 1 (b): as for the first configuration, misoKG outperforms the other methods, making efficient use of the cheap biased information source. In fact, the relative difference in performance is even larger, which might suggest that misoKG handles the observational noise slightly better than its competitors for this benchmark.

4.2 The Image Classification Benchmark

This classification problem was introduced by Swersky et al. [2013] to demonstrate the performance of their MTBO

algorithm. The goal is to optimize hyper-parameters of the Logistic Regression algorithm 

Theano in order to minimize the classification error on the MNIST dataset of LeCun et al.

. The weights are trained using a stochastic gradient method with mini-batches. We have four hyper-parameters: the learning rate, the L2-regularization parameter, the batch size, and the number of epochs. The MNIST dataset contains 70,000 grayscale images of handwritten digits, where each image consists of 784 pixels. In the experimental setup information source 

corresponds to invoking the machine learning algorithm on this dataset.

Following Swersky et al., the USPS dataset serves as cheap information source : this set comprises only about 9000 images of handwritten digits that are also smaller, only 256 pixels each USPS . Again we suppose a tiny observational noise of  and set the invocation costs of the sources to 4.5 for  and 43.69 for . A closer examination shows that  is subject to considerable bias with respect to , making it a challenge for MISO algorithms.

Initially, misoKG and MTBO+ are on par and both outperform misoEI (cp. Fig.2 (t)). In order to study the convergence behavior, we evaluated misoKG and MTBO+ for 150 steps, with a lower number of replications but using the same initial data for each replication. We observe that misoKG usually achieves an optimal testerror of about  on the MNIST testset after about 80 queries to information sources (see Fig.2 (b)). In its late iterations MTBO+ achieves a worse performance than misoKG has at the same costs. Note that the experimental results of Swersky et al. [2013] show that MTBO+ will also converge to the optimum eventually.

Figure 2: The performance on the image classification benchmark with significant model discrepancy. (t) The first 50 steps of each algorithm: misoKG and MTBO+ perform better than misoEI. (b) The first 150 steps of misoKG and MTBO+. While the initial performance of misoKG and MTBO+ is comparable, misoKG achieves better testerrors after about 80 steps and converges to the global optimum.

4.3 The Assemble-To-Order Benchmark

In the assemble-to-order (ATO) benchmark, a reinforcement learning problem from a business application, we are managing the inventory of a company that manufactures  products. Each of these products is made from a selection from  items, where we distinguish for each product between key items and non-key items: if the company runs out of key items, then it cannot sell the respective products until it has restocked its inventory; non-key items are optional and used if available. When the company sends a replenishment order, the required item is delivered after a random period whose distribution is known. Since items in the inventory inflict holding cost, our goal is to find an optimal target inventory level vector  that determines the amount of each item we want to stock, such that we maximize the expected profit per day (cp. Hong and Nelson [2006] for details).

Hong and Nelson proposed a specific scenario with  different products that depend on a subset of  items, thus our task is to optimize the -dimensional target vector . For each such vector their simulator provides an estimate of the expected daily profit by running the simulation for a variable number of replications (see “runs” in Table 1). Increasing this number yields a more accurate estimate but also has higher computational cost. The observational noise and query cost, i.e. the computation time of the simulation, are estimated from samples for each information source, assuming that both functions are constant over the domain for the sake of simplicity.

There are two simulators for this assemble-to-order setting that differ subtly in the model of the inventory system. However, the effect in estimated objective value is significant: on average the outputs of both simulators at the same target vector differ by about  of the score of the global optimum, which is about , whereas the largest observed bias out of  random samples was . Moreover, the sample variance of the difference between the outputs of both simulators is about . Thus, we are witnessing a significant model discrepancy. We set up three information sources (cp. Table 1): and  use the simulator of Xie et al. [2012], whereas the cheapest source  invokes the implementation of Hong and Nelson. We assume that  models the truth.

Figure 3: The performance on the assemble-to-order benchmark with significant model discrepancy. misoKG has the best gain per cost ratio among the algorithms.
# Runs Noise Variance Cost
Table 1: The Parameters for the ATO problem

Fig. 3 displays the performance over the first 150 steps for misoKG and MTBO+ and the first 50 steps of misoEI, all averaged over 100 runs. misoKG has the best start and dominates in particular MTBO+ clearly. misoKG averages at a gain of , but inflicts only an average query cost of  to the information sources, excluding the fixed cost for the initial datasets that are identical for all algorithms for the moment. This is only  of the query cost that misoEI requires to achieve a comparable score. Interestingly, misoKG and MTBO+ utilize in their optimization of the target inventory level vector mostly the cheap, biased source, and therefore are able to obtain significantly better gain per cost ratios than misoEI.

Looking closer, we see that typically misoKG’s first call to  occurs after about  steps. In total, misoKG queries  about ten times within the first  steps; in some replications misoKG makes one late call to  when it has already converged. Our interpretation is that misoKG exploits the cheap, biased  to zoom in on the global optimum and switches to the unbiased but noisy  to identify the optimal solution exactly. This is the expected (and desired) behavior for misoKG when the uncertainty of  for some  is not expected to be reduced sufficiently by queries to .

MTBO+ trades off the gain versus cost differently: it queries  once or twice after  steps and directs all other queries to , which might explain the observed lower performance. misoEI that employs a two-step heuristic for trading off predicted gain and query cost chose almost always the most expensive simulation to evaluate the selected design.

Acknowledgments

The authors were partially supported by NSF CAREER CMMI-1254298, NSF CMMI-1536895, NSF IIS-1247696, AFOSR FA9550-12-1-0200, AFOSR FA9550-15-1-0038, and AFOSR FA9550-16-1-0046.

References

  • Allaire and Willcox [2014] D. Allaire and K. Willcox. A mathematical and computational framework for multifidelity design and analysis with computer models. International Journal for Uncertainty Quantification, 4(1), 2014.
  • Balabanov and Venter [2004] V. Balabanov and G. Venter. Multi-fidelity optimization with high-fidelity analysis and low-fidelity gradients. In 10th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, 2004.
  • Bonilla et al. [2007] E. V. Bonilla, K. M. Chai, and C. Williams. Multi-task gaussian process prediction. In Advances in Neural Information Processing Systems, pages 153–160, 2007.
  • Eldred et al. [2004] M. S. Eldred, A. A. Giunta, and S. S. Collis. Second-order corrections for surrogate-based optimization with model hierarchies. In Proceedings of the 10th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, pages 2013–2014, 2004.
  • Frazier et al. [2009] P. I. Frazier, W. B. Powell, and S. Dayanik. The Knowledge Gradient Policy for Correlated Normal Beliefs. INFORMS Journal on Computing, 21(4):599–613, 2009.
  • Goovaerts [1997] P. Goovaerts. Geostatistics for Natural Resources Evaluation. Oxford University, 1997.
  • Gratiet and Cannamela [2015] L. L. Gratiet and C. Cannamela. Cokriging-based sequential design strategies using fast cross-validation techniques for multi-fidelity computer codes. Technometrics, 57(3):418–427, 2015.
  • Hennig and Schuler [2012] P. Hennig and C. J. Schuler. Entropy search for information-efficient global optimization. The Journal of Machine Learning Research, 13(1):1809–1837, 2012.
  • Hernández-Lobato et al. [2014] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems, pages 918–926, 2014.
  • Hong and Nelson [2006] L. J. Hong and B. L. Nelson. Discrete optimization via simulation using compass. Operations Research, 54(1):115–129, 2006.
  • Huang et al. [2006a] D. Huang, T. Allen, W. Notz, and R. Miller. Sequential kriging optimization using multiple-fidelity evaluations. Structural and Multidisciplinary Optimization, 32(5):369–382, 2006a.
  • Huang et al. [2006b] D. Huang, T. T. Allen, W. I. Notz, and N. Zeng. Global Optimization of Stochastic Black-Box Systems via Sequential Kriging Meta-Models. Journal of Global Optimization, 34(3):441–466, 2006b.
  • Jones et al. [1998] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization, 13(4):455–492, 1998.
  • Kandasamy et al. [2016] K. Kandasamy, G. Dasarathy, J. B. Oliva, J. Schneider, and B. Poczos. Multi-fidelity gaussian process bandit optimisation. In Advances in Neural Information Processing Systems, 2016.
  • Kennedy and O’Hagan [2000] M. C. Kennedy and A. O’Hagan. Predicting the output from a complex computer code when fast approximations are available. Biometrika, 87(1):1–13, 2000.
  • Klein et al. [2016] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. CoRR, abs/1605.07079, 2016.
  • Lam et al. [2015] R. Lam, D. Allaire, and K. Willcox. Multifidelity optimization using statistical surrogate modeling for non-hierarchical information sources. In 56th AIAA/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, 2015.
  • [18] Y. LeCun, C. Cortes, and C. J. Burges.

    The MNIST database of handwritten digits.

    http://yann.lecun.com/exdb/mnist/. Last Accessed on 10/09/2016.
  • March and Willcox [2012] A. March and K. Willcox. Provably convergent multifidelity optimization algorithm not requiring high-fidelity derivatives. AIAA Journal, 50(5):1079–1089, 2012.
  • [20] MOE. Metrics optimization engine. http://yelp.github.io/MOE/. Last Accessed on 10/04/2016.
  • Picheny et al. [2013] V. Picheny, D. Ginsbourger, Y. Richet, and G. Caplin. Quantile-based optimization of noisy computer experiments with tunable precision. Technometrics, 55(1):2–13, 2013.
  • Queipo et al. [2005] N. V. Queipo, R. T. Haftka, W. Shyy, T. Goel, R. Vaidyanathan, and P. K. Tucker. Surrogate-based analysis and optimization. Progress in aerospace sciences, 41(1):1–28, 2005.
  • Rajnarayan et al. [2008] D. Rajnarayan, A. Haas, and I. Kroo. A multifidelity gradient-free optimization method and application to aerodynamic design. In Proceedings of the 12th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, number 6020, 2008.
  • Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. ISBN ISBN 0-262-18253-X.
  • Scott et al. [2011] W. R. Scott, P. I. Frazier, and W. B. Powell. The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression. SIAM Journal on Optimization, 21(3):996–1026, 2011.
  • Snoek [2016] J. Snoek. Personal communication, 2016.
  • [27] J. Snoek and et al. Spearmint. http://github.com/HIPS/Spearmint. Last Accessed on 10/04/2016.
  • Swersky et al. [2013] K. Swersky, J. Snoek, and R. P. Adams. Multi-task bayesian optimization. In Advances in Neural Information Processing Systems, pages 2004–2012, 2013.
  • Teh et al. [2005] Y.-W. Teh, M. Seeger, and M. Jordan. Semiparametric latent factor models. In Artificial Intelligence and Statistics 10, 2005.
  • [30] Theano. Theano: Logistic regression. http://deeplearning.net/tutorial/code/logistic_sgd.py. Last Accessed on 10/08/16.
  • [31] USPS. USPS dataset. http://mldata.org/repository/data/viewslug/usps/. Last Accessed on 10/09/2016.
  • Villemonteix et al. [2009] J. Villemonteix, E. Vazquez, and E. Walter. An informational approach to the global optimization of expensive-to-evaluate functions. Journal of Global Optimization, 44(4):509–534, 2009.
  • Winkler [1981] R. L. Winkler.

    Combining probability distributions from dependent information sources.

    Management Science, 27(4):479–488, 1981.
  • Xie et al. [2012] J. Xie, P. I. Frazier, and S. Chick. Assemble to order simulator. http://simopt.org/wiki/index.php?title=Assemble_to_Order&oldid=447, 2012. Last Accessed on 10/02/2016.

Appendix A The Model Revisited

a.1 Correlated Model Discrepancies

Next we demonstrate that our approach is flexible and can easily be extended to scenarios where some of the information sources have correlated model discrepancies. This arises for hyper-parameter tuning if the auxiliary tasks are formed from data that was collected in batches and thus is correlated over time (see Sect. 1 for a discussion). In engineering sciences we witness this if some sources share a common modeling approach, as for example, if one set of sources for an airfoil modeling problem correspond to different discretizations of a PDE that models wing flutter, while another set provides various discretizations of another PDE that modeling airflow. Two information sources that solve the same PDE will be more correlated than two that solve different PDEs.

For example, let  denote a partition of  and define the function  that gives for each IS its corresponding partition in . Then we suppose an independent Gaussian process  for each partition. (Note that in principle we could take this approach further to arbitrary sets of . However, this comes at the expense of a larger number of hyper-parameters that need to be estimated.) Again our approach is to incorporate all Gaussian processes into a single one with prior distribution :111For simplicity we reuse the notation from the first model to denote their pendants in this model. therefore, for all  and  we define , where  is the objective function that we want to optimize. Due to linearity of expectation, we have

since . Recall that the indicator variable  denotes Kronecker’s delta. Let  and , then we define the following composite covariance function :

a.2 Estimation of Hyper-Parameters

In this section we detail how to set the hyper-parameters via maximum a posteriori (MAP) estimation and propose a specific prior that has proven its value in our application and thus is of interest in its own right.

In typical MISO scenarios little data is available, that is why we suggest MAP estimates that in our experience are more robust than maximum likelihood estimates (MLE) under these circumstances. However, we wish to point out that we observed essentially the same performances of the algorithms when conducting the Rosenbrock and Assemble-to-Order benchmarks with maximum likelihood estimates for the hyper-parameters.

In what follows we use the notation introduced in Sect. 2. One would suppose that the functions  and  with  belong to some parameterized class: for example, one might set  and each  to constants, and suppose that  each belong to the class of Matérn covariance kernels (cp. Sect. 4 for the choices used in the experimental evaluation). The hyper-parameters are fit from data using maximum a posteriori (MAP) estimation; note that this approach ensures that covariances between information sources and the objective function are inferred from data.

For a Matérn kernel we have to estimate  hyper-parameters for each information source (see next subsection): length scales and the signal variance. We suppose a normal prior  for hyper-parameter . Let  be a set of points, for example chosen via a Latin Hypercube design, and evaluate every information source at all points in . We estimate the hyper-parameters for  and the  for , using the “observations”  for the . The prior mean of the signal variance parameter of  is set to the variance of the observations at  minus their average observational noise. The mean for the signal variance of  with  is obtained analogously using the “observations” in ; here we subtract the mean noise variance of the observations at  and the mean noise at , exploiting the assumption that observational noise is independent. Regarding the means of the priors for length scales, we found it useful to set each prior mean to the length of the interval that the corresponding parameter is optimized over. For all hyper-parameters  we set the variance of the prior to , where  is the mean of the prior.

a.3 How to Express Beliefs on Fidelities of Information Sources

In many applications one has beliefs about the relative accuracies of information sources. One approach to explicitly encode these is to introduce a new coefficient  for each  that typically would be fitted from data along with the other hyper-parameters. But we may also set it at the discretion of a domain expert, which is particularly useful if none of the information sources is an unbiased estimator and we rely on regression to estimate the true objective. In case of the squared exponential kernel this coefficient is sometimes part of the formulation and referred to as “signal variance” (e.g., see [Rasmussen and Williams, 2006, p. 19]). For the sake of completeness, we detail the effect for our model of uncorrelated information sources stated in Sect. 2. Recall that we suppose  with a mean function and covariance kernel , and observe that the introduction of the new coefficient  does not affect . But it changes  to

We observe that setting  to a larger value results in a bigger uncertainty. The gist is that then samples from such an information source have less influence in the Gaussian process regression (e.g., see Eq. (A.6) on pp. 200 in Rasmussen and Williams [2006]). It is instructive to consider the case that we observe a design  at a noiseless and deterministic information source: then its observed output coincides with  (with zero variance). Our estimate  for , however, is a Gaussian random variable whose variance depends (in particular) on the uncertainty of the above information source as encoded in , since  holds.

Appendix B Parallel Computation of the Cost-Sensitive Knowledge Gradient

In Sect. 3 we detailed how the cost-sensitive Knowledge Gradient can be computed. In particular, we discretized the inner optimization problem in Eq. (1) to obtain

Then we suggested exploiting the gradient of the CKG factor to obtain the next sample point and information source that maximize the expected gain per unit cost.

While we found this approach to work well in our experimental evaluation, there are scenarios where it is beneficial to also discretize the outer maximization problem  and find the best  by enumerating all CKG factors, for example when the CKG over  has many local maxima and therefore gradient-based optimization techniques may fail to find the best sample location. However, the running time of this approach is , and therefore may become a bottleneck in probing the domain with high accuracy. We note in passing that the logarithmic factor can be shaved off by a suitable sorting algorithm that exploits the property that we are sorting numbers and hence runs in time . In this section we propose a parallel algorithmic solution of essentially linear speed up, that makes efficient use of modern multi-core architectures.

We present two ideas to improve the scalability: the first stems from the observation that the computations for different choices of the next sample decision  are independent and thus can be done in parallel; the only required communication is to scatter the data and afterwards determine the best sample, hence the speedup is essentially linear. The second optimization is more intricate: we also parallelize the computation of the value of information for each . Thus, we offer two levels of parallelization that can be used separately or combined to efficiently utilize several multi-core CPUs of a cluster.

First recall that the query cost only depends on the choice of and thus can be trivially incorporated; we omit it in the sequel for a more compact presentation. Note that the same set of discrete points  is used in the inner and the outer maximization problem only for the sake of simplicity; we could also use different sets and additionally exploit potential benefits of choosing them adaptively. Moreover, let  be a bijection and define the -dimensional vectors  and  as  and  respectively, where  is the analog of  in [Frazier et al., 2009] (see also Sect. 3). Then we can define and using 

and thus observe that

Comparing these ideas to [Frazier et al., 2009], the first modification corresponds to a parallel execution of the outer loop of Algorithm 2 of Frazier et al. [2009] that computes the -function, whereas the second aims at parallelizing each iteration of the loop itself. Due to space constraints we only provide a sketch of the parallel algorithm here and assume familiarity with the algorithm of Frazier et al. [2009].

We begin its computation by sorting the entries of  and  in parallel by ascending -value; if multiple entries have the same , we only keep the entry with largest value in  and discard all others, since they are dominated [Frazier et al., 2009]. W.l.o.g. we assume in the sequel that both vectors do not contain dominated entries and that  whenever  for . However, there is another type of domination that is even more important: for each  it is sufficient to find the , which is equivalent to removing those  that never maximize  for any . Let  be the number of sample points that remain after this step and consider the sequence  with  for , and . We observe that the intervals between these points uniquely determine the respective  for all  in that interval.

In order to parallelize Algorithm 1 of Frazier et al. [2009] that determines un-dominated candidates for the next sample point (also called alternatives), we divide the previously sorted sequence  into  subsequences, where  is the number of cores we wish to use. After running the linear scan algorithm of Frazier et al. [2009] on each subsequence separately and in parallel, we “merge” adjacent subsequences pairwise in parallel: when merging two sequences, say  and , where  contains the smaller elements, it suffices to search for the rightmost element in  not dominated by the leftmost one in  (or vice versa). The reason is that we have ensured previously that no element is dominated by another within each subsequence.

After at most  merging rounds, each core has determined which of the elements in its respective subsequence of  are not dominated as required by Algorithm 2. Note that the “merging” procedure does not require actually transfer the elements among cores. Hence the final step, the summation in Eq. (14) on p. 605 in [Frazier et al., 2009] that calculates , is trivial to parallelize.