Optimizing an unknown function that is expensive to evaluate is a common problem in real applications. Examples include experimental design for protein engineering, where chemists need to synthesize designed amino acid sequences and then test whether they satisfy certain properties (Romero et al., 2013); or black-box optimization for material science, where scientists need to run extensive computational experiments at various levels of accuracy to find the optimal material design structure (Fleischman et al., 2017). Conducting real experiments could be labor-intensive and time-consuming. In practice, we would like to look for alternative ways to gather information so that we can make the most effective use of real experiments that we do conduct. A natural candidate is computer simulation (van Gunsteren & Berendsen, 1990), which tends to be less time consuming but produces less accurate results. For example, computer simulation is ubiquitous in robotic applications, e.g. we test a control policy first in simulation before deploying it in a real physical system (Marco et al., 2017).
The central challenge in efficiently using multiple sources of information is captured in the general framework of multi-fidelity optimization (Forrester et al., 2007; Kandasamy et al., 2016, 2017; Marco et al., 2017; Sen et al., 2018) where multiple functions with varying degrees of accuracy and costs can be effectively leveraged to provide the maximal amount of information. However, strict assumptions, such as requiring strict relations between the quality and the cost of a lower fidelity function, and two-stage query selection criteria (cf. §2.2) are likely to limit their practical use and lead to sub-optimal selections.
In this paper, we propose a general and principled multi-fidelity Bayesian optimization framework MF-MI-Greedy (Multi-fidelity Mutual Information Greedy) that prioritizes maximizing the amount of mutual information gathered across fidelity levels. Figure 1 captures the intuition of maximizing mutual information. Gathering information from lower fidelity also conveys information on the target fidelity. We make this idea concrete in §4
. Our method improves upon prior work on multi-fidelity Bayesian optimization by establishing explicit connections across fidelity levels to enable joint posterior updates and hyperparameter optimization. In summary, our contributions in this paper are
2 Background and Related Work
In this section, we review related work on Bayesian optimization with Gaussian processes.
2.1 Background on Gaussian Processes
Gaussian process (Rasmussen & Williams, 2006)
models an infinite collection of random variables, each indexed by an
, such that every finite subset of random variables has a multivariate Gaussian distribution. The GP distributionis a joint Gaussian distribution over all those (infinitely many) random variables specified by its mean and covariance (also known as kernel) function
A key advantage of GP is that it is very efficient to perform inference. Assume that is a sample from the GP distribution, and that is a noisy observation of the function value . Here, the noise could depend on the input . Suppose that we have selected and received . We can obtain the posterior mean and covariance of the function through the covariance matrix and :
where is the Kronecker delta function.
2.2 Bayesian Optimization via Gaussian Processes
Single-fidelity Gaussian Process optimization
Optimizing an unknown and noisy function is a common task in Bayesian optimization. In real applications, such functions tend to be expensive to evaluate, for example tuning hyperparameters for deep learning models(Snoek et al., 2012), so the number of queries should be minimized. As a way to model the unknown function, Gaussian process (GP) (Rasmussen & Williams, 2006) is an expressive and flexible tool to model a large class of functions. A classical method for Bayesian optimization with GPs is GP-UCB (Srinivas et al., 2010) which treats Bayesian optimization as a multi-armed bandit problem and proposes an upper-confidence bound based algorithm for query selections. The authors provide a theoretical bound on the cumulative regret that is connected with the amount of mutual information gained through the queries. (Contal et al., 2014) directly incorporates mutual information into the UCB framework and demonstrated the empirical value of their method.
Entropy search (Hennig & Schuler, 2012) represents another class of GP-based Bayesian optimization approach. Its main idea is to directly search for the global optimum of an unknown function through queries. Each query point is selected based on its informativeness in learning the location for the function optimum. Predictive entropy search (Hernández-Lobato et al., 2014) addresses some computational issues from entropy search by maximizing the expected information gain with respect to the location of the global optimum. Max-value entropy search (Wang et al., 2016; Wang & Jegelka, 2017) approaches the task of searching the global optimum differently. Instead of searching for the location of the global optimum, it looks for the value of the global optimum. This effectively avoids issues related to the dimension of the search space and the authors are able to provide regret bound analysis that the previous two entropy search methods lack.
A computational consideration for learning with GPs concerns with optimizing specific kernels used to model the covariance structures of GPs. As this optimization task depends on the dimension of feature space, approximation methods are needed to speed up the learning process. Random Fourier features (Rahimi & Recht, 2008) are efficient tools for dimension reduction and are employed in GP regression tasks (Lázaro-Gredilla et al., 2010). As elaborated on in §4, our algorithmic framework offers the flexibility of choosing among different single-fidelity optimization approaches as a subroutine, so that one can take advantage of these computational and approximation algorithms for efficient optimization.
Multi-output Gaussian Process
Sometimes it is desirable to model multiple correlated outputs with Gaussian processes. Most GP-based multi-output models create correlated outputs by mixing a set of independent latent processes. A simple form of such a mixing scheme is the linear model of coregionalization (Teh et al., 2005; Bonilla et al., 2008) where each output is modeled as a linear combination of latent GPs with fixed coefficients. The dependencies among outputs are captured by sharing those latent GPs. More complex structures can be captured by a linear combination of GPs with input-dependent coefficients (Wilson et al., 2012), shared inducing variables (Nguyen & Bonilla, 2014), or convolved process (Boyle & Frean, 2005; Alvarez & Lawrence, 2009). In comparison with single fidelity/output GPs, multi-output GP often requires more sophisticated approximate models for efficient optimization (e.g., using inducing points (Snelson & Ghahramani, 2007) to reduce the storage and computational complexity, and variational inference approaches to approximate the posterior of the latent processes (Titsias, 2009; Nguyen & Bonilla, 2014)
). While the analysis of our framework is not limited to a fixed structural assumption in modeling the joint distribution among multiple outputs, for efficiency concern, we use a simple, additive model between multiple fidelity outputs in our experiments (cf. §6.1) to demonstrate the effectiveness of the optimization framework.
Multi-fidelity Bayesian optimization
Multi-fidelity optimization is a general framework that captures the trade-off between cheap low quality and expensive high quality data. Recently, there have been several works on using GPs to model functions of different fidelity levels. Recursive co-kriging (Forrester et al., 2007; Le Gratiet & Garnier, 2014)
consider an autoregressive model for multi-fidelity GP regression, which assumes that the higher fidelity consists of a lower fidelity term and anindependent GP term which models the systematic error for approximating the higher-fidelity output. Therefore, one can model cross-covariance between the high-fidelity and low-fidelity functions using the covariance of the lower fidelity function only. Virtual vs Real (Marco et al., 2017)
extends this idea to Bayesian optimization. The authors consider a two-fidelity setting (i.e., virtual simulation and real system experiments), where they model the correlation between the two fidelities through co-kriging, and then apply entropy search (ES) to optimize the target output.Zhang et al. (2017) model the dependencies between different fidelities with convolved Gaussian processes (Alvarez & Lawrence, 2009), and then apply predictive entropy search (PES) (Hernández-Lobato et al., 2014) to efficient exploration. Although both the ES and multi-fidelity PES heuristics have shown promising empirical results on some datasets, little is known about their theoretical performance.
Recently, Kandasamy et al. (2016) proposed Multi-fidelity GP-UCB (MF-GP-UCB), a principled framework for multi-fidelity Bayesian optimization with Gaussian processes. In a followup work (Kandasamy et al., 2017; Sen et al., 2018), the authors address the disconnect issue by considering a continuous fidelity space and performing joint updates to effectively share information among different fidelity levels.
In contrast to our general assumption on the joint distribution between different fidelities, Kandasamy et al. (2016) and Sen et al. (2018) assume a specific structure over multiple fidelities, where the cost of each lower fidelity is determined according to the maximal approximation error in function value when compared with the target fidelity. Kandasamy et al. (2017) consider a two-stage optimization process, where the action and the fidelity level are selected in two separate stages. We note that this procedure may lead to non-intuitive choices of queries: For example, in a pessimistic case where the low fidelity only differs from the target fidelity by a constant shift, their algorithm is likely to focus only on querying the target fidelity actions even though the low fidelity is as useful as the target fidelity. In contrast, as described in §4, our algorithm jointly selects a query point and a fidelity level so such sub-optimality can be avoided.
3 Problem Statement
We now introduce useful notations and formally state the problem studied in this paper.
Payoff function and auxiliary functions
Consider the problem of maximizing an unknown payoff function . We can probe the function by directly querying it at some and obtaining a noisy observation , where
denotes i.i.d. Gaussian white noise. In addition to the payoff function, we are also given access to oracle calls to some unknown auxiliary functions ; similarly, we obtain a noisy observation when querying at . Here, each could be viewed as a low-fidelity version of for . For example, if represents the actual reward obtained by running a real physical system with input , then may represent the simulated payoff from a numerical simulator at fidelity level .
Joint distribution on multiple fidelities
We assume that multiple fidelities
are mutually dependent through some fixed, (possibly) unknown joint probability distribution. In particular, we model with a multiple output Gaussian process; hence the marginal distribution on each fidelity is a separate GP, i.e., , where specify the (prior) mean and covariance at fidelity level .
Action, reward, and cost
Let us use to denote the action of querying at . Each action incurs a cost of , and achieves a reward
That is, performing (at the target fidelity) achieves a reward . We receive the minimal immediate reward with lower fidelity actions for , even though it may provide some information about and could thus lead to more informed decisions in the future. W.l.o.g., we assume that , and .
Let us encode an adaptive strategy for picking actions as a policy . In words, a policy specifies which action to perform next, based on the actions picked so far and the corresponding observations. We consider policies with a fixed budget . Upon termination, returns a sequence of actions , such that . Note that for a given policy , the sequence is random, dependent on the joint distribution and the (random) observations of the selected actions.
Given a budget on , our goal is to maximize the expected cumulative reward, so as to identify an action with performance close to as rapidly as possible. Formally, we seek
Problem 4 strictly generalizes the optimal value of information (VoI) problem (Chen et al., 2015) to the online setting. To see this, consider the scenario where for , and . To achieve a non-zero reward, a policy must pick as the last action before exhausting the budget . Therefore, our goal becomes to adaptively pick lower fidelity actions that are the most informative about under budget , which reduces to the optimal VoI problem.
4 The Multi-fidelity BO Framework
We now present MF-MI-Greedy, a general framework for multi-fidelity Gaussian process optimization. In a nutshell, MF-MI-Greedy attempts to balance the “exploratory” low-fidelity actions and the more expensive target fidelity actions, based on how much information (per unit cost) these actions could provide about the target fidelity function. Concretely, MF-MI-Greedy proceeds in rounds under a given budget . Each round can be divided into two phases: (i) an exploration (i.e., information gathering) phase, where the algorithm focuses on exploring the low fidelity actions, and (ii) an optimization phase, where the algorithm tries to optimize the payoff function by performing an action at the target fidelity. The pseudo-code of the algorithm is provided in Algorithm 1.
A key challenge in designing the algorithm is to decide when to stop exploration (or equivalently, to invoke the optimization phase). Note that this is analogous to the exploration-exploitation dilemma in the classical single-fidelity Bayesian optimization problems; the difference is that in the multi-fidelity setting, we have a more distinctive notion of “exploration”, and a more complicated structure of the action space (i.e., each exploration phase corresponds to picking a set of low fidelity actions). Furthermore, note that there is no explicit measurement of the relative “quality” of a low fidelity action, as they all have uniform reward by our modeling assumption (c.f. Eq. (3)); hence we need to design a proper heuristic to keep track of the progress of exploration.
We consider an information-theoretic selection criterion for picking low fidelity actions. The quality of a low fidelity action is captured by the information gain, defined as the amount of entropy reduction in the posterior distribution of the target payoff function111An alternative, more aggressive information measurement is the information gain over the optimizer of the target function (Hennig & Schuler, 2012), i.e., , or the optimal value of (Wang & Jegelka, 2017), i.e., . : . Here, denotes the set of previously selected actions, and denote the observation history. Given an exploration budget , our objective for a single exploration phase is to find a set of low-fidelity actions, which are (i) maximally informative about the target function, (ii) better than the best action on the target fidelity when considering the information gain per unit cost (otherwise, one would rather pick the target fidelity action to trade off exploration and exploitation), and (iii) not overly aggressive in terms of exploration (since we would also like to reserve a certain budget for performing target fidelity actions to gain reward).
Finding the optimal set of actions satisfying the above design principles is computationally prohibitive, as it requires us to search through a combinatorial (for finite discrete domains) or even infinite (for continuous domains) space. In Algorithm 2, we introduce Explore-LF, a key subroutine of MF-MI-Greedy, for efficient exploration on low fidelities. At each step, Explore-LF takes a greedy step w.r.t. the benefit-cost ratio over all actions. To ensure that the algorithm does not explore excessively, we consider the following stopping conditions: (i) when the budget is exhausted (Line 2), (ii) when a single target fidelity action is better than all the low fidelity actions in terms of the benefit-cost ratio (Line 2), and (iii) when the cumulative benefit-cost ratio is small (Line 2). Here, the parameter is set to be to ensure low regret, and we defer the detailed discussion of the choice of to §5.2.
At the end of the exploration phase, MF-MI-Greedy updates the posterior distribution of the joint GP using the full observation history, and searches for a target fidelity action via the (single-fidelity) GP optimization subroutine SF-GP-OPT (Line 1). Here, SF-GP-OPT could be any off-the-shelf Bayesian optimization algorithm, such as GP-UCB (Srinivas et al., 2010), GP-MI (Contal et al., 2014), EST (Wang et al., 2016) and MVES (Wang & Jegelka, 2017), etc. Different from the previous exploration phase which seeks an informative set of low fidelity actions, the GP optimization subroutine aims to trade off exploration and exploitation on the target fidelity, and outputs a single action at each round. MF-MI-Greedy
then proceeds to the next round until it exhausts the preset budget and eventually outputs an estimator of the target function optimizer.
5 Theoretical Analysis
In this section, we investigate the theoretical behavior of MF-MI-Greedy. We first introduce an intuitive notion of regret for the multi-fidelity setting, and then state our main theoretical results.
5.1 Multi-fidelity Regret
[Episode] Let be any integer. We call a sequence of items an episode, if and . In words, only the last action of an episode is from the target fidelity and all remaining actions are from lower fidelities. We now define a simple notion of regret for an episode. [Episode regret] The regret of an episode is
where is the total cost of episode , and denotes the reward value of the last action on the target fidelity.
Suppose we run policy under budget and select a sequence of actions . One can represent using multiple episodes , where denotes the sequence of low fidelity actions and target fidelity action selected at the episode. Let be the cost of episode ; clearly . We define the multi-fidelity cumulative regret as follows. [Cumulative regret] The cumulative regret of policy under budget is
Intuitively222Note that our notion of cumulative regret is different from the multi-fidelity regret (Eq. (2)) of Kandasamy et al. (2016). Although both definitions reduce to the classical single-fidelity regret (Srinivas et al., 2010) when , Definition 5.1 has a simpler form and intuitive physical interpretation., Definition 5.1 characterizes the difference in the cumulative reward of and the best possible reward gathered under budget .
5.2 Regret Analysis
In the following, we establish a bound on the cumulative regret of MF-MI-Greedy, as a function of the mutual information between the target fidelity function and the actions attained by the algorithm. Assume that MF-MI-Greedy terminates in episodes, and w.h.p., the cumulative regret incurred by SF-GP-OPT is upper bounded by , where is some constant independent of , and denotes the mutual information gathered by the target fidelity actions chosen by SF-GP-OPT (equivalently by MF-MI-Greedy). Then, w.h.p, the cumulative regret of MF-MI-Greedy (Algorithm 1) satisfies
where , is some constant independent of , and
denotes the mutual information gathered by the low fidelity actions when running MF-MI-Greedy. The proof of Theorem 5.2 is provided in the Appendix. Similarly with the single-fidelity case, a desirable asymptotic property of a multi-fidelity optimization algorithm is to be no-regret, i.e., . If we set , then Theorem 5.2 reduces to . Clearly, MF-MI-Greedy is no-regret as .
Furthermore, let us compare the above result with the regret bound of the single-fidelity GP optimization algorithm SF-GP-OPT. By the assumption of Theorem 5.2, we know , where is the information gain of all the (target fidelity) actions by running SF-GP-OPT for rounds. When the low fidelity actions are very informative about the and have much lower costs than (hence larger ), the less likely MF-MI-Greedy will focus on exploring the target fidelity, i.e., , and hence MF-MI-Greedy becomes more advantageous to SF-GP-OPT. The implication of Theorem 5.2 is similar in spirit to the regret bound provided in Kandasamy et al. (2016), however, our results apply to a much broader class of optimization strategies, as suggested by the following corollary.
Let . Running MF-MI-Greedy with subroutine GP-UCB, EST, or GP-MI in the optimization phase is no-regret.
In this section, we empirically evaluate our algorithm on 3 synthetic test function optimization tasks and 2 practical optimization problems.
6.1 Experimental Setup
To model the relationship between a low fidelity function and the target fidelity function , we use an additive model. Specifically, we assume that for all fidelity levels where is an unknown function characterizing the error incurred by a lower fidelity function. We use Gaussian processes to model and . Since is embedded in every fidelity level, we can use an observation from any fidelity to update the posterior for every fidelity level. We use square exponential kernels for all the GP covariances, with hyperparameter tuning scheduled periodically during optimization. We keep the same experimental setup for MF-GP-UCB as in Kandasamy et al. (2016).
For all our experiments, we use a total budget of 100 times the cost of target fidelity function call . When the optimal value for is known, we compare simple regrets. Otherwise, we compare simple rewards.
6.2 Compared Methods
Our framework is general and we could plug in different single fidelity Bayesian optimization algorithms for the SF-GP-OPT procedure in Algorithm 1. In our experiment, we choose to use GP-UCB as one instantiation. We compare with MF-GP-UCB (Kandasamy et al., 2016) and GP-UCB (Srinivas et al., 2010).
6.3 Synthetic Examples
We first evaluate our algorithm on three synthetic datasets, namely (a) Hartmann 6D, (b) Currin exponential 2D and (c) BoreHole 8D (Kandasamy et al., 2016). We follow the setup used in Kandasamy et al. (2016) to define lower fidelity functions, while we use a different definition of lower fidelity costs. We emphasize that in synthetic settings, the artificially defined costs do not have practical meanings, as function evaluation costs do not differ across different fidelity levels. Nevertheless, we set the cost of the function evaluations (monotonically) according to the fidelity levels, and present the results in Fig. 3 The -axis represents the expended budget and the
-axis represents the smallest simple regret. The error bars represent one standard error over 20 runs of each experiment.
Our method MF-MI-Greedy is generally competitive with MF-GP-UCB. A common issue is its simple regrets tend to be larger at the beginning. A cause for this behavior may be the parameters controlling the termination conditions early on are not tuned optimally, which leads to over exploration in regions that do not reveal much information on where the function optimum lies.
6.4 Real Experiments
We test our methods on two real datasets: maximum likelihood inference for cosmological parameters and experimental design for material science.
6.4.1 Maximum Likelihood Inference for Cosmological Parameters
The first real experiment is to perform maximum likelihood inference on 3 cosmological parameters, the Hubble constant , the dark matter fraction and the dark energy fraction . It thus has a dimensionality of 3. The likelihood is given by the Roberson-Walker metric, which requires a one-dimensional numerical integration for each point in the dataset from Davis et al. (2007). In Kandasamy et al. (2017), the authors set up two lower fidelity functions by considering two aspects of computing the likelihood: (i) how many data points (denoted by ) are used, and (ii) what is the discrete grid size (denoted by ) for performing the numerical integration. The range for these two parameters are and . We follow the fidelity levels selected in Kandasamy et al. (2017) which correspond to two lower fidelities with , and the target fidelity with . Costs are defined as the product of and .
Upon further investigation, we find that the grid sizes selected above for performing numerical integration do not affect the final integral values, i.e. the grid size for the lowest fidelity is fine enough to compute an approximation to the integration as using the grid size for the target fidelity. So costs taking into consideration the integration grid sizes are not an accurate characterization of the true computation costs. As a result, we propose a different cost definition that depends only on how many data points are used to compute the likelihood, i.e. the new costs for the 3 functions are , respectively.
The results using the original cost definition are shown in Figure 2(a). Note for this task we do not know the optimal likelihood, so we report the best objective value so far (simple rewards) in the -axis. Our method MF-MI-Greedy (red) outperforms both baselines. The results using the new cost definition are shown in Figure 2(b). Our method obtains a consistent high likelihood when the cost structure changes. However, MF-GP-UCB’s quality degrades significantly, which implies that it is sensitive to how the costs among fidelity levels are defined. These two set of results demonstrate the robustness of our method against costs, which is a desirable property as inaccuracy in cost estimates is inevitable in practical applications.
6.4.2 Experimental Design for Optimizing Nanophotonic Structures
The second experiment is motivated by a material science task of designing nanophotonic structures with desired color filtering property (Fleischman et al., 2017). A nanophotonic structure is characterized by the following 5 parameters: mirror height, film thickness, mirror spacing, slit width, and oxide thickness. For each parameter setting, we use a score, commonly called a figure-of-merit (FOM), to represent how well the resulting structure satisfies the desired color filtering property. By minimizing FOM, we hope to find a set of high-quality design parameters.
Traditionally, FOMs can only be computed through the actual fabrication of a structure and tests its various physical properties, which is a time-consuming process. Alternatively, simulations can be utilized to estimate what physical properties a design will have. By solving a variant of the Maxwell’s equations, we could simulate the transimission of light spectrum and compute FOM from the spectrum. We collect three fidelity level data on 5000 nanophotonic structures. What distinguishes each fidelity is the mesh size we use to solve the Maxwell’s equations. Finer meshes lead to more accurate results. Specifically, lowest fidelity uses a mesh size of , the middle fidelity and the target fidelity . The costs, simulation time, are inverse proportional to the mesh size, so we use the following costs [1, 4, 9] for our three fidelity functions.
Figure 2(c) shows the results of this experiment. As usual, the -axis is the cost and -axis is negative FOM. After a small portion of the budget is used in initial exploration, MF-MI-Greedy (red) is able to arrive at a better final design compared with MF-GP-UCB and GP-UCB.
In this paper, we investigated the multi-fidelity Bayesian optimization problem, and proposed a general, principled framework for addressing the problem. We introduced a simple, intuitive notion of regret, and showed that our framework is able to lift many popular, off-the-shelf single-fidelity GP optimization algorithms to the multi-fidelity setting, while still preserving their original regret bounds. We demonstrated the performance of our proposed algorithm on several synthetic and real datasets.
This work was supported in part by NSF Award #1645832, Northrop Grumman, Bloomberg, and a Swiss NSF Early Mobility Postdoctoral Fellowship.
- Alvarez & Lawrence (2009) Alvarez, M. and Lawrence, N. D. Sparse convolved gaussian processes for multi-output regression. In Advances in neural information processing systems, pp. 57–64, 2009.
- Bonilla et al. (2008) Bonilla, E. V., Chai, K. M., and Williams, C. Multi-task gaussian process prediction. In Advances in neural information processing systems, pp. 153–160, 2008.
- Boyle & Frean (2005) Boyle, P. and Frean, M. Dependent gaussian processes. In Advances in neural information processing systems, pp. 217–224, 2005.
Chen et al. (2015)
Chen, Y., Javdani, S., Karbasi, A., Bagnell, J. A., Srinivasa, S., and Krause,
Submodular surrogates for value of information.
Proc. Conference on Artificial Intelligence (AAAI), January 2015. URL https://bit.ly/2QEfJnZ.
Contal et al. (2014)
Contal, E., Perchet, V., and Vayatis, N.
Gaussian process optimization with mutual information.
International Conference on Machine Learning, pp. 253–261, 2014. URL https://bit.ly/2x7EEbw.
- Davis et al. (2007) Davis, T. M., Mörtsell, E., Sollerman, J., Becker, A. C., Blondin, S., Challis, P., Clocchiatti, A., Filippenko, A., Foley, R., Garnavich, P. M., et al. Scrutinizing exotic cosmological models using essence supernova data combined with other cosmological probes. The Astrophysical Journal, 666(2):716, 2007.
- Fleischman et al. (2017) Fleischman, D., Sweatlock, L. A., Murakami, H., and Atwater, H. Hyper-selective plasmonic color filters. Optics Express, 25(22):27386–27395, 2017.
- Forrester et al. (2007) Forrester, A. I., Sóbester, A., and Keane, A. J. Multi-fidelity optimization via surrogate modelling. In Proceedings of the royal society of london a: mathematical, physical and engineering sciences, volume 463, pp. 3251–3269. The Royal Society, 2007. URL https://bit.ly/2xkMXRr.
- Hennig & Schuler (2012) Hennig, P. and Schuler, C. J. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(Jun):1809–1837, 2012. URL https://bit.ly/2x5KMQC.
- Hernández-Lobato et al. (2014) Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pp. 918–926, 2014.
- Kandasamy et al. (2016) Kandasamy, K., Dasarathy, G., Oliva, J. B., Schneider, J., and Póczos, B. Gaussian process bandit optimisation with multi-fidelity evaluations. In Advances in Neural Information Processing Systems, pp. 992–1000, 2016. URL https://bit.ly/2Qngemh.
- Kandasamy et al. (2017) Kandasamy, K., Dasarathy, G., Schneider, J., and Póczos, B. Multi-fidelity bayesian optimisation with continuous approximations. In International Conference on Machine Learning, pp. 1799–1808, 2017. URL https://bit.ly/2N9KgMq.
- Lázaro-Gredilla et al. (2010) Lázaro-Gredilla, M., Quiñonero-Candela, J., Rasmussen, C. E., and Figueiras-Vidal, A. R. Sparse spectrum gaussian process regression. Journal of Machine Learning Research, 11:1865–1881, 2010.
- Le Gratiet & Garnier (2014) Le Gratiet, L. and Garnier, J. Recursive co-kriging model for design of computer experiments with multiple levels of fidelity. International Journal for Uncertainty Quantification, 4(5), 2014. URL https://bit.ly/2PICVQu.
Marco et al. (2017)
Marco, A., Berkenkamp, F., Hennig, P., Schoellig, A. P., Krause, A., Schaal,
S., and Trimpe, S.
Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with bayesian optimization.In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1557–1563, 2017. URL https://bit.ly/2Oa4e62.
- Nguyen & Bonilla (2014) Nguyen, T. V. and Bonilla, E. V. Collaborative multi-output gaussian processes. In UAI, pp. 643–652, 2014. URL https://bit.ly/2D4BwCH.
- Rahimi & Recht (2008) Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177–1184, 2008.
- Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006. URL https://bit.ly/2tYpBix.
- Romero et al. (2013) Romero, P. A., Krause, A., and Arnold, F. H. Navigating the protein fitness landscape with gaussian processes. Proceedings of the National Academy of Sciences, 110(3):E193–E201, 2013.
- Sen et al. (2018) Sen, R., Kandasamy, K., and Shakkottai, S. Multi-fidelity black-box optimization with hierarchical partitions. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2018. URL https://bit.ly/2MSIsSQ.
- Snelson & Ghahramani (2007) Snelson, E. and Ghahramani, Z. Local and global sparse gaussian process approximations. In Artificial Intelligence and Statistics, pp. 524–531, 2007.
- Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
- Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. International Conference on Machine Learning (ICML), 2010. URL https://bit.ly/2CNGPGc.
- Teh et al. (2005) Teh, Y.-W., Seeger, M., and Jordan, M. Semiparametric latent factor models. In Artificial Intelligence and Statistics 10, 2005.
- Titsias (2009) Titsias, M. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pp. 567–574, 2009.
- van Gunsteren & Berendsen (1990) van Gunsteren, W. F. and Berendsen, H. J. Computer simulation of molecular dynamics: Methodology, applications, and perspectives in chemistry. Angewandte Chemie International Edition in English, 29(9):992–1023, 1990.
- Wang & Jegelka (2017) Wang, Z. and Jegelka, S. Max-value entropy search for efficient bayesian optimization. arXiv preprint arXiv:1703.01968, 2017. URL https://arxiv.org/pdf/1703.01968.pdf.
- Wang et al. (2016) Wang, Z., Zhou, B., and Jegelka, S. Optimization as estimation with gaussian processes in bandit settings. In Artificial Intelligence and Statistics, pp. 1022–1031, 2016. URL https://bit.ly/2OeSOhp.
- Wilson et al. (2012) Wilson, A. G., Knowles, D. A., and Ghahramani, Z. Gaussian process regression networks. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.
- Zhang et al. (2017) Zhang, Y., Hoang, T. N., Low, B. K. H., and Kankanhalli, M. Information-based multi-fidelity bayesian optimization. NIPS Workshop on Bayesian Optimization, 2017. URL https://bit.ly/2N5CdjH.
Appendix A Proofs for §5
a.1 Proofs of Theorem 5.2
[Proof of Theorem 5.2] Assume that MF-MI-Greedy terminates within episodes. Let us use to denote the sequence of actions selected by MF-MI-Greedy, where denotes the sequence of actions selected at the episode. Further, let be the cost of the episode, and the cost of lower fidelity actions of the episode. The budget allocated for the target fidelity is . By definition of the cumulative regret (Eq. (6)), we get
The first term on the R.H.S. of Eq. (7) represents the regret incurred from exploring the lower fidelity actions, while the second term represents the regret from the target fidelity actions (chosen by SF-GP-OPT).
where denotes the observations obtained up to episode , and specifies the stopping condition of Explore-LF at episode . Therefore
where step (a) is because for . Recall that . Therefore,
Note that the second term of Eq. (7) is the regret of MF-MI-Greedy on the target fidelity. Since all the target fidelity actions are selected by the subroutine SF-GP-OPT, by assumption, we know . Combining this with Eq. (9) completes the proof.
a.2 Proof of Corollary 5.2
To show that running MF-MI-Greedy with subroutine GP-UCB (Srinivas et al., 2010), EST (Wang et al., 2016), or GP-MI (Contal et al., 2014) in the optimization phase is no-regret, it suffices to show that the candidate subroutines GP-UCB, EST, and GP-MI satisfy the assumption on SF-GP-OPT as provided in Theorem 5.2. From the references above we know that the statement is true.