1 Introduction
Optimizing an unknown function that is expensive to evaluate is a common problem in real applications. Examples include experimental design for protein engineering, where chemists need to synthesize designed amino acid sequences and then test whether they satisfy certain properties (Romero et al., 2013); or blackbox optimization for material science, where scientists need to run extensive computational experiments at various levels of accuracy to find the optimal material design structure (Fleischman et al., 2017). Conducting real experiments could be laborintensive and timeconsuming. In practice, we would like to look for alternative ways to gather information so that we can make the most effective use of real experiments that we do conduct. A natural candidate is computer simulation (van Gunsteren & Berendsen, 1990), which tends to be less time consuming but produces less accurate results. For example, computer simulation is ubiquitous in robotic applications, e.g. we test a control policy first in simulation before deploying it in a real physical system (Marco et al., 2017).
The central challenge in efficiently using multiple sources of information is captured in the general framework of multifidelity optimization (Forrester et al., 2007; Kandasamy et al., 2016, 2017; Marco et al., 2017; Sen et al., 2018) where multiple functions with varying degrees of accuracy and costs can be effectively leveraged to provide the maximal amount of information. However, strict assumptions, such as requiring strict relations between the quality and the cost of a lower fidelity function, and twostage query selection criteria (cf. §2.2) are likely to limit their practical use and lead to suboptimal selections.
In this paper, we propose a general and principled multifidelity Bayesian optimization framework MFMIGreedy (Multifidelity Mutual Information Greedy) that prioritizes maximizing the amount of mutual information gathered across fidelity levels. Figure 1 captures the intuition of maximizing mutual information. Gathering information from lower fidelity also conveys information on the target fidelity. We make this idea concrete in §4
. Our method improves upon prior work on multifidelity Bayesian optimization by establishing explicit connections across fidelity levels to enable joint posterior updates and hyperparameter optimization. In summary, our contributions in this paper are
2 Background and Related Work
In this section, we review related work on Bayesian optimization with Gaussian processes.
2.1 Background on Gaussian Processes
Gaussian process (Rasmussen & Williams, 2006)
models an infinite collection of random variables, each indexed by an
, such that every finite subset of random variables has a multivariate Gaussian distribution. The GP distribution
is a joint Gaussian distribution over all those (infinitely many) random variables specified by its mean and covariance (also known as kernel) functionA key advantage of GP is that it is very efficient to perform inference. Assume that is a sample from the GP distribution, and that is a noisy observation of the function value . Here, the noise could depend on the input . Suppose that we have selected and received . We can obtain the posterior mean and covariance of the function through the covariance matrix and :
(1)  
(2) 
where is the Kronecker delta function.
2.2 Bayesian Optimization via Gaussian Processes
Singlefidelity Gaussian Process optimization
Optimizing an unknown and noisy function is a common task in Bayesian optimization. In real applications, such functions tend to be expensive to evaluate, for example tuning hyperparameters for deep learning models
(Snoek et al., 2012), so the number of queries should be minimized. As a way to model the unknown function, Gaussian process (GP) (Rasmussen & Williams, 2006) is an expressive and flexible tool to model a large class of functions. A classical method for Bayesian optimization with GPs is GPUCB (Srinivas et al., 2010) which treats Bayesian optimization as a multiarmed bandit problem and proposes an upperconfidence bound based algorithm for query selections. The authors provide a theoretical bound on the cumulative regret that is connected with the amount of mutual information gained through the queries. (Contal et al., 2014) directly incorporates mutual information into the UCB framework and demonstrated the empirical value of their method.Entropy search (Hennig & Schuler, 2012) represents another class of GPbased Bayesian optimization approach. Its main idea is to directly search for the global optimum of an unknown function through queries. Each query point is selected based on its informativeness in learning the location for the function optimum. Predictive entropy search (HernándezLobato et al., 2014) addresses some computational issues from entropy search by maximizing the expected information gain with respect to the location of the global optimum. Maxvalue entropy search (Wang et al., 2016; Wang & Jegelka, 2017) approaches the task of searching the global optimum differently. Instead of searching for the location of the global optimum, it looks for the value of the global optimum. This effectively avoids issues related to the dimension of the search space and the authors are able to provide regret bound analysis that the previous two entropy search methods lack.
A computational consideration for learning with GPs concerns with optimizing specific kernels used to model the covariance structures of GPs. As this optimization task depends on the dimension of feature space, approximation methods are needed to speed up the learning process. Random Fourier features (Rahimi & Recht, 2008) are efficient tools for dimension reduction and are employed in GP regression tasks (LázaroGredilla et al., 2010). As elaborated on in §4, our algorithmic framework offers the flexibility of choosing among different singlefidelity optimization approaches as a subroutine, so that one can take advantage of these computational and approximation algorithms for efficient optimization.
Multioutput Gaussian Process
Sometimes it is desirable to model multiple correlated outputs with Gaussian processes. Most GPbased multioutput models create correlated outputs by mixing a set of independent latent processes. A simple form of such a mixing scheme is the linear model of coregionalization (Teh et al., 2005; Bonilla et al., 2008) where each output is modeled as a linear combination of latent GPs with fixed coefficients. The dependencies among outputs are captured by sharing those latent GPs. More complex structures can be captured by a linear combination of GPs with inputdependent coefficients (Wilson et al., 2012), shared inducing variables (Nguyen & Bonilla, 2014), or convolved process (Boyle & Frean, 2005; Alvarez & Lawrence, 2009). In comparison with single fidelity/output GPs, multioutput GP often requires more sophisticated approximate models for efficient optimization (e.g., using inducing points (Snelson & Ghahramani, 2007) to reduce the storage and computational complexity, and variational inference approaches to approximate the posterior of the latent processes (Titsias, 2009; Nguyen & Bonilla, 2014)
). While the analysis of our framework is not limited to a fixed structural assumption in modeling the joint distribution among multiple outputs, for efficiency concern, we use a simple, additive model between multiple fidelity outputs in our experiments (cf. §
6.1) to demonstrate the effectiveness of the optimization framework.Multifidelity Bayesian optimization
Multifidelity optimization is a general framework that captures the tradeoff between cheap low quality and expensive high quality data. Recently, there have been several works on using GPs to model functions of different fidelity levels. Recursive cokriging (Forrester et al., 2007; Le Gratiet & Garnier, 2014)
consider an autoregressive model for multifidelity GP regression, which assumes that the higher fidelity consists of a lower fidelity term and an
independent GP term which models the systematic error for approximating the higherfidelity output. Therefore, one can model crosscovariance between the highfidelity and lowfidelity functions using the covariance of the lower fidelity function only. Virtual vs Real (Marco et al., 2017)extends this idea to Bayesian optimization. The authors consider a twofidelity setting (i.e., virtual simulation and real system experiments), where they model the correlation between the two fidelities through cokriging, and then apply entropy search (ES) to optimize the target output.
Zhang et al. (2017) model the dependencies between different fidelities with convolved Gaussian processes (Alvarez & Lawrence, 2009), and then apply predictive entropy search (PES) (HernándezLobato et al., 2014) to efficient exploration. Although both the ES and multifidelity PES heuristics have shown promising empirical results on some datasets, little is known about their theoretical performance.Recently, Kandasamy et al. (2016) proposed Multifidelity GPUCB (MFGPUCB), a principled framework for multifidelity Bayesian optimization with Gaussian processes. In a followup work (Kandasamy et al., 2017; Sen et al., 2018), the authors address the disconnect issue by considering a continuous fidelity space and performing joint updates to effectively share information among different fidelity levels.
In contrast to our general assumption on the joint distribution between different fidelities, Kandasamy et al. (2016) and Sen et al. (2018) assume a specific structure over multiple fidelities, where the cost of each lower fidelity is determined according to the maximal approximation error in function value when compared with the target fidelity. Kandasamy et al. (2017) consider a twostage optimization process, where the action and the fidelity level are selected in two separate stages. We note that this procedure may lead to nonintuitive choices of queries: For example, in a pessimistic case where the low fidelity only differs from the target fidelity by a constant shift, their algorithm is likely to focus only on querying the target fidelity actions even though the low fidelity is as useful as the target fidelity. In contrast, as described in §4, our algorithm jointly selects a query point and a fidelity level so such suboptimality can be avoided.
3 Problem Statement
We now introduce useful notations and formally state the problem studied in this paper.
Payoff function and auxiliary functions
Consider the problem of maximizing an unknown payoff function . We can probe the function by directly querying it at some and obtaining a noisy observation , where
denotes i.i.d. Gaussian white noise. In addition to the payoff function
, we are also given access to oracle calls to some unknown auxiliary functions ; similarly, we obtain a noisy observation when querying at . Here, each could be viewed as a lowfidelity version of for . For example, if represents the actual reward obtained by running a real physical system with input , then may represent the simulated payoff from a numerical simulator at fidelity level .Joint distribution on multiple fidelities
We assume that multiple fidelities
are mutually dependent through some fixed, (possibly) unknown joint probability distribution
. In particular, we model with a multiple output Gaussian process; hence the marginal distribution on each fidelity is a separate GP, i.e., , where specify the (prior) mean and covariance at fidelity level .Action, reward, and cost
Let us use to denote the action of querying at . Each action incurs a cost of , and achieves a reward
(3) 
That is, performing (at the target fidelity) achieves a reward . We receive the minimal immediate reward with lower fidelity actions for , even though it may provide some information about and could thus lead to more informed decisions in the future. W.l.o.g., we assume that , and .
Policy
Let us encode an adaptive strategy for picking actions as a policy . In words, a policy specifies which action to perform next, based on the actions picked so far and the corresponding observations. We consider policies with a fixed budget . Upon termination, returns a sequence of actions , such that . Note that for a given policy , the sequence is random, dependent on the joint distribution and the (random) observations of the selected actions.
Objective
Given a budget on , our goal is to maximize the expected cumulative reward, so as to identify an action with performance close to as rapidly as possible. Formally, we seek
(4) 
Remarks
Problem 4 strictly generalizes the optimal value of information (VoI) problem (Chen et al., 2015) to the online setting. To see this, consider the scenario where for , and . To achieve a nonzero reward, a policy must pick as the last action before exhausting the budget . Therefore, our goal becomes to adaptively pick lower fidelity actions that are the most informative about under budget , which reduces to the optimal VoI problem.
4 The Multifidelity BO Framework
We now present MFMIGreedy, a general framework for multifidelity Gaussian process optimization. In a nutshell, MFMIGreedy attempts to balance the “exploratory” lowfidelity actions and the more expensive target fidelity actions, based on how much information (per unit cost) these actions could provide about the target fidelity function. Concretely, MFMIGreedy proceeds in rounds under a given budget . Each round can be divided into two phases: (i) an exploration (i.e., information gathering) phase, where the algorithm focuses on exploring the low fidelity actions, and (ii) an optimization phase, where the algorithm tries to optimize the payoff function by performing an action at the target fidelity. The pseudocode of the algorithm is provided in Algorithm 1.
Exploration phase
A key challenge in designing the algorithm is to decide when to stop exploration (or equivalently, to invoke the optimization phase). Note that this is analogous to the explorationexploitation dilemma in the classical singlefidelity Bayesian optimization problems; the difference is that in the multifidelity setting, we have a more distinctive notion of “exploration”, and a more complicated structure of the action space (i.e., each exploration phase corresponds to picking a set of low fidelity actions). Furthermore, note that there is no explicit measurement of the relative “quality” of a low fidelity action, as they all have uniform reward by our modeling assumption (c.f. Eq. (3)); hence we need to design a proper heuristic to keep track of the progress of exploration.
We consider an informationtheoretic selection criterion for picking low fidelity actions. The quality of a low fidelity action is captured by the information gain, defined as the amount of entropy reduction in the posterior distribution of the target payoff function^{1}^{1}1An alternative, more aggressive information measurement is the information gain over the optimizer of the target function (Hennig & Schuler, 2012), i.e., , or the optimal value of (Wang & Jegelka, 2017), i.e., . : . Here, denotes the set of previously selected actions, and denote the observation history. Given an exploration budget , our objective for a single exploration phase is to find a set of lowfidelity actions, which are (i) maximally informative about the target function, (ii) better than the best action on the target fidelity when considering the information gain per unit cost (otherwise, one would rather pick the target fidelity action to trade off exploration and exploitation), and (iii) not overly aggressive in terms of exploration (since we would also like to reserve a certain budget for performing target fidelity actions to gain reward).
Finding the optimal set of actions satisfying the above design principles is computationally prohibitive, as it requires us to search through a combinatorial (for finite discrete domains) or even infinite (for continuous domains) space. In Algorithm 2, we introduce ExploreLF, a key subroutine of MFMIGreedy, for efficient exploration on low fidelities. At each step, ExploreLF takes a greedy step w.r.t. the benefitcost ratio over all actions. To ensure that the algorithm does not explore excessively, we consider the following stopping conditions: (i) when the budget is exhausted (Line 2), (ii) when a single target fidelity action is better than all the low fidelity actions in terms of the benefitcost ratio (Line 2), and (iii) when the cumulative benefitcost ratio is small (Line 2). Here, the parameter is set to be to ensure low regret, and we defer the detailed discussion of the choice of to §5.2.
Optimization phase
At the end of the exploration phase, MFMIGreedy updates the posterior distribution of the joint GP using the full observation history, and searches for a target fidelity action via the (singlefidelity) GP optimization subroutine SFGPOPT (Line 1). Here, SFGPOPT could be any offtheshelf Bayesian optimization algorithm, such as GPUCB (Srinivas et al., 2010), GPMI (Contal et al., 2014), EST (Wang et al., 2016) and MVES (Wang & Jegelka, 2017), etc. Different from the previous exploration phase which seeks an informative set of low fidelity actions, the GP optimization subroutine aims to trade off exploration and exploitation on the target fidelity, and outputs a single action at each round. MFMIGreedy
then proceeds to the next round until it exhausts the preset budget and eventually outputs an estimator of the target function optimizer.
5 Theoretical Analysis
In this section, we investigate the theoretical behavior of MFMIGreedy. We first introduce an intuitive notion of regret for the multifidelity setting, and then state our main theoretical results.
5.1 Multifidelity Regret
[Episode] Let be any integer. We call a sequence of items an episode, if and . In words, only the last action of an episode is from the target fidelity and all remaining actions are from lower fidelities. We now define a simple notion of regret for an episode. [Episode regret] The regret of an episode is
(5) 
where is the total cost of episode , and denotes the reward value of the last action on the target fidelity.
Suppose we run policy under budget and select a sequence of actions . One can represent using multiple episodes , where denotes the sequence of low fidelity actions and target fidelity action selected at the episode. Let be the cost of episode ; clearly . We define the multifidelity cumulative regret as follows. [Cumulative regret] The cumulative regret of policy under budget is
(6) 
Intuitively^{2}^{2}2Note that our notion of cumulative regret is different from the multifidelity regret (Eq. (2)) of Kandasamy et al. (2016). Although both definitions reduce to the classical singlefidelity regret (Srinivas et al., 2010) when , Definition 5.1 has a simpler form and intuitive physical interpretation., Definition 5.1 characterizes the difference in the cumulative reward of and the best possible reward gathered under budget .
5.2 Regret Analysis
In the following, we establish a bound on the cumulative regret of MFMIGreedy, as a function of the mutual information between the target fidelity function and the actions attained by the algorithm. Assume that MFMIGreedy terminates in episodes, and w.h.p., the cumulative regret incurred by SFGPOPT is upper bounded by , where is some constant independent of , and denotes the mutual information gathered by the target fidelity actions chosen by SFGPOPT (equivalently by MFMIGreedy). Then, w.h.p, the cumulative regret of MFMIGreedy (Algorithm 1) satisfies
where , is some constant independent of , and
denotes the mutual information gathered by the low fidelity actions when running MFMIGreedy. The proof of Theorem 5.2 is provided in the Appendix. Similarly with the singlefidelity case, a desirable asymptotic property of a multifidelity optimization algorithm is to be noregret, i.e., . If we set , then Theorem 5.2 reduces to . Clearly, MFMIGreedy is noregret as .
Furthermore, let us compare the above result with the regret bound of the singlefidelity GP optimization algorithm SFGPOPT. By the assumption of Theorem 5.2, we know , where is the information gain of all the (target fidelity) actions by running SFGPOPT for rounds. When the low fidelity actions are very informative about the and have much lower costs than (hence larger ), the less likely MFMIGreedy will focus on exploring the target fidelity, i.e., , and hence MFMIGreedy becomes more advantageous to SFGPOPT. The implication of Theorem 5.2 is similar in spirit to the regret bound provided in Kandasamy et al. (2016), however, our results apply to a much broader class of optimization strategies, as suggested by the following corollary.
Let . Running MFMIGreedy with subroutine GPUCB, EST, or GPMI in the optimization phase is noregret.
6 Experiments
In this section, we empirically evaluate our algorithm on 3 synthetic test function optimization tasks and 2 practical optimization problems.
6.1 Experimental Setup
To model the relationship between a low fidelity function and the target fidelity function , we use an additive model. Specifically, we assume that for all fidelity levels where is an unknown function characterizing the error incurred by a lower fidelity function. We use Gaussian processes to model and . Since is embedded in every fidelity level, we can use an observation from any fidelity to update the posterior for every fidelity level. We use square exponential kernels for all the GP covariances, with hyperparameter tuning scheduled periodically during optimization. We keep the same experimental setup for MFGPUCB as in Kandasamy et al. (2016).
For all our experiments, we use a total budget of 100 times the cost of target fidelity function call . When the optimal value for is known, we compare simple regrets. Otherwise, we compare simple rewards.
6.2 Compared Methods
Our framework is general and we could plug in different single fidelity Bayesian optimization algorithms for the SFGPOPT procedure in Algorithm 1. In our experiment, we choose to use GPUCB as one instantiation. We compare with MFGPUCB (Kandasamy et al., 2016) and GPUCB (Srinivas et al., 2010).
6.3 Synthetic Examples
We first evaluate our algorithm on three synthetic datasets, namely (a) Hartmann 6D, (b) Currin exponential 2D and (c) BoreHole 8D (Kandasamy et al., 2016). We follow the setup used in Kandasamy et al. (2016) to define lower fidelity functions, while we use a different definition of lower fidelity costs. We emphasize that in synthetic settings, the artificially defined costs do not have practical meanings, as function evaluation costs do not differ across different fidelity levels. Nevertheless, we set the cost of the function evaluations (monotonically) according to the fidelity levels, and present the results in Fig. 3 The axis represents the expended budget and the
axis represents the smallest simple regret. The error bars represent one standard error over 20 runs of each experiment.
Our method MFMIGreedy is generally competitive with MFGPUCB. A common issue is its simple regrets tend to be larger at the beginning. A cause for this behavior may be the parameters controlling the termination conditions early on are not tuned optimally, which leads to over exploration in regions that do not reveal much information on where the function optimum lies.
6.4 Real Experiments
We test our methods on two real datasets: maximum likelihood inference for cosmological parameters and experimental design for material science.
6.4.1 Maximum Likelihood Inference for Cosmological Parameters
The first real experiment is to perform maximum likelihood inference on 3 cosmological parameters, the Hubble constant , the dark matter fraction and the dark energy fraction . It thus has a dimensionality of 3. The likelihood is given by the RobersonWalker metric, which requires a onedimensional numerical integration for each point in the dataset from Davis et al. (2007). In Kandasamy et al. (2017), the authors set up two lower fidelity functions by considering two aspects of computing the likelihood: (i) how many data points (denoted by ) are used, and (ii) what is the discrete grid size (denoted by ) for performing the numerical integration. The range for these two parameters are and . We follow the fidelity levels selected in Kandasamy et al. (2017) which correspond to two lower fidelities with , and the target fidelity with . Costs are defined as the product of and .
Upon further investigation, we find that the grid sizes selected above for performing numerical integration do not affect the final integral values, i.e. the grid size for the lowest fidelity is fine enough to compute an approximation to the integration as using the grid size for the target fidelity. So costs taking into consideration the integration grid sizes are not an accurate characterization of the true computation costs. As a result, we propose a different cost definition that depends only on how many data points are used to compute the likelihood, i.e. the new costs for the 3 functions are , respectively.
The results using the original cost definition are shown in Figure 2(a). Note for this task we do not know the optimal likelihood, so we report the best objective value so far (simple rewards) in the axis. Our method MFMIGreedy (red) outperforms both baselines. The results using the new cost definition are shown in Figure 2(b). Our method obtains a consistent high likelihood when the cost structure changes. However, MFGPUCB’s quality degrades significantly, which implies that it is sensitive to how the costs among fidelity levels are defined. These two set of results demonstrate the robustness of our method against costs, which is a desirable property as inaccuracy in cost estimates is inevitable in practical applications.
6.4.2 Experimental Design for Optimizing Nanophotonic Structures
The second experiment is motivated by a material science task of designing nanophotonic structures with desired color filtering property (Fleischman et al., 2017). A nanophotonic structure is characterized by the following 5 parameters: mirror height, film thickness, mirror spacing, slit width, and oxide thickness. For each parameter setting, we use a score, commonly called a figureofmerit (FOM), to represent how well the resulting structure satisfies the desired color filtering property. By minimizing FOM, we hope to find a set of highquality design parameters.
Traditionally, FOMs can only be computed through the actual fabrication of a structure and tests its various physical properties, which is a timeconsuming process. Alternatively, simulations can be utilized to estimate what physical properties a design will have. By solving a variant of the Maxwell’s equations, we could simulate the transimission of light spectrum and compute FOM from the spectrum. We collect three fidelity level data on 5000 nanophotonic structures. What distinguishes each fidelity is the mesh size we use to solve the Maxwell’s equations. Finer meshes lead to more accurate results. Specifically, lowest fidelity uses a mesh size of , the middle fidelity and the target fidelity . The costs, simulation time, are inverse proportional to the mesh size, so we use the following costs [1, 4, 9] for our three fidelity functions.
Figure 2(c) shows the results of this experiment. As usual, the axis is the cost and axis is negative FOM. After a small portion of the budget is used in initial exploration, MFMIGreedy (red) is able to arrive at a better final design compared with MFGPUCB and GPUCB.
7 Conclusion
In this paper, we investigated the multifidelity Bayesian optimization problem, and proposed a general, principled framework for addressing the problem. We introduced a simple, intuitive notion of regret, and showed that our framework is able to lift many popular, offtheshelf singlefidelity GP optimization algorithms to the multifidelity setting, while still preserving their original regret bounds. We demonstrated the performance of our proposed algorithm on several synthetic and real datasets.
8 Acknowledgments
This work was supported in part by NSF Award #1645832, Northrop Grumman, Bloomberg, and a Swiss NSF Early Mobility Postdoctoral Fellowship.
References
 Alvarez & Lawrence (2009) Alvarez, M. and Lawrence, N. D. Sparse convolved gaussian processes for multioutput regression. In Advances in neural information processing systems, pp. 57–64, 2009.
 Bonilla et al. (2008) Bonilla, E. V., Chai, K. M., and Williams, C. Multitask gaussian process prediction. In Advances in neural information processing systems, pp. 153–160, 2008.
 Boyle & Frean (2005) Boyle, P. and Frean, M. Dependent gaussian processes. In Advances in neural information processing systems, pp. 217–224, 2005.

Chen et al. (2015)
Chen, Y., Javdani, S., Karbasi, A., Bagnell, J. A., Srinivasa, S., and Krause,
A.
Submodular surrogates for value of information.
In
Proc. Conference on Artificial Intelligence (AAAI)
, January 2015. URL https://bit.ly/2QEfJnZ. 
Contal et al. (2014)
Contal, E., Perchet, V., and Vayatis, N.
Gaussian process optimization with mutual information.
In
International Conference on Machine Learning
, pp. 253–261, 2014. URL https://bit.ly/2x7EEbw.  Davis et al. (2007) Davis, T. M., Mörtsell, E., Sollerman, J., Becker, A. C., Blondin, S., Challis, P., Clocchiatti, A., Filippenko, A., Foley, R., Garnavich, P. M., et al. Scrutinizing exotic cosmological models using essence supernova data combined with other cosmological probes. The Astrophysical Journal, 666(2):716, 2007.
 Fleischman et al. (2017) Fleischman, D., Sweatlock, L. A., Murakami, H., and Atwater, H. Hyperselective plasmonic color filters. Optics Express, 25(22):27386–27395, 2017.
 Forrester et al. (2007) Forrester, A. I., Sóbester, A., and Keane, A. J. Multifidelity optimization via surrogate modelling. In Proceedings of the royal society of london a: mathematical, physical and engineering sciences, volume 463, pp. 3251–3269. The Royal Society, 2007. URL https://bit.ly/2xkMXRr.
 Hennig & Schuler (2012) Hennig, P. and Schuler, C. J. Entropy search for informationefficient global optimization. Journal of Machine Learning Research, 13(Jun):1809–1837, 2012. URL https://bit.ly/2x5KMQC.
 HernándezLobato et al. (2014) HernándezLobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive entropy search for efficient global optimization of blackbox functions. In Advances in neural information processing systems, pp. 918–926, 2014.
 Kandasamy et al. (2016) Kandasamy, K., Dasarathy, G., Oliva, J. B., Schneider, J., and Póczos, B. Gaussian process bandit optimisation with multifidelity evaluations. In Advances in Neural Information Processing Systems, pp. 992–1000, 2016. URL https://bit.ly/2Qngemh.
 Kandasamy et al. (2017) Kandasamy, K., Dasarathy, G., Schneider, J., and Póczos, B. Multifidelity bayesian optimisation with continuous approximations. In International Conference on Machine Learning, pp. 1799–1808, 2017. URL https://bit.ly/2N9KgMq.
 LázaroGredilla et al. (2010) LázaroGredilla, M., QuiñoneroCandela, J., Rasmussen, C. E., and FigueirasVidal, A. R. Sparse spectrum gaussian process regression. Journal of Machine Learning Research, 11:1865–1881, 2010.
 Le Gratiet & Garnier (2014) Le Gratiet, L. and Garnier, J. Recursive cokriging model for design of computer experiments with multiple levels of fidelity. International Journal for Uncertainty Quantification, 4(5), 2014. URL https://bit.ly/2PICVQu.

Marco et al. (2017)
Marco, A., Berkenkamp, F., Hennig, P., Schoellig, A. P., Krause, A., Schaal,
S., and Trimpe, S.
Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with bayesian optimization.
In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1557–1563, 2017. URL https://bit.ly/2Oa4e62.  Nguyen & Bonilla (2014) Nguyen, T. V. and Bonilla, E. V. Collaborative multioutput gaussian processes. In UAI, pp. 643–652, 2014. URL https://bit.ly/2D4BwCH.
 Rahimi & Recht (2008) Rahimi, A. and Recht, B. Random features for largescale kernel machines. In Advances in neural information processing systems, pp. 1177–1184, 2008.
 Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006. URL https://bit.ly/2tYpBix.
 Romero et al. (2013) Romero, P. A., Krause, A., and Arnold, F. H. Navigating the protein fitness landscape with gaussian processes. Proceedings of the National Academy of Sciences, 110(3):E193–E201, 2013.
 Sen et al. (2018) Sen, R., Kandasamy, K., and Shakkottai, S. Multifidelity blackbox optimization with hierarchical partitions. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2018. URL https://bit.ly/2MSIsSQ.
 Snelson & Ghahramani (2007) Snelson, E. and Ghahramani, Z. Local and global sparse gaussian process approximations. In Artificial Intelligence and Statistics, pp. 524–531, 2007.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
 Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. International Conference on Machine Learning (ICML), 2010. URL https://bit.ly/2CNGPGc.
 Teh et al. (2005) Teh, Y.W., Seeger, M., and Jordan, M. Semiparametric latent factor models. In Artificial Intelligence and Statistics 10, 2005.
 Titsias (2009) Titsias, M. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pp. 567–574, 2009.
 van Gunsteren & Berendsen (1990) van Gunsteren, W. F. and Berendsen, H. J. Computer simulation of molecular dynamics: Methodology, applications, and perspectives in chemistry. Angewandte Chemie International Edition in English, 29(9):992–1023, 1990.
 Wang & Jegelka (2017) Wang, Z. and Jegelka, S. Maxvalue entropy search for efficient bayesian optimization. arXiv preprint arXiv:1703.01968, 2017. URL https://arxiv.org/pdf/1703.01968.pdf.
 Wang et al. (2016) Wang, Z., Zhou, B., and Jegelka, S. Optimization as estimation with gaussian processes in bandit settings. In Artificial Intelligence and Statistics, pp. 1022–1031, 2016. URL https://bit.ly/2OeSOhp.
 Wilson et al. (2012) Wilson, A. G., Knowles, D. A., and Ghahramani, Z. Gaussian process regression networks. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.
 Zhang et al. (2017) Zhang, Y., Hoang, T. N., Low, B. K. H., and Kankanhalli, M. Informationbased multifidelity bayesian optimization. NIPS Workshop on Bayesian Optimization, 2017. URL https://bit.ly/2N5CdjH.
Appendix A Proofs for §5
a.1 Proofs of Theorem 5.2
[Proof of Theorem 5.2] Assume that MFMIGreedy terminates within episodes. Let us use to denote the sequence of actions selected by MFMIGreedy, where denotes the sequence of actions selected at the episode. Further, let be the cost of the episode, and the cost of lower fidelity actions of the episode. The budget allocated for the target fidelity is . By definition of the cumulative regret (Eq. (6)), we get
(7) 
The first term on the R.H.S. of Eq. (7) represents the regret incurred from exploring the lower fidelity actions, while the second term represents the regret from the target fidelity actions (chosen by SFGPOPT).
According to the stopping condition of Algorithm 2 at Line 2, we know that when ExploreLF terminates at episode , the selected low fidelity actions satisfy
where denotes the observations obtained up to episode , and specifies the stopping condition of ExploreLF at episode . Therefore
(8)  
where step (a) is because for . Recall that . Therefore,
(9) 
Note that the second term of Eq. (7) is the regret of MFMIGreedy on the target fidelity. Since all the target fidelity actions are selected by the subroutine SFGPOPT, by assumption, we know . Combining this with Eq. (9) completes the proof.
a.2 Proof of Corollary 5.2
To show that running MFMIGreedy with subroutine GPUCB (Srinivas et al., 2010), EST (Wang et al., 2016), or GPMI (Contal et al., 2014) in the optimization phase is noregret, it suffices to show that the candidate subroutines GPUCB, EST, and GPMI satisfy the assumption on SFGPOPT as provided in Theorem 5.2. From the references above we know that the statement is true.