1 Introduction
Global optimisation is core to any complex problem where design and choice play a role. Within Machine Learning, such problems are found in the tuning of hyperparameters
(Snoek et al., 2012), sensor selection (Garnett et al., 2010) or experimental design (González et al., 2014; MartinezCantin et al., 2009). Most global optimisation techniques are myopic, in considering no more than a single step into the future. Relieving this myopia requires solving the multistep lookahead problem: the global optimisation of an function by considering the significance of the next function evaluation on function evaluations (steps) further into the future. It is clear that a solution to the problem would offer performance gains. For example, consider the case in which we have a budget of two evaluations with which to optimise a function over the domain . If we are strictly myopic, our first evaluation will likely be at , and our second then at only one of and . This myopic strategy will thereby result in ignoring half of the domain , regardless of the second choice. If we adopt a twostep lookahead approach, we will select function evaluations that will be more evenly distributed across the domain by the time the budget is exhausted. We will consequently be better informed about and its optimum.There is a limited literature on the multistep lookahead problem. Osborne et al. (2009) perform multistep lookahead by optimising future evaluation locations, and sampling over future function values. This approach scales poorly with the number of future evaluations considered, and the authors present results for no more than twostep lookahead. (Marchant et al., 2014)
reframe the multistep lookahead problem as a partially observed Markov decision process, and adopt a Monte Carlo tree search approach in solving it. Again, the scaling of the approach permits the authors to consider no more than six steps into the future. In the past, the multistep look ahead problem was studied by
Streltsov and Vakili (1999) proposing a utility function that maximizes the total ‘reward’ of the algorithm by taking into account the cost of future computations, rather than trying to find the optimum after a fixed number of evaluations.Interestingly, there exists a link between the multistep lookahead problem and batch Bayesian optimisation (Ginsbourger et al., 2009; Azimi et al., 2012). In this later case, batches of locations rather than individual observations are selected in each iteration of the algorithm and evaluated in parallel. When such locations are selected greedily, that is, one after the other, the key to selecting good batches relies on the ability of the batch criterion of predicting future steps of the algorithm. In this work we will exploit this parallelism to compute a nonmyopic loss for Bayesian optimisation.
This paper is organised as follows. In Section 2 we formalise the problem and describe the contributions of this work. Section 3 describe the details of the proposed algorithm. Section 4 illustrates the superior performance of glasses in a variety of test functions and we conclude in Section 5 with a discussion about the most interesting results observed in this work.
2 Background and challenge
2.1 Bayesian optimisation with one step lookahead
Let be well behaved function defined on a compact subset . We are interested in solving the global optimization problem of finding
We assume that is a blackbox from which only perturbed evaluations of the type , with , are available. Bayesian Optimization (bo
) is an heuristic strategy to make a series of evaluations
of , typically very limited in number, such that the the minimum of is evaluated as soon as possible (Lizotte, 2008; Jones, 2001; Snoek et al., 2012; Brochu et al., 2010).Assume that points have been gathered so far, having a dataset . Before collecting any new point, a surrogate probabilistic model for is calculated. This is typically a Gaussian Process (gp) with mean function and a covariance function , and whose parameters will be denoted by . Let be the current available information: the conjunction of , the model parameters and the model likelihood type. Under Gaussian likelihoods, the predictive distribution for at
is also Gaussian with mean posterior mean and variance
where is the matrix such that , (Rasmussen and Williams, 2005).
Given the gp model, we now need to determine the best location to sample. Imagine that we only have one remaining evaluation () before we need to report our inferred location about the minimum of . Denote by , the current best found value. We can define the loss of evaluating this last time at assuming it is returning as
Its expectation is
where the subscript in refers to the fact that we are considering one future evaluation. Giving the properties of the gp, can be computed in closed form for any . In particular, for
the usual Gaussian cumulative distribution function, we have that
where we have abbreviated as and as . Finally, the next evaluation is located where gives the minimum value. This point can be obtained by any gradient descent algorithm since analytical expressions for the gradient and Hessian of exist (Osborne, 2010).
2.2 Looking many steps ahead
Expression (2.1) can also be used as a myopic approximation to the optimal decision when evaluations of remain available. Indeed, most bo methods are myopic and ignore the future decisions that will be made by the algorithm in the future steps.
Let for be the remaining available evaluations and by the available information after the data set has been augmented with and the parameters of the model updated. We use to denote the expected loss of selecting given and considering future evaluations. A proper Bayesian formulation allows us to define this longsight loss (Osborne, 2010) as^{1}^{1}1We assume that .
(2)  
where
is the predictive distribution of the gp at and
reflects the optimization step required to obtain after all previous the evaluations have been iteratively optimized and marginalized. The graphical probabilistic model underlying (2) is illustrated in Figure 1.
To evaluate Eq. (2) we can successively sample from to and optimize for the appropriate . This is in done in (Osborne, 2010) for only two steps look ahead. The reason why further horizons remain unexplored is the computational burden required to compute this loss for many steps ahead. Note that analytical expression are only available in the myopic case .
2.3 Contributions of this work
The goal of this work is to propose a computationally efficient approximation to Eq. (2) for many steps ahead able to relieve the myopia of classical Bayesian optimization. The contributions of this paper are:

A new algorithm, glasses, to relieve the myopia of Bayesian optimisation that is able to efficiently take into account dozens of steps ahead. The method is based on the prediction of the future steps of the myopic loss to efficiently integrate out a longside loss.

The key aspect of our approach is to split the recursive optimization marginalization loop in Eq. (2) into two independent optimisationmarginalization steps that jointly act on all the future steps. We propose an ExpectationPropagation formulation for the joint marginalisation and we discuss different strategies to carry out the optimisation step.

Together with this work, we deliver a open source Python code framework^{2}^{2}2http://sheffieldml.github.io/GPyOpt/ containing a fully functional implementation of the method useful to reproduce the results of this work and applicable in general global optimisation problems. As we mentioned in the introduction of this work, there exist a limited literature in bo nonmyopic methods and, to our knowledge, none available generic bo package can be used with myopic loss functions.

Simulations: New practical experiments and insights that show that nonmyopic methods outperform myopic approaches in a benchmark of optimisation problems.
3 The GLASSES algorithm
As detailed in the previous section, a proper multistep look ahead loss function requires the iterative optimizationmarginalization of the future steps, which is computationally intractable. A possible way of dealing with this issue is to jointly model our epistemic uncertainty over the future locations
with a joint probability distribution
and to consider the expected loss(3) 
for
the vector of future evaluations of
and X the dimensional matrix whose rows are the future evaluations . is multivariate Gaussian, since it corresponds to the predictive distribution of the gp at X. The graphical probabilistic model underlying (2) is illustrated in Figure 2. differs from in the fact that all future evaluations are modelled jointly rather then sequentially. A proper choice of is crucial here. An interesting option would be to choose to be some determinantal point process (dpp)^{3}^{3}3 A determinantal point process is a probability measure over sets that is entirely characterised by the determinant of some (kernel) function. defined on (Affandi et al., 2014) and integrate Eq. (3) with respect to by averaging over multiple samples (Kulesza and Taskar, 2012, 2011). DPPs provide a density over sample locations that induces them to be dissimilar to each other (as wellspaced samples should), but that can be concentrated in chosen regions (such as regions of low myopic expected loss). However, although DPPs have nice computational properties in discrete sets, here we would need to take samples from the dpp by conditioning on and the number of steps ahead. Although this is possible in theory, the computational burden of generating these samples will make this strategy impractical.An alternative and more efficient approach, that we explore here, is to work with a fixed X, which we assume it is computed beforehand. As we show in this section, although this approach does not make use of our epistemic uncertainty on the future steps, it drastically reduces the computational burden of approximating .
3.1 Oracle multiple steps lookahead expected loss
Suppose that we had access to an oracle function able to predict the future locations that the loss would suggest if we started evaluating at . In other words, takes the putative location as input and it returns and the predicted future locations . We work here under the assumption that the oracle has perfect information about the future locations, in the same way we have have perfect information about the locations that the algorithm already visited. This is an unrealistic assumption in practice, but it will help us to setup our algorithm. We leave for the next section the details of how to marginalise over .
Assume, for now, that exists and that we have access to it. We it and denote by the vector of future locations evaluations of at . Assuming that is known, it is possible to rewrite the expected loss in Eq. (2) as
(4) 
where the expectation is taken over the multivariate Gaussian distribution, with mean vector
and covariance matrix , that gives rise after marginalizing the posterior distribution of the GP at . Note that under a fixed , it also holds that . See supplementary materials for details.The intuition behind Eq. (4) is as follows: the expected loss at is the best possible function value that we expect to find in the next steps, conditional on the first evaluation being made at . Rather than merely quantifying the benefit provided by the next evaluation, this loss function accounts for the expected gain in the whole future sequence of evaluations of . As we analyse in the experimental section of this work, the effect of this is an adaptive loss that tends to be more explorative when more remaining evaluations are available and more exploitative as soon as we approach the final evaluations of .
To compute Eq. (4) we use Expectation Propagation, ep, (Minka, 2001). This turns out to be a natural operation by observing that
where and
See supplementary materials for details. The first term in Eq. (3.1) is a Gaussian probability on an unbounded polyhedron in which the limits are aligned with the axis. The second term is the sum of the Gaussian expectations on different nonaxisaligned different polyhedra defined by the indicator functions. Both terms can be computed with ep using the approach proposed in (Cunningham et al., 2011). In a nutshell, to compute the integrals it is possible to replace the indicator functions with univariate Gaussians that play the role of softindicators in the ep iterations. This method is computationally efficient and scales well for high dimensions. Note that when , Eq. (4) reduces to Eq. (2.1).
Under the hypothesis of this section, the next evaluation is located where gives the minimum value. We still need, however, to propose a way to approximate the oracle . We do in next section.
3.2 Local Penalisation to Predicting the Steps Ahead
This section proposes an empirical surrogate for . A sensible option would be to use the maximum a posteriori probability, map, of the abovementioned dpp. However, as it is the generation of dpp samples, to compute the map of a dpp is an expensive operation (Gillenwater et al., 2012). Alternatively, here we use some ideas that have been recently developed in the batch Bayesian optimisation literature. In a nutshell, batch bo methods aim to define sets of points in where should be evaluated in parallel, rather than sequentially. In essence, a key aspect to building a ‘good’ batch is the same as to computing a good approximation for : to find a set of good locations at which to evaluate the objective.
In this work we adapt to our context the batch bo idea proposed by González et al. (2015). Inspired by the repulsion properties of dpp, González et al. (2015) propose to build each batch by iteratively optimising and penalising the acquisition function in a neighbourhood of the already collected points by means of some local penalisers . Note that any other batch method could be used here, but we consider this approach since it is computationally tractable and scales well with the size of the batches (steps ahead in our context) and the dimensionality of the problem.
More formally, assume that the objective function is Lipschitz continuous, that is, it satisfies that , . Take and valid Lipschitz constant . It is possible to show that the ball
(6) 
where , doesn’t contain the minimum of . Probabilistic versions of these balls are used to define the above mentioned penalisers by noting that . In particular, is chosen as the probability that , any point in that is a potential candidate to be a minimum, does not belong to : As detailed in González et al. (2015), these functions have a closed form and create an exclusion zone around the point . The predicted kth location when looking at step ahead and using as putative point is defined as
(7) 
for and where are local local penalizers centered at and is the softplus transformation .
To illustrate how computes the steps ahead we include Figure (3). We show the myopic loss in a bo experiment together with predicted locations by for two different putative points. In case 1, the putative point (grey star) is close to the location of the minimum of the myopic loss (blue region). In case 2, is located in an uninteresting region. In both cases the future locations explore the interesting region determined for the myopic loss while the locations of the points are conditioned to the first location. In the bottom row of the figure we show how the points are selected in Case 1. The first putative point creates an exclusion zone that shifts the minimum of the acquisition, where the next location is selected. Iteratively, new locations are found by balancing diversity (due to the effect of the exclusion areas) and quality (exploring locations where the loss is low), similarly to the way samples the probabilities over subsets can be characterised in a dpp (Kulesza and Taskar, 2012).
3.3 Algorithm and computational costs
All the steps of glasses are detailed in Algorithm 1. The main computational cost is the calculation of the steps ahead, which is done using a sequence of lbfgs optimizers at complexity for , the maximum number of lbfgs updates. The use of ep to compute the value of the expected loss at each location requires a quadratic run time factor update in the dimensionality of each Gaussian factor.
4 Experiments
4.1 Interpreting the nonmyopic loss
The goal of this experiment is to visualise the effect on the expected loss of considering multiple steps ahead. To this end, we use sixhump camel function (see Table 4 for details). We fit a gp with a square exponential kernel and we plot the myopic expected loss together with 5 variants that consider 2, 3, 5, 10 and 20 steps ahead. Increasing the steps ahead decreases the optimum value of the loss: the algorithm can visit more locations and the expected minimum is lower. Also, increasing the steps ahead flattens down the loss: it is likely to hit a good location irrespective of the initial point so all candidate looks better because of the future chances of the algorithm to be in a good location. In practice this behaviour translates into an acquisition function that becomes more explorative as we look further ahead.
4.2 Testing the effect of considering multiple steps ahead
Name  Function domain  

SinCos  1  
Cosines  2  
Branin  2  
Sixhumpcamel  2  
McCormick  2  
Dropwave  2  
Beale  2  
Powers  2  
Alpine2  2, 5, 10  
Ackley  2, 5 
MPI  GPLCB  EL  EL2  EL3  EL5  EL10  GLASSES  
SinCos  0.7147  0.6058  0.7645  0.8656  0.6027  0.4881  0.8274  0.9000 
Cosines  0.8637  0.8704  0.8161  0.8423  0.8118  0.7946  0.7477  0.8722 
Branin  0.9854  0.9616  0.9900  0.9856  0.9673  0.9824  0.9887  0.9811 
Sixhumpcamel  0.8983  0.9346  0.9299  0.9115  0.9067  0.8970  0.9123  0.8880 
Mccormick  0.9514  0.9326  0.9055  0.9139  0.9189  0.9283  0.9389  0.9424 
Dropwave  0.7308  0.7413  0.7667  0.7237  0.7555  0.7293  0.6860  0.7740 
Powers  0.2177  0.2167  0.2216  0.2428  0.2372  0.2390  0.2339  0.3670 
Ackley2  0.8230  0.8975  0.7333  0.6382  0.5864  0.6864  0.6293  0.7001 
Ackley5  0.1832  0.2082  0.5473  0.6694  0.3582  0.3744  0.6700  0.4348 
Ackley10  0.9893  0.9864  0.8178  0.9900  0.9912  0.9916  0.8340  0.8567 
Alpine22  0.8628  0.8482  0.7902  0.7467  0.5988  0.6699  0.6393  0.7807 
Alpine25  0.5221  0.6151  0.7797  0.6740  0.6431  0.6592  0.6747  0.7123 
To study the validity of our approximation we choose a variety of functions with a range of dimensions and domain sizes. See Table 1 for details. We use the full glasses algorithm (in which at each iteration the number of remaining iterations is used as the number of stepsahead) and we show the results when 2, 3, 5, and 10 steps lookahead are used to compute the loss function. Each problem is solved 5 times with different random initialisations of 5 data points. The number of allows evaluations is 10 times the dimensionality of the problem. This allows us to compare the average performance of each method on each problem. As baseline we use the myopic expected loss, el. For comparative purposes we used other two loss functions are commonly used in the literature. In particular, we use the Maximum Probability of Improvement, mpi, and the gp lower confidence bound, gplcb. In this later case we set the parameter that balances exploration and exploitation to 1. See (Snoek et al., 2012) for details on these loss functions. All acquisition functions are optimised using the dividing rectangles algorithm direct (Jones et al., 1993). As surrogate model for the functions we used a gp with a squared exponential kernel plus a bias kernel (Rasmussen and Williams, 2005). The hyperparameters of the model were optimised by the standard method of maximising the marginal likelihood, using lbfgs (Nocedal, 1980) for 1,000 iterations and selected the best of 5 random restarts. To compare the methods we used the ‘gap’ measure of performance (Huang et al., 2006), which is defined as
where represents the evaluation of the objective function, is the global minimum, and and are the first and best evaluated point, respectively. To avoid gap measures larger that one due to the noise in the data, the measures for each experiment are normalized across all methods. The initial point was chosen to be the best of the original points used to initialise the models.
Table 2 shows the comparative across different functions and methods. None of the methods used is universally the best but a non myopic loss is the best in 6 of the 11 cases. In 3 cases the full glasses approach is the best of all methods. Specially interesting is the case of the McCormick and the Powers function. In these two cases, to increase the number of steps ahead used to compute the loss consistently improve the obtained results. Note as well that when the glasses algorithm is not the best global method it typically performs closely to the best alternative which makes it a good ‘default’ choice if the function to optimise is expensive to evaluate.
5 Conclusions
In this paper we have explored the myopia in Bayesian optimisation methods. For the first time in the literature, we have proposed an nonmyopic loss that allows taking into account dozens of future evaluations before making the decision of where to sample the objective function. The key idea is to jointly model all future evaluations of the algorithm with a probability distribution and to compute the expected loss by marginalising them out. Because this is an expensive step, we avoid it by proposing a fixed prediction of the future steps. Although this doesn’t make use of the epistemic uncertainty on the steps ahead, it drastically reduces the computation burden of approximating the loss. We made use of the connection of the multiple steps ahead problem with some methods proposed in the batch Bayesian optimisation to solve this issue. The final computation of the loss for each point in the domain is carried out by adapting ep to our context. As previously suggested in Osborne et al. (2009), our results confirm that using a nonmyopic loss helps, in practice, to solve global optimisation problems. Interestingly, and as happens with any comparison of loss functions across many objective functions, there is not a universal best method. However, in cases in which glasses is not superior, it performs very closely to the myopic loss, which makes it an interesting default choice in most scenarios.
Some interesting challenges will be addressed in the future such as making the optimisation of the loss more efficient (for which direct is employed in this work): although the smoothness of the loss is guaranteed if the steps ahead are consistently predicted for points close in the space, if the optimization of the steps ahead fails, the optimization of the loss may be challenging. Also, the use of nonstationary kernels, extensions to deal with to high dimensional problems and finding efficient was of sampling many steps ahead will also be analysed.
References
 Affandi et al. [2014] Raja H. Affandi, Emily .B. Fox, and Ben Taskar. Approximate inference in continuous determinantal processes. In Neural Information Processing Systems 26. MIT Press, 2014.
 Azimi et al. [2012] Javad Azimi, Ali Jalali, and Xiaoli Zhang Fern. Hybrid batch Bayesian optimization. In Proceedings of the 29th International Conference on Machine Learning, 2012.
 Brochu et al. [2010] Eric Brochu, Vlad M. Cora, and Nando De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
 Cunningham et al. [2011] John P. Cunningham, Philipp Hennig, and Simon LacosteJulien. Gaussian probabilities and expectation propagation. arXiv:1111.6832 [stat], Nov 2011. arXiv: 1111.6832.
 Garnett et al. [2010] Roman. Garnett, Michael. A. Osborne, and Stephan. J. Roberts. Bayesian optimization for sensor set selection, page 209–219. ACM, 2010. ISBN 1605589888. doi: 10.1145/1791212.1791238.
 Gillenwater et al. [2012] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Nearoptimal MAP inference for determinantal point processes. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2735–2743. Curran Associates, Inc., 2012.
 Ginsbourger et al. [2009] David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. A multipoints criterion for deterministic parallel global optimization based on Gaussian processes. HAL: hal00260579, 2009.
 González et al. [2014] Javier González, Joseph Longworth, David James, and Neil Lawrence. Bayesian optimisation for synthetic gene design. NIPS Workshop on Bayesian Optimization in Academia and Industry, 2014.
 González et al. [2015] Javier González, Zhenwen Dai, Philipp Hennig, and Neil D Lawrence. Batch Bayesian optimization via local penalization. arXiv preprint arXiv:1505.08052, 2015.
 Huang et al. [2006] D. Huang, T. T. Allen, W. I. Notz, and N. Zeng. Global optimization of stochastic blackbox systems via sequential kriging metamodels. J. of Global Optimization, 34(3):441–466, March 2006. ISSN 09255001.
 Jones et al. [1993] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. J. Optim. Theory Appl., 79(1):157–181, October 1993. ISSN 00223239.
 Jones [2001] Donald R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21(4):345–383, 2001.
 Kulesza and Taskar [2011] Alex Kulesza and Ben Taskar. kdpps: Fixedsize determinantal point processes. In Lise Getoor and Scheffe, editors, ICML, 2011.
 Kulesza and Taskar [2012] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3):123–286, 2012. ISSN 19358237. doi: 10.1561/2200000044.
 Lizotte [2008] Daniel James Lizotte. Practical Bayesian Optimization. PhD thesis, University of Alberta, 2008. AAINR46365.

Marchant et al. [2014]
Roman Marchant, Fabio Ramos, and Scott Sanner.
Sequential Bayesian optimisation for spatialtemporal monitoring.
In
Proceedings of the International Conference on Uncertainty in Artificial Intelligence
, 2014.  MartinezCantin et al. [2009] Ruben MartinezCantin, Nando de Freitas, Eric Brochu, José Castellanos, and Arnaud Doucet. A Bayesian explorationexploitation approach for optimal online sensing and planning with a visually guided mobile robot. Autonomous Robots, 27(2):93–103, August 2009. ISSN 09295593, 15737527. doi: 10.1007/s1051400991302.

Minka [2001]
Thomas P. Minka.
Expectation propagation for approximate Bayesian inference.
In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01, pages 362–369, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001.  Molga and Smutnicki [1995] Marcin Molga and Czesław Smutnicki. Test functions for optimization needs. NIPS Workshop on Bayesian Optimization in Academia and Industry, 1995. URL www.zsd.ict.pwr.wroc.pl/files/docs/functions.pdf.
 Nocedal [1980] Jorge Nocedal. Updating quasiNewton matrices with limited storage. Mathematics of Computation, 35(151):773–782, 1980.
 Osborne [2010] Michael Osborne. Bayesian Gaussian Processes for Sequential Prediction, Optimisation and Quadrature. PhD thesis, University of Oxford, 2010.
 Osborne et al. [2009] Michael A. Osborne, Roman Garnett, and Stephen J. Roberts. Gaussian processes for global optimization. In 3rd international conference on learning and intelligent optimization (LION3), pages 1–15, 2009.
 Rasmussen and Williams [2005] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
 Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, page 2951–2959, 2012.
 Streltsov and Vakili [1999] Simon Streltsov and Pirooz Vakili. A nonmyopic utility function for statistical global optimization algorithms. J. Global Optim., 14(3):283–298, 1999.
S1 Oracle Multiple Steps lookahead Expected Loss
Denote by the value of the best visited location when looking at evaluations in the future. Note that reduces to the current best lost in the one stepahead case. It is straightforward to see that
It holds hat
where the integrals with respect to are , since we don’t need to optimize for any location and . Notice that
and therefore
S2 Formulation of the Oracle Multiple Steps loookahead Expected Loss to be computed using Expectation Propagation
Assume that . Then we have that
The first term can be written as follows:
where . We can do this because the regions are disjoint and it holds that . Also, note that the can be replaced within the integrals since within each it holds that . Rewriting the integral in terms of indicator functions we have that
(S.1) 
where if and otherwise.
s2.1 Synthetic functions
In this section we include the formulation of the objective functions used in the experiments that are not available in the references provided.
Name  Function 

SinCos  
Alpine2  
Cosines  . 