Bayesian Quantile and Expectile Optimisation

01/12/2020 ∙ by Léonard Torossian, et al. ∙ Inra 24

Bayesian optimisation is widely used to optimise stochastic black box functions. While most strategies are focused on optimising conditional expectations, a large variety of applications require risk-averse decisions and alternative criteria accounting for the distribution tails need to be considered. In this paper, we propose new variational models for Bayesian quantile and expectile regression that are well-suited for heteroscedastic settings. Our models consist of two latent Gaussian processes accounting respectively for the conditional quantile (or expectile) and variance that are chained through asymmetric likelihood functions. Furthermore, we propose two Bayesian optimisation strategies, either derived from a GP-UCB or Thompson sampling, that are tailored to such models and that can accommodate large batches of points. As illustrated in the experimental section, the proposed approach clearly outperforms the state of the art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be an unknown function , where and

denotes a probability space representing some uncontrolled variables. For any fixed

is a random variable of distribution

. We assume here a classical black-box optimisation framework: is available only through (costly) pointwise evaluations of , and no gradient or structural information are available. Typical examples may include stochastic simulators in physics or biology (see [Skullerud, 1968] for simulations of ion motion and [Székely Jr and Burrage, 2014] for simulations of heterogeneous natural systems), but

can also represent the performance of a machine learning algorithm according to some hyperparameters (see

[Bergstra et al., 2011, Li et al., 2016] for instance). In the latter case, the randomness can come from the use of minibaching in the training procedure, the choice of a stochastic optimiser or the randomness in the optimisation initialisation.

Let be the objective function we want to maximise, where is a real-valued functional defined on probability measures. The canonical choice for is the conditional expectation (i.e. conditional on ), which is sensible when the exposition to extreme values is not a significant aspect of the decision. However, in a large variety of fields such as agronomy, medicine or finance, the decision maker has an incentive to protect himself against extreme events which typically have little influence on the expectation but that can lead to severe consequences. To take these rare events into account, one should consider alternative choices for that depend on the tails of , such as the quantile [Rostek, 2010], conditional value-at-risk (CVaR, see [Rockafellar et al., 2000]) or expectile [Bellini and Di Bernardino, 2017]. In this paper we focus our interest on the conditional quantile and expectile.

Given an estimate of

based on available data, global optimization algorithms define a policy that finds a trade-off between exploration and intensification. More precisely, the algorithm has to explore the input space in order to avoid getting trapped in a local optimum, but it also has to concentrate its budget on input regions identified as having a high potential. The latter results in accurate estimates of in the region of interest and allow the algorithm to return an optimal input value with high precision.

In the context of Bayesian optimisation (BO), such trade-offs have been initially studied by [Mockus et al., 1978] and [Jones et al., 1998] in a noise-free setting. Their framework has latter been extended to optimisation of the conditional expectation of a stochastic black box (see e.g. [Frazier et al., 2009, Srinivas et al., 2009] or [Picheny et al., 2013] for a review). Although the literature is very scarce for quantile or expectile optimization under the BO framework, an algorithm based on Gaussian processes for quantile optimization is presented in [Browne et al., 2016]. The approach they propose however requires many replications per input point and is not compatible with small-data settings.

Contributions

The contributions of this paper are the following: 1) We propose a new metamodel based on variational inference to estimate either quantiles or expectiles without repetitions in the design of experiment and suited to the heteroscedastic case. 2) We propose a new Bayesian algorithm suited to optimise conditional quantiles or expectiles in a small data setting. Two batch-sequential acquisition strategies are designed to find a good trade-off between exploration and intensification. The ability of our algorithm to optimise quantiles and expectiles is illustrated on several test problems.

2 Bayesian metamodels for tails dependant measures

The conditional quantile of order can be defined as:

(1)

with

(2)

the so-called pinball loss introduced by [Koenker and Bassett Jr, 1978]. Following this formalism, [Newey and Powell, 1987] introduced the square pinball loss defined as

(3)

and the expectile of order as

(4)

We detail in the next section how these losses can be used to get an estimate of using a dataset with a matrix.

2.1 Quantile and Expectile Metamodel

To estimate a conditional quantile of fixed order, different metamodels such as artificial neural networks

[Cannon, 2011]

, random forest

[Meinshausen, 2006] or nonparametric estimation in reproducing kernel Hilbert spaces [Takeuchi et al., 2006, Sangnier et al., 2016] have been proposed. While the literature on expectile regression is less extended, neural network [Jiang et al., 2017] or SVM-like approaches [Farooq and Steinwart, 2017] have been developed as well. All the approaches cited above defined an estimator of as the function that minimises (optionally with a regularization term)

(5)

with for the quantile estimation and for the expectile. This framework makes sense because asymptotically minimising (5) is equivalent to minimising (4) or (1).

All these approaches however have a drawback: they do not quantify the uncertainty associated with each prediction. This is a significant problem in our settings since this knowledge is of paramount importance to define the exploration/intensification trade-off. Alternatively, using Bayesian models can overcome this issue as they provide a posterior distribution on . To do so, [Yu and Moyeed, 2001] proposed the model:

with an improper uniform prior and a random variable that follows an asymmetric Laplace distribution,

The associated likelihood is given by

(6)

Then an estimator of is taken as the function that maximises this likelihood. This model is intuitive for two reasons. First is asymmetric, such that and . Second, minimising the empirical risk associated to the pinball loss (5) is equivalent to maximising the asymmetric Laplace likelihood (6

). To the best of our knowledge, there are no Bayesian expectile models in the literature. However, similarly to quantiles, it is possible to use the asymmetric Gaussian distribution defined as

(7)

with

To estimate these models, we can refer to the existing methods for the quantile: see for instance [Boukouvalas et al., 2012, Abeywardana and Ramos, 2015]. In the review of [Torossian et al., 2019], quantile estimation using variational approaches and a Gaussian process prior for

appears to be one of the most competitive approach on the considered benchmark. However although GPs theoretically provide confidence intervals, the original metamodel of

[Abeywardana and Ramos, 2015] seems to be overconfident in heteroscedastic problems as presented Figure 1. The main reason for this is the use (also present in the aforementioned papers) of a single spread parameter for the likelihood function. This amounts to considering the spread of as constant over , which is particularly harmful as quantile optimisation precisely aim at leveraging the varying spread of .

To overcome this issue and to propose a relevant model for heteroscedastic cases, it is necessary to add a degree of freedom to the spread of

and make it dependent on the variance of . For both the asymmetric Laplace and asymmetric Gaussian distributions, it can be done simply by defining in equations 6 and 7 as a function of the input parameters. Intuitively, using a small

creates a very confident model that tends to interpolate the data while using a large

will add regularity in the model and produce a more robust estimate. To incorporate such flexibility, we propose a model with a GP prior on and on the scale parameter ,

(8)
(9)

where is the softmax function. It is used in order to ensure the positivity of . Estimating the parameters of this model may appear challenging, but the chained GP formalism introduced by [Saul et al., 2016] provides the appropriate framework to do so.

Figure 1: Original GP quantile model from [Abeywardana and Ramos, 2015] (left) and chained GP (right) on a very heteroscedastic model. The model on the left cannot compromise between very small observation variances around and very large variances () and largely overfits on half of the domain, while returning overconfident confidence intervals. The chained GP model captures the low variance region and the high variance one, while returning well-calibrated confidence intervals.

2.2 Inference Procedure

Although our study is limited to the small data regime, quantile and expectile regression are much more challenging problems than classical regression. In the review on quantile regression of [Torossian et al., 2019], the typical budget to perform estimation is defined as times the input dimension, while a classical rule-of-thumb for GP regression is 10 times the dimension [Loeppky et al., 2009]. As we wish to propose a scalable algorithm and as optimisation naturally needs more points than regression, we need a model able to train on large datasets (say, ). In addition, the concentration of points which results of the intensification part of our optimisation scheme would produce instability during the computation of the covariance matrix. To handle these two potential issues, it is natural to use the classical inducing point approach. Following [Snelson and Ghahramani, 2006, Titsias, 2009, Hensman et al., 2013] we introduce ‘pseudo inputs’ (named inducing points) at location and the corresponding output and . The marginal likelihood is thus provided as

with This later quantity is not analytically tractable because the likelihoods introduced to estimate quantiles and expectiles are not conjugated with the Gaussian likelihood related to the assumptions (8) and (9). Thus to estimate the parameters of the model we use a variational black box formalism with a stochastic optimisation scheme as introduced in [Saul et al., 2016, Hensman et al., 2013]. Assuming the mean field approximation for and implies

It results the following evidence lower bound (ELBO)

The posterior on and is assumed to be Gaussian,

with , in and , in the variational quantities that are fully parametrised. Next, because the considered distributions follow Gaussian priors, we obtain

where for , .

Finally the approximation of the posterior is

where .

To compute the intractable approximation of the log-likelihood , it is possible to take advantage of the factorized form of our likelihood across the data in order to optimize stochastically an equivalence of the ELBO provided by

Note that due to the non differentiability of the pinball loss at the origin, the lower bound is not differentiable everywhere. We thus use a first order optimizer (ADAM optimiser [Kingma and Ba, 2014]) as it does not need the objective function to have continuous derivative. To estimate and for we use a quadrature approximation.

To make predictions at a query point , using Gaussian properties, it is possible to write

where .

3 Bayesian optimisation

Classical BO algorithms work as follow. First, a posterior distribution on is inferred from an initial set of experiments (typically obtained using a space-filling design). Then the next input point to evaluate is chosen as the maximiser of an acquisition function, computed from the posterior. The objective function is sampled at the chosen input and the posterior on is updated. These steps are repeated until the budget is exhausted.

The efficiency of such strategies depends on the relevance of the posterior but also on the exploration/exploitation trade-off provided by the acquisition function. Many acquisition functions have been designed to fit this trade off, among them the Expected improvement [Jones et al., 1998], upper confidence bound [Srinivas et al., 2009], knowledge gradient [Frazier et al., 2009] or Entropy search [Hernández-Lobato et al., 2014]. In the case of quantiles and expectiles, adding points one at a time may be impractical, as many points may be necessary to modify significantly the posterior. One solution is to rely on replications, i.e. evaluating repeatedly a single input, as in [Browne et al., 2016, Wang et al., 2019]. However, in [Torossian et al., 2019] using replications was clearly found less efficient than using distributed observations.

All of the above-mentioned acquisition functions have been extended to batches of points: see for instance [Ginsbourger et al., 2010, Marmin et al., 2015] for EI, [Wu and Frazier, 2016] for KG or [Contal et al., 2013, Desautels et al., 2014]

for UCB. However, none actually fit our settings for two main reasons. First, most parallel acquisitions make use of explicit update equations for the GP moments, which are not available for our model. Second, most are designed for small batches (say,

) and become numerically intractable for larger batches (say, ), which is our aim.

We propose in the following two alternatives, one based on a simple adaptation of UCB, the other based on the Thompson sampling approach proposed by [Hernández-Lobato et al., 2017].

Figure 2: Left: estimator of the -quantile and the associated confidence intervals; middle: two (with resp. and ) with different maximisers (red); right: two sample trajectories of using RFF, with different maximisers.

3.1 Batch GP-UCB via Multiple Optimism Levels

Assume the posterior on is a GP of mean and covariance matrix then the classical UCB acquisition function is

(10)

with a positive hyperparameter that tunes the trade-off between exploration (large , implying more weight on the variance) and exploitation (small , implying more weight on the mean).

A simple way to parallelise this criterion consists in selecting different values of at the same time. Denoting the batch size, we choose as: (), with

the inverse of the cumulative distribution function of the standard Gaussian distribution. Intuitively, each batch of new inputs is based on a gradient of exploration / exploitation trade-offs. This idea is represented at the center of Figure

2. However such values of are to small to guaranty the exploration thus we finally multiply it by . Algorithm 1 presents the pseudo-code for this strategy.

Contrary to [Srinivas et al., 2009, Contal et al., 2013, Desautels et al., 2014], due to the chained GP framework our approach does not have theoretical guarantees. However, this might have a limited practical effect, as the theoretically sound values for are known to be overly conservative and typical algorithms use constant ’s.

Input: initial data ; batch size
for  to  do
       Compute the posterior ;
       for  to  do
             Select ;
             ;
             Observe by sampling at ;
            
       end for
      ;
      
end for
Algorithm 1 Risk Parallel GP-UCB

3.2 Thompson Sampling

In this section we adapt the parallel Thompson Sampling strategy of [Hernández-Lobato et al., 2017] to the case of the Chained GPs with a Matérn prior on the kernel.

Given the posterior on , an intuitive approach is to sample according to the probability that . However this distribution is usually intractable. Alternatively, one may achieve the same goal by sampling a trajectory from the posterior of and selects the input that corresponds to its maximiser. Such approach directly extends to batches of inputs, by drawing several strategies and selecting all the maximisers. Algorithm 2 illustrates this strategy.

Input: initial data ; batch size
for  to  do
       Compute the posterior ;
       for  to  do
             Sample the trajectory according to ;
             ;
             Observe by sampling at ;
            
       end for
      ;
      
end for
Algorithm 2 Risk Parallel Thompson Sampling

The main difficulty of this strategy lies in the creation of sample trajectories of . It is well-known that the values of trajectories from can be obtained on any discrete set of size using

where

is a vector of independent standard Gaussian samples,

is the lower triangular matrix of the Cholesky decomposition of evaluated on . But this framework has two drawbacks. First the obtained trajectories are not continuous functions, the optimisation can only be made over the discrete set. Second, as is obtained with a Cholesky decomposition, defining such trajectories has a cost [Diggle et al., 1998]. So this approach quickly meets its limitations as the dimension of increases and cannot be well represented by .

To overcome these drawbacks, a solution is to go back to the parametric formulation of and to use the random Fourier features (RFF) to approximate the kernel , as it is presented in [Rahimi and Recht, 2008]. Let us introduce Bochner’s theorem that asserts the existence of a dual formulation for a large class of kernels and in particular the Matérn family.

Theorem 1

A continuous, shift-invariant kernel is positive definite if and only if it is the Fourier transform of a non-negative, finite measure.

Thus, giving a stationary kernel , there exists an associated spectral density such that

with

Note that

is not a probability density function because it is not normalized. It is possible to define

where the normalizing constant is . Using this formulation enables to write

with . RFFs then consists in approximating this expectation using a Monte-Carlo estimate:

(11)

with a -dimensional feature such that where and are i.i.d. samples from and .

Such methodology has been classically used to approximate the squared-exponential kernel because it is self conjugated (see [Hernández-Lobato et al., 2014]). Here we present how to use RFFs to approximate anisotropic Matern kernels. Our goal is to determine the spectral density associated to the Matern kernel of variance parameter and length scales .

In its simplest form the Matérn kernel is provided by

with the modified Bessel function of the second kind with order , the gamma function and its Fourier transform (see [Rasmussen, 2003] for more details) is given by

If , with the diagonal matrix containing the length scale hyperparameters, then the Fourier transform is

Now it is possible to use that provides

next if we define we obtain with its associated normalized probability density function that is the multivariate t-distribution:

As the Fourier transform is linear, we simply have to multiply by to obtain the normalizing constant, .

Now, to approximate the trajectories, we only need to rewrite under the parametric form. Combining (11) with (2.2), we obtain

with and . Consequently it is possible to factorize by to obtain

With this sampling strategy, an analytic expression of the trajectory is known that enables its optimisation. In addition the cost to obtain a trajectory is .

3.3 Adding Noise

Both algorithms presented above select sampling points that correspond to a potential reduction of the simple regret. They do not correspond necessarily to inputs that improve the accuracy of the model in the vicinity of the maximum (contrary to the approaches in [Frazier et al., 2009, Hernández-Lobato et al., 2014, Picheny, 2014] for instance).

We observed empirically that focusing on the simple regret resulted in overly myopic strategies, as our model delivers much better local predictions using well-spread observations over a local region rather than highly concentrated points around a local optimum. In a sense, our acquisition functions point towards the right optima but do not propose an efficient sampling strategy to improve our model.

However, a simple way to correct this problem is to add a small centered multivariate Gaussian noise to the selected inputs, with a diagonal covariance matrix with terms .

4 Experiments

4.1 Test Cases Description

In this section we show the capacity of our algorithms to optimise a conditional quantile or expectile. To do so, we propose two challenging toy problems of dimension 2 and 7, respectively.

Test case 1

is a toy problem on based on the Griewank function (see [Dixon and Szego, 1978]), defined as , with

and The quantiles of order are represented Figure 5.

Test case 2

is a toy problem based on the Ackley function (see [Ackley, 1987]) on , defined as a function

with

, and , with , , and

follows a log-normal distribution of parameters

.

4.2 Quantile Kriging Baseline

Up to our knowledge, there exists no other BO algorithm to tackle quantile or expectile problems. A simple alternative is to use repetitions in the design of experiment to observe locally and the observation noise (for instance by bootstrap). As direct observations are available, a standard GP inference can be used to provide a posterior on [Plumlee and Tuo, 2014]. Next a BO procedure can be defined based on the EI criterion. As the number of repetitions is a potentially critical parameter of the method, in our experiments we use for different values: 10 and 20 in 2D, 10 and 35 in 7D. We refer to these algorithms as (for the smallest number of repetition) and (for the largest number of repetitions), which serve as baseline competitors.

4.3 Experimental Setting

Figure 3: Evolution of the root mean square error during optimization: 2D, expectiles, (left) and (middle), 7D, quantile, (right).
Figure 4: Regrets for: 2D, expectiles, (left) and (middle), 7D, quantile, (right).

Sequential strategy

We created an initial design of size . At each update of the model we selected new inputs to be sampled and we added new points selected uniformly at random in . In we used a budget equals to while in the budget is equals to 1100. At the end the point returned by the algorithm corresponds to the maximizer of our model.

On the hyperparameters of the model

For the first test case we selected inducing points at the location of the initial design of experiment. For the second test case we put inducing points at the locations of the initial design of experiment and we add an inducing point at each corner of the input space which empirically helps to obtain relevant trajectories for TS. We trained the whole model on the initial design until convergence of the ELBO that took between and epochs with a leaning rate equals to . To update the model we first trained only the variational parameters for epochs with a learning rate equals to then we optimised both the variationnal parameters and the kernel hyperparameters during epochs with a leaning rate equals to . Note that we did not optimise the inducing point location. To help the optimisation we used the whitening representation (see [Hensman et al., 2015] for more details).

Metrics

Each strategy is run 30 times with different initial conditions. As a primary performance metric, we consider the simple regret. In addition, we record the root mean square error of the models on randomly drawn test points over .

4.4 Results

Table 1 summaries our results. It is clear that for every of our test cases the strategy UCB and TS outperform the two versions of . In addition some problems are harder than others. For example the results of the -quantile are not as good as the results obtained for the quantile. The reason for that lies in the high variance of the conditional distribution close to the maximum of while the variance is much more smaller close to the optimum .

Figure 4 shows the regret curves for a subset of problems. We see that on the simplest 2D problem (), the baseline although not competitive with our approach, behaves correctly. On the much more difficult 2D problem (), the very high noise prevents the baseline from converging. While our approaches provide much better solutions from the start, the progress along iterations is limited. However, on average the model provides improved predictions (Figure 3, middle). On the 7D problem, our approaches directly start with a much better solution than the baseline and improve significantly over time. In addition, the overall prediction quality improves almost linearly (Figure 3, right).

Despite the stronger theoretical grounds of the Thompson sampling, the UCB approach offers comparable performances on our test problems for a smaller computational burden. This may be explained by the specificity of the problems at hand: the difficulty of the learning task results in large uncertainties in the model predictions, which reduces the influence of the sampling strategy. More significant differences may appear when more data is available, or on less demanding tasks such as low noise settings.

Figure 5: Quantiles of order , and of test case 1.
Quantile UCB TS
2D test 0.007 0.01 0.006 0.006
case 0.88 0.82 0.25 0.27
7D Test case 2.1 0.73 0.29 0.31
Expectile UCB TS
2D test 0.01 0.01 0.003 0.003
case 0.65 0.61 0.27 0.27
7D Test case 0.71 0.78 0.42 0.41
Table 1: Final values (median) of the simple regret with a budget of 350 points in 2D and 1100 points in 7D.

5 Conclusion

In this paper we have presented a new setting to estimate quantiles and expectiles of stochastic black box functions which is well suited to heteroscedastic cases. Then the proposed model has been used to create a Bayesian optimisation algorithm designed to optimise conditional quantiles and expectiles without repetitions in the experimental design. These algorithms showed good results on toy problems in dimension and and for different orders .

References

  • [Abeywardana and Ramos, 2015] Abeywardana, S. and Ramos, F. (2015). Variational inference for nonparametric bayesian quantile regression. In AAAI, pages 1686–1692.
  • [Ackley, 1987] Ackley, D. (1987). A connectionist machine for genetic hillclimbing, vol ume secs28 of.
  • [Bellini and Di Bernardino, 2017] Bellini, F. and Di Bernardino, E. (2017). Risk management with expectiles. The European Journal of Finance, 23(6):487–506.
  • [Bergstra et al., 2011] Bergstra, J. S., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pages 2546–2554.
  • [Boukouvalas et al., 2012] Boukouvalas, A., Barillec, R., and Cornford, D. (2012). Gaussian process quantile regression using expectation propagation. arXiv preprint arXiv:1206.6391.
  • [Browne et al., 2016] Browne, T., Iooss, B., Gratiet, L. L., Lonchampt, J., and Remy, E. (2016). Stochastic simulators based optimization by gaussian process metamodels–application to maintenance investments planning issues. Quality and Reliability Engineering International, 32(6):2067–2080.
  • [Cannon, 2011] Cannon, A. J. (2011). Quantile regression neural networks: Implementation in r and application to precipitation downscaling. Computers & geosciences, 37(9):1277–1284.
  • [Contal et al., 2013] Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. (2013). Parallel gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225–240. Springer.
  • [Desautels et al., 2014] Desautels, T., Krause, A., and Burdick, J. W. (2014). Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. The Journal of Machine Learning Research, 15(1):3873–3923.
  • [Diggle et al., 1998] Diggle, P. J., Tawn, J. A., and Moyeed, R. (1998). Model-based geostatistics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3):299–350.
  • [Dixon and Szego, 1978] Dixon, L. C. W. and Szego, G. P. (1978). The global optimization problem. an introduction. Toward global optimization, 2:1–15.
  • [Farooq and Steinwart, 2017] Farooq, M. and Steinwart, I. (2017). An svm-like approach for expectile regression. Computational Statistics & Data Analysis, 109:159–181.
  • [Frazier et al., 2009] Frazier, P., Powell, W., and Dayanik, S. (2009). The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613.
  • [Ginsbourger et al., 2010] Ginsbourger, D., Le Riche, R., and Carraro, L. (2010). Kriging is well-suited to parallelize optimization. In Computational intelligence in expensive optimization problems, pages 131–162. Springer.
  • [Hensman et al., 2013] Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. arXiv preprint arXiv:1309.6835.
  • [Hensman et al., 2015] Hensman, J., Matthews, A. G., Filippone, M., and Ghahramani, Z. (2015). Mcmc for variationally sparse gaussian processes. In Advances in Neural Information Processing Systems, pages 1648–1656.
  • [Hernández-Lobato et al., 2014] Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014). Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pages 918–926.
  • [Hernández-Lobato et al., 2017] Hernández-Lobato, J. M., Requeima, J., Pyzer-Knapp, E. O., and Aspuru-Guzik, A. (2017). Parallel and distributed thompson sampling for large-scale accelerated exploration of chemical space. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1470–1479. JMLR. org.
  • [Jiang et al., 2017] Jiang, C., Jiang, M., Xu, Q., and Huang, X. (2017). Expectile regression neural network model with applications. Neurocomputing, 247:73–86.
  • [Jones et al., 1998] Jones, D. R., Schonlau, M., and Welch, W. J. (1998). Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492.
  • [Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Koenker and Bassett Jr, 1978] Koenker, R. and Bassett Jr, G. (1978). Regression quantiles. Econometrica: journal of the Econometric Society, pages 33–50.
  • [Li et al., 2016] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2016). Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560.
  • [Loeppky et al., 2009] Loeppky, J. L., Sacks, J., and Welch, W. J. (2009). Choosing the sample size of a computer experiment: A practical guide. Technometrics, 51(4):366–376.
  • [Marmin et al., 2015] Marmin, S., Chevalier, C., and Ginsbourger, D. (2015). Differentiating the multipoint expected improvement for optimal batch design. In International Workshop on Machine Learning, Optimization and Big Data, pages 37–48. Springer.
  • [Meinshausen, 2006] Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7(Jun):983–999.
  • [Mockus et al., 1978] Mockus, J., Tiesis, V., and Zilinskas, A. (1978). The application of bayesian methods for seeking the extremum. Towards global optimization, 2(117-129):2.
  • [Newey and Powell, 1987] Newey, W. K. and Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica: Journal of the Econometric Society, pages 819–847.
  • [Picheny, 2014] Picheny, V. (2014). A stepwise uncertainty reduction approach to constrained global optimization. In Artificial Intelligence and Statistics, pages 787–795.
  • [Picheny et al., 2013] Picheny, V., Wagner, T., and Ginsbourger, D. (2013). A benchmark of kriging-based infill criteria for noisy optimization. Structural and Multidisciplinary Optimization, 48(3):607–626.
  • [Plumlee and Tuo, 2014] Plumlee, M. and Tuo, R. (2014). Building accurate emulators for stochastic simulations via quantile kriging. Technometrics, 56(4):466–473.
  • [Rahimi and Recht, 2008] Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184.
  • [Rasmussen, 2003] Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer.
  • [Rockafellar et al., 2000] Rockafellar, R. T., Uryasev, S., et al. (2000). Optimization of conditional value-at-risk. Journal of risk, 2:21–42.
  • [Rostek, 2010] Rostek, M. (2010). Quantile maximization in decision theory. The Review of Economic Studies, 77(1):339–371.
  • [Sangnier et al., 2016] Sangnier, M., Fercoq, O., and d’Alché Buc, F. (2016). Joint quantile regression in vector-valued rkhss. In Advances in Neural Information Processing Systems, pages 3693–3701.
  • [Saul et al., 2016] Saul, A. D., Hensman, J., Vehtari, A., and Lawrence, N. D. (2016). Chained gaussian processes. In Artificial Intelligence and Statistics, pages 1431–1440.
  • [Skullerud, 1968] Skullerud, H. (1968). The stochastic computer simulation of ion motion in a gas subjected to a constant electric field. Journal of Physics D: Applied Physics, 1(11):1567.
  • [Snelson and Ghahramani, 2006] Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257–1264.
  • [Srinivas et al., 2009] Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
  • [Székely Jr and Burrage, 2014] Székely Jr, T. and Burrage, K. (2014). Stochastic simulation in systems biology. Computational and structural biotechnology journal, 12(20-21):14–25.
  • [Takeuchi et al., 2006] Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. (2006). Nonparametric quantile estimation. Journal of machine learning research, 7(Jul):1231–1264.
  • [Titsias, 2009] Titsias, M. (2009). Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pages 567–574.
  • [Torossian et al., 2019] Torossian, L., Picheny, V., Faivre, R., and Garivier, A. (2019). A review on quantile regression for stochastic computer experiments. arXiv preprint arXiv:1901.07874.
  • [Wang et al., 2019] Wang, S., Ng, S. H., and Haskell, W. B. (2019). A multi-level simulation optimization approach for quantile functions. arXiv preprint arXiv:1901.05768.
  • [Wu and Frazier, 2016] Wu, J. and Frazier, P. (2016). The parallel knowledge gradient method for batch bayesian optimization. In Advances in Neural Information Processing Systems, pages 3126–3134.
  • [Yu and Moyeed, 2001] Yu, K. and Moyeed, R. A. (2001). Bayesian quantile regression. Statistics & Probability Letters, 54(4):437–447.