1 Introduction
Computer simulation models are now an essential tool for performance evaluation, decision making, quality control and uncertainty quantification of complex systems.
These models generally depend on multiple input variables that can be divided into two groups, one controllable the other uncontrollable. The enduser will decide based on the controllable variable, but the decisions should account for the effect of uncontrollable variables. For example, in pharmacology, the optimal drug posology depends on the drug formulation but also on the individual targeted (genetic characters, age, sex) and environmental interactions. The shelf life and performance of a manufacturing device depend on its design, but also on its environment and on uncertainties during to the manufacturing process. The performance of a deep neural network depends on its (hyper)parameters and on the quality of the training set.
In such systems, the links between the inputs and outputs may be too complex to be fully understood or to be formulated in a closed form. In this case, the system can be considered as a black box and formalized by an unknown function: , where denotes a compact space representing the controllable variables, and denotes the probability space representing the uncontrollable variables. Note that some black boxes have their outputs in a multidimensional space but this aspect is beyond the scope of the present paper. Because the space may be large and/or complex, we do not assume that it has any structure. In contrast to deterministic systems, for any fixed ,
is a random variable of law
; hence, such systems are often referred to as stochastic black boxes.Based on the standard constraints encountered in computer experiments, throughout this paper, we assume that the function is only accessible through pointwise evaluations ; no structural information is available regarding ; in addition, we take into account the fact that evaluations may be expensive, which drastically limits the number of possible calls to .
In order to understand the behavior of the system of interest or to take optimal decisions, information is needed about . An intuitive approach is to use a simple MonteCarlo technique and evaluate
to extract statistical moments, the empirical cumulative distribution function, etc. Unfortunately a stratified approach is not practical in the case of an
that is expensive to evaluate.Instead, we focus on surrogate models (also referred to as metamodels or statistical emulators), which are appropriate approaches in a small data setting associated with a regularity hypothesis (with respect to ) concerning targeted statistics. Among the vast choice of surrogate models [73, 81]
, the most popular ones include regression trees, Gaussian processes, support vector machines and neural networks. In the framework of stochastic black boxes, the standard approach consists in estimating the conditional expectation of
. This case has been extensively treated in the literature and many applications, including Bayesian optimization [64], have been developed. However, the conditional expectation is riskneutral, whereas pharmacologists, manufacturers, asset managers, data scientists and agronomists need to evaluate the worst case scenarios associated with their decisions, for example.Risk information can be introduced by using a surrogate expectationvariance model
[32, 65]or by estimating the whole distribution via heteroscedastic Gaussian processes
[36, 40]. However, such approaches usually imply that the shape of the distribution (e.g. normal, uniform, etc.) is the same for all . Another possible approach would be to learn the whole distribution with no strong structural hypotheses [52], but this requires a large number of evaluations of . Here, we focus on the conditional quantile estimation of order , a flexible tool to tackle cases in which the distribution of varies markedly in spread and shape with respect to , and a riskawareness tool in decision theory [3].1.1 Paper Overview
Many metamodels originally designed to estimate conditional expectations have been adapted to estimate the conditional quantile. However, despite extensive literature on estimating the quantile in the presence of spatial structure, few studies have reported on the constraints associated with stochastic black boxes. The performance of a metamodel with high dimension input is treated in insufficient details, performance based on the number of points has rarely been tackled and, to our knowledge, dependence on specific aspects of the quantile functions has never been studied. The aim of the present paper is to review quantile regression methods under standard constraints related to the stochastic black box framework, to provide information on the performance of the selected methods, and to recommend which metamodel to use depending on the characteristics of the computer simulation model and the data.
A comprehensive review of quantile regression is of course beyond the scope of the present work.We limit our review to the approaches that are best suited for our framework, while insuring the necessary diversity of metamodels. In particular, we have chosen six metamodels that are representative of three main categories: approaches based on statistical order as Knearest neighbors (KN) regression [6]
and random forest (RF) regression
[49], functional or frequentist approaches as neural networks (NN) regression [15] and regression in reproducing kernel Hilbert space (RK) [75], and Bayesian approaches based on Gaussian processes as Quantile Kriging (QK) [55] and the variational Bayesian (VB) regression [1]. Each category has some specificities in terms of theoretical basis, implementation and complexity. The methods are described in full in sections 3 to 6.1.In order to identify the relevant areas of expertise of the different metamodels, we designed an original benchmark system based on three toy functions and an agronomical model [16]. The dimension of the problems ranged from to and the number of observations from to . Particular attention was paid to the performance of each metamodel according to the size of the learning set, the value of the probability density function at the targeted quantile and the dimension of the problem. Sections 6 and 7
describe the benchmark system and detail its implementation, with particular focus on the tuning of the hyperparameters of each method. Full results and discussion are to be found in Sections
8 and 9, respectively.Figure 1 summarizes our findings. In a nutshell, quantile regression requires large budgets: while the ruleofthumb for computer experiments is a budget (i.e. number of experiments) 10 times the dimension, in our problems, we found that no method was able to provide a relevant quantile estimate with a number of observations less than 50 times the dimension. For larger budgets, no method works uniformly better than any other. NN and VB are best when the budget is large. When the budget is smaller, RF, RK, KN are best when the pdf is small in the neighborhood of the quantile, in other words, when little information is available. However, VB outperforms all the other methods when more information is available, that is, when the pdf is large in the neighborhood of the quantile.
2 Quantile emulators and design of experiments
We first provide the necessary definitions, objects and properties related to the quantile. The quantile of order of a random variable can be defined either as the (generalized) inverse of a cumulative distribution function (CDF), or as the solution to an optimization problem:
(1) 
where is the CDF of and
(2) 
is the socalled pinball loss [38] (Figure 2). In the following, we only consider situations in which is continuous.
Given a finite observation set composed of i.i.d samples of , the empirical estimator of can thus be introduced in two different ways:
(3)  
(4) 
where denotes the empirical CDF function. coincides with the order statistic:
where represents the smallest integer greater than or equal to and is the th smallest value in the sample . Similarly to (1), the conditional quantile of order can be defined in two equivalent ways:
(5) 
where is the conditional CDF of and is a functional space containing .
In a quantile regression context, one only has access to a finite observation set with a matrix. Estimators for (5) are either based on the order statistic as in (3) (section 3), or on the pinball loss as in (4) (sections 4 and 5). Throughout this work, the observation set is fixed (we do not consider a dynamic or sequential framework). Following the standard approach used in computer experiments, the training points are chosen according to a spacefilling design [17] over a hyperrectangle. In particular, we assume that there are no repeated experiments: , ; most of the methods chosen in this survey (KN, RF, RK, NN, VB) work under that setting.
However, as a baseline approach, one may decide to use a stratified experimental design with i.i.d samples for a given , , extract pointwise quantile estimates using (3) and fit a standard metamodel to these estimates. The immediate drawback is that for the same budget () such experimental designs cover much less of the design space than a design with no repetition. The QK method is based on this approach.
3 Methods based on order statistics
A simple way to compute a quantile estimate is to take an order statistic of an i.i.d. sample. A possible approach is to emulate such a sample by selecting all the data points in the neighborhood of the query point , and then by taking the order statistic of this subsample as an estimator for the conditional quantile. One may simply choose a subsample of based on a distance defined on : this is what the nearest neighbors approach does. It is common to use KN based on the Euclidean distance but of course any other distance can be used, such as Mahalanobis [80] or weighted Euclidean distance [28]. Alternatively, one may define a notion of neighborhood using some space partitioning of
. That includes all the decision tree methods
[14], in particular regression trees, bagging or random forest [49].3.1 nearest neighbors
The nearest neighbors method was first proposed for the estimation of conditional expectations [71, 72]. Its extension to the conditional quantile estimation can be found in [6].
3.1.1 Quantile regression implementation
KN works as follows: define the subset of containing the points that are the closest to the query point . Define the associated outputs, and define as the associated empirical CDF. According to (3), the conditional quantile of order is define as
(6) 
Algorithm 1 details the implementation of the KN method.
3.1.2 Computational complexity
For a naive implementation of such an estimator, one needs to compute distances, where is the number of query points, hence for a cost in . Moreover, sorting distances in order to extract the nearest points has a cost in . Combining the two operations implies a complexity of order
Note that some algorithms have been proposed in order to reduce the computational time, for example by using GPUs [30] or by using tree search algorithms [4].
3.2 Random forests
Random forests were introduced by Breiman [13] for the estimation of conditional expectations. They have been used successfully for classification and regression, especially with problems where the number of variables is much larger than the number of observations [25].
3.2.1 Overview
The basic element of random forests is the regression tree , a simple regressor built via a binary recursive partitioning process. Starting with all data in the same partition , the following sequential process is applied. At each step, the data is split in two, so that is partitioned in a way that can be represented by a tree as it is presented Figure 3.

Several splitting criteria can be chosen (see [34]). In [49], the splitting point minimizes
(7) 
where and are the mean of the left and right subpopulations, respectively. Equation (7) applies when the ’s are realvalued. In the multidimensional case, the dimension in which the split is performed has to be selected. The split then goes through and perpendicularly to the direction . There are several rules to stop the expansion of . For instance, the process can be stopped when the population of each cell is inferior to a minimal size : then, each node becomes a terminal node or leaf. The result of the process is a partition of the input space into hyperrectangles . Like the KN method, the treebased estimator is constant on each neighborhood. However the regression tree framework automatically builds neighborhoods from the data that should be adapted to each problem.
Despite their simplicity of construction and interpretation, regression trees are known to suffer from a high variance. To overcome this drawback, regression trees can be used with ensemble methods like bagging. Instead of using only one tree, bagging creates a set of tree based on a bootstrap version of . Then the final model is created by averaging the results among all the trees.
Although bagging reduces the variance of the predictor, as the splitting criterion has to be optimized over all the input dimensions, computing (7) for each possible split is costly when the dimension is large. The random forest algorithm, a variant of bagging, constructs an ensemble of weak learners based on and aggregates them. Unlike plain bagging, at each node evaluation, the algorithm uses only a subset of covariables for the choice of the split dimensions. Because the covariables are randomly chosen, the result of the process is a random partition of constructed by the random tree .
3.2.2 Quantile prediction
We present the extension proposed in [49] for conditional quantile regression. Let us define the leaf obtained from the tree containing a query point and
The ’s represent the weights illustrating the “proximity” between and . In the classical regression case, the estimator of the expectation is:
(8) 
In [49] the conditional quantile of order is defined as in (3) with the CDF estimator defined as
(9) 
Algorithm 2 details the implementation of the RF method.
3.2.3 Computational complexity
Assuming that the value of (7) can be computed sequentially for consecutive thresholds, the RF computation burden lies in the search of the splitting point that implies sorting the data. Sorting variables has a complexity in . Thus, at each node the algorithm finds the best splitting points considering only covariables. This implies a complexity of per node. In addition, the depth of a tree is generally upper bounded by . Then the computational cost of building a forest containing trees under the criterion (7) is
[44][82]. One may observe that RF are easy to parallelize and that contrary to KN the prediction time is very small once the forest is built.
4 Approaches based on functional analysis
Functional methods search directly for the function mapping the input to the output in a space fixed beforehand by the user.
Estimating any functional of with this framework implies selecting a loss (associated to ) and a function space . Thus, the estimator is obtained as the minimizer of the empirical risk associated to ,
(10) 
The functional space must be chosen large enough to extract some signal from the data. In addition,
needs to have enough structure to make the optimization procedure feasible (at least numerically). In the literature, several formalisms such as linear regression
[63], spline regression [46], support vector machine [79], neural networks [8] or deep neural networks [61] use structured functional spaces with different levels of flexibility.However, using a too large can lead to overfitting, return predictors that are good only on the training set and generalize poorly. Overcoming overfitting requires some regularization [62] [85] [86], defining for example the regularized risk
(11) 
where is a penalization factor, and is either a norm for some methods (Section 4.2) or a measure of variability for others (Section 4.1). The parameter plays a major role, as it allows to tune the balance between bias and variance.
Classically, squared loss is used: it is perfectly suited to the estimation of the conditional expectation. Using the pinball loss (Eq. 2) instead allows to estimate quantiles. In this section we present two approaches based on Equation (11) with the pinball loss. The first one is regression using artificial neural networks (NN), a rich and versatile class of functions that has shown a high efficiency in several fields. The second approach is the generalized linear regression in reproducing kernel Hilbert spaces (RK). RK is a nonparametric regression method that has been much studied in the last decades (see [69]) since it appeared in the core of learning theory in the 1990’s [62, 79].
4.1 Neural Networks
Artificial neural networks have been successfully used for a large variety of tasks such as classification, computer vision, music generation, and regression
[8]. In the regression setting, feedforward neural networks have shown outstanding achievements. Here we present quantile regression neural network
[15] wich is an adaptation of the traditional feedforward neural network.4.1.1 Overview
A feedforward neural network is defined by its number of hidden layers
, its numbers of neurons per layer
, and its activation functions
, . Given an input vector the information is fed to the hidden layer composed of a fixed number of neurones . For each neurone , , a scalar product (noted ) is computed between the input vector and the weights of the neurones. Then a bias term is added to the result of the scalar product. The result is composed with the activation function(linear or nonlinear) which is typically the sigmoid or the RElu function
[61] and the result is given to the next layer where the same operation is processed until the information comes out from the outpout layer. For example, the output of a layers NN at is given by(12) 
The corresponding architecture can be found in Figure 4.
The architecture of the NN defines . Finding the right architecture is a very difficult problem which will not be treated in this paper. However a classical implementation procedure consists of creating a network large enough (able to overfit) and then using techniques such as early stopping, dropout, bootstrapping or risk regularization to avoid overfitting [67]. In [15], the following regularized risk is used:
(13) 
4.1.2 Quantile regression
Minimizing Equation (13) (with respect to all the weights and biases) is in general challenging, as is a highly multimodal function. It is mostly tackled using derivativebased algorithms and multistarts ( launching the optimization procedure times with different starting points). In the case of quantile estimation case, the loss function is nondifferentiable at the origin, which may cause problems to some numerical optimization procedures. To address this issue, [15] introduced a smooth version of the pinball loss function, defined as:
where
(14) 
Let us define the list containing the weights and bias of the network. To find , a minimizer of , the idea is to solve a series of problems using the smoothed loss instead of the pinball one with a sequence corresponding to decreasing values of . The process begins with the optimization with the larger value . Once the optimization converges, the optimal weights are used as the initialization for the optimization with , and so on. The process stops when the weights based on are obtained. Finally, is given by the evaluation of the optimal network at . Algorithm 3 details the implementation of the NN method.
4.1.3 Computational complexity
In [15] the optimization is based on a Newton method. Thus the procedure needs to inverse a Hessian matrix. Without sophistications, its cost is with the size of the problem
the number of parameters to optimize. Note that using a high order method makes sense here because NN has few parameters (in contrast to deep learning methods). Moreover providing an upper bound on the number of iterations needed to reach an optimal point may be really hard in practice because of the non convexity of (
13). In the nonconvex case, there is no optimality guaranty and the optimization could be stuck in a local minima. However, it can be shown that the convergence near a local optimal point is at least super linear and may be quadratic (with some additional hypotheses). It implies, for each , the number of iterations until is bounded above bywith the minimal decreasing rate, , the strong convexity constant of near and the Hessian Lipschitz constant (see [11] page 489). As increases very slowly with respect to , it is possible to bound the number of iterations typically by
That means, near an optimal point, the complexity is , with the total number of neurons. Then using a multistart procedure implies a complexity of
with .
4.2 Generalized linear regression
Regression in RKHS was introduced for classification via Support Vector Machine by [22, 33], and has been naturally extended for the estimation of the conditional expectation [27, 58]. Since, many applications have been developed [69, 62], here we present the quantile regression in RKHS [75, 60].
4.2.1 RKHS introduction and formalism
Under the linear regression framework, is assumed to be under the form , with in . To stay in the same vein while creating nonlinear responses, one can map the input space to a space of higher dimension (named the feature space), thanks to a feature map . For exemple the feature space could be a polynomial space, in that case we are working with the spline framework [46]. For a large flexibility and few parameters, the feature space can even be chosen as an infinite dimensional space. In the following, defines a feature map from to , where is the Hilbert functional space defined as
where is the cardinal of . Under the hypothesis that belongs to , can be written as
(15) 
Notice that without more hypothesis on , estimating is difficult. In fact it is impossible to compute (15) directly because of the infinite sum. However, this issue may be tackled by using the RKHS formalism and the socalled kernel trick [62]. Let us first introduce the symmetric definite positive function such that:
(16) 
Under this setting, is a RKHS with the reproducing kernel , that means for all and the reproducing property
(17) 
holds for all and all . It can be shown that working with a fixed kernel is equivalent to working with its associated functional Hilbert space.
The kernel choice is based on kernel properties or assumptions made on the functional space. See for instance [69], chapter 4, for some kernel definitions and properties. In the following, and denote respectively a RKHS and its kernel associated to the hyperparameters vector . is the kernel matrix obtained via .
From a theoretical point of view, the representer theorem implies that the minimizer of
lives in with Thanks to the reproducing property (17), and since ,
Using the definition (16), it is possible to rewrite as:
Hence, the original infinite dimensional problem (15) becomes an optimization problem over coefficients . More precisely, finding is equivalent to minimizing in the quantity
(18) 
4.2.2 Quantile regression
Quantile regression in RKHS was introduced by [75], followed by several authors [42, 70, 20, 21, 60]. Quantile regression has two specificities compared to the general case. Firstly the loss is defined as the pinball. Secondly, to ensure the quantile property, the intercept is not regularized. More precisely, we assume that
and we consider the empirical regularized risk
(19) 
Thus the representer theorem implies that can be written under the form
for a new query point . Since (19) cannot be minimized analytically, a numerical minimization procedure is used. [22] followed by [75] introduced nonnegative variables to transform the original problem into
subject to
and
Using a Lagrangian formulation, it can be shown that minimizing is equivalent to the problem:
(20)  
and 
It is a quadratic optimization problem under linear constraint, for which many efficient solvers exist.
The value of may be obtained from the KarushKuhnTucker slackness condition or fixed independently of the problem. A simple way to do so is to choose as the quantile of . Algorithm 4 details the implementation of the RK method.
and 
4.2.3 Computational complexity
Let us notice two things. Firstly, the minimal upper bound complexity for solving (20) is . Indeed solving (20) without the constraints is easier and it needs . Secondly the optimization problem (20) is convex, thus the optimum is global.
There are two main approaches for solving (20), the interior point method [11] and the iterative methods like libSVM [18]. The interior point method is based on the Newton algorithm, one method is the barrier method (see [11] p.590). It is shown that the number of iterations for reaching a solution with precision is Moreover each iteration of a Newton type algorithm costs because it needs to inverse a Hessian. Thus, the complexity of an interior point method for finding a global solution with precision is
On another hand, iterative methods like libSVM transform the main problem into a smaller one. At each iteration the algorithm solves a subproblem in . Contrary to the interior point methods, the number of iterations depends explicitly on the matrix . [43] shows that the number of iterations is
where . Note that depends on the type of the kernel, it evolves in with an increasing value of the regularity of [19]
. For more information about the eigenvalues of
one can consult [12].To summarize, it implies that the complexity of the libSVM method has an upper bound higher than the interior point algorithm. However, these algorithms are known to converge pretty fast. In practice, the upper bound is almost never reached, and thus the most important factor is the cost per iteration, rather than the number of iterations needed. That is the reason why libSVM is popular in this setting.
5 Bayesian approaches
Bayesian formalism has been used for a wide class of problems such as classification and regression [57], model averaging [10] and model selection [56].
Under the regression framework, a classical Bayesian model is defined as
(21) 
where depends on parameters that have to be inferred, is an observation and a noise term. Knowing , the goal is to estimate the posterior distribution at
with a vector of parameters. The posterior is a priori unknown and is the model hypothesis. According to the Bayes formula, the posterior can be written as
Because the normalizing constant is independent of and , considering only the likelihood and the prior is enough. We obtain
with the likelihood
(22) 
The Bayesian quantile regression framework can be summarized as follows. Starting from Equation (22), there are several estimation possibilities. The first idea was introduced in [83] where the authors worked under a linear framework, but the linear hypothesis is too restrictive to treat the stochastic black box setting. [74] introduced a mixture modeling framework called Dirichlet process to perform nonlinear quantile regression. However the inference is performed with MCMC methods [31, 29], a procedure that is often costly. A possible alternative is the utilisation of Gaussian process (GP). GPs are powerful in a Bayesian context because of their flexibility and their tractability (GP are only characterized by their mean and covariance ). In this section we present QK and VB, two approaches using a Gaussian process (GP) as a prior for .
Note that while QK and VB use GP, both metamodels are intrinsically different. For QK, GP is a consequence of the hypothesis made on . However VB uses the artificial assumption GP in order to simplify the estimation procedure while ensuring flexibility.
5.1 Quantile kriging
Kriging takes its origins in geostatistics and spatial data interpolation
[23, 68]. Since the’s, kriging drew attention of the machine learning community
[57]. We present a very intuitive method that gives flexible quantile estimators based on data containing repetition and GPs [55].5.1.1 Kriging introduction
In the kriging framework, in (21) is as assumed to be Gaussian,
(23) 
and
with
(24) 
Note that here . Under this assumption, the likelihood (22) is given by
(25)  
with . It can be shown (see [57]) that under the hypotheses (23) and (24)
(26) 
with
being a diagonal matrix of size with diagonal terms equals to which represents the observation noise. As in Section 4.2, the covariance functions are usually chosen among a set of predefined ones that depend on a set of hyperparameters .
The best hyperparameter can be selected as the maximizer of the marginal likelihood, which can be computed by marginalizing over :
In addition, since it follows [57] that
where is the determinant of the matrix . Maximizing this likelihood with respect to is usually done using derivativebased algorithms, although the problem is nonconvex and known to have several local maxima.
Different estimators of may be extracted based on (26). Here is fixed as
. Note that this classical choice is made because the maximum a posteriori of a Gaussian distribution coincides with its mean.
5.1.2 Quantile kriging prediction
As is a latent quantity, the solution proposed in [55] is to consider the sample which represents a design of experiments with different points that are repeated times in order to obtain quantile observations. For each , , let us define:
and
Following [55], let us assume that
Comments
There are no comments yet.