1 Introduction
Researchers, technicians or policy makers often support their decisions on complex models. They have to process, analyze and interpret them assertively with the data available. Usually, those models include many variables and interactions. One choice to overcome these issues is selecting the most relevant variables of the system. In this way, we will gain insight on the model and we will discover the main characteristics of it. Still, we have to produce a stable approximation of the model to avoid large variations on the input given by small perturbations on the output. The analyst, however, has to confirm, check and improve the model.
The typical situation is if we assume a set of inputs variables producing an output related by the model
(1) 
The function could be unknown and complex. In some cases, a computer code can gauge it (e.g., oakley2004probabilistic). Also, we can replace the original model by a low fidelity approximation called a metamodel (see box1987empirical). The problems related to this formulation extend to engineering, biology, oceanography and others.
Given the set of inputs in the model (1
), we can rank them according different criteria. Some examples are: the screening method (cullen1999probabilistic, campolongo2011screening), the automatic differentiation (rall1980applications, carmichael1997sensitivity), the regression analysis (draper1981applied, devore1996statistics) or the response surface method (montgomery1995response, goos2002optimal).
The work of sobol1993sensitivity, inspired by an ANOVA (or Hoeffding) decomposition, split down the variance of the model in partial variances. They are generated by the conditionals expectations of
giving each input for . The partial variances represent the uncertainty created by each input or its interactions. Dividing each partial variance by the model total variance, we get a normalized index of importance. We call the firstorder Sobol indices to the quantities,Notice that is the best approximation of given . Thus, if the variance of is large, it means a large influence of into .
These indices are used in theoretical and applied techniques and determine the most relevant and sensible inputs on the model. We can establish indices that measure the interactions between variables or the total effect of a certain input in the whole model. We refer the reader to saltelli2000sensitivity for the exact computation of higherorder Sobol indices.
The main task with the Sobol indices relays in its computation. MonteCarlo or quasi MonteCarlo methods propose sampling the model (of the order of hundreds or thousands) to get an approximation of its behavior. For instance, the Fourier Amplitude Sensitivity Test (FAST) or the Sobol PickFreeze (SPF) cukier1973study,cukier1978nonlinear created the FAST method which transforms the partial variances in Fourier expansions. This method allows the aggregated and simple estimation of Sobol indices in an escalated way. The SPF scheme regresses the model output against a pickfrozen replication. The principle is to create a replication holding the interest variable (frozen) and resampling the other variables (picked). We refer to the reader to sobol1993sensitivity,sobol2001global and janon2013asymptotic. Other methods include to Ishigami1990Importance which improved the classic MonteCarlo procedure by resampling the inputs and reducing the whole process to only one MonteCarlo draw. The paper of saltelli2002making proposed an algorithm to estimate higherorder indices with the minimal computation effort.
The MonteCarlo methods suffer of the highcomputational stress in its implementation. For example, the FAST method requires estimate a set of suitable transformation functions and integer angular frequencies for each variable. The SPF scheme creates a new copy of the variable in each iteration. For complex and highdimensional models, those techniques will be expensive in computational time.
One limitation is in most cases, those methods requires a complete identification of the link function
between the inputs and the outputs. It means, the analyst has to have available the exact link function some function or some alternative algorithm which produce the outcome. For example, if the researcher have only a dataset where there exist explanatory and response variables. One natural task is to ask is to find the most influential explanatory variables.
This article proposes an alternative way to compute the Sobol indices. In particular, we will take the ideas of zhu1996asymptotics and we shall apply a nonparametric NadarayaWatson to estimate the value for
. With this estimator, we avoid the stochastic techniques and we use the information of the data to fit the nonparametric model. This work shows if the joint distribution of
is twice differentiable, the nonparametric estimator of , has a parametric rate of convergence. Otherwise, we will get a nonparametric rate of convergence depending on the regularity of the density.The article follows this framework: In Section 2 we will propose the nonparametric estimator for the firstorder Sobol indices. We gather the hypotheses, assumptions and main results in Section 3. The method to calibrate the optimal bandwidth is in Section 4. We show our method with two numerical examples in Section 5. In Section 6, we expose the conclusions and discussion.
2 Methodology
In our context we suppose that and ,
are independent and identically distributed observations from the random vector
and . We denote by the joint density of the couple . Let be the marginal density function of for .Recall Sobol indices definition presented in the introduction,
(2) 
We have expanded the variance of the numerator to simplify the presentation. Notice we can estimate the terms and in equation (2) by their empirical counterparts
(3)  
and  
(4) 
The term requires more effort to estimate. For any we introduce the following notation,
where  
We will use a changed version of the nonparametric estimator developed in loubes2013rates. This paper estimates the conditional expectation covariance for a reduction dimension problem.
We will estimate the functions and by their nonparametric estimators,
(5)  
(6) 
The nonparametric estimator for is,
(7) 
The estimator (8) provides a direct way to estimate the firstorder Sobol index . We want to control the square risk over some regular class of functions for the joint density . Thus, we can choose a bandwidth according to the regularity to get a convergent estimator. Our aim is to find enough conditions to have a parametric rate of convergence for .
3 Main result
We denote , and so on, constants (independent of ) that may take different values throughout all the chapter.
Let be a positive real value and define as the largest integer such that . We define the continuous kernel function of order if it satifies:

the support of is the interval ;

is symmetric around 0;

and for .
To guarantee a parametric consistency in our model, we need to impose some regularity conditions. In our case, define the Hölder class of smooth functions as follows.
Definition 1.
Denote as the Hölder class of density functions with smoothness and radius . This class is defined as the set of times differentiable functions where is an interval in , whose derivative satisfies
The following technical assumption is important to establish the class of function where we will find the upper bound of the risk.
Assumption 1.
For fixed, the function belongs to a Hölder class of regularity and constant , i.e. .
Also, the function for belongs to a Hölder class of smoothness and radius , i.e. .
Remark 1.
Notice that the function defined in (5) also belongs to for . Recall that
Let fixed. If then by a direct calculation we can prove our assertion.
Denote as the class of functions that fulfill Assumption 1. The following theorem establishes the rates of convergence for depending on the parameter .
Theorem 1.
Assume that . The upper bound risk of the estimator defined in (8) over the functional class satisfies:

if and choosing then

if and choosing then,
Theorem 1 presents an elbow effect on the rates of convergences. This is a typical behavior in linked to studies on functional estimation, for instance baraud2003adaptive or laurent2005adaptive. The regularity of the joint density function defines the rate of convergence for the mean squared risk of . It means, we can get a parametric rate paying a price on the regularity of . When , the rate turn into a nonparametric one which depends on the parameter .
In the regular case, when , we avoid any adaptability problem concerning the choose of the bandwidth . We can establish a parametric rate of convergence for taking a bandwidth of . In other case, it is necessary to know of the regularity of our model to choose the bandwidth.
4 Choice of bandwiths for Sobol indices
Even if Theorem 1 gives the asymptotic optimal rate, the constant it still depends on unknown quantities like the true density function . Thus, the choice of the bandwidth remains as the crucial step to estimate accurately . The main issue is to estimate the regression curve .
We can fit this curve with the data available minimizing the least squares criteria
(9) 
As before, we estimate by where and were defined in Equations (5) and (6). But, there exist a problem because this method uses twice the data to calibrate and verify the model. The crossvalidation method estimate the prediction error removing one by one the observations and recalculating the model with the remaining data. The estimator is called leaveoneout estimator with the expression
Afterwards, we can build a new version of the least squares error
and find the optimal bandwidth . Finally, estimate . The book of hardle2004nonparametric has the detailed procedure.
However, even if the crossvalidation is asymptotically unbiased, those estimators have a relatively large finitesample bias. The works from Faraway1990, Romano1988 and Padgett1986 established the same behavior studying nonparametric estimators for the density, quantiles and the mode respectively. This problem arises on the nonparametricbased models, as it was exemplified by Hardle1993. One solution is remove the bias part of the estimate by bootstrapping, following the ideas in Racine2001.
The procedure starts with the residuals for the variable with respect to its nonparametric estimate counterpart with some bandwidth ,
These residuals are then transformed by
where is the arithmetic mean of and
is the conditional standard deviation of
. It means, we normalize the residuals by the conditional variancein each point of the sample. This overcomes any issue due to the heteroscedasticity in the sample. Therefore, the quantities
are pure noiserandom variables.Denote a bootstrap sample taken from . The bootstrap sample takes draws with replacement from the empirical distribution of . Then, reconstruct the response variable defining
(10) 
as a bootstrap sample of . Here, we take a base mean function and recreate multiple times the noise . To reconstruct again another output sample of we reescalate this noise with the conditional variance in each point . This new sample of depends of the index , because the errors were taken from the residuals between and . Thus, for we take a sample with replacement from . For each sample , estimate the regression curve
Notice that the regression is performed conditionally under the design sequence, the ’s are not resampled, only the . As an example, for the Sobol explained in Section 5.1 we took the first input against the output . Figure 1 presents 100 curves generated by the bootstrap procedure and their mean. Notice that each bootstrap curve presents more variance. However, this behavior capture all the irregularities in the model and produces a better fit of the data.
Obtaining the regression curves, we can now compare the distance in mean square from the observational output and the bootstrap mean curve. Thus, an improvement to the leastsquare error presented in Equation (9) is,
We call it Boostrap LeastSquare criterion. The second term in the last expression produces the mean curve generated from the bootstrap curves. The function reaches its smallest value at,
Getting the bandwidth estimated, it only left reestimate the Sobol index with the new bootstrap structure,
The procedure captures the different irregularities in the data, without having an explicit functional form of the model. The procedure summarizes those irregularities in a mean curve and create a corrected Sobol index for each variable.
5 Numerical ilustrations
5.1 Simulation study
Simulations were performed to asses the quality Sobol index estimator using the classic crossvalidation and bootstrap procedures. In all the simulations we will take equal to , , for each case. We repeated the experiment 100 times selecting different samples in each iteration. In the bootstrap case, 100 draws were taken in each iteration. The inputs are uniform random variables specified in each configuration. For all simulations, the algorithm executed the nonparametric regression with second and fourth Epanechnikov kernel. These kernels are defined by and for in both cases. The purpose of including fourth order Kernels is to reduce the bias, giving a smoother structure to the model (for further details, see Tsybakov2009). For a detail explanation on higher order kernels see Hansen2005.
The software used was R (RCoreTeam2017), along the package np (Hayfield2008) for all the nonparametric estimators and the routine optimize to minimize the function . The setting considered is called gSobol and defined by
(11)  
where the ’s are positive parameters. The gSobol is a strong nonlinear and nonmonotonic behavior function. As discussed by Saltelli2008, this model has exact first order Sobol indices
For each , the lower is the value of , the higher is the relevance of in the model. The parameters used in the simulations are , , , , with Sobol indices , , , and .
To compare our method, we estimate in parallel the following methods for Sobol indices: Bspline smoothing (Ratto2010), and the schemes by Sobol (sobol1993sensitivity), Saltelli (saltelli2002making), MauntzKuncherenko (Sobol2007), JansenSobol (Jansen1999), Martinez and Touati (Baudin2016, Touati2016), JanonMonod (Makowski2006), Mara (AlexMara2008) and Owen (Owen2013). Those methods do not represent an exhaustive list, but give wide point of comparison between estimators. All methods estimated—except the Bsplines—need the prior knowledge of the link function in equation (11) between the input and the output .
Figure 2 and 3 presents the estimated Sobol indices for the gSobol model. The first Figure presents the indices using all the algorithms described in the last paragraph. The next Figure examines further the bandwidth adjust between the classic crossvalidation and the bootstrap methods.
Measuring the bandwidth with the classic crossvalidation procedure, the bias with the second order kernel is greater than with the fourth order kernel. The behavior is not surprising. In the latter case, the regression assumes that the subjacent curve has at least four finite derivatives and the bias has a better adjustments due to the smoothness.
The proposed bootstrap algorithm reduces the bias in all the cases giving an approximate value near to the real one. It overestimates the regression curve by selecting a bandwidth smaller. This overfitting causes that variance increases and the Sobol index gets larger. Notice that in most cases the procedure corrects the structural bias. However, for a fourth order kernel, the bias will be already controlled with the classic crossvalidation procedure. Therefore, the proposed method will raise the values, causing a Sobol index overestimation.
For all variables, the nonparametric methods achieve the theoretical values, compared with the other methodologies.
Table 1 presents the median estimated bandwidths for the gSobol. The algorithm calculated the bandwidths using crossvalidation and the bootstrap methods with second order Epanechnikov kernel. The results show us the overfitting explained before, due to the choice of smaller bandwidths for the bootstrap algorithm. Here, there were values that did not converge to an optimal solution and the bandwidth tend to . The phenomenon is due to the regression curve is almost flat, causing that their variance stay in almost zero. For those examples, the nonparametric curve estimator represent only the mean of the data regarding .
Sample  

Method  Bandwidth  100  200  300 
Bootstrap  0.005  0.004  0.004  
0.012  0.009  0.008  
0.037  0.019  0.020  
0.130  0.066  0.032  
Crossvalidation  0.051  0.045  0.039  
0.096  0.079  0.065  
0.187  0.121  0.129  
0.373  0.226  0.168 
5.2 Application
One academic real case model to test the performance in sensitivity analysis is the dyke model. This model simplifies the 1D hydrodynamical equations of Saint Venant under the assumptions of uniform and constant flowrate and large rectangular sections. The following equations recreate the river overflow and its cost (in millons of euros),
(12)  
(13) 
The variable measure the maximal annual overflow of the river (in meters) and is the associated cost (in million euros) of the dyke. Table 2 shows the inputs (). Here is equal to 1 for and 0 otherwise. The variable in Equation (12) is a design parameter for the Dyke’s height set as a .
In Equation (13), the first term is 1 million euros due to a flooding (), the second term corresponds to the cost of the dyke maintenance () and the third term is the construction cost related to the dyke. The latter cost is constant for a height of dyke less than 8m and is growing like the dyke height otherwise.
Input  Descripción  Unit  Probability Distribution 

Maximal annual flowrate  m/s  truncated on [500, 3000]  
Strickler coefficient  —  truncated on  
River downstream level  m  
River upstream level  m  
Dyke height  m  
Bank level  m  
Length of the river stretch  m  
River width  m 
For a complete discussion about the model, their parameters and their meaning the reader can review Iooss2015, Rocquigny2006 and their references.
We generated 1000 observations for each input according to Table 2 and their respective values for and . Figure 4 shows the result of simulations for the output and of the Dyke model using the crossvalidation and bootstrap procedures.
For both output and we see that the variables in order of importance are , , , and . The rest of variables have values near to zero and they provided insignificant impact to the output.
As reported in Table 3, we compared the values from our procedure against the reported by Iooss2015. The values of the Sobol indices detect the influence of each variable compared with the classic MonteCarlo and metamodels procedures. The exception is , which in our case decreased to values near to against the reported values of .
Indices (in %)  

MonteCarlo (Iooss)  35.5  15.9  18.3  12.5  3.8 
Metamodel (Iooss)  38.9  16.8  18.8  13.9  3.7 
Bootstrap  40.5  15.5  18.1  5.7  2.9 
Crossvalidation  37.2  14.6  17.2  5.5  2.8 
6 Conclusions
This paper presented an alternative way to estimate first order Sobol indices for the general model . These indices are calculated using the formula . The method builds the regression curve by a kernel nonparametric regression. We proved that the rate of convergence for the estimator converge with a rate of under some regularity conditions. Otherwise, the rate of convergence depends on the number of finite derivates of the density function.
The leastsquare crossvalidation procedure is a classic way to find the bandwidth. However, the literature presents cases where there exist a finitesample bias on the model. One way to correct is increasing the number of samples. This method proposes a bootstrap algorithm to correct the bias, by first estimating the normalized residuals of the model and then recreating a bootstrap version of the response variable. With this new data, the algorithm estimates an empirical version of the least squared error. We call it Bootstrap LeastSquare criterion and denoted . The function finds its minimum in a value .
The proposed algorithm overfit the regression curve , because it chooses a smaller bandwidth to increase the variability of the curve. The procedure approximates the first order Sobol indices. But it could overestimate them when using fourth order kernels.
The function was minimized using a Brenttype routine, implemented in the R function optimize. Due to the complexity of the target function, one future improvement to the algorithm is to use a global minimizer like simulated annealing to compare the results. In this scenario, we will expect a better choice of the bandwidths and observe a better adjust for the Sobol indices.
The method showed a consistent approximation to the Sobol indices only having the observational data available. In all cases, the nonparametric estimator using crossvalidation and bootstrap approximate the influential variables in the Sobol and Dyke models.
We consider only the indices with simple interactions between one variable with respect the output. The higher order indices and total effects will remain for a further study. We will estimate the multivariate nonparametric surface for multiple variables. Then, we have to approximate the surface variability over some range. The latter step will be an interesting topic of study due to the numerical complexities.
7 Proofs
Proof of Theorem 1.
To simplify the notation, denote and . Notice that
(14) 
The third term in equation (14) will be bounded using the CauchySchwarz inequality in the following way,
(15) 
Therefore, we have reduced the problem to bound only the first and second term of equation (14). The first term in equation (14) is straight forward bounded by applying the deltamethod. For the second term, adding and subtracting , we can decompose it into
We can apply the same argument of Equation (15) to bound the third term of the last equation. Therefore, we only have to bound the terms
(16)  
and  
(17) 
Let us start with equation (16). Here , we can write
Given that the kernel function and the value are bounded, we can maximize the first term on the left side can apply again the CauchySchwarz inequality. Then, there exists a real number such as . Employing once again the deltamethod, we can control the value
For the term in equation (17), we will apply the Theorem 1 of loubes2013rates. We use the same reasoning in that proof: If and choosing we get
Otherwise, taking the upper bound turns into
This proves the theorem. ∎
Comments
There are no comments yet.