Nonparametric estimation of the first order Sobol indices with bootstrap bandwidth

03/08/2018
by   Maikol Solís, et al.
0

Suppose that Y = m(X_1, ..., X_p), where (X_1, ..., X_p) are inputs, Y is an output, and m is an unknown function. The Sobol indices gauge sensitivity each X against Y by estimating the regression curve's variability between them. In this paper, we estimate it with a kernel-based methodology. Under some regularity conditions, the mean squared error for the first order Sobol indices have an asymptotically parametric behavior, with bandwidth equivalent to n^-1/4 where n is the sample size. For finite samples, the cross-validation method produces a structural bias. To correct this, we propose a bootstrap procedure which reconstruct the model residuals and re-estimate the nonparametric regression curve. With the new curve, the procedure corrects the bias in the Sobol index. We present simulated two numerical examples with complex functions to assess the performance of our procedure.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/10/2021

Bagging cross-validated bandwidth selection in nonparametric regression estimation with applications to large-sized samples

Cross-validation is a well-known and widely used bandwidth selection met...
02/25/2021

Random Forest based Qantile Oriented Sensitivity Analysis indices estimation

We propose a random forest based estimation procedure for Quantile Orien...
01/18/2017

A Machine Learning Alternative to P-values

This paper presents an alternative approach to p-values in regression se...
04/10/2019

Infill asymptotics and bandwidth selection for kernel estimators of spatial intensity functions

We investigate the asymptotic mean squared error of kernel estimators of...
03/23/2020

On bandwidth selection problems in nonparametric trend estimation under martingale difference errors

In this paper, we are interested in the problem of smoothing parameter s...
06/07/2020

Overall Agreement for Multiple Raters with Replicated Measurements

Multiple raters are often needed to be used interchangeably in practice ...
10/30/2021

Bayesian surface regression versus spatial spectral nonparametric curve regression

COVID-19 incidence is analyzed at the provinces of some Spanish Communit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Researchers, technicians or policy makers often support their decisions on complex models. They have to process, analyze and interpret them assertively with the data available. Usually, those models include many variables and interactions. One choice to overcome these issues is selecting the most relevant variables of the system. In this way, we will gain insight on the model and we will discover the main characteristics of it. Still, we have to produce a stable approximation of the model to avoid large variations on the input given by small perturbations on the output. The analyst, however, has to confirm, check and improve the model.

The typical situation is if we assume a set of inputs variables producing an output related by the model

(1)

The function could be unknown and complex. In some cases, a computer code can gauge it (e.g., oakley2004probabilistic). Also, we can replace the original model by a low fidelity approximation called a meta-model (see box1987empirical). The problems related to this formulation extend to engineering, biology, oceanography and others.

Given the set of inputs in the model (1

), we can rank them according different criteria. Some examples are: the screening method (cullen1999probabilistic, campolongo2011screening), the automatic differentiation (rall1980applications, carmichael1997sensitivity), the regression analysis (draper1981applied, devore1996statistics) or the response surface method (montgomery1995response, goos2002optimal).

The work of sobol1993sensitivity, inspired by an ANOVA (or Hoeffding) decomposition, split down the variance of the model in partial variances. They are generated by the conditionals expectations of

giving each input for . The partial variances represent the uncertainty created by each input or its interactions. Dividing each partial variance by the model total variance, we get a normalized index of importance. We call the first-order Sobol indices to the quantities,

Notice that is the best approximation of given . Thus, if the variance of is large, it means a large influence of into .

These indices are used in theoretical and applied techniques and determine the most relevant and sensible inputs on the model. We can establish indices that measure the interactions between variables or the total effect of a certain input in the whole model. We refer the reader to saltelli2000sensitivity for the exact computation of higher-order Sobol indices.

The main task with the Sobol indices relays in its computation. Monte-Carlo or quasi Monte-Carlo methods propose sampling the model (of the order of hundreds or thousands) to get an approximation of its behavior. For instance, the Fourier Amplitude Sensitivity Test (FAST) or the Sobol Pick-Freeze (SPF) cukier1973study,cukier1978nonlinear created the FAST method which transforms the partial variances in Fourier expansions. This method allows the aggregated and simple estimation of Sobol indices in an escalated way. The SPF scheme regresses the model output against a pick-frozen replication. The principle is to create a replication holding the interest variable (frozen) and re-sampling the other variables (picked). We refer to the reader to sobol1993sensitivity,sobol2001global and janon2013asymptotic. Other methods include to Ishigami1990Importance which improved the classic Monte-Carlo procedure by re-sampling the inputs and reducing the whole process to only one Monte-Carlo draw. The paper of saltelli2002making proposed an algorithm to estimate higher-order indices with the minimal computation effort.

The Monte-Carlo methods suffer of the high-computational stress in its implementation. For example, the FAST method requires estimate a set of suitable transformation functions and integer angular frequencies for each variable. The SPF scheme creates a new copy of the variable in each iteration. For complex and high-dimensional models, those techniques will be expensive in computational time.

One limitation is in most cases, those methods requires a complete identification of the link function

between the inputs and the outputs. It means, the analyst has to have available the exact link function some function or some alternative algorithm which produce the outcome. For example, if the researcher have only a dataset where there exist explanatory and response variables. One natural task is to ask is to find the most influential explanatory variables.

This article proposes an alternative way to compute the Sobol indices. In particular, we will take the ideas of zhu1996asymptotics and we shall apply a nonparametric Nadaraya-Watson to estimate the value for

. With this estimator, we avoid the stochastic techniques and we use the information of the data to fit the nonparametric model. This work shows if the joint distribution of

is twice differentiable, the nonparametric estimator of , has a parametric rate of convergence. Otherwise, we will get a nonparametric rate of convergence depending on the regularity of the density.

The article follows this framework: In Section 2 we will propose the nonparametric estimator for the first-order Sobol indices. We gather the hypotheses, assumptions and main results in Section 3. The method to calibrate the optimal bandwidth is in Section 4. We show our method with two numerical examples in Section 5. In Section 6, we expose the conclusions and discussion.

2 Methodology

In our context we suppose that and ,

are independent and identically distributed observations from the random vector

and . We denote by the joint density of the couple . Let be the marginal density function of for .

Recall Sobol indices definition presented in the introduction,

(2)

We have expanded the variance of the numerator to simplify the presentation. Notice we can estimate the terms and in equation (2) by their empirical counterparts

(3)
and
(4)

The term requires more effort to estimate. For any we introduce the following notation,

where

We will use a changed version of the nonparametric estimator developed in loubes2013rates. This paper estimates the conditional expectation covariance for a reduction dimension problem.

We will estimate the functions and by their nonparametric estimators,

(5)
(6)

The nonparametric estimator for is,

(7)

Thus, we can gather the estimators (3) and (7) and define the nonparametric estimator for as,

(8)

The estimator (8) provides a direct way to estimate the first-order Sobol index . We want to control the square risk over some regular class of functions for the joint density . Thus, we can choose a bandwidth according to the regularity to get a convergent estimator. Our aim is to find enough conditions to have a parametric rate of convergence for .

3 Main result

We denote , and so on, constants (independent of ) that may take different values throughout all the chapter.

Let be a positive real value and define as the largest integer such that . We define the continuous kernel function of order if it satifies:

the support of is the interval ;

is symmetric around 0;

and for .

To guarantee a parametric consistency in our model, we need to impose some regularity conditions. In our case, define the Hölder class of smooth functions as follows.

Definition 1.

Denote as the Hölder class of density functions with smoothness and radius . This class is defined as the set of times differentiable functions where is an interval in , whose derivative satisfies

The following technical assumption is important to establish the class of function where we will find the upper bound of the risk.

Assumption 1.

For fixed, the function belongs to a Hölder class of regularity and constant , i.e. .

Also, the function for belongs to a Hölder class of smoothness and radius , i.e. .

Remark 1.

Notice that the function defined in (5) also belongs to for . Recall that

Let fixed. If then by a direct calculation we can prove our assertion.

Denote as the class of functions that fulfill Assumption 1. The following theorem establishes the rates of convergence for depending on the parameter .

Theorem 1.

Assume that . The upper bound risk of the estimator defined in (8) over the functional class satisfies:

  • if and choosing then

  • if and choosing then,

The proof of the Theorem 1 will be postponed to the Section 7.

Theorem 1 presents an elbow effect on the rates of convergences. This is a typical behavior in linked to studies on functional estimation, for instance baraud2003adaptive or laurent2005adaptive. The regularity of the joint density function defines the rate of convergence for the mean squared risk of . It means, we can get a parametric rate paying a price on the regularity of . When , the rate turn into a nonparametric one which depends on the parameter .

In the regular case, when , we avoid any adaptability problem concerning the choose of the bandwidth . We can establish a parametric rate of convergence for taking a bandwidth of . In other case, it is necessary to know of the regularity of our model to choose the bandwidth.

4 Choice of bandwiths for Sobol indices

Even if Theorem 1 gives the asymptotic optimal rate, the constant it still depends on unknown quantities like the true density function . Thus, the choice of the bandwidth remains as the crucial step to estimate accurately . The main issue is to estimate the regression curve .

We can fit this curve with the data available minimizing the least squares criteria

(9)

As before, we estimate by where and were defined in Equations (5) and (6). But, there exist a problem because this method uses twice the data to calibrate and verify the model. The cross-validation method estimate the prediction error removing one by one the observations and recalculating the model with the remaining data. The estimator is called leave-one-out estimator with the expression

Afterwards, we can build a new version of the least squares error

and find the optimal bandwidth . Finally, estimate . The book of hardle2004nonparametric has the detailed procedure.

However, even if the cross-validation is asymptotically unbiased, those estimators have a relatively large finite-sample bias. The works from Faraway1990, Romano1988 and Padgett1986 established the same behavior studying nonparametric estimators for the density, quantiles and the mode respectively. This problem arises on the nonparametric-based models, as it was exemplified by Hardle1993. One solution is remove the bias part of the estimate by bootstrapping, following the ideas in Racine2001.

The procedure starts with the residuals for the variable with respect to its nonparametric estimate counterpart with some bandwidth ,

These residuals are then transformed by

where is the arithmetic mean of and

is the conditional standard deviation of

. It means, we normalize the residuals by the conditional variance

in each point of the sample. This overcomes any issue due to the heteroscedasticity in the sample. Therefore, the quantities

are pure noiserandom variables.

Denote a bootstrap sample taken from . The bootstrap sample takes draws with replacement from the empirical distribution of . Then, reconstruct the response variable defining

(10)

as a bootstrap sample of . Here, we take a base mean function and re-create multiple times the noise . To reconstruct again another output sample of we re-escalate this noise with the conditional variance in each point . This new sample of depends of the index , because the errors were taken from the residuals between and . Thus, for we take a sample with replacement from . For each sample , estimate the regression curve

Notice that the regression is performed conditionally under the design sequence, the ’s are not resampled, only the . As an example, for the -Sobol explained in Section 5.1 we took the first input against the output . Figure 1 presents 100 curves generated by the bootstrap procedure and their mean. Notice that each bootstrap curve presents more variance. However, this behavior capture all the irregularities in the model and produces a better fit of the data.

Figure 1: Data points (black dots), bootstrap curves (gray lines) and mean curve (black solid line) from observations of the -Sobol (Equation (11)) for the first input against the output .

Obtaining the regression curves, we can now compare the distance in mean square from the observational output and the bootstrap mean curve. Thus, an improvement to the least-square error presented in Equation (9) is,

We call it Boostrap Least-Square criterion. The second term in the last expression produces the mean curve generated from the bootstrap curves. The function reaches its smallest value at,

Getting the bandwidth estimated, it only left re-estimate the Sobol index with the new bootstrap structure,

The procedure captures the different irregularities in the data, without having an explicit functional form of the model. The procedure summarizes those irregularities in a mean curve and create a corrected Sobol index for each variable.

5 Numerical ilustrations

5.1 Simulation study

Simulations were performed to asses the quality Sobol index estimator using the classic cross-validation and bootstrap procedures. In all the simulations we will take equal to , , for each case. We repeated the experiment 100 times selecting different samples in each iteration. In the bootstrap case, 100 draws were taken in each iteration. The inputs are uniform random variables specified in each configuration. For all simulations, the algorithm executed the nonparametric regression with second and fourth Epanechnikov kernel. These kernels are defined by and for in both cases. The purpose of including fourth order Kernels is to reduce the bias, giving a smoother structure to the model (for further details, see Tsybakov2009). For a detail explanation on higher order kernels see Hansen2005.

The software used was R (RCoreTeam2017), along the package np (Hayfield2008) for all the nonparametric estimators and the routine optimize to minimize the function . The setting considered is called g-Sobol and defined by

(11)

where the ’s are positive parameters. The g-Sobol is a strong nonlinear and nonmonotonic behavior function. As discussed by Saltelli2008, this model has exact first order Sobol indices

For each , the lower is the value of , the higher is the relevance of in the model. The parameters used in the simulations are , , , , with Sobol indices , , , and .

To compare our method, we estimate in parallel the following methods for Sobol indices: B-spline smoothing (Ratto2010), and the schemes by Sobol (sobol1993sensitivity), Saltelli (saltelli2002making), Mauntz-Kuncherenko (Sobol2007), Jansen-Sobol (Jansen1999), Martinez and Touati (Baudin2016, Touati2016), Janon-Monod (Makowski2006), Mara (AlexMara2008) and Owen (Owen2013). Those methods do not represent an exhaustive list, but give wide point of comparison between estimators. All methods estimated—except the B-splines—need the prior knowledge of the link function in equation (11) between the input and the output .

Figure 2 and 3 presents the estimated Sobol indices for the g-Sobol model. The first Figure presents the indices using all the algorithms described in the last paragraph. The next Figure examines further the bandwidth adjust between the classic cross-validation and the bootstrap methods.

Measuring the bandwidth with the classic cross-validation procedure, the bias with the second order kernel is greater than with the fourth order kernel. The behavior is not surprising. In the latter case, the regression assumes that the subjacent curve has at least four finite derivatives and the bias has a better adjustments due to the smoothness.

The proposed bootstrap algorithm reduces the bias in all the cases giving an approximate value near to the real one. It overestimates the regression curve by selecting a bandwidth smaller. This over-fitting causes that variance increases and the Sobol index gets larger. Notice that in most cases the procedure corrects the structural bias. However, for a fourth order kernel, the bias will be already controlled with the classic cross-validation procedure. Therefore, the proposed method will raise the values, causing a Sobol index overestimation.

For all variables, the nonparametric methods achieve the theoretical values, compared with the other methodologies.

Figure 2: Estimated values from the 100 iterations for the g-Sobol model across different methodologies. The horizontal dashed lines represent the theoretical values for each Sobol index.
Figure 3: Estimated values from the 100 iterations for the g-Sobol model using the cross-validation and bootstrap procedures to estimate the bandwidth. Both methods were calculated using an Epanechnikov kernel of second and fourth order. The horizontal dashed lines represent the theoretical values for each Sobol index.

Table 1 presents the median estimated bandwidths for the g-Sobol. The algorithm calculated the bandwidths using cross-validation and the bootstrap methods with second order Epanechnikov kernel. The results show us the over-fitting explained before, due to the choice of smaller bandwidths for the bootstrap algorithm. Here, there were values that did not converge to an optimal solution and the bandwidth tend to . The phenomenon is due to the regression curve is almost flat, causing that their variance stay in almost zero. For those examples, the nonparametric curve estimator represent only the mean of the data regarding .

Sample
Method Bandwidth 100 200 300
Bootstrap 0.005 0.004 0.004
0.012 0.009 0.008
0.037 0.019 0.020
0.130 0.066 0.032
Cross-validation 0.051 0.045 0.039
0.096 0.079 0.065
0.187 0.121 0.129
0.373 0.226 0.168
Table 1: Median bandwidths estimated from the 100 iterations for the first four variable of the g-Sobol model.

5.2 Application

One academic real case model to test the performance in sensitivity analysis is the dyke model. This model simplifies the 1D hydro-dynamical equations of Saint Venant under the assumptions of uniform and constant flowrate and large rectangular sections. The following equations re-create the river overflow and its cost (in millons of euros),

(12)
(13)

The variable measure the maximal annual overflow of the river (in meters) and is the associated cost (in million euros) of the dyke. Table 2 shows the inputs (). Here is equal to 1 for and 0 otherwise. The variable in Equation (12) is a design parameter for the Dyke’s height set as a .

In Equation (13), the first term is 1 million euros due to a flooding (), the second term corresponds to the cost of the dyke maintenance () and the third term is the construction cost related to the dyke. The latter cost is constant for a height of dyke less than 8m and is growing like the dyke height otherwise.

Input Descripción Unit Probability Distribution
Maximal annual flowrate m/s truncated on [500, 3000]
Strickler coefficient truncated on
River downstream level m
River upstream level m
Dyke height m
Bank level m
Length of the river stretch m
River width m
Table 2: Input variables and their probability distributions.

For a complete discussion about the model, their parameters and their meaning the reader can review Iooss2015, Rocquigny2006 and their references.

Figure 4: Estimated values from the 100 iterations for the output and in the Dyke model.

We generated 1000 observations for each input according to Table 2 and their respective values for and . Figure 4 shows the result of simulations for the output and of the Dyke model using the cross-validation and bootstrap procedures.

For both output and we see that the variables in order of importance are , , , and . The rest of variables have values near to zero and they provided insignificant impact to the output.

As reported in Table 3, we compared the values from our procedure against the reported by Iooss2015. The values of the Sobol indices detect the influence of each variable compared with the classic Monte-Carlo and meta-models procedures. The exception is , which in our case decreased to values near to against the reported values of .

Indices (in %)
Monte-Carlo (Iooss) 35.5 15.9 18.3 12.5 3.8
Meta-model (Iooss) 38.9 16.8 18.8 13.9 3.7
Bootstrap 40.5 15.5 18.1 5.7 2.9
Cross-validation 37.2 14.6 17.2 5.5 2.8
Table 3: Comparison between the Sobol indices in the dyke model reported by Iooss2015 and our method. The Monte-Carlo and meta-model methods used samples of . The bootstrap and cross-validation method used samples of . In all cases the simulation repeated the experiment 100 times.

6 Conclusions

This paper presented an alternative way to estimate first order Sobol indices for the general model . These indices are calculated using the formula . The method builds the regression curve by a kernel nonparametric regression. We proved that the rate of convergence for the estimator converge with a rate of under some regularity conditions. Otherwise, the rate of convergence depends on the number of finite derivates of the density function.

The least-square cross-validation procedure is a classic way to find the bandwidth. However, the literature presents cases where there exist a finite-sample bias on the model. One way to correct is increasing the number of samples. This method proposes a bootstrap algorithm to correct the bias, by first estimating the normalized residuals of the model and then recreating a bootstrap version of the response variable. With this new data, the algorithm estimates an empirical version of the least squared error. We call it Bootstrap Least-Square criterion and denoted . The function finds its minimum in a value .

The proposed algorithm overfit the regression curve , because it chooses a smaller bandwidth to increase the variability of the curve. The procedure approximates the first order Sobol indices. But it could overestimate them when using fourth order kernels.

The function was minimized using a Brent-type routine, implemented in the R function optimize. Due to the complexity of the target function, one future improvement to the algorithm is to use a global minimizer like simulated annealing to compare the results. In this scenario, we will expect a better choice of the bandwidths and observe a better adjust for the Sobol indices.

The method showed a consistent approximation to the Sobol indices only having the observational data available. In all cases, the nonparametric estimator using cross-validation and bootstrap approximate the influential variables in the -Sobol and Dyke models.

We consider only the indices with simple interactions between one variable with respect the output. The higher order indices and total effects will remain for a further study. We will estimate the multivariate nonparametric surface for multiple variables. Then, we have to approximate the surface variability over some range. The latter step will be an interesting topic of study due to the numerical complexities.

7 Proofs

Proof of Theorem 1.

To simplify the notation, denote and . Notice that

(14)

The third term in equation (14) will be bounded using the Cauchy-Schwarz inequality in the following way,

(15)

Therefore, we have reduced the problem to bound only the first and second term of equation (14). The first term in equation (14) is straight forward bounded by applying the delta-method. For the second term, adding and subtracting , we can decompose it into

We can apply the same argument of Equation  (15) to bound the third term of the last equation. Therefore, we only have to bound the terms

(16)
and
(17)

Let us start with equation (16). Here , we can write

Given that the kernel function and the value are bounded, we can maximize the first term on the left side can apply again the Cauchy-Schwarz inequality. Then, there exists a real number such as . Employing once again the delta-method, we can control the value

For the term in equation (17), we will apply the Theorem 1 of loubes2013rates. We use the same reasoning in that proof: If and choosing we get

Otherwise, taking the upper bound turns into

This proves the theorem. ∎