Since the early work of Sobol (Sobol’, 1993)
, global sensitivity analysis (GSA) has received a lot of attention in the computer code experiments community. These days variance-based sensitivity indices are common tools in the analysis of complex physical phenomena. Several statistical estimators have been proposed(Cukier et al., 1973; Saltelli et al., 2010; Tarantola et al., 2006; Janon et al., 2013; Owen, 2013) and their asymptotic properties are now well understood (Tissot and Prieur, 2012; Da Veiga and Gamboa, 2013; Janon et al., 2013). In addition, the case of computationally expensive codes has been investigated thoroughly with the introduction of several dedicated surrogate models (Oakley and O’Hagan, 2004; Marrel et al., 2009; Da Veiga et al., 2009; Blatman and Sudret, 2010; Touzani and Busby, 2012; Durrande et al., 2012).
However, even if they are extremely popular and informative importance measures, variance-based indices suffer from theoretical and practical limitations. First, by definition they only study the impact of the input parameters on the variance of the output. Since this is a restricted summary of the output distribution, this measure happens to be inadequate for many case studies. Alternative approaches include for example density-based indices (Borgonovo, 2007), derivative-based measures (Sobol and Kucherenko, 2009), or goal-oriented dedicated indices (Fort et al., 2013). Second, variance-based indices do not generalize easily to the case of a multivariate output (Gamboa et al., 2013). Unfortunately, computer code outputs often consist of several scalars or even time-dependent curves, which limits severely the practical use of standard indices. Finally for high-dimensional problems, a preliminary screening procedure is usually mandatory before the analysis of the computer code or the modeling with a surrogate. The computational cost of GSA is in general too high to envision its use for screening purposes and more qualitative approaches are thus needed, e.g. the Morris method (Morris, 1991) or group screening (Moon et al., 2012).
In this paper, we propose a completely original point of view in order to overcome these limitations. Starting from the general framework of GSA and the concept of dissimilarity measure, we introduce a new class of sensitivity indices which comprises as a special case the density-based index of Borgonovo (2007). We propose an estimation procedure relying on density ratio estimation and show that it gives access to several different indices for the same computational cost. More importantly, we highlight that other special cases lead to well-known dependence measures, including the mutual information. This link motivates us to investigate the potential of recent state-of-the-art dependence measures as new sensitivity indices, such as the distance correlation (Székely et al., 2007) or the Hilbert-Schmidt independence criterion (Gretton et al., 2005a). An appealing property of such measures is that they can handle multivariate random variables very easily. We also discuss how feature selection methods based on these measures can effectively replace standard screening procedures.
The structure of the paper is as follows. In Section 2, we introduce the general GSA framework based on dissimilarity measures and discuss the use of Csiszár f-divergences. We also emphasize the link with mutual information and propose an estimation procedure. In Section 3
, we give a review of some dependence measures and show how they can be used as new sensitivity indices. We also provide examples of feature selection techniques in which they are involved. Screening will then be seen as an equivalent to feature selection in machine learning. Finally, several numerical experiments are conducted in Section4 on both analytical and industrial applications. In particular, we illustrate the potential of dependence measures for GSA.
2 From dissimilarity measures to sensitivity indices
Denote the computer code output which is a function of the input random variables , where is assumed to be continuous. In standard global sensitivity analysis, it is further assumed that the have a known distribution and are independent. As pointed out by Baucells and Borgonovo (2013), a natural way of defining the impact of a given input on is to consider a function which measures the similarity between the distribution of and that of . More precisely, the impact of on is given by
where denotes a dissimilarity measure between two random variables. The advantage of such a formulation is that many choices for are available, and we will see in what follows that some natural dissimilarity measures yield sensitivity indices related to well known quantities. However before going further, let us note that the naive dissimilarity measure
where random variables are compared only through their mean values produces the unnormalized Sobol first-order sensivity index .
2.1 Csiszár f-divergences
Assuming all input random variables have an absolutely continuous distribution with respect to the Lebesgue measure on , the f-divergence (Csiszár, 1967) between and is given by
where is a convex function such that and and
are the probability distribution functions ofand , respectively. Standard choices for function include for example
Hellinger distance: ;
Total variation distance: ;
Pearson divergence: or ;
Neyman divergence: or .
Plugging this dissimilarity measure in (1) yields the following sensitivity index:
where and are the probability distribution functions of and , respectively. First of all, note that inequalities on Csiszár f-divergences imply that such sensitivity indices are positive and equal zero when and are independent. Also, it is important to note that given the form of , it is invariant under any smooth and uniquely invertible transformation of the variables and , see the proof for mutual information in Kraskov et al. (2004)
. This is a major advantage over variance-based Sobol sensitivity indices, which are only invariant under linear transformations.
It is easy to see that the total variation distance with gives a sensivity index equal to the one proposed by Borgonovo (2007):
In addition, the Kullback-Leibler divergence with yields
that is the mutual information between and . A normalized version of this sensitivity index was studied by Krzykacz-Hausmann (2001). Similarly, the Neyman divergence with leads to
which is the so-called squared-loss mutual information between and (or mean square contingency). These results show that some previously proposed sensitivity indices are actually special cases of more general indices defined through Csiszár f-divergences. To the best of our knowledge, this is the first work in which this link is highlighted. Moreover, the specific structure of equation (3) makes it possible to envision more efficient tools for the estimation of these sensitivity indices. Indeed, it only involves approximating a density ratio rather than full densities. This point is investigated in the next subsection. But more importantly, we see that special choices for define sensivity indices that are actually well-known dependence measures such as the mutual information. This paves the way for completely new sensitivity indices based on recent state-of-the-art dependence measures, see Section 3.
Coming back to equation (3), the goal is to estimate
where is the ratio between the joint density of and the marginals. Of course, straightforward estimation is possible if one estimates the densities , and
with e.g. kernel density estimators. However, it is well known that density estimation suffers from the curse of dimensionality. This limits the possible multivariate extensions we discuss in the next subsection. Besides, since only the ratio functionis needed, we expect more robust estimates by focusing only on it.
Let us assume now that we have a sample of , the idea is to build first an estimate of the ratio. The final estimator of will then be given by
Powerful estimating methods for ratios include e.g. maximum-likelihood estimation (Suzuki et al., 2008), unconstrained least-squares importance fitting (Kanamori et al., 2009), among others (see Sugiyama et al., 2012). A k-nearest neighbors strategy dedicated to mutual information is also discussed in Kraskov et al. (2004).
2.3 Multivariate extensions
Given our approach focusing only on densities, it is straightforward to extend the definition of the sensitivity index in equation (3) to any number of input and output variables. We can then study the impact of a given group of input variables where is a subset of on a multivariate output with the sensitivity index given by
This favorable generalization was already mentioned for the special cases of the total-variation distance and mutual information by Borgonovo (2007) and Auder and Iooss (2008), respectively. However, in the high-dimensional setting, estimation of such sensitivity indices is infeasible since even the ratio trick detailed above fails. This is thus a severe limitation for screening purposes. We examine efficient alternatives in Section 3.
Moreover, note that extending the naive dissimilarity measure (2) to the multivariate output case naturally leads to consider . Straightforward calculations reveal that the corresponding sensitivity index is then the sum of Sobol first-order sensitivity indices on each output. Gamboa et al. (2013) showed that this multivariate index is the only one possessing desired invariance properties in the variance-based index family.
2.4 On the use of other dissimilarity measures
We focused above on Csiszár f-divergences but other dissimilarity measures exist to compare probability distributions. In particular, integral probability metrics (IPM, Müller, 1997) are a popular family of distance measures on probabilities given by
for two probability measures and and where is a class of real-valued bounded measurable functions on . Just as the choice of function in Csiszár f-divergences gives rise to different measures, the choice of generates different IPMs, e.g. the Wasserstein distance, the Dudley metric or the total variation distance. It is interesting to note that Csiszár f-divergences and IPMs are very distinct classes of measures, since they only intersect at the total variation distance (Sriperumbudur et al., 2012). Unfortunately, plugging the general expression (5) of an IPM in equation (1) no longer yields a closed-form expression for a sensitivity index. However, we plan to study such indices in a future work since estimation of IPMs appears to be easier than for Csiszár f-divergences and is independent of the dimensionality of the random variables (Sriperumbudur et al., 2012).
Finally, let us mention the recent work of Fort et al. (2013) on goal-oriented measures, where they introduce a new class of sensitivity indices
where is the contrast function associated to the features of interest and of and conditionally to , respectively (note that we only give here the unnormalized version of the index). It is easy to check that (6) is a special case of (1).
3 Dependence measures and feature selection
Given two random vectorsin and in , dependence measures aim at quantifying the dependence between and in arbitrary dimension, with the property that the measure equals zero if and only if and are independent. In particular, they are useful when one wants to design a statistical test for independence. Here, we focus on the long-known mutual information criterion, as well as on the novel distance correlation measure (Székely et al., 2007). Recently, Sejdinovic et al. (2013) showed that it shares deep links with distances between embeddings of distributions to reproducing kernel Hilbert spaces (RKHS) and especially the Hilbert-Schmidt independence criterion (HSIC, Gretton et al., 2005a) which will also be discussed. Finally, we will review feature selection techniques introduced in machine learning which make use of these dependence measures.
3.1 Mutual information
Mutual information (MI, Shannon, 1948) is a symmetric measure of dependence which is related to the entropy. Assuming and are two random vectors which are absolutely continuous with respect to the Lebesgue measure on and with density functions and , respectively, one can define their marginal entropy:
and similarly. Denoting their joint density function, the joint entropy between and writes
MI is then formally defined as
Interestingly, MI equals zero if and only if and are independent. This implies that MI is able to detect nonlinear dependencies between random variables, unlike the correlation coefficient. It is also easy to check that with Jensen’s inequality. Further note that it is not a distance since it does not obey the triangle inequality. A simple modified version yielding a distance, the variation of information (VI), is given by
Another variant is the squared-loss mutual information (SMI, Suzuki et al., 2009):
which is again a dependence measure verifying with equality if and only if and
are independent. Applications of MI, VI and SMI include independent component analysis(Hyvärinen and Oja, 2000), image registration (Pluim et al., 2003)2007), among many others.
In the context of global sensitivity analysis, we have seen in Section 2.1 that MI and SMI arise as sensitivity indices when specific Csiszár f-divergences are chosen to evaluate the dissimilarity between the output and the conditional output where is one of the input variables. We will then study the two following sensitivity indices:
3.2 Distance correlation
The distance correlation was introduced by Székely et al. (2007) to address the problem of testing dependence between two random vectors in and in
. It is based on the concept of distance covariance which measures the distance between the joint characteristic function ofand the product of the marginal characteristic functions.
More precisely, denote and the characteristic function of and , respectively, and their joint characteristic function. For a complex-valued function , we also denote the complex conjugate of and . The distance covariance (dCov) between and
with finite first moment is then defined as a weighted-distance between and given by
where the weight function with constants for is chosen to ensure invariance properties, see Székely et al. (2007). The distance correlation (dCor) between and is then naturally defined as
if and equals otherwise. Important properties of the distance correlation introduced in (11) include that and if and only if and are independent. Interestingly, the distance covariance in (10) can be computed in terms of expectations of pairwise Euclidean distances, namely
where is an i.i.d. copy of . Concerning estimation, let be a sample of the random vector . Following equation (12), an estimator of is then given by
and is also equal to equation (10) if one uses the empirical characteristic functions computed on the sample . The empirical distance correlation is then
and satisfies . Although is a consistent estimator of , it is easy to see that it is biased. Székely and Rizzo (2013b) propose an unbiased version of and a specific correction for the high-dimensional case is investigated in Székely and Rizzo (2013a). Further note that Székely et al. (2007) also study defined as
with the new weight function and constants as soon as and . Distance covariance is retrieved for . The very general case of and living in metric spaces has been examined by Lyons (2013). More precisely, let and be metric spaces of negative type (see Lyons, 2013), the generalized distance covariance
characterizes independence between and .
Coming back to sensitivity analysis, just like we defined a new index based on mutual information, we can finally introduce an index based on distance correlation, i.e.
which will measure the dependence between an input variable and the output . Since distance correlation is designed to detect nonlinear relationships, we except this index to quantify effectively the impact of on . Besides, considering that distance covariance is defined in arbitrary dimension, this index generalizes easily to the multivariate case:
for evaluating the impact of a group of inputs on a multivariate output .
The limiting case in (14) interestingly leads to , see Székely et al. (2007). This turns out to be another original way for defining a new sensitivity index. Indeed, recall that Sobol first-oder sensitivity index actually equals where is an independent copy of obtained by fixing , see Janon et al. (2013). The idea is then to replace the covariance (obtained with ) by dCov (with ):
where PF stands for pick-and-freeze, since this index generalizes the pick-and-freeze estimator proposed by Janon et al. (2013) and is able to detect nonlinear dependencies, unlike the correlation coefficient.
The Hilbert-Schmidt independence criterion proposed by Gretton et al. (2005a) builds upon kernel-based approaches for detecting dependence, and more particularly on cross-covariance operators in RKHSs. Here, we only give a brief summary and introduction on this topic and refer the reader to Berlinet and Thomas-Agnan (2004); Gretton et al. (2005a); Smola et al. (2007) for details.
Let the random vector have distribution and consider a RKHS of functions with kernel and dot product . Similarly, we can also define a second RKHS of functions with kernel and dot product associated to the random vector with distribution . By definition, the cross-covariance operator
associated to the joint distributionof is the linear operator defined for every and as
In a nutshell, the cross-covariance operator generalizes the covariance matrix by representing higher order correlations between and through nonlinear kernels. For every linear operator and provided the sum converges, the Hilbert-Schmidt norm of is given by
where and are orthonormal bases of and , respectively. This is simply the generalization of the Frobenius norm on matrices. The HSIC criterion is then defined as the Hilbert-Schmidt norm of the cross-covariance operator:
where the last equality in terms of kernels is proven in Gretton et al. (2005a). An important property of is that it equals if and only if and are independent, as long as the associated RKHSs and are universal, i.e. they are dense in the space of continuous functions with respect to the infinity norm (Gretton et al., 2005b). Examples of kernels generating universal RKHSs are e.g. the Gaussian and the Laplace kernels (Sriperumbudur et al., 2009).
It is interesting to note the similarity between the generalized distance covariance of equation (15) and the HSIC criterion (18). Actually, Sejdinovic et al. (2013) recently studied the deep connection between these approaches and show that
if the kernels and generate the metrics and , respectively (see Sejdinovic et al., 2013). In particular, the standard distance covariance (12) is retrieved with the (universal) kernel which generates the metric .
Assume now that is a sample of the random vector and denote and the Gram matrices with entries and . Gretton et al. (2005a) propose the following consistent estimator for :
where is the centering matrix such that . Besides, it is easy to check that can be expressed just like the empirical distance covariance (13):
An unbiased estimator is also introduced bySong et al. (2012).
We can finally propose a sensitivity index generalizing (16):
where the kernel-based distance correlation is given by
and the kernels inducing and have to be chosen within the class of universal kernels. The multivariate extension of is straightforward. The impact of the choice of kernels has previously been studied by Sriperumbudur et al. (2009) in the context of independence hypothesis tests.
Instead of working with the cross-covariance operator , Fukumizu et al. (2007) work with the normalized cross-covariance operator (NOCCO) defined as , see Fukumizu et al. (2007) for the existence of this representation. Just as the HSIC criterion, the associated measure of dependence is given by . Interestingly, is independent of the choice of kernels and is actually equal to the squared-loss mutual information (7) under some assumptions, see Fukumizu et al. (2008). Despite the advantage of being kernel-free, using in practice unfortunately requires to work with an estimator with a regularization parameter, which has to be selected (Fukumizu et al., 2007). Nevertheless, it is still interesting to use this approach for approximating SMI efficiently, since dimensionality limitations related to density function estimation no longer apply.
The pick-and-freeze estimator defined in Remark 1 can be readily generalized with kernels:
where this time only the kernel acting on needs to be specified.
3.3.2 Going beyond
The kernel point of view in HSIC also provides an elegant and powerful framework for dealing with categorical inputs and outputs, as well as functional ones.
The categorical case is common practice in feature selection, since the target output is often represented as labels. Appropriate kernels include for examplewhere is the number of samples with label , see e.g. Song et al. (2012); Yamada et al. (2013). From a GSA perspective, this implies that we can evaluate the impact of the inputs on level sets of the output by a simple change of variable for a given threshold . We can note the resemblance with the approach of Fort et al. (2013) if one uses a contrast function adapted to exceedance probabilities.
As a matter of fact, it is also possible to design dedicated semi-metrics for functional data which can be incorporated in the definition of the kernels, see e.g. Ferraty and Vieu (2006). For example, let be such a semi-metric defined on when the output variable is of functional type. The kernel associated to is then given by where is a kernel acting on . The same scheme applies to functional inputs as well, see Ginsbourger et al. (2012) for an illustration in the context of surrogate modeling where the semi-metric is a cheap and simplified computer code. However, a theoretical shortcoming lies in our current inability to check if such semi-metric kernels are universal, which implies that we can not claim that independence can be detected. Despite this deficiency, we show in Section 4 that from a practical perspective, the use of a semi-metric based on principal components can efficiently deal with a functional output given as a 2D map.
3.4 Feature selection as an alternative to screening
In machine learning, feature selection aims at identifying relevant features (among a large set) with respect to a prediction task. The goal is to detect irrelevant or redundant features which may increase the prediction variance without reducing its bias. As a matter of fact, this closely resembles the objective of factor screening in GSA. The main difference is that in GSA, input variables are usually assumed to be independent, whereas in feature selection redundant features, i.e. highly dependent factors, precisely have to be filtered out. This apparently naive distinction actually makes feature selection an interesting alternative to screening when some input variables are correlated. But it is important to note that it is also a powerful option even in the independent case.
We do not intend here to give an exhaustive review of feature selection techniques, but rather detail some approaches which make use of the dependence measures we recapped above. We hope that it will illustrate how they can be used as new screening procedures in high dimensional problems.
Literature on feature selection is abundant and entails many approaches. In the high dimensional setting, model-based techniques include for example the Lasso (Tibshirani, 1996) or sparse additive models (Ravikumar et al., 2009), see Fan and Lv (2010) for a selective overview. Generalizations for the ultra-high dimensional case usually replace penalty-based techniques to focus on marginal regression, where an underlying model is still assumed (e.g. linear Fan and Lv (2008) or non-parametric additive Fan et al. (2011)). Another line of work for the ultra-high dimensional setting are model-free methods, where only dependence measures are used to identify relevant features. Except for the very specific HSIC Lasso technique (Yamada et al., 2013), here we only focus on pure dependence-based approaches.
Let us first introduce the concept of Max-Dependency (Peng et al., 2005). Denote the set of available features, the target output to predict and any measure quantifying the dependence between two random vectors. The Max-Dependency scheme for feature selection involves finding features which jointly have the largest dependency with , i.e. one has to solve the following optimization problem
Solving (21) is however computationally infeasible when and are large for cardinality reasons. Near-optimal solutions are then usually found by iterative procedures, where features are added one at a time in the subset (forward selection). On the other hand, the dependence measure must also be robust to dimensionality, which is hard to achieve in practice when the number of samples is less than . Consequently, marginal computations which only involve terms are usually preferred. The Max-Relevance criterion (Peng et al., 2005) serves in this context as a proxy to Max-Dependency, where the optimization problem writes
But when the features are dependent, it is likely that this criterion will select redundant features. To limit this effect, one can add a condition of Min-Redudancy expressed as
Forward and backward procedures for mRMR are investigated by Peng et al. (2005) where is chosen as the mutual information. Similarly, forward and backward approaches where MI is replaced with HSIC is introduced by Song et al. (2012). A purely marginal point of view is studied by Li et al. (2012) where the authors propose the dCor criterion (11). In a nutshell, the dCor measure is computed between and each factor , and only the features with dCor above a certain threshold are retained. A sure screening property of this approach is also proven. Balasubramanian et al. (2013) extend this work by considering a modified version of the HSIC dependence measure (supremum of HSIC over a family of universal kernels, denoted sup-HSIC). Even if the sure screening procedure of this generalized method is proven, the authors mention that every feature selection technique based on marginal computations fails at detecting features that may be marginally uncorrelated with the output but are in fact jointly correlated with it. As a result, they propose the following iterative approach:
Compute the marginal sup-HSIC measures between and each feature , and select the inputs with a measure above a given threshold. Let be the subset of selected features.
Compute sup-HSIC between and for each . Augment with features having a measure greater than the sup-HSIC criterion between and .
Repeat until the subset of selected features stabilizes or when its cardinality reaches a given maximum value.
As pointed out previously, another drawback of marginal computations which is not taken care of by the above scheme is that redundant variables are not eliminated. But Balasubramanian et al. (2013) design another iterative procedure to deal with this case. Finally, let us note that in the examples of Section 4, we will only study the above iterative technique since we focus on independent input factors. We plan to investigate in particular the full mRMR approach for problems with correlated inputs in a future work.
Instead of working with forward and backward approaches, Yamada et al. (2013) propose a combination of the Lasso and the HSIC dependence measure. Denote for and the centered Gram matrices computed from a sample of following the notations of Section 3.3. The HSIC Lasso solves the following optimization problem
with constraints and where stands for the Frobenius norm and is a regularization parameter. Interestingly, the first term of equation (25) expands as
using that , are symmetric and is idempotent, which highlights the strong correspondence with the mRMR criterion (24). The authors show that (25) can be recast as a standard Lasso program and propose a dual augmented Lagrangian algorithm to solve the optimization problem. They also discuss a variant based on the dependence measure.
We mentioned before that feature selection techniques based on dependence measures have been particularly designed for the ultra-high dimensional case, which is not the common setting of screening problems in GSA. Nevertheless, we illustrate in Section 4 that they perform remarkably well on complex benchmark functions, while requiring very few samples of the output variable. This reveals their high potential for preliminary screening of expensive computer codes.
In this Section, we finally assess the performance of all the new sensitivity indices introduced before on a series of benchmark analytical functions and two industrial applications. All benchmark functions can be found in the Virtual Library of Simulation Experiments available at http://www.sfu.ca/~ssurjano/index.html. For easier comparison, we first summarize the proposed indices in Table 1 (SI stands for sensitivity index and see Tarantola et al. (2006) for RBD-FAST).
|Sobol first-order SI||Normalized version (RBD-FAST est.)|
|Sobol total SI||Normalized version (RBD-FAST est.)|
|(eq. (3))||Csiszár f-divergences||Includes as special case Borgonovo (2007),|
|(eq. (8)) and (eq. (9))|
|(eq. (16))||Distance correlation|
|(eq. (17))||Pick-and-freeze distance correlation||Generalization of Janon et al. (2013)|
|(eq. (19)||HSIC||Can be extended to categorical/functional data|
|(eq. (20)||Pick-and-freeze HSIC||Generalization of Janon et al. (2013)|