1 Introduction
Global sensitivity analysis (GSA) aims to quantify the relative importance of input variables or factors in determining the value of a function [41, 21]. It has been used widely for analysis of parameter uncertainty in mathematical models and simulations [41, 18]. In particular, GSA may be used to improve modeling insight, encourage model parsimony, and accelerate the modelfitting process. In this paper we propose a novel method for GSA of parameters in statistical models. To our knowledge, the GSA tools developed for mathematical models and simulations have not been systematically developed for analysis of statistical models. The combination of parameter correlation, model stochasticity, multivariate model output, and having an unknown parameter distribution prohibits a direct application of GSA tools to statistical models. Nevertheless, problem structure in statistical models may be exploited to enable efficient GSA. This paper provides a framework to use existing GSA tools along with tools from statistics to address these challenges and yield a new GSA approach for analysis of statistical models.
This work is motivated by a statistical model that fuses two datasets of atmospheric wind speed in order to provide statistical prediction of wind speed in space and time [1]. The predictions are generated from a Gaussian process whose mean and covariance are parameterized through a large number of parameters, which are determined by numerical optimization. Because of changing weather patterns, the parameters must be reoptimized on a regular basis. Our objective in this application is to reduce the dimension of the parameter space making the model easier to fit and interpret. The optimization procedure to fit parameters is an important problem feature influencing our approach to GSA. Our method is developed in an abstract setting and subsequently used to analyze the Gaussian process wind speed model.
There are a variety of methods under the umbrella of GSA, the most common is variancebased
[45, 44, 40, 37]. For reasons of computational efficiency, derivativebased methods [27, 46, 47, 26, 39, 19] have also gained attention recently; they are related to the classical Morris method [32]. Theoretical challenges in the aforementioned methods has motivated interest in alternatives such as momentindependent importance measures
[4, 3, 2], Shapley effects [35, 48, 20, 34], and dependence measures [10]. In this paper we propose a derivativebased strategy for GSA of statistical models. In principle, any of the aforementioned methods may be used. Computational considerations make derivativebased methods preferable. In particular, the number of gradient evaluations is independent of the parameter space dimension (evaluating the gradient may depend on the dimension but can frequently be evaluated efficiently) and they do not require sampling from conditional distributions.To perform GSA, a probability distribution must be defined on the input space, and the analysis is done with respect to it. GSA has been well developed and studied for problems where the inputs are independent. In many statistical models, however, the inputs (parameters) are correlated, thus posing additional challenges to traditional GSA tools. Developing GSA tools for problems with correlated inputs is an active area of research
[51, 50, 28, 29, 7, 8, 3, 52, 2, 48, 4, 34, 35, 20]. In addition to the theoretical and computational challenges posed by input correlations, simply defining or fitting a joint distribution on the inputs may be challenging in the context of statistical models.
The traditional GSA framework has focused on realvalued deterministic functions
with uncertain inputs. A spacetime Gaussian process, as in our motivating application, is not a deterministic realvalued function but rather a stochastic vectorvalued function, i.e.
, where is a set of random vectors. Realvalued stochastic processes are considered in [17, 30] and vectorvalued deterministic models in [11, 31]; generalizing GSA tools to stochastic and/or vectorvalued functions is an area of ongoing research. In principle, these approaches may be used to compute sensitivities of a statistical model with respect to its parameters; however, treating them as generic stochastic processes fails to exploit important structure in statistical models. Sensitivity analysis of statistical surrogate models is considered in [33, 23]; however, they focus on the sensitivity of model inputs instead of model parameters. The work of [42] considers the sensitivity of a Gaussian process model to changes in the prior and correlation function. To the author’s knowledge, the approach proposed in this article is the first to formally apply methods from the global sensitivity analysis literature to analyze the influence of statistical model parameters.This article provides a framework to connect the mathematical and statistical tools needed to efficiently extend GSA to statistical models. We use the loss function associated with the statistical model parameter estimation to define a joint probability distribution, which respects the correlation structure in the problem. This distribution is sampled using a Markov Chain Monte Carlo method, and derivativebased sensitivity indices of the loss function are computed from these samples. In this framework, we are able to discover both sensitivities and correlation structures without requiring a priori knowledge about the parameters.
Our framework requires efficient evaluations of the loss function’s gradient. In fact, it is designed to exploit efficient gradient evaluations, as is the case in our motivating application. Derivativebased methods are typically limited to identifying unimportant parameters as they may fail to capture the relative importance of the most influential parameters. Though generally undesirable, this is permissible when seeking model simplification and/or parameter dimension reduction as identifying unimportant parameters is the primary focus.
We provide definitions and construct the abstract framework in Section 2. Section 3 details the computational methodology and summarizes the proposed method. In Section 4, we present numerical results for a synthetic test problem and for our motivating application. Section 5 provides a brief summary of our conclusions.
2 Preliminaries
Let be a statistical model defined through parameters
, . Let be a loss function (or cost function) associated with ;
that is, to fit to the data , one computes .
In the modeloriented context of statistics, parameters of the model are usually estimated by minimizing computed on observed data .
The loss function is chosen given the model and its intended use. Common examples are leastsquares and maximum likelihood. For simplicity, will be used instead of for the remainder of the paper.
In our motivating application, is a Gaussian process with
mean and covariance depending on a vector of
parameters and is the the negative log likelihood. We assume that may be computed efficiently, as is the case in our motivating application.
We are interested in the global sensitivity of with respect to . Since may be a complex mathematical object, we propose to analyze the global sensitivity of to through the global sensitivity of to . This makes our analysis dependent on the choice of loss function, which is consistent with a goal oriented choice of . This is appropriate since encodes the dependence of on . Further, since is a deterministic realvalued function of , it is frequently easier to analyze than .
One challenge is that the statistical model may be mathematically well defined for parameters but yield a practically irrelevant solution in the context of a given application. To avoid this scenario, we let be the subset that restricts to parameters yielding relevant solutions. For instance, a quantity in may be required to be nonnegative so restricts to parameters respecting this constraint. We assume that is a Lebesgue measurable set; this is easily verified in most applications.
We make three assumptions about ; formally, they are expressed as the following.

is differentiable.

such that , , and
, , exist and are finite. 
such that , .
Assumption I is necessary since we seek to use a derivativebased GSA. This assumption is easily verifiable in most cases. Assumption II is needed so that global sensitivity indices (3) are well defined. Assumption III is a needed technical assumption requiring that the loss function be bounded below. Note that if is continuously differentiable and is compact, then all three assumptions follow immediately; this is a common case.
To define global sensitivity indices, we must specify a probability measure to integrate against. Let
(1) 
for some and ;
is the characteristic function of a set, and
is the Euclidean norm. Note that is defined through constraints on so it is generally difficult to express in terms of simple algebraic constraints. In most cases, however, the constraints may be checked when is evaluated, and hence is easily evaluated through evaluating .From Assumption II and the fact that , it follows that
is integrable. We define the probability density function (PDF) as
(2) 
Then is supported on and gives the greatest probability to regions where is close to its minimum namely, where is a good fit. A PDF of this form corresponds to a Gibbs measure [24] with temperature ; the temperature determines how the probability mass disperses from the modes. The scalar is a regularization factor that aids when is too heavy tailed; this is illustrated in Section 4. The determination of and is considered in Section 3
. Our formulation shares similar characteristics to Bayesian inference. For instance, if
is a negative log likelihood and then is the posterior PDF of using a Gaussian prior truncated to (or when, the prior is simply a uniform distribution on
).Definition 1
Let the sensitivity index of with respect to be defined as
(3) 
Derivativebased sensitivity indices are commonly defined in the literature by taking the expected value of the partial derivative squared. The absolute value is used here because and its derivatives typically become large for parameters with low probability, so squaring the partial derivative results in the low probability realizations making larger contributions to (3). Since depends on the units of , it will be difficult to compare partial derivatives when parameters are on multiple scales. Typically one would rescale parameters a priori to avoid this issue, but this is difficult to do in our context. Multiplying by yields scaleinvariant global sensitivity indices. Sensitivity indices for groups of variables [47] may also be defined in our framework, but are not considered in this paper.
Correlations in make Monte Carlo integration with uniform sampling intractable for computing the ’s. Importance sampling may be used if an efficient proposal distribution is found; however, this is also challenging in most cases. Therefore, we propose to compute the ’s with Markov Chain Monte Carlo (MCMC) methods.
3 Computing sensitivities
In this section we present the main result of this study. The proposed method may be partitioned into three stages:

Preprocessing where we collect information about the loss function,

Sampling where samples are drawn from the probability measure (2),

Postprocessing where sensitivities as well as additional information are computed.
In the preprocessing stage we seek to find characteristic values for the parameters and the loss function. These characteristic values are used to determine the temperature and regularization factor in the PDF (2).
In the sampling stage we first determine the temperature and regularization factor. Subsequently an MCMC sampler is run to collect samples from (2).
In the postprocessing stage we compute sensitivities by evaluating the gradient of the loss function at the samples drawn in the sampling stage ii. In addition, the robustness of the sensitivities with respect to perturbations in the temperature and the parameter correlations are extracted from the existing samples and gradient evaluations. These two pieces of information are byproducts of computing sensitivities and require no additional computation.
These three stages are described in Subsections 3.1, 3.2, and 3.3, respectively. The method is summarized as a whole in Subsection 3.4.
3.1 Preprocessing stage
Characteristic magnitudes for and are needed to determine the regularization factor and temperature. To this end we introduce two auxiliary computations as a preprocessing step.
The first auxiliary computation runs an optimization routine to minimize ; the choice of optimizer is not essential here. Let be the minimizing parameter vector. For our purposes it is acceptable if is not the global minimizer of as long as it is sufficiently close to capture characteristic magnitudes of in regions of good fit.
The second auxiliary computation uses to determine the range of loss function values that our MCMC sampler should explore. Let , and let be a random vector defined by
(4) 
where all the ’s are independent of one another and denotes the uniform distribution. Hence, represents uniform uncertainty of about .
Determining is an applicationdependent problem. In fact, its determination is the only portion of our proposed method that cannot be automated. To choose , we suggest fixing a value for , sampling from , and assessing the quality of ’s predictions using the sample. Repeating this sample and assessment process for various values of allow the user to determine a that yields reasonable predictions. This step is highly subjective and application dependent; however, it is a very natural means of inserting user specification. One simple way to do this is visualizing the model output for each sample and increasing until the outputs become unrealistic.
Taking large values for will result in the PDF giving significant probability to regions of the parameter space yielding poor fits, and thus hence sensitivity indices that are not useful. Taking small values for will result in the PDF giving significant probability to regions of the parameter space near local minima, thus making the sensitivity indices local. Since the choice of is strongly user dependent, the robustness of the sensitivity indices with respect to perturbations in is highly relevant; this is indirectly addressed by Theorem 3.2.
Once is specified, then a threshold , which is used to compute the regularization factor and temperature (see Subsection 3.2.1 and Subsection 3.2.2), may be easily computed via Monte Carlo integration. We define the threshold
(5) 
Note that the expectation in (5) is computed with respect to the independent uniform measure; all other expectations in the paper are computed with respect to the PDF (2).
3.2 Sampling stage
We use an MCMC method to sample from (2) through evaluations of the unnormalized density (1). Then the ’s may be computed through evaluations of at the sample points. Many MCMC methods may be used to sample ; see, for example [43, 38, 49, 14, 12, 16].
Determining which MCMC method to use and when it has converged may be challenging. Convergence diagnostics [9, 6] have been developed that may identify when the chain has not converged; however, they all have limitations and cannot ensure convergence [13]. In Section 4, adaptive MCMC [49] is used with the convergence diagnostic from [5].
Assuming that an MCMC sampler is specified, we focus on determining the temperature and regularization factors in Subsections 3.2.1 and 3.2.2, respectively.
3.2.1 Determining the regularization factor
To determine the regularization factor , consider the function
The PDF gives greatest probability to regions where is small. If is small relative to , then the local minima of are near the local minima of . Ideally we would like , but in some cases this results in being too heavy tailed. Instead we may require that for some ; that is, the regularization term contributes percent of the value of . Setting and replacing and with and , we get
(6) 
In practice we suggest beginning with . If the MCMC sampler yields heavytailed distributions that converge slowly, then may be increased to aid the convergence. This case is illustrated in Section 4.
3.2.2 Determining the temperature
To determine the temperature , we first define
We seek to find so that with probability ; is suggested to mitigate wasted computation in regions where yields a poor fit. Let . We note that is a Lebesgue measurable set since is continuous and is Lebesgue measurable. We define the function by
(7) 
Then gives the probability that . The optimal temperature is the solution of . Four results are given below showing that possesses advantageous properties making the nonlinear equation easily solvable. The proofs of the following propositions are given in the appendix.
Proposition 1
If , then is differentiable on with
(8) 
Proposition 2
If , then is a strictly increasing function on .
Propositions 1 and 2 yield desirable properties of . The assumption that
is integrable is necessary for to be well defined. Note that this assumption follows from Assumption I when is bounded. Theorem 3.1 and Corollary 1 below give existence and uniqueness, respectively, for the solution of under mild assumptions.
Theorem 3.1
If is a bounded set and such that , then such that .
Corollary 1
If , is a bounded set, and such that
, then admits a unique solution.
The assumption that is bounded is reasonable in most applications; may be unbounded, but is restricted to relevant solutions that will typically be bounded. The assumption that means that is not chosen as the global minimum, which should always hold in practice. The assumption that is necessary for existence. Typically is much less than 1, while is chosen close to 1. The assumptions Theorem 3.1 and Corollary 1 hold in most applications
In summary, under mild assumptions is a scalar nonlinear equation admitting a unique solution and possesses nice properties (monotonicity and differentiability). Further, and may be approximated simultaneously by running MCMC. The challenge is that evaluating and in high precision requires running a long MCMC chain. In fact, is significantly more challenging to evaluate than . For this reason we suggest using derivativefree nonlinear solvers which will still be efficient since is a wellbehaved function. In the spirit of inexact Newton methods [22], shorter chains may be run for the early iterations solving and the precision increased near the solution. In practice, relatively few evaluations of are needed because of its properties, shown above.
As previously highlighted, the PDF (2) corresponds to a Bayesian posterior PDF when is a negative log likelihood and . If , our GSA approach uses a “flatter” PDF than the Bayesian posterior. Our determination of incorporates information from the parameter optimization procedure to ensure that our sensitivity analysis searches the parameter space in which the optimization routine traverses.
3.3 Postprocessing stage
Having attained samples from (2), the sensitivities (3) may be estimated by evaluating at the sample points and forming the Monte Carlo estimator for the expectations in (3). In addition to computing these sensitivities, we may extract two other useful pieces of information, namely, the robustness of the sensitivities with respect to perturbations in the temperature and the parameter correlations. These are described in Subsections 3.3.1 and 3.3.2, respectively.
3.3.1 Robustness with respect to the temperature
As a result of the uncertainty in the determination of (computation of , choice of , estimation of , solution of ), we analyze the robustness of the sensitivities with respect to . Consider the functions
; clearly . Theorem 3.2 gives the derivative of the sensitivity index with respect to the temperature , namely, .
Theorem 3.2
If , , and
exist and are finite, then is differentiable with
(9) 
where is the covariance operator.
Theorem 3.2 allows to be computed from the samples and function evaluations used to compute . For small , so the robustness of may be estimated without any further computational expense.
Since the magnitude of may depend on , it is useful to normalize for each when assessing robustness. We define
(10) 
which may be plotted for to assess robustness. Since this is only a local estimate we suggest taking , reflecting a uncertainty about .
The user may interpret as the local sensitivity of (3) with respect to . Because of the several sources of uncertainty in , it is desirable to have a global sensitivity of (3) with respect to ; however, this would require significantly more computational effort. Nonetheless, locality in does not diminish the value of (3) as a global sensitivity index, and it provides useful information about (3) at a negligible computational cost.
3.3.2 Extracting parameter correlations
Parameters are typically correlated, and the correlation information is a valuable complement to the sensitivity indices. For instance, if is sensitive to two parameters that are highly correlated, then it may be possible to remove one of them from since the other may compensate. In addition, the correlations may reveal parameter misspecifications in .
The strength and nature of the correlations in are typically not known a priori. Correlation coefficients may be computed from the MCMC samples and returned as a byproduct of computing sensitivity indices. The Pearson correlation coefficient is commonly used to measure correlations from sampled data. Other measures of correlation may be interchanged within our framework as well.
3.4 Summary of the method
This subsection summarizes our proposed method. The method is divided into three algorithms, one for each stage described in Section 3.
Algorithm 1 performs the auxiliary computations of Subsection 3.1. Note that determining in line 2 is the only applicationspecific portion of the proposed method; user discernment is necessary to choose .
Algorithm 2 requires the user to specify the parameter from Subsection 3.2.1, the parameter from Subsection 3.2.2, and the number of MCMC samples . We suggest starting with and rerunning Algorithm 2 with a larger if the convergence results indicate that the PDF is heavy tailed. Hence may be viewed as a computed quantity rather than one specified by the user. As mentioned in Subsection 3.2.2, we suggest using . It may be considered fixed and the user only needs to change it if they have a specific purpose which requires giving more weight to “poor” parameter choices. The choice of may be difficult; however, more samples may be appended after an initial run so can be adapted without any wasted computation.
4 Numerical results
In this section we apply the proposed method to two problems. The first is a synthetic test problem meant to illustrate the methodological details described in Section 3. The second is our motivating application where is a spacetime hierarchical Gaussian process used for wind speed forecast [1].
4.1 Synthetic test problem
This synthetic problem illustrates the proposed method of GSA and its properties on a simple example with least squares estimation. We demonstrate the difficulty of MCMC sampling with heavy tailed distributions and how the regularization factor (6) alleviates this problem.
Mimicking characteristics of our motivating application, we consider a spacetime process governed by the function
(11) 
where
with , , and
(12) 
We draw samples from (11) on a uniform grid of , which gives data
A model parameterized in the same form as (11) is proposed, but the parameters are assumed to be unknown. They are determined by minimizing the least squares loss function
Least squares estimates are generally used as initial conditions for maximum likelihood optimization. This motivates a least squares formulation in this example as a simplification of the loss function in our motivating application.
Analytic solutions for the sensitivities are intractable; however, we can validate our results by comparing them with our knowledge of the true model that generated the data. In particular, the relative importance of the parameters is clear by examining (11) and (12). We expect to be the most important parameter and and to be the least important parameters.
The proposed method is used with , , , and . Five independent chains are generated from overdispersed initial iterates using adaptive MCMC [49]. When , the MCMC sampler fails to converge because the tail of is too heavy. To illustrate this, Figure 1 shows the iteration history for the parameter in each of the five chains after a burnin period is discarded. The two leftmost frames indicate that is heavy tailed; the other three chains never reach the tail. A heavytailed PDF such as this requires extensive sampling, which makes the reliable computation of sensitivity indices intractable. Therefore, we use regularization to alleviate this problem by increasing as we monitor the sampler’s convergence. We find that yields converged chains with samples. The chains are deemed convergent by using the potential scale reduction factor (PSRF) [5] as well as visualizing the iteration histories and histograms from each of the five chains.
Plotting the iteration history of indicates that a burnin of is sufficient. Then the remaining samples from the five chains are pooled together so that sensitivities and correlations may be computed from them. Figure 2 shows the sensitivity indices and Pearson correlation matrix computed from the pooled samples. These results are consistent with our expectations, is seen as the most important parameter and as the least important. Two primarily blocks are seen in the correlation plot representing the set of spatial variables and the set of temporal variables. Negative correlations are observed on the off diagonal blocks since the spatial and temporal variables are multiplied by one another and hence are inversely related.
4.2 Analysis of a spacetime Gaussian process
In this section we apply the proposed method to analyze the motivating statistical model [1]. The model aims at forecasting wind speed by fusing two heterogeneous datasets: numerical weather prediction model (NWP) outputs and physical observations. Let denote the output of the NWP model and denote the observed measurements. They are modeled as a bivariate spacetime Gaussian process specified in terms of mean and covariance structures as follows,
(13) 
where is the set of parameters that describe the shapes of the means and covariances. The model is expressed in a hierarchical conditional manner to avoid the specification of the full joint covariance in (13), indeed the mean and covariance of the distributions and are specified in time, geographical coordinates, and parameters from the numerical model (the landuse parameter). More precisely,
(14) 
where is a spatial location, is time, represents a sum of temporal harmonics with daily, halfdaily and 8hrperiodicities,
is a categorical variable that represents the landuse associated with location
, and are the latitude and longitude coordinates.(15) 
, are temporal squared exponential covariances expressed as
where is the Kronecker delta, , , and are parameters to be estimated. are landuse specific terms, and is linear in the latitude and longitude coordinates and quadratic in time. The parameters , , , along with the parameters of , and , will be estimated during the maximum likelihood procedure. We will denote the collection of all these parameters by .
The conditional distribution is expressed through its mean and covariance:
where is written similarly to as a product of temporal harmonics and a linear combination of the coordinates latitude and longitude. is a projection matrix specified in time, latitude, longitude and the landuse parameter. The covariance is parametrized with a similar shape to (15) with a different set of parameters. Parameters of these functions are denoted as in the following and will estimated by maximum likelihood.
This model is fitted by maximum likelihood on the two datasets with respect to the parameters . The negative log likelihood of the model can be decomposed as
(17) 
where and are the negative log likelihoods for the marginal distribution of and the conditional distribution , respectively. Since the model decomposes in this way, we will consider analysis of the parameters in and separately. Our dataset consists of 27 days of measurements from August 2012; details may be found in [1]. The parameter sensitivity during the first 13 days for and is analyzed in Subsection 4.2.1 and Subsection 4.2.2, respectively. Inferences are drawn from this analysis and validated using the later 14 days of data in Subsection 4.2.3.
4.2.1 Parameter sensitivity analysis for
In this subsection we apply the proposed method to determine the sensitivity of the marginal model for to its 41 parameters during a 13day period. The sensitivities being computed are with respect to the parameters in , but for notational simplicity we will denote them by in this subsection.
In order to determine (line 1 of Algorithm 1), the LBFGSB algorithm is used to minimize . Visualizing the model predictions for various choices of yields (line 2 of Algorithm 1). Then (5) is estimated with Monte Carlo samples (line 3 of Algorithm 1). It returns an estimate
with standard deviation 2; hence
is considered to be sufficiently large. These steps complete the preprocessing stage by providing characteristic values for the parameters and the loss function .The PDF is found to be heavy tailed, so is chosen to reduce this effect. Then the equation is solved with by evaluating and manually updating . The solution is obtained. This converged in very few iterations because of the nice properties of the equation . Any other derivativefree nonlinear solver may be used in our framework; however, manual tuning is preferable in many cases because of the simplicity of the equation and the stochasticity of the function evaluations. Having determined and , the PDF from which we draw samples is now well defined.
Adaptive MCMC [49] is used with a desired acceptance rate of . Five chains of length each are generated independently from overdispersed initial iterates, and the first iterates are discarded as burnin. The PSRF convergence diagnostic from [5] is used on and separately to assess the convergence of each. The PSRFs for all parameters lie in the intervals and , respectively. Other visual diagnostics are applied as well, along with comparing sensitivity indices from each of the chains. The sensitivity estimation appears to converge.
Figure 4 displays the sensitivity indices estimated from each of the chains. The five different colors represent the five different chains; their comparability demonstrates that MCMC estimation errors are negligible. The intercept terms in the mean and the covariance kernel parameters are the most influential. The terms parameterizing are less influential, particularly the quadratic temporal terms.
As discussed in Subsection 3.3.1, the robustness of the sensitivities with respect to errors in may be estimated as a byproduct of computing sensitivities. Figure 5 displays (10) plotted for , . Most of the curves are nearly horizontal, and those that not horizontal display small variation that does not change the resulting inference. Thus the sensitivities are robust with respect to , and hence any errors made when determining are negligible.
As mentioned in Subsection 3.3.2, parameter correlation information is a useful complement to sensitivity indices. Figure 6 displays the empirical Pearson correlation matrix computed from the MCMC samples retained after removing burnin and pooling the chains. Strong positive correlations are observed between the three landuse dependent spatial intercepts in the mean. Strong negative correlations are observed between the temporal range and the nugget term parameterizing the landuse specific covariance kernels . This correlation is expected since the nugget term represents the variance of the signal that is not explained by the exponential part.
4.2.2 Parameter sensitivity analysis for
In this subsection we apply the proposed method to determine the sensitivity of the model for to its 54 parameters during the same 13day period used in Subsection 4.2.1. The sensitivities being computed are with respect to the parameters in , but for notational simplicity we will denote them by in this subsection.
In a similar fashion to Subsection 4.2.1, the LBFGSB algorithm is used to determine (line 1 of Algorithm 1). Visualizing the model predictions for various choices of yields (line 2 of Algorithm 1). Then M (5) is estimated with Monte Carlo samples (line 3 of Algorithm 1). It returns an estimate with standard deviation 2; hence is considered to be sufficiently large. These steps complete the preprocessing stage by providing characteristic values for the parameters and the loss function . One may note that and are significantly smaller for than for . Their difference is unsurprising since and model two different processes.
The PDF is found to be heavy tailed, so is chosen to reduce the tail of . Analogously to Subsection 4.2.1, is solved yielding (line 3 of Algorithm 2).
Adaptive MCMC [49] is used with a desired acceptance rate of . Five chains of length each are generated independently from overdispersed initial iterates, and the first iterates are discarded as burnin. The convergence diagnostic from [5] is used on and separately to assess the convergence of each. The PSRFs for all parameters lie in the intervals and , respectively. Other visual diagnostics are applied as well, along with comparing sensitivity indices from each of the chains. A few sensitivity indices have not fully converged; however, the remaining uncertainty in their estimation is sufficiently small for our purposes. These uncertain sensitivities are among the largest in magnitude. Since our goal is encouraging model parsimony, then precisely ordering the most influential parameters is of secondary importance.
Figure 7 displays the sensitivity indices estimated from each of the chains. The five different colors represent the five different chains. The sensitivities with greatest uncertainties are demonstrated by the differences in their estimated values in each chain; however, these discrepancies are sufficiently small that they do not alter our resulting inference. The longitudinal terms in the mean and are observed to have little influence since their sensitivity indices are nearly zero. The most influential parameters are the spatial weights in the matrix which acts on .
As discussed in Subsection 3.3.1, the robustness of the sensitivities with respect to errors in may be estimated as a byproduct of computing sensitivities. Figure 8 displays (10) plotted for , . Most of the curves are nearly horizontal, and those that are not horizontal display small variation that does not change the resulting inference.
As mentioned in Subsection 3.3.2, parameter correlation information is a useful complement to sensitivity indices. Figure 9 displays the empirical Pearson correlation matrix computed from the MCMC samples retained after burnin and pooling the chains. Strong correlations are observed between the spatial weights in the matrix . Similar to , strong negative correlations are also observed between the temporal range and the nugget term of the landuse specific covariance kernels.
4.2.3 Inference and validation of results
In some cases with mathematical models one may set a threshold and fix all parameters whose sensitivity is below the threshold. This approach is not suitable for statistical models because the parameters must be understood in light of their contribution to the model structure and their correlation with other parameters. Rather, the sensitivity indices and correlation structures should be used to reparameterize the statistical model in a simpler way. For instance, if a collection of spatially dependent parameters are all found to be unimportant then the user may consider replacing them by a single parameter which is not spatially dependent.
Using the results of Subsection 4.2.1, and considerations of the model structure, we determine that the model is insensitive to the temporal quadratic terms in the parameterization of . Similarly, coupling the results of Subsection 4.2.2, and the model structure, we determine that the model is insensitive to several of the longitude terms. Specifically, the longitude term in the parameterization of the mean and the nine longitude terms in the parameterization of . This conclusion confirms what one would expect from the physics considerations. The flow is predominantly eastwest and thus the northsouth correlation is relatively weaker. The eastwest information is likely to be well captured by and hence is not needed in .
The 16 insensitive parameters are removed from the model, yielding a more parsimonious model that we refer to as the reduced model. To validate our inferences, we use the original model and the reduced model for prediction on the other 14 days of data we have available but did not use in the sensitivity analysis. Leaveoneout crossvalidation is used to fit each of the models and assess their predictive capabilities.
We simulated 1,000 scenarios for each of the 14 days and use two metrics to quantify the predictive capacity of the full and reduced models, namely, the energy score and the root mean square error. The energy score^{1}^{1}1, where and are independent predictive scenarios with [15, 36] is a measure of distance between the distribution used to generate the scenarios and the observed data on a fixed day; hence 14 energy scores are computed (one for each day). The root mean square error^{2}^{2}2 is computed as the square root of the time average squared error between the average of scenarios and the observation at each spatial location; hence, there are 11 root mean square errors. Figure 10 displays the energy scores on the left and root mean square errors on the right. The reduced model has slightly smaller (and hence better) energy scores and root mean square errors in the majority of the cases. The sum of energy scores for full model and reduced model is 121.1 and 120.7, respectively. The sum of root mean squared errors for the full model and reduced model is 11.4 and 11.3, respectively.
To further illustrate the difference between the full and reduced models, we display the simulated scenarios for a typical case. Specifically, we take the spatial location with median root mean square error and six days with median energy score and plot the 1,000 scenarios along with the observed data. Figure 11 displays the results with the full model on the left and reduced model on the right.
We have thus simplified the parameterization of the model from having 95 parameters to 79. The reduced model has equal or better predictive capability and is simpler to fit and analyze. Further, the reduced model typically has fewer outlying scenarios, as evidenced in Figure 11.
For this application we observed a strong insensitivity of to some of its terms contributing longitudinal information. This is likely because the model captures longitudinal information well and hence the Gaussian process does not need to fit terms contributing longitudinal information. Thus, inferences may also be made on the data being input to the statistical model through the parameter sensitivities.
These inferences and simplifications are useful for multiple reasons. First, long term weather prediction is difficult so the parameters must be optimized frequently to accommodate changing weather patterns. Hence a large optimization problem must be solved frequently, reducing the number of parameters allows for faster and more robust optimization. Second, by removing unimportant parameters the model is more robust and can be more easily integrated into a larger workflow, namely power scheduling. Third, we are able to learn more about the underlying system through these inferences, for instance, the unimportance of longitudinal information.
5 Conclusion
A new method for global sensitivity analysis of statistical model parameters was introduced. It addresses the challenges and exploits the problem structure specific to parameterized statistical models. The method is nearly fully automated; one step depends on the user’s discretion, but this level of user specification is likely necessary for any alternative method. The proposed method also admits perturbation analysis at no additional computational cost, thus yielding sensitivities accompanied with certificates of confidence in them.
The method was motivated by, and applied to, a Gaussian process model aimed at wind simulation. Sensitivities were computed and the model parameterization simplified by removing 17% of the model parameters. This simpler model was validated and shown to provide equal or superior predictive capability compared with the original model.
Our proposed method has two primary limitations. First, it relies heavily on Markov Chain Monte Carlo sampling for which convergence diagnostics are notoriously challenging. Second, regularization may be needed to eliminate heavytailed distributions. Determining the regularization constant is simple in principle but may require drawing many samples to resolve. However, these limitations are classical and have been observed in various applications previously.
Global sensitivity analysis has seen much success analyzing parameter uncertainty for mathematical models. The framework presented in this paper provides the necessary tools to extend global sensitivity analysis to parameters of complex statistical models. To the authors knowledge, our approach is the first to systematically combined tools from mathematics and statistics to facilitate efficient global sensitivity analysis. There are a plurality of other approaches one can consider, which may yield similar or different results than our proposed method. Utilizing the loss function and its gradient is critical for our method’s efficiency. Future work may include a broader exploration of how our approach compares with possible alternative methods, particularly those which have less dependence on the loss function. Analysis for groups of parameters may also be a useful extension of this work.
Appendix
This appendix contains the proofs for Proposition 1, Proposition 2, Theorem 3.1, Corollary 1, and Theorem 3.2.
Proof of Proposition 1
Let and . Then
By using Theorem 6.28 in [25] with dominating , we have that and are differentiable with
is differentiable since , , and applying the quotient rule for derivatives gives
Simple manipulations yields
Writing and using the linearity of the integral completes the proof.
Comments
There are no comments yet.