# Global sensitivity analysis for statistical model parameters

Global sensitivity analysis (GSA) is frequently used to analyze the influence of uncertain parameters in mathematical models and simulations. In principle, tools from GSA may be extended to analyze the influence of parameters in statistical models. Such analyses may enable reduced or parsimonious modeling and greater predictive capability. However, difficulties such as parameter correlation, model stochasticity, multivariate model output, and unknown parameter distributions prohibit a direct application of GSA tools to statistical models. We introduce a novel framework that addresses these difficulties and enables GSA for statistical model parameters. Theoretical and computational properties are considered and illustrated on a synthetic example. The framework is applied to a Gaussian process model from the literature, which depends on 95 parameters. Non-influential parameters are discovered through GSA and a reduced model with equal or stronger predictive capability is constructed by using only 79 parameters.

## Authors

• 6 publications
• 8 publications
• 5 publications
12/29/2021

### PowerGraph: Using neural networks and principal components to multivariate statistical power trade-offs

It is increasingly acknowledged that a priori statistical power estimati...
03/03/2020

### Assessment of WRF model parameter sensitivity for high-intensity precipitation events during the Indian summer monsoon

Default values for many model parameters in Numerical Weather Prediction...
10/26/2021

### Global sensitivity analysis of rare event probabilities

By their very nature, rare event probabilities are expensive to compute;...
03/27/2021

### Multivariate Gaussian Process Incorporated Predictive Model for Stream Turbine Power Plant

Steam power turbine-based power plant approximately contributes 90 total...
07/14/2018

### Global sensitivity analysis of frequency band gaps in one-dimensional phononic crystals

Phononic crystals have been widely employed in many engineering fields, ...
04/08/2020

### Statistical identification of penalizing configurations in high-dimensional thermalhydraulic numerical experiments: The ICSCREAM methodology

In the framework of risk assessment in nuclear accident analysis, best-e...
06/10/2020

### Condensing Two-stage Detection with Automatic Object Key Part Discovery

Modern two-stage object detectors generally require excessively large mo...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Global sensitivity analysis (GSA) aims to quantify the relative importance of input variables or factors in determining the value of a function [41, 21]. It has been used widely for analysis of parameter uncertainty in mathematical models and simulations [41, 18]. In particular, GSA may be used to improve modeling insight, encourage model parsimony, and accelerate the model-fitting process. In this paper we propose a novel method for GSA of parameters in statistical models. To our knowledge, the GSA tools developed for mathematical models and simulations have not been systematically developed for analysis of statistical models. The combination of parameter correlation, model stochasticity, multivariate model output, and having an unknown parameter distribution prohibits a direct application of GSA tools to statistical models. Nevertheless, problem structure in statistical models may be exploited to enable efficient GSA. This paper provides a framework to use existing GSA tools along with tools from statistics to address these challenges and yield a new GSA approach for analysis of statistical models.

This work is motivated by a statistical model that fuses two datasets of atmospheric wind speed in order to provide statistical prediction of wind speed in space and time [1]. The predictions are generated from a Gaussian process whose mean and covariance are parameterized through a large number of parameters, which are determined by numerical optimization. Because of changing weather patterns, the parameters must be re-optimized on a regular basis. Our objective in this application is to reduce the dimension of the parameter space making the model easier to fit and interpret. The optimization procedure to fit parameters is an important problem feature influencing our approach to GSA. Our method is developed in an abstract setting and subsequently used to analyze the Gaussian process wind speed model.

There are a variety of methods under the umbrella of GSA, the most common is variance-based

[45, 44, 40, 37]. For reasons of computational efficiency, derivative-based methods [27, 46, 47, 26, 39, 19] have also gained attention recently; they are related to the classical Morris method [32]

. Theoretical challenges in the aforementioned methods has motivated interest in alternatives such as moment-independent importance measures

[4, 3, 2], Shapley effects [35, 48, 20, 34], and dependence measures [10]. In this paper we propose a derivative-based strategy for GSA of statistical models. In principle, any of the aforementioned methods may be used. Computational considerations make derivative-based methods preferable. In particular, the number of gradient evaluations is independent of the parameter space dimension (evaluating the gradient may depend on the dimension but can frequently be evaluated efficiently) and they do not require sampling from conditional distributions.

To perform GSA, a probability distribution must be defined on the input space, and the analysis is done with respect to it. GSA has been well developed and studied for problems where the inputs are independent. In many statistical models, however, the inputs (parameters) are correlated, thus posing additional challenges to traditional GSA tools. Developing GSA tools for problems with correlated inputs is an active area of research

[51, 50, 28, 29, 7, 8, 3, 52, 2, 48, 4, 34, 35, 20]

. In addition to the theoretical and computational challenges posed by input correlations, simply defining or fitting a joint distribution on the inputs may be challenging in the context of statistical models.

The traditional GSA framework has focused on real-valued deterministic functions

with uncertain inputs. A space-time Gaussian process, as in our motivating application, is not a deterministic real-valued function but rather a stochastic vector-valued function, i.e.

, where is a set of random vectors. Real-valued stochastic processes are considered in [17, 30] and vector-valued deterministic models in [11, 31]; generalizing GSA tools to stochastic and/or vector-valued functions is an area of ongoing research. In principle, these approaches may be used to compute sensitivities of a statistical model with respect to its parameters; however, treating them as generic stochastic processes fails to exploit important structure in statistical models. Sensitivity analysis of statistical surrogate models is considered in [33, 23]; however, they focus on the sensitivity of model inputs instead of model parameters. The work of [42] considers the sensitivity of a Gaussian process model to changes in the prior and correlation function. To the author’s knowledge, the approach proposed in this article is the first to formally apply methods from the global sensitivity analysis literature to analyze the influence of statistical model parameters.

This article provides a framework to connect the mathematical and statistical tools needed to efficiently extend GSA to statistical models. We use the loss function associated with the statistical model parameter estimation to define a joint probability distribution, which respects the correlation structure in the problem. This distribution is sampled using a Markov Chain Monte Carlo method, and derivative-based sensitivity indices of the loss function are computed from these samples. In this framework, we are able to discover both sensitivities and correlation structures without requiring a priori knowledge about the parameters.

Our framework requires efficient evaluations of the loss function’s gradient. In fact, it is designed to exploit efficient gradient evaluations, as is the case in our motivating application. Derivative-based methods are typically limited to identifying unimportant parameters as they may fail to capture the relative importance of the most influential parameters. Though generally undesirable, this is permissible when seeking model simplification and/or parameter dimension reduction as identifying unimportant parameters is the primary focus.

We provide definitions and construct the abstract framework in Section 2. Section 3 details the computational methodology and summarizes the proposed method. In Section 4, we present numerical results for a synthetic test problem and for our motivating application. Section 5 provides a brief summary of our conclusions.

## 2 Preliminaries

Let be a statistical model defined through parameters
, . Let be a loss function (or cost function) associated with ; that is, to fit to the data , one computes . In the model-oriented context of statistics, parameters of the model are usually estimated by minimizing computed on observed data . The loss function is chosen given the model and its intended use. Common examples are least-squares and maximum likelihood. For simplicity, will be used instead of for the remainder of the paper. In our motivating application, is a Gaussian process with mean and covariance depending on a vector of parameters and is the the negative log likelihood. We assume that may be computed efficiently, as is the case in our motivating application.

We are interested in the global sensitivity of with respect to . Since may be a complex mathematical object, we propose to analyze the global sensitivity of to through the global sensitivity of to . This makes our analysis dependent on the choice of loss function, which is consistent with a goal oriented choice of . This is appropriate since encodes the dependence of on . Further, since is a deterministic real-valued function of , it is frequently easier to analyze than .

One challenge is that the statistical model may be mathematically well defined for parameters but yield a practically irrelevant solution in the context of a given application. To avoid this scenario, we let be the subset that restricts to parameters yielding relevant solutions. For instance, a quantity in may be required to be nonnegative so restricts to parameters respecting this constraint. We assume that is a Lebesgue measurable set; this is easily verified in most applications.

We make three assumptions about ; formally, they are expressed as the following.

1. is differentiable.

2. such that , , and
, , exist and are finite.

3. such that , .

Assumption I is necessary since we seek to use a derivative-based GSA. This assumption is easily verifiable in most cases. Assumption II is needed so that global sensitivity indices (3) are well defined. Assumption III is a needed technical assumption requiring that the loss function be bounded below. Note that if is continuously differentiable and is compact, then all three assumptions follow immediately; this is a common case.

To define global sensitivity indices, we must specify a probability measure to integrate against. Let

 q(θ)=χB(θ)e−δ(L(θ)+λ||θ||22) (1)

for some and ;

is the characteristic function of a set, and

is the Euclidean norm. Note that is defined through constraints on so it is generally difficult to express in terms of simple algebraic constraints. In most cases, however, the constraints may be checked when is evaluated, and hence is easily evaluated through evaluating .

From Assumption II and the fact that , it follows that

is integrable. We define the probability density function (PDF) as

 p(θ)=q(θ)∫Aq(γ)dγ=e−δ(L(θ)+λ||θ||22)∫Be−δ(L(γ)+λ||γ||22)dγ. (2)

Then is supported on and gives the greatest probability to regions where is close to its minimum namely, where is a good fit. A PDF of this form corresponds to a Gibbs measure [24] with temperature ; the temperature determines how the probability mass disperses from the modes. The scalar is a regularization factor that aids when is too heavy tailed; this is illustrated in Section 4. The determination of and is considered in Section 3

. Our formulation shares similar characteristics to Bayesian inference. For instance, if

is a negative log likelihood and then is the posterior PDF of using a Gaussian prior truncated to (or when

, the prior is simply a uniform distribution on

).

###### Definition 1

Let the sensitivity index of with respect to be defined as

 Sk=E(|θk|)E(∣∣∣∂L∂θk(θ)∣∣∣)=∫B|θk|p(θ)dθ∫B∣∣∣∂L∂θk(θ)∣∣∣p(θ)dθ. (3)

Derivative-based sensitivity indices are commonly defined in the literature by taking the expected value of the partial derivative squared. The absolute value is used here because and its derivatives typically become large for parameters with low probability, so squaring the partial derivative results in the low probability realizations making larger contributions to (3). Since depends on the units of , it will be difficult to compare partial derivatives when parameters are on multiple scales. Typically one would rescale parameters a priori to avoid this issue, but this is difficult to do in our context. Multiplying by yields scale-invariant global sensitivity indices. Sensitivity indices for groups of variables [47] may also be defined in our framework, but are not considered in this paper.

Correlations in make Monte Carlo integration with uniform sampling intractable for computing the ’s. Importance sampling may be used if an efficient proposal distribution is found; however, this is also challenging in most cases. Therefore, we propose to compute the ’s with Markov Chain Monte Carlo (MCMC) methods.

In summary, the global sensitivity of to may be estimated by using only evaluations of and along with MCMC. This framework also admits additional useful information as by-products of estimating (3). More details are given in Section 3.

## 3 Computing sensitivities

In this section we present the main result of this study. The proposed method may be partitioned into three stages:

1. Preprocessing where we collect information about the loss function,

2. Sampling where samples are drawn from the probability measure (2),

3. Post-processing where sensitivities as well as additional information are computed.

In the preprocessing stage we seek to find characteristic values for the parameters and the loss function. These characteristic values are used to determine the temperature and regularization factor in the PDF (2).

In the sampling stage we first determine the temperature and regularization factor. Subsequently an MCMC sampler is run to collect samples from (2).

In the post-processing stage we compute sensitivities by evaluating the gradient of the loss function at the samples drawn in the sampling stage ii. In addition, the robustness of the sensitivities with respect to perturbations in the temperature and the parameter correlations are extracted from the existing samples and gradient evaluations. These two pieces of information are by-products of computing sensitivities and require no additional computation.

These three stages are described in Subsections 3.13.2, and  3.3, respectively. The method is summarized as a whole in Subsection 3.4.

### 3.1 Preprocessing stage

Characteristic magnitudes for and are needed to determine the regularization factor and temperature. To this end we introduce two auxiliary computations as a preprocessing step.

The first auxiliary computation runs an optimization routine to minimize ; the choice of optimizer is not essential here. Let be the minimizing parameter vector. For our purposes it is acceptable if is not the global minimizer of as long as it is sufficiently close to capture characteristic magnitudes of in regions of good fit.

The second auxiliary computation uses to determine the range of loss function values that our MCMC sampler should explore. Let , and let be a random vector defined by

 Θk∼U[(1−c)θ⋆k,(1+c)θ⋆k], (4)

where all the ’s are independent of one another and denotes the uniform distribution. Hence, represents uniform uncertainty of about .

Determining is an application-dependent problem. In fact, its determination is the only portion of our proposed method that cannot be automated. To choose , we suggest fixing a value for , sampling from , and assessing the quality of ’s predictions using the sample. Repeating this sample and assessment process for various values of allow the user to determine a that yields reasonable predictions. This step is highly subjective and application dependent; however, it is a very natural means of inserting user specification. One simple way to do this is visualizing the model output for each sample and increasing until the outputs become unrealistic.

Taking large values for will result in the PDF giving significant probability to regions of the parameter space yielding poor fits, and thus hence sensitivity indices that are not useful. Taking small values for will result in the PDF giving significant probability to regions of the parameter space near local minima, thus making the sensitivity indices local. Since the choice of is strongly user dependent, the robustness of the sensitivity indices with respect to perturbations in is highly relevant; this is indirectly addressed by Theorem 3.2.

Once is specified, then a threshold , which is used to compute the regularization factor and temperature (see Subsection 3.2.1 and Subsection 3.2.2), may be easily computed via Monte Carlo integration. We define the threshold

 M=E(L(Θ)). (5)

Note that the expectation in (5) is computed with respect to the independent uniform measure; all other expectations in the paper are computed with respect to the PDF (2).

### 3.2 Sampling stage

We use an MCMC method to sample from (2) through evaluations of the unnormalized density (1). Then the ’s may be computed through evaluations of at the sample points. Many MCMC methods may be used to sample ; see, for example [43, 38, 49, 14, 12, 16].

Determining which MCMC method to use and when it has converged may be challenging. Convergence diagnostics [9, 6] have been developed that may identify when the chain has not converged; however, they all have limitations and cannot ensure convergence [13]. In Section 4, adaptive MCMC [49] is used with the convergence diagnostic from [5].

Assuming that an MCMC sampler is specified, we focus on determining the temperature and regularization factors in Subsections 3.2.1 and 3.2.2, respectively.

#### 3.2.1 Determining the regularization factor

To determine the regularization factor , consider the function

 Lλ(θ)=L(θ)+λ||θ||22.

The PDF gives greatest probability to regions where is small. If is small relative to , then the local minima of are near the local minima of . Ideally we would like , but in some cases this results in being too heavy tailed. Instead we may require that for some ; that is, the regularization term contributes percent of the value of . Setting and replacing and with and , we get

 λ=νM(1−ν)||θ⋆||22. (6)

In practice we suggest beginning with . If the MCMC sampler yields heavy-tailed distributions that converge slowly, then may be increased to aid the convergence. This case is illustrated in Section 4.

#### 3.2.2 Determining the temperature

To determine the temperature , we first define

 Mλ=M+λ||θ⋆||22.

We seek to find so that with probability ; is suggested to mitigate wasted computation in regions where yields a poor fit. Let . We note that is a Lebesgue measurable set since is continuous and is Lebesgue measurable. We define the function by

 Δ(δ)=∫Cp(θ)dθ. (7)

Then gives the probability that . The optimal temperature is the solution of . Four results are given below showing that possesses advantageous properties making the nonlinear equation easily solvable. The proofs of the following propositions are given in the appendix.

###### Proposition 1

If , then is differentiable on with

 Δ′(δ)=(−1+Δ(δ))∫CLλ(θ)p(θ)dθ+Δ(δ)∫B∖CLλ(θ)p(θ)dθ. (8)

###### Proposition 2

If , then is a strictly increasing function on .

Propositions 1 and 2 yield desirable properties of . The assumption that
is integrable is necessary for to be well defined. Note that this assumption follows from Assumption I when is bounded. Theorem 3.1 and Corollary 1 below give existence and uniqueness, respectively, for the solution of under mild assumptions.

###### Theorem 3.1

If is a bounded set and such that , then such that .

###### Corollary 1

If , is a bounded set, and such that
, then admits a unique solution.

The assumption that is bounded is reasonable in most applications; may be unbounded, but is restricted to relevant solutions that will typically be bounded. The assumption that means that is not chosen as the global minimum, which should always hold in practice. The assumption that is necessary for existence. Typically is much less than 1, while is chosen close to 1. The assumptions Theorem 3.1 and Corollary 1 hold in most applications

In summary, under mild assumptions is a scalar nonlinear equation admitting a unique solution and possesses nice properties (monotonicity and differentiability). Further, and may be approximated simultaneously by running MCMC. The challenge is that evaluating and in high precision requires running a long MCMC chain. In fact, is significantly more challenging to evaluate than . For this reason we suggest using derivative-free nonlinear solvers which will still be efficient since is a well-behaved function. In the spirit of inexact Newton methods [22], shorter chains may be run for the early iterations solving and the precision increased near the solution. In practice, relatively few evaluations of are needed because of its properties, shown above.

As previously highlighted, the PDF (2) corresponds to a Bayesian posterior PDF when is a negative log likelihood and . If , our GSA approach uses a “flatter” PDF than the Bayesian posterior. Our determination of incorporates information from the parameter optimization procedure to ensure that our sensitivity analysis searches the parameter space in which the optimization routine traverses.

### 3.3 Post-processing stage

Having attained samples from (2), the sensitivities (3) may be estimated by evaluating at the sample points and forming the Monte Carlo estimator for the expectations in (3). In addition to computing these sensitivities, we may extract two other useful pieces of information, namely, the robustness of the sensitivities with respect to perturbations in the temperature and the parameter correlations. These are described in Subsections 3.3.1 and 3.3.2, respectively.

#### 3.3.1 Robustness with respect to the temperature

As a result of the uncertainty in the determination of (computation of , choice of , estimation of , solution of ), we analyze the robustness of the sensitivities with respect to . Consider the functions

 Fk:(δmin,∞) → R, δ ↦ (∫B|θk|(e−δLλ(θ)∫Be−δLλ(~θ)d~θ)dθ)(∫B∣∣∣∂L∂θk(θ)∣∣∣(e−δLλ(θ)∫Be−δLλ(~θ)d~θ)dθ)

; clearly . Theorem 3.2 gives the derivative of the sensitivity index with respect to the temperature , namely, .

###### Theorem 3.2

If , , and
exist and are finite, then is differentiable with

 (9)

where is the covariance operator.

Theorem 3.2 allows to be computed from the samples and function evaluations used to compute . For small , so the robustness of may be estimated without any further computational expense.

Since the magnitude of may depend on , it is useful to normalize for each when assessing robustness. We define

 ^Fk(δ+δh)=Fk(δ)+hδF′k(δ)∑nj=1(Fj(δ)+hδF′j(δ)),k=1,2,…,n, (10)

which may be plotted for to assess robustness. Since this is only a local estimate we suggest taking , reflecting a uncertainty about .

The user may interpret as the local sensitivity of (3) with respect to . Because of the several sources of uncertainty in , it is desirable to have a global sensitivity of (3) with respect to ; however, this would require significantly more computational effort. Nonetheless, locality in does not diminish the value of (3) as a global sensitivity index, and it provides useful information about (3) at a negligible computational cost.

#### 3.3.2 Extracting parameter correlations

Parameters are typically correlated, and the correlation information is a valuable complement to the sensitivity indices. For instance, if is sensitive to two parameters that are highly correlated, then it may be possible to remove one of them from since the other may compensate. In addition, the correlations may reveal parameter misspecifications in .

The strength and nature of the correlations in are typically not known a priori. Correlation coefficients may be computed from the MCMC samples and returned as a by-product of computing sensitivity indices. The Pearson correlation coefficient is commonly used to measure correlations from sampled data. Other measures of correlation may be interchanged within our framework as well.

### 3.4 Summary of the method

This subsection summarizes our proposed method. The method is divided into three algorithms, one for each stage described in Section 3.

Algorithm 1 performs the auxiliary computations of Subsection 3.1. Note that determining in line 2 is the only application-specific portion of the proposed method; user discernment is necessary to choose .

Algorithm 2 requires the user to specify the parameter from Subsection 3.2.1, the parameter from Subsection 3.2.2, and the number of MCMC samples . We suggest starting with and rerunning Algorithm 2 with a larger if the convergence results indicate that the PDF is heavy tailed. Hence may be viewed as a computed quantity rather than one specified by the user. As mentioned in Subsection 3.2.2, we suggest using . It may be considered fixed and the user only needs to change it if they have a specific purpose which requires giving more weight to “poor” parameter choices. The choice of may be difficult; however, more samples may be appended after an initial run so can be adapted without any wasted computation.

Algorithm 3 is a simple post-processing of the MCMC samples to compute sensitivity indices, robustness estimates, and parameter correlations. One may also perform convergence diagnostics on the MCMC estimators of , , along with Algorithm 3.

## 4 Numerical results

In this section we apply the proposed method to two problems. The first is a synthetic test problem meant to illustrate the methodological details described in Section 3. The second is our motivating application where is a space-time hierarchical Gaussian process used for wind speed forecast [1].

### 4.1 Synthetic test problem

This synthetic problem illustrates the proposed method of GSA and its properties on a simple example with least squares estimation. We demonstrate the difficulty of MCMC sampling with heavy tailed distributions and how the regularization factor (6) alleviates this problem.

Mimicking characteristics of our motivating application, we consider a space-time process governed by the function

 f(x,t)=S(x)T(t), (11)

where

 S(x)=α0+α1x+α2x2, and
 T(t)=β0+β1e−γtcos(2π100t)+β2sin(2π100t)+β311+e−.1(t−50)

with , , and

 θ =(β0,β1,β2,β3,γ,α0,α1,α2) =(2,10,3,.01,.01,1,.01,1). (12)

We draw samples from (11) on a uniform grid of , which gives data

 {(xi,ti,f(xi,ti))}225i=1.

A model parameterized in the same form as (11) is proposed, but the parameters are assumed to be unknown. They are determined by minimizing the least squares loss function

 L(θ)=1225225∑i=1(f(xi,ti)−^f(xi,ti))2.

Least squares estimates are generally used as initial conditions for maximum likelihood optimization. This motivates a least squares formulation in this example as a simplification of the loss function in our motivating application.

Analytic solutions for the sensitivities are intractable; however, we can validate our results by comparing them with our knowledge of the true model that generated the data. In particular, the relative importance of the parameters is clear by examining (11) and (12). We expect to be the most important parameter and and to be the least important parameters.

The proposed method is used with , , , and . Five independent chains are generated from overdispersed initial iterates using adaptive MCMC [49]. When , the MCMC sampler fails to converge because the tail of is too heavy. To illustrate this, Figure 1 shows the iteration history for the parameter in each of the five chains after a burn-in period is discarded. The two leftmost frames indicate that is heavy tailed; the other three chains never reach the tail. A heavy-tailed PDF such as this requires extensive sampling, which makes the reliable computation of sensitivity indices intractable. Therefore, we use regularization to alleviate this problem by increasing as we monitor the sampler’s convergence. We find that yields converged chains with samples. The chains are deemed convergent by using the potential scale reduction factor (PSRF) [5] as well as visualizing the iteration histories and histograms from each of the five chains.

Plotting the iteration history of indicates that a burn-in of is sufficient. Then the remaining samples from the five chains are pooled together so that sensitivities and correlations may be computed from them. Figure 2 shows the sensitivity indices and Pearson correlation matrix computed from the pooled samples. These results are consistent with our expectations, is seen as the most important parameter and as the least important. Two primarily blocks are seen in the correlation plot representing the set of spatial variables and the set of temporal variables. Negative correlations are observed on the off diagonal blocks since the spatial and temporal variables are multiplied by one another and hence are inversely related.

Figure 3 displays (10) plotted for , . The horizontal lines indicate that errors in determining are immaterial since the analysis would be unchanged by perturbing .

### 4.2 Analysis of a space-time Gaussian process

In this section we apply the proposed method to analyze the motivating statistical model [1]. The model aims at forecasting wind speed by fusing two heterogeneous datasets: numerical weather prediction model (NWP) outputs and physical observations. Let denote the output of the NWP model and denote the observed measurements. They are modeled as a bivariate space-time Gaussian process specified in terms of mean and covariance structures as follows,

 (YObsYNWP)∼N((μObs(θ)μNWP(θ)),(ΣObs(θ)ΣObs,NWP(θ)ΣTObs,NWP(θ)ΣNWP(θ))), (13)

where is the set of parameters that describe the shapes of the means and covariances. The model is expressed in a hierarchical conditional manner to avoid the specification of the full joint covariance in (13), indeed the mean and covariance of the distributions and are specified in time, geographical coordinates, and parameters from the numerical model (the land-use parameter). More precisely,

 μNWP(t,s) =(α0LU(s)+α1Lat(s)+α2Long(s))f(t), (14)

where is a spatial location, is time, represents a sum of temporal harmonics with daily, half-daily and 8hr-periodicities,

is a categorical variable that represents the land-use associated with location

, and are the latitude and longitude coordinates.

 ΣNWP(.,si;.,sj)=Cov(YNWP(.,si),YNWP(.,sj))=(Ψ(si)Γ0Ψ(sj)T)+δi−jΓLU(si), (15)

, are temporal squared exponential covariances expressed as

 Γ.(tk,tl)=σ.exp(−ρ.(|tk−tl|)2)+δk−lγ.,

where is the Kronecker delta, , , and are parameters to be estimated. are land-use specific terms, and is linear in the latitude and longitude coordinates and quadratic in time. The parameters , , , along with the parameters of , and , will be estimated during the maximum likelihood procedure. We will denote the collection of all these parameters by .

The conditional distribution is expressed through its mean and covariance:

 μObs|NWP(t,s) =μ(t,s)+(ΛYNWP)(t,s),

where is written similarly to as a product of temporal harmonics and a linear combination of the coordinates latitude and longitude. is a projection matrix specified in time, latitude, longitude and the land-use parameter. The covariance is parametrized with a similar shape to (15) with a different set of parameters. Parameters of these functions are denoted as in the following and will estimated by maximum likelihood.

This model is fitted by maximum likelihood on the two datasets with respect to the parameters . The negative log likelihood of the model can be decomposed as

 L(θ)=LNWP(θNWP)+LObs|NWP(θObs|NWP), (17)

where and are the negative log likelihoods for the marginal distribution of and the conditional distribution , respectively. Since the model decomposes in this way, we will consider analysis of the parameters in and separately. Our dataset consists of 27 days of measurements from August 2012; details may be found in [1]. The parameter sensitivity during the first 13 days for and is analyzed in Subsection 4.2.1 and Subsection 4.2.2, respectively. Inferences are drawn from this analysis and validated using the later 14 days of data in Subsection 4.2.3.

#### 4.2.1 Parameter sensitivity analysis for YNWP

In this subsection we apply the proposed method to determine the sensitivity of the marginal model for to its 41 parameters during a 13-day period. The sensitivities being computed are with respect to the parameters in , but for notational simplicity we will denote them by in this subsection.

In order to determine (line 1 of Algorithm 1), the L-BFGS-B algorithm is used to minimize . Visualizing the model predictions for various choices of yields (line 2 of Algorithm 1). Then (5) is estimated with Monte Carlo samples (line 3 of Algorithm 1). It returns an estimate

with standard deviation 2; hence

is considered to be sufficiently large. These steps complete the preprocessing stage by providing characteristic values for the parameters and the loss function .

The PDF is found to be heavy tailed, so is chosen to reduce this effect. Then the equation is solved with by evaluating and manually updating . The solution is obtained. This converged in very few iterations because of the nice properties of the equation . Any other derivative-free nonlinear solver may be used in our framework; however, manual tuning is preferable in many cases because of the simplicity of the equation and the stochasticity of the function evaluations. Having determined and , the PDF from which we draw samples is now well defined.

Adaptive MCMC [49] is used with a desired acceptance rate of . Five chains of length each are generated independently from overdispersed initial iterates, and the first iterates are discarded as burn-in. The PSRF convergence diagnostic from [5] is used on and separately to assess the convergence of each. The PSRFs for all parameters lie in the intervals and , respectively. Other visual diagnostics are applied as well, along with comparing sensitivity indices from each of the chains. The sensitivity estimation appears to converge.

Figure 4 displays the sensitivity indices estimated from each of the chains. The five different colors represent the five different chains; their comparability demonstrates that MCMC estimation errors are negligible. The intercept terms in the mean and the covariance kernel parameters are the most influential. The terms parameterizing are less influential, particularly the quadratic temporal terms.

As discussed in Subsection 3.3.1, the robustness of the sensitivities with respect to errors in may be estimated as a by-product of computing sensitivities. Figure 5 displays (10) plotted for , . Most of the curves are nearly horizontal, and those that not horizontal display small variation that does not change the resulting inference. Thus the sensitivities are robust with respect to , and hence any errors made when determining are negligible.

As mentioned in Subsection 3.3.2, parameter correlation information is a useful complement to sensitivity indices. Figure 6 displays the empirical Pearson correlation matrix computed from the MCMC samples retained after removing burn-in and pooling the chains. Strong positive correlations are observed between the three land-use dependent spatial intercepts in the mean. Strong negative correlations are observed between the temporal range and the nugget term parameterizing the land-use specific covariance kernels . This correlation is expected since the nugget term represents the variance of the signal that is not explained by the exponential part.

#### 4.2.2 Parameter sensitivity analysis for YObs|YNWP

In this subsection we apply the proposed method to determine the sensitivity of the model for to its 54 parameters during the same 13-day period used in Subsection 4.2.1. The sensitivities being computed are with respect to the parameters in , but for notational simplicity we will denote them by in this subsection.

In a similar fashion to Subsection 4.2.1, the L-BFGS-B algorithm is used to determine (line 1 of Algorithm 1). Visualizing the model predictions for various choices of yields (line 2 of Algorithm 1). Then M (5) is estimated with Monte Carlo samples (line 3 of Algorithm 1). It returns an estimate with standard deviation 2; hence is considered to be sufficiently large. These steps complete the preprocessing stage by providing characteristic values for the parameters and the loss function . One may note that and are significantly smaller for than for . Their difference is unsurprising since and model two different processes.

The PDF is found to be heavy tailed, so is chosen to reduce the tail of . Analogously to Subsection 4.2.1, is solved yielding (line 3 of Algorithm 2).

Adaptive MCMC [49] is used with a desired acceptance rate of . Five chains of length each are generated independently from overdispersed initial iterates, and the first iterates are discarded as burn-in. The convergence diagnostic from [5] is used on and separately to assess the convergence of each. The PSRFs for all parameters lie in the intervals and , respectively. Other visual diagnostics are applied as well, along with comparing sensitivity indices from each of the chains. A few sensitivity indices have not fully converged; however, the remaining uncertainty in their estimation is sufficiently small for our purposes. These uncertain sensitivities are among the largest in magnitude. Since our goal is encouraging model parsimony, then precisely ordering the most influential parameters is of secondary importance.

Figure 7 displays the sensitivity indices estimated from each of the chains. The five different colors represent the five different chains. The sensitivities with greatest uncertainties are demonstrated by the differences in their estimated values in each chain; however, these discrepancies are sufficiently small that they do not alter our resulting inference. The longitudinal terms in the mean and are observed to have little influence since their sensitivity indices are nearly zero. The most influential parameters are the spatial weights in the matrix which acts on .

As discussed in Subsection 3.3.1, the robustness of the sensitivities with respect to errors in may be estimated as a by-product of computing sensitivities. Figure 8 displays (10) plotted for , . Most of the curves are nearly horizontal, and those that are not horizontal display small variation that does not change the resulting inference.

As mentioned in Subsection 3.3.2, parameter correlation information is a useful complement to sensitivity indices. Figure 9 displays the empirical Pearson correlation matrix computed from the MCMC samples retained after burn-in and pooling the chains. Strong correlations are observed between the spatial weights in the matrix . Similar to , strong negative correlations are also observed between the temporal range and the nugget term of the land-use specific covariance kernels.

#### 4.2.3 Inference and validation of results

In some cases with mathematical models one may set a threshold and fix all parameters whose sensitivity is below the threshold. This approach is not suitable for statistical models because the parameters must be understood in light of their contribution to the model structure and their correlation with other parameters. Rather, the sensitivity indices and correlation structures should be used to re-parameterize the statistical model in a simpler way. For instance, if a collection of spatially dependent parameters are all found to be unimportant then the user may consider replacing them by a single parameter which is not spatially dependent.

Using the results of Subsection 4.2.1, and considerations of the model structure, we determine that the model is insensitive to the temporal quadratic terms in the parameterization of . Similarly, coupling the results of Subsection 4.2.2, and the model structure, we determine that the model is insensitive to several of the longitude terms. Specifically, the longitude term in the parameterization of the mean and the nine longitude terms in the parameterization of . This conclusion confirms what one would expect from the physics considerations. The flow is predominantly east-west and thus the north-south correlation is relatively weaker. The east-west information is likely to be well captured by and hence is not needed in .

The 16 insensitive parameters are removed from the model, yielding a more parsimonious model that we refer to as the reduced model. To validate our inferences, we use the original model and the reduced model for prediction on the other 14 days of data we have available but did not use in the sensitivity analysis. Leave-one-out cross-validation is used to fit each of the models and assess their predictive capabilities.

We simulated 1,000 scenarios for each of the 14 days and use two metrics to quantify the predictive capacity of the full and reduced models, namely, the energy score and the root mean square error. The energy score111, where and are independent predictive scenarios with [15, 36] is a measure of distance between the distribution used to generate the scenarios and the observed data on a fixed day; hence 14 energy scores are computed (one for each day). The root mean square error222 is computed as the square root of the time average squared error between the average of scenarios and the observation at each spatial location; hence, there are 11 root mean square errors. Figure 10 displays the energy scores on the left and root mean square errors on the right. The reduced model has slightly smaller (and hence better) energy scores and root mean square errors in the majority of the cases. The sum of energy scores for full model and reduced model is 121.1 and 120.7, respectively. The sum of root mean squared errors for the full model and reduced model is 11.4 and 11.3, respectively.

To further illustrate the difference between the full and reduced models, we display the simulated scenarios for a typical case. Specifically, we take the spatial location with median root mean square error and six days with median energy score and plot the 1,000 scenarios along with the observed data. Figure 11 displays the results with the full model on the left and reduced model on the right.

We have thus simplified the parameterization of the model from having 95 parameters to 79. The reduced model has equal or better predictive capability and is simpler to fit and analyze. Further, the reduced model typically has fewer outlying scenarios, as evidenced in Figure 11.

For this application we observed a strong insensitivity of to some of its terms contributing longitudinal information. This is likely because the model captures longitudinal information well and hence the Gaussian process does not need to fit terms contributing longitudinal information. Thus, inferences may also be made on the data being input to the statistical model through the parameter sensitivities.

These inferences and simplifications are useful for multiple reasons. First, long term weather prediction is difficult so the parameters must be optimized frequently to accommodate changing weather patterns. Hence a large optimization problem must be solved frequently, reducing the number of parameters allows for faster and more robust optimization. Second, by removing unimportant parameters the model is more robust and can be more easily integrated into a larger workflow, namely power scheduling. Third, we are able to learn more about the underlying system through these inferences, for instance, the unimportance of longitudinal information.

## 5 Conclusion

A new method for global sensitivity analysis of statistical model parameters was introduced. It addresses the challenges and exploits the problem structure specific to parameterized statistical models. The method is nearly fully automated; one step depends on the user’s discretion, but this level of user specification is likely necessary for any alternative method. The proposed method also admits perturbation analysis at no additional computational cost, thus yielding sensitivities accompanied with certificates of confidence in them.

The method was motivated by, and applied to, a Gaussian process model aimed at wind simulation. Sensitivities were computed and the model parameterization simplified by removing 17% of the model parameters. This simpler model was validated and shown to provide equal or superior predictive capability compared with the original model.

Our proposed method has two primary limitations. First, it relies heavily on Markov Chain Monte Carlo sampling for which convergence diagnostics are notoriously challenging. Second, regularization may be needed to eliminate heavy-tailed distributions. Determining the regularization constant is simple in principle but may require drawing many samples to resolve. However, these limitations are classical and have been observed in various applications previously.

Global sensitivity analysis has seen much success analyzing parameter uncertainty for mathematical models. The framework presented in this paper provides the necessary tools to extend global sensitivity analysis to parameters of complex statistical models. To the authors knowledge, our approach is the first to systematically combined tools from mathematics and statistics to facilitate efficient global sensitivity analysis. There are a plurality of other approaches one can consider, which may yield similar or different results than our proposed method. Utilizing the loss function and its gradient is critical for our method’s efficiency. Future work may include a broader exploration of how our approach compares with possible alternative methods, particularly those which have less dependence on the loss function. Analysis for groups of parameters may also be a useful extension of this work.

## Appendix

This appendix contains the proofs for Proposition 1, Proposition 2, Theorem 3.1, Corollary 1, and Theorem 3.2.

### Proof of Proposition 1

Let and . Then

 Δ(δ)=1V(δ)∫CU(δ)dθ.

By using Theorem 6.28 in [25] with dominating , we have that and are differentiable with

 ddδ∫CU(δ)dθ =−∫C(Lλ(θ)−Lmin)U(δ)dθ, V′(δ) =−∫B(Lλ(θ)−Lmin)e−δ(Lλ(θ)−Lmin)dθ.

is differentiable since , , and applying the quotient rule for derivatives gives

 Δ′(δ)=ddδ∫CU(δ)dθV(δ)−V′(δ)V(δ)∫CU(δ)dθV(δ).

Simple manipulations yields

 Δ′(δ)=(−1+Δ(δ))∫C(Lλ(θ)−Lmin)U(δ)V(δ)dθ+Δ(δ)∫B∖C(Lλ(θ)−Lmin)U(δ)V(δ)dθ.

Writing and using the linearity of the integral completes the proof.

### Proof of Proposition 2

From the proof of Proposition 1 we have

 Δ′(δ)=(−1+Δ(δ))∫C(Lλ(θ)−Lmin)U(δ)V(δ)dθ+Δ(δ)∫B∖C(Lλ(θ)−Lmin)U(δ)V(δ)dθ.

Since

 θ∈C⟹Lλ(θ)≤Mλ, θ∈B∖C⟹Lλ(θ)>Mλ, Δ(δ)∈[0,1], ∫B∖Cp(θ)dθ=1−Δ(δ),

we have

 Δ′(δ) >(−1+Δ(δ))∫C(Mλ−Lmin)U(δ)V(δ)dθ+Δ(δ)∫B∖C(Mλ−Lmin)U(δ)V(δ)dθ =(−1+Δ(δ))(Mλ−Lmin)Δ(δ)+Δ(δ)(Mλ−Lmin)(1−Δ(δ)) =0.

### Proof of Theorem 3.1

Let . Define

 ~Mλ=L(θ′)+Mλ2

By Assumption I, is continuous so . Then