    # Linear screening for high-dimensional computer experiments

In this paper we propose a linear variable screening method for computer experiments when the number of input variables is larger than the number of runs. This method uses a linear model to model the nonlinear data, and screens the important variables by existing screening methods for linear models. When the underlying simulator is nearly sparse, we prove that the linear screening method is asymptotically valid under mild conditions. To improve the screening accuracy, we also provide a two-stage procedure that uses different basis functions in the linear model. The proposed methods are very simple and easy to implement. Numerical results indicate that our methods outperform existing model-free screening methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Nowadays computer experiments are commonly used to study computer simulations in engineering and scientific investigations (Santner, Williams, and Notz 2018). Computer simulations usually have complex nonlinear input-output relationships with long running times. Furthermore, they often involve larger numbers of input variables (Fang, Li, and Sudjianto 2006). For example, building performance simulation is used to predict performance aspects of a building, and its inputs include various types of parameters such as climate parameters, geometry parameters, envelope parameters, and so on. For large buildings, the number of these inputs can be much larger than one hundred (Clarke 2001). Examples of computer simulations with large numbers of input variables can also be found in climate simulations (Roulstone and Norbury 2013) and manufacturing simulations (Jahangirian et al. 2010).

Many authors discussed the screening/selection problem or related sensitivity analysis problem for computer simulations with many inputs. If only a small proportion of the inputs are active or influential, variable screening or sensitivity analysis methods can detect active inputs that have major impact on the output, and thus we can better understand the input-output relationship. Morris (1991) proposed a design-based one-factor-at-a-time factor screening method. Schonlau and Welch (2006) presented a screening method via analysis of variance and visualization. Linkletter et al. (2006) and Reich, Strolie, and Bondell (2009) provided Bayesian selection methods. Moon, Dean, and Santner (2012) proposed a two-stage sensitivity-based group screening method. Sung et al. (2017) provided a multi-resolution functional ANOVA approach for many-input computer experiments. However, these methods are not applicable to the cases where the number of variables is larger than the number of runs. Such cases are common in practice since we usually have limited runs to analyze a high-dimensional computer simulation due to the long running time. In addition, the “large

small ” problem often appears in the first stage of analyzing high-dimensional simulations. Based on the screening result in the first stage, more efficient design and analysis strategies can be made in the follow-up study.

This paper focuses on the variable screening problem for computer experiments when the number of inputs, , is larger than the number of runs, . In recent years, plenty of methodologies were proposed to screen important variables for

problems in statistics. Fan and Lv (2008) proposed the sure independence screening (SIS) method for linear regression models. This method was extended to generalized linear models (Fan and Song 2010), nonparametric additive models (Fan, Feng, and Song 2011), and varying coefficient models (Fan, Ma, and Dai 2014). Many model-free screening methods were also provided in the literature; see, e.g., Zhu et al. (2011), Li, Zhong, and Zhu (2012), Huang and Zhu (2016), and Lu and Lin (2017). These model-free methods can be used for aforementioned high-dimensional computer experiments, and their performance is in need of evaluation.

It should be noted that most screening methods for cases in the literature are marginal methods that only use the separate relationship between each variable and the response. In this paper we consider the screening problem from another angle. Compared to the number of variables, available runs are very limited. It seems that models as simple as possible should be first considered. Therefore, we adopt the linear regression model to the data from high-dimensional computer experiments, and use the -screening principle for the linear model (Xiong 2014; Xu and Chen 2014) to screen the active input variables of the nonlinear simulator. It can be seen that the idea of this linear screening method is similar to that of the regression method in global sensitivity analysis for computer experiments (Santner, Williams, and Notz 2018), which uses regression coefficients under the linear regression model as sensitivity indices for the input variables.

The linear screening method is very simple and easy to implement. One of the main contributions of this paper is to prove its asymptotic validity. To handle the bias cased by the model simplicity, we investigate the best linear approximation (BLA) of a nonlinear computer simulator. When the simulator is nearly sparse, we show that the active variables are still active in its BLA under mild conditions. Based on this, we prove the asymptotic validity of the linear -screening principle for computer experiments with . Consequently, sophisticated screening algorithms other than the marginal methods for linear regression models are proposed in our linear screening procedure. A large number of numerical results indicate that the proposed methods perform better than the marginal screening methods in the literature. In addition, the screening accuracy of the proposed methods can be improved through using different basis functions in the underlying linear model.

The rest of the paper is organized as follows. In section 2, we give the definition of BLA and discuss its properties. Section 3 provides theoretical results on the asymptotic validity of the linear screening methods for nonlinear computer models. In section 4, we discuss the linear screening methods with different basis functions. Section 5 gives the numerical results. Section 6 ends this paper with some discussion. Additional definitions and all proofs are given in the Supplementary Materials.

## 2 Best linear approximation of a nonlinear function

Suppose that the input-output relationship of a deterministic computer simulation is

 y=f(x), (1)

where the input variables , is continuous, i.e., , and denotes the transpose. For design sites , the corresponding outputs are , where . Let and . When we discuss asymptotics, in (1) depends on , and is also written as .

When is larger than , popular modeling and variable selection methods for computer experiments such as Kriging (Matheron 1963) are difficult to apply. Compared with the dimensionality, our data are very limited. It seems that we should use a very simple model for the data. Here the linear regression model

 y=ϕ0+ϕ′x+ϵ (2)

is under consideration, where are unknown coefficients and is the measurement error. In fact, the linear part of the above linear model corresponds to the best linear approximation (BLA) of , which is defined as

 β0+β′x=argming∈{ϕ0+ϕ′x: ϕ0∈R, ϕ∈Rp}∫[0,1]p[f(x)−g(x)]2dx.

Let and

 H(ϕ0,ϕ1,⋯,ϕp) =∫[0,1]p[f(x)−(ϕ0+ϕ1x1+…+ϕpxp)]2dx.

Taking partial derivatives of with respect to and letting them be equal to zero, we have

 β0=∫[0,1]pfn(x)dx−12p∑j=1βj, (3) βj=12(∫[0,1]pxjfn(x)dx−12∫[0,1]pfn(x)dx), j=1,⋯,p. (4)

To discuss theoretical properties of our methods, we make the following basic assumption.

###### Assumption 1.

There exist and such that

 sup(x1,…,xp)′∈[0,1]p∣∣fn(x1,…,xp)−˜fn(x1,⋯,xp0)∣∣<ηn, (5)

where . Furthermore, for each ,

 ∣∣∣∫[0,1]pxjfn(x)dx−12∫[0,1]pfn(x)dx∣∣∣>τ (6)

where is a positive constant.

Unlike the (complete) sparsity assumption in the literature of high-dimensional screening, we allow the computer model to be different from a model with variables in the first part (5) of Assumption 1. This indicates that also depends on the less important variables , and matches the practical cases better than the sparsity assumption. By (4), the second part (6) requires that each of these variables should be active in the BLA of . Specially, we have the following result.

###### Theorem 1.

Under Assumption 1, for , and for .

For an integer , let . If as , then Theorem 1 indicates that, under Assumption 1, can be viewed as the true submodel of both the original model

and its BLA. Note that the vector of coefficients

can be sparsely estimated by regularized least squares under the linear model (

2). Theorem 1 basically guarantees the validity of our linear screening methods to select . Further discussion on (6) can be seen in Section LABEL:sec:db.