Variable selection has attracted tremendous interests from both researchers and practitioners, due to the availability of large number of variables in many real applications. In such scenarios, identifying the truly informative variables for the objective of analysis has become a key factor to facilitate statistical modeling and analysis. Ideally, a variable selection algorithm shall be flexible, scalable, and with theoretical guarantee. To be more specific, the algorithm shall not assume restrictive model assumptions, so that it is applicable to data with complex structures; its implementation shall be computationally efficient and able to take advantage of high performance computing platform; it shall have theoretical guarantee on its asymptotic sparsistency in identifying the truly informative variables.
In literature, many variable selection algorithms have been developed in the regularization framework assuming certain working model set. The most popular working model set is to assume a linear model, where the variable selection task simplifies to identifying nonzero coefficients. Under the linear model assumption, the regularization framework consists of a least sqaure loss function for the linear model as well as a sparsity-inducing regularization term. Various regularization terms have been considered, including the least absolute shrinkage and selection operator (Lasso; Tibshirani, 1996,), the smoothly clipped absolute deviation (SCAD; Fan and Li, 2001, ), the adaptive Lasso (Zou, 2006, ), the minimax concave penalty (MCP; Zhang, 2010, ), the truncated -penalty (TLP; Shen et al., 2012, ), and so on. These algorithms have also been extended to the nonparametric models to relax the linear model assumption. For example, under the additive model assumption, a number of variable selection algorithms have been developed (Shively et al., 1999, ; Huang et al., 2010, ), where each component function depends on one variable only. Further, a component selection and smoothing operator algorithm is proposed in Lin and Zhang (2006, ) to allow higher-order interaction components in the additive model, at the cost of increasing computational cost. These nonparametric variable selection algorithms, although more flexible than the linear model, still require some explicit working model sets.
More recently, attempts have been made to develop model-free variable selection algorithms to circumvent the dependency on restrictive model assumptions. Particularly, variable selection is formulated in a dimension reduction framework in Li et al. (2005, ) and Bondell and Li (2009, ) via searching for the sparse basis of the central dimension reduction space. Fukumizu and Leng (2014, ) developed a gradient-based dimension reduction algorithm that can be extended to model-free variable selection. A novel measurement-error-model-based variable selection algorithm is developed in Stefanski et al. (2014, ) and Wu and Stefanski (2015, ) for nonparametric kernel regression models, and a gradient learning framework is proposed in Yang et al. (2016, ) to conduct variable selection in a flexible RKHS (Wahba, 1998, ). These algorithms are attractive due to its flexibility and asymptotic sparsistency, yet their high computational cost remains as one of the main obstacles.
Another popular line of research on high-dimensional data is variable screening, which screens out uninformative variables by examining the marginal relationship between the response and each variable. The marginal relationship can be measured by various criteria, including the Pearson’s correlation (Fan and Lv, 2008, ), the empirical functional norm (Fan et al., 2011, ), the distance correlation (Li et al., 2012, 
), and a quantile-adaptive procedure (He et al., 2013,
). All these methods are computationally very efficient, and attain the sure screening property, meaning that all the truly informative variables are retained after screening with probability tending to one. This is a desirable property, yet slightly weaker than the asymptotic sparsistency in variable selection. Another potential weakness of the marginal screening methods is that they may ignore those marginally unimportant but jointly important variables (He et al., 2013,). To remedy this limitation, some recent work (Hao et al., 2017, ; Kong et al., 2017, ; She et al., 2017, ) has been done to conduct sure screening for variables with interaction effects.
In this article, we propose a scalable kernel-based variable selection algorithm, which is methodologically flexible, computationally efficient, and able to achieve the asymptotic sparsistency without requiring any explicit model assumption. The algorithm consists of three simple steps, involving kernel-based estimation of the regression function and its gradient functions as well as a hard thresholding. It first fits a kernel ridge regression model in a flexible RKHS to obtain an estimated regression function, then estimates its gradient functions along each variable by taking advantage of the derivative reproducing property (Zhou, 2007, ), and finally hard-thresholds the empirical norm of each gradient function to identify the truly informative variables. This algorithm is flexible in that it can be adapted to any RKHS with different kernel functions, to accommodate prior information about the true regression function. Its computational cost is only linear in the data dimension, and thus scalable to analyze dataset with large dimensions. For example, the simulated examples with variables can be efficiently analyzed on a standard multi-core PC. More importantly, asymptotic sparsistency can be established for the proposed algorithm without requiring any explicit model assumptions. It is clear that the proposed algorithm is advantageous than the existing algorithms, as it achieves methodological flexibility, numerical efficiency and asymptotic sparsistency. To our knowledge, this algorithm is the first one that can achieve these three desriable properties at the same time.
The rest of the article is organized as follows. In Section 2, we present the proposed general kernel-based variable selection algorithm as well as its computational scheme. In Section 3, the asymptotic sparsistency of the proposed algorithm is established. In Section 4, we give a detailed analysis of the asymptotic behavior of the proposed algorithm under linear model, and compare it against some existing theoretical results. In Section 5, the proposed algorithm is extended to select truly informative interaction terms. The numerical experiments on the simulated and real examples are contained in Section 6, and all the technical proofs are given in Section 7.
2 Proposed algorithm
Suppose a random sample are independent copies of , drawn from some unknown distribution with supported on a compact metric space and . Consider a general regression setting,
where is a random error with and , and thus with denoting the conditional distribution of given . It is also assumed that , where is a RKHS induced by some pre-specified kernel function . For each , denote , and the reproducing property of RKHS implies that for any , where is the inner product in . Examples of RKHS include the standard Euclidean space, the Sobolev space, the Besov space, as well as some domain-specific spaces for various applications (Genton, 2001, ).
In sparse modeling, it is generally believed that only depends on a small number of variables, while others are uninformative. Unlike model-based settings, variable selection for a general regression model is challenging due to the lack of explicit regression parameters. Here we measure the importance of variables in a regression function by examining the corresponding gradient functions. It is crucial to observe that if a variable is deemed uninformative, the corresponding gradient function should be exactly zero almost surely. Thus the true active set can be defined as
where with the marginal distribution .
The proposed general variable selection algorithm is presented in Algorithm 1.
We now give details of each step in Algorithm 1. To obtain in Step 1, we employ the kernel ridge regression model,
where and . Then the optimization task in (2.1) can be solved analytically, with
where , and denotes the Moore-Penrose generalized inverse of a matrix. When is invertible, (2.2) simplifies to .
Next, to obtain in Step 2, it follows from Lemma 1 in the Appendix that for any ,
where . This implies that the gradient function of any can be bounded by its -norm up to some constant. In other words, if we want to estimate within the smooth RKHS, it suffices to estimate itself without loss of information. Consequently, if is obtained in Step 1, can be estimated as for each .
In Step 3, it is difficult to evaluate directly, as is usually unknown in practice. We then adopt the empirical norm of as a practical measure,
The estimated active set can be set as for some pre-specified . It is crucial to choose a proper for optimal selection performance, which shall be determined by some data-adaptive tuning schemes such as the stability-based criterion in Sun et al. (2013, ).
The proposed Algorithm 1 is general in that it can be adapted to any smooth RKHS with different kernel functions, where the choice of kernel function depends on prior knowledge about . A direct application of the proposed algorithm under linear model will be discussed in Sections 4. The proposed algorithm is also computationally efficient, whose computational cost is about . The complexity comes from inverting an matrix in (2.2), and the complexity comes from calculating for . Furthermore, if low-rank approximation of the kernel matrix is employed, such as Nystrom sampling (Williams and Seeger, 2001, ) or incomplete Cholesky decomposition (Fine and Scheinberg, 2002, ), the complexity can be reduced to , where is the preserved rank. The algorithm can also be parallelized to further speed up the computation, which is particularly attractive in the large--small- scenarios.
3 Asymptotic sparsistency
Now we establish the sparsistency of the proposed algorithm. First, we introduce an integral operator , given by
for any . Note that if the corresponding RKHS is separable, by the spectral theorem we have
where is an orthonormal basis of ,
is the eigenvalue of the integral operator, and is the inner product in .
We rewrite and as and to emphasize their dependency on , and thus is allowed to diverge with . The cardinality of the true active set is denoted as . The following technical assumptions are made.
Assumption 1: Suppose that is in the range of the -th power of , denoted as , for some positive constant .
Note that the operator on is self-adjoint and semi-positive definite, and thus its fractional operator is well defined. Furthermore, the range of is contained in if (Smale and Zhou, 2007, ), and thus Assumption 1 implies that there exists some function such that , ensuring strong estimation consistency under the RKHS-norm. Similar assumptions are also imposed in Mendelson and Neeman (2010, ).
Assumption 2: There exist some constants and such that , and , for any
Assumption 2 assumes the boundedness of the kernel function and its gradient functions, which is satisfied by many popular kernels, including the Gaussian kernel and the Sobolev kernel (Smale and Zhou, 2007, ; Rosasco et al., 2013, ; Yang et al., 2016, ).
Suppose Assumptions 1 and 2 are satisfied. For any , with probability at least , there holds
Additionally, let , then with probability at least , there holds
Theorem 1 establishes the convergence rate of the difference between the estimated regression function and the true regression function in terms of the RKHS-norm. In the upper bound in (3.1), the quantity may depend on through , yet such dependency is generally difficult to quantify explicitly (Fukumizu and Leng, 2014, ). Theorem 1 also shows that converges to with high probability, which is crucial to establish the asymptotic sparsistency. Note that the constant is spelled out precisely for the subsequent analysis of the asymptotic sparsistency and its dependency on .
follows a sub-Gaussian distribution and the decay rate of’s eigenvalues fulfills a polynomial upper bound of order ; that is, for some positive constant . Then the convergence rate can be obtained as
, which has a weak and implicit dependency on the ambient dimension. In other words, even though most of the RKHSs are infinite-dimensional spaces, their functional complexity can be well controlled and the curse of dimensionality can be largely avoided without specifying model structures.
Assumption 3: There exist constants and such that .
Assumption 3 requires the true gradient function contains sufficient information about the truly informative variables. Unlike most nonparametric models, we measure the significance of each gradient function to distinguish the informative and uninformative variables without any explicit model specification. Now we establish the asymptotic sparsistency of the proposed variable selection algorithm.
Suppose the assumptions of Theorem 1 and Assumption 3 are satisfied. Let , then we have
Theorem 2 shows that the selected active set can exactly recover the true active set with probability tending to 1. This result is particularly interesting given the fact that it is established for any RKHS with different kernel functions. A direct application of the proposed algorithm and Theorem 2 is to conduct nonparametric variable selection with sparsistency (Li et al., 2012, ; He et al., 2013, ; Yang et al., 2016, ). If no prior knowledge about the true regression function is available, the proposed algorithm can be applied with a RKHS associated with the Gaussian kernel. Asymptotic sparsistency can be established following Theorem 2 provided that is contained in the RKHS associated with the Gaussian kernel. This RKHS is fairly large as the Gaussian kernel is known to be universal in the sense that any continuous function can be well approximated by some function in the induced RKHS under the infinity norm (Steinwart, 2005, ).
4 A special case with linear kernel
Variable selection for linear model is of great interest in statistical literature due to its simplicity and interpretability. Particularly, the true regression function is assumed to be a linear function, , and the true active set is defined as . We also centralize the response and each variable, so that can be discarded from the linear model for simplicity.
With the linear kernel , the induced RKHS becomes the linear functional space and contains the true regression function . The proposed Algorithm 1 simplifies to first fit a ridge regression model,
where the RKHS-norm induced by the linear kernel reduces to . Then the estimated active set is defined as for some pre-specified thresholding value . This algorithm simply truncates the ridge regression coefficients for variable selection, and its asymptotic properties have been considered in Shao and Deng (2012, ) and Wang and Leng (2016, ).
We now apply the general results in Section 3 to establish the sparsistency of the proposed algorithm under the linear model. We first scale the original data as and , and let be the RKHS induced by a scaled linear kernel . The true regression model can be rewritten as with . With the scaled data, the ridge regression formula in (4.1) becomes
where . By the representer theorem, the solution of (4.2) is
where and . This is equivalent to the standard formula for the ridge regression according to the Sherman-Morrison-Woodbury formula (Wang and Leng, 2015).
Assumption 4: There exist some positive constants and such that the smallest eigenvalue of , , and .
Assumption 4 implies that is invertible, and that Assumption 1 is satisfied for the scaled linear kernel with . Assumption 2 is also satisfied due to the fact that is bounded on a compact support . A similar assumption is made in Shao and Deng (2012, ), assuming the decay order of the smallest eigenvalue of . We now establish the estimation consistency for .
Suppose Assumption 4 is met. For any , there exists some positive constant such that, with probability at least , there holds
Specifically, let , then with probability at least , there holds
Corollary 1 is a directly application of Theorem 1 under the linear kernel. An additional assumption is made to establish the sparsistency.
Assumption 5: There exist some positive constants and such that .
Assumption 5 is similar to Assumption 3, and requires the true regression coefficients contains sufficient information about the truly informative variables in the linear model. Similar assumptions are also assumed in Shao and Deng (2012, ) and Wang and Leng (2016, ).
Suppose the assumptions of Corollary 1 and Assumption 5 are met. Let , then we have
It is interesting to point out that Corollary 2 holds when diverges at order . Particularly, when and , can diverge at a polynomial rate
. This result is similar as that established in Shao and Deng (2012) under the assumption of a finite second moment of. In literature, it is possible to allow to diverge at an exponential rate of under certain distributional assumptions on , such as a sub-Gaussian distribution in Shao and Deng (2012, ) or a spherically symmetric distribution with -exponential tails in Wang and Leng (2016, ). Such distributional assumptions are not necessary in establishing the sparsistency in Corollary 2.
5 Extension to interaction selection
Interaction selection has become popular in recent literature, including Bien et al. (2013, ), Lim and Hastie (2015, ), Hao et al. (2018, ), She et al. (2018, ) and Kong et al. (2017, ). In these work, the true regression function is assumed to be a quadratic model . A strong heredity is often assumed, requiring that if an interaction effect is included in the model, then both main effects and must be included as well.
We now extend the proposed algorithm to identify the truly informative interaction terms without assuming the quadratic model. Following the idea in Section 2, the true interaction effects can be defined as those with nonzero second-order gradient function . Note that a nonzero second-order gradient function implies that both and are also nonzero, and thus the strong heredity is automatically satisfied. Specifically, given the true active set , we denote
which contains the variables that contribute to the interaction effects in . Further, let , which contains the variables that contribute to the main effects of only. Therefore, the main goal of interaction selection is to estimate both and .
First, let be a forth-order differentiable kernel function, then it follows from Lemma 1 in the Appendix that for any ,
where . Then, given from (2.1), its second-order gradient function is
where . Its empirical norm is . With some pre-defined thresholding value , the estimated and are set as
respectively. The following technical assumption is made to establish the interaction selection consistency for the proposed algorithm.
Assumption 6: There exists some constant such that for any and . Also, there exist some positive constants and such that .
Assumption 6 can be regarded as an extension of Assumptions 2 and 3 by requiring the boundedness of the second-order gradients of , and that the true second-order gradient functions have sufficient information about the interaction effects.
Suppose the assumptions of Theorem 2 and Assumption 6 are met. Let . For any , with probability at least , there holds
Theorem 3 shows that converges to with high probability, which is crucial to establish the interaction selection consistency.
Suppose the assumptions of Theorem 3 are met. Let , there holds
Theorem 4 shows that the proposed interaction selection algorithm can exactly detect the interaction structure with probability tending to 1. It is clear that the algorithm can be extended to detect higher-order interaction effects, which is of particular interest in some real applications (Ritchie et al., 2001, ).
6 Numerical experiments
In this section, the numerical performance of the proposed algorithm is examined, and compared against some existing methods, including the distance correlation screening (Li et al., 2012, ) and the quantile-adaptive screening (He et al, 2013, ). As these two methods are designed for screening only, they are also truncated by some thresholding values to conduct variable selection. For simplicity, we denote these three methods as GM, DC-t and QaSIS-t, respectively.
In all the simulation examples, no prior knowledge about the true regression function is assumed, and the Gaussian kernel is used to induce the RKHS, where is set as the median of all the pairwise distances among the training sample. The thresholding values for all the methods are selected by the stability-based selection criterion (Sun et al., 2013, ). Its key idea is to measure the stability of variable selection by randomly splitting the training sample into two parts and comparing the disagreement between the two estimated active sets. The maximization of the stability criterion is conducted via a grid search, where the grid is set as .
6.1 Simulated examples
Two simulated examples are examined under various scenarios.
Example 1: We first generate with , where and are independently drawn from . The response is generated as where , with , , and ’s are independently drawn from . Clearly, the first variables are truly informative.
Example 2: The generating scheme is similar as Example 1, except that and are independently drawn from and . The first variables are truly informative.
For each example, we consider scenarios with , and . For each scenario, and are examined. When , the variables are completely independent, whereas when , correlation structure are added among the variables. Each scenario is replicated times. The averaged performance measures are summarized in Tables 1 and 2, where Size is the averaged number of selected informative variables, TP is the number of truly informative variables selected, FP is the number of truly uninformative variables selected, and C, U, O are the times of correct-fitting, under-fitting, and over-fitting, respectively.
It is evident that GM outperforms the other methods in both examples. In Example 1, GM is able to identify all the truly informative variables in most replications. However, the other two methods tend to miss some truly informative variables, probably due to the interaction effect between and . In Example 2, with a three-way interaction term involved in , GM is still able to identify all the truly informative variables with high accuracy, but the other two methods tend to underfit by missing some truly informative variables in the interaction term. Note that if we do not threshold DC and QaSIS, they tend to overfit almost in every replication as both screening methods tend to keep a substantial amount of uninformative variables to attain the sure screening property. Furthermore, when the correlation structure with is considered, identifying the truly informative variables becomes more difficult, yet GM still outperforms the other two competitors in most scenarios.
6.2 Supermarket dataset
We now apply the proposed algorithm to a supermarket dataset in Wang (2009, ). The dataset is collected from a major supermarket located in northern China, consisting of daily sale records of products on
days. In this dataset, the response is the number of customers on each day, and the variables are the daily sale volumes of each product. The primary interest is to identify the products whose sale volumes are related with the number of customers, and then to design sale strategies based on those products. The dataset is pre-processed so that both the response and predictors have zero mean and unit variance.
In addition to GM, DC-t and QaSIS-t, we also include the original DC and QaSIS without thresholding, which keep the first variables to assure the sure screening property. As the truly informative variables are unknown for the supermarket dataset, we report the prediction performance of each method. Specifically, the supermarket dataset is randomly split into two parts, with 164 observations for testing and the remaining for training. We first apply each method to the full dataset to select the informative variables, and then refit a kernel ridge regression model with the selected variables for each method on the training set. The prediction performance of each ridge regression model is measured on the testing set. The procedure is replicated 1000 times, and the number of selected variables and the averaged prediction errors are summarized in Table 3.
|Dataset||Method||Size||Testing error (Std)|
As Table 3 shows, GM selects 10 variables, whereas DC-t and QaSIS-t select 7 variables and DC and QaSIS select 75 variables. The average prediction error of GM is smaller than that of the other four methods, implying that DC-t and QaSIS-t may miss some truly informative variables that deteriorate their prediction accuracy, and DC and QaSIS may include too many noise variables. Precisely, among the 10 selected variables by GM, and are missed by both DC-t and QaSIS-t. The scatter plots of the response against these five variables are presented in Figure 1.
The scatter plots of the response against a number of selected variables by GM in the supermarket dataset. The solid lines are the fitted curve by local smoothing, and the dashed lines are the fitted mean plus or minus one standard deviation.
It is evident that the response and these variables have showed some clear relationship, which supports the advantage of GM in identifying the truly informative variables.
7 Lemmas and technical proofs
To be self-contained, we first give a special case of Theorem 1 in Zhou (2007, ) as a lemma on the smooth RKHS below, which plays an important role for the subsequent analysis. Its proof follows directly from that of Theorem 1 in Zhou (2007, ) and thus omitted here.
Let be a Mercer kernel such that , where is a class of functions whose fourth derivative is continuous. Then the following statements hold:
(a) For any , , for any .
(b) A derivative reproducing property holds true; that is, for any ,
We now start the proof of Theorem 1 with a proposition, which is crucial to establish the estimation consistency of the proposed algorithm.
Suppose Assumption 2 is met and . Let be the minimizer of in . Then for any , with probability at least , there holds
Proof of Proposition 1: Define the sample operators and as
Then solving (2.1) is equivalent to solve
where , and hence that
Similarly, the minimizer of in must have the form
Therefore, we have
and its RKHS-norm can be upper bounded as