With the development of modern technology, it is much easier to collect a large amount of data with unprecedented size and complexity at relatively low cost. It is common that the number of variables collected are much larger than the sample size. In such scenarios, it is generally believed that only a small number of collected variables are useful for statistical analysis while others are noise. Thus, sparse learning aiming to correctly learn the sparse structure of the true target function plays a crucial role to facilitate subsequent statistical modeling and analysis.
Sparse learning has attracted tremendous attention in the past few decades. A popular class is based on regularized M-estimations under certain working model assumptions. The most popular assumed working model is the linear model, which assumes that the variables affect the response only through a linear pattern, and thus the truly informative variables are defined as those with nonzero coefficients. Under the linear assumption, the regularized M-estimation aims to solve the optimization on the combination of various loss functions and some sparse-induced penalty terms (see [5, 12, 13, 28] and references therein). Most recently, Loh  examines the theoretical properties of regularized linear M-estimators. More efforts have been made by considering the nonparametric additive model to relax the restrictive linear assumption (see [11, 18, 29] and the references therein). Yet such assumptions are usually difficult to check in real-life analysis and their performance highly relies on the validity of the pre-specified model assumption. Only a few attempts have been made recently to relax the restrictive assumptions (see [9, 16, 26, 27] and references therein). Yet they are highly computational demanding and the computational efficiency becomes one of their main obstacles.
Another popular research line on high dimensional data analysis is the screening type of methods, which examine the marginal relationship between the response and each variable to screen out non-informative variables under some pre-specified criteria (see [4, 10, 14] and references therein). Most recently, Han  proposes a novel nonparametric screening working within a convex loss family, which shows substantial advantage than many existing methods. Yet it requires that the loss function should be differentiable almost everywhere, which may exclude some popularly used loss functions such as the hinge loss. Note that the screening type of methods can attain the sure screening property, which is slightly weaker than the selection consistency, and may fail to detect the marginally independent but jointly dependent variables [10, 14]. Recently, some novel methods have been proposed by adding noise into the model and thus identify the truly informative variables by using the property of noise, including the measurement-error-model-based selection and the knock-off filter methods (see [1, 25] and the references therein). Most recently, Dasgupta et al. 
propose a recursive feature elimination method via repeatedly fitting a kernel ridge regression within a rich loss function family. Yet their lack of selection consistency or computational efficiency remain as some of their main obstacles.
This article proposes a new spare learning method to learn the sparse structure of the true target function via the regularized M-estimators within a rich family of loss functions in a flexible RKHS. The family of loss functions interested is very rich, which contains most commonly used one in literature, such as the square loss, check loss, hinge loss, logistic loss, -insensitive loss and so on. The proposed method is motivated by the derivative reproducing properties of RKHS . The proposed method is methodologically simple and easy to carry out. Specifically, it consists of the estimation of regularized M-estimators in a flexible RKHS, computation of the gradient functions as well as a hard thresholding step. It shows great advantage than most existing methods that it works for general loss functions, assumes no distributional model, admits general dependence structure, allows for efficient computation, and attains asymptotic selection consistency. The proposed method is computationally efficient by only fitting a standard kernel ridge regression, and thus scalable to analyze large scale dataset. More importantly, asymptotic estimation and selection consistencies of the sparse learning method can be established for general loss functions without requiring any explicit model assumptions. To our knowledge, it is the only method which achieves universality, scalability, efficiency and consistency at the same time in literature.
The rest of the article is organized as follows. Section 2 introduces the rich family of loss functions interested in this paper. The motivations as well as the proposed sparse learning method are provided in Section 3. In Section 4, the asymptotic estimation and selection consistencies of the sparse learning method are established for a general loss under mild conditions. Numerical experiments on the simulated and real examples are reported in Section 5. A brief summary is provided in Section 6, and the numerical algorithms as well as all the technical proofs are given in Appendix and a online supplemental material.
Suppose a training sample are independent copies of , with supported on a compact metric space of and . The true target function is often defined as the minimizer of an expected error
where is a loss function of interest. In this paper, we assume the following conditions on that
(1) is convex, and there exist some positive constants and such that , for any and .
(2) is locally Lipschitz continuous; i.e., for any , there exists some constant such that , for any and .
Note that both conditions are commonly used in machine learning literature to characterize loss functions[2, 8, 15]. Many popularly used loss functions satisfy these two conditions, including:
(i) square loss: with and , for any with a constant ;
(ii) check loss: with and ;
(iii) -insensitive loss: with and ;
(iv) logistic loss: with , ;
(v) hinge loss: with and .
The explicit form of varies from one loss function to another; for example, when the square loss is used, ; when the check loss is used, with ; and when the hinge loss is used, . In this paper, we assume that , where is a reproducing kernel Hilbert space (RKHS) induced by a kernel function .
Unlike most existing nonparametric methods with substantial computational cost, the proposed method provides an efficient alternative for conducting nonparametric variable selection. It is motivated by a key observation that a variable does not contribute to if and only if
More importantly, the derivative reproducing property in RKHS  assures that for any ,
where . This implies that to estimate within the induced RKHS, it suffices to estimate itself without loss of any information. Furthermore, the -norm induced by the marginal distribution of is defined as
which can be adopted as a valid measure to distinguish the informative and noninformative variables. Then the true active set is defined as .
Motivated by the key observations (2) and (3), the proposed method consists of three steps, involving a regularized M-estimation in a flexible RKHS, the computation of the corresponding gradients functions and a hard-thresholding procedure.
Firstly, to estimate , we consider the regularized M-estimation in a flexible RKHS by solving
where and denote the representer coefficients.
Once is obtained, the derivative reproducing property in Proposition LABEL:prop3 can be used to facilitate the computation of and by (3), we have for each , where the estimated representer coefficients are obtained in (6) and . In practice, it is difficult to directly evaluate , since is usually unknown. We then adopt the empirical norm of as a practical measure,
Then the active set can be estimated as for some pre-specified .
Note that the selection performance of the proposed method highly depends on the thresholding parameter , and thus the stability-based criterion  is employed to select an optimal value of . Specifically, its key idea is to measure the stability of variable selection by randomly splitting the training sample into two parts and comparing the disagreement between the two estimated active sets. Specifically, given a thresholding value , we randomly split the training sample into two parts and . Then the proposed algorithm is applied to and and obtain two estimated active sets and , respectively. The disagreement between and is measured by Cohen’s kappa coefficient
where and with and denotes the set cardinality. The procedure is repeated for times and the estimated variable selection stability is measured as
Finally, the thresholding parameter is set as , where is some given percentage.
4 Asymptotic results
In this section, we establish the estimation and selection consistencies of the proposed method. We start with a brief introduction about some basic facts in learning theory. Specifically, we have for any , and for any . By the Mercer’s theorem, under some regularity conditions, the eigen-expansion of the kernel function is
are non-negative eigenvalues, and
are the associated eigenfunctions, taken to be orthonormal in. The RKHS-norm of any then can be written as
which implies the decay rate of fully characterizes the complexity of the RKHS induced by , and has close relationships with various entropy numbers . Therefore, for any , we have where are Fourier coefficients. Note that these results require that , which is automatically satisfied if is bounded. Besides, the solution of (1) may not be unique, and we further define with to ensure the uniqueness of in the rest content.. Moreover, we denote .
In the rest of this paper, we rewrite and as and to emphasize their dependency on , and is allowed to diverge with . We also denote as . The following technical assumptions are required to establish the estimation consistency of the proposed method.
Assumption 1: There exist some positive constants and such that and for any .
Assumption 2: There exist some positive constants and such that the approximation error .
Assumption 1 imposes the boundedness condition on the kernel function as well as its gradient functions, which is commonly used in literature [20, 21] and satisfied by many kernels, including the Gaussian kernel, Sobolev kernel, scaled linear kernel, scaled quadratic kernel and so on. Assumption 2 quantifies the approximation error as a function of the tuning parameter , which is sensible as in general. Similar assumptions are also used in literature to control the approximation error rate [2, 19, 20, 27]. Particularly, Mendelson and Neeman  shows that the approximation error under the square loss is with . Further investigations about the approximation error rate are provided in Eberts and Steinwart  and Hang and Steinwart  by imposing some additional technical assumptions.
Suppose that Assumptions 1 – 2 are satisfied. Let , then for any , there exists some positive constant such that with probability at least
such that with probability at least, there holds
Theorem 1 establishes the estimation consistency of , in the sense that converges to in probability. This consistency result is established without requiring any specific model assumption for a general loss satisfying conditions (1) and (2) in Section 2. Note that the convergence result allows to diverge with but the dependency is generally difficult to be quantified explicitly. Similar observations have also been made in . Theorem 1 is also crucial to establish the selection consistency of the proposed method.
One more technical assumption is needed to establish the selection consistency of the proposed method.
Assumption 3. There exist some positive constants and such that
Assumption 3 requires that contains sufficient information about the truly informative variables. Note that the required minimal signal strength in Assumption 3 is much tighter than many nonparametric variable selection methods [11, 26], which require the signal is significantly bounded away from zero by some positive constant. Now we establish the selection consistency of the proposed method.
Suppose that all the assumptions of Theorem 1 as well as Assumption 3 are satisfied. Let , then we have
Theorem 2 shows that the selected active set can exactly recover the true active set with probability tending to 1. This result is particularly interesting given the fact that it is established for general loss function without any explicit model assumption. The attained theoretical result also shows some advantages than the existing results in literature. First, the theoretical results hold for a variety of loss functions, including (i)–(v) in Section 2. Second, the proposed method can be regarded as a joint screening method and thus is able to identify all the informative variables acting on the response with a general dependence structure, including the marginally noninformative but jointly informative ones. To some extent, the proposed method can simultaneously achieve methodological universality, computational scalability, and theoretical consistency.
5 Numerical results
In this section, we examine the proposed method in some simulated and real-life examples under various settings. Specifically, we consider mean regression with square error loss, check loss with quantile level, and classification with the hinge loss and logistic loss, due to their popularity and importance in statistical machine learning [10, 26, 28, 31]. Under mean regression setting, we compare the performance of the proposed methods with distance correlation learning , the quantile-adaptive screening  with quantile level , denoted as MF-SQ, MF-QA, DC and QaSIS for simplicity. As the DC and QaSIS methods are designed to keep the first variables to assure the sure screening property, they are also truncated by some thresholding values to conduct variable selection and we denote the truncated methods as DC-t and QaSIS-t, respectively. Under classification setting, we compare the performance of the proposed methods with sparse additive machine , DC and DC-t, and the first three methods are denoted as MF-SVM, MF-LOG and SAM for simplicity.
In all the examples, the RKHS induced by the Gaussian kernel is adopted, where is set as the median of all the pairwise distances among the training sample. Other tuning parameters such as the ridge regression parameter and the thresholding value are selected by the variable selection stability criterion  via a grid search, where the grid is set as .
5.1 Simulated examples
In this section, we consider two simulated examples to illustrate the performance of the proposed algorithm. The first example is under mean regression setting where the true regression function is nonparametric. The second example is under classification setting where the true conditional logit function has a complicated structure. The two simulated examples are examined under various scenarios.
Example 1 (Regression): We first generate with , where and are independently drawn from . The response is generated as where , , and ’s are independently drawn from . Clearly, the first variables are truly informative.
Example 2 (Classification): We generate with , where and are independently drawn from . Then we generate with the true conditional logit function Clearly, the first 2 variables are truly informative.
For each example, we consider scenarios with and . For each scenario, and are examined. When , the variables are completely independent, whereas when , correlation structure are added among the variables. Each scenario is replicated 50 times. The averaged performance measures are summarized in Tables 1 and 2, where Size is the averaged number of selected informative variables, TP is the number of truly informative variables selected, FP is the number of truly non-informative variables selected, and C, U, O are the times of correct-fitting, under-fitting, and over-fitting, respectively.
|Tables 1–2 about here|
It is evident that the proposed methods outperform the other competitors in both regression and classification settings. In Example 1, MF-SQ and MF-QA are able to identify all the truly informative variables in most replications. However, the DC-t and QaSIS-t methods tend to miss some truly informative variables. In Example 2, MF-SVM and MF-LOG are able to identify all the truly informative variables acting on the true conditional logit function with high probability, but the DC-t method tends to underfit by missing some truly informative variables and the SAM method tends to overfit by including some noise variables. As opposed to DC-t and SIS-t, the DC and QaSIS methods tend to overfit in every replication as they are designed to keep a substantial amount of noninformative variables to attain the sure screening property. Furthermore, when the correlation structure with is considered, identifying the truly informative variables becomes more difficult, yet the proposed algorithm still outperforms the other competitors in most scenarios. As a computational remark, since the computational cost of SAM can be very expensive especially the dimension is large, and thus we do not report the performance of SAM in the scenarios with .
5.2 Human breast cancer dataset
In this section, we apply the proposed method to a real dataset on the human breast cancer study , which can be downloaded online at https://www.ncbi.nlm.nih.gov/geo/ with accession number GSE20194. It consists of 278 patient samples with 164 patients have positive oestrogen receptor status and 114 patients have negative oestrogen receptor status, and each sample is characterized by 22283 genes. A patient has positive oestrogen receptor status if the receptors for estrogen are detected, which suggests that estrogen may send signals to the cancer cells among normal breast cells to promote their growth. It has been shown that roughly 80 percent of the patients diagnosed with breast cancers, have positive estrogen receptor status. Consequently, the main interest of our study is to identify the genes which effect the oestrogen receptor status.
Clearly, this dataset falls into classification scenarios. Then we apply all the methods used in Example 2 to identify the informative genes. When the informative genes have been selected, we randomly split the dataset, with 78 observations for testing and the remaining for training, and refit a standard kernel SVM by using R package ‘e1071’. The splitting is replicated 1000 times, and the averaged prediction results are summarized in Table 3. Besides, we also provide the boxplots for all the methods in Figure 1.
|Figure 1 and Table 3 about here|
Note that since the oestrogen receptor status plays a important role in assistant diagnosis for breast cancer, it is more severe to wrongly classify the patients with positive oestrogen receptor status to the negative one. And thus, we also report the false negative rate in Table3. Clearly, MF-SVM shows substantial advantage than all the other methods and followed by MF-LOG from Table 3 and Figure 1. Specifically, the averaged testing error and false negative rate based on the selected sets of MF-SVM are the smallest among all the methods, and followed by MF-LOG and SAM have the similar performance and show advantage than the DC and DC-t in terms of the testing error. Note that MF-LOG selects less genes than SAM, which implies that SAM may wrongly identify some variables. The high testing errors of DC-t and DC suggest that these methods may include or exclude some non-informative genes that deteriorate their prediction performance. The numerical results in real-life analysis assure that our proposed method outperforms most existing classification methods in literature with strong selection and prediction power.
This paper proposes a novel and universal sparse learning method, which brings bright to the general variable selection under various scenarios. The proposed method takes fully advantage of the nice properties of the induced RKHS. It is worhty to point out that our proposed method can be directly extended to conduct interaction selection efficiently without explicit model assumption without paying much additional efforts. Another interesting future extension is to allow to be out of the induced RKHS but , one possible routine is to modify the definition of the true active set as , where denotes all variables except for and measureing the largest possible change of along the -th coordinate. Then, the equivalence between and the gradients of some intermediate function in is of great interest to be examined. As discussed after Theorem 1, the convergence rate of the proposed sparse learning method is linked with the dimension implicitly, and may vary from case to case.It would be of great interest to expound the explicit diverging order in the future work by analyzing the kernel function in high-dimensional case in detail.
Due to space constraint, the computing algorithms and some lemmas and their proofs are provided in a online supplementary material.
Appendix II: technical proofs
Proof of Theorem 1: Define the sample operators for gradients, and its adjoint operator as
respectively. Similarly, the integral operators for gradients, and are defined as
Note that and are Hilbert-Schimdt operators  in the Hilbert-Schmidt space , where is a Hilbert space with all the Hilbert-Schmidt operators on , endowed with norm . Note that the following property holds: for any . Therefore, we have
Therefore, there holds
where is a bounded quantity and the inequality follows from the Cauchy-Swartz inequality. Note that , and thus an upper bound for is provided by the combination of Assumption 2 and Lemma 2 in the supplementary material. Now we turn to bound and .
By Assumption 1 and direct calculation yields that
On the other hand, by the concentration inequalities in on , for any , we have
Then, this implies that for any , with probability at least , there holds
Then, by setting , the desired results follows immediately.
which contradicts with Theorem 1, and thus with probability at least .
Next, we show that in probability. If not, suppose there exists some but , which implies that by Assumption 3, and . Hence, by Theorem 1, with probability at least , we have
And thus it contradicts with Theorem 1 again. This implies that with probability at least . Combining these two results yields the desired sparsistency.
-  R. Barber and E. Cands. A knockoff filter for high-dimensional selective inference. ArXiv:1602.03574, 2018.
-  S. Dasgupta, Y. Goldberg, and M Kosorok. Feature elimination in kernel machines in moderately high dimensions. Annals of Statistics, In press, 2018.
-  M. Eberts and I. Steinwart. Optimal regression rates for svms using gaussian kernels. Electronic Journal of Statistics, 7:1–42, 2013.
-  J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B, 70:849–911, 2008.
-  J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space (invited review article). Statistica Sinica, 20:101–148, 2010.
-  K. Fukumizu and C. Leng. Gradient-based kernel dimension reduction for regression. Journal of the American Statistical Association, 109:359–370, 2014.
-  X. Han. Nonparametric screening under conditional strictly convex loss for ultrahigh dimensional sparse data. Annals of Statistics, In press, 2018.
-  H. Hang and I. Steinwart. A bernstein-type inequality for some mixing processes and dynamical systems with an application to learning. Annals of Statistics, 45:708–743, 2018.
-  X. He, J. Wang, and S. Lv. Gradient-induced model-free variable selection with composite quantile regression. Statistica Sinica, 28:1512–1538, 2018.
-  X. He, L. Wang, and H. Hong. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Annals of Statistics, 41:342–369, 2013.
-  J. Huang, J. Horowitz, and F. Wei. Variable selection in nonparametric additive models. Annals of Statistics, 38:2282–2313, 2010.
-  J. Huang, Y. Jiao, Y. Liu, and X. Lu. A constructive approach to high-dimensional regression. ArXiv:1701.05128v1, 2018.
-  C. Leng, Y. Lin, and G. Wahba. A note on the lasso and related procedures in model selection. Statistica Sinica, 16:1273–1284, 2006.
-  R. Li, W. Zhong, and L. Zhu. Feature screening via distance correlation learning. Journal of the American Statistical Association, 107:1129–1139, 2012.
-  J. Lin, L. Rosasco, and D. Zhou. Iterative regularization for learning with convex loss functions. Journal of Machine Learning Research, 17:1–38, 2016.
-  Y. Lin and H. Zhang. Component selection and smoothing in multivariate nonparametric regression. Annal of Statistics, 34:2272–2297, 2006.
-  P. Loh. Statistical consistency and asymptotic normality for high-dimensional robust m-estimators. Annals of Statistics, 45:866–896, 2017.
-  S. Lv, H. Lin, H. Lian, and J. Huang. Oracle inequalities for sparse additive quantile regression in reproducing kernel hilbert space. Annals of Statistics, 2:781–813, 2018.
-  S. Mendelson and J. Neeman. Regularization in kernel learning. Annal of Statistics, 38:526–565, 2010.
-  L. Rosasco, S. Villa, S. Mosci, M. Santoro, and A. Verri. Nonparametric sparsity and regularization. Journal of Machine Learning Research, 14:1665–1714, 2013.
-  S. Smale and D. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26:153–172, 2007.
-  I. Steinwart and A. Christmann. Support Vector Machine. Springer, 2008.
-  W. Sun, J. Wang, and Y. Fang. Consistent selection of tuning parameters via variable selection stability. Journal of Machine Learning Research, 14:3419–3440, 2013.
-  G. Wahba. Support vector machines, reproducing kernel hilbert spaces, and randomized gacv. Advances in kernel methods: support vector learning, pages 69–88, MIT Press, 1998.
-  Y. Wu and L. Stefanski. Automatic structure recovery for additive models. Biometrika, 102:381–395, 2015.
-  L. Yang, S. Lv, and J. Wang. Model-free variable selection in reproducing kernel hilbert space. Journal of Machine Learning Research, 17:1–24, 2016.
-  C. Zhang, Y. Liu, and Y. Wu. On quantile regression in reproducing kernel hilbert spaces with data sparsity constraint. Journal of Machine Learning Research, 17:1–45, 2016.
-  X. Zhang, Y. Wu, L. Wang, and R. Li. Variable selection for support vector machines in moderately high dimensions. Journal of the Royal Statistical Society Series B, 78:53–76, 2016.
T. Zhao and H. Liu.
Sparse additive machine.
International Conference on Artificial Intelligence and Statistics, 2012.
-  D. Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Computational and Applied Mathematics, 220:456–463, 2007.
J. Zhu and T. Hastie.
Kernel logistic regression and the import vector machine.Journal of Computational and Graphical Statistics, 14:185–205, 2005.