1 Introduction
Feature selection is important in data mining and knowledge discovery. It has many applications, including 1) finding and ranking important disease genes; 2) reducing nuisance features that mask truth or terminate a computation due to the size or memory limitation of a current computing software architecture [Liu (2009)]; 3) building a parsimonious model or performing a dimension reduction; and 4) improving efficiency in storage, communication, or computation.
A unified view
for feature selection in supervised learning is that finding important features is equivalent to an optimization problem that seeks a solution “
” that maximizes an objective function over a domain of decisions “”, given a data set . Here , where is a matrix representing measurements of covariates measured on subjects, and denotes the outcomes of these subjects. The number of true important features (defined explicitly below in Section 2) from the covariates is unknown and typically sparse. depends on a statistical objective and a suitable model for analyzing the data . The value of , which maximizes is indicative of the importance of features. For example,is a penalized summation of squares of “residuals” in a linear regression context, or a measure of predictive classification rate in a classification model.
With a large , seeking an exact optimization of may be impossible or inefficient. Approximate solutions or clever alternatives are often necessary for practical use. These alternatives include Local Quadratic Approximation (LQA) for minimizing a penalized likelihood with a SCAD penalty [Fan and Li (2001)], LAR [Efron et al. (2004)] for speeding up LASSO [Tibshirani (1996)], MCP [Zhang (2010)] that advocates using a minimax concave penalty, and iterative procedures, such as stepwise procedure/ELSA [Kim, Street, and Menczer (2000)], generalized bagging [Breiman (1996)] and boosting [Schapire et al. (1998)] procedures for tree based models. As a current stateoftheart technique, the Elastic Net [Trevor, Tibshirani, and Friedman (2009)] is a regularized regression method that linearly combines the L1 and L2 penalties of the LASSO and ridge methods. This procedure and its cousins, SCAD and MCP, all belong to the class of penalized (likelihood) strategies. They have been studied extensively to generalize for example, to a smooth regression model that subjects to different covariate effects within a group by Guo et al. (2016); for their asymptotic properties by Liu and Yu (2013); or for asymptotic optimality by Zhang (2013); for multivariate regression modeling with a large number of responses by Sun et al. (2015)
; and for confidence intervals in highdimensional linear regression [
Cai and Guo (2017)]. The Random Forest [
Breiman (2001)] is the benchmark of treebased methods for producing importance of predictors (or features) in the application of ensemble learning and regression. These sophisticated procedures assume that target data are within the size and memory limitation of these procedures’ software implementation. The everincreasing volume of data continuously challenges this assumption. When a limit is reached, one strategy is to use a sample of the data. However, large data are often heterogenous, a subsample or even a few subsamples may not adequately represent the full data set. Another strategy is to use a newer machine with a larger space and memory, and a new software implementation. On the other hand, when the true number of features is only a few, inclusion of all features, even only a few hundreds or thousands into a (standard) statistical analysis procedure can mask true features. See Liu, Sun, and Zhang (2007)for subsampling for PCA, an unsupervised learning. The third approach is using a rough dimension reduction methodology or domain knowledge to prescreen the features. Here, we consider a different approach, developing
an ensemble that uses an adequate number of multiple subsamples and the outcomes from these subsamples strategically. We also wish to require this ensemble procedure simple to implement and easy to scale to everenlarging dimension or size of data. Our Subsampling Winner Algorithm (SWA) is developed based on thess ensemble requirements. Our SWA differs from the “subsampling procedures" in Wang et al. (2016), which subsample from observations; SWA subsamples from variables or features, and hence the traditional statistics based on a large number of observations does not apply here. SWA bears some similarity to the philosophy behind the bagging and boosting for a tree based model in selecting the best split at each node of the tree. SWA applies simple, standard procedures on subsamples strategically as Donoho and Jin (2015) did for Higher Criticism (HC) motivated by a Tukey’s idea, and hence SWA can be used for large data without adding too much complexity in its algorithm. By its subsampling nature, SWA reduces not only the requirement of the memory or size, but also the complexity that arises from many masking nuisance features in the whole sample.This paper focuses on developing a practical SWA for feature selection in linear regression from large data. In Section 2, our sparse linear model is provided with an ovarian cancer data that motivated our development of SWA for regression. In Section 3, the Subsampling Winner Algorithm (SWA) is introduced. In Section 4, the procedures for specifying the SWA parameters are provided. In Section 5, the SWA’s parameter specification procedure is validated, and a comprehensive comparison of SWA performance with those of benchmark procedures are given. The benchmark procedures include the Elastic Net, SCAD, MCP, and Random Forest. In Section 6, the SWA is applied to ovarian cancer data, where we also provided a “double assurance” procedure for practical applications. We found that SWA has excellent, automatic control of false discovery rates (FDR), and is competitive in capturing true features, i.e., with a good true discovery rate (TDR) when the FDRs are equalized
for all comparison procedures. Also, the features discovered by SWA from the ovarian cancer data contain functionally important genes and pathways, setting a basis for future research into much needed development of drug targets for ovarian cancer. The article is concluded with a discussion, summary, and recommendations in Section 7, including an analogy of SWA with the deep learning principle. A simple proof of our proposition is given in the appendix.
2 Model and Data
Consider data from a linear model
(1) 
relating the outcome and its dimensional covariates (aka features) via their measurements in
, and the unknown parameter vector
. Here, the ndimensional error vectoris assumed to come from a homoscedastic normal distribution with mean
and variance
, for simplicity. This simple model (1) is adequate for many data applications, though the data may need to be preprocessed or transformed before its modeling by (1), such as the motivating data below. This model is also the most basic model for evaluating a paradigm or direction change from a status quo such as Cai and Guo (2017) did for confidence intervals in high dimension. A generalization of (1) to more general error structures will be discussed in section 7. The challenge here, as it is in many recent modern procedures, is when . These recent procedures have been based on the exploitation of sparsity assumption of the large vector, which assumes that only a small number out of components of are nonzero. We assume the same moderate sparsity with as Cai and Guo (2017) did. Our objective here is to identify the set of ‘true features’ that are associated with the ‘covariates’ corresponding to nonzero coefficients ’s.Data. In human cancer genomic research, feature selection plays an important role in analyzing big data. Messenger RNA (mRNA) are RNA molecules that transmit genetic information from DNA to the ribosome, where they control the amino acid sequence of gene expression. In particular, for the ovarian cancer data from the Broad Institute TCGA Genome Data Analysis Center (GDAC, 2013), we are interested in the identification of mRNAs that are directly linked to an important gene expression (aka, the response ) that has a significant impact on the overall survival of ovarian cancer patients. Analyzing this publicly available data set is in response to the NIH call for “Secondary Analysis and Integration of Existing Data to Elucidate the Genetic Architecture of Cancer Risk and Related Outcomes (R01)” (https://grants.nih.gov/grants/guide/pafiles/PA17239.html).
The important mRNA genes would influence the response and have nonzero ’s in explaining the response in (1). The ovarian cancer data contain mRNA gene expression profiles derived from ovarian serous cystadenocarcinoma specimens available from the Broad Institute’s GDAC. The data have been normalized and analyzed for an association study (Broad Institute TCGA Genome Data Analysis Center, GDAC, 2013). Hence, Equation (1) is an adequate approximate model for our application to this normalized dataset.
Clearly in this ovarian cancer data. Thus, an adhoc or a standard statistical feature selection procedure is inconsistent. An adequate procedure designed specially for largesmall problems is needed. SWA described below is suitable for the case of and will be applied to this ovarian cancer data to find significant genes linked to an important response . The resulting “important” genes will help to validate candidate target genes in literature, or mine further important features that were not presented in the existing biological literatures.
3 Subsampling Winner Algorithm
This section presents our Subsampling Winner Algorithm (SWA) for linear regression with data from equation (1) when . Given a suitable base procedure h and a scoring algorithm w that ranks features, our SWA goes through 4 steps: (1) SWA subsamples data and performs subsample analysis using a base procedure to each of subsamples of size , for ; (2) SWA ranks the subsample models and features; (3) SWA obtains semifinalists from the ranks of features; (4) SWA analyzes the semifinalists to capture the finalists, i.e., the most important features using the base, or an enhanced base procedure. Hence, an SWA depends on and h and w:
(2) 
where h, w are the base and scoring functions, respectively.
For the linear regression model (1), h can be simply the standard least square procedure for subsamples of dimension of data points (). We define based on the scores of features in each submodel weighed by the submodel performance as given in the score function (3) below.
Algorithm 1: SWA for Regression in Large Data

Subsample analysis. For each , {

Sample candidate features. Randomly draw subcolumns from columns of to obtain an dimensional submatrix, still denoted as for simplicity though having abused the notation.

Least squared regression. Fit by h, the standard least square procedure (and allowing an additional stepwise variable selection on this resulting fit, as a user specified option).

Grade candidates locally. Store the resulting ‘Residual Sum of Squares’ and estimated ‘ statistic’ of as and , for , where the for the unsampled features are set to zero.
}


Rank submodels. Rank for , and let be the smallest Residual Sum of Squares and be the corresponding tvalues of , for , .

Rank all individuals to obtain semifinalists. Compute the scores of all features for , by the scoring function:
(3) Retain the top features that correspond to the largest values, as the semifinalists.

Select finalists. Fit a linear regression of on the semifinalists by h. The features with pvalues less than (after a multiplicity adjustment) are selected as the finalists. A further stepwise selection is also an option from this final regression fit.
We shall evaluate the SWA by comparing SWA with its benchmarks, in terms of the actual false discovery rate and true discovery rate of important features.
This SWA algorithm is simple. It can subsample from large data to select important features with no restriction on the size of . It therefore may be called “dimensionless” and will be shown to be competitive in Section 5.
4 Procedures for SWA’s Parameter Selection
This section addresses the specification of , and in (2) for the SWA in regression. We use SWA in stead of SWA hereafter.
4.1 The Choice of
Denote to be the set of true features and to be the set of features selected by SWA. If the fixed , then as given by Proposition 1 below, a chance that all features are selected in one subsample is close to when is large. The estimated coefficients from this subsample analysis would then be consistent for a fixed , as [Rao (1973)].
In general, we assume that data come from an identifiable model in the sense that (1) features are sparse with [Cai and Guo (2018)]; (2) models with more true features are dominant than those with less true features, leading to small RSS; and (3) the SOIL condition by Ye, Yang, and Yang (2018) is satisfied as . See more discussion on this SOIL condition below. Hence, we would then have that if ,
(4) 
In other words, when
, the features selected by the SWA have a large probability to be consistent to the set of true nonzero features when the model is identifiable. However, the true number of features
is unknown. If is too small, we may not be able to capture with a high probability all the important features in any subsample. On the other hand, if is too large, a large can slow down computation dramatically and may mask important features. Hence as , where “” denotes the symmetric difference of two sets and .Rule of Thumb. Our simulation and experience recommend a rule of thumb in selecting :
(5) 
This rule of thumb does not require
knowing but some range of . When the range is still quite
uncertain, we propose a multipanel diagnostic plot for choosing ,
constructed based on the following algorithm.
Algorithm 2: Multipanel Plot Algorithm (MPA) for selecting

Initialize a set of values: , e.g., .

Arrange scree plots of feature weights for each of the s values in I panels. Run the SWA for . Name the resulting weights in (3) for each as , for and . Plot for to obtain simultaneous panel plots in both fixed and freescales to examine the changes in both magnitude and shape of these plots. See Figures 1 & 2.

Identify the minimal stable point of s. A value of is a stable point, if its scree plot has an obvious “elbow” point and a stable “upper arm” set. An “elbow” point is the point where a significant change in the slope of a scree plot has occurred, typically leading to a tapering flat line. An “upper arm” set contains the points above the “elbow” point, for example, feature set in Figures 1 and 2. An “upper arm” set is stable if indices of the points in the “upper arm” set are the same as those of the “upper arm” set of the next value (although set orders may differ). A set is “relative stable" if the intersection of this set with its adjacent upper arm set(s) (for adjacent s values) contains a stable subset in reference to the neighboring sets.

Optional step. If there are no clear stable plots, but some semistable subplots (i.e., there are no definite stable sets), enlarge the set of by adding a few in the neighborhood of semistable plots, return to Step 2 using the updated .
Both practical selection procedures of are simple to implement. The multipanel plot has been implemented in our R package “subsamp,” available in cran.rproject.org.
The following are two examples of using this algorithm.
Example 1: Consider data from the model:
(6) 
where , are i.i.d. , for . Hence, , , and . For this example, we apply the SWA by subsampling with a repetition size , and . The resulting multipanel plots of the weight scores of the top 40 features for each choice of , for , are in Figures 1 and 2.
The fixedscale multipanel plot Figure 1 shows a significant change in magnitude occurred at or . Combining with the information in the freescale plot Figure 2, we see that the plot at has an “elbow” point, while the plot at does not have an obvious “elbow” point.

The upper arm set is for the plot of and this set has a relatively stable subset of . It is important to note that the coefficient of is not as large as these of and . So “1” can hide among other features.

, obviously.
When , true features are gradually masked more by nuisance features. Hence, the corresponding would be the recommended subsample size.
Example 2: Consider data from the model
(7) 
where , and are i.i.d. , for . This is again large small data with true features. For this example we apply SWA by subsampling with a repetition size , and .

As shown in the fixed scale multipanel plot in Figure 3, significant changes in magnitude occurred at . In combination with the information from Figure 4, we observe that lead to plots with a stable ‘elbow’ point.

The upper arm set is , which is also stable.

It is obvious that . It is also important to note that the signal of , and are not as strong as the rest of true features with that of being extremely weak, especially when compared with the variance of .
Figure 1: Fixedscale multipanel diagnostics plot for .
Figure 2: Freescale multipanel diagnostics plot for .
Figure 3: Fixedscale multipanel diagnostics plot for .
Figure 4: Freescale multipanel diagnostics plot for .
The is the optimal subsample size in terms of stabilization in both scale and shape. In general, in is sufficient by our experience, although there are some cases when for a large may still work. In Section 5.1, we validate systematically the multipanel procedure for choosing s. In these two examples, we did not need to use Step 4 in Algorithm 2.
Discussion of the identifiable model. Given the SOIL condition in the identifiable model, Ye, Yang, and Yang (2018) has shown that the ensemble model leads to a consistent feature selection. We made the same assumption in our identifiable model. However, in our studies, under the identifiable condition (2), the dominance condition, the scoring functions ’s in (3) typically place strongenough important features into a semifinalist set and hence lead to consistent estimates of regression coefficients. This is perhaps why our SWA procedure is competitive to the consistent penalized procedures in our simulation studies. We conjecture that conditions (1) and (2) in the identifiable model above would imply conditions (3), the SOIL condition or the consistency of Random Forest [Scornet, Biau, and Vert (2015)], under some mild regularity conditions on the strength of the signals. The ideas used to show that Adaptive Regression by Mixing (ARM) leads to the SOIL condition [Yang (2001)] might be generalizable to probe the consistency of our SWA procedure. We leave the formal consistency condition and proof to future investigation.
4.2 The Choice of
Proposition 1. Given that , the upper and lower bounds for the number of independently selected subsamples of size to capture all the important features with a probability at least are given by
(8) 
A simple proof is given in the appendix.
Remark: It is worthwhile to point out that the repetition number given by Proposition 1 is usually quite large. This is due to the consideration of the “worst” case senario, in which the important features would not be discovered unless all of them were selected in one of subsamples. This “all or none" condition is extreme. The repetition number needed to catch the important features is often much smaller than the one indicated by Proposition 1. The repetition number and its corresponding lower and upper bounds given by Proposition 1 may be considered conservative benchmarks.
The larger is, the more computation load will be. In our R package “subsamp,” the default , has been sufficient for all examples we tried. Parallel computing can be used to speed up the implementation of the algorithm process, see discussion in Section 7.
4.3 The Choice of q
In the last step of Subsampling Winner Algorithm, in our examples and application in Section 5 and 6, we have chosen in the final (stepwise) model selection, which seems to be satisfactory for the values of we have tried.
5 Validation and Comparisons
In this section, we first validate the MPA method specified in Algorithm 2, for selecting the subsample size , by empirically examining the performance of SWA with various choices of subsample size (Section 5.1). We then compare the SWA with three benchmark procedures: Elastic Net, SCAD and MCP, via their R code implementation “glmnet”, “ncvreg” and “plus”, and then with the “randomForest” procedure. Both independent (Section 5.2) and correlated (Section 5.3) covariates are considered in our comparative study. The ultra high dimensional case with a prescreening procedure is also examined (Section 5.4).
5.1 Validation of Selection Method MPA of
The model of our choice for validating the MPA is (7) given in Example 2, where , , with representing the parameters of true features. This model has a mixture of features including extremely weak (with coefficients 0.1 and 0.5), weak ( 1 and 1.5), moderate ( 2 and 2.5), strong () and very strong ( 3.5, 4 and 5) features.
Using the MPA, we have determined that is a good choice of subsample size for example 2. For each sample drawn from this model, variables selected by SWA, were recorded; pvalues of the finalists were adjusted by the simple Bonferroni method to control the multiplicity of searching from covariates at familywise error (FWE) rate. We repeated this random sampling and analysis times independently.
Tables 1 and 2 show the effects of different choices of on the true discovery and false discovery of features by SWA, respectively. We examine the number of captures of true features for (full capture), and for at least true features. We experiment with , and . Chosen by Algorithm in Example 2, is indeed the best choice based on a simulation study with repetitions. As shown in Table 1, the performance of true feature detection is improved as the subsample size increases from 5 to 30. When (or ), obviously, detection rate for more than 5 (or 8) features is zero, because with . The two best powers seem to have reached at and in this study. There is little difference between results with and , considering that is not very different from , when .
#Features  s=5  s=8  s=10  s=15  s=20  s=30  s=35 

Captured  
0%  0%  0%  0%  0%  0%  0.1%  
0%  0%  0.3%  4.7%  16.8%  26.7%  29%  
0%  1.2 %  10.2%  46.3%  78.1%  96.2%  96%  
0%  22.1%  44.0%  86.6%  97.8%  99.9%  100.0%  
0%  63.7%  80.1%  98.0%  99.9%  100.0%  
34.4%  90.8%  96.7%  99.8%  100.0%  
63.8%  97.7%  99.9%  100.0%  
81.7%  99.6%  100.0%  
100.0%  100.0% 

Notes: Entries are percentages of captures of true features. “10” in the first column means that all 10 true feature are captured, while “” means that at least 6 true features are captured.
Table 2 indicates that the falsefeature capture rate remains controlled at until . At , the false discovery rate is a little bigger than . Hence, is the optimal value of when both the power and false detection rate are considered. This outcome (from Tables 1 and 2) based on the repeated experiments validates our finding by the multipaneling plot algorithm (MPA).
#Features  s=5  s=8  s=10  s=15  s=20  s=30  s=35 

Captured  
986  991  987  984  981  956  934  
14  9  12  15  19  42  58  
1  1  2  6  
2  
Total  1000  1000  1000  1000  1000  1000  1000 

Notes: Entries are numbers of captures. “1” in the first column means that exactly one false feature is captured.
5.2 Comparison with Benchmark Procedures in the Case of Independent Covariates
In this section, we compare SWA with 3 benchmark procedures and Random Forest, when covariates are generated independently as in model (7) in Example 2. Given by the multipanel diagnostic plots in Figures 3 and 4, we set and . We performed a comprehensive, comparative study of SWA with the Elastic Net, SCAD and MCP as well as Random Forest.
We first compared SWA to the Elastic Net, SCAD and MCP using the default parameters in these R packages “glmnet” and “ncvreg” where their tuning parameter was chosen corresponding to the minimum mean crossvalidated error for Elastic Net and SCAD, and using for MCP. The simulation size is 1000. The False Discovery Rates (FDR) of these competing procedures are listed in Table 3. Out of independent repetitions, SWA with a Bonferroni control of the multiplicity at 0.05 FWE performed extremely well, capturing none of false features in 956 times, exact 1 false feature in 42 of the trials, exactly 2 nuisance features only twice, and no error in capturing more than 2 nuisance features. This SWA had a FDR closest to the target level of , among all comparison procedures. With the same Bonferroni control but adding a stepwise procedure in SWA’s last step (Step 4), the false discovery of the nuisance features by SWA are comparable to the MCP procedure, both much better than those by the Elastic Net and SCAD (using their default parameters). The procedure with a BenjaminiHochberg (BH) FDR control still did better than Elastic Net and SCAD (with the default datadependent parameter) procedures in terms of false feature alarm.
#Features  SWA w/  SWA w/  SWA w/step  Elastic Net  SCAD  MCP 

Captured  Bonferroni  BH  Bonferroni  
956  756  828  7  403  832  
42  169  142  14  255  151  
2  53  20  31  175  15  
13  6  50  88  2  
5  3  65  38  
3  1  76  24  
89  10  
1  91  5  
82  2  
76  
419 

Notes: Entries are numbers of captures. “1” in the first column means that exactly one false feature is captured.
Next we compare the competing procedures, SWA, Elastic Net, SCAD and MCP, in terms of truefeature discovery rates after setting the regularization parameter for Elastic Net, SCAD and for MCP to match the FDR of SWA at 0.05. The outcome is summarized in Table , which provides the percentage of times that the true features are captured. Overall, SWA with a Bonferroni correction (denoted as “SWA w/ Bonferroni") under this setting is comparable to SCAD and MCP, while Elastic Net does not scale up to the performance of either of SWA, SCAD and MCP. Specifically, SWA captured all moderate, strong and very strong features as SCAD and MCP did in the “7+” case; SWA (w/ Bonferroni) had a rate of 96.2% truefeature captures in the “8+” case that inclueds a weak signal. In the “9+” case that includes extremely weak features, we do not expect a high capture rate, where SWA (w/ Bonferroni) had a true feature capturing rate of 26.7% which is less than those by SCAD and MCP. It is possible that if we used a less conservative multiplicity adjustment than Bonferroni procedure, such as the BenjaminiHochberg (BH in short) procedure, SWA could have captured more very weak features, though the false discovery may also increase.
In this set of comparisons, the values of and are typical of some genetic studies, and are within the implementation limitation of the SCAD, Elastic Net and MCP procedures. The advantages of SWA are as follows. First, it does not require a choice of or determination of unknown while providing comparable outcomes for weak to strong signals. We do need to choose and , but they are simple to setup. SWA by default controls the actual FDR to the nominal level, while penalized procedures are liberal (as shown in Table 3) by current data dependent choice of . In the true feature comparison, we adjusted the regularization parameter of Elastic Net, SCAD and MCP based on the performance of SWA in this simulation study to equalize the FDR. In a real data application, it is not clear how to control the FDR to the desired level for each of Elastic Net, SCAD and MCP individually. Perhaps, we can combine the results of either SWA and SCAD, or SWA and MCP. See more discussion in Section 7. Third, SWA is simple. It can handle data with a much larger than those requiring all dimensional data input once for each analysis, as it’s a subsampling procedure. Thus, SWA can scale easily to the cases when the penalized procedures do not run at the current software implementation.
#Features  SWA w/  SWA w/  SWA w/step  Elastic Net  SCAD  MCP 

Captured  Bonferroni  BH  Bonferroni  FDRctrl  FDRctrl  FDRctrl 
0%  0.6%  0.7%  0%  0.1%  0.1%  
26.7%  45.8%  46.8%  0%  41.9%  63.8%  
96.2%  98.6%  97.6%  0.8%  99.6%  99.9%  
99.9%  100.0%  100.0%  8.1%  100.0%  100.0%  
100.0%  21.4%  
41.5%  
60.2%  
78.9%  
93.6%  
100.0% 

Notes: Entries are percentages of captures. “10” in the first column means that all 10 true features are captured, “6+” means that at least 6 true features are captured.
Random Forest is a machine learning method for classification and regression. Random Forest and SWA share some similarities such as random selection of features. However, Random Forest does not control FDR, but gives a set of importance features for each user specified number called “npick,” the number of features that one desires to specify in advance. Hence, it is not directly comparable with SWA, Elastic Net, SCAD, or MCP in terms of FDR. Regradless, we evaluated Random Forest performance on the same data from Example 2. In comparison to SWA, Table 5 summarizes the percentage of true features among top “npick” features found by Random Forest measure of importance. When “ntree,” another user specified number in Random Forest calculation, is set to be ntree
, the resulting truefeature capture percentage improves as “npick” increases from 10 to 30, which would mean that the FDR would increase at least to 10 (50%) for npick , and 20 (67%) for npick . When npick , the number of true features, only strong and moderate features are captured well by Random Forest. With npick , we also increased the number of trees, “ntree,” from 1000 to 2000, which did not help much. In comparison, SWA performs well for all levels, controlling the FDR and providing good true feature detection rates.#Features  npick=30  npick=20  npick=10  npick=20  npick=10  SWA w/ 

Captured  ntree=1K  ntree=1K  ntree=1K  ntree=2K  ntree=2K  Bonferroni 
1.3%  0.1%  0%  0.1%  0%  0%  
7.9%  1.9%  0%  2.7%  0.1%  26.7%  
30.4%  13.7 %  0.7%  17.3%  1.7%  96.2%  
67.6%  44.9%  13.6%  52.2%  13.1%  99.9%  
93.1%  81.1%  46.3%  82.2%  51.5%  100.0%  
99.2%  97.2%  98.7%  97.7%  86.8%  
100.0%  99.9%  100.0%  100.0%  99.4%  
100.0%  100.0% 

Notes: Entries are percentages of captures. “10” in the first column means that all 10 true feature are captured, “” means that at least 6 true features are captured.
Based on the above outcome, we shall only compare SWA with Elastic Net, SCAD, and MCP in the correlated and ultrahigh dimension cases below.
5.3 Comparison with Correlated Covariates
We repeat our comparison when the covariates are correlated as it used in Tibshirani (1996). Specially, in Example 2, let correlation between and be for with , and and respectively. The results for case 1 of are given in Tables , while those for case 2 of are in Tables .
For Case 1, in terms of false feature alarm, SWA and SCAD made the fewest misdetection, which is followed by MCP and then Elastic Net. In terms of true feature detection, after adjusting benchmark procedures to match the same FDR as that of SWA, SWA is found to be comparable with the benchmark procedures as shown in Table 7, with MCP to be the best performer, slightly.
# features  SWA w/  SWA w/  Elastic Net  SCAD  MCP 

captured  Bonferroni  BH  
961  768  264  977  822  
36  171  195  21  163  
3  42  143  2  14  
15  129  2  
2  61  
2  68  
37  
28  
17  
21  
37 

“1” in the first column means that exactly one false feature is captured.
For Case 2, simulation variation is increased to . As shown in Table 8, SWA again has the fewest misdetection. In terms of true feature detection, after matching benchmark procedures’ FDR at the same rate of SWA, SWA is found to be comparable with the competing procedures for detecting the moderate and stronger features, while Elastic Net does better in detecting the weak features.
# of features  SWA w/  SWA w/  Elastic Net  SCAD  MCP 

captured  Bonferroni  BH  FDRctrl  FDRctrl  FDRctrl 
0%  0%  2.5%  1.9%  0.7%  
21.5%  45.7%  50.1%  60.2%  50.7%  
94.4%  98.8%  99.8%  99.2%  100%  
100%  100%  100%  100% 

Notes: Entries are percentages of captures. “10” in the first column means that all 10 true features are captured, “6+” means that at least 6 true features are captured.
# features  SWA w/  SWA w/  Elastic Net  SCAD  MCP 

captured  Bonferroni  BH  
875  567  299  73  638  
10  234  213  99  272  
21  101  143  142  73  
57  82  127  13  
19  69  117  3  
13  44  112  1  
6  31  95  
2  21  68  
1  20  68  
14  42  
55  57 

Notes: “1” in the first column means that exactly one true feature is captured.
# features  SWA w/  SWA w/  Elastic Net  SCAD  MCP 

captured  Bonferroni  BH  FDRctrl  FDRctrl  FDRctrl 
0%  0%  0.8%  0.1%  0%  
0.1%  0.5%  7.8%  2.8%  0.3%  
1.7%  8.9%  35.5%  23.5%  10.5%  
20.3%  51%  80.6%  59.7%  45.2%  
71.1%  92.2%  98.0%  86.6%  86.1%  
95.5%  99%  99.9%  97.6%  98.5%  
99.8%  99.8%  100%  100%  100%  
100%  100% 

Notes: Entries are percentages of captures. “10” in the first column means that all 10 true features are captured, “6+” means that at least 6 true features are captured.
5.4 Comparison in Ultra high Dimensions
Consider the model in Example 2 with the same and , but let instead of . SWA did not perform well without a prescreening step. Since it is common to perform a preindependence screening procedure [Fan and Lv (2008)] for variable selection, we added a prescreening step by selecting
variables with the strongest marginal correlation to the response variable
. SWA procedure with a repetition size and was applied afterwards. SWA (w/ Bonferroni adjustment of or ) outperformed, Elastic Net, SCAD and MCP using default parameters, in false feature detection (in Table 10). After setting the smoothing parameters in Elastic Net, SCAD and MCP to match the FDA at the same level, SWA was found still competitive overall as shown in Table 11. In this ultra high dimensional case, we recommend using the Bonferroni adjustment at the rate of , the size that was adjusted to after a prescreening step. Arguably, using a Bonferroni adjustment at the rate of , the size of subsample size and the size of the semifinalists , is also reasonable.# features  SWA  SWA adjusted  Elastic Net  SCAD  MCP 

captured  
965  913  136  714  
29  71  147  249  
6  14  131  35  
2  128  2  
87  
77  
65  
1  60  
48  
1  22  
998  29 

Notes: Entries are numbers of true feature captures. “1” in the first column means that exactly one feature is captured.
# features  SWA adjusted  SWA adjusted  Elastic Net  SCAD  MCP 

captured  FDRctrl  FDRctrl  FDRctrl  
0%  0%  
27.9%  38.1%  3.1%  31.4%  
96.6%  98%  58.4%  99.7%  
100%  100%  96.5%  100%  
0.3%  99.7%  
5.9%  100%  
19.1%  
33.2%  
41.5%  
100% 

Notes: Entries are percentages of captures. “10” in column 1 means that all 10 true features are captured, “6+” means that at least 6 true features are captured.
6 Application and Double Assurance Procedure
In this section, we apply our SWA to the ovarian cancer data. There are samples, each having mRNA gene expressions. This data set neither contains an obvious response variable nor has control cases. We hence choose the Cyclin E gene, termed as “CCNE1”, as the response in this case study. “CCNE1” is a good surrogate for the survival, according to the cBioPortal for Cancer Genomics (http://cbioportal.org) by Gao et al. (2013), 25% of the 591 ovarian cancer patients were found to have their “CCNE1” gene altered, and the subjects with the alteration of “CCNE1” gene had a significantly lower overall survival rate (Pvalue) as shown in the right side of Figure 5. Hence, we are interested in finding mRNAs that collectively are significant explanatory variables or features for “CCNE1.” Finding these interesting explanatory variables is helpful to examine/find pathways, not just one gene, responsible for survival.
First, we activate a screening step based on the distribution of sample correlation in the left side of Figure 5, to reduce the number of features from to by selecting the features with an absolute value of correlation . Then, we apply SWA with a repetition size and subsampling sizes from and to choose the ‘optimal’ subsample . Guided by Algorithm , examining the fixed scale multipanel plot in Figure 6, we find that significant changes in magnitude occur at . Combining this fixedscale indication with the freescale information in Figure 7, we observe that and lead to plots with a stable ‘elbow’ point and a stable upper arm set. Thus, the choice of is the optimal subsample size that is stabilized in both scale and shape. However, as an enhancement to the practice, we propose one more step to implement a ‘DoubleAssurance Procedure’ below.
Algorithm 3: Double Assurance Procedure

Consider for and a few with that seemed to have led to semistable points.

Combine the final significant features in and for these ’s (here, and ) into a combined prediction set , and then run another base analysis of for , and perform a variable selection step on the outcomes.
Figure 6: Fixedscale multipanel diagnostics plot for ovarian cancer study.
Figure 7: Freescale multipanel diagnostics plot for ovarian cancer study.
The final significant features from Algorithm 3 are listed in Table 12. Our set of features is sensible based on a STRING analysis of our detected features using functional protein association networks (http://stringdb.org/) by Szklarczyk et al. (2015). The STRING analysis shows that our discovered features contain functionally important genes; in particular “POLQ” and “RAD51” form a special pathway in which “POLQ” blocks “RAD51”mediated recombination. Together, “POLQ” and “RAD51” correlate with defects in homologous recombination (HR) repair published in Nature [Ceccaldi et al. (2015)]. In addition, pathway analysis shows that the network has significantly more interactions than expected (Pvalue). Both our findings and their results reveal a synthetic lethal relationship between the HR pathway and “POLQ”mediated repair in Ovarian Cancer. The biological application of the discovery of features in Table 12 may go beyond the statements above. In a separate biological manuscript, we apply another new bioinformatics tool to a combined set of these features with other “BRCA” genes to explore potentials for drug target for ovarian cancer.
# of features  Gene Name  Estimate  Pvalue 

CDKN2A  0.26554  2e16  
C19orf2  0.28228  4.94e10  
PLEKHF1  0.21810  5.08e09  
POLQ  0.35074  4.25e06  
TMEM30B  0.13503  8.61e06  
GLCE  0.16539  5.93e05  
POP4  0.20944  7.06e05  
GIPC1  0.19072  0.000393  
C17orf53  0.43510  0.000529  
C15orf15  0.15195  0.000586  
TM2D3  0.14747  0.000712  
CRIM1  0.14717  0.000768  
RAD51  0.27486  0.002840 
7 Discussion and Conclusions
We have provided a new procedure, Subsampling Winner Algorithm (SWA), for finding important features in large regression. We performed a comprehensive study to compare SWA with the benchmark procedures, LASSO, Elastic Net, SCAD, MCP and Random Forest. We also applied the SWA to analyze the genomic data on ovarian cancer. Our study revealed a meaningful ovarian cancer pathway verified using the STRING analysis, as well as new genes that have become candidates to be examined biologically.
Our subsampling approach represents a paradigm shift from the popular approach using a penalized criterion in selecting important/true features from covariates when . The penalized procedures approximate the solution to a penalized criterion based on entire sample. The SWA is an ensemble of simple procedures performed on subsamples, by lifting the good performance of in low dimension () to the high dimension () using our scoring function strategically. SWA is analogous to the method for selecting national merit scholars.
The advantages of SWA are given below.
First, the SWA is virtually dimensionless, due to its subsampling nature and the simplicity of its base procedure (without a need for finding a penalizing or turning parameter). Hence, it can scale easily to an enormously large , especially when is too large to use a benchmark, penalized procedure that requires data inputted in an existing software in one pass. Also due to the simplicity and subsampling nature, SWA can be made “embarrassingly parallel" and distributed over multiple cores for fast computation and for enlarging data. We leave this kind of coding refinement for multiple cores or high performance computers to the future.
Second, when the large p is manageable, SWA is shown to be competitive to benchmark procedure in our simulation studies. In terms of the false discovery rate (FDR), we compared the SWA using a simple Bonferroni multiplicity adjustment (denoted as SWA w/Bonferroni) with LASSO, Elastic Net, SCAD, MCP and Random Forest (RF). We used the default parameter based on a crossvalidated mean square errors for Elastic Net and SCAD, and the known for MCP, as well as various parameter choices for RF. The SWA controlled the false discovery rate (FDR) closest to the target value (Table 3, 6, 8 and 10). Random Forest does not explicitly control FDR and hence is not comparable (Table 5). The advantage of Random Forest is not at controlling FDR, but at its prediction precision and perhaps at providing a set of importance features as a firststage dimension reduction candidates.
In terms of true discovery rate (TDR), we compared the true feature selection after equalizing the FDRs for all penalized procedures to that of the SWA. The good performers are summarized in Table 13 based on the comprehensive results in Tables 4,7,9,11 and 12. Specifically, in the case of independent covariates, SWA, SCAD and MCP were comparable except for the very weak features, where MCP was the (slight) winner followed closely by SCAD and SWA (Table 4). Elastic Net did not perform well for the independent covariates case (Table 4), but was comparable with SWA, SCAD and MCP in Cases 1 and 2 of the correlatedcovariates (Tables 7 and 9), and performed the best slightly for the correlated case 2 (Table 9). SWA and MCP are clear winners in the ultrahigh dimension case with independent covariates (Table 11). In the ultra high dimension case, it is important to add a prescreening procedure as we and all penalized procedures have done.
Cases  Good Performers  Good Performers  Practical 

based on FDR  based on TDR  Recommendation  
Independent  SWA (Table 3)  MCP, SCAD &  A. SWA with an added double assurance (Algorithm 3); or B. Combination of SWA & a penalized procedure (Algorithm 4) 
SWA (Table 4)  
Correlated Case1  SCAD & SWA (Table 6)  MCP, SCAD, ElasticNet  
& SWA (Table 7)  
Correlated Case2  SWA (Table 8)  ElasticNet, SWA,  
MCP & SCAD (Table 9)  
Ultra High  SWA (Table 10)  MCP, & SWA (Table 11) 

Notes: Entries in Columns 2 and 3 are good performers based on the results from the table cited. Recommendations in Column 4 were made based on both FDR and TDR.
In all these comparisons, for Elastic Net, SCAD and MCP we used a trialanderror method to modify the and from their default or initial values to match the FDR at 0.05. In practice, without knowing the truth, this trialanderror adjustment to match actual 0.05 rate is impractical. In practice, thus, we recommend using SWA with a double assurance as in Algorithm 3, or a combination of SWA with one of good performers from Elastic Net, SCAD, and MCP as shown in Table 13 by the following Algorithm 4.
Algorithm 4: Combine features from SWA and a good performing penalized procedure and run our base procedure on this combined predictor set to lead to the final set of features.
We do not have a rigorous mathematical statement or proof of SWA’s optimality, although we gave a rationale for its obvious consistency if given the SOIL condition, as Ye, Yang, and Yang (2018) did. SWA’s excellent performance and scalability to an ultralarge may bear some similarity to that of the deep learning procedure [LeCon, Bengio, and Hinton (2015)] as follows. A deep learning algorithm is built on many layers of linear combination of simple basis functions, while SWA is based on a fusion of many simple analyses of subsamples. The purpose of deep learning parallels to that of RandomForest, being excellent for prediction, while SWA aims directly at feature selection. Another reason for the good performance is perhaps at SWA’s selection of important features from an ensemble of (subsample) models in the same spirit of the SOIL conditions of Ye, Yang, and Yang (2018). The difference between our SWA and the ensemble of Ye, Yang, and Yang (2018) is that we extract final useful set of features from simple analyses of subsamples (hence, SWA is scalable to very high dimensions) while Ye, Yang, and Yang (2018)’s ensemble obtains final results from penalized or other iterated procedures of whole samples.
Although SWA does not need to choose a penalizing parameter or require an estimate of in advance, SWA does need to choose a subsample size and the number of subsample iterations ; fortunately, they are straightforward to select and our selection automatically controls the FDR to the target level. Specifically, the SWA is reasonably stable as long as with a sufficiently large ; and can be chosen more specifically by our simple multipaneling plot with a fixed . Another parameter , the number of “semifinalists" in SWA is set to be in our simulation experiments herein and was shown to be satisfactory. However, this can easily be enlarged to or to be conservative. We also proposed a doubleenhancement procedure, which is similar to enlarging from , for practical applications in Section 6.
The SWA in this paper is developed for linear regression, the most fundamental model. The SWA can be generalized for a generalized linear model, or to a nonparametric classification function, as suitable for other type of data and analysis objectives. The key is to choose properly and generalize the scoring function
accordingly. See SWA for principal component analysis [
Liu, Sun, and Zhang (2007)] used different and . We leave the study for these generalized models using SWA to the future.It is also worth to note the following caveats before using any new advanced procedure to analyze a data set. First, preprocessing data is essential as it can remove a large number of obviously nuisance features cheaply and quickly in advance. Second, an ensemble of statistical procedures is commonly needed for analyzing a large and complex data, for example, the ensemble used for analyzing movies (a sequence of images) in a clinical study from Wang, Sun, and Bogie (2006) and Bogie et al. (2008) includes both image preprocessing and statistical analysis procedures. Third, correct modeling is critical to features selection for examining the effect of possible influencing factors. Under different models, the effects of various features may be shown differently.
Finally, we advocate that any biological implications be verified by independent biological experiments or another instrument. We used cBiportal and STRING analysis tools to our findings by SWA. This is consistent to the recommendation made by Wasserstein, Schirm, and Lazar (2019). We are quite pleased in discovering that our study of the ovarian cancer data has led to important pathways, and also that there is a potential for biological applications beyond what is presented in Section 6.
Appendix A Regularity Conditions and Proofs
Proof.
The idea is to first calculate the probability that all the important features are selected and detected in one subsample, then choose the required such that this probability is at least .
Since , there exists a chance that all the important features are selected in a subsample so that the important features can be detected at once. For each subsample, the probability that all important features are selected is
(9) 
It is known that , we can obtain upper and lower bounds for based on the above expression
(10) 
Assume the random subsamples are independent of each other, the probability that all important features are selected at least once in one subsample after m trials is Setting it to be
(11) 
we have
(12) 
and can thus obtain the upper and lower bounds of from the upper and lower bounds of , which are the ones given in (8). ∎
References
 Berger (2010) Berger, J.O. (2010). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York.
 Bogie et al. (2008) Bogie, K., Wang, X., Fei, B., and Sun, J (2008). Getting More out of Pressure Mapping: A New Technique for Interface Pressure Analysis. Journal of Rehabilitation Research and Development, 45 (4) 523–536.
 Breiman (1996) Breiman, L. (1996). Bagging predictors. Machine Learning, 24 (2) 123–140.
 Breiman (2001) Breiman, L. (2001). Random Forests. Machine Learning, 45 (1) 5–32.
 Broad Institute TCGA Genome Data Analysis Center (GDAC, 2013) Broad Institute TCGA Genome Data Analysis Center (2013). Identification of putative miR direct targets. Broad Institute of MIT and Harvard. doi:10.7908/C1B27SBH. 23 May, 2013.
 Cai and Guo (2017) Cai, T. T. and Guo, Z. (2017). Confidence intervals for highdimensional linear regression: minimax rates and adaptivity. Annals of Statistics, 45 (2) 615–646.
 Cai and Guo (2018) Cai, T. T. and Guo, Z. (2018). Accuracy assessment for highdimensional linear regression. Annals of Statistics, 46 (4) 1807–1836.
 Candés and Tao (2007) Candés, E. J. and Tao, T. (2007). The Dantzig selector: statistical estimation when is much larger than . Annals of Statistics, 35 2392–2404.
 Ceccaldi et al. (2015) Ceccaldi, R., Liu, J.C., Amunugama, R., Hajdu, I., Primack, B., Petalcorin, M., O’Connor, K.W., Konstantinopoulos, P.A., Elledge, S.J., Boulton, S.J., Yusufzai, T., and D’Andrea, A.D. (2015). Homologous recombinationdeficient tumors are hyperdependent on POLQmediated repair. Nature, 518 258–262.
 Donoho and Jin (2015) Donoho, D. and Jin, J. (2015). Higher Criticism for LargeScale Inference, Especially for Rare and Weak Effects Statistical Science, 30 (1) 1–25.
 Efron and Tibshirani (1997) Efron, B. and Tibshirani, R. (1997). Improvements on crossvalidation: The .632+ bootstrap method. Journal of the American Association, 92 548–560.
 Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least Angle Regression. The Annals of Statistics, 32 (2) 407–499.
 Fan and Li (2001) Fan, J. and Li, R.(2001). Variable selection via nonconcave penalized likelihood and its oracle property. Journal of the American Association, 96 (456) 1348–1360.
 Fan and Lv (2008) Fan, J. and Lv, J.(2008). Sure Independence Screening for Ultrahigh Dimensional Feature Space (with discussion). Journal of Royal Statistical Society B, 70 849–911.
 Gao et al. (2013) Gao, J., Aksoy, B.A., Dogrusoz, U, Dresdner, G., Gross, B., Sumer, S.O., Sun, Y., Jacobsen, A., Sinha, R., Larsson, E., Cerami, E., Sander, C., and Schultz, N. (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013 Apr 2; 6 (269):pl1. doi: 10.1126/scisignal.2004088.
 Golub, Heath, and Wahba (1979) Golub, G., Heath, M., and Wahba, G. (1979). Generalized crossvalidation as a method for choosing a good ridge parameter. Technometrics, 21 (2) 215–223.
 Guo et al. (2016) Guo, J., Hu, J., Jing, B., and Zhang, Z. (2016). SplineLasso in HighDimensional Linear Regression. Journal of the American Association, 111 (513) 288–297.
 Ishwaran, James, and Sun (2001) Ishwaran, H., James, L., and Sun, J. (2001). Bayesian Model Selection in Finite Mixtures by Marginal Density Decompositions. Journal of American Statistical Association, 96 (456) 1316–1332.
 Kim, Street, and Menczer (2000) Kim, Y., Street, W.N., and Menczer, F. (2000). Feature selection in unsupervised learning via evolutionary search. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 365–369. ACM.
 LeCon, Bengio, and Hinton (2015) LeCon, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521 (7553) 436–444.
 Liu and Yu (2013) Liu, H. and Yu, B. (2013). Asymptotic properties of Lasso+mLS and Lasso+Ridge in sparse highdimensional linear regression. Electronic Journal of Statistics, 7 3124–3169.
 Liu (2009) Liu, P. (2009). Adaptive Mixture Estimation and Subsampling PCA. Ph.D Thesis, Case Western Reserve University, Sciences.
 Liu, Sun, and Zhang (2007) Liu, P., Sun, J., and Zhang, Z. (2007). SPCA  A New Feature Selection Procedure for Largep Data. Proceedings of American Statistical Association, Section on Statistical computing, [CDROM], Alexandria, VA: American Statistical Association: 19701974.
 Park, Sriram, and Yin (2010) Park, J., Sriram, T.N., and Yin, X. (2010). Dimension Reduction in Time Series, Statistica Sinica, 20 747–770.
 Rao (1973) Rao, C.R. (1973). Linear statistical inference and its applications. 2nd ed. New York: John Wiley & Sons.
 Schapire et al. (1998) Schapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26 (5) 1651–1686.
 Scornet, Biau, and Vert (2015) Scornet, E., Biau, G., and Vert, J (2015). Consistency of Random Forests. Annals of Statistics, 43 (4) 1716–1741.
 Sun et al. (2015) Sun, Q., Zhu, H., Liu, Y., and Ibrahim, J. G. (2015). SPReM: sparse projection regression model for highdimensional linear regression. Journal of American Statistical Association, 110 (509) 289–302.
 Szklarczyk et al. (2015) Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., HuertaCepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K.P., Kuhn, M., Bork, P., Jensen, L.J., and Von Mering, C. (2015). STRING v10: proteinprotein interaction networks, integrated over the tree of life. Nucleic Acids Res. 43 D447–52.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via lasso. Journal of the Royal Statistical Society Series B. 58 (1) 267–288.
 Trevor, Tibshirani, and Friedman (2009) Trevor, H., Tibshirani, R., and Friedman, J. (2009). The Element of Statistical Learning; Data Mining, Inference, and Prediction. 2nd ed. Springer.
 Wang et al. (2016) Wang, C., Chen, M., Schifano, E., and Yan, J. (2016). Statistical Methods and Computing for Big Data. Stat Interface. 9 (4) 399–414.
 Wang, Sun, and Bogie (2006) Wang, X., Sun, J., and Bogie, K. (2006). Spatialtemporal data mining procedure: LASR. IMS Lecture Notes–Monograph Series. 50 213–231.
 Wasserstein, Schirm, and Lazar (2019) Wasserstein, R.L., Schirm, A.L., and Lazar, N.A. (2019). Moving to a World Beyond “”. The American Statistician. 73 1–19.
 Yang (2001) Yang, Y. (2001). Adaptive Regression by Mixing. Journal of the American Statistical Association. 96 (454) 574–588.
 Ye, Yang, and Yang (2018) Ye, C., Yang, Y., and Yang, Y. (2018). Sparsity Oriented Importance Learning for HighDimensional Linear Regression. Journal of the American Statistical Association. 113 (524) 1797–1812 Theory and Methods.
 Zhang (2010) Zhang, C.H. (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics. 38 (2) 894–942.
 Zhang (2013) Zhang, L. (2013). Nearly optimal minimax estimator for highdimensional sparse linear regression. Annals of Statistics. 41 (4) 2149–2175.
Comments
There are no comments yet.