Iterated Feature Screening based on Distance Correlation for Ultrahigh-Dimensional Censored Data with Covariates Measurement Error

01/06/2019 ∙ by Li-Pang Chen, et al. ∙ University of Waterloo 0

Feature screening is an important method to reduce the dimension and capture informative variables in ultrahigh-dimensional data analysis. Many methods have been developed for feature screening. These methods, however, are challenged by complex features pertinent to the data collection as well as the nature of the data themselves. Typically, incomplete response caused by right-censoring and covariates measurement error are often accompanying with survival analysis. Even though there are many methods have been proposed for censored data, little work has been available when both incomplete response and measurement error occur simultaneously. In addition, the conventional feature screening methods may fail to detect the truly important covariates which are marginally independent of the response variable due to correlations among covariates. In this paper, we explore this important problem and propose the valid feature screening method in the presence of survival data with measurement error. In addition, we also develop the iteration method to improve the accuracy of selecting all important covariates. Numerical studies are reported to assess the performance of the proposed method. Finally, we implement the proposed method to two different real datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ultrahigh-dimensional data appears in various scientific research areas, including genetic data, financial data, survival data, and so on. In regression analysis, ultrahigh-dimensional data is very difficult to analyze since it contains many unimportant variables in the sense that those variables are not highly correlated to the response. In addition, the covariance matrix of the variables is usually singular due to that the dimension of variables is ultra higher than the sample size. As a result, we should select the informative variables before constructing regression models. Moreover, to cope with ultrahigh dimensionality, the assumption of sparsity is imposed. In other words, there are only a small number of predicting variables associated with the response.

In the early development of variable selection, Akaike’s Information Criterion (AIC) (Akaike 1973) and Bayesian Information Criterion (BIC) (Schwarz 1978) are two well-known conventional variable selection criteria. Those two methods aim to search over all possible combinations so that the optimal solution is achieved. However, in ultrahigh-dimensional data, it is near impossible to search the final model through all possible combinations of variables. In the two decades, some regularization methods have been proposed to select variables. Those methods include the LASSO (Tibshirani 1996), SCAD (Fan and Li 2001), LARS (Efron et al. 2004), elastic net (Zou and Hastie 2005), adaptive LASSO (Zou 2006), and Dantzig selector (Candes and Tao 2007) methods. However, those methods are mainly implemented in high-dimensional data but the dimension of variables is smaller than the sample size, and they may perform worse for ultrahigh-dimensional data.

To address ultrahigh-dimensional data with stable computation and accurate selection, Fan and Lv (2008) first proposed the sure independent screening (SIS) procedure for ultrahigh-dimensional linear model which utilized the Pearson correlation to rank the importance of each predictor. Hall and Miller (2009) developed the bootstrap procedure to rank the importance of each predictor based on Pearson correlation between the response and predictors. Fan et al. (2009) and Fan and Song (2010) considered to rank the importance of each predictor through marginal maximum likelihood. Different from the SIS method which specifies the model structure, Zhu et al. (2011) and Li et al. (2012) proposed the model-free feature screening to capture the informative covariates for the ultrahigh-dimensional data.

Even though feature screening methods for ultrahigh-dimensional data have been developed, the research gaps still exist. Specifically, in survival analysis with genetic data, the response (failure time) is usually incomplete due to right-censoring and the covariates are usually contaminated with measurement error. It is not trivial to implement the conventional feature screening methods to analyze such data. Actually, in the presence of the incomplete response (or survival data) and precise measurement, some valid methods have been proposed. To name a few, Fan et al. (2010) proposed SIS method but restricted on Cox model. Song et al. (2014) proposed the censored rank independence screening. Yan et al. (2017) proposed the Spearman rank correlation screening. Chen et al. (2018) developed the robust feature screening based on distance correlation. Chen et al. (2019) considered a model-free survival conditional feature screening. In the presence of measurement error, however, it is unknown that whether or not those existing methods can determine the “correct” features for the surrogate version of the covariates.

The other crucial issue is the accuracy of feature screening. Since conventional SIS methods rank the importance of each predictor through marginal utilities, then those methods may fail to detect truly important predictors which are marginally independent of the response due to correlations among predictors. The detailed example is deferred in Section 2.4. To overcome this problem, Fan and Lv (2008) proposed the iterative SIS method. Zhong and Zhu (2015) developed the iterated distance correlation to improve the accuracy of variable screening. These methods, however, are based on complete data and free of mismeasurement. For ultrahigh-dimensional survival data with measurement error in covariates, there is no method to deal with this problem. As a result, we mainly explore this important problem with both survival data and covariates measurement error incorporated. In our development, we first present the distance correlation with error correction for feature screening. Under such approach, the set of selected surrogate variables is the same as the set of selected unobserved variables. After that, we propose the valid iterated procedure with error correction to improve the accuracy of feature screening. In particular, our proposed method is free of model specification and free of specification of distribution for the covariates.

The remainder is organized as follows. In Section 2, we introduce the survival data with right-censoring, measurement error model and the distance correlation method. In Section 3, we propose the iteration algorithm of feature screening procedure for censored data and covariates measurement error. Empirical studies, including simulation results and real data analysis, are provided in Sections 4 and 5, respectively. We conclude the article with discussions in Section 6.

2 Notation and Model

2.1 Survival Data

In survival analysis, the response is usually incomplete due to the presence of the censoring time. Specifically, let be the failure time and be the censoring time. Then let and denote , where is the indicator function. Let be the

-dimensional random vector of covariates. Suppose that we have a sample of

subjects and that for , has the same distribution as and represents realizations of . Let denote a maximum support of the failure time. Some regular conditions are imposed.

  • , where is an upper bound of failure times which is assumed to be finite and is the risk set.

  • Censoring time is non-informative. That is, the failure time and the censoring time are independent.

2.2 Measurement Error Model

Let denote the surrogate, or observed covariate, of . Let and be the covariance matrices of and , respectively. For , has the same distribution as . Let denote the realizations of . In this paper, we focus on the the following measurement error model

(1)

for , where is independent to , with covariance matrix . Here can be known or unknown. Hence, to discuss

and its estimation, we consider the following three scenarios:

Scenario I

: is known.

In this scenario, is a constant matrix. Therefore, it is straightforward to discuss the analysis.

Scenario II

: is unknown and repeated measurements is available.

Measurement error model (1) with repeated measurement is given by

for and , where the and independent to

. Using the method of moments, we estimate

by

(2)

where .

Scenario III

: is unknown and validation data is available.

Suppose that is the subject sets for the main study containing subjects and is the subject sets for the external validation study containing subjects. Assume that and do not overlap. Therefore, the available data contain measurements from the main study and from the validation sample. Hence, for the measurement error model, we have

for , where the and independent to . In this case, applying the least square regression method gives

(3)

where .

2.3 Review of the Distance Correlation Method

In this section, we briefly review the distance correlation (DC) method, which was first proposed by Székely et al. (2007).

Let and

denote the characteristic functions of two random vectors

and , respectively, and let be the joint characteristic function of and . Let for any complex function , where is the conjugate of . The distance covariance between and is defined as

where and are dimensions of and , respectively, and

with and is the Euclidean norm of any vector . Therefore, the DC is defined as

(4)

Székely et al. (2007) showed that two random vectors and are independent if and only if . This property motivates us to do the feature screening and identify which covariates are dependent with the response (e.g., Li et al. 2012). The detailed estimation of (4) can be found in Li et al. (2012).

2.4 Potential Problem in Conventional Screening Method

As discussed in Section 1, even though many feature screening methods have been proposed, those methods cannot capture all important variables due to that some variables are highly correlated with others. To see this problem explicitly, we consider the following regression model which was adopted by Fan and Lv (2008)

(5)

where with is a vector of covariates and each

is generated from the normal distribution with mean zero and unit variance. The correlations of all

except are , while has the correlation with all other variables.

It is clear to see that variables with are included in model (5). By the feature screening based on conventional distance correlation method, we can only identify and

, while there is a large probability that

cannot be identified due to .

This simple example verifies that the conventional feature screening method fails to select whole important variables. To successfully identify the variable , Fan and Lv (2008) proposed the iterated SIS method. Zhong and Zhu (2015) considered the iterated distance correlation method. In survival analysis, however, the response in model (5) is usually incomplete due to right-censoring. Therefore, in the presence of right-censoring, it is not trivial to implement those conventional methods to deal with survival data. In addition, the other challenge comes from the mismeasurement of covariates. Specifically, variables - in model (5) may be contaminated with measurement error, and we only have the surrogate variables -. It is expected that the important variables with label 1-4 cannot be identified if we ignore the impact of mismeasurement. As a result, it is also crucial to take care the measurement error effect.

3 The Proposed Method

3.1 Feature Screening for Censored Data and Measurement Error

To present the setting, we start from the unobserved covariate .

Let denote the conditional distribution function of given , and let

denote the active set which contains all relevant predictors for the response with and , and is the complement of which contains all irrelevant predictors for the response . In this case, let denote the vector containing all the active predictors, and let be the vector containing all the irrelevant predictors.

If is complete, i.e., , then it is straightforward to implement conventional methods to determine the active set. However, if

is incomplete, i.e., right-censoring occurs, then we impute

by (Buckley and James 1979)

indicating that (Miller 1981, p. 151). In addition, by Condition (C1) in Section 2.1, can be written as

where and are the density and distribution functions of , respectively. Moreover, can be estimated by

where is the Kaplan-Meier estimator of . As a result, the estimator of , denoted as , is determined by replacing with , and thus, we have

Finally, the crucial target is the determination of the active set . In the presence of measurement error, we adopt the DC method described in Section 2.3 with modification. Let denote the characteristic function of , where is a complex number with . Define

and

If is unknown, then it can be estimated based on repeated measurement or validation in (2) or (3). Therefore, we define

and

(6)

As a result, to select features, it suffices to consider

(7)

for , and the corresponding estimator is

(8)

As suggested by Li et al. (2012), let the threshold value be for some constants and , then the estimated active set is given by

(9)

To see the validity of the criterion (7), we have the following theorem:

Theorem 3.1

Active features based on and are the same. That is, for every ,

or

where is determined by implementing and to (4).

Generally speaking, Theorem 3.1

suggests that based on the feature selection criterion (

7), the true and surrogate covariates share the same active set . Furthermore, similar derivation in Li et al. (2012) yields that has the sure screening property in this sense that as . Therefore, we can decompose the measurement error model (1) by

(10a)
(10b)

where , , and . The covariance matrix can be further decomposed as

where is the covariance matrix based on (10a), is the covariance matrix based on (10b), and is the covariance matrix based on the interaction of (10a) and (10b).

3.2 Iteration Algorithm

As motivated by example in Section 2.4, directly implementing (7) may lose some important variables. To increase the probability of selecting all important variables, we modify the selection criterion (7) and develop the iterated feature screening procedure.

The key idea is as follows: We first implement the feature screening criterion (7) to determine and . It is noted that there exist some potential important variables in but not be identified. Therefore, to determine the other important variables in , a natural way is to remove the correlations of and by regressing onto

. As a result, the residuals obtained from such linear regression are then uncorrelated with

. Therefore, other important variables in can be identified by residuals and .

Specifically, to present the idea explicitly, we provide the following iteration algorithm:

Step 1:

Initial determination of the active set.

Let denote the covariate matrix, where is a -dimensional vector of th covariate with .

In this stage, we first implement (7) to determine the initial active set and the corresponding relevant covariate is with dimension . Let denote the irrelevant covariates matrix with dimension such that . In addition, based on feature selection criterion (7) and Theorem 3.1, the active set based on the surrogate variables is equal to the set based on the true covariates. Therefore, we also have .

Step 2:

Improvement.

In this stage, we aim to search other important variables in . Our main approach is to regress onto and update the active set through the residual.

In this paper, we consider the multivariate linear regression model and the ordinary least square is given by

where is the -norm and is the parameter matrix with dimension . The corresponding score function is

However, in the presence of covariates measurement error , we only observe and , then the score function becomes

(11)

It is well known that directly solving may incur the estimator of with tremendous bias (e.g., Carroll et al. 2006). Instead, by the simple calculation, we obtain

(12)

such that

indicating that is the suitable score function which corrects error-prone variables. Therefore, the estimator of based on (12) is given by

(13)

Based on (13) and the surrogate variables, define

(14)

In fact, is an exact formulation of the residual and thus contains the covariate information in and is uncorrelated to . Therefore, implementing (7) with gives the active set based on .

Step 3:

Update of the active set.

Update the active set by and continue Step 2 until no more covariate is included. Finally, the final model is .

In practice, as suggested in Yan et al. (2017), Chen et al. (2019) and among others, we can specify the size of the active set to be , where stands for the floor function. In this sense, based on the iteration algorithm, we can first select variables with size in Step 1, and then determine the variables with size in Step 2.

4 Simulation Studies

4.1 Simulation Setup

Let denote the sample size. Let with , or denote a -dimensional vector of covariates which is generated from the normal distribution with mean zero and the covariance matrix with the diagonal elements being one and the non-diagonal elements being the correlations of all with . Similar to the setting with an example in Section 2.4, we specify the correlations of all except to be , while has the correlation with all other variables. We consider or .

The failure time is generated by the following model:

Specifying the distribution of the error term

gives some commonly used survival models. In this paper, we consider the extreme value distribution for the proportional hazards (PH) model and the logistic distribution for the proportional odds (PO) model. The censoring time

is generated from the uniform distribution

where is a constant such that the censoring rate is approximately 50%. As a result, we have and . For , the survival data is .

For the measurement error model (1), let be generated from the normal distribution with mean zero and the diagonal matrix with entries being , 0.5, or 0.75. If is unknown, then the following two scenarios are considered as additional information:

Scenario 1:

Repeated measurement

For and with , and are again be generated from and , respectively, and is generated by

for and . As a result, can be estimated by (2).

Scenario 2:

Validation data

For with , and are again be generated from and , respectively, and is generated by

for . Therefore, can be estimated by (3).

Finally, we repeat simulation 1000 times in each setting.

4.2 Simulation Results

To evaluate the finite-sample performance of the proposed method, we consider the proportion that each active covariates is selected out of 1000 simulations which is denoted by , and the proportion that all active covariates are selected out of 1000 simulations which is denoted by . In addition, for the comparisons, we also examine the naive estimator, which is derived by directly implementing the observed covariates and taking iteration through (11). For two different survival models and several settings of , we compare the results obtained from applying the proposed method to the surrogate covariates as opposed to the estimators obtained from fitting the data with the true covariate measurements.

The numerical results are placed in Tables 1-3. Since feature screenings based on the naive and proposed methods use the same criterion (7), so the screening result are the same. Furthermore, the results of feature screening based on the true covariates are similar to the results based on the surrogate covariates regardless values of and . It also verifies Theorem 3.1. However, the feature screening method can successfully select variables and with high probability, but is selected with low proportion. This result is consistent with the example in Section 2.4. On the contrary, from Tables 1-3, we can see that the iterated feature screening method based on corrected score function (12) successfully identify the variable with high proportion. This result is parallel to the case that the true covariate is implemented. On the other hand, even the iterated feature screening method is implemented, cannot be identified if the measurement error effect is not corrected appropriately. This result is verified by the naive method with the usage of (11).

5 Data Analysis

5.1 Analysis of The Mantle Cell Lymphoma Microarray Data

We first illustrate the proposed methods by an application to the mantle cell lymphoma microarray dataset, available from http://llmpp.nih.gov/MCL/. The dataset contains the survival time of 92 patients and the gene expression measurements of 8810 genes for each patient. However, we only concern 6312 genes after deleting 2498 ones appearing to be missing. During the follow-up, 64 patients died of mantle cell lymphoma and the other 28 ones were censored, causing 36% censoring ratio. The aim of the study was to formulate a molecular predictor of survival after chemotherapy for the disease.

Since this dataset contains no information to characterize the degree of measurement error that is accompanying with the gene expressions, here we conduct sensitivity analyses to investigate the measurement error effects on analysis results. Specifically, let be the covariance matrix of the gene expressions. For sensitivity analyses, we consider to be the covariance matrix for the measurement error model (1), where is the diagonal matrix with diagonal elements being a common value , which is specified as , or to feature a setting with minor, moderate or severe measurement error. Let , indicating that we aim to select 20 variables in the active set . In the iteration algorithm, we first select 8 gene expressions, and then the remaining 12 gene expressions are selected by either (11) or (12). For comparisons, we examine the feature screening (FS) method in Section 3.1 and the iterated feature screening (IFS) method in Section 3.2. The selection results are summarized in Table 4.

From Table 4, we can see that both feature screening and iterated feature screening methods have the same results in the first 8 gene expressions regardless of proposed and naive methods. It indicates that the first 8 gene expressions are clearly dependent on the response and easily identified. In the remaining 12 gene expressions, on the other hand, the screening results are different. Specifically, the iterated feature screening method select some gene expressions, such as 29897, 30620, 32699 and so on, regardless of different degrees of measurement error effects, and those selected gene expressions are not shown in the result of feature screening method. It implies that the iterated feature screening method select some potentially important variables which are not identified by the feature screening method. Furthermore, for the result based on naive method, even the iterated feature screening method is implemented, the selections in the remaining 12 gene expressions are different from the result based on the correction of error effect. The main reason comes from the usage of the estimator of solved by (11) or (12).

5.2 Analysis of NKI Breast Cancer Data

In this section, we implement our proposed method to analyze the breast cancer data collected by the Netherlands Cancer Institute (NKI) (van de Vijver et al. 2002). Tumors from 295 women with breast cancer were collected from the fresh-frozen-tissue bank of the Netherlands Cancer Institute. Tumors of those patients were primarily invasive breast cancer carcinoma that were about 5 cm in diameter. Patients at diagnosis were 52 years or younger and the diagnosis was done from 1984 to 1995. Of all those patients, 79 patients died before the study ended, yielding approximately the 73.2% censoring rate. For each tumor of patient, approximate 25000 gene expressions were collected. Consistent with the analysis of gene expression data, we treat log intensity as the covariates.

Since this dataset also contains no information to characterize the degree of measurement error that is accompanying with the gene expressions, similar to the idea in Section 5.1, we conduct sensitivity analyses to investigate the measurement error effects on analysis results. That is, let be the covariance matrix of the gene expressions, and we consider to be the covariance matrix for the measurement error model (1), where is the diagonal matrix with diagonal elements being a common value , which is specified as , or to feature a setting with minor, moderate or severe measurement error. Let , indicating that we aim to select 18 variables in the active set . In the iteration algorithm, here we first select 7 gene expressions, and then the remaining 11 gene expressions are selected by either (11) or (12). Similar to the procedure in Section 5.1, we investigate the feature screening (FS) method in Section 3.1 and the iterated feature screening (IFS) method in Section 3.2. The selection results are summarized in Table 5.

From Table 5, the result of NKI data is parallel to the result in Section 5.1, in the sense that both feature screening and iterated feature screening methods have the same results in the first 7 gene expressions regardless of proposed and naive methods. It indicates that the first 7 gene expressions are clearly dependent on the response and easily identified. In the remaining 11 gene expressions, on the other hand, the screening results are different. For example, the iterated feature screening method select some gene expressions, such as NM  020188, Contig25991, NM  003882 and so on, regardless of different degrees of measurement error effects, and those selected gene expressions are not shown in the result of feature screening method. It implies that the iterated feature screening method select some potentially important variables which are not identified by the feature screening method. Furthermore, for the result based on naive method, even the iterated feature screening method is implemented, the selections in the remaining 11 gene expressions are different from the result based on the correction of error effect. The main reason comes from the usage of the estimator of solved by (11) or (12).

6 Conclusion

Ultrahigh-dimensional data analysis is one of an important topics in decades, and it appears frequently in many practical situations and research fields, such as biological data and financial data. Many methods have been developed to deal with this problem. In the presence of censored data and covariates measurement error simultaneously, however, little method is available. Furthermore, some truly important covariates may be failed to be detected due to correlations among other covariates.

To overcome those challenges, we propose the valid feature screening method to deal with ultrahigh-dimensionality with both censored data and covariates measurement error incorporated simultaneously. Different from other feature screening methods based on censored data, the proposed method enables to determine the same active predictors based on the surrogate and unobserved covariates. To improve the accuracy of feature screening and identify some potentially important variables, we further develop the iterated feature screening with correction of measurement error. Throughout the simulation studies and real data analysis, it is verified that the iterated feature screening method yields the satisfactory results and outperforms the feature screening and naive methods.

There are some possible extensions and applications. First, even the dimension of variables is reduced to be smaller than the sample size, sometimes the dimension is still high and some unimportant variables may still contain in the dataset. In this case, we then implement the variable selection techniques, such as LASSO or SCAD, to identify the most important variables and shrink other unimportant variables. Second, although we mainly consider continuous covariates and classical measurement error model, the proposed method can be naturally extended to other types of variables, such as binary and count variables, and other measurement error models, including Berkson error model. Furthermore, the binary covariates with mismeasurement, also called misclassification, is also a crucial problem. Finally, in addition to right-censoring, some complex structures, such as left-truncation (e.g., Chen 2019), also appear in survival data with ultrahigh dimensionality. It is also interesting to explore this problem by extending the proposed method. These important topics are our future work.

Appendix: Proof of Theorem 3.1

We first consider and . Note that the former formulation is based on the true covariates , while the latter formulation is based on the surrogate covariates .

Since the error term follows normal distribution , then its characteristic function is given by

(A.1)

By the direct computation, we have

(A.2)

where the second equality is due to the independence of and , and the last equality is due to (A.1).

In addition, we can also derive

(A.3)

where the second equality is due to the independence of and , and the last equality again comes from (A.1). As a result, combining (A.2) and (A.3) with gives the same expression of .

The equivalence of and holds by the similar derivations. Therefore, we conclude that and are equivalent in the sense that if and only if . Consequently, the same active features can be determined for and .

References

Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, eds by Petrov, N. and Czaki, F., 267 - 281. Akademiai Kaido, Bydapest.

Buckley, J. and James, I. (1979) Linear regression with censored data. Biometrika, 66, 429-436.

Candes, E. and Tao, T. (2007) The Dantzig selector: statistical estimation when p is much larger than n (with discussion). The Annals of Statistics, 35, 2313 - 2404.

Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006) Measurement Error in Nonlinear Model. CRC Press, New York.

Chen, L.-P. (2019). Pseudo likelihood estimation for the additive hazards model with data subject to left-truncation and right-censoring. Statistics and Its Interface, 12, 135-148.

Chen, X., Chen, X. and Wang, H. (2018) Robust feature screening for ultra-high dimensional right censored data via distance correlation. Computational Statistics and Data Analysis, 119, 118-138.

Chen, X., Zhang, Y., Chen, X. and Liu, Y. (2019) A simple model-free survival conditional feature screening. Statistics and Probability Letters, 146, 156-160.

Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004) Least angle regression. The Annals of Statistics, 32, 409 - 499.

Fan, J. and Li, R. (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360.

Fan, J. and Lv, J. (2008) Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society. Series B, 70, 849 - 911.

Fan, J., Samworth, R. and Wu, Y. (2009) Ultrahigh dimensional feature selection: beyond the linear model.

Journal of Machine Learning Research

,, 10, 1829 - 1853.

Fan, J. and Song, R. (2010) Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics, 38, 3567 - 3604.

Fan, J., Feng, Y. and Wu, Y. (2010) Ultrahigh dimensional variable selection for Cox’s proportional hazards model. IMS Collect, 6, 70 - 86.

Hall, P. and Miller, H. (2009) Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18, 533 - 550.

Li, R., Zhong, W. & Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129 - 1139.

Miller, R. G. (1981). Survival Analysis. Wiley, New York.

Rosenwald, A.,Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B., and Staudt, L. M. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-bell lymphoma. The New England Journal of Medicine, 346, 1937-1947.

Schwarz, G. (1978) Estimating the dimension of model. Annals of Statistics, 6, 461 - 464.

Székely, G. J., Rizzo, M. L. & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35, 2769-2794.

Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58, 267-288.

van de Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A.M., Voskuil, D. W., Schreiber, G.J., Peterse, J.L., Roberts, C., Marton, M.J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E.T., Friend, S.H. and Bernards, R. (2002) A gene-expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine, 347, 1999 - 2009.

Yan, X., Tang, N. and Zhao, X. (2017) The Spearman rank correlation screening for ultrahigh dimensional censored data. arXiv:1702.02708v1

Zhong, W. and Zhu, L. (2015) An iterative approach to distance correlation-based sure independence screening. Journal of Statistical Computation and Simulation, 85, 2331 - 2345.

Zhu, L., Li, L., Li, R. and Zhu, L. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106, 1464 - 1475.

Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of Royal Statistical Society: Series B, 67, 301-320.

Zou, H. (2006) The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418-1429.

Feature screening Iterated feature screening
Model Method
PH 0.15 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 1.000 1.000
0.50 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.998 0.998
0.75 Naive 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 1.000 1.000
0.15 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
0.50 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
0.75 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.997 0.997
PO 0.15 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
0.50 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
0.75 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 1.000 1.000
0.15 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
0.50 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
0.75 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
1.000 1.000 1.000 0.002 0.002 1.000 1.000 1.000 0.997 0.997
Table 1: Simulation results for the feature selection with known and
Feature screening Iterated feature screening
Model Method
PH 0.15 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.004 0.004
Propose 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 1.000 1.000
0.50 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.005 0.005
Propose 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.997 0.997
0.75 Naive 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.004 0.004
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.997 0.997
1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 1.000 1.000
0.15 Naive 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.005 0.005
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.995 0.995
0.50 Naive 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.002 0.002
Propose 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.994 0.994
0.75 Naive 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
1.000 1.000 1.000 0.006 0.006 1.000 1.000 1.000 0.998 0.998
PO 0.15 Naive 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.004 0.004
Propose 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.997 0.997
0.50 Naive 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.004 0.004
Propose 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.995 0.995
0.75 Naive 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.004 0.004
Propose 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.995 0.995
1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 1.000 1.000
0.15 Naive 1.000 1.000 1.000 0.002 0.002 1.000 1.000 1.000 0.003 0.003
Propose 1.000 1.000 1.000 0.002 0.002 1.000 1.000 1.000 0.997 0.997
0.50 Naive 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.995 0.995
0.75 Naive 1.000 1.000 1.000 0.002 0.002 1.000 1.000 1.000 0.003 0.003
Propose 1.000 1.000 1.000 0.002 0.002 1.000 1.000 1.000 0.995 0.995
1.000 1.000 1.000 0.002 0.002 1.000 1.000 1.000 0.997 0.997
Table 2: Simulation results for the feature selection with repeated measurement and
Feature screening Iterated feature screening
Model Method
PH 0.15 Naive 1.000 1.000 1.000 0.007 0.007 1.000 1.000 1.000 0.007 0.007
Propose 1.000 1.000 1.000 0.007 0.007 1.000 1.000 1.000 1.000 1.000
0.50 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.005 0.005
Propose 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.997 0.997
0.75 Naive 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.004 0.003
Propose 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.995 0.995
1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 1.000 1.000
0.15 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.005 0.005
Propose 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.997 0.997
0.50 Naive 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.004 0.004
Propose 1.000 1.000 1.000 0.004 0.004 1.000 1.000 1.000 0.996 0.996
0.75 Naive 1.000 1.000 1.000 0.001 0.001 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.001 0.001 1.000 1.000 1.000 0.994 0.994
1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.998 0.998
PO 0.15 Naive 1.000 1.000 1.000 0.008 0.008 1.000 1.000 1.000 0.009 0.009
Propose 1.000 1.000 1.000 0.008 0.008 1.000 1.000 1.000 0.998 0.998
0.50 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.005 0.005
Propose 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.997 0.997
0.75 Naive 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.006 0.006
Propose 1.000 1.000 1.000 0.005 0.005 1.000 1.000 1.000 0.996 0.996
1.000 1.000 1.000 0.006 0.006 1.000 1.000 1.000 1.000 1.000
0.15 Naive 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.004 0.004
Propose 1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.995 0.995
0.50 Naive 1.000 1.000 1.000 0.002 0.002 1.000 1.000 1.000 0.004 0.004
Propose 1.000 1.000 1.000 0.002 0.002 1.000 1.000 1.000 0.995 0.995
0.75 Naive 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 0.003 0.003
Propose 1.000 1.000 1.000 0.000 0.000 1.000 1.000 1.000 0.994 0.994
1.000 1.000 1.000 0.003 0.003 1.000 1.000 1.000 0.998 0.998
Table 3: Simulation results for the feature selection with validation data and
# naive
FS IFS FS IFS FS IFS FS IFS
1 16587 16587 16587 16587 16587 16587 16587 16587
2 24719 24719 24719 24719 24719 24719 24719 24719
3 27057 27057 27057 27057 27057 27057 27057 27057
4 28581 28581 28581 28581 28581 28581 28581 28581
5 31420 31420 31420 31420 31420 31420 31420 31420
6 34790 34790 34790 34790 34790 34790 34790 34790
7 28581 28581 28581 28581 28581 28581 28581 28581
8 16312 29357 16312 29357 16312 29357 30157 30157
9 34771 29897 26537 29897 17053 29897 27116 28872
10 28346 30620 29637 30620 30917 30620 30334 32699
11 26521 30898 16587 30898 30929 30898 27762 27095
12 34375 32699 17053 32699 31972 32699 17326 24710
13 29642 15843 28346 15843 29637 15844 27019 19325
14 26537 15924 28908 15924 17605 15924 27762 30282
15 17605 27927 32519 27927 28346 27931 17176 32187
16 28920 28929 26521 28929 34771 28929 23887 29209
17 29657 34339 34364 34375 28908 34375 17343 16528
18 32519 34913 34667 32475 34651 32475 32699 27019
19 34651 26510 34771 26510 16079 26510 30157 23887
20 28908 27530 27931 34913 26537 27530 17917 16020
Table 4: Sensitivity Analyses of mantle cell lymphoma microarray dataset. FS stands for feature screening method in Section 3.1; IFS stands for iterated feature screening method in Section 3.2.
# naive
FS IFS FS IFS FS IFS FS IFS
1 NM  016359 NM  016359 NM  016359 NM  016359 NM  016359 NM  016359 NM  016359 NM  016359
2 AA555029  RC AA555029  RC AA555029  RC AA555029  RC AA555029  RC AA555029  RC AA555029  RC AA555029  RC
3 NM  003748 NM  003748 NM  003748 NM  003748 NM  003748 NM  003748 NM  003748 NM  003748
4 Contig38288  RC Contig38288  RC Contig38288  RC Contig38288  RC Contig38288  RC Contig38288  RC Contig38288  RC Contig38288  RC
5 NM  003862 NM  003862 NM  003862 NM  003862 NM  003862 NM  003862 NM  003862 NM  003862
6 Contig28552  RC Contig28552  RC Contig28552  RC Contig28552  RC Contig28552  RC Contig28552  RC Contig28552  RC Contig28552  RC
7 Contig32125  RC Contig32125  RC Contig32125  RC Contig32125  RC Contig32125  RC Contig32125  RC Contig32125  RC Contig32125  RC
8 AB037863 Contig036649  RC Contig55725  RC Contig036649  RC Contig55725  RC Contig036649  RC NM  000599 NM  000599
9 Contig036649  RC Contig46218  RC AF201905 Contig46218  RC AB037863 Contig46218  RC Contig46223 NM  005915
10 X05610 AB037863 AB037863 AB037863 AF201905 AB037863 AF257175 Contig46223
11 AL080079 NM  020188 Contig48328  RC NM  020188 Contig036649  RC NM  020188 NM  006931 X05610
12 NM  006931 Contig55377  RC Contig036649  RC Contig25991 X05610 Contig25991 AK000745 AK000745
13 AF201905 Contig48328  RC AL080079 Contig55377  RC NM  018354 Contig48328  RC NM  005915 NM  005915
14 NM  003875 Contig25991 X05610 Contig46223  RC AL080079 Contig55377  RC NM  001282 NM  001282
15 Contig55725  RC NM  003875 Contig55725  RC NM  003875 Contig55725  RC NM  003875 AL080079 NM  614321
16 Contig48328  RC NM  006101 NM  018354 NM  006101 NM  006931 NM  006101 NM  014889 AF257175
17 NM  000599 NM  003882 NM  003875 NM  003607 Contig48328  RC NM  000849 Contig55725  RC NM  014889
18 NM  018354 NM  016577 NM  006931 NM  003882 NM  003875 NM  016577 NM  614321