1 Introduction
With the increasing proliferation of biomarker studies, there is a need for efficient methods for relating a survival time response to a large number of features. In typical genetic microarray studies, the sample size is measured in hundreds whereas the number of features per sample can be in excess of millions. Sparse regression techniques such as lasso (Tibshirani, 1997) and SCAD (Fan and Li, 2001) have proved useful for dealing with such highdimensional features but their usefulness diminishes when becomes extremely large compared to . The notion of NPdimensionality (Fan and Lv, 2009) has been conceived to describe such ultrahigh dimensional settings which are formally analyzed in an asymptotic regime where grows at a nonpolynomial rate with . Despite recent progress (Bradic et al., 2011), theoretical knowledge about sparse regression techniques under NPdimensionality is still in its infancy. Moreover, NPdimensionality poses substantial computational challenges. When for example pairwise interactions among gene expressions in a genetic microarray study are of interest, the dimension of the feature space will trouble even the most efficient algorithms for fitting sparse regression models. A popular ad hoc solution is to simply pretend that feature correlations are ignorable and resort to computationally swift univariate regression methods; socalled independent screening methods.
In an important paper, Fan and Lv (2008)
laid the formal foundation for using independent screening to distinguish ‘relevant’ features from ‘irrelevant’ ones. For the linear regression model they showed that, when the design is close to orthogonal, a superset of the true set of nonzero regression coefficients can be estimated consistently by simple hardthresholding of featureresponse correlations. This sure independent screening (SIS) property of correlation screening is a rather trivial one, if not for the fact that it holds true in the asymptotic regime of NPdimensionality. Thus, when the feature covariance structure is sufficiently simple, SIS methods can overcome the noise accumulation in extremely high dimension. In order to accommodate more complex feature covariance structures
Fan and Lv (2008) and Fan et al. (2009)developed heuristic, iterated methods combining independent screening with forward selection techniques. Recently,
Fan and Song (2010) extended the formal basis for SIS to generalized linear models.In biomedical applications, the response of interest is often a rightcensored survival time, making the study of screening methods for survival data an important one. Fan et al. (2010) investigated SIS methods for the Cox proportional hazards model based on ranking features according to the univariate partial loglikelihood but gave no formal justification. Tibshirani (2009) suggested softthresholding of univariate Cox score statistics with some theoretical justification but under strong assumptions. Indeed, independent screening methods for survival data are apt to be difficult to justify theoretically due to the presence of censoring which can confound marginal associations between the response and the features. Recent work by Zhao and Li (2010) contains ideas which indicate that independent screening based on the Cox model may have the SIS property in the absence of censoring.
In the present paper, we depart from the standard approach of studying SIS as a rather specific type of model misspecification in which the univariate versions of a particular regression model are used to infer the structure of the joint version of the same particular regression model. Instead, we propose a survival variant of independent screening based on a modelfree statistic which we call the ‘Feature Aberration at Survival Times’ (FAST) statistic. The FAST statistic is a simple linear statistic which aggregates across survival times the aberration of each feature relative to its timevarying average. Independent screening based on this statistic can be regarded as a natural survival equivalent of correlation screening. We study the SIS property of FAST screening in ultrahigh dimension for a general class of singleindex hazard rate regression models in which the risk of an event depends on the features through some linear functional. A key aim has been to derive simple and operational sufficient conditions for the SIS property to hold. Accordingly, our main result states that the FAST statistic has the SIS property in an ultrahigh dimensional setting under covariance assumptions as in Fan et al. (2009), provided that censoring is essentially random and that features satisfy a technical condition which holds when they follow an elliptically contoured distribution. Utilizing the fact that the FAST statistic is related to the univariate regression coefficients in the semiparametric additive hazards model (Lin and Ying (1994); McKeague and Sasieni (1994)), we develop methods for iterated SIS. The techniques are evaluated in a simulation study where we also compare with screening methods for the Cox model (Fan et al., 2010). Finally, an application to a real genetic microarray data set is presented.
2 The FAST statistic and its motivation
Let
be a survival time which is subject to rightcensoring by some random variable
. Denote by the counting process which counts events up to time , let be the atrisk process, and letdenote a random vector of explanatory variables or features. It is assumed throughout that
has finite variance and is standardized, i.e. centered and with a covariance matrix
with unit diagonal. We observe independent and identically distributed (i.i.d.) replicates of for where is the observation time window.Define the ‘Feature Aberration at Survival Times’ (FAST) statistic as follows:
(1) 
where is the atriskaverage of the s,
Components of the FAST statistic define basic measures of the marginal association between each feature and survival. In the following, we provide two motivations for using the FAST statistic for screening purposes. The first, being modelbased, is perhaps the most intuitive – the second shows that, even in a modelfree setting, the FAST statistic may provide valuable information about marginal associations.
2.1 A modelbased interpretation of the FAST statistic
Assume in this section that the s have hazard functions of the form
(2) 
with an unspecified baseline hazard rate and a vector of regression coefficients. This is the socalled semiparametric additive hazards model (Lin and Ying (1994); McKeague and Sasieni (1994)), henceforth simply the LinYing model. The LinYing model corresponds to assuming for each an intensity function of the form . From the DoobMeyer decomposition with a martingale, it is easily verified that
(3) 
This suggests that is estimable as the solution to the linear system of equations
(4) 
where
(5) 
Suppose solves (4). Standard martingale arguments (Lin and Ying, 1994) imply root consistency of ,
(6) 
For now, simply observe that the lefthand side of (4) is exactly the FAST statistic; whereas for estimate the regression coefficients in the corresponding univariate LinYing models. Hence we can interpret as a (scaled) estimator of the univariate regression coefficients in a working LinYing model.
A nice heuristic interpretation of results from the pointwise signal/error decomposition (3) which is essentially a reformulated linear regression model with ‘responses’ and ‘explanatory variables’ . The FAST statistic is given by the time average of and may accordingly be viewed as a survival equivalent of the usual predictorresponse correlations.
2.2 A modelfree interpretation of the FAST statistic
For a feature to be judged (marginally) associated with survival in any reasonable interpretation of survival data, one would first require that the feature is correlated with the probability of experiencing an event – second, that this correlation persists throughout the time window. The FAST statistic can be shown to reflect these two requirements when the censoring mechanism is sufficiently simple.
Specifically, assume administrative censoring at time (so that ). Set where denotes the conditional probability of death before time . For each , denote by the population version of (the in probability limit of when ). Then
We can make the following observations:

If has constant sign throughout , then .

Conversely, if changes sign, so that the the direction of association with is not persistent throughout , then this will lead to a smaller value of compared to (i).

Lastly, if then .
In other words, the sample version estimates a timeaveraged summary of the correlation function which takes into account both magnitude and persistent behavior throughout . This indicates that the FAST statistic is relevant for judging marginal association of features with survival beyond the modelspecific setting of Section 2.1.
3 Independent screening with the FAST statistic
In this section, we extend the heuristic arguments of the previous section and provide theoretical justification for using the FAST statistic to screen for relevant features when the datagenerating model belongs to a class of singleindex hazard rate regression models.
3.1 The general case of singleindex hazard rate models
In the notation of Section 2, we assume survival times to have hazard rate functions of singleindex form:
(7) 
Here is a continuous function, are random vectors in , is a vector of regression coefficients, and defines a risk score. We subscript by to indicate that the dimension of the feature space can grow with the sample size. Censoring will always be assumed at least independent so that is independent of conditionally on . We impose the following assumption on the hazard ‘link function’ :
Assumption 1.
The survival function is strictly monotonic for each .Requiring the survival function to depend monotonically on is natural in order to enable interpretation of the components of as indicative of positive or negative association with survival. Note that it suffices that is strictly monotonic for each . Assumption 1 holds for a range of popular survival regression models. For example, with some baseline hazard yields the LinYing model (2); is a Cox model; and is an accelerated failure time model.
Denote by the population version of the FAST statistic under the model (7) which, by the DoobMeyer decomposition with a martingale, takes the form
(8) 
Our proposed FAST screening procedure is as follows: given some (datadependent) threshold ,

calculate the FAST statistic from the available data and

declare the ‘relevant features’ to be the set .
By the arguments in Section 2, this procedure defines a natural survival equivalent of correlation screening. Define the following sets of features:
The problem of establishing the SIS property of FAST screening amounts to determining when holds with large probability for large . This translates into two questions: first, when do we have ; second, when do we have ? The first question is essentially modelindependent and requires establishing an exponential bound for as . The second question is strongly modeldependent and is answered by manipulating expectations under the singleindex model (7).
We state the main results here and relegate proofs to the appendix where we also state various regularity conditions. The following principal assumptions, however, deserve separate attention:
Assumption 2.
There exists such that .Assumption 3.
The censoring time depends on only through , .Assumption 4.
, is independent of , .Assumption 2 is a ‘linear regression’ property which holds true for Gaussian features and, more generally, for features following an elliptically contoured distribution (Hardin, 1982). In view of Hall and Li (1993) which states that most low dimensional projections of high dimensional features are close to linear, Assumption 2 may not be unreasonable a priori even for general feature distributions when is large.
Assumption 3 restricts the censoring mechanism to be partially random in the sense of depending only on irrelevant features. As we will discuss in detail below, such rather strong restrictions on the censoring distribution seem necessary for obtaining the SIS property; Assumption 3 is both general and convenient.
Assumption 4 is the partial orthogonality condition also used by Fan and Song (2010). Under this assumption and Assumption 3, it follows from (8) that whenever , implying . Provided that we also have for (that is, ), there exists a threshold such that
Consequently, Assumptions 34 are needed to enable consistent model selection via independent screening. Although model selection consistency is not essential in order to capture just some superset of the relevant features via independent screening, it is pertinent in order to limit the size of such a superset.
The following theorem on FAST screening (FASTSIS) is our main theoretical result. It states that the screening property may hold with large probability even when grows exponentially fast in a certain power of which depends on the tail behavior of features. The covariance condition in the theorem is analogous to that of Fan and Song (2010) for SIS in generalized linear models with Gaussian features.
Observe that with bounded features, we may take and handle dimension of order .
We may dispense with Assumption 2 on the feature distribution by revising (9). By Lemma 5 in the appendix, taking , it holds generally under Assumption 3 that
Accordingly, if we replace (9) with the assumption that uniformly in for , the conclusions of Theorem 1 still hold. In other words, we can generally expect FASTSIS to detect features which are ‘correlated with the chance of survival’, much in line with Section 2. While this is valuable structural insight, the covariance assumption (9) seems a more operational condition.
Assumption 3 is crucial to the proof of Theorem 1
and to the general idea of translating a modelbased feature selection problem into a problem of hardthresholding
. A weaker assumption is not possible in general. For example, suppose that only Assumption 2 holds and that the censoring time also follows some singleindex model of the form (7) with regression coefficients . Applying Lemma 2.1 of Cheng and Wu (1994) to (8), there exists finite constants (depending on ) such that(10) 
It follows that unrestricted censoring will generally confound the relationship between and , hence . The precise impact of such unrestricted censoring seems difficult to discern, although (10) suggests that FASTSIS may still be able to capture the underlying model (unless is particularly illbehaved). We will have more to say about unrestricted censoring in the next section.
Theorem 1 shows that FASTSIS can consistently capture a superset of the relevant features. A priori, this superset can be quite large; indeed, ‘perfect’ screening would result by simply including all features. For FASTSIS to be useful, it must substantially reduce feature space dimension. Below we state a survival analogue of Theorem 5 in Fan and Song (2010), providing an asymptotic rate on the FASTSIS model size.
Theorem 2.
Suppose that Assumptions 14 hold alongside the regularity conditions of the appendix and that for positive constants and sufficiently large . If for some and , there exists a positive constant such thatdenotes the maximal eigenvalue of the covariance matrix
of the feature distribution.Informally, the theorem states that, under similar assumptions as in Theorem 1 and the partial orthogonality condition (Assumption 4), if features are not too strongly correlated (as measured by the maximal eigenvalue of the covariance matrix) and grows sufficiently fast, we can choose a threshold for hardthresholding such that the false selection rate becomes asymptotically negligible.
Our theorems say little about how to actually select the hardthresholding parameter in practice. Following Fan and Lv (2008) and Fan et al. (2009), we would typically choose such that is of order . Devising a general dataadaptive way of choosing is an open problem; falseselectionbased criteria are briefly mentioned in Section 3.3.
3.2 The special case of the Aalen model
Additional insight into the impact of censoring on FASTSIS is possible within the more restrictive context of the nonparametric Aalen model with Gaussian features (Aalen (1980); Aalen (1989)). This particular model asserts a hazard rate function for of the form
(11) 
for some baseline hazard rate function and a vector of continuous regression coefficient functions. The Aalen model extends the LinYing model of Section 2 by allowing timevarying regression coefficients. Alternatively, it can be viewed as defining an expansion to the first order of a general hazard rate function in the class (7) in the sense that
(12) 
For Aalen models with Gaussian features, we have the following analogue to Theorem 1.
Theorem 3.
Suppose that Assumptions 12 hold alongside the regularity conditions of the appendix. Suppose moreover that the is mean zero Gaussian and that follows a model of the form (11) with regression coefficients . Assume that also follows a model of the form (11) conditionally on and that censoring is independent. Let . If for some and , we haveIn view of (12), Theorem 3 can be viewed as establishing, within the model class (7), conditions for firstorder validity of FASTSIS under a general (independent) censoring mechanism and Gaussian features. The expectation term in (13) is essentially the ‘expected regression coefficients at the exit time’ which is strongly dependent on censoring through the symmetric dependence on survival and censoring time.
In fact, general independent censoring is a nuisance even in the LinYing model which would otherwise seem the ‘natural model’ in which to use FASTSIS. Specifically, assuming only independent censoring, suppose that follows a LinYing model with regression coefficients conditionally on and that also follows some LinYing model conditionally on . If where the components of are i.i.d. with mean zero and unit variance, there exists a diagonal matrix such that
(14) 
See Lemma 6 in the appendix. It holds that has constant diagonal iff features are Gaussian; otherwise the diagonal is nonconstant and depends nontrivially on the regression coefficients of the censoring model. A curious implication is that, under Gaussian features, FAST screening has the SIS property for this ‘double’ LinYing model irrespective of the (independent) censoring mechanism. Conversely, sufficient conditions for a SIS property to hold here under more general feature distributions would require the th component of to be ‘large’ whenever is ‘large’; hardly a very operational assumption. In other words, even in the simple LinYing model, unrestricted censoring complicates analysis of FASTSIS considerably.
3.3 Scaling the FAST statistic
The FAST statistic is easily generalized to incorporate scaling. Inspection of the results in the appendix immediately shows that multiplying the FAST statistic by some strictly positive, deterministic weight does not alter its asymptotic behavior. Under suitable assumptions, this also holds when weights are stochastic. In the notation of Section 2, the following two types of scaling are immediately relevant:
(FAST);  (15)  
(LinYingFAST).  (16) 
The FAST statistic corresponds to standardizing
by its estimated standard deviation; screening with this statistic is equivalent to the standard approach of ranking features according to univariate Wald
values. Various forms of asymptotic falsepositive control can be implemented forFAST, courtesy of the central limit theorem. Note that
FAST is modelindependent in the sense that its interpretation (and asymptotic normality) does not depend on a specific model. In contrast, the LinYingFAST statistic is modelspecific and corresponds to calculating the univariate regression coefficients in the LinYing model, thus leading to an analogue of the idea of ‘ranking by absolute regression coefficients’ of Fan and Song (2010) .We may even devise a scaling of which mimicks the ‘ranking by marginal likelihood ratio’ screening of Fan and Song (2010)
by considering univariate versions of the natural loss function
for the LinYing model. The components of the resulting statistic are rather similar to (16), taking the form(17) 
Additional flexibility can be gained by using a timedependent scaling where some strictly positive (stochastic) weight is multiplied on the integrand in (1). This is beyond the scope of the present paper.
4 Beyond simple independent screening – iterated FAST screening
The main assumption underlying any SIS method, including FASTSIS, is that the design is close to orthogonal. This assumption is easily violated: a relevant feature may have a low marginal association with survival; an irrelevant feature may be indirectly associated with survival through associations with relevant features etc. To address such issues, Fan and Lv (2008) and Fan et al. (2009) proposed various heuristic iterative SIS (ISIS) methods which generally work as follows. First, SIS is used to recruit a small subset of features within which an even smaller subset of features is selected using a (multivariate) variable selection method such as penalized regression. Second, the (univariate) relevance of each feature not selected in the variable selection step is reevaluated, adjusted for all the selected features. Third, a small subset of the most relevant of these new features is joined to the set of already selected features, and the variable selection step is repeated. The last two steps are iterated until the set of selected features stabilizes or some stopping criterion of choice is reached.
We advocate a similar strategy to extend the application domain of FASTSIS. In view of Section 2.1, a variable step using a working LinYing model is intuitively sensible. We may also provide some formal justification. Firstly, estimation in a LinYing model corresponds to optimizing the loss function
(18) 
where was defined in Section 2.1. As discussed by Martinussen and Scheike (2009), the loss function (18) is meaningful for general hazard rate models: it is the empirical version of the mean squared prediction error for predicting, with a working LinYing model, the part of the intensity which is orthogonal to the atrisk indicator. In the present context, we are mainly interested in the model selection properties of a working LinYing model. Suppose that conditionally on follows a singleindex model of the form (7) and that Assumptions 34 hold. Suppose that with the in probability limit of . Then implies (Hattori, 2006) so that a working LinYing model will yield conservative model selection in a quite general setting. Under stronger assumptions, the following result, related to work by Brillinger (1983) and Li and Duan (1989), is available.
Theorem 4.
Assume that conditionally on follows a singleindex model of the form (7). Suppose moreover that Assumption 2 holds and that is independent of (random censoring). If defined by is the vector of regression coefficients of the associated working LinYing model and is nonsingular, then there exists a nonzero constant depending only on the distributions of and such that .Thus a working LinYing model can consistently estimate regression coefficient signs under misspecification. From the efforts of Zhu et al. (2009) and Zhu and Zhu (2009) for other types of singleindex models, it seems conceivable that variable selection methods designed for the LinYing model will enjoy certain consistency properties within the model class (7). The conclusion of Theorem 4 continues to hold when is replaced by any matrix proportional to the feature covariance matrix . This is a consequence of Assumption 2 and underlines the considerable flexibility available when estimating in singleindex models.
Variable selection based on the LinYing loss (18) can be accomplished by optimizing a penalized loss function of the form
(19) 
where is some nonnegative penalty function, singular at the origin to facilitate model selection (Fan and Li, 2001) and depending on some tuning parameter controlling the sparsity of the penalized estimator. A popular choice is the lasso penalty (Tibshirani, 2009) and its adaptive variant (Zou, 2006), corresponding to penalty functions and with some root consistent estimator of , respectively. These penalties were studied by Ma and Leng (2007) and Martinussen and Scheike (2009) for the LinYing model. Empirically, we have had better success with the onestep SCAD (OSSCAD) penalty of Zou and Li (2008) than with lasso penalties. Letting
(20) 
an OSSCAD penalty function for the LinYing model can be defined as follows:
(21) 
Here is the unpenalized estimator and is the average diagonal element of ; this particular rescaling is just one way to lessen dependency of the penalization on the time scale. If has approximately constant diagonal (which is often the case for standardized features), then rescaling by leads to a similar penalty as for OSSCAD in the linear regression model with standardized features. The choice in (20) was recommended by Fan and Li (2001). OSSCAD has not previously been explored for the LinYing model but its favorable performance in ISIS for other regression models is well known (Fan et al., 2009, 2010). OSSCAD can be implemented efficiently using, for example, coordinate descent methods for fitting the lasso (GorstRasmussen and Scheike, 2011; Friedman et al., 2007). For fixed , the OSSCAD penalty (21) has the oracle property if the LinYing model holds true. A proof is beyond scope but follows by adapting Zou and Li (2008) along the lines of Martinussen and Scheike (2009).
In the basic FASTISIS algorithm proposed below, the initial recruitment step corresponds to ranking the regression coefficients in the univariate LinYing models. This is a convenient generic choice because it enables interpretation of the algorithm as standard ‘vanilla ISIS’ (Fan et al., 2009) for the LinYing model.
Algorithm 1 (LinYingFASTISIS).
Set , let be some predefined maximal number of iterations of the algorithm. (Initial recruitment). Perform SIS by ranking , , according to decreasing order of magnitude and retaining the most relevant features . For do: (Feature selection). Define if and otherwise. EstimateFan and Lv (2008) recommended choosing to be of order . Following Fan et al. (2009), we may take and at each step. This ensures that we complete at least one iteration of the algorithm; the choice of for ensures that at most features are included in the final solution.
Algorithm 1 defines an iterated variant of SIS with the LinYingFAST statistic (16). We can devise an analogous iterated variant of FASTSIS in which the initial recruitment is performed by ranking based on the statistic (15), and the subsequent rerecruitments are performed by ranking statistics in the multivariate LinYing model according to decreasing order of magnitude, using the variance estimator (6). A third option would be to base recruitment on (17) and rerecruitments on the decrease in the multivariate loss (18) when joining a given feature to the set of features picked out in the variable selection step.
The rerecruitment step (b.iii) in Algorithm 1 resembles that of Fan et al. (2009). Its naive implementation will be computationally burdensome when is large, requiring a lowdimensional matrix inversion per feature. Significant speedup over the naive implementation is possible via the matrix identity
(23) 
Note that only the first row of is required for the rerecruitment step so that (22) can be implemented using just a single lowdimensional matrix inversion alongside matrix/vector multiplications. Combining (23) with (6), a similarly efficient implementation applies for FASTISIS.
The variable selection step (b.i) of Algorithm 1 requires the choice of an appropriate tuning parameter. This is traditionally a difficult part of penalized regression, particularly when the aim is model selection where methods such as crossvalidation are prone to overfitting (Leng et al., 2007). Previous work on ISIS used the Bayesian information criterion (BIC) for tuning parameter selection (Fan et al., 2009). Although BIC is based on the likelihood, we may still define the following ‘pseudo BIC’ based on the loss (18):
(24) 
Here is the penalized estimator, is the unpenalized estimator, is a scaling constant of choice, and
estimates the degrees of freedom of the penalized estimator. A computationally convenient choice is
(Zou et al., 2007). It turns out that choosing may lead to model selection consistency. Specifically, the loss (18) for the LinYing model is of the leastsquares type. Then we can repeat the arguments of Wang and Leng (2007) and show that, under suitable consistency assumptions for the penalized estimator, there exists a sequence yielding selection consistency for and satisfying(25) 
with the union of the set of tuning parameters which lead to overfitted (strict supermodels of the true model), respectively underfitted models (any model which do not include the true model). While (25) holds independently of the scaling constant , the finitesample behavior of PBIC depends strongly on . A sensible value may be inferred heuristically as follows: the range of a ‘true’ likelihood BIC is asymptotically equivalent to a Wald statistic in the sense that (for fixed ),
(26) 
with the maximum likelihood estimator and the information matrix. We may specify by requiring that admits an analogous interpretation as a Wald statistic. Since , it follows from (6) that we should choose
This choice of also removes the dependency of PBIC on the time scale.
5 Simulation studies
In this section, we investigate the performance of FAST screening on simulated data. Rather than comparing with popular variable selection methods such as the lasso, we will compare with analogous screening methods based on the Cox model (Fan et al., 2010). This seems a more pertinent benchmark since previous work has already demonstrated that (iterated) SIS can outperform variable selection based on penalized regression in a number of cases (Fan and Lv (2008); Fan et al. (2009)).
For all the simulations, survival times were generated from three different conditionally exponential models of the generic form (7); that is, a timeindependent hazard ‘link function’ applied to a linear functional of features. For suitable constants , the link functions were as follows (see also Figure 1):
The link functions represent different characteristic effects on the feature functional, ranging from uniformly bounded (logit) over fast decay/increase (Cox), to fast decay/slow increase (log). We took
, , and and, unless otherwise stated, survival times were rightcensored by independent exponential random variables with rate parameters 0.12 (logit link), 0.3 (Cox link) and 0.17 (log link). These constants were selected to provide a crude ‘calibration’ to make the simulation models more comparable: for a univariate standard Gaussian feature , a regression coefficient , and a sample size of , the expected statistic was 8 for all three link functions with an expected censoring rate of 25%, as evaluated by numerical integration based on the true likelihood.Methods for FAST screening have been implemented in the R package ‘ahaz’ (GorstRasmussen, 2011).
5.1 Performance of FASTSIS
We first considered the performance of basic, noniterated FASTSIS. Features were generated as in scenario 1 of Fan and Song (2010). Specifically, let be standard Gaussian. Define
(27) 
where is independently distributed as a standard Gaussian for
: independently distributed according to a double exponential distribution with location parameter zero and scale parameter 1 for
; and independently distributed according to a Gaussian mixture for . The constants satisfy and for . With the choice , , we obtain for , , enabling crude adjustment of the correlation structure of the feature distribution. Regression coefficients were chosen to be of the generic form with exactly the first components nonzero.For each combination of hazard link function, nonsparsity level , and correlation , we performed 100 simulations with features and observations. Features were ranked using the vanilla FAST statistic, the scaled FAST statistics (15) and (16), and SIS based on a Cox working model (CoxSIS), the latter ranking features according their absolute univariate regression coefficient. Results are shown in Table 1. As a performance measure, we report the median of the minimum model size (MMS) needed to detect all relevant features alongside its relative standard deviation (RSD), the interquartile range divided by 1.34. The MMS is a useful performance measure for this type of study since it eliminates the need to select a threshold parameter for SIS. The censoring rate in the simulations was typically 30%40%.
For all methods, the MMMS is seen to increase with feature correlation and nonsparsity . As also noted by Fan and Song (2010) for the case of SIS for generalized linear models, some correlation among features can actually be helpful since it increases the strength of marginal signals. Overall, the statistic seems to perform slightly worse than both and whereas the latter two statistics perform similarly to CoxSIS. In our basic implementation, screening with any of the FAST statistics was more than 100 times faster than CoxSIS, providing a rough indication of the relative computational efficiency of FASTSIS.
To gauge the relative difficulty of the different simulation scenarios, Figure 2 shows box plots of the minimum of the observed statistics in the oracle model (the joint model with only the relevant features included and estimation based on the likelihood under the true link function) for the link function . This particular link function represents an ‘intermediate’ level of difficulty; with statistics for generally being somewhat larger and statistics for being slightly smaller. Even with oracle information and the correct working model, these are evidently difficult data to deal with.
5.2 FASTSIS with nonGaussian features and nonrandom censoring
We next investigated FASTSIS with nonGaussian features and a more complex censoring mechanism. The simulation scenario was inspired by the previous section but with all features generated according to either a standard Gaussian distribution, a
distribution with 4 degrees of freedom, or a unit rate exponential distribution. Features were standardized to have mean zero and variance one, and the feature correlation structure was such that for , and otherwise. Survival times were generated according to the link function with regression coefficients while censoring times were generated according to the same model (link function and conditionally on the same feature realizations) with regression coefficients . The constant controls the association between censoring and survival times, leading to a basic example of nonrandom censoring (competing risks).Using features and observations, we performed 100 simulations under each of the three feature distributions, for different values of . Table 2 reports the MMMS and RSD for the four different screening methods of the previous section, as well as for the statistic in (17). The censoring rate in all scenarios was around 50%.
From the column with (random censoring), the heavier tails of the distribution increases the MMMS, particularly for . The vanilla FAST statistic seems the least affected here, most likely because it does not directly involve secondorder statistics which are poorly estimated due to the heavier tails. While and are also scaled by secondorder statistics, the impact of the tails is dampened by the squareroot transformation in the scaling factors. In contrast, the more distinctly nonGaussian exponential distribution is problematic for . Overall, the statistics and seems to have the best and most consistent performance across feature distributions. Nonrandom censoring generally increases the MMMS and RSD, particularly for the nonGaussian distributions. There appears to be no clear difference between the effect of positive and negative values of . We found that the effect of diminished when the sample size was increased (results not shown), suggesting that nonrandom censoring in the present example leads to a power rather than bias issue. This may not be surprising in view of the considerations below (14). However, the example still shows the dramatic impact of nonrandom censoring on the performance of SIS.
5.3 Performance of FASTISIS
We lastly evaluated the ability of FASTISIS (Algorithm 1) to cope with scenarios where FASTSIS fails. As in the previous sections, we compare our results with the analogous ISIS screening method for the Cox model. To perform CoxISIS, we used the R package ‘SIS’, with (re)recruitment based on the absolute Cox regression coefficients and variable selection based on OSSCAD. We also compared with FASTISIS variant described below Algorithm 1 in which (re)recruitment is based on the LinYing model statistics (results for FASTISIS with (re)recruitment based on the loss function were very similar).
For the simulations, we adopted the structural form of the feature distributions used by Fan et al. (2010). We considered observations and features which were jointly Gaussian and marginally standard Gaussian. Only regression coefficients and feature correlations differed between cases as follows:

The regression coefficients are , , , , , and for . Features are independent, for .

The regression coefficients are the same as in (a) while for .

The regression coefficients are , . The correlation between features is given by for and for , .

The regression coefficients are , and . The correlation between features is for ,
Comments
There are no comments yet.