# Independent screening for single-index hazard rate models with ultra-high dimensional features

In data sets with many more features than observations, independent screening based on all univariate regression models leads to a computationally convenient variable selection method. Recent efforts have shown that in the case of generalized linear models, independent screening may suffice to capture all relevant features with high probability, even in ultra-high dimension. It is unclear whether this formal sure screening property is attainable when the response is a right-censored survival time. We propose a computationally very efficient independent screening method for survival data which can be viewed as the natural survival equivalent of correlation screening. We state conditions under which the method admits the sure screening property within a general class of single-index hazard rate models with ultra-high dimensional features. An iterative variant is also described which combines screening with penalized regression in order to handle more complex feature covariance structures. The methods are evaluated through simulation studies and through application to a real gene expression dataset.

## Authors

• 1 publication
• 3 publications
• ### A robust variable screening procedure for ultra-high dimensional data

Variable selection in ultra-high dimensional regression problems has bec...
04/30/2020 ∙ by Abhik Ghosh, et al. ∙ 0

• ### High-dimensional variable selection for Cox's proportional hazards model

Variable selection in high dimensional space has challenged many contemp...
02/17/2010 ∙ by Jianqing Fan, et al. ∙ 0

• ### Sure Screening for Gaussian Graphical Models

We propose graphical sure screening, or GRASS, a very simple and computa...
07/29/2014 ∙ by Shikai Luo, et al. ∙ 0

• ### Marginal and Interactive Feature Screening of Ultra-high Dimensional Feature Spaces with Multivariate Response

When the number of features exponentially outnumbers the number of sampl...
11/16/2019 ∙ by Randall Reese, et al. ∙ 0

• ### Strong Sure Screening of Ultra-high Dimensional Categorical Data

Feature screening for ultra high dimensional feature spaces plays a crit...
01/10/2018 ∙ by Randall Reese, et al. ∙ 0

• ### Prediction of treatment outcome for autism from structure of the brain based on sure independence screening

Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder,...
10/17/2018 ∙ by Juntang Zhuang, et al. ∙ 0

• ### On Learning High Dimensional Structured Single Index Models

Single Index Models (SIMs) are simple yet flexible semi-parametric model...
03/13/2016 ∙ by Nikhil Rao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

With the increasing proliferation of biomarker studies, there is a need for efficient methods for relating a survival time response to a large number of features. In typical genetic microarray studies, the sample size is measured in hundreds whereas the number of features per sample can be in excess of millions. Sparse regression techniques such as lasso (Tibshirani, 1997) and SCAD (Fan and Li, 2001) have proved useful for dealing with such high-dimensional features but their usefulness diminishes when becomes extremely large compared to . The notion of NP-dimensionality (Fan and Lv, 2009) has been conceived to describe such ultra-high dimensional settings which are formally analyzed in an asymptotic regime where grows at a non-polynomial rate with . Despite recent progress (Bradic et al., 2011), theoretical knowledge about sparse regression techniques under NP-dimensionality is still in its infancy. Moreover, NP-dimensionality poses substantial computational challenges. When for example pairwise interactions among gene expressions in a genetic microarray study are of interest, the dimension of the feature space will trouble even the most efficient algorithms for fitting sparse regression models. A popular ad hoc solution is to simply pretend that feature correlations are ignorable and resort to computationally swift univariate regression methods; so-called independent screening methods.

In an important paper, Fan and Lv (2008)

laid the formal foundation for using independent screening to distinguish ‘relevant’ features from ‘irrelevant’ ones. For the linear regression model they showed that, when the design is close to orthogonal, a superset of the true set of nonzero regression coefficients can be estimated consistently by simple hard-thresholding of feature-response correlations. This sure independent screening (SIS) property of correlation screening is a rather trivial one, if not for the fact that it holds true in the asymptotic regime of NP-dimensionality. Thus, when the feature covariance structure is sufficiently simple, SIS methods can overcome the noise accumulation in extremely high dimension. In order to accommodate more complex feature covariance structures

Fan and Lv (2008) and Fan et al. (2009)

developed heuristic, iterated methods combining independent screening with forward selection techniques. Recently,

Fan and Song (2010) extended the formal basis for SIS to generalized linear models.

In biomedical applications, the response of interest is often a right-censored survival time, making the study of screening methods for survival data an important one. Fan et al. (2010) investigated SIS methods for the Cox proportional hazards model based on ranking features according to the univariate partial log-likelihood but gave no formal justification. Tibshirani (2009) suggested soft-thresholding of univariate Cox score statistics with some theoretical justification but under strong assumptions. Indeed, independent screening methods for survival data are apt to be difficult to justify theoretically due to the presence of censoring which can confound marginal associations between the response and the features. Recent work by Zhao and Li (2010) contains ideas which indicate that independent screening based on the Cox model may have the SIS property in the absence of censoring.

In the present paper, we depart from the standard approach of studying SIS as a rather specific type of model misspecification in which the univariate versions of a particular regression model are used to infer the structure of the joint version of the same particular regression model. Instead, we propose a survival variant of independent screening based on a model-free statistic which we call the ‘Feature Aberration at Survival Times’ (FAST) statistic. The FAST statistic is a simple linear statistic which aggregates across survival times the aberration of each feature relative to its time-varying average. Independent screening based on this statistic can be regarded as a natural survival equivalent of correlation screening. We study the SIS property of FAST screening in ultra-high dimension for a general class of single-index hazard rate regression models in which the risk of an event depends on the features through some linear functional. A key aim has been to derive simple and operational sufficient conditions for the SIS property to hold. Accordingly, our main result states that the FAST statistic has the SIS property in an ultra-high dimensional setting under covariance assumptions as in Fan et al. (2009), provided that censoring is essentially random and that features satisfy a technical condition which holds when they follow an elliptically contoured distribution. Utilizing the fact that the FAST statistic is related to the univariate regression coefficients in the semiparametric additive hazards model (Lin and Ying (1994); McKeague and Sasieni (1994)), we develop methods for iterated SIS. The techniques are evaluated in a simulation study where we also compare with screening methods for the Cox model (Fan et al., 2010). Finally, an application to a real genetic microarray data set is presented.

## 2 The FAST statistic and its motivation

Let

be a survival time which is subject to right-censoring by some random variable

. Denote by the counting process which counts events up to time , let be the at-risk process, and let

denote a random vector of explanatory variables or features. It is assumed throughout that

has finite variance and is standardized, i.e. centered and with a covariance matrix

with unit diagonal. We observe independent and identically distributed (i.i.d.) replicates of for where is the observation time window.

Define the ‘Feature Aberration at Survival Times’ (FAST) statistic as follows:

 d:=n−1∫τ0n∑i=1{Zi−¯Z(t)}dNi(t); (1)

where is the at-risk-average of the s,

 ¯Z(t):=∑ni=1ZiYi(t)∑ni=1Yi(t).

Components of the FAST statistic define basic measures of the marginal association between each feature and survival. In the following, we provide two motivations for using the FAST statistic for screening purposes. The first, being model-based, is perhaps the most intuitive – the second shows that, even in a model-free setting, the FAST statistic may provide valuable information about marginal associations.

### 2.1 A model-based interpretation of the FAST statistic

Assume in this section that the s have hazard functions of the form

 λj(t)=λ0(t)+Z⊤jα0;j=1,2,…,n; (2)

with an unspecified baseline hazard rate and a vector of regression coefficients. This is the so-called semiparametric additive hazards model (Lin and Ying (1994); McKeague and Sasieni (1994)), henceforth simply the Lin-Ying model. The Lin-Ying model corresponds to assuming for each an intensity function of the form . From the Doob-Meyer decomposition with a martingale, it is easily verified that

 n∑i=1{Zi−¯Z(t)}dNi(t)=[n∑i=1{Zi−¯Z(t)}⊗2Yi(t)dt]α0+n∑i=1{Zi−¯Z(t)}dMi(t),t∈[0,τ]. (3)

This suggests that is estimable as the solution to the linear system of equations

 d=Dα; (4)

where

 d:=n−1n∑i=1∫τ0{Zi−¯Z(t)}dNi(t),and D:=n−1n∑i=1∫τ0Yi(t){Zi−¯Z(t)}⊗2dt. (5)

Suppose solves (4). Standard martingale arguments (Lin and Ying, 1994) imply root consistency of ,

 √n(^α−α0)d→N(0,D−1BD−1), where B=n−1n∑i=1∫τ0{Zi−¯Z(t)}⊗2dNi(t). (6)

For now, simply observe that the left-hand side of (4) is exactly the FAST statistic; whereas for estimate the regression coefficients in the corresponding univariate Lin-Ying models. Hence we can interpret as a (scaled) estimator of the univariate regression coefficients in a working Lin-Ying model.

A nice heuristic interpretation of results from the pointwise signal/error decomposition (3) which is essentially a reformulated linear regression model with ‘responses’ and ‘explanatory variables’ . The FAST statistic is given by the time average of and may accordingly be viewed as a survival equivalent of the usual predictor-response correlations.

### 2.2 A model-free interpretation of the FAST statistic

For a feature to be judged (marginally) associated with survival in any reasonable interpretation of survival data, one would first require that the feature is correlated with the probability of experiencing an event – second, that this correlation persists throughout the time window. The FAST statistic can be shown to reflect these two requirements when the censoring mechanism is sufficiently simple.

Specifically, assume administrative censoring at time (so that ). Set where denotes the conditional probability of death before time . For each , denote by the population version of (the in probability limit of when ). Then

 δj =E([Z1j−E{Z1jY1(t)}E{Y1(t)}]1(T1≤t∧τ)) =E{Z1jF(τ|Z1)}−∫τ0E{Z1jY1(t)}E{Y1(t)}E{dF(t|Z1)} =V(τ)Cor{Z1j,F(τ|Z1)}+∫τ0Cor{Z1j,F(t|Z1)}V(t)E{Y1(t)}E{dF(t|Z1)}.

We can make the following observations:

1. If has constant sign throughout , then .

2. Conversely, if changes sign, so that the the direction of association with is not persistent throughout , then this will lead to a smaller value of compared to (i).

3. Lastly, if then .

In other words, the sample version estimates a time-averaged summary of the correlation function which takes into account both magnitude and persistent behavior throughout . This indicates that the FAST statistic is relevant for judging marginal association of features with survival beyond the model-specific setting of Section 2.1.

## 3 Independent screening with the FAST statistic

In this section, we extend the heuristic arguments of the previous section and provide theoretical justification for using the FAST statistic to screen for relevant features when the data-generating model belongs to a class of single-index hazard rate regression models.

### 3.1 The general case of single-index hazard rate models

In the notation of Section 2, we assume survival times to have hazard rate functions of single-index form:

 λj(t)=λ(t,Z⊤jα0),j=1,2,…,n. (7)

Here is a continuous function, are random vectors in , is a vector of regression coefficients, and defines a risk score. We subscript by to indicate that the dimension of the feature space can grow with the sample size. Censoring will always be assumed at least independent so that is independent of conditionally on . We impose the following assumption on the hazard ‘link function’ :

###### Assumption 1.
The survival function is strictly monotonic for each .

Requiring the survival function to depend monotonically on is natural in order to enable interpretation of the components of as indicative of positive or negative association with survival. Note that it suffices that is strictly monotonic for each . Assumption 1 holds for a range of popular survival regression models. For example, with some baseline hazard yields the Lin-Ying model (2); is a Cox model; and is an accelerated failure time model.

Denote by the population version of the FAST statistic under the model (7) which, by the Doob-Meyer decomposition with a martingale, takes the form

 δ=E[∫τ0{Z1−e(t)}Y1(t)λ(t,Z⊤1α0)dt];where e(t):=E{Z1Y1(t)}E{Y1(t)}. (8)

Our proposed FAST screening procedure is as follows: given some (data-dependent) threshold ,

1. calculate the FAST statistic from the available data and

2. declare the ‘relevant features’ to be the set .

By the arguments in Section 2, this procedure defines a natural survival equivalent of correlation screening. Define the following sets of features:

 ˆMnd :={1≤j≤pn:|dj|>γn}, Mn :={1≤j≤pn:α0j≠0}, Mnδ :={1≤j≤pn:δj≠0}.

The problem of establishing the SIS property of FAST screening amounts to determining when holds with large probability for large . This translates into two questions: first, when do we have ; second, when do we have ? The first question is essentially model-independent and requires establishing an exponential bound for as . The second question is strongly model-dependent and is answered by manipulating expectations under the single-index model (7).

We state the main results here and relegate proofs to the appendix where we also state various regularity conditions. The following principal assumptions, however, deserve separate attention:

###### Assumption 2.
There exists such that .
###### Assumption 3.
The censoring time depends on only through , .
###### Assumption 4.
, is independent of , .

Assumption 2 is a ‘linear regression’ property which holds true for Gaussian features and, more generally, for features following an elliptically contoured distribution (Hardin, 1982). In view of Hall and Li (1993) which states that most low dimensional projections of high dimensional features are close to linear, Assumption 2 may not be unreasonable a priori even for general feature distributions when is large.

Assumption 3 restricts the censoring mechanism to be partially random in the sense of depending only on irrelevant features. As we will discuss in detail below, such rather strong restrictions on the censoring distribution seem necessary for obtaining the SIS property; Assumption 3 is both general and convenient.

Assumption 4 is the partial orthogonality condition also used by Fan and Song (2010). Under this assumption and Assumption 3, it follows from (8) that whenever , implying . Provided that we also have for (that is, ), there exists a threshold such that

 minj∈Mn|δj|≥ζnmaxj∉Mn|δj|=0.

Consequently, Assumptions 3-4 are needed to enable consistent model selection via independent screening. Although model selection consistency is not essential in order to capture just some superset of the relevant features via independent screening, it is pertinent in order to limit the size of such a superset.

The following theorem on FAST screening (FAST-SIS) is our main theoretical result. It states that the screening property may hold with large probability even when grows exponentially fast in a certain power of which depends on the tail behavior of features. The covariance condition in the theorem is analogous to that of Fan and Song (2010) for SIS in generalized linear models with Gaussian features.

###### Theorem 1.
Suppose that Assumptions 1-3 hold alongside the regularity conditions of the appendix and that for some positive constants and sufficiently large . Suppose moreover that for some and ,
(9)
Then . Suppose in addition that for some constant and that . Then the SIS property holds, when .

Observe that with bounded features, we may take and handle dimension of order .

We may dispense with Assumption 2 on the feature distribution by revising (9). By Lemma 5 in the appendix, taking , it holds generally under Assumption 3 that

 δj=E{~ej(T1∧C1∧τ)},j∈Mn.

Accordingly, if we replace (9) with the assumption that uniformly in for , the conclusions of Theorem 1 still hold. In other words, we can generally expect FAST-SIS to detect features which are ‘correlated with the chance of survival’, much in line with Section 2. While this is valuable structural insight, the covariance assumption (9) seems a more operational condition.

Assumption 3 is crucial to the proof of Theorem 1

and to the general idea of translating a model-based feature selection problem into a problem of hard-thresholding

. A weaker assumption is not possible in general. For example, suppose that only Assumption 2 holds and that the censoring time also follows some single-index model of the form (7) with regression coefficients . Applying Lemma 2.1 of Cheng and Wu (1994) to (8), there exists finite constants (depending on ) such that

 δ=Σ(ζ1α0+ζ2β0). (10)

It follows that unrestricted censoring will generally confound the relationship between and , hence . The precise impact of such unrestricted censoring seems difficult to discern, although (10) suggests that FAST-SIS may still be able to capture the underlying model (unless is particularly ill-behaved). We will have more to say about unrestricted censoring in the next section.

Theorem 1 shows that FAST-SIS can consistently capture a superset of the relevant features. A priori, this superset can be quite large; indeed, ‘perfect’ screening would result by simply including all features. For FAST-SIS to be useful, it must substantially reduce feature space dimension. Below we state a survival analogue of Theorem 5 in Fan and Song (2010), providing an asymptotic rate on the FAST-SIS model size.

###### Theorem 2.
Suppose that Assumptions 1-4 hold alongside the regularity conditions of the appendix and that for positive constants and sufficiently large . If for some and , there exists a positive constant such that
where

denotes the maximal eigenvalue of the covariance matrix

of the feature distribution.

Informally, the theorem states that, under similar assumptions as in Theorem 1 and the partial orthogonality condition (Assumption 4), if features are not too strongly correlated (as measured by the maximal eigenvalue of the covariance matrix) and grows sufficiently fast, we can choose a threshold for hard-thresholding such that the false selection rate becomes asymptotically negligible.

Our theorems say little about how to actually select the hard-thresholding parameter in practice. Following Fan and Lv (2008) and Fan et al. (2009), we would typically choose such that is of order . Devising a general data-adaptive way of choosing is an open problem; false-selection-based criteria are briefly mentioned in Section 3.3.

### 3.2 The special case of the Aalen model

Additional insight into the impact of censoring on FAST-SIS is possible within the more restrictive context of the nonparametric Aalen model with Gaussian features (Aalen (1980); Aalen (1989)). This particular model asserts a hazard rate function for of the form

 λj(t)=λ0(t)+Z⊤jα0(t),j=1,2,…,n; (11)

for some baseline hazard rate function and a vector of continuous regression coefficient functions. The Aalen model extends the Lin-Ying model of Section 2 by allowing time-varying regression coefficients. Alternatively, it can be viewed as defining an expansion to the first order of a general hazard rate function in the class (7) in the sense that

 λ(t,Z⊤1α0)≈λ(t,0)+Z⊤1α0∂λ(t,x)∂x∣∣x=0. (12)

For Aalen models with Gaussian features, we have the following analogue to Theorem 1.

###### Theorem 3.
Suppose that Assumptions 1-2 hold alongside the regularity conditions of the appendix. Suppose moreover that the is mean zero Gaussian and that follows a model of the form (11) with regression coefficients . Assume that also follows a model of the form (11) conditionally on and that censoring is independent. Let . If for some and , we have
(13)
then the conclusions of Theorem 1 hold with .

In view of (12), Theorem 3 can be viewed as establishing, within the model class (7), conditions for first-order validity of FAST-SIS under a general (independent) censoring mechanism and Gaussian features. The expectation term in (13) is essentially the ‘expected regression coefficients at the exit time’ which is strongly dependent on censoring through the symmetric dependence on survival and censoring time.

In fact, general independent censoring is a nuisance even in the Lin-Ying model which would otherwise seem the ‘natural model’ in which to use FAST-SIS. Specifically, assuming only independent censoring, suppose that follows a Lin-Ying model with regression coefficients conditionally on and that also follows some Lin-Ying model conditionally on . If where the components of are i.i.d. with mean zero and unit variance, there exists a diagonal matrix such that

 δ=Σ1/2CΣ1/2α0. (14)

See Lemma 6 in the appendix. It holds that has constant diagonal iff features are Gaussian; otherwise the diagonal is nonconstant and depends nontrivially on the regression coefficients of the censoring model. A curious implication is that, under Gaussian features, FAST screening has the SIS property for this ‘double’ Lin-Ying model irrespective of the (independent) censoring mechanism. Conversely, sufficient conditions for a SIS property to hold here under more general feature distributions would require the th component of to be ‘large’ whenever is ‘large’; hardly a very operational assumption. In other words, even in the simple Lin-Ying model, unrestricted censoring complicates analysis of FAST-SIS considerably.

### 3.3 Scaling the FAST statistic

The FAST statistic is easily generalized to incorporate scaling. Inspection of the results in the appendix immediately shows that multiplying the FAST statistic by some strictly positive, deterministic weight does not alter its asymptotic behavior. Under suitable assumptions, this also holds when weights are stochastic. In the notation of Section 2, the following two types of scaling are immediately relevant:

 dZj =djB−1/2jj (Z-FAST); (15) dLYj =djD−1jj (Lin-Ying-FAST). (16)

The -FAST statistic corresponds to standardizing

by its estimated standard deviation; screening with this statistic is equivalent to the standard approach of ranking features according to univariate Wald

-values. Various forms of asymptotic false-positive control can be implemented for

-FAST, courtesy of the central limit theorem. Note that

-FAST is model-independent in the sense that its interpretation (and asymptotic normality) does not depend on a specific model. In contrast, the Lin-Ying-FAST statistic is model-specific and corresponds to calculating the univariate regression coefficients in the Lin-Ying model, thus leading to an analogue of the idea of ‘ranking by absolute regression coefficients’ of Fan and Song (2010) .

We may even devise a scaling of which mimicks the ‘ranking by marginal likelihood ratio’ screening of Fan and Song (2010)

by considering univariate versions of the natural loss function

for the Lin-Ying model. The components of the resulting statistic are rather similar to (16), taking the form

 dlossj=djB−1/2jj (loss-FAST). (17)

Additional flexibility can be gained by using a time-dependent scaling where some strictly positive (stochastic) weight is multiplied on the integrand in (1). This is beyond the scope of the present paper.

## 4 Beyond simple independent screening – iterated FAST screening

The main assumption underlying any SIS method, including FAST-SIS, is that the design is close to orthogonal. This assumption is easily violated: a relevant feature may have a low marginal association with survival; an irrelevant feature may be indirectly associated with survival through associations with relevant features etc. To address such issues, Fan and Lv (2008) and Fan et al. (2009) proposed various heuristic iterative SIS (ISIS) methods which generally work as follows. First, SIS is used to recruit a small subset of features within which an even smaller subset of features is selected using a (multivariate) variable selection method such as penalized regression. Second, the (univariate) relevance of each feature not selected in the variable selection step is re-evaluated, adjusted for all the selected features. Third, a small subset of the most relevant of these new features is joined to the set of already selected features, and the variable selection step is repeated. The last two steps are iterated until the set of selected features stabilizes or some stopping criterion of choice is reached.

We advocate a similar strategy to extend the application domain of FAST-SIS. In view of Section 2.1, a variable step using a working Lin-Ying model is intuitively sensible. We may also provide some formal justification. Firstly, estimation in a Lin-Ying model corresponds to optimizing the loss function

 (18)

where was defined in Section 2.1. As discussed by Martinussen and Scheike (2009), the loss function (18) is meaningful for general hazard rate models: it is the empirical version of the mean squared prediction error for predicting, with a working Lin-Ying model, the part of the intensity which is orthogonal to the at-risk indicator. In the present context, we are mainly interested in the model selection properties of a working Lin-Ying model. Suppose that conditionally on follows a single-index model of the form (7) and that Assumptions 3-4 hold. Suppose that with the in probability limit of . Then implies (Hattori, 2006) so that a working Lin-Ying model will yield conservative model selection in a quite general setting. Under stronger assumptions, the following result, related to work by Brillinger (1983) and Li and Duan (1989), is available.

###### Theorem 4.
Assume that conditionally on follows a single-index model of the form (7). Suppose moreover that Assumption 2 holds and that is independent of (random censoring). If defined by is the vector of regression coefficients of the associated working Lin-Ying model and is nonsingular, then there exists a nonzero constant depending only on the distributions of and such that .

Thus a working Lin-Ying model can consistently estimate regression coefficient signs under misspecification. From the efforts of Zhu et al. (2009) and Zhu and Zhu (2009) for other types of single-index models, it seems conceivable that variable selection methods designed for the Lin-Ying model will enjoy certain consistency properties within the model class (7). The conclusion of Theorem 4 continues to hold when is replaced by any matrix proportional to the feature covariance matrix . This is a consequence of Assumption 2 and underlines the considerable flexibility available when estimating in single-index models.

Variable selection based on the Lin-Ying loss (18) can be accomplished by optimizing a penalized loss function of the form

 β↦L(β)+p∑j=1pλ(|βj|); (19)

where is some nonnegative penalty function, singular at the origin to facilitate model selection (Fan and Li, 2001) and depending on some tuning parameter controlling the sparsity of the penalized estimator. A popular choice is the lasso penalty (Tibshirani, 2009) and its adaptive variant (Zou, 2006), corresponding to penalty functions and with some root consistent estimator of , respectively. These penalties were studied by Ma and Leng (2007) and Martinussen and Scheike (2009) for the Lin-Ying model. Empirically, we have had better success with the one-step SCAD (OS-SCAD) penalty of Zou and Li (2008) than with lasso penalties. Letting

 wλ(x):=λ1(x≤λ)+(aλ−x)+a−11(x>λ),a>2 (20)

an OS-SCAD penalty function for the Lin-Ying model can be defined as follows:

 pλ(|βj|):=wλ(¯D|^βj|)|βj|. (21)

Here is the unpenalized estimator and is the average diagonal element of ; this particular re-scaling is just one way to lessen dependency of the penalization on the time scale. If has approximately constant diagonal (which is often the case for standardized features), then re-scaling by leads to a similar penalty as for OS-SCAD in the linear regression model with standardized features. The choice in (20) was recommended by Fan and Li (2001). OS-SCAD has not previously been explored for the Lin-Ying model but its favorable performance in ISIS for other regression models is well known (Fan et al., 2009, 2010). OS-SCAD can be implemented efficiently using, for example, coordinate descent methods for fitting the lasso (Gorst-Rasmussen and Scheike, 2011; Friedman et al., 2007). For fixed , the OS-SCAD penalty (21) has the oracle property if the Lin-Ying model holds true. A proof is beyond scope but follows by adapting Zou and Li (2008) along the lines of Martinussen and Scheike (2009).

In the basic FAST-ISIS algorithm proposed below, the initial recruitment step corresponds to ranking the regression coefficients in the univariate Lin-Ying models. This is a convenient generic choice because it enables interpretation of the algorithm as standard ‘vanilla ISIS’ (Fan et al., 2009) for the Lin-Ying model.

###### Algorithm 1 (Lin-Ying-FAST-ISIS).
Set , let be some pre-defined maximal number of iterations of the algorithm. (Initial recruitment). Perform SIS by ranking , , according to decreasing order of magnitude and retaining the most relevant features . For do: (Feature selection). Define if and otherwise. Estimate
with defined in (21) for some suitable tuning parameter . Set .
If and , or if ; return . (Re-recruitment). Otherwise, re-evaluate relevance of features in according to the absolute value of their regression coefficient in the unpenalized Lin-Ying models including each feature in and all features in , i.e.
(22)
Take where is the set of the most relevant features in , ranked according to decreasing order of magnitude of .

Fan and Lv (2008) recommended choosing to be of order . Following Fan et al. (2009), we may take and at each step. This ensures that we complete at least one iteration of the algorithm; the choice of for ensures that at most features are included in the final solution.

Algorithm 1 defines an iterated variant of SIS with the Lin-Ying-FAST statistic (16). We can devise an analogous iterated variant of -FAST-SIS in which the initial recruitment is performed by ranking based on the statistic (15), and the subsequent re-recruitments are performed by ranking -statistics in the multivariate Lin-Ying model according to decreasing order of magnitude, using the variance estimator (6). A third option would be to base recruitment on (17) and re-recruitments on the decrease in the multivariate loss (18) when joining a given feature to the set of features picked out in the variable selection step.

The re-recruitment step (b.iii) in Algorithm 1 resembles that of Fan et al. (2009). Its naive implementation will be computationally burdensome when is large, requiring a low-dimensional matrix inversion per feature. Significant speedup over the naive implementation is possible via the matrix identity

 D=(ef⊤f~D)⇒D−1=(k−1−k−1f⊤~D−1−k−1~D−1f(~D−e−1ff⊤)−1)where k=e−f⊤~D−1f. (23)

Note that only the first row of is required for the re-recruitment step so that (22) can be implemented using just a single low-dimensional matrix inversion alongside matrix/vector multiplications. Combining (23) with (6), a similarly efficient implementation applies for -FAST-ISIS.

The variable selection step (b.i) of Algorithm 1 requires the choice of an appropriate tuning parameter. This is traditionally a difficult part of penalized regression, particularly when the aim is model selection where methods such as cross-validation are prone to overfitting (Leng et al., 2007). Previous work on ISIS used the Bayesian information criterion (BIC) for tuning parameter selection (Fan et al., 2009). Although BIC is based on the likelihood, we may still define the following ‘pseudo BIC’ based on the loss (18):

 PBIC(λ)=κ{L(^βλ)−L(^β)}+n−1dfλlogn. (24)

Here is the penalized estimator, is the unpenalized estimator, is a scaling constant of choice, and

estimates the degrees of freedom of the penalized estimator. A computationally convenient choice is

(Zou et al., 2007). It turns out that choosing may lead to model selection consistency. Specifically, the loss (18) for the Lin-Ying model is of the least-squares type. Then we can repeat the arguments of Wang and Leng (2007) and show that, under suitable consistency assumptions for the penalized estimator, there exists a sequence yielding selection consistency for and satisfying

 P{infλ∈SPBIC(λ)>PBIC(λn)}→1,n→∞; (25)

with the union of the set of tuning parameters which lead to overfitted (strict supermodels of the true model), respectively underfitted models (any model which do not include the true model). While (25) holds independently of the scaling constant , the finite-sample behavior of PBIC depends strongly on . A sensible value may be inferred heuristically as follows: the range of a ‘true’ likelihood BIC is asymptotically equivalent to a Wald statistic in the sense that (for fixed ),

 BIC(0)−BIC(∞)=^β⊤MLI(β0)^βML+op(n−1/2); (26)

with the maximum likelihood estimator and the information matrix. We may specify by requiring that admits an analogous interpretation as a Wald statistic. Since , it follows from (6) that we should choose

 κ:=d⊤B−1dd⊤D−1d.

This choice of also removes the dependency of PBIC on the time scale.

## 5 Simulation studies

In this section, we investigate the performance of FAST screening on simulated data. Rather than comparing with popular variable selection methods such as the lasso, we will compare with analogous screening methods based on the Cox model (Fan et al., 2010). This seems a more pertinent benchmark since previous work has already demonstrated that (iterated) SIS can outperform variable selection based on penalized regression in a number of cases (Fan and Lv (2008); Fan et al. (2009)).

For all the simulations, survival times were generated from three different conditionally exponential models of the generic form (7); that is, a time-independent hazard ‘link function’ applied to a linear functional of features. For suitable constants , the link functions were as follows (see also Figure 1):

 Logit:λlogit(t,x):={1+exp(clogitx}−1Cox:λcox(t,x):=exp(ccoxx)Log:λlog(t,x):=log{e+(clogx)2}{1+exp(clogx)}−1.

The link functions represent different characteristic effects on the feature functional, ranging from uniformly bounded (logit) over fast decay/increase (Cox), to fast decay/slow increase (log). We took

, , and and, unless otherwise stated, survival times were right-censored by independent exponential random variables with rate parameters 0.12 (logit link), 0.3 (Cox link) and 0.17 (log link). These constants were selected to provide a crude ‘calibration’ to make the simulation models more comparable: for a univariate standard Gaussian feature , a regression coefficient , and a sample size of , the expected -statistic was 8 for all three link functions with an expected censoring rate of 25%, as evaluated by numerical integration based on the true likelihood.

Methods for FAST screening have been implemented in the R package ‘ahaz’ (Gorst-Rasmussen, 2011).

### 5.1 Performance of FAST-SIS

We first considered the performance of basic, non-iterated FAST-SIS. Features were generated as in scenario 1 of Fan and Song (2010). Specifically, let be standard Gaussian. Define

 Z1j:=ϵj+ajϵ√1+a2j,j=1,…,p; (27)

where is independently distributed as a standard Gaussian for

: independently distributed according to a double exponential distribution with location parameter zero and scale parameter 1 for

; and independently distributed according to a Gaussian mixture for . The constants satisfy and for . With the choice , , we obtain for , , enabling crude adjustment of the correlation structure of the feature distribution. Regression coefficients were chosen to be of the generic form with exactly the first components nonzero.

For each combination of hazard link function, non-sparsity level , and correlation , we performed 100 simulations with features and observations. Features were ranked using the vanilla FAST statistic, the scaled FAST statistics (15) and (16), and SIS based on a Cox working model (Cox-SIS), the latter ranking features according their absolute univariate regression coefficient. Results are shown in Table 1. As a performance measure, we report the median of the minimum model size (MMS) needed to detect all relevant features alongside its relative standard deviation (RSD), the interquartile range divided by 1.34. The MMS is a useful performance measure for this type of study since it eliminates the need to select a threshold parameter for SIS. The censoring rate in the simulations was typically 30%-40%.

For all methods, the MMMS is seen to increase with feature correlation and non-sparsity . As also noted by Fan and Song (2010) for the case of SIS for generalized linear models, some correlation among features can actually be helpful since it increases the strength of marginal signals. Overall, the statistic seems to perform slightly worse than both and whereas the latter two statistics perform similarly to Cox-SIS. In our basic implementation, screening with any of the FAST statistics was more than 100 times faster than Cox-SIS, providing a rough indication of the relative computational efficiency of FAST-SIS.

To gauge the relative difficulty of the different simulation scenarios, Figure 2 shows box plots of the minimum of the observed -statistics in the oracle model (the joint model with only the relevant features included and estimation based on the likelihood under the true link function) for the link function . This particular link function represents an ‘intermediate’ level of difficulty; with -statistics for generally being somewhat larger and -statistics for being slightly smaller. Even with oracle information and the correct working model, these are evidently difficult data to deal with.

### 5.2 FAST-SIS with non-Gaussian features and nonrandom censoring

We next investigated FAST-SIS with non-Gaussian features and a more complex censoring mechanism. The simulation scenario was inspired by the previous section but with all features generated according to either a standard Gaussian distribution, a

-distribution with 4 degrees of freedom, or a unit rate exponential distribution. Features were standardized to have mean zero and variance one, and the feature correlation structure was such that for , and otherwise. Survival times were generated according to the link function with regression coefficients while censoring times were generated according to the same model (link function and conditionally on the same feature realizations) with regression coefficients . The constant controls the association between censoring and survival times, leading to a basic example of nonrandom censoring (competing risks).

Using features and  observations, we performed 100 simulations under each of the three feature distributions, for different values of . Table 2 reports the MMMS and RSD for the four different screening methods of the previous section, as well as for the statistic in (17). The censoring rate in all scenarios was around 50%.

From the column with (random censoring), the heavier tails of the -distribution increases the MMMS, particularly for . The vanilla FAST statistic seems the least affected here, most likely because it does not directly involve second-order statistics which are poorly estimated due to the heavier tails. While and are also scaled by second-order statistics, the impact of the tails is dampened by the square-root transformation in the scaling factors. In contrast, the more distinctly non-Gaussian exponential distribution is problematic for . Overall, the statistics  and seems to have the best and most consistent performance across feature distributions. Nonrandom censoring generally increases the MMMS and RSD, particularly for the non-Gaussian distributions. There appears to be no clear difference between the effect of positive and negative values of . We found that the effect of diminished when the sample size was increased (results not shown), suggesting that nonrandom censoring in the present example leads to a power rather than bias issue. This may not be surprising in view of the considerations below (14). However, the example still shows the dramatic impact of nonrandom censoring on the performance of SIS.

### 5.3 Performance of FAST-ISIS

We lastly evaluated the ability of FAST-ISIS (Algorithm 1) to cope with scenarios where FAST-SIS fails. As in the previous sections, we compare our results with the analogous ISIS screening method for the Cox model. To perform Cox-ISIS, we used the R package ‘SIS’, with (re)recruitment based on the absolute Cox regression coefficients and variable selection based on OS-SCAD. We also compared with -FAST-ISIS variant described below Algorithm 1 in which (re)recruitment is based on the Lin-Ying model -statistics (results for FAST-ISIS with (re)recruitment based on the loss function were very similar).

For the simulations, we adopted the structural form of the feature distributions used by Fan et al. (2010). We considered observations and features which were jointly Gaussian and marginally standard Gaussian. Only regression coefficients and feature correlations differed between cases as follows:

1. The regression coefficients are , , , , , and for . Features are independent, for .

2. The regression coefficients are the same as in (a) while for .

3. The regression coefficients are , . The correlation between features is given by for and for , .

4. The regression coefficients are , and . The correlation between features is for ,