The ordinary linear model
, despite its apparent simplicity, has been the bedrock of signal processing, statistics, and machine learning for decades. The last decade, however, has witnessed a marked transformation of this model: instead of the classical low-dimensional setting in which the dimension,, of (henceforth, referred to as the sample size) exceeds the dimension, , of
(henceforth, referred to as the number of features/predictors/variables), we are increasingly having to operate in the high-dimensional setting in which the number of variables far exceeds the sample size (i.e.,). While the high-dimensional setting should ordinarily lead to ill-posed problems, the principle of parsimony—which states that only a small number of variables typically affect the response —helps obtain unique solutions to inference problems based on high-dimensional linear models.
Our focus in this paper is on ultrahigh-dimensional linear models, in which the number of variables can scale exponentially with the sample size: for .111Recall Landau’s big- notation: if and if . Such linear models are increasingly becoming common in application areas ranging from genomics [1, 2, 3, 4] and proteomics [5, 6, 7]8, 9, 10] and hyperspectral imaging [11, 12, 13]. While there exist a number of techniques in the literature—such as forward selection/matching pursuit, backward elimination , least absolute shrinkage and selection operator (LASSO) , elastic net , smoothly clipped absolute deviation (SCAD) , bridge regression [18, 19], adaptive LASSO , group LASSO , and Dantzig selector —that can be employed for inference from high-dimensional linear models, all these techniques have super-linear (in the number of variables ) computational complexity. In the ultrahigh-dimensional setting, therefore, use of the aforementioned methods for statistical inference can easily become computationally prohibitive. Variable selection-based dimensionality reduction, commonly referred to as variable screening, has been put forth as a practical means of overcoming this curse of dimensionality : since only a small number of (independent) variables actually contribute to the response (dependent variable) in the ultrahigh-dimensional setting, one can first—in principle—discard most of the variables (the screening step) and then carry out inference on a relatively low-dimensional linear model using any one of the sparsity-promoting techniques. There are two main challenges that arise in the context of variable screening in ultrahigh-dimensional linear models. First, the screening algorithm should have low computational complexity (ideally, ). Second, the screening algorithm should be accompanied with mathematical guarantees that ensure the reduced linear model contains all relevant variables that affect the response. Our goal in this paper is to revisit one of the simplest screening algorithms, which uses marginal correlations between the variables and the response for screening purposes [24, 25], and provide a comprehensive theoretical understanding of its screening performance for arbitrary (random or deterministic) ultrahigh-dimensional linear models.
I-a Relationship to Prior Work
Researchers have long intuited that the (absolute) marginal correlation is a strong indicator of whether the
-th variable contributes to the response variable. Indeed, methods such as stepwise forward regression are based on this very intuition. It is only recently, however, that we have obtained a rigorous understanding of the role of marginal correlations in variable screening. One of the earliest screening works in this regard that is agnostic to the choice of the subsequent inference techniques is termedsure independence screening (SIS) . SIS is based on simple thresholding of marginal correlations and satisfies the so-called sure screening
property—which guarantees that all important variables survive the screening stage with high probability—for the case of normally distributed variables. An iterative variant of SIS, termed ISIS, is also discussed in, while  presents variants of SIS and ISIS that can lead to reduced false selection rates of the screening stage. Extensions of SIS to generalized linear models are discussed in [27, 28]
, while its generalizations for semi-parametric (Cox) models and non-parametric models are presented in[29, 30] and [31, 32], respectively.
The marginal correlation can be considered an empirical measure of the Pearson correlation coefficient, which is a natural choice for discovering linear relations between the independent variables and the response. In order to perform ultrahigh-dimensional variable screening in the presence of non-linear relations between ’s and and/or heavy-tailed variables,  and  have put forth screening using generalized (empirical) correlation and Kendall rank correlation, respectively.
The defining characteristics of the works referenced above is that they are agnostic to the inference technique that follows the screening stage. In recent years, screening methods have also been proposed for specific optimization-based inference techniques. To this end,  formulates a marginal correlations-based screening method, termed SAFE, for the LASSO problem and shows that SAFE results in zero false selection rate. In , the so-called strong rules for variable screening in LASSO-type problems are proposed that are still based on marginal correlations and that result in discarding of far more variables than the SAFE method. The screening tests of [35, 36] for the LASSO problem are further improved in [37, 38, 39] by analyzing the dual of the LASSO problem. We refer the reader to  for an excellent review of these different screening tests for LASSO-type problems.
Notwithstanding these prior works, we have holes in our understanding of variable screening in ultrahigh-dimensional linear models. Works such as [35, 36, 37, 38, 39] necessitate the use of LASSO-type inference techniques after the screening stage. In addition, these works do not help us understand the relationship between the problem parameters and the dimensions of the reduced model. Stated differently, it is difficult to a priori quantify the computational savings associated with the screening tests proposed in [35, 36, 37, 38, 39]. Similar to [26, 27, 33, 34], and in contrast to [35, 36, 37, 38, 39], our focus in this paper is on screening that is agnostic to the post-screening inference technique. To this end,  lacks a rigorous theoretical understanding of variable screening using the generalized correlation. While [26, 27, 34] overcome this shortcoming of , these works have two major limitations. First, their results are derived under the assumption of restrictive statistical priors on the linear model (e.g., normally distributed ’s). In many applications, however, it can be a challenge to ascertain the distribution of the independent variables. Second, the analyses in [26, 27, 34]
assume the variance of the response variable to be bounded by a constant; this assumption, in turn, imposes the condition. In contrast, defining , we establish in the sequel that the ratio (and not ) directly influences the performance of marginal correlation-based screening procedures.
I-B Our Contributions
Our focus in this paper is on marginal correlation-based screening of ultrahigh-dimensional linear models that is agnostic to the post-screening inference technique. To this end, we provide an extended analysis of the thresholding-based SIS procedure of . The resulting screening procedure, which we term extended sure independence screening (ExSIS), provides new insights into marginal correlation-based screening of arbitrary (random or deterministic) ultrahigh-dimensional linear models. Specifically, we first provide a simple, distribution-agnostic sufficient condition—termed the screening condition—for (marginal correlation-based) screening of linear models. This sufficient condition, which succinctly captures joint interactions among both the active and the inactive variables, is then leveraged to explicitly characterize the performance of ExSIS as a function of various problem parameters, including noise variance, the ratio , and model sparsity. The numerical experiments reported at the end of this paper confirm that the dependencies highlighted in this screening result are reflective of the actual challenges associated with marginal correlation-based screening and are not mere artifacts of our analysis.
Next, despite the theoretical usefulness of the screening condition, it cannot be explicitly verified in polynomial time for any given linear model. This is reminiscent of related conditions such as the incoherence condition , the irrepresentable condition , the restricted isometry property , and the restricted eigenvalue condition
restricted eigenvalue condition
studied in the literature on high-dimensional linear models. In order to overcome this limitation of the screening condition, we explicitly derive it for two families of linear models. The first family corresponds to sub-Gaussian linear models, in which the independent variables are independently drawn from (possibly different) sub-Gaussian distributions. We show that the ExSIS results for this family of linear models generalize the SIS results derived in for normally distributed linear models. The second family corresponds to arbitrary (random or deterministic) linear models in which the (empirical) correlations between independent variables satisfy certain polynomial-time verifiable conditions. The ExSIS results for this family of linear models establish that, under appropriate conditions, it is possible to reduce the dimension of an ultrahigh-dimensional linear model to almost the sample size even when the number of active variables scales almost linearly with the sample size. This, to the best of our knowledge, is the first screening result that provides such explicit and optimistic guarantees without imposing a statistical prior on the distribution of the independent variables.
I-C Notation and Organization
The following notation is used throughout this paper. Lower-case letters are used to denote scalars and vectors, while upper-case letters are used to denote matrices. Given, denotes the smallest integer greater than or equal to . Given , we use as a shorthand for . Given a vector , denotes its norm. Given a matrix , denotes its -th column and denotes the entry in its -th row and -th column. Further, given a set , (resp., ) denotes a submatrix (resp., subvector) obtained by retaining columns of (resp., entries of ) corresponding to the indices in . Finally, the superscript denotes the transpose operation.
The rest of this paper is organized as follows. We formulate the problem of marginal correlation-based screening in Sec. II. Next, in Sec. III, we define the screening condition and present one of our main results that establishes the screening condition as a sufficient condition for ExSIS. In Sec. IV, we derive the screening condition for sub-Gaussian linear models and discuss the resulting ExSIS guarantees in relation to prior work. In Sec. V, we derive the screening condition for arbitrary linear models based on the correlations between independent variables and discuss implications of the derived ExSIS results. Finally, results of extensive numerical experiments on both synthetic and real data are reported in Sec. VI, while concluding remarks are presented in Sec. VII.
Ii Problem Formulation
Our focus in this paper is on the ultrahigh-dimensional ordinary linear model , where , , and for . In the statistics literature, is referred to as data/design/observation matrix with the rows of corresponding to individual observations and the columns of corresponding to individual features/predictors/variables, is referred to as observation/response vector with individual responses given by , is referred to as the parameter vector, and is referred to as modeling error or observation noise. Throughout this paper, we assume has unit norm columns, is sparse with (i.e., ), and is a zero-mean Gaussian vector with (entry-wise) variance and covariance . Here, is taken to be Gaussian with covariance for the sake of this exposition, but our analysis is trivially generalizable to other noise distributions and/or covariance matrices. Further, we make no a priori assumption on the distribution of . Finally, we define to be the set that indexes the non-zero components of . Using this notation, the linear model can equivalently be expressed as
Given (1), the goal of variable screening is to reduce the number of variables in the linear model from (since ) to a moderate scale (with ) using a fast and efficient method. Our focus here is in particular on screening methods that satisfy the so-called sure screening property ; specifically, a method is said to carry out sure screening if the dimensional model returned by it is guaranteed with high probability to retain all the columns of that are indexed by . The motivation here is that once one obtains a moderate-dimensional model through sure screening of (1
), one can use computationally intensive model selection, regression and estimation techniques on thedimensional model for reliable model selection (identification of ), prediction (estimation of ), and reconstruction (estimation of ), respectively.
In this paper, we study sure screening using marginal correlations between the response vector and the columns of . The resulting screening procedure is outlined in Algorithm 1, which is based on the principle that the higher the correlation of a column of with the response vector, the more likely it is that the said column contributes to the response vector (i.e., it is indexed by the set ).
The computational complexity of Algorithm 1 is only and its ability to screen ultrahigh-dimensional linear models has been investigated in recent years by a number of researchers [24, 25]. The fundamental difference among these works stems from the manner in which the parameter (the dimension of the screened model) is computed from (1). Our goal in this paper is to provide an extended understanding of the screening performance of Algorithm 1 for arbitrary (random or deterministic) design matrices. The term sure independence screening (SIS) was coined in  to refer to screening of ultrahigh-dimensional Gaussian linear models using Algorithm 1. In this vein, we refer to variable screening using Algorithm 1 and the analysis of this paper as extended sure independence screening (ExSIS). The main research challenge for ExSIS is specification of for arbitrary matrices such that with high probability. Note that there is an inherent trade-off in addressing this challenge: the higher the value of , the more likely is to satisfy the sure screening property; however, the smaller the value of , the lower the computational cost of performing model selection, regression, estimation, etc., on the dimensional problem. This leads us to the following research questions for ExSIS: () What are the conditions on under which ? () How small can be for arbitrary matrices such that ? () What are the constraints on the sparsity parameter under which ? Note that there is also an interplay between the sparsity level and the allowable value of for sure screening: the lower the sparsity level, the easier it should be to screen a larger number of columns of . Thus, an understanding of ExSIS also requires characterization of this relationship between and for marginal correlation-based screening. In the sequel, we not only address the aforementioned questions for ExSIS, but also characterize this relationship.
Iii Sufficient Conditions for Sure Screening
In this section, we derive the most general sufficient conditions for ExSIS of ultrahigh-dimensional linear models. The results reported in this section provide important insights into the workings of ExSIS without imposing any statistical priors on and . We begin with a definition of the screening condition for the design matrix .
Definition 1 (Screening Condition).
Fix an arbitrary that is sparse. The (normalized) matrix satisfies the screening condition if there exists such that the following hold:
The screening condition is a statement about the collinearity of the independent variables in the design matrix. The parameter in the screening condition captures the similarity between () the columns of , and () the columns of and ; the smaller the parameter is, the less similar the columns are. Furthermore, since in the screening condition, the parameter reflects constraints on the sparsity parameter .
We now present one of our main screening results for arbitrary design matrices, which highlights the significance of the screening condition and the role of the parameter within ExSIS.
Theorem 1 (Sufficient Conditions for ExSIS).
Let with a sparse vector and the entries of independently distributed as . Define and , and let be the event . Suppose satisfies the screening condition and assume . Then, conditioned on , Algorithm 1 satisfies as long as .
We refer the reader to Sec. III-B for a proof of this theorem.
Theorem 1 highlights the dependence of ExSIS on the observation noise, the ratio , the parameter , and model sparsity. We first comment on the relationship between ExSIS and observation noise . Notice that the statement of Theorem 1 is dependent upon the event . However, for any , we have (see, e.g., [45, Lemma 6])
Therefore, substituting in (2), we obtain
Thus, Algorithm 1 possesses the sure screening property in the case of the observation noise distributed as . We further note from the statement of Theorem 1 that the higher the signal-to-noise ratio (SNR), defined here as , the more Algorithm 1 can screen irrelevant/inactive variables. It is also worth noting here trivial generalizations of Theorem 1 for other noise distributions. In the case of distributed as , Theorem 1 has replaced by the largest eigenvalue of the covariance matrix . In the case of following a non-Gaussian distribution, Theorem 1 has replaced by distribution-specific upper bound on that holds with high probability.
In addition to the noise distribution, the performance of ExSIS also seems to be impacted by the minimum-to-signal ratio (MSR), defined here as . Specifically, the higher the MSR, the more Algorithm 1 can screen inactive variables. Stated differently, the independent variable with the weakest contribution to the response determines the size of the screened model. Finally, the parameter in the screening condition also plays a central role in characterization of the performance of ExSIS. First, the smaller the parameter , the more Algorithm 1 can screen inactive variables. Second, the smaller the parameter , the more independent variables can be active in the original model; indeed, we have from the screening condition that . Third, the smaller the parameter , the lower the smallest allowable value of MSR; indeed, we have from the theorem statement that .
It is evident from the preceding discussion that the screening condition (equivalently, the parameter ) is one of the most important factors that helps understand the workings of ExSIS and helps quantify its performance. Unfortunately, the usefulness of this knowledge is limited in the sense that the screening condition cannot be utilized in practice. Specifically, the screening condition is defined in terms of the set , which is of course unknown. We overcome this limitation of Theorem 1 by implicitly deriving the screening condition for sub-Gaussian design matrices in Sec. IV and for a class of arbitrary (random or deterministic) design matrices in Sec. V.
Iii-B Proof of Theorem 1
We first provide an outline of the proof of Theorem 1, which is followed by its formal proof. Define , , and . Next, fix a positive integer and define
The idea is to first derive an initial upper bound on , denoted by , and then choose ; trivially, we have . As a result, we get
Note that while deriving , we need to ensure ; this in turn imposes some conditions on that also need to be specified. Next, we can repeat the aforementioned steps to obtain from for a fixed positive integer . Specifically, define
and . We can then derive an upper bound on , denoted by , and then choose ; once again, we have . Notice further that we do require , which again will impose conditions on .
In similar vein, we can keep on repeating this procedure to obtain a decreasing sequence of numbers and sets as long as , where and . The complete proof of Theorem 1 follows from a careful combination of these (analytical) steps. In order for us to be able to do that, however, we need two lemmas. The first lemma provides an upper bound on for , denoted by . The second lemma provides conditions on the design matrix such that . The proof of the theorem follows from repeated application of the two lemmas.
Fix and suppose , where and . Further, suppose the design matrix satisfies the screening condition for the sparse vector and the event holds true. Finally, define . Under these conditions, we have
The proof of this lemma is provided in Appendix A. The second lemma, whose proof is given in Appendix B, provides conditions on under which the upper bound derived on for , denoted by , is non-trivial.
Fix . Suppose and . Then, we have .
We are now ready to present a complete technical proof of Theorem 1.
The idea is to use Lemma 1 and Lemma 2 repeatedly to screen columns of . Note, however, that this is simply an analytical technique and we do not actually need to perform such an iterative procedure to specify in Algorithm 1. To begin, recall that we have , ,
Iv Screening of Sub-Gaussian Design Matrices
In this section, we characterize the implications of Theorem 1 for ExSIS of the family of sub-Gaussian design matrices. As noted in Sec. III, this effort primarily involves establishing the screening condition for sub-Gaussian matrices and specifying the parameter
for such matrices. We begin by first recalling the definition of a sub-Gaussian random variable.
A zero-mean random variable is said to follow a sub-Gaussian distribution if there exists a sub-Gaussian parameter such that for all .
In words, a
random variable is one whose moment generating function is dominated by that of arandom variable. Some common examples of sub-Gaussian random variables include:
Our focus in this paper is on design matrices in which entries are first independently drawn from sub-Gaussian distributions and then the columns are normalized. In contrast to prior works, however, we do not require the (pre-normalized) entries to be identically distributed. Rather, we allow each independent variable to be distributed as a sub-Gaussian random variable with a different sub-Gaussian parameter. Thus, the ExSIS analysis of this section is applicable to design matrices in which different columns might have different sub-Gaussian distributions. It is also straightforward to extend our analysis to the case where all (and not just across column) entries of the design matrix are non-identically distributed; we do not focus on this extension in here for the sake of notational clarity.
Iv-a Main Result
The ExSIS of linear models involving sub-Gaussian design matrices mainly requires establishing the screening condition and characterization of the parameter for sub-Gaussian matrices. We accomplish this by individually deriving (1) and (SC-2) in Definition 1 for sub-Gaussian design matrices in the following two lemmas.
Let be an matrix with the entries independently distributed as with variances . Suppose the design matrix is obtained by normalizing the columns of , i.e., . Finally, fix an arbitrary that is sparse, define , and let . Then, with probability exceeding , we have
Let be an matrix with the entries independently distributed as with variances . Suppose the design matrix is obtained by normalizing the columns of , i.e., . Finally, fix an arbitrary that is sparse, define , and let . Then, with probability exceeding , we have
The proofs of Lemma 3 and Lemma 4 are provided in Appendix C and Appendix D, respectively. It now follows from a simple union bound argument that the screening condition holds for sub-Gaussian design matrices with probability exceeding . In particular, we have from Lemma 3 and Lemma 4 that for sub-Gaussian matrices. We can now use this knowledge and Theorem 1 to provide the main result for ExSIS of ultrahigh-dimensional linear models involving sub-Gaussian design matrices.
Theorem 2 (ExSIS and Sub-Gaussian Matrices).
Let be an matrix with the entries independently distributed as with variances . Suppose the design matrix is obtained by normalizing the columns of , i.e., . Next, let with a sparse vector and the entries of independently distributed as . Finally, define and , and let and . Then Algorithm 1 guarantees with probability exceeding as long as
Let be the event that the design matrix satisfies the screening condition with parameter . Further, let be the event as defined in Theorem 1. It then follows from Lemma 3, Lemma 4, (3), and the union bound that the event holds with probability exceeding . The advertised claim now follows directly from Theorem 1. ∎
Since Theorem 2 follows from Theorem 1, it shares many of the insights discussed in Sec. III-A. In particular, Theorem 2 allows for exponential scaling of the number of independent variables, , and dictates that the number of independent variables, , retained after the screening stage be increased with an increase in the sparsity level and/or the number of independent variables, while it can be decreased with an increase in the SNR, MSR, and/or the number of samples. Notice that the lower bound on in Theorem 2 does require knowledge of the sparsity level. However, this limitation can be overcome in a straightforward manner, as shown below.
Let be an matrix with the entries independently distributed as with variances . Suppose the design matrix is obtained by normalizing the columns of , i.e., . Next, let with a sparse vector and the entries of independently distributed as . Further, define and . Finally, let , , and for some constants . Then Algorithm 1 guarantees with probability exceeding as long as .
A few remarks are in order now concerning our analysis of ExSIS for sub-Gaussian design matrices and that of SIS for random matrices in the existing literature. To this end, we focus on the results reported in , which is one of the most influential SIS works. In contrast to the screening condition presented in this paper, the analysis in  is carried out for design matrices that satisfy a certain concentration property. Since the said concentration property has only been shown in  to hold for Gaussian matrices, our discussion in the following is limited to Gaussian design matrices with independent entries.
The SIS results reported in  hold under four specific conditions. In particular, Condition 3 in  requires that: () the variance of the response variable is , () for some , , and () for some . Notice, however, that the variance condition is equivalent to having . Our analysis, in contrast, imposes no such restriction. Rather, Theorem 2 shows that marginal correlation-based sure screening is fundamentally affected by the MSR . While Theorem 2 is only concerned with sufficient conditions, numerical experiments reported in Sec. VI confirm this dependence. Next, notice that implies . It therefore follows that (1) in the screening condition is a non-statistical variant of the condition in .
We next assume for the sake of simplicity of argument and explicitly compare Theorem 2 and [26, Theorem 1] for the case of Gaussian design matrices with independent entries. Similar to , we also impose the condition for comparison purposes. In this setting, both the theorems guarantee sure screening with high probability. In [26, Theorem 1], this requires for and for some . It is, however, easy to verify that substituting and in Theorem 2 results in identical constraints of and for our analysis. Next, [26, Theorem 1] also imposes the sparsity constraint for the sure screening result to hold. However, the condition with reduces this constraint to , which matches the sparsity constraint imposed by Theorem 2 (cf. Corollary 2.1). To summarize, the ExSIS results derived in this paper coincide with the ones in  for the case of Gaussian design matrices. However, our results are more general in the sense that they explicitly bring out the dependence of Algorithm 1 on the SNR and the MSR, which is something missing in , and they are applicable to sub-Gaussian design matrices.
V Screening of Arbitrary Design Matrices
The ExSIS analysis in Sec. IV specializes Theorem 1 for sub-Gaussian design matrices. But what about the design matrices in which either the entries do not follow sub-Gaussian distributions or the statistical distributions of entries are unknown? We address this particular question in this section by deriving verifiable sufficient conditions that guarantee the screening condition for any arbitrary (random or deterministic) design matrix. These sufficient conditions are presented in terms of two measures of similarity among the columns of a design matrix. These measures, termed worst-case coherence and average coherence, are defined as follows.
Definition 3 (Worst-case and Average Coherences).
Notice that both the worst-case and the average coherences are readily computable in polynomial time. Heuristically, the worst-case coherence is an indirect measure of pairwise similarity among the columns of: with as the columns of become less similar and as at least two columns of become more similar. The average coherence, on the other hand, is an indirect measure of both the collective similarity among the columns of and the spread of the columns of within the unit sphere: with as the columns of become more spread out in and as the columns of become less spread out. We refer the reader to  for further discussion of these two measures as well as their values for commonly encountered matrices.
We are now ready to describe the main results of this section. The first result connects the screening condition to the worst-case coherence. We will see, however, that this result suffers from the so-called square-root bottleneck: ExSIS analysis based solely on the worst-case coherence can, at best, handle scaling of the sparsity parameter. The second result overcomes this bottleneck by connecting the screening condition to both worst-case and average coherences. The caveat here is that this result imposes a mild statistical prior on the set .
V-a ExSIS and the Worst-case Coherence
We begin by relating the worst-case coherence of an arbitrary design matrix with unit-norm columns to the screening condition.
Lemma 5 (Worst-case Coherence and the Screening Condition).
Let be an design matrix with unit-norm columns. Then, we have
The proof of this lemma is provided in Appendix E. It follows from Lemma 5 that a design matrix satisfies the screening condition with parameter as long as . We now combine this implication of Lemma 5 with Theorem 1 to provide a result for ExSIS of arbitrary linear models.
Let with a sparse vector and the entries of independently distributed as . Suppose and . Then, Algorithm 1 satisfies with probability exceeding as long as .
The proof of this theorem follows directly from Lemma 5 and Theorem 1. Next, a straightforward corollary of Theorem 3 shows that ExSIS of arbitrary linear models can in fact be carried out without explicit knowledge of the sparsity parameter .
Let with a sparse vector and the entries of independently distributed as . Suppose , , and for some . Then, Algorithm 1 satisfies with probability exceeding as long as .