Global testing under the sparse alternatives for single index models

05/04/2018
by   Qian Lin, et al.
0

For the single index model y=f(β^τx,ϵ) with Gaussian design, and β is a sparse p-dimensional unit vector with at most s nonzero entries, we are interested in testing the null hypothesis that β, when viewed as a whole vector, is zero against the alternative that some entries of β is nonzero. Assuming that var(E[x | y]) is non-vanishing, we define the generalized signal-to-noise ratio (gSNR) λ of the model as the unique non-zero eigenvalue of var(E[x | y]). We show that if s^2^2(p)∧ p is of a smaller order of n, denoted as s^2^2(p)∧ p≺ n, where n is the sample size, one can detect the existence of signals if and only if gSNR≻p^1/2/n∧s(p)/n. Furthermore, if the noise is additive (i.e., y=f(β^τx)+ϵ), one can detect the existence of the signal if and only if gSNR≻p^1/2/n∧s(p)/n∧1/√(n). It is rather surprising that the detection boundary for the single index model with additive noise matches that for linear regression models. These results pave the road for thorough theoretical analysis of single/multiple index models in high dimensions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

03/12/2019

The All-or-Nothing Phenomenon in Sparse Linear Regression

We study the problem of recovering a hidden binary k-sparse p-dimensiona...
07/12/2012

Probabilistic index maps for modeling natural signals

One of the major problems in modeling natural signals is that signals wi...
05/27/2021

Entrywise Estimation of Singular Vectors of Low-Rank Matrices with Heteroskedasticity and Dependence

We propose an estimator for the singular vectors of high-dimensional low...
05/06/2018

Residual Ratio Thresholding for Model Order Selection

Model order selection (MOS) in linear regression models is a widely stud...
12/22/2016

Statistical limits of spiked tensor models

We study the statistical limits of both detecting and estimating a rank-...
11/07/2015

Signed Support Recovery for Single Index Models in High-Dimensions

In this paper we study the support recovery problem for single index mod...
02/22/2018

On detection of Gaussian stochastic sequences

The problem of minimax detection of Gaussian random signal vector in Whi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Testing whether a subset of covariates have any relationship with a quantitative response is one of the central problems in statistical analyses. Most existing literature focuses on linear relationships. The analysis of variance (ANOVA) introduced by Fisher in 1920s has been a main tool for statistical analyses of experiments and routinely used in countless applications. Under the simple normal mean model,

where and , one-way ANOVA tests the null hypothesis : against the alternative hypothesis : at least one . Although the test cannot indicate which ’s are nonzero, ANOVA is powerful in testing the global null against alternatives of the form . Arias-Castro et al. (2011b) extended these results to the linear model

(1)

where , to test whether all the ’s are zero. This can be formulated as the following null and alternative hypotheses:

(2)

where denotes the set of -sparse vector in with the number of non-zero entries being no greater than . Arias-Castro et al. (2011b) and Ingster et al. (2010) showed that one can detect the signal if and only if . The upper bound is guaranteed by an asymptotically most powerful test based on higher criticism (Donoho and Jin, 2004).

The linearity or other functional form assumption is often too restrictive in practice. Theoretical and methodological developments beyond parametric models are important, urgent, and extremely challenging. As a first step towards nonparametric global testing, we here study the single index model

, where is an unknown function. Our goal is to test the global null hypothesis that all the ’s are zero. The first challenge is to find an appropriate formulation of alternative hypotheses because used in (2) is not even identifiable in single index models.

When is nonzero in a single index model, the unique non-zero eigenvalue of can be viewed as the generalized signal to noise ratio (gSNR) (Lin et al., 2018b). In Section 2, we show that for the linear regression model, this is almost proportional to when it is small. The alternative hypotheses in (2) can be rewritten as . Because of this connection, we can treat as the separation quantity for the single index model and consider the following contrasting hypotheses:

We show that, under certain regularity conditions, one can detect a non-zero gSNR if and only if for the single index model with additive noise.

This is a strong and surprising result because this detection boundary is the same as that for the linear model. Using the idea from the sliced inverse regression (SIR) (Li, 1991), we show that this boundary can be achieved by the proposed Spectral test Statistics using S

IR (SSS) and SSS with ANOVA test assisted (SSSa). Although SIR has been advocated as an effective alternative to linear multivariate analysis

(Chen and Li, 1998), the existing literature has not provided satisfactory theoretical foundations until recently (Lin et al., 2018a, b, 2017). We believe that the results in this paper provide further supporting evidence to the speculation that “SIR can be used to take the same role as linear regression in model building, residual analysis, regression diagnoses, etc” (Chen and Li, 1998).

In Section 2, after briefly reviewing the SIR and related results in linear regression, we state the optimal detection problem and a lower bound for single index models. In Section 3, we first show that the correlation-based Higher Criticism (Cor-HC) developed for linear models fails for single index models, and then propose a test to achieve the lower bound stated in Section 2. Some numerical studies are included in Section 4. We list several interesting implications and future directions in Section 5. Additional proofs and lemmas are included in Appendices.

2 Generalized SNR for Single Index Models

2.1 Notation

The following notations are adopted throughout the paper. For a matrix , we call the space generated by its column vectors the column space and denote it by . The -th row and -th column of the matrix are denoted by and , respectively. For vectors and , we denote their inner product by , and the -th entry of by . For two positive numbers , we use and to denote and respectively. Throughout the paper, we use , , and to denote generic absolute constants, though the actual value may vary from case to case. For two sequences and , we denote (resp. ) if there exist positive constant (resp. ) such that (resp. ). We denote if both and hold. We denote (resp. ) if (resp. ). The norm and norm of matrix A are defined by and respectively. For a finite subset , we denote by its cardinality. We also write for the sub-matrix with elements and for . For any squared matrix , we define and as the smallest and largest eigenvalues of , respectively.

2.2 A brief review of the sliced inverse regression (SIR)

SIR was first proposed by (Li, 1991)

to estimate the central space spanned by

based on i.i.d. observations , , from the multiple index model , under the assumption that follows an elliptical distribution and is Gaussian. SIR starts by dividing the data into equal-sized slices according to the order statistics . To ease notations and arguments, we assume that and , and re-express the data as and , where refers to the slice number and refers to the order number within the slice, i.e., Here is the concomitant of . Let the sample mean in the -th slice be denoted by , then can be estimated by:

(3)

where denotes the matrix formed by the sample means, i.e., . Thus, is estimated by , where is the matrix formed by the eigenvectors associated to the largest eigenvalues of . The is a consistent estimator of under certain technical conditions (Duan and Li, 1991; Hsing and Carroll, 1992; Zhu et al., 2006; Li, 1991; Lin et al., 2017). It is shown in Lin et al. (2017, 2018a) that, for single index models (), can be chosen as a fixed number not depending on , , and for the asymptotic results to hold. Throughout this paper, we assume the following mild conditions.

  • and there exist two positive constants , such that .

  • Matrix is non-vanishing, i.e., .

2.3 Generalized Signal-to-Noise Ratio of Single Index Models

We consider the following single index model:

(4)

where is an unknown function. What we want to know is whether the coefficient vector , when viewed as a whole, is zero. This can be formulated as a global testing problem as

When assuming the linear model , whether we can separate the null and alternative depends on the interplay between and the norm of . More precisely, it depends on the signal-to-noise ratio (SNR) defined as

where (Janson et al., 2017). Here is useful for benchmarking prediction accuracy for various model selection techniques such as AIC, BIC, or the Lasso. However, since there is an unknown link function in the single index model, the norm becomes non-identifiable. Without loss of generality, we restrict and have to find another quantity to describe the separability.

For the single index model (4), to simplify the notation, let us use to denote . For linear models, we can easily show that

Consequently, . When assuming condition A2), such a ratio is bounded by two finite limits. Thus, can be treated as an equivalent quantity to the SNR for linear models, and is therefore named as the generalized signal-to-noise ratio (gSNR) for single index models.

Remark 1.

To the best of our knowledge, although SIR uses the estimation of to determine the structural dimension (Li (1991)), few investigations have been made towards theoretical properties of this procedure in high dimensions. The only work that uses as a parameter to quantify the estimation error when estimating the direction of is Lin et al. (2018a), which, however, does not indicate explicitly what role plays. The aforementioned observation about for single index models provides a useful intuition: is a generalized notion of the SNR, and condition merely requires that gSNR is non-zero.

2.4 Global testing for single index models

As we have discussed, Arias-Castro et al. (2011b) and Ingster et al. (2010) considered the testing problem (2), which can be viewed as the determination of the detection boundary of gSNR. Through the whole paper, we consider the following testing problem:

(5)

based on i.i.d. samples . Two models are considered: (i) the general single index model (SIM) defined in (4); and (ii) the single index model with additive noise (SIMa) defined as

(6)

We assume that conditions and hold for both models.

3 The Optimal Test for Single Index Models

3.1 The detection boundary of linear regression

To set the goal and scope, we briefly review some related results on the detection boundary for linear models (Arias-Castro et al., 2011b; Ingster et al., 2010).

Proposition 1.

Assume that , , and that has at most non-zero entries. There is a test with both type I and II errors converging to zero for the testing problem in (2) if and only if

(7)

Assuming and the variance of the noise is known, Ingster et al. (2010) obtained the sharp detection boundary (i.e., with exact asymptotic constant) for the above problem. Since linear models are special cases of SIMa, which is a special subset of SIM, the following statement about the lower bound of detectability is a direct corollary of Proposition 1.

Corollary 1.

If , then any test fails to separate the null and the alternative hypothesis asymptotically for SIM when

(8)

Any test fails to separate the null and the alternative hypothesis asymptotically for SIMa when

(9)

3.2 Single Index Models

Moving from linear models to single index models is a big step. A natural and reasonable start is to consider tests based on the marginal correlation used for linear models (Ingster et al., 2010; Arias-Castro et al., 2011b). However, the following example shows that the marginal correlation fails for the single index models, indicating that we need to look for some other statistics to approximate the gSNR.

Example 1.

Suppose that , , and we have samples from the following model:

(10)

Simple calculation shows that . Thus, correlation-based methods do not work for this simple model. On the other hand, since the link function is monotone when is sufficiently large, we know that is not a constant and .

Let and be two sequences such that

For a symmetric matrix and a positive constant such that , we define

(11)

For model , in addition to the condition that , we further assume that .

Let be the estimate of based on SIR. Let , and be three quantities satisfying

(12)

We introduce the following two assistance tests

  • Define

  • Define

Finally, the Spectral test Statistic based on SIR, abbreviated as SSS, is defined as

(13)

To show the theoretical properties of SSS, we impose the following condition on the covariance matrix .

  • There are at most non-zero entries in each row of .

This assumption is first explicitly proposed in Lin et al. (2017), which is partially motivated by the Separable After Screening (SAS) properties in Ji and Jin (2012). In this paper, we assume such a relative strong condition and focus on establishing the detection boundary. This condition can be possibly relaxed by considering a larger class of covariance matrices

which is used in Arias-Castro et al. (2011a) for analyzing linear models. Our condition contains for some positive constant and we could relax our constraint to some . However, the technical details will be much more involved, which masks the importance of the main results. We thus leave it for a future investigation.

Theorem 1.

Assume that , , and conditions hold. Two sequences and satisfy the conditions in (12

). Then, type I and type II errors of the test

converge to zero for the testing problem under SIM, i.e., we have

Comparing with the test proposed in Ingster et al. (2010)

, our test statistics is a spectral statistics and depends on the first eigenvalue of

. It is adaptive in the moderate-sparsity scenario. In the high-sparsity scenario when , the SSS relies on , which depends on the sparsity of the vector . Therefore, SSS is not adaptive to the sparsity level. Both Arias-Castro et al. (2011a) and Ingster et al. (2010) introduced an (adaptive) asymptotically powerful test based on the higher criticism (HC) for the testing problem under linear models. It is an interesting research problem to develop an adaptive test using the idea of higher criticism for (5).

3.3 Optimal Test for SIMa

When the noise is assumed additive as in SIMa (6), the detection boundary can be further improved. In addition to conditions A1-A3), is further assumed to satisfy the following condition:

  • is sub-Gaussian, and for some constant , where .

Note that for any fixed function such that , there exists a positive constant such that

(14)

By continuity, we know that (14) holds in a small neighbourhood of , i.e., if is sufficiently small, condition holds for a large class of functions.

First, we adopt the test described in the previous subsection. Since the noise is additive, we include the ANOVA test,

where and is a sequence satisfying the condition (12). Combing this test with the test , we can introduce SSS assisted by ANOVA test (SSSa) as

(15)

We then have the following result.

Theorem 2.

Assume that and the conditions and hold. Assume that the sequences , and satisfy condition (12), then type I and type II errors of the test converge to zero for the testing problem under SIMa, i.e., we have

Example continued. For the example in (10), we calculated the test statistic defined by (13) under both the null and alternative hypotheses. Figure 1 shows the histograms of such a statistic under both hypotheses, demonstrating a perfect separation between the null and alternative. For this example, has more discrimination power than .

Figure 1: The histograms of for the model (10

). The top panel corresponds to the scores under the null and the bottom one corresponds to the scores under the alternative. The ”black” vertical line is the 95% quantile under the null.

3.4 Computationally efficient test

Although the test (and ) is rate optimal, it is computationally inefficient. Here we propose an efficient algorithm to approximate via a convex relaxation, which is similar to the convex relaxation method for estimating the top eigenvector of a semi-definite matrix in Adamczak et al. (2008). To be precise, given the SIR estimate of , consider the following semi-definite programming (SDP) problem:

(16)
subject to

With , for a sequence satisfying the condition in (12), i.e., , a computationally feasible test is

Then, for any sequence satisfying the inequality in (12), we define the following computationally feasible alternative of :

(17)
Theorem 3.

Assume that , and conditions hold. Then, type I and type II errors of the test converge to zero for the testing problem under SIMa, i.e., we have

Similarly, if we introduce the test

(18)

for three sequences , and , then we have

Theorem 4.

Assume that and conditions and hold. The test is asymptotically powerful for the testing problem under SIMa, i.e., we have

Theorem 2 and Theorem 4 not only establish the detection boundary of gSNR for single index models, but also open a door of thorough understanding of semi-parametric regression with a Gaussian design. It is shown in Lin et al. (2018a) that for single index models satisfying conditions A1), A2), A3), one has

(19)

This implies that the necessary and sufficient condition for obtaining a consistent estimate of the projection operator is . On the other hand, Theorems 2 and 4 state that, for single index models with additive noise, if , then one can detect the existence of gSNR (a.k.a. non-trivial direction ). Our results thus imply for SIMa that, if , one can detect the existence of non-zero , but can not provide a consistent estimation of its direction. To estimate the location of non-zero coefficient, we must tolerate a certain error rate such as the false discovery rate (Benjamini and Hochberg (1995)). For example, the knockoff procedure (Barber and Candès (2015)), SLOPE (Su and Candes (2016)), and UPT(Ji and Zhao (2014)) might be extended to single index models.

3.5 Practical Issues

In practice, we do not know whether the noise is additive or not. Therefore, we only consider the test statistic . Condition (12) provides us a theoretical basis for choosing the sequences and . In practice, however, we determine these thresholds by simulating the null distribution of and . Our final algorithm is as follows.

1. Calculate and for the given input ;
2. Generate , where ;
3. Calculate and based on ;
4. Repeat Steps 2 and 3 times to get two sequences of and . Let and be the 95% quantile of these two simulated sequences;
5. Reject the null if and/or .
Algorithm 1 Spectral test Statistic based on SIR (SSS) Algorithm

4 Numerical Studies

Let be the vector of coefficients and let be the active set, , for which we simulated . Let be the random design matrix with each row following . We consider two types of covariance matrices: (i) with and ; and (ii) , when or , and when . The first one represents a covariance matrix which is essentially sparse and we choose among 0, 0.3, 0.5, and 0.8. The second one represents a dense covariance matrix with chosen as 0.2. In all the simulations, , varies among 100, 500, 1,000, and 2,000 and the number of replication is 100. The random error follows . We consider the following models:

  • , where ;

  • , where ;

  • , where ;

  • , where .

Model Dim SSS HC Model Dim SSS HC
I 100 0 1.00 0.16 II 100 0 0.98 0.12
0.3 1.00 0.29 0.3 0.97 0.16
0.5 0.99 0.54 0.5 0.96 0.24
0.8 1.00 0.93 0.8 1.00 0.37
0.2 0.90 0.35 0.2 0.96 0.56
500 0 0.98 0.16 500 0 0.87 0.06
0.3 0.99 0.18 0.3 0.80 0.09
0.5 0.97 0.34 0.5 0.82 0.13
0.8 0.98 0.71 0.8 0.83 0.14
0.2 0.52 0.25 0.2 0.77 0.32
1000 0 0.89 0.19 1000 0 0.81 0.09
0.3 0.88 0.16 0.3 0.74 0.06
0.5 0.91 0.33 0.5 0.77 0.08
0.8 0.96 0.53 0.8 0.84 0.11
0.2 0.37 0.30 0.2 0.69 0.25
2000 0 0.92 0.18 2000 0 0.75 0.11
0.3 0.86 0.25 0.3 0.68 0.12
0.5 0.83 0.43 0.5 0.68 0.13
0.8 0.90 0.60 0.8 0.81 0.10
0.2 0.43 0.17 0.2 0.63 0.41
III 100 0 1.00 0.21 IV 100 0 0.89 0.01
0.3 1.00 0.25 0.3 0.91 0.03
0.5 1.00 0.63 0.5 0.89 0.04
0.8 1.00 1.00 0.8 1.00 0.10
0.2 0.98 0.78 0.2 0.94 0.07
500 0 0.99 0.11 500 0 0.70 0.03
0.3 1.00 0.12 0.3 0.57 0.04
0.5 0.98 0.11 0.5 0.57 0.07
0.8 0.99 0.22 0.8 0.69 0.09
0.2 0.62 0.72 0.2 0.45 0.08
1000 0 0.99 0.11 1000 0 0.55 0.07
0.3 0.97 0.06 0.3 0.56 0.04
0.5 0.97 0.18 0.5 0.51 0.09
0.8 0.92 0.10 0.8 0.73 0.06
0.2 0.60 0.59 0.2 0.44 0.08
2000 0 0.96 0.16 2000 0 0.58 0.07
0.3 0.97 0.19 0.3 0.47 0.07
0.5 0.93 0.15 0.5 0.45 0.09
0.8 0.88 0.10 0.8 0.61 0.02
0.2 0.59 0.58 0.2 0.40 0.08
Table 1: Power comparison of SSS and HC for four models I-IV for different parameter settings. Symbol “” indicates the type (ii) covariance matrix.

We choose in the Algorithm 1. If we calculate test statistics for each replication, it will take an extremely long time. Therefore, in the simulation, we calculate and slightly different from Algorithm 1. For each generated data set, we simulated only one vector where and calculate the statistic and . The and are chosen as 95% quantile from the corresponding sequence for all the replications.

For each generated data, we also calculated Cor-HC scores according to Arias-Castro et al. (2012). The threshold is chosen according to the same scheme as choosing the thresholds and . Namely, we calculated the Cor-HC scores based on where . The threshold is the 95% quantile of these simulated scores. The hypotheses is rejected if the Cor-HC score is greater than . The power for both methods is calculated as the average number of rejections out of 100 replications. These numbers are reported in Tables 1.

It is clearly seen that the power of SSS decreases when the dimension increases. Nevertheless, the power of SSS is better than the one based on Cor-HC except for one case. In Figure 2, we plot the histogram of the statistic under the null in the top-left panel and the histogram of this statistic under the alternative in the bottom-left panel for Model III when and for type (i) covariance matrix. It is clearly seen that the test statistic are well separated under the null and alternative. However, Cor-HC fails to distinguish between the null and alternative as shown in the two panels on the right side.

Figure 2: Model III, , , type (i) covariance matrix, .

To see how the performance of Cor-HC varies, we consider the following model

  • , where

Set , , and for type (i) covariance matrix, and the power of both methods are displayed in Figure 3. The coefficient determines the magnitude of the marginal correlation between the active predictors and the response. It is seen that when is close to 16, representing the case of diminishing marginal correlation, the power of Cor-HC dropped to the lowest. Under all the models, SSS is more powerful in detecting the existence of the signal.

Figure 3: Power: Model V, for type (i) covariance matrix.

To observe the influence of the signal-to-noise ratio on the power of the tests, we consider the following two models

  • , where ;

  • , where .

Here .

Set and , we plot the power of both methods against the coefficient in Figure 4

. It is clearly seen that for both examples there is a sharp ”phase-transition” for the power of SSS as the signal strength increases, validating our theory about the detection boundary. In both examples SSS is much more powerful than Cor-HC.

Figure 4: Power: Models VI and VII, for the type (i) covariance matrix.

5 Discussion

Assuming that is non-vanishing, we show in this paper that , the unique non-zero eigenvalue of associated with the single index model, is a generalization of the SNR. We demonstrate a surprising similarity between linear regression and single index models with Gaussian design: the detection boundary of gSNR for the testing problem (5) under SIMa matches that of SNR for linear models (2). This similarity provides an additional support to the speculation that “the rich theories developed for linear regression can be extended to the single/multiple index models” (Lin et al., 2018b; Chen and Li, 1998).

Besides the gap we explicitly depicted between detection and estimation boundaries, we provide here several other directions which might be of interests to researchers. First, although this paper only deals with single index models, the results obtained here are very likely extendable to multiple index models. Assume that the noise is additive and let be the non-zero eigenvalues associated with the matrix of a multiple index model. Similar arguments can show that the -th direction is detectable if . New thoughts and technical preparations might be needed for a rigorous argument for determining the lower bound of the detection boundary. Second, the framework can be extended to study theoretical properties of other sufficient dimension reduction algorithms such as SAVE and directional regression (Lin et al. (2017, 2018a, 2018b)).

6 Acknowledgment

We thank Dr. Zhisu Zhu for his generous help with SDP.

APPENDIX: PROOFS

Appendix A Assisting Lemmas

Since our approaches are based on the technical tools developed in Lin et al. (2017, 2018a, 2018b), we briefly recollect the necessary (modified) statements without proofs below.

Lemma 1.

Let . Let be positive constants satisfying . Then for any , we have

(20)
Lemma 2.

Suppose that a matrix formed by dimensional vector , where for some constants and . We have

(21)

with probability at most

for some positive constant . In particular, we know that

(22)

happens with probability at least .

Lemma 3.

Assume that . Let be a matrix, where and are scalar, is a vector and is a matrix satisfying

(23)

for a constant where . Then we have

(24)

Sliced approximation inequality

The next result is referred to as ‘key lemma’ in Lin et al. (2017, 2018a, 2018b) , which depends on the following sliced stable condition.

Definition 1 (Sliced stable condition).

For , let denote all partitions of satisfying that

A curve is -sliced stable with respect to y, if there exist positive constants and large enough such that for any , for any partition in and any , one has:

(25)

A curve is sliced stable if it is -sliced stable for some positive constant .

The sliced stable condition is a mild condition. Neykov et al. (2016) derived the sliced stable condition from a modification of the regularity condition proposed in Hsing and Carroll (1992). The inequality (25) implies the following deviation inequality for multiple index models. For our purpose, we modified it for single index models.

Lemma 4.

Assume that Conditions , and the sliced stable condition (for some ) hold in the single index model . Let be the SIR estimate of , and let be the projection matrix associated with the column space of . For any vector and any , let . There exist positive constants , and such that for any and satisfying that , one has

(26)

Lin and Liu (2017) recently proved a similar deviation inequality without the sliced stable condition.

Appendix B Proof of Theorems

Proof of Theorem 1

Theorem 1 follows from the following Lemma 5 and Lemma 6.

Lemma 5.

Assume that , and be a sequence such that . Then, as , we have:

Under