High-throughput technology in biology stimulates new challenges in high-dimensional data analysis. For example, recent genomic studies have suggested complex, polygenic bases for complex traits — hundreds of genetic variants are involved in conferring disease risk; individual variants have low effect but variants in aggregate modify disease susceptibility at the gene, pathway or network level. One major goal of the high-throughput biological research, such as genome-wide association studies (GWAS), epigenome-wide association studies (EWAS), and expression quantitative trait locus (eQTL), is to elucidate the joint mechanisms of a genome-wide set of genomic variables on the trait of interests. The exploration of polygenicity requires sophisticated approaches based on simultaneously analyzing genome-wide variables, e.g., pathway based analysis (Kao et al. (2017) and reference therein), polygenic modeling for assessing SNP main or interaction effects (Waldmann et al. (2013) and reference therein, Wu et al. (2010), Hung et al. (2016)) or for genetic prediction (Abraham et al. (2013) and reference therein). However, due to the high dimension of genome-wide variables and limited sample size, these simultaneous analyses have to be coupled with a pre-screening step to reduce the data dimension.
Pre-screening can greatly impact subsequent analyses. When individual variants have small effect, pre-screening that is too stringent may fail to capture them for follow-up studies. On the other hand, a pre-screening that is too liberal can hurt the performance of the subsequent simultaneous analyses by including too many noise variables. In current practice, pre-screening on genome-wide variables is often performed by selecting the SNPs with p-values less than an arbitrary threshold (e.g., p-value in Zhou et al. (2011), and in Wu et al. (2009)). In this work, we aim to develop a data-driven method that is adaptive to the underlying data features so that a high proportion of signals can be selected without incurring many unnecessary noise variables.
One major challenge in developing a data-adaptive method for signal inclusion is how to effectively accommodate the unknown signal information such as signal sparsity and intensity. When signals are much rarer than noise, inference based on signal information is more challenging than inference based on noise distribution. Consequently, retaining signals through false negative control requires different techniques from those used for false positive control. Another challenge for data-adaptive signal inclusion is how to accommodate dependence among variables in real applications. For example, in genomic data analysis, a Single Nucleotide Polymorphism (SNP) is usually strongly correlated with the SNPs nearby due to linkage disequilibrium. The dependence among SNPs can dramatically effect the test statistics and confound inference.
In this paper, we propose a new analytic framework for efficient signal inclusion under dependence. We first discuss sensible criteria for false negative control and propose a new measure called signal missing rate (SMR). Compared to existing measures, SMR assesses the exceedance probability of false negative proportion and incorporates the variability of false negative proportion into inferences.
Next, we develop data-adaptive procedures to control SMR under dependence. The first procedure, conservative SMR (cvSMR), utilizes existing techniques in multiple testing to control false discovery proportion at a stochastic level involving signal information and, consequently, control the measure of SMR at a low level. cvSMR is quite intuitive and easy to implement. However, it tends to be overly conservative for false negative control and can include too many noise variables.
In order to improve the efficiency of signal inclusion, we propose the second method, Adaptive SMR (AdSMR). The main difference between AdSMR and cvSMR is that AdSMR implements a much relaxed critical sequence in its selection rule, which results in a smaller subset of selected variables. The new critical sequence is established by novel theoretical analysis on the variability of false negative proportion through concentration properties of order statistics under dependence. The improved AdSMR procedure and the new analytic techniques guarantee the control of SMR at a degenerating level and the control of unnecessary false positives. Although the implementation of AdSMR does not need signal information, the cut-off position of AdSMR automatically vary with signal sparsity and intensity, and, as a result, when signal intensity become stronger, both false negative and false positive can be better controlled by AdSMR.
A by-product in the study is a consistent estimator for the number of signals under block dependence that is widely observed in genomic data. Existing studies on signal proportion estimation mainly assume independence(Meinshausen and Rice, 2006; Jin and Cai, 2007). Consistent estimation under dependence is not only useful for signal inclusion as described in this paper, but also valuable in other areas, such as to improve the performance of FDR-based methods in multiple testing.
We compare the finite-sample performances of the proposed AdSMR method and existing methods in simulation. The simulation settings include different sparsity and intensity levels of signals, block dependence with various block sizes, sparse dependence without block structures, and dependence structure from a multi-factor model. While all the methods seem to be effective in false negative control, AdSMR generally outperforms other methods in incurring less false positives.
We apply AdSMR to a GWA analysis on human height using the CoLaus data, in which all autosomal SNPs explain of the phenotypic variability of human height. Multiple testing based on the full set of SNPs identifies zero significant candidates because individual variants have small effects. In order to significantly reduce data dimension and carry as many relevant SNPs to subsequent polygenic analyses, we apply AdSMR and select only a small proportion () of the total SNPs, which explains nearly all the of the height variation attained using the full set of SNPs. We further apply penalized regression on the SNPs selected by AdSMR and narrow down the number of selected SNPs to . The estimated heritability is still close to based on only the SNPs. The selected subset would include a high proportion of truly relevant SNPs for further downstream analyses such as gene annotation, pathway mapping, polygenic risk score, etc.
The rest of the paper is organized as follows. Section 2 introduces the signal missing rate and develop procedures for SMR control under dependence. Consistent estimation of signal proportion is also discussed. Section 3 demonstrates the finite-sample performance of the proposed methods in simulation. Comparisons with other methods are provided. Section 4 presents an application of our method to genomic data analysis. Concluding remarks are provided in Section 5.
2 Efficient Signal Inclusion Under Dependence
2.1 Signal Missing Rate
We first discuss sensible measures for false negatives. Table 1 summarizes notations in classification of variables where TP, FN, FP, and TN are numbers of true positives, false negatives, false positives, and true negatives, respectively; is the total number of signal variables. Our goal of signal inclusion corresponds to seeking low .
In the multiple testing literature, False Nondiscovery Rate (FNR) has been proposed as an analogue of False Discovery Rate (FDR) to measure false negatives. It is defined as the expectation of the proportion of false negatives among the unselected variables, (Genovese and Wasserman, 2004; Sarkar, 2006). The notion of FNR, unfortunately, does not suit our need for signal inclusion, because FNR is mostly very close to zero when signals are sparse, and a large (or small) FNR does not correspond to a high (or low) . We provide a simulation example to illustrate this point in Supplementary Material.
Recently, Cai and Sun (2017) constructed the Missed Discovery Rate (MDR) for false negative control. MDR is the expected value of false negative proportion, . To utilize MDR for false negative control, all the variables are ranked by their estimated local FDR values (). Because can be approximated by , given an estimate for , a cut-off position on the ranked local FDR values can be determined to control MDR at a pre-fixed level. MDR control is intuitive and easy to implement. However, the measure of MDR does not consider the variability of false negative proportion; and the control of MDR has not been studied under dependence.
In this paper, we propose a new measure for false negative control called Signal Missing Rate (SMR). SMR is defined as
where is a constant between and . Signal missing rate evaluates the probability of neglecting at least a certain proportion of signals. By controlling SMR at a low level with a small , a high proportion of signals can be captured. Compared to MDR, SMR measures the exceedance probability of false negative proportion and incorporates the variability of into inference.
2.2 Controlling SMR under dependence
To assure generality of our work, we do not assume any specific distribution for the test statistic. Specifically, define and as collections of indices of the noise and signal variables, respectively. Let be the -value of the th variable. Assume
whereis some unknown cumulative distribution function dominating , i.e., for all . This mixture model on -values in (2.2) provides a convenient framework for large-scale inference. It can be used in a wide range of applications as long as the baseline distribution of the noise can be reasonably estimated from either asymptotic or empirical approaches, such as by permutation or parametric bootstrap.
We define the signal missing rate of a procedure selecting the top candidates along the ranked -values as
where is a constant and represents the number of false negatives for selecting the top candidates, which equals to the number of true signals ranked after .
We develop two procedures for SMR control. Both procedures are easy to implement in applications. The first procedure is more in line with existing techniques in multiple testing. The second procedure improves the efficiency of the first approach by developing new analytic techniques based on concentration inequalities of order statistics under dependence.
2.2.1 The conservative SMR procedure
Suppose that we know the number of signals , then a procedure controlling at the level of includes the number of signals as . Therefore, one can modify a method that controls FDP at the level of to include a high proportion of signals. Motivated by this idea, we develop the conservative SMR (cvSMR), a procedure that determines the cut-off position on the ranked -values by
where is an estimate for the number of signals, is a prefixed small constant, and with . The top candidates are selected.
cvSMR is a step-down procedure with critical sequence that is frequently used in methods controlling FDR or FDP. Compared to an existing step-down procedure studied in Lehmann and Romano (2005), cvSMR uses an opposite sign when comparing -value with the critical sequence. This is because the procedure starts at the position , where FDP is not controlled in general, and stops as soon as FDP is controlled at a desirable level. On the other hand, the procedure in Lehmann and Romano (2005) starts from the first position where FDP is controlled and stops once FDP cannot be controlled. cvSMR also differs in the index of , where is not the index of the ordered -values (k), but . It can be proved that the step-down procedure of cvSMR controls FDP at a stochastic level of . Consequently, the number of true signals included in the top variables is greater than with high probability.
The following theoretical results show that cvSMR asymptotically controls SMR at the level of . denote the -values corresponding to the noise variables and denote the -values corresponding to the signals. We consider the same dependence condition as in Lehmann and Romano (2005): for any ,
This condition says that the -value of a noise variable is conditionally dominated by a uniform distribution. This condition allows arbitrary joint dependence within noise variables and within signal variables.
Note that as long as is a constant. If , the asymptotic control on SMR by cvSMR may not hold. This follows our intuition that the control of false negatives is harder when smaller number of false negatives is allowed.
Generally speaking, cvSMR would work well in situations where FDR/FDP methods work well. However, cvSMR inherits the same issue as FDR/FDP methods and tends to be conservative under high dimensionality. Here, the conservativeness is on false negative control, which means that too many variables could be selected. In fact, this disadvantage can be more severe for cvSMR as the event is less likely to happen than the event in FDR/FDP methods. Table 2 presents the number of selected candidates in a simulation example with and
. The test statistics are generated from multivariate normal distribution, where for noise and for signal. The covariance matrix is a block-diagonal matrix with equal block size and within-block correlation . cvSMR with appears too conservative in this example by selecting almost all variables.
Because cvSMR controls for arbitrarily small constant , one way to mitigate the conservativeness of cvSMR is to weaken the control of false negatives and allow for a fixed proportion of false negatives. Consequently, the critical sequence can be relaxed by involving the fixed proportion. Considering our motivation for retaining as many true signals as possible for subsequent data analysis, we would like to propose a different strategy to significantly reduce the number of false positives without weakening the theoretical control on false negatives.
2.2.2 The Adaptive SMR Procedure
In order to develop a method that incurs less false positives and is more applicable in Big Data applications, we propose the second procedure, the Adaptive SMR (AdSMR). AdSMR has the cut-off position on the ranked -values as
. The rationale to use beta distribution to determineis because our proposed method is based on ranked -values, and the -th ranked -value of noise variables follows under independence. Therefore, it is natural to utilize to perform inference.
Although the median of a beta distribution does not have an explicit form, it is known that the median is bounded by the mode and mean of beta distribution. Therefore,
and is approximately times as large as of cvSMR. Larger results in less variables being selected as shown in Table 2.
To justify this new procedure in theory, existing techniques for FDP control cannot be used anymore. We develop novel techniques to analyze the procedure based on concentration properties of the ordered -values under dependence. Figure 1 illustrates a sequence of ordered -values, where denotes the location before the first noise variable and denotes the location of the last signal variable. Signals and noise are mixed indistinguishably between and . We show that the proposed AdSMR method is able to capture a high proportion of signals ranked before and, at the same time, avoid unnecessary false positives ranked after .
Let be the total number of noise variables ranked before , then . Note that
is a random variable varying from sample to sample. Generally speaking, lower signal intensity results in larger. The specific relationship depends on the model that generates the data. In this section, we assume that is bounded almost surely by a number , and
Condition (2.8) says that the number of indistinguishable noise is not too large. For example, in an association study with total variables and truly associated signals, we request . This condition is fairly general as it allows the existence of weak signals that may rank after many noise variables and “pseudo” signals which are indeed noise but show up as signals due to high dependence with true signals. We note that the number of such “pseudo” signals is often much smaller than the total number of variables in applications when true signals are sparse.
Our next condition is on dependence in the data. Let denote the ordered -values corresponding to the noise variables. Assume that, for any ,
where is the conterpart of the left side probability for independent -values, and is some constant. Since order statistics of independent noise -values follow beta distribution, is the cumulative distribution function of with and . This condition essentially says that the smallest noise -values (given the signal -values) can be more or less extreme than their counterparts under independence as long as is a constant no less than 1. There are no constraints on the noise -values ranked after or the -values of signal variables.
The following theorem shows that given the above conditions, AdSMR has a degenerating SMR, which is equivalent to say that the FN proportion/sensitivity of AdSMR converges to / in probability.
Comparing Theorem 2.2 with Proposition 2.1, it can been seen that AdSMR and cvSMR control SMR differently. cvSMR asymptotically controls SMR at a prefixed level , whereas AdSMR controls at a degenerating level. The theoretical justification coupled with the more relaxed critical sequence make AdSMR a more efficient method for false negative control.
The asymptotic result in Theorem 2.2 holds for any constant but may not hold for . In other words, AdSMR may allow a number of false negatives as long as the number is not greater than a proportion of the total number of signals.
Next, we show that AdSMR can avoid selecting unnecessary false positives. Recall the locations of and in Figure 1. Although signal and noise variables mix indistinguishably between and , noise variables ranked after should be avoided. The next theorem shows that under suitable conditions on the dependence and the estimator , AdSMR controls the selection of noise variables ranked after .
Condition (2.11) is quite general as is an arbitrary constant no less than 1. Theorem 2.2 and 2.3 imply that AdSMR achieves both SMR control and false positive control when (a) dependence conditions in (2.9) and (2.11) are satisfied, (b) signal intensity is strong enough so that condition (2.8) is satisfied, and (c) for arbitrarily small constant . The estimator studied in the next section has the property in (c).
2.3 AdSMR with MR estimator
An important component of the SMR control-based methods is a consistent estimator for the number of signals. Estimation of signal proportion among all variables () has inspired profound research in high-dimensional inference. For example, Storey (2002), Genovese and Wasserman (2004), and Jin and Cai (2007) have developed consistent estimators for relatively dense signals with proportion ; Cai et al. (2007) has considered sparse signals with proportion ; and Meinshausen and Rice (2006) has considered both dense and sparse signals for proportion estimation. However, existing studies mainly focus on independent variables. Rigorous analysis under realistic dependence structure is scarce.
In genomic data analysis, the covariance matrix of -values of SNPs often exhibits a block structure. In Figure 2, we present heatmap of the absolute values of sample correlations of the first SNPs in Chromosome 1 from Cohorte Lausannoise (CoLaus) study samples. Details of the real data are described in Section 4. The heatmap shows blocks of high correlations along the diagonal regions with different block sizes. Such dependence structures are frequently observed in genomic data (Efron, 2007; Fan et al., 2012).
In this section, we study the consistency of the estimator developed in Meinshausen and Rice (2006) in the situation where variables have block dependence. We refer to this estimator as the MR estimator which is constructed as follows. Define the empirical distribution of -values
where and denote the empirical distributions of the -values corresponding to noise variables and signals, respectively. Similarly, denote as the empirical distributions of the -values when all candidates are noise. Define
and denote as a bounding sequence of such that is monotonically increasing with and . Then, the MR estimator for is constructed as
Consistency of has been shown for independent -values in Meinshausen and Rice (2006). In this section, we study the consistency of under block dependence and implement into AdSMR. Assume that can be divided into independent groups with arbitrary dependence in each group. Denote as the upper bound of the group sizes and assume
The next theorem summarizes the conditions for consistency and the SMR control of AdSMR with MR estimator.
Theorem 2.4 considers the scenario where the same dataset is used to derive and to retain signals by AdSMR. Therefore, the SMR control of AdSMR would be affected by the conditions for the consistency of . Given the general block structure of dependence, condition (2.9) implies constraints on the block size and within-block dependence. Under the constraints of dependence, conditions in (i) - (iii) can be satisfied if signal intensity is strong enough in different ranges of sparsity level.
It can be seen that the more sparse the signals (larger ), the stronger the condition on signal intensity. When , the condition is quite general and has been described as the “pure” case in Genovese and Wasserman (2004). When , the stronger condition says that the distribution of signal -values is highly concentrated around . One can also see the effect of block size demonstrated through . Generally speaking, the larger the , the stronger the conditions in (i) - (iii).
AdSMR with MR estimator also controls unnecessary false positives ranked after as presented in the following corollary. The proof duplicates parts of the proofs for Theorem 2.3 and 2.4, and thus are omitted.
The next corollary shows that AdSMR with MR estimator has asymptotically zero false positives in two special scenarios.
We conclude this section by a complete algorithm for AdSMR with the MR estimator. For simplicity, the same is used to simulate the bounding sequence and to obtain in AdSMR. More specifically, one can simulation from the empirical null distribution. Then, can be determined as the
th quantile of the empirical distribution offrom 1000 simulations. To save computation, we approximate by as shown in (2.7) and set an upper limit for at . The step-by-step algorithm is as follows. A toy example demonstrating the algorithm is provided in Appendix 5.4.
Algorithm 1: AdSMR with MR estimator
3 Simulation Study
We compare the finite-sample performances of AdSMR and existing methods for false negative control. These methods include the MDR procedure (Cai and Sun, 2017) and BH-FDR with high nominal levels. For fair comparison, both AdSMR and MDR use for proportion estimation. MDR aims to control the expectation of FN proportion through estimating local FDR. We use the software “locfdr” to estimate local FDR and apply MDR at the recommended level . BH-FDR with high nominal levels are ad-hoc procedures that apply the original FDR method in Benjamini and Hochberg (1995) with high nominal levels 0.5 and 0.7 to capture more signals.
We demonstrate FN control of these methods by reporting their FN proportions (). Then, we show their efficiency by reporting the false discovery proportion () of these methods. Higher FDP can be viewed as higher price paid to achieve low FN proportion.
In addition, we employ the F-measure as a summary metric for FN proportion and FDP (Powers, 2011) . By definition, F-measure is the harmonic mean of precision (
. By definition, F-measure is the harmonic mean of precision () and recall () and calculated by .
We simulate test statistics from multivariate normal distribution , where for noise and for signal. The locations of the signals are selected randomly. We consider settings with different signal sparsity and intensity levels and various dependence.
Example 1 has data dimension . The covariance matrix is a block diagonal matrix with equal block size and within-block correlation . The diagonal elements of are set to be 1. We demonstrate different signal sparsity levels: and . Signal intensity increases from 3 to 5.5.
Figure 3 presents the median values of FN proportion, FDP, and the F-measure for all the methods from 100 simulations. It shows that the FN proportion of AdSMR is lower for signals with than for signals with . This agrees with the theoretical insights of Theorem 2.4 where stronger condition on signal intensity is required for more sparse signals. Compared to other methods, AdSMR has slightly larger FN proportion but smaller FDP. AdSMR generally outperforms other methods in F-measure especially for sparse signals with .
We follow the convention in high-dimensional screening literatures, e.g., Fan and Lv (2008) , and report the median of these measures.
Examples of the mean and standard deviation of the measures are provided in Supplementary Material.
, and report the median of these measures. Examples of the mean and standard deviation of the measures are provided in Supplementary Material.
Example 2 has block dependence with larger block size . Figure 4 compares all the methods under different signal sparsity and intensity levels. Compared to Example 1, the performance of AdSMR deteriorates a little for FN control, which agree with the theoretical insights in Theorem 2.4 on the effect of block size. Overall, AdSMR still mostly outperforms other methods in FDP control and F-measure.
Example 3 simulates data with and a sparse whose nonzero elements are randomly located. The data generation process is similar to Model 3 in Cai et al. (2013). Let , where , Bernoulli for and . Then , where . Figure 5 shows that the performance of AdSMR is comparable to its performance in Example 1, although the covariance matrix in this example does not have a block structure. MDR performs relatively better in this example than in previous examples. AdSMR and MDR generally outperform high level BH-FDR procedures in this example.
Example 4 considers dependence structure from a two-factor model as in Fan
et al. (2012). Let be the correlation matrix of a random sample with size 100 of -dimensional vector
-dimensional vector, where , and are iid , and are iid , and are iid . Figure 6 shows that the performance of AdSMR is comparable to its performance in Example 2, where has large diagonal blocks. MDR and high level BH-FDR procedures perform relatively better in this example than in the examples with block dependence.
In all the examples, the cut-off position of AdSMR automatically vary with signal sparsity and intensity and, as a result, when signal intensity becomes stronger and the noise and signal -values are better separated, AdSMR controls both false negative and false positive better. BH-FDR with high nominal levels do not have such property as their FDPs remain high with increasing signal intensity. MDR method performs better than high level BH-FDR in terms of adapting to signal intensity but worse than AdSMR when signals are more sparse. Additional simulation for the empirical SMR of AdSMR is presented in Appendix 5.5. Finite-sample performance of the MR estimator along is demonstrated in Supplementary Material.
4 Real Data Analysis
Recent heritability analyses of GWAS data have suggested that a large proportion of variation of human height can be explained by all autosomal SNPs although only a small proportion of associated variants have been successfully identified. For example, Yang
et al. (2011) showed that about of the variation can be accounted for by common SNPs from a sample of around Australians with ancestry in the British Isles. In the analysis of Kostem and
Eskin (2013), of the variation can be explained by all autosomal SNPs from the Northern Finland Birth Cohort of 1966 from unrelated individuals. However, the discovered associated SNPs have only explained a small proportion of the total heritability. For example, in Allen
et al. (2010), only of the variance can be explained by the
of the variance can be explained by theSNPs with genome-wide significance. It indicates that there exist a large number of weak signals, for which current methods have limited power even with a sample size of .
We obtained the GWAS height data from the Cohorte Lausannoise (CoLaus) study (Firmann et al., 2008). The dataset includes subjects with available information on age, sex, and autosomal SNPs with minor allele frequencies greater than . We calculated the -values corresponding to each SNP by fitting marginal linear models while adjusting 12 clinical covariates including sex, age, and the top 10 principal components. The FDR method in Benjamini and Hochberg (1995) cannot identify any associated SNPs based on the full set of SNPs due to small effects of individual SNPs. On the other hand, the Lasso procedure (Tibshirani, 1996) applied to the full dataset identifies only two candidate SNPs.
In order to reduce data dimension and carry as many true signals to subsequent analyses, we apply both AdSMR and MDR procedures. To implement the MR estimator to AdSMR, we first generate a set of -value sequences under the global null using the permutation approach introduced in Westfall and Young (1993). In each permutation, the phenotype values are randomly shuffled and reassigned to individuals, the null -values are calculated in each permutation with the correlation structure of the SNPs preserved. We replicate the permutation process times. Based on the simulated null -values, the MR estimate and the cut-off position of AdSMR are calculated by the algorithm in Section 2.3. The estimated number of associated SNPs is equal to ; AdSMR selects SNPs, while MDR selects SNPs.
To examine the contribution of the selected SNPs, we perform heritability analysis using GCTA package (Yang
et al. (2011)). First, we use GCTA to estimate the genetic relationship matrix (GRM) of all the GWAS SNPs and then fit a random effects model to estimate the proportion of variance explained by all the autosomal SNPs.
We found that of the phenotypic variance in height can be explained by all the autosomal SNPs. We then repeat the analysis on the partitioned SNP sets (e.g., the SNPs identified by AdSMR vs. the rest) to estimate the proportion of variance in height explained by the selected SNPs while adjusting for the remaining unselected SNPs.
Note that the random effects model in GCTA is very different from the marginal linear regression model that we used to select SNPs. In addition, the heritability percentage is also different from the
SNPs identified by AdSMR vs. the rest) to estimate the proportion of variance in height explained by the selected SNPs while adjusting for the remaining unselected SNPs. Note that the random effects model in GCTA is very different from the marginal linear regression model that we used to select SNPs. In addition, the heritability percentage is also different from thestatistic of goodness-of-fit test for prediction.
Table 3 shows that AdSMR reduces the number of SNPs from to , but the explained variability of the selected SNPs is still , suggesting that almost all important SNPs to human height are retained by AdSMR for this dataset. MDR also explained of the height variation, but appears less efficient than AdSMR by selecting a much larger set of SNPs.
|Number of SNPs||Estimated Heritability|
With the reduced set of SNPs from AdSMR, we can apply joint modeling using Lasso and further narrow down the number of SNPs to 1,563. By repeating the above GCTA analysis on the 1,563 selected SNPs vs. the rest, the estimated heritability is still . These selected SNPs provide promising candidates for further downstream analyses such as gene annotation, pathway mapping, polygenic risk score, etc.
5 Conclusion and Further Discussion
In this paper, we consider the problem of efficient signal inclusion under dependence. Our motivation comes from Big Data applications where sample sizes are relatively limited with respect to data dimension, and there are great needs to significantly reduce data dimension while retaining as many true signals as possible for subsequent analyses. However, in applications where signals are much rarer than noise, inference for false negative control based on signal information requires different techniques from those used for false positive control. Furthermore, the dependence among variables, especially between noise and signal variables, adds another layer of difficulty. To address these challenges, we develop data-adaptive methods whose implementations do not need the signal information. Nevertheless, the cut-off position of the proposed AdSMR procedure automatically vary with signal sparsity and intensity and, as a result, a high proportion of signals can be selected without incurring many unnecessary false positives. When signal intensity becomes stronger and the noise and signal -values are better separated, the cut-off position of AdSMR controls both false negative and false positive better. These properties are presented in Theorem 2.2 - 2.4 and Corollary 2.5 - 2.6, and illustrated in simulation examples where signal intensity increases in each setting.
We have also proposed a new measure, SMR, for false negative control. The notion of SMR includes two parameters: the FN proportion and the SMR control level . The methods developed in the paper, cvSMR and AdSMR, control SMR at different levels. The construction of cvSMR is more in line with the existing techniques based on marginal distribution of -values, whereas AdSMR benefits by new explorations on the concentration properties of the ordered -values under dependence. Both methods control SMR for arbitrarily small . However, AdSMR is shown to incur much less false positives.
We note that false positives can be further reduced by allowing a small fixed FN proportion. Such extension on AdSMR would involve more delicate analyses on the concentration properties of order statistics under dependence. The estimator implemented in cvSMR and AdSMR is another subject to be re-investigated under the new request. Detailed studies are deferred to future research. Another interesting topic for future research is SMR control for specific data generating models. While the current paper considers a general -value model without model assumptions on test statistics, it will be interesting to relate conditions in the paper to parameters of the specific models. We expect that the characterizations would be very different for models with sparse or dense signal component.
The authors are grateful to Dr. Peter Vollenweider and Dr. Gerard Waeber, PIs of the CoLaus study, and Dr. Meg Ehm and Dr. Matthew Nelson, collaborators at GlaxoSmithKline, for providing the CoLaus phenotype and genetic data. The authors appreciate the very helpful comments and suggestions from an associate editor and three reviewers. Dr. Jeng was partially supported by National Human Genome Research Institute of the National Institute of Health under grant R03HG008642. Dr. Tzeng was partially supported by National Institutes of Health grant P01CA142538.
Appendix 5.1 - 5.3 provide proofs of Proposition 2.1 and Theorem 2.2. A toy example demonstrating the AdSMR algorithm is shown in Appendix 5.4, and additional simulation results are presented in Appendix 5.5. Proofs of Theorem 2.3, Theorem 2.4, and Corollary 2.6 and more simulation results are provided in Supplementary Material.
5.1 Proof of Proposition 2.1
For notation simplicity, let , where . Then . Denote and as the numbers of true positives and false positives in the top candidates, then . We also have . Now, for any ,
where the last step is by the consistency of .
5.2 Proof of Theorem 2.2
We consider and separately.
When , the proof is relatively straight-forward. First, by the definitions of and ,
which implies with probability tending to 1. On the other hand,
which implies that
with probability tending to 1. This conclude the case with .
The following proof is for . By the definition of , it is enough to show that for any ,
The case can be proved by similar arguments leading to (5.2).
Consider the case . Without loss of generality, assume is an integer. Denote as the position for the -th signal. Then
Note that are composed of and . Denote and as the critical values corresponding to and in , respectively. Then
Let be the largest before . The following lemma shows the relationships between and and between and .
Given a consistent estimator for the number of signals and , we have
with high probability.
where is a random variable following with and . Further, by (5.6),
where the second inequality is by the properties of Beta distribution that is less positively skewed as
is less positively skewed asincreases and the change of skewness gets slower as approaches to . Then the above implies
Note that there is no explicit form for the cumulative distribution function of Beta distribution. To derive the probability in (5.10), let
By the relationship between Beta and F distributions,
By the relationship between Beta and F distributions,has an distribution (Johnson and Kotz, 1970). In our case, by condition (2.8), then is highly concentrated at with mean and variance . On the other hand, we know that the median of Beta distribution is bounded by its mean for , then
where the last step is by condition (2.8). Further, let
then by Wilson-Hilferty approximation to and the fact that is a ratio of distribution, approximately follows