Data-adaptive testing under high-dimensions
Current statistical inference problems in areas like astronomy, genomics, and marketing routinely involve the simultaneous testing of thousands -- even millions -- of null hypotheses. For high-dimensional multivariate distributions, these hypotheses may concern a wide range of parameters, with complex and unknown dependence structures among variables. In analyzing such hypothesis testing procedures, gains in efficiency and power can be achieved by performing variable reduction on the set of hypotheses prior to testing. We present in this paper an approach using data-adaptive multiple testing that serves exactly this purpose. This approach applies data mining techniques to screen the full set of covariates on equally sized partitions of the whole sample via cross-validation. This generalized screening procedure is used to create average ranks for covariates, which are then used to generate a reduced (sub)set of hypotheses, from which we compute test statistics that are subsequently subjected to standard multiple testing corrections. The principal advantage of this methodology lies in its providing valid statistical inference without the a priori specifying which hypotheses will be tested. Here, we present the theoretical details of this approach, confirm its validity via a simulation study, and exemplify its use by applying it to the analysis of data on microRNA differential expression.READ FULL TEXT VIEW PDF
Multiple testing of a single hypothesis and testing multiple hypotheses ...
This article develops a framework for testing general hypothesis in
The ability to predict individualized treatment effects (ITEs) based on ...
Hypothesis testing plays a central role in statistical inference, and is...
The conventional nonparametric tests in survival analysis, such as the
While standard statistical inference techniques and machine learning
Differences between genetic networks corresponding to disease conditions...
Data-adaptive testing under high-dimensions
Recently developed technologies enable high throughput screen of thousands (millions) of biological molecules, which has resulted in filters that use multiple testing to highlight potentially informative biomakers. For instance, consider microRNA, a new class of small, non-coding RNAs have been the subject of intense study due to their central role in gene regulation (Wienholds et al., 2005)
. In a process involving binding to messenger RNA (mRNA), miRNAs regulate gene expression at the post-transcriptional level, thereby affecting the abundance of a wide range of proteins in diverse biological processes. The resulting data consists of a large vector of microRNA expressions as well as other characteristics and experimental conditions relevant to a particular biological sample. Before looking at the complex relationship of these molecules, a first step is often examining the association of microRNA expression with some other phenotypic variable(s). In our case, we consider a study examining the relationship of occupational exposure benzene (a known carcinogen) to microRNA expression. This study, like many of its kind, has a relative small sample size, but large ambitions regarding the teasing out of associations of many thousands of potential microRNA’s. In this paper, we propose a method that adaptively reduces the number of tests so that a study can preserve reasonable power even when the number of potential tests is huge.
Speaking generally, current problems of statistical inference involving multiple hypothesis testing share the following characteristics: inference for high- dimensional multivariate distributions, with complex and unknown dependence structures among variables; a variety of parameters of interest, such as coefficients in general regression models relating possibly censored biological and clinical covariates and outcomes to genome-wide expression measures and genotypes; many null hypotheses, in the thousands or even millions; complex and unknown dependence structures among test statistics.
An often-ignored yet insidious issue with small-sample inference and large numbers of comparisons is the enormous sample sizes required for joint convergence, by the central limit theorem, of the joint sampling distribution to a multivariate normal distribution. Unfortunately, others have shown that for convergence to multivariate normal distribution in the far tails (to derive accurate adjustment for multiple testing) requires astronomical sample sizes (e.g., from high-throughput sequencing technologies). Gerlovina and Hubbard (2016) employed Edgeworth expansions to rigorously study the manner in which the number of tests performed affects the accuracy of the resultant analyses, with particular attention paid to situations where the number of tests vastly outnumbers the sample size. Here, the utility of Edgeworth series lies in their providing, by way of higher-order approximations, estimates of critical values that would otherwise be computed exactly were the true distribution known. In any case, these expansions and associated simulations of multiple testing experiments showed that error control can be wildly anti-conservative if sample size is inadequate. Thus, using standard and commonly used multiple testing techniques can make it practically impossible to obtain honest statistical inference when conducting very large numbers of tests.
Motivated by the broad use of multiple testing procedures for very large numbers of tests, and the limitations of existing multiple testing procedures, we present in this paper a technique for data-adaptive multiple testing in high-dimensional problems, one that harnesses data mining procedures to perform variable reduction, whilst preserving accurate and honest statistical inference. This new method is a natural extension of the data-adaptive statistical target parameter framework introduced by Hubbard and van der Laan (2016), who show that this new class of inference procedures provides an impetus for using methods providing rigorous statistical inference for data-mining procedures.
The proposed approach for multiple testing in high-dimensional settings uses data-adaptive test statistics, which rely on cross-validation, to perform variable reduction by screening algorithms. Typically, the method uncovers associations that represent signals that are stable across the full sample, while allowing for multiple testing efforts to be restricted to a much smaller subset of biomarkers (predictors). It is expected that this new class of test statistics outperforms, in terms of improvements in both power and control of type-I error rate, standard test statistics generated by classical approaches to multiple hypothesis testing, in which variable set is pre-specified.
In section 2, we proceed to define this methodology in generality with the aforementioned estimation strategies, present theorems establishing its asymptotic statistical performance, present influence function and bootstrap-based inference, and discuss the implications of these theoretical results. In section 3, we demonstrate the relative performance by way of simulation studies; in section 4, we apply the general approach for high-dimensional multiple testing using data-adaptive test statistics to the analysis of data from miRNA assays; and then conclude (in section 5) with remarks on the implications of this new methodology to future work in analyzing data from biomedical investigations.
The procedure is straightforward and involves cross-validation to keep separate the data used for choosing which tests to perform and the data used for constructing the test statistics and generating p-values. Define the observed data as , which consist of independent and identically distributed random samples from the population distribution , with known to be an element of a statistical model . For each observation, is the outcome of interest of dimension and can be a real number or class variable. is a 1-dimensional vector of treatment variable and is a multi-dimensional vector of baseline covariates.
Consider V-fold cross-validation, where the learning set is randomly divided into mutually exclusive and exhaustive sets, , of as nearly equal size as possible. We will define the parameter-generating sample for , the sample used to select which of the original set of hypotheses should be considered for future testing. Then, the estimates are computed on the estimation-sample , and averaged (details below) across the . The proportion of observations in the estimation-sample is approximately .
For a given random split , let be the empirical distribution of the parameter-generating sample , and be the empirical distribution of the estimation-sample . is the target parameter mapping indexed by the parameter-generating sample , and the corresponding estimator of this target parameter. For instance, one might use an algorithm within the training sample to subset variables based on, say, ranking by some statistic (e.g., differential expression). Thus, could be the mean difference between two groups of this random subset of variables. Here, is a nonparametric model and an estimator is defined as an algorithmic mapping from a nonparametric model, including all of the constituent empirical distributions, to the parameter space. For simplicity, assume that the parameter is real-valued. Thus, the target parameter mapping and estimator can depend not only on the parameter-generating sample , but also on the particular split .
Assume the existence of a mapping from the parameter-generating sample into a target parameter mapping and a corresponding estimator of that target parameter. The choice of target parameter mapping and corresponding estimator can be informed by the data but not by the estimation sample – that is, one only need know the realization of the mapping from the parameter-generating sample to the space of target parameter mappings and estimators , but not the explicit definition of said mapping.
Define the sample-split data-adaptive statistical target parameter as with
and the statistical estimand of interest is thus
Note that this target parameter mapping depends on the data, which is the reason for calling it a data-adaptive target parameter. A corresponding estimator of the data-adaptive estimand is given by:
In this section, we consider a rank-based approach to generate a test statistic on a reduced set of responses , where is a subset of the full observed data determined by the application of a selection procedure on the parameter-generating sample. The data-adaptive statistic is identified by three components: the cardinality of the reduced set denoted by , the parameter-generating algorithm, and the number of folds in cross-validation.
Specifically, for each parameter-generating sample , we simply rank by , which is the empirical average treatment effect of on . Within each fold , the parameter-generating algorithm returns the rank of each response covariate by its effect size using the the parameter-generating sample , and the set is defined by taking the top ’s ( we have arbitrarily chosen ).
The parameter of interest is the average treatment effect , estimated by using the estimation-sample . The efficient influence curve of is derived in van der Laan and Petersen (2012) and can be represented as
where , , and . We calculate the efficient influence curves also on the estimation-sample, which will be useful when we later calculate test statistics.
We repeat the procedure for all folds, and take the average for each that are selected in any single fold . The calculated efficient influence curves are combined across all folds and used to derive asymptotic distribution of target parameters and perform statistical testing. The p-values for the in the final reduced set can be constructed based on asymptotic linearity of the TMLE (van der Laan and Petersen, 2012). Under regularity conditions on the estimates of and ,
is the empirical variance of efficient influence curve. As a result, the test statistics based on the response matrixare computed using asymptotic normal distribution of each single target parameter, thus we have a p-value for each selected in . The false discovery rate of the corresponding test p-values () can be controlled, for example, using the well-established Benjamini-Hochberg procedure for controlling the False Discovery Rate (FDR) (Benjamini and Hochberg, 1995).
The data-adaptive parameter-generating procedure not only reduces the multiplicity of the hypothesis tests compared with directly applying multiple testing methodologies, but it also generates summary statistics that validate the robustness and credibility of the result. A plot of sorted Q-values can become useful when a lot of covariates are significant. If the smallest Q-values are similar among each other, and as a group much smaller than the rest (still significant) covariates, we can identify the cluster of smallest Q-values to be of most scientific interest. In addition to looking at the Q-values, which is a measure of statistical significance, practitioners can evaluate the scientific significance by looking at the magnitude of average treatment effect. Percentage of each to be selected in across all folds can be viewed as a measure of association robustness. So is the average rank of each covariate across folds.
We consider a situation that has an analogue in high-dimensional data generated by microRNA data discussed above. The method evaluated in this simulation study uses the parameter-generating sample to select a small subset of the original genes, and subsequently it uses the estimation sample to validate the effect of these genes on a phenotype of interest. In this manner, it avoids the need to apply multiple testing procedures that control a Type-I error rate among a comparatively large number of tests.
Let where is a binary vector, and
a multivariate outcome. The true probability distribution,, is generated based on a design where there is equal probability of and ; moreover, for each gene , the distribution of , given , is defined by the following regression equation:
The coefficient is generated by a standard normal distribution, and the coefficient takes a fixed sparse design, with and . As a result, we generated true effects and null effects. Note, that these coefficients are fixed in the simulation. The errors were independent draws from a random distribution, and we repeated the simulation not only for different magnitudes of the residual error (different realizations of ), but also for increasing sample sizes. We define our data-adaptive statistical target parameter as outlined in 2.1, where we set the dimension of the reduced response matrix to be . Data on a total of potential biomarkers were generated.
Directly adjusting p-values using the method of Benjamini and Hochberg (1995), for controlling the FDR, on all responses yields the plot in 2. Note that out of the top true effects failed to achieve significance despite a signal-to-noise ratio of , due to adjustment on too large a dimension.
Running the data-adaptive algorithm on the parameter-generating samples provides a reasonable recovery of true effect responses. The results are given below:
|covar. ID||ATE est.||p-value||adjusted p-value||mean CV-rank||% times in top 15|
From the table, it is clear that the approach of data-adaptive statistical target parameter estimation consistently picks out the true effects in the top candidates. After application of the Benjamini and Hochberg (1995) procedure on the reduced set of responses, out of true signals are still significant.
Benzene, an established cause of acute myeloid leukemia (AML), may also cause one or more lymphoid malignancies in humans. Previous studies have identified single nucleotide polymorphisms (SNP) and patterns of DNA methylation associated with exposure to benzene through transcriptomic analyses of blood cells from a small number of occupationally exposed workers (Zhang et al., 2010); however, in these studies, the effect of benzene-induced changes to microRNA expression (Affymetrix 3.0 GeneChips which contain probes for human miRNAs) in blood cells was not the subject of intense scrutiny. We discuss a study of 85 individuals in Tianjin, China, in which 56 workers exposed to varying levels of benzene and 29 unexposed control counterparts were monitored repeatedly for up to 12 months. For each individual, blood samples were collected and miRNA expression was measured and log-transformed, leading to the formulation of a statistical problem in which a large number of comparisons arises from performing miRNA screens on both exposed subjects and unexposed controls.
To illustrate the flexibility of the proposed method we will show how it can be used as a powerful test for differentially expressed microRNA. The data consist of 5639 real-valued outcomes and one binary treatment for 85 individuals. In univariate testing, the microRNA hsa-miR-320a_st has the smallest p-value (). However the association is no longer significant after controlling for multiple testing using the Benjamini-Hochberg (Benjamini and Hochberg, 1995) method to control the False Discovery Rate (). Since our objective of interest was to detect the top differentially expressed microRNAs. (and not to differentiate all microRNAs), we generated data-adaptive test statistics to reduce the number of hypotheses based on the procedure outlined in 2.1
|microRNA||ATE||Fold Change||p-value||adjusted p-value|
|microRNA||ATE||Fold Change||p-value||adjusted p-value|
Given the multiplicity of comparisons, we propose the use of data-adaptive test statistics to reduce the number of comparisons first, thereby increasing power, while still maintaining accurate statistical inference. We specified the reduced set of response matrix with dimension , which correspond to studying top 30 microRNAs that will express differently under benzene exposure. We carried out 10-fold cross-validation to calculate the data-adaptive test statistics and tested each of the top 30 microRNAs. We finally performed FDR correction ((Benjamini and Hochberg, 1995)) on the 30 raw p-values.
The results for the top 30 microRNAs and their FDR-adjusted p-values are shown in Table 4. 19 out of 30 top microRNAs had a significant differential expression (q-value
) while we found none in FDR-corrected t-tests in Table2. Observing the plot of FDR-adjusted p-values (Figure 4) also gives us insights as we can easily identify groups of significant p-values. In practice, we can choose a cutoff based on the trend of sorted p-values. The average rank of the top covariates (in Table 4) can also be referenced as a measure of stability of the effect across different subjects.
|microRNA||ATE||raw p-values||adjusted p-values||avg rank||% appear in top 30|
The same analysis can be performed on up-regulated microRNA’s. 20 out of 30 top microRNAs had a significant differential expression (q-value ) as did the hsa-miR-338-5p_st and hsa-miR-103a-2-star_st that we found in FDR-corrected t-tests in Table 3.
|microRNA||ATE||raw p-value||adjusted p-value||avg rank||% appear in top 30|
Overall, this formally confirms the conclusions in the paper that by generating the data-adaptive test statistics we can increase the power of testing a large set of statistical hypotheses and at the same time control the level of false positive rate (or false discovery rate).
The goal of this article is to introduce a generalized class of robust procedures for performing statistical tests in high-dimensional settings, relying on the approach of data-adaptive statistical target parameters. Here, we have introduced, in generality, the theory and methodology underlying the use of data-adaptive statistics for multiple testing, illustrating key advantages of this approach via simulation studies and providing examples where relevant. By providing the theoretical formalisms in a generalized way, we have exposed a flexible framework in which the number of multiple testing corrections applied in high-dimensional problems can be reduced, allowing for signals that would otherwise be made undetectable by said corrections to be recovered.
In the example provided, we demonstrate the power of the approach based on data-adaptive test statistics in the context of a study of miRNA. We show that this new class of approaches for analyzing high-dimensional data sets allows researchers to derive improved statistical power in problems plagued by multiple testing, by allowing for relatively fewer null hypotheses of interest to be generated data-adaptively – that is, suggested by the observed data. In order to improve accessibilty to the methodology presented herein, Cai et al. (2017) have developed and made publicly available an open-source software package for data-adaptive multiple testing, available for the R statistical computing language (R Core Team (2016)).