1 Introduction
A multiple hypotheses scenario arises when a number of factors are believed to contribute to the difference between two populations. A typical example in biology is transcriptome analysis in which thousands of gene expression are examined to detect a handful that are expressed at significantly different levels between two groups of subjects with different biological outcomes. A common goal of such a study is to identify the regulator genes that influence the outcome of interest. Empirical Bayes methods can provide useful analysis tools for these problems.
The empirical Bayes model was considered by Robbins [Robbins] to study iid observations , , where is a latent variable. This model results in the mixture model where are iid with density
(1.1) 
Here is the conditional density of given and is the marginal density of .
In addition to justifying shrinkage estimation
[Morris], the method has been very useful in large scale inference [Efron07, Efron08, Efron10]. The specialization of (1.1) to the two point mixture model has played an important role in the analysis of the two class data problem. In the model(1.2) 
is the background density and is the signal or contamination density with . This representation can be used for a generic multiple testing problem where the hypotheses under study are divided into the null class and the nonnull class [Efron07] and one wishes to control the false discovery rate [Benjamini] for the entire array of hypotheses under study. For example, a two point mixture model where is the Uniform and is the Beta distribution is used in [Allison] to analyze pvalues for gene expression data.
The methodology proposed in this paper is a modification of these ideas with the twoclass empirical Bayes models for pvalues in multiple testing problems. The modification considers the twopoint UniformBeta mixture model (1.2) and then adjusts the mixture component to accommodate better the extreme pvalues. The main step incorporates repeated random sample splitting to use the “modeling” split to fit this adjusted model. The fitted model then is used to identify the signal pvalues in the “screening” split. The reoccurrence of the signal pvalues is finally used to “identify” significant signals in the data. The modified approach is able to control the false discovery rate and the precision for the whole process.
By repeating the screening process over a large number of randomly chosen subgroups of available subjects, then combining the findings, the proposed method can help to balance out unwanted sources of variation in observational data. Crossvalidation through resampling also addresses statistical overfitting [Baily] of the underlying empirical model and enhances the reproducibility of the screening results. The frequency distribution of detection recurrence provides insights into group behavior among significant factors. This methodology can be used to analyze omics data to identify potential regulators and to explore the interrelationship between them. In this paper, we demonstrate the usefulness of our methodology for gene expression data analysis.
The next section provides further background on the empirical Bayes approach for the multiple testing problems. The proposed methodology along with the associated power and precision probability calculations are presented in Section
3. The methodology is illustrated in Section 4 using a microarray data set and an RNAsequencing data set. A few simulation studies are also included in that section along with related power analysis. Some discussion and concluding remarks are given in Section 5. Appendix A presents an argument supporting the association plots developed by the method and Appendix B, on pooling data, gives a justification of the methodology and suggests when clusters of hypotheses may act in sync. A performance comparison between the proposed method and an existing screening method is included in Appendix C.2 Background
For a multipletesting problem, pvalues associated with the hypotheses tests either correspond to the baseline/background information (cases when the null hypothesis is true) or to the signal (true discoveries, i.e., the cases when the null hypothesis is false and pvalues close to 0). The basic model for the density of the population of pvalues under study is assumed to be (
1.2) where we label the background distribution as null and signal distribution as nonnull. For any continuous test statistic, the pvalue will follow a Uniform distribution under the null hypothesis
[Dickhaus], whereas the Beta distribution can be a natural choice for the nonnull (signal) pvalues because it is flexible enough to capture distributions with a wide range of shapes on
.For an observed pvalue
, the posterior probabilities
and can be given by the assignment functions associated with the mixture model in (1.2): and . In an empirical Bayes approach, is estimated using , and fitted from the observed data and for the multiple testing scenario, it is referred to as the local false discovery rate (local fdr) (see [Efron07, Efron08, Efron10]).The local fdr can be used to calculate the tailarea false discovery rate (for symmetric in twosided tests; can be adjusted for nonsymmetric cases). The tailarea Fdr is a useful tool to screen for potentially significant cases. Specifically, observations with small Fdr can be viewed as less likely to be a false discovery and thus can be considered as a significant case or a true discovery. (See [Efron07] for a detailed discussion of local and tailarea false discovery rates and their relationship with the BenjaminiHochberg [Benjamini] (BH) mode of controlling overall false discovery rate).
In mixture model based Fdr screening [Efron07, Efron08, Murlidharan], generally, the twoclass model is fitted first using the whole data and then it is used for Fdr based screening to detect the signal or the significant cases. The procedure is equivalent to a classification problem where the estimated Fdr is used as the discriminant function. Estimation of the Fdr is based on the fitted model that is entirely datadriven, and hence prone to overfitting.
Statistical overfitting [Baily]
is an especially important issue in machine learning and observational biological data analysis where empirical model building remains essential. Usually, a prognostic model is developed based on the available data, and then the model is used to classify factors under study or to predict the factors’ behavior in general. The problem arises when the developed model might work on the given dataset but fails to perform correctly in another data set thus creating a potential reproducibility issue
[Baily, Raeder].Crossvalidation is recommended as a common remedy to avoid overfitting due to datadriven empirical model building [BERRAR, Harell, Mathur, Raeder, Simon, Subramanian]. However, often additional data to validate the fitted model are not available, and the entire analysis or inference is to be done using only one observed data set. Resampling methods are the natural choice for crossvalidation in such scenarios.
The proposed method uses the mixture modelbased false discovery rate control described above with the following key attributes:

We incorporate a random sample splitting step in the existing multiple hypothesis testing procedure with Fdr screening where one part of the data is used for model fitting, and the complementary part is used to screen for the potential signal. This fitting/screening step is used repeatedly with multiple resampled subsets of all available subjects. The most frequently detected factors over all subsets are considered statistically significant. The crossvalidation through repeated subsampling and screening helps to reduce the effect of possible overfitting.

Observational studies that are common in biology are prone to include many latent sources of variation that can attenuate or distort the signal of interest. The proposed method screens a large number of different data splits and then selects significant cases taking into account the variations over all these splits. This helps to balance out the unwanted sources of variation.

We offer a novel way to control the overall false discovery of the inference process by choosing the critical detection frequency of a factor to be identified as a potential true discovery.

Based on the resampled screening frequencies, we also present an association plot that can be used to explore the interrelationship between the significant factors.
3 Methodology
We now consider a multiple testing scenario where null hypotheses are to be tested against appropriate alternatives respectively. Suppose are test statistics to be used for individual tests considering the available sample size. To identify the nonnull cases we can analyze either the pvalues associated with each or the lefttail area (LTA) from each . Nonnull cases will produce pvalues close to 0, equivalently nonnull LTA’s will be close to 0 or 1. The advantage of working with LTA’s is one can easily identify the direction of deviation from the null hypothesis.
Let denote the pvalue (or LTA) from test statistic . When the ’s follow a twopoint mixture distribution with density (1.2) the proposed methodology is implemented by the following steps.
Proposed SampleSplitting Analysis Methodology Steps:
 (i)

First the subjects under study are randomly split into two (equal) parts, viz. the “modeling” set and the “screening” set. The pvalues or the LTA’s from the test statistic associated with each of the hypotheses under study derived from the modeling set are, respectively called the “modeling data” and the “screening data”.
 (ii)

Then, a mixture contamination model is fitted using only the modeling data. This fitted model is then adjusted to an “empirical model” that better captures the baseline and the signal in the empirical fit more appropriately.
The detection of significant cases is based on the screening data alone.
 (iii)

The fitted model is used to derive the tailarea Fdr or the local fdr associated with each in the screening split. The cases with the tailarea Fdr (or the local fdr) less than a predetermined cutoff point are identified as potential significant cases.
 (iv)

The entire process (stages i, ii, iii) is repeated several times with different random splits of the modeling and the screening subsets. For each split/repetition, a set of potential significant cases is identified. The most frequently identified significant cases are considered as “significant discoveries”. The required critical detection frequency can be set so that the overall false discovery rate for the process is controlled at a fixed level.
 (v)

The screened cases detected together and their detection frequencies can be used to study the interrelationships/dependencies between the significant cases. This frequency distribution is used to develop an association plot for the hypotheses that graphically describes these insights.
Next we present the theoretical model formulation along with the screening and power analysis tools. As noted earlier in Section 1, the local false discovery rate (fdr) in [Efron07] is essentially the same as the assignment function . From identity (2.15) in [Efron10], the relationship between the local fdr and the tailarea for a given tailarea is .
3.1 Empirical Fit
Using the observed ’s in the modeling split, we first fit a mixture of Uniform distribution and Beta distribution to the population density given by (1.2) viz.
Since our main interest is the identification of the most extreme cases, we adjust the signal/contamination part as follows: let , where
equationparentequation
(3.1a)  
(3.1b) 
So, captures the more extreme part of the signal (deviation from the ). Since is not a density, it needs to be normalized as follows. Let and define
and
Now let
Then the fitted model can be rewritten as
(3.2) 
Note that the fitted mixture model is unchanged; the terms have been rearranged so that captures more of the middle part of the data while captures the tail part. The rearrangement given in (3.2) is what we will refer to as the “empirical mixture model” and is related to Efron’s “empirical null” [Efron07]; this representation of the fitted mixture model better captures the baseline and the signal distribution than the original mixture representation. To derive the estimates for expressions presented in next two subsections one has only to replace the subsequent terms in the assumed population density by Equation (3.2).
Comment: The assumption that the null data follows a Uniform distribution may not always hold in discrete cases (see [muralidharan2012detecting]). But the tail adjustment part can compensate, at least in part, for the deviation from Uniform in the final adjusted form .
3.2 Screening Significant Cases Based on Fdr and Sample Splitting.
In case pvalues are used for the analysis, ’s close to are associated with the signal. If ’s are the LTA’s from the test statistics, then ’s close to or (or both) are associated with the signal depending on whether a leftsided test, right sidedtest, or twosided test is being used. Thus the tail area (left, right or twosided) used for calculating tailarea Fdr will depend on the definition of the ’s and the direction of the alternative hypotheses under study.
Let be the distribution function corresponding to . Define , for any Borel set . Similarly, write as the distribution function of the baseline distribution . With this notation, we derive the tailarea Fdr associated with a given tailarea as follows:
(3.3) 
In practice, one can consider the tailarea associated with any value in the support of . The tailarea false discovery rate associated with can be calculated using (3.2).
If ’s are pvalues derived for each hypothesis under study, for any , the appropriate tail area to use is . When ’s are LTA’s from test statistics, one should use for a left sided test, and for a right sided test. Whereas, in a two sided test situation with symmetric around zero, the tail area simply is .
In general, for that is not symmetric, the twosided tail area can be derived using matching percentiles. Express any as the percentile of , i.e, ; then a matching can be found such that . Now if ( is smaller than the median) we choose equationparentequation
(3.4a)  
On the other hand if ( is larger than the median) we choose  
(3.4b) 
An observation with smaller than a predetermined critical value should be identified as significant.
For tailarea Fdr screening of a data set, a modeling split is first used to fit the adjusted mixture model (3.2). Then for each data point in the corresponding screening split, the appropriate tail area is determined using the fitted model. Next , and from the modeling fit are used with (3.2) to derive the estimated observed tailarea Fdr, viz. . Any case with is screened as a potential discovery, where is a predetermined cutoff point.
For screening with local fdr after fitting the adjusted mixture model with the modeling split, the fitted densities are used to calculate estimated local fdr for each screening split data point. Cases with less than (predetermined) are considered to be potential discoveries.
In terms of the rejection region, for any fixed cutoff point , depending on the tailarea Fdr or the local fdr screening, the theoretical rejection set from the split is given by equationparentequation
(3.5a)  
or  
(3.5b) 
where is the support of from the sample split.
The above calculation is repeated a number of times. The potential significant cases can be identified from the combined rejection set or . But to increase the precision, only the observations that have been detected repeatedly with high frequency across the ’s or ’s should be considered as potential true discoveries. The critical frequency of detection for significant discoveries will be discussed in subSection 3.3 (see Equation (3.13)).
3.3 Power and Error Probabilities Calculation.
Efron [Efron07, Efron10]
uses a whole data fit on ztransformations of pvalues associated with the hypotheses under study and advocates the use of the local fdr for the screening of significant cases. In that analysis, for a given cutoff point
of the local fdr, the rejection region effectively is . The power diagnostic tools chosen in those discussions are the nonnull average of the local fdr and the nonnull cdf of the local fdr given by . Some empirical estimates of these functions were used in [Efron07, Efron10] for the power analysis.For the sample splitting method proposed in this paper, where the model is fitted on the pvalues or LTA’s associated with test statistics, the rejection region from the split can be obtained from (3.5). The combined rejection region from all splits can then be constructed as or as , depending on the screening tool used. Considering the mixture model setup in (1.2), for a given rejection region with a cutoff point , the following probabilities can be used for the power analysis and a relative efficiency comparison:
(3.6) 
(3.7) 
(3.8) 
(3.9) 
(3.10) 
Here, , and are the true densities that follow from the assumption that are with pdf (1.2). The terms “recall”, “false negative rate” or “false positive rate” and “precision” are commonly used in machine learning (see [Powers]). measures the processwise false discovery rate associated with the combined rejection region .
The estimates of eqs. 3.10, 3.9, 3.8, 3.7 and 3.6 for a given data set can be obtained from the following steps:
Rejection Region and Power Calculation Steps.
 (i)
 (ii)

For a twosided test with LTA’s the rejection region from the split, using either the tailarea Fdr or the local fdr screening, can be written as:
Then writing
the combined rejection region can be expressed as: equationparentequation
(3.11a) or (3.11b) depending on the choice of the screening tool. Equation (3.11) will include sets with onesided regions only for pvalue analysis or onesided tests with LTA’s.
 (iii)
 (iv)
The power and error probabilities in the steps above are associated with the final analysis method that combines all splits and not with any single modeling split in particular. Therefore, for the estimation of eqs. 3.10, 3.9, 3.8, 3.7 and 3.6 it is appropriate to use a mixture model fitted to the entire data for estimates of densities , and as suggested in step (iii) above. An individual fit from any single particular modeling split should not be used for the power analysis.
Using the combined rejection region for screening will increase the number of rejections compared to whole databased screening described in [Efron07]. This will naturally increase the power (3.6) of the proposed method, but the payoff will be a loss of precision (3.9) or increase in .
Critical Detection Frequency.
Using only the cases in with high detection frequencies as the potential discoveries will increase the precision of the method. If we want to impose a tolerance level for , a critical detection frequency requirement can be derived as follows:
(3.12) 
Note that the factor is detected as potentially significant if . For any factor (individual hypothesis) under study, the relative frequency of detection from repeated sample splittings is essentially the bootstrap resampling relative frequency of lying in the rejection region . Thus, for large the relative frequency of detection for the factor estimates , .
Following Equation (3.3), if only factors with relative frequency of detection larger than are considered to be significant, that will control the estimated at . In other words, a factor can be considered a significant discovery if after run of sample splitting and subsequent screening the frequency of detection is at least
(3.13) 
Here the choice of and will be user defined (just as the choice of the level of significance for a traditional single hypothesis test). Smaller and will result in more conservative screening.
An added benefit of expressing the error probabilities as a function of Fdr cutoff point is that one can choose where both the type I and the type II error probabilities are at a reasonable level. Alternatively, an appropriate can also be chosen so that the proportion of correct classifications is at a desired level. is also known as the “accuracy” function in machine learning.
4 Illustrative Examples.
In this section, we illustrate the proposed methodology with a microarray data set and an RNAsequencing data set where the goal is to identify genes that are expressed at a significantly higher or lower level in the experimental group compared to the control group. We also present several simulation studies to compare the proposed approach with the existing Fdr screening methods.
4.1 Microarray Data Analysis.
Empirical Bayes methods have been used to analyze microarray data [smyth2004linear, hirakawa, Efron08]. The proposed methodology is more generalized since it can be applied with any test statistic (not necessarily Normal or tdistribution based).
We analyzed a prostate cancer microarray data set (used in [Efron10] from [Singh], the data is available in R under package sda, in a data file named “singh2002”) consists of 52 prostate cancer patients and 50 normal subjects. In the data, the expression levels of the same 6033 genes were measured for each subject. The analysis aims to detect genes that have significantly different expression levels between cancer and noncancer groups and to explore the interrelation between those genes. The significant genes are captured in the screening step. An association plot is generated based on concurrent detection frequencies, that is useful for exploring the interrelation between these genes (see section 5 for discussion and interpretation of these plots). We used LTA’s with tailarea Fdr screening for the analysis, although a local fdr screening also can be used following the steps described in Section 3.
To begin, the data was split into the modeling set and the screening set. The modeling split consisted of 26 randomly selected prostate cancer patients and 25 normal subjects. The remaining 26 patients and 25 noncancer subjects formed the screening split. The modeling data was used to fit the contamination model (3.2). This is described next.
The twosample statistic , , is calculated for each gene from the modeling group where it is assumed that follows a central
distribution with 49 degrees of freedom. The LTA for each of these
’s is calculated as:(4.1) 
Note that, ’s should be close to 0 or 1 for genes that are deemed significantly differentially expressed between cancer and noncancer groups. To compute the statistics, we used the difference (control meancancer mean). Hence, any close to implies that the gene is expressed at higher level in cancer patients, while any close to indicates that the gene is expressed at lower level in cancer patients compared to the control group. The histogram of the ’s from 6033 genes in Figure 1 shows a bathtub shape. Prompted by that shape, we first fitted a mixture of and a distribution to the modeling data and then readjusted it as described in subsection 3.1.
Figure 1 illustrates the histogram of (lefttailarea for the observed two sample tstatistic for genes ) from a particular screening split consisting of half of the control and the treatment group respectively. Panel 0(a) features the fitted UniformBeta (blue dashed lines) and the associated mixture distribution (red solid line) obtained from the corresponding modeling split as , where is the pdf and is the pdf. Panel 0(b) shows the adjusted empirical null fit as in Equation (3.2) derived as .
Next following (3.2), we computed the tailarea Fdr associated with each gene. Genes with tailarea Fdr less than 0.1 were declared as potentially significantly different between cancer patients and noncancer subjects. This procedure was repeated on 500 different sample splits.
Out of these 500 repetitions, 234 verification groups identified at least one significant gene with associated tailarea Fdr less than 0.1. The other 266 verification groups failed to capture any significant gene. Few verification sets identified more than one significant gene. In the 500 repetitions of this procedure, out of 6033 total genes, 117 genes showed Fdr 0.1 at least once, and 47 genes had FDR at least twice. These genes included some that are expressed at significantly higher levels among the cancer patients (’s close to 0) and some at significantly lower levels among the cancer patients (’s close to 1).
Genes that are repeatedly detected as significant strongly confirm the difference between the patient and the control group. Table 1 shows the significant genes (with detection frequency at least 2) along with the number of times they were identified as significant through the 500 sample splits. The false discovery rate calculated from the whole data analysis by the BH [Benjamini] method is also included in the table ().
Gene_ID  freq  rfreq  med(Fdr)  med(x)  mean(x)  sd(x)  

610  54  0.108  0.001  0.068  0.000  0.001  0.003 
1720  51  0.102  0.005  0.063  0.000  0.002  0.006 
4331  28  0.056  0.019  0.067  0.998  0.992  0.017 
3940  22  0.044  0.015  0.065  0.999  0.994  0.013 
914  20  0.040  0.015  0.067  0.001  0.005  0.010 
364  15  0.030  0.015  0.076  0.999  0.995  0.010 
4546  8  0.016  0.018  0.043  0.999  0.994  0.014 
921  8  0.016  0.060  0.083  0.994  0.985  0.027 
3647  7  0.014  0.019  0.071  0.002  0.008  0.020 
4316  7  0.014  0.037  0.082  0.997  0.988  0.023 
1077  6  0.012  0.031  0.066  0.002  0.011  0.023 
2856  5  0.010  0.064  0.056  0.994  0.979  0.039 
579  5  0.010  0.019  0.081  0.002  0.007  0.012 
1089  4  0.008  0.019  0.059  0.002  0.009  0.020 
11  4  0.008  0.092  0.041  0.011  0.030  0.048 
1346  4  0.008  0.060  0.058  0.995  0.982  0.030 
2  4  0.008  0.063  0.079  0.006  0.018  0.032 
2852  4  0.008  0.151  0.060  0.014  0.038  0.059 
2945  4  0.008  0.065  0.085  0.994  0.982  0.031 
3930  4  0.008  0.070  0.078  0.008  0.023  0.041 
4088  4  0.008  0.035  0.065  0.997  0.989  0.022 
702  4  0.008  0.063  0.090  0.995  0.983  0.029 
3017  3  0.006  0.070  0.059  0.994  0.982  0.029 
3991  3  0.006  0.035  0.079  0.996  0.988  0.022 
4073  3  0.006  0.040  0.065  0.997  0.987  0.028 
4396  3  0.006  0.070  0.065  0.992  0.975  0.044 
1068  2  0.004  0.019  0.087  0.002  0.005  0.011 
1314  2  0.004  0.060  0.059  0.005  0.015  0.026 
1491  2  0.004  0.119  0.048  0.013  0.029  0.043 
1588  2  0.004  0.070  0.080  0.007  0.021  0.040 
1966  2  0.004  0.120  0.071  0.014  0.036  0.055 
2370  2  0.004  0.063  0.062  0.006  0.018  0.034 
2621  2  0.004  0.512  0.078  0.062  0.097  0.110 
2745  2  0.004  0.627  0.080  0.912  0.862  0.145 
2785  2  0.004  0.327  0.077  0.969  0.936  0.084 
2912  2  0.004  0.137  0.076  0.014  0.035  0.055 
332  2  0.004  0.015  0.094  0.001  0.004  0.007 
3600  2  0.004  0.065  0.056  0.994  0.984  0.027 
3848  2  0.004  0.386  0.064  0.040  0.074  0.092 
4000  2  0.004  0.063  0.079  0.993  0.982  0.032 
4040  2  0.004  0.095  0.082  0.989  0.974  0.041 
4104  2  0.004  0.060  0.086  0.994  0.982  0.031 
4417  2  0.004  0.529  0.063  0.931  0.891  0.120 
4515  2  0.004  0.117  0.057  0.987  0.969  0.054 
4518  2  0.004  0.033  0.085  0.003  0.010  0.021 
4549  2  0.004  0.060  0.083  0.005  0.016  0.033 
758  2  0.004  0.255  0.067  0.973  0.946  0.071 

) are the median, mean and standard deviation of the 500 LTA’s
for each gene computed from 500 screening splits.Table 1 further shows that the most frequently screened genes by the proposed method would also be detected by the BH method, since they had (Efron’s approach would have a similar result as the BH method since they are related [Efron07]). However, our approach was able to capture a few genes (with low detection frequencies) that would have been missed by the BH screening if the same threshold value 0.1 was used. For example, genes 3848, 4417, 4515, 758 all had larger than 0.1 but was detected twice by our method. That is, these genes showed tailarea in 2 screening splits among the 500 splits used in the analysis. Note that, the median LTA () for these genes are fairly close to 0 or 1 which supports the fact that these are potentially significant genes.
Figure 1(a) shows genes identified as significant at least twice and the sets of genes with which they are simultaneously identified as significant across 500 sample splits and subsequent screenings. In this gene frequency association plot (Fassociation), the nodes and edges respectively indicate the detected significant genes and the detection of two genes at the same time in a particular split. The edge width increases with the increase in the frequency of simultaneous detection for a pair of significant genes. The node color represents the median tail area from the 500 screening data sets for that gene, for which the color turns from blue to red accordingly as the tail area increases from small (close to 0) to large (close to 1). That is, the blue nodes represent genes expressed at significantly higher levels, and the red nodes represent genes expressed at significantly lower levels in the cancer patients’ group compared to the controls. Figure 1(b) shows genes that are detected at least 5 times for a clearer picture into the network.
4.2 RNAseq Data Analysis
In this subsection, we show the application of our methodology for the analysis of global gene expression data using RNAseq, a commonly used tool for transcriptome profiling. Here we demonstrate how the method can be used with pvalues. We used the data consists of RNASeq profiles of cell lines derived from lymphoblastoid cells from 69 different Yoruba individuals from Ibadan, Nigeria [pickrell2010understanding]. The analysis aims to investigate differentially expressed genes between males and females. The RNA counts are available in the R package tweeDEseqCountData. In the raw RNA counts, there are 38415 genes with or without defined annotations. To filter out the noninformative genes, we keep genes with both defined annotations and at least 1 countpermillion (CPM) in at least 20 samples. At last, 17310 genes remain for the differential gene analysis.
Gene  Chrom  freq  FDR  med(FDR)  med  mean  sd 

XIST  X  91  1.78E42  2.27E19  8.92E25  2.44E23  7.24E23 
CYorf15A  Y  91  1.03E37  1.70E16  1.64E21  1.00E20  2.58E20 
CYorf15B  Y  91  1.31E32  1.76E14  1.92E19  1.09E18  2.13E18 
RPS4Y2  Y  91  1.31E32  3.85E14  2.32E19  4.52E18  1.49E17 
TTTY15  Y  91  1.53E31  2.60E13  1.31E18  4.96E17  2.03E16 
EIF1AY  Y  91  8.78E27  1.39E11  3.89E16  7.25E15  2.11E14 
NLGN4Y  Y  91  4.21E24  7.37E10  2.89E14  3.86E13  9.19E13 
UTY  Y  91  7.01E22  1.97E09  1.63E13  1.87E12  5.62E12 
AC010889.1  Y  91  6.29E21  4.93E09  6.55E13  3.18E12  1.01E11 
RPS4Y1  Y  91  6.29E21  1.16E08  3.81E13  2.03E11  6.14E11 
KDM5D  Y  91  1.82E20  1.27E08  1.41E12  1.41E11  4.08E11 
RP11331F4.1  16  91  4.13E20  2.12E08  2.51E12  2.82E11  9.40E11 
DDX3Y  Y  91  3.30E18  2.16E07  1.56E11  6.59E10  2.70E09 
NLGN4X  X  91  2.24E15  4.98E06  7.24E10  7.74E09  1.86E08 
PNPLA4  X  89  3.65E12  1.91E04  9.08E08  3.44E07  8.81E07 
RP13204A15.4  X  88  3.68E13  2.41E05  5.84E09  1.84E07  6.61E07 
AL137162.1  20  84  1.19E10  4.20E04  2.16E07  1.48E06  2.72E06 
RP51068H6.3  20  76  1.22E09  9.72E04  8.62E07  2.95E06  4.99E06 
AC016734.3  2  72  3.29E09  7.49E04  9.67E07  1.10E05  3.66E05 
RP11143J12.1  18  64  1.47E08  1.40E03  2.47E06  9.25E06  1.58E05 
RPS4X  X  64  1.56E08  1.88E03  3.49E06  9.70E06  1.69E05 
RP11411G7.1  17  62  1.56E08  1.80E03  3.95E06  1.14E05  2.14E05 
RP11863N1.1  18  61  3.35E08  9.45E04  2.68E06  3.21E05  8.07E05 
CTD3116E22.2  19  58  3.56E08  2.64E03  6.68E06  2.39E05  6.72E05 
RP11135F9.1  12  56  2.41E08  2.73E03  5.29E06  1.87E05  3.00E05 
RP11624G17.1  11  56  3.59E08  2.82E03  5.76E06  1.66E05  3.06E05 
HDHD1  X  54  5.64E08  1.58E03  4.20E06  3.77E05  9.29E05 
RP1121K20.1  12  48  5.64E08  2.10E03  7.33E06  2.88E05  4.67E05 
Of the 69 individuals, there are 40 females and 29 males. From the whole group, 20 female and 15 male subjects were randomly selected to construct the modeling split, (the remaining subjects formed the screening split). Assuming the data follows a negative binomial distribution
[anders2010diff], the pvalue for each gene was calculated using a generalized likelihood ratio test comparing male and female subjects in the modeling split. The modeling data consists of these 17310 pvalues. Similar pvalues from the screening split provided the screening data.A mixture of a and a distribution was fitted to the modeling data pvalues and was adjusted to obtain the empirical fit as described in Subsection 3.1. Figure 3 illustrates the effect of the adjustment on the whole data. An initial mixture model fit with the density , and the density is illustrated in Panel 2(a), while the adjusted mixture model , according to Equation (3.2), is featured in Panel 2(b).
For the analysis, the adjusted fit from a modeling data was used to calculate the tailarea Fdr associated with the pvalue in the corresponding screening data for each gene. Genes with Fdr less than 0.01 were detected as significantly differentially expressed between male and female subjects. The process was repeated 100 times.
100 sample splits and subsequent screening detected a total of 83 significant genes out of 17310. Of these 83 detected significant genes, 56 appeared more than once in the 100 repetitions of the splitting and screening. Table 2 presents a partial summary of the detection results that includes 28 most frequently detected significant genes with tailarea Fdr (detection frequencies 40 or higher). The “freq” column indicates the frequency of occurrence for the corresponding gene in the 100 screening splits. Note in particular, the topranked gene XIST (X inactive specific transcript) is expressed only in females to silence one of the two X chromosomes; this provides the dosage equivalence between females and males. The other most frequently detected significant genes appear on either X or Y chromosomes, which is expected since they are differentially expressed between males and females.
Figure 4 shows the Fassociation plot for the most significant genes. Since we are using pvalues for the analysis, detected genes cannot be flagged as overexpressed or underexpressed (unlike LTA values), but can only be detected as significantly differentially expressed in females compared to males. Therefore the color scheme of Figure 2 is absent in Figure 4. Each node represents a gene, and the gray link between nodes indicates the pair of genes that are simultaneously differentially expressed. In Figure 3(a), most of the inner clustered genes come from the topranked genes from Table 2, and the more outward the nodes are in the frequency network, the lower the frequency of being differentially expressed.
4.3 Simulation Studies
In this section, we present several simulation studies featuring the proposed screening. Subsection 4.3.1 includes a continuous data analysis and detailed power and precision comparison with Efron’s whole data fit method. Subsection 4.3.2 presents the specificity of the method for a simulated data without any signal. Subsection 4.3.3 illustrates the analysis of a discrete data, whereas in subsection 4.3.4 we analyze a similar discrete model simulation with a larger proportion of the signal.
4.3.1 Continuous Data Simulation
We simulated a data set with a treatment group of 50 subjects and a control group of 50 subjects where 1000 expressions were simulated for each subject. Out of the total 1000 variables, the first 30 were set to be the “nonnull” cases (expressions were simulated from distributions different for the treatment and control groups); while the last 970 variables were set to be the “null” cases (expressions were simulated from the same distribution for the treatment and control groups).
Output Variables  Mean  Variance (treatment and control)  

treatment  control  
3  
(independent)  6  2  
(independent)  6  2 
Further, to show that the Fassociation plots constructed using the detection frequencies can pick up existing interrelationships between screened cases, we added a correlation structure among the first 10 nonnull variables. For a gene expression study, if some genes are interrelated with each other, they will work in tandem in any subject no matter whether from the control or from the treatment group, although their expression levels can be different between the two groups. Keeping that in mind, for the simulated data the same correlation structures were applied to both the treatment and the control group while keeping the nonnull means different in the two groups.
We used the normal distribution for the simulation. The
distribution was used for all 970 null variables for each subject in the treatment and the control group. The treatment and control group parameters used to simulate the 30 nonnull variables are described in Table 3, where the mean and the variance columns illustrate the Normal distribution mean vector and variancecovariance matrices used for simulation. Variables
, , and are clustered together with the covariance structure shown in the table. Whereas, variables 11 to 30 were simulated independently from . 10 values for were generated from an distribution, one for the mean of each of the variables to in the treatment group. Similarly, 10 values for were generated from an distribution, one for the mean of each of the output variables to in the treatment group which are independently simulated from distributions.The 100 subjects were randomly split into a modeling and a screening set, each consisting of 25 subjects from the treatment group and 25 subjects from the control group. The data points were defined as , where
is the twosample ttest statistic for each output variable from the 25 control and the 25 treatment samples in the modeling split. Then a mixture of a Uniform and a Beta distribution was fitted to the modeling split
’s and was adjusted to better capture the background and the signal similar to (3.2) in Section 4.1.Figure 5 shows the histogram of (lefttailarea for the observed two sample tstatistic for the simulated variables ) from a particular screening data. Figure 4(a) illustrates the initial mixture model , where is the pdf and is the pdf, fitted from the modeling data. The red solid line shows the fitted mixture distribution. Figure 4(b) presents the adjusted empirical null fit according to Equation (3.2).
Here we used tailarea Fdr screening to construct the frequency table (Table 4) and the Fassociation plots (Figure 6). The local fdr screening results are used for comparison purposes in Figure 7.
The twosample for each variable in the screening split and it’s lefttailarea formed the screening data. The modeling split fit (3.2) was used to obtain the tailarea Fdr associated with each screening data point . The variables with tailarea Fdr less than were detected as significant. The process was repeated 100 times. Table 4 shows 26 most frequently detected significant variables from 100 sample splits with the tailarea Fdr (detection frequencies 2 or higher). The “freq” column shows the frequency of detection for the corresponding variable in 100 screening splits.
variable ()  freq  med(Fdr)  med  avg  sd 

8  63  0.03660  0.00006  0.00058  0.00153 
5  45  0.04460  0.99987  0.99928  0.00146 
10  44  0.03740  0.00040  0.00266  0.00862 
6  41  0.04690  0.99979  0.99898  0.00216 
7  38  0.05290  0.99971  0.99868  0.00303 
15  36  0.04490  0.00051  0.00399  0.01051 
26  34  0.05280  0.99962  0.99826  0.00362 
21  22  0.05030  0.99923  0.99583  0.01447 
9  20  0.04850  0.00213  0.01042  0.01890 
16  17  0.04820  0.00233  0.01138  0.02408 
23  17  0.05830  0.99841  0.99473  0.01289 
30  17  0.07370  0.99819  0.99265  0.01572 
29  13  0.04630  0.99760  0.99270  0.01855 
12  12  0.04980  0.00648  0.01892  0.03579 
20  10  0.06330  0.00518  0.02024  0.03748 
11  9  0.06040  0.00311  0.01547  0.03765 
13  6  0.06720  0.01671  0.02783  0.03881 
19  6  0.06930  0.00958  0.01910  0.02499 
4  6  0.06720  0.01157  0.02577  0.04406 
24  5  0.04710  0.99692  0.99130  0.01475 
3  5  0.06420  0.00861  0.02270  0.03694 
106  3  0.03140  0.02360  0.04740  0.06810 
523  3  0.04050  0.02281  0.05717  0.10463 
27  2  0.07010  0.98869  0.96776  0.05341 
508  2  0.07170  0.04119  0.07933  0.10046 
919  2  0.05420  0.92189  0.89232  0.12438 
Note that, although variables 1 to 30 out of the 1000 simulated variables were set as nonnull, the groups of variables 5, 6, 7 and 8, 9, 10 deviated the most from the null. The mean vectors for variables 1, 2 and 3, 4 did not deviate enough from the null mean to produce significantly large or small tstatistic values. The analysis was done using the tstatistic tailarea and not the original normal distribution, thus naturally nonnull variables 1, 2 were not captured in the screening process.
Figure 5(a) presents the Fassociation plot of 26 significant variables appearing at least twice in the 100 sample splits by using the tailarea Fdr. Figure 5(b) is the simplified network of Figure 5(a), obtained by deleting variables with less than 5 connected edges. Here node color blue indicates the gene was expressed at a lower level in the treatment group compared to the control group, whereas, red implies the gene was expressed at a higher level in the treatment group.
The Fassociation plots in Figures 5(a) and 5(b) show that among the detected significant variables, the groups with strongest correlation structure, one with variables 5, 6 and 7 and another with variables 8, 10 were captured successfully. However, the plot did not capture all correlated genes (e.g., gene 9 that had a low correlation with gene 10 but had a higher correlation with gene 8 did not show up as in the group with ).
The association featured in the network plot depends on the difference between control and treatment groups in the observed data, and while the inherent dependence structure might drive the difference, the difference itself might not always indicate a dependence. Hence the plot should be used for exploration and not to prove a causal association. A more detailed discussion about the limitation of the Fnetwork is included in Section 5, while a justification for the association is presented in Appendix A.
Power and precision comparison with the whole data fit method.
To compare the efficacy of the proposed method with the whole data fit/screening, we present the power (recall) and precision (Equations (3.6) and (3.9)) comparisons in Figure 7. Since Efron [Efron07] used the local fdr for screening, for comparison purposes with repeated sample splitting, the combined rejection region based on local fdr screening as in (3.11b) is used in Figure 7 along with the rejection region from the whole data fit and the local fdr screening method.
The combined rejection region from repeated sample splits results in a larger rejection region hence providing higher power as evident in Figure 6(a). But the inclusion of all detections increases the number of false discoveries and consequently decreases the precision as seen in Figure 6(b). Note that, even when the entire was used with the proposed method as the rejection region, at relevant values 0.1 to 0.3 (reasonable cutoff points for local fdr) the sample splitting technique shows a level of precision that is on par with the whole data fit/screening method.
However, we are proposing that only the variables with high detection frequencies should be screened as potential true discoveries and not the entire . By adding a high frequency criterion (as in (3.13)) to the combined rejection region, the precision can be set at any desired fixed level for all . Whereas, with enough repeated sample splitting, the set of variables in with high frequencies is unlikely to get much smaller compared to the rejection set produced by the existing methods, with sufficiently large repetition of sample splitting, the high frequency screening can be set to achieve intended level of precision without significant loss of power.
For example, 21 of 26 detections (about 81%) in Table 4 with detection frequencies of 2 or higher were true discoveries. If variables with detection frequency 5 or higher are considered, all 21 detections are true discoveries (100%).
Critical detection frequency and .
For a fixed , we screen significant genes at every splitting/screening step if the tailarea Fdr in the verification split. This leads to the rejection region from an individual split according to Equation (3.5a), which can be calculated by a numerical grid search for , so that Equation (3.5a) holds. The overall rejection region is constructed by combining rejection regions from 100 repeated splits, according to equation (3.11a).
For example, for the simulated data generated with parameters in Table 3, if we keep , the combined rejection regions from 100 repeated random sample split/screening is .
The and critical detection calculation steps described in Subsection 3.3, Equations (3.10) and (3.13) use a fitted mixture distribution on the entire data without sample splitting (as described in step (iii) under the heading “Rejection Region and Power Calculation Steps” in that subsection). For this particular simulated data set, the whole data Uniform/Beta mixture fit on lefttail areas is given by , which in its tailadjusted form can be written as (adjusted according to Equation (3.2)).
When the splitting/screening in repeated times, to keep , the critical detection frequency for potential true significance can be calculated from Equation (3.13) as
where can be userdefined such as 0.05, 0.1, and 0.2, etc. A numerical integration procedure is used for this part of the calculation in our Rpackage.
To generalize the concept for any , one might just consider the critical detection “relative” frequency by
Figure 8 illustrates the critical relative frequency of detection required to keep the overall combined false discovery rate bounded at = , or , when each screeningsplit uses detection criterion tailarea Fdr .
For example, in Figure 7(b), at the graph for shows the minimum relative frequency needed is about .01. This implies that for each screeningsplit, if the detection criterion employed is “a variable is significant when its corresponding tailarea Fdr” and the process is repeated 500 times, any variable needs to be detected more than times to be identified as a potential true discovery in order to keep for the combined screening process. If the process is repeated 1000 times, any variable needs to be detected more than 10 times to be identified as a potential true discovery in order to keep for the combined screening process.
Figure 9 shows type I and type II errors as functions of cutoff points for the tailarea Fdr screening calculated from the simulated data following steps described in Section 3.3. This may help in the choice of appropriate cutoff point for the main analysis.
In Appendix C we have included an overall performance comparison between the repeated sample splitting/screening procedure and the conventional BenjaminiHochberg screening method over 100 different data set simulated from the Normal distribution setup used in this section. The comparison reveals that our approach, as a whole, performed better in terms of stability selection [Meinshausen]. The sample splitting/screening approach could successfully detect all nonnull genes, where the BH screening detected some nonnull genes with high relative frequency but missed nearly half of them. In the results, a prominent specificity difference between the two approaches was apparent as well. Our approach almost never detected null genes as nonnull, while the BH method erroneously picked up almost all null genes as nonnull multiple times over the course of 100 simulations (Appendix C contains further details).
4.3.2 Null Model Simulation (Specificity Study)
Next, we consider the null model, i.e., the scenario where both the experiment and the control group data come from the same population distribution. In this case, all hypotheses under study belong to the null class. Hence any classification or detection of a nonnull case will essentially be a false discovery. To understand if a given procedure works well in this scenario, we can examine the percentage of time the procedure can assign a hypothesis correctly to the null class or the “specificity” of the method.
We simulated continuous data from 100 subjects for 1000 genes, from the normal distribution with the mean , and standard deviation . The first 50 subjects and the next 50 subjects respectively constitute the control group and the treatment group, which obviously come from the same distribution. Figure 10 illustrates the specificity comparison of the BH method (dark solid line) and the proposed method (gray dashed line). The axis presents , so smaller hight of the plot indicates higher specificity, i.e., a higher rate of correct classification of null cases. The plot shows that the proposed method has slightly higher specificity for moderate FDR (), but lower specificity for large FDR (). Note that, for multiple testing screening factors with small FDR are considered to be true discoveries, and Figure 10 shows that the proposed method works better in that crucial region.
4.3.3 Discrete Data Simulation
Here we consider a simulation of a discrete data for 100 subjects and 1000 genes, in which the first 50 subjects are assumed to form the control group, and the other 50 subjects form the treatment group. The control group values are simulated from the negative binomial distribution with the mean and the dispersion , in notation, (we are using the NegativeBinomial parameterization used in [robinson2007small]). The treatment group data is simulated from the mixture of negative binomial distributions, , where the first 30 genes are the nonnull genes that are simulated from the negative binomial distribution with mean and the same dispersion value as the control group.
Gene_ID  freq  rfreq  pvalue  

22  44  0.088  0.001  1.01E06 
27  17  0.034  0.043  2.55E04 
5  14  0.028  0.031  6.95E05 
25  11  0.022  0.031  9.36E05 
209  9  0.018  0.033  1.30E04 
21  9  0.018  0.081  9.37E04 
4  8  0.016  0.034  1.71E04 
2  7  0.014  0.048  3.34E04 
19  6  0.012  0.054  4.70E04 
193  5  0.010  0.081  9.68E04 
29  5  0.010  0.054  5.22E04 
We used 500 random splits to analyze the data. By using the Fdr threshold 0.05 and the critical detection frequency threshold 0.01, our method found 11 significant genes including 9 true discoveries and had 2 false discoveries. Alternatively, the conventional BH method detected 7 genes including 6 true discoveries and one false discovery by using the threshold 0.05. Screening results are shown in Table 5, where the columns freq and rfreq illustrate the detection frequencies and detection relative frequencies of the significant genes. The values in the column are calculated based on the BH method by using the whole data. The last column presents the pvalues for each gene from the whole data likelihood ratio tests that compare the treatment group and the control group averages. Table 5 shows that genes 19, 21 and 29 were detected more than 5 times by our method, while had greater than the threshold 0.05, hence would have been missed by the BH screening.
4.3.4 Discrete Data Simulation with Larger Signal Proportion
We repeated the discrete data simulation setup from Subsection 4.3.3 but with 100 nonnull genes in the treatment group instead of 30. That is, data was simulated for 1000 genes for 50 control subjects and 50 treatment group subjects, with the control group data generated from a negative binomial distribution , and the treatment group data from a mixture of negative binomial distributions, , where the first 100 genes are set to be the nonnull genes and are simulated from the negative binomial distribution with mean .
The analysis was done with 500 random sample splits using the proposed method. By using the FDR threshold 0.05 and the relative significant frequency 0.02, our method detected 46 significant genes, including 39 true discoveries and 7 false discoveries. Comparatively, the BH method detected 39 genes, including 34 true discoveries and 5 false discoveries, respectively. The power of the proposed method was 0.39, and the power of the BH method was 0.34. Table 6 presents the screening results that includes the detection frequencies and relative frequencies of all 46 genes along with their corresponding BH method FDR. The table shows that the proposed method detected nonnull genes 27, 47, 80, 85, and 18 at least 10 times, but had greater than 0.05, hence would have been missed by the BH method with the same 0.05 FDR threshold value.
Gene_ID  freq  rfreq  pvalue  

86  123  0.246  0.002  5.03E06 
58  121  0.242  0.002  9.67E06 
70  107  0.214  0.002  1.15E05 
56  101  0.202  0.003  2.47E05 
66  96  0.192  0.003  1.89E05 
94  96  0.192  0.002  1.32E05 
55  93  0.186  0.002  1.28E05 
92  93  0.186  0.002  8.33E06 
21  77  0.154  0.006  8.97E05 
24  69  0.138  0.004  3.61E05 
25  66  0.132  0.003  2.41E05 
209  63  0.126  0.004  4.74E05 
23  53  0.106  0.012  1.86E04 
77  51  0.102  0.009  1.42E04 
72  50  0.100  0.004  5.31E05 
91  49  0.098  0.060  2.59E03 
83  47  0.094  0.024  7.30E04 
39  46  0.092  0.017  4.03E04 
82  45  0.090  0.005  6.10E05 
32  42  0.084  0.018  4.66E04 
99  42  0.084  0.017  3.95E04 
12  37  0.074  0.031  1.15E03 
57  34  0.068  0.013  2.64E04 
74  34  0.068  0.013  2.79E04 
8  29  0.058  0.013  2.52E04 
95  29  0.058  0.012  1.96E04 
35  28  0.056  0.018  5.00E04 
73  27  0.054  0.017  3.92E04 
97  25  0.050  0.031  1.07E03 
29  22  0.044  0.018  4.80E04 
22  21  0.042  0.013  2.80E04 
68  20  0.040  0.029  8.96E04 
193  18  0.036  0.018  4.46E04 
42  18  0.036  0.031  1.10E03 
60  18  0.036  0.024  7.26E04 
27  17  0.034  0.059  2.46E03 
927  15  0.030  0.029  9.42E04 
181  14  0.028  0.106  6.23E03 
763  14  0.028  0.031  1.22E03 
4  13  0.026  0.031  1.17E03 
117  12  0.024  0.031  1.05E03 
706  12  0.024  0.064  2.82E03 
47  11  0.022  0.090  4.53E03 
80  11  0.022  0.057  2.27E03 
85  11  0.022  0.077  3.75E03 
18  10  0.020  0.057  2.35E03 

5 Discussion
Sample Splitting:
In a multiple testing situation, if the entire available data is used for the fitting of a null/nonnull mixture model, then using the same model for the nonnull detection again from that data may affect the screening process due to overfitting of the empirical model. The sample splitting in the proposed method allows a part of the available information to be used for model building, and the other part for screening significant cases hence avoids that drawback. Further, when the data is randomly split multiple times, it produces a different (may not be disjoint) modeling set each time. When models are fitted based on these different splits, it helps to neutralize the effect of sources of variation (noise) other than the one that is of interest in a study.
The use of only partial information for the model building part may lead to some loss of power for an individual modeling split, but repeated sample splitting and combining the resulting rejection regions overcomes that. In fact, in all simulation studies, the proposed method had a higher number of true discoveries than the existing whole data fit method. In models with higher signal proportion, this difference was more pronounced. However, the combined rejection region also accumulates false discoveries and reduces precision, but the use of a critical detection frequency threshold is designed to attain any fixed precision level for the entire process.
Fassociation Plots:
The other benefit of the repeated sample splitting is the Fassociation plots that are constructed based on the detection frequencies across the repeated splits. The simulation study shows that if some screened nonnull cases are indeed associated, that relation is captured in the group structure of the Fassociation plot (see Appendix A
for a heuristic justification). Here we want to emphasize that the Fassociation plots generated by this method should not be used as a “proof” of group behavior among the cases. If two unrelated nonnull cases deviate strongly from the null distribution, they are bound to be frequently detected as significant, no matter how the data is split. In that case, these two unrelated nonnull cases may appear in the detected sets repeatedly at the same time and consequently will show up in the Fassociation plot as a group. Therefore, the Fassociation plot is not intended to use as a basis for a causal relation.
Thus we recommend that these plots be used as an exploratory tool that precedes further investigation to establish possible causal relationship between cases that show high concurrent detection frequencies. When a study includes thousands of cases, at least some starting point for an exploratory network analysis can be highly useful and costeffective.
The relevance and effectiveness of the proposed method can be explained particularly well in a study where the main goal is the identification of groups of differentially expressed genes. Such studies are commonly used to identify the genes that are associated with specific biological behavior. Genes detected through the proposed methodology can be selected by experimental biologists for detailed followup functional studies. For example, a biologist may look into the few most frequently identified significant genes to distinguish the “regulators”. A systematic gene knockout experiment, conducted on the sets that appear in the Fassociation plot as a group, can reveal the effect of individual genes on the biological response of interest. This can provide a novel starting point for the elucidation of gene networks or hierarchical regulation patterns in a biological system. Thus, the proposed analysis can guide an exploratory biological study where, instead of experimental investigations of the effect of every gene, a small subset of significant ones can be selected for further experimentation to establish their individual or collective role in the biological response of interest.
Conditional Independence:
An interesting question can be posed: “When does the empirical Bayes model (1.2) work?” It of course works when the observations are i.i.d according to (1.2). But in many cases (in particular, the microarray example considered here), it is not correct. It does work, though, in pooling observations where the observations are conditionally independent. For example, in a microarray, groups of genes may be acting together but still conditionally independent. This is a typical argument used in multiple hypothesis testing cases [Karlin81, Sarkar97, Huque16]. See Appendix B for an explanation of why the empirical Bayes approach works in this case.
In conclusion, we present a method that can be used for identifying significant cases when carrying out a large number of simultaneous tests. We propose a crossvalidation type analysis where a part of the available information goes into the understanding of the underlying process or model fitting, while the other part goes into screening for extreme cases. Random splitting and repeated screening provide a way to reduce the noise (other sources of variation) in the analysis. We present a way of controlling the overall chance of false discoveries that arise from combining the screening results from all splits. As a byproduct we get an exploratory look into the associative pattern between significant cases.
6 Software
The analysis were done using R. The relevant codes are available upon request. An R package for the proposed analysis method is available for general use. The R package tiltmod can be downloaded via: https://github.com/chongma1989/tiltmod.
References
Appendix A Fassociation plots justification.
When we are trying to detect genes that have significantly different levels of expression in the treatment group compared to that of the control group, the detection can be based on the difference in the average expression level between the two groups.
Let denote the average expression difference between the experiment and the control group for the gene. Genes that behave similarly in both groups should have ’s close to , whereas genes that are truly different in these two groups should have ’s removed from in either a positive or a negative direction. These extreme genes will have smaller chances of being false discoveries. Thus an extreme gene detection criterion of the form for some is equivalent of for some .
Let us consider two genes and that are truly different on average in the treatment group compared to the control group. Suppose is detected, that is, the observed value of , or the average expression difference for is beyond the threshold; without loss of generality suppose . We present an argument below that the chance that also will be detected is higher when and have some level of dependency as compared to the scenario when and behave completely independently.
Similar to the simulation study in Section 4.3, we assume the ’s follow a Normal distribution (not an entirely hypothetical scenario if we have large enough sample sizes in the experiment and the control groups). If and function independently, we assume are variables. If and are not independent we assume . with
and
.
(We will denote the left and right tail probabilities of the standard Normal distribution by and respectively).
If the two genes behave independently and is detected, the chance that will also be detected is (here is the righttail area from in a standard Normal distribution). If the two genes are not independent then we have the following two possibilities:

If and is detected, the chance that will also be detected is
Comments
There are no comments yet.