Exploratory data analysis for large-scale multiple testing problems and its application in gene expression studies

12/12/2019 ∙ by Paramita Chakraborty, et al. ∙ University of South Carolina 0

In large scale multiple testing problems, a two-class empirical Bayes approach can be used to control the false discovery rate (Fdr) for the entire array of hypotheses under study. A sample splitting step is incorporated to modify that approach where one part of the data is used for model fitting and the other part for detecting the significant cases by a screening technique featuring the empirical Bayes mode of Fdr control. Cases with high detection frequency across repeated random sample splits are considered true discoveries. A critical detection frequency is set to control the overall false discovery rate. The proposed method helps to balance out unwanted sources of variation and addresses potential statistical overfitting of the core empirical model by cross-validation through resampling. Further, concurrent detection frequencies are used to provide visual tools to explore the inter-relationship between significant cases. The methodology is illustrated using a microarray data set, RNA-sequencing data set, and several simulation studies. A power analysis is presented to understand the efficiency of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

page 40

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A multiple hypotheses scenario arises when a number of factors are believed to contribute to the difference between two populations. A typical example in biology is transcriptome analysis in which thousands of gene expression are examined to detect a handful that are expressed at significantly different levels between two groups of subjects with different biological outcomes. A common goal of such a study is to identify the regulator genes that influence the outcome of interest. Empirical Bayes methods can provide useful analysis tools for these problems.

The empirical Bayes model was considered by Robbins [Robbins] to study iid observations , , where is a latent variable. This model results in the mixture model where are iid with density

(1.1)

Here is the conditional density of given and is the marginal density of .

In addition to justifying shrinkage estimation

[Morris], the method has been very useful in large scale inference [Efron07, Efron08, Efron10]. The specialization of (1.1) to the two point mixture model has played an important role in the analysis of the two class data problem. In the model

(1.2)

is the background density and is the signal or contamination density with . This representation can be used for a generic multiple testing problem where the hypotheses under study are divided into the null class and the non-null class [Efron07] and one wishes to control the false discovery rate [Benjamini] for the entire array of hypotheses under study. For example, a two point mixture model where is the Uniform and is the Beta distribution is used in [Allison] to analyze p-values for gene expression data.

The methodology proposed in this paper is a modification of these ideas with the two-class empirical Bayes models for p-values in multiple testing problems. The modification considers the two-point Uniform-Beta mixture model (1.2) and then adjusts the mixture component to accommodate better the extreme p-values. The main step incorporates repeated random sample splitting to use the “modeling” split to fit this adjusted model. The fitted model then is used to identify the signal p-values in the “screening” split. The reoccurrence of the signal p-values is finally used to “identify” significant signals in the data. The modified approach is able to control the false discovery rate and the precision for the whole process.

By repeating the screening process over a large number of randomly chosen subgroups of available subjects, then combining the findings, the proposed method can help to balance out unwanted sources of variation in observational data. Cross-validation through resampling also addresses statistical overfitting [Baily] of the underlying empirical model and enhances the reproducibility of the screening results. The frequency distribution of detection recurrence provides insights into group behavior among significant factors. This methodology can be used to analyze omics data to identify potential regulators and to explore the inter-relationship between them. In this paper, we demonstrate the usefulness of our methodology for gene expression data analysis.

The next section provides further background on the empirical Bayes approach for the multiple testing problems. The proposed methodology along with the associated power and precision probability calculations are presented in Section

3. The methodology is illustrated in Section 4 using a microarray data set and an RNA-sequencing data set. A few simulation studies are also included in that section along with related power analysis. Some discussion and concluding remarks are given in Section 5. Appendix A presents an argument supporting the association plots developed by the method and Appendix B, on pooling data, gives a justification of the methodology and suggests when clusters of hypotheses may act in sync. A performance comparison between the proposed method and an existing screening method is included in Appendix C.

2 Background

For a multiple-testing problem, p-values associated with the hypotheses tests either correspond to the baseline/background information (cases when the null hypothesis is true) or to the signal (true discoveries, i.e., the cases when the null hypothesis is false and p-values close to 0). The basic model for the density of the population of p-values under study is assumed to be (

1.2) where we label the background distribution as null and signal distribution as non-null

. For any continuous test statistic, the p-value will follow a Uniform distribution under the null hypothesis

[Dickhaus]

, whereas the Beta distribution can be a natural choice for the non-null (signal) p-values because it is flexible enough to capture distributions with a wide range of shapes on

.

For an observed p-value

, the posterior probabilities

and can be given by the assignment functions associated with the mixture model in (1.2): and . In an empirical Bayes approach, is estimated using , and fitted from the observed data and for the multiple testing scenario, it is referred to as the local false discovery rate (local fdr) (see [Efron07, Efron08, Efron10]).

The local fdr can be used to calculate the tail-area false discovery rate (for symmetric in two-sided tests; can be adjusted for non-symmetric cases). The tail-area Fdr is a useful tool to screen for potentially significant cases. Specifically, observations with small Fdr can be viewed as less likely to be a false discovery and thus can be considered as a significant case or a true discovery. (See [Efron07] for a detailed discussion of local and tail-area false discovery rates and their relationship with the Benjamini-Hochberg [Benjamini] (BH) mode of controlling overall false discovery rate).

In mixture model based Fdr screening [Efron07, Efron08, Murlidharan], generally, the two-class model is fitted first using the whole data and then it is used for Fdr based screening to detect the signal or the significant cases. The procedure is equivalent to a classification problem where the estimated Fdr is used as the discriminant function. Estimation of the Fdr is based on the fitted model that is entirely data-driven, and hence prone to overfitting.

Statistical overfitting [Baily]

is an especially important issue in machine learning and observational biological data analysis where empirical model building remains essential. Usually, a prognostic model is developed based on the available data, and then the model is used to classify factors under study or to predict the factors’ behavior in general. The problem arises when the developed model might work on the given dataset but fails to perform correctly in another data set thus creating a potential reproducibility issue

[Baily, Raeder].

Cross-validation is recommended as a common remedy to avoid overfitting due to data-driven empirical model building [BERRAR, Harell, Mathur, Raeder, Simon, Subramanian]. However, often additional data to validate the fitted model are not available, and the entire analysis or inference is to be done using only one observed data set. Resampling methods are the natural choice for cross-validation in such scenarios.

The proposed method uses the mixture model-based false discovery rate control described above with the following key attributes:

  • We incorporate a random sample splitting step in the existing multiple hypothesis testing procedure with Fdr screening where one part of the data is used for model fitting, and the complementary part is used to screen for the potential signal. This fitting/screening step is used repeatedly with multiple resampled subsets of all available subjects. The most frequently detected factors over all subsets are considered statistically significant. The cross-validation through repeated subsampling and screening helps to reduce the effect of possible overfitting.

  • Observational studies that are common in biology are prone to include many latent sources of variation that can attenuate or distort the signal of interest. The proposed method screens a large number of different data splits and then selects significant cases taking into account the variations over all these splits. This helps to balance out the unwanted sources of variation.

  • We offer a novel way to control the overall false discovery of the inference process by choosing the critical detection frequency of a factor to be identified as a potential true discovery.

  • Based on the resampled screening frequencies, we also present an association plot that can be used to explore the inter-relationship between the significant factors.

3 Methodology

We now consider a multiple testing scenario where null hypotheses are to be tested against appropriate alternatives respectively. Suppose are test statistics to be used for individual tests considering the available sample size. To identify the non-null cases we can analyze either the p-values associated with each or the left-tail area (LTA) from each . Non-null cases will produce p-values close to 0, equivalently non-null LTA’s will be close to 0 or 1. The advantage of working with LTA’s is one can easily identify the direction of deviation from the null hypothesis.

Let denote the p-value (or LTA) from test statistic . When the ’s follow a two-point mixture distribution with density (1.2) the proposed methodology is implemented by the following steps.

Proposed Sample-Splitting Analysis Methodology Steps:

(i)

First the subjects under study are randomly split into two (equal) parts, viz. the modeling set and the screening set. The p-values or the LTA’s from the test statistic associated with each of the hypotheses under study derived from the modeling set are, respectively called the “modeling data” and the “screening data”.

(ii)

Then, a mixture contamination model is fitted using only the modeling data. This fitted model is then adjusted to an “empirical model” that better captures the baseline and the signal in the empirical fit more appropriately.

The detection of significant cases is based on the screening data alone.

(iii)

The fitted model is used to derive the tail-area Fdr or the local fdr associated with each in the screening split. The cases with the tail-area Fdr (or the local fdr) less than a predetermined cutoff point are identified as potential significant cases.

(iv)

The entire process (stages i, ii, iii) is repeated several times with different random splits of the modeling and the screening subsets. For each split/repetition, a set of potential significant cases is identified. The most frequently identified significant cases are considered as “significant discoveries”. The required critical detection frequency can be set so that the overall false discovery rate for the process is controlled at a fixed level.

(v)

The screened cases detected together and their detection frequencies can be used to study the inter-relationships/dependencies between the significant cases. This frequency distribution is used to develop an association plot for the hypotheses that graphically describes these insights.

Next we present the theoretical model formulation along with the screening and power analysis tools. As noted earlier in Section 1, the local false discovery rate (fdr) in [Efron07] is essentially the same as the assignment function . From identity (2.15) in [Efron10], the relationship between the local fdr and the tail-area for a given tail-area is .

3.1 Empirical Fit

Using the observed ’s in the modeling split, we first fit a mixture of Uniform distribution and Beta distribution to the population density given by (1.2) viz.

Since our main interest is the identification of the most extreme cases, we adjust the signal/contamination part as follows: let , where

equationparentequation

(3.1a)
(3.1b)

So, captures the more extreme part of the signal (deviation from the ). Since is not a density, it needs to be normalized as follows. Let and define

and

Now let

Then the fitted model can be re-written as

(3.2)

Note that the fitted mixture model is unchanged; the terms have been rearranged so that captures more of the middle part of the data while captures the tail part. The rearrangement given in (3.2) is what we will refer to as the “empirical mixture model” and is related to Efron’s “empirical null” [Efron07]; this representation of the fitted mixture model better captures the baseline and the signal distribution than the original mixture representation. To derive the estimates for expressions presented in next two subsections one has only to replace the subsequent terms in the assumed population density by Equation (3.2).

Comment: The assumption that the null data follows a Uniform distribution may not always hold in discrete cases (see [muralidharan2012detecting]). But the tail adjustment part can compensate, at least in part, for the deviation from Uniform in the final adjusted form .

3.2 Screening Significant Cases Based on Fdr and Sample Splitting.

In case p-values are used for the analysis, ’s close to are associated with the signal. If ’s are the LTA’s from the test statistics, then ’s close to or (or both) are associated with the signal depending on whether a left-sided test, right sided-test, or two-sided test is being used. Thus the tail area (left, right or two-sided) used for calculating tail-area Fdr will depend on the definition of the ’s and the direction of the alternative hypotheses under study.

Let be the distribution function corresponding to . Define , for any Borel set . Similarly, write as the distribution function of the baseline distribution . With this notation, we derive the tail-area Fdr associated with a given tail-area as follows:

(3.3)

In practice, one can consider the tail-area associated with any value in the support of . The tail-area false discovery rate associated with can be calculated using (3.2).

If ’s are p-values derived for each hypothesis under study, for any , the appropriate tail area to use is . When ’s are LTA’s from test statistics, one should use for a left sided test, and for a right sided test. Whereas, in a two sided test situation with symmetric around zero, the tail area simply is .

In general, for that is not symmetric, the two-sided tail area can be derived using matching percentiles. Express any as the percentile of , i.e, ; then a matching can be found such that . Now if ( is smaller than the median) we choose equationparentequation

(3.4a)
On the other hand if ( is larger than the median) we choose
(3.4b)

An observation with smaller than a predetermined critical value should be identified as significant.

For tail-area Fdr screening of a data set, a modeling split is first used to fit the adjusted mixture model (3.2). Then for each data point in the corresponding screening split, the appropriate tail area is determined using the fitted model. Next , and from the modeling fit are used with (3.2) to derive the estimated observed tail-area Fdr, viz. . Any case with is screened as a potential discovery, where is a pre-determined cutoff point.

For screening with local fdr after fitting the adjusted mixture model with the modeling split, the fitted densities are used to calculate estimated local fdr for each screening split data point. Cases with less than (predetermined) are considered to be potential discoveries.

In terms of the rejection region, for any fixed cutoff point , depending on the tail-area Fdr or the local fdr screening, the theoretical rejection set from the split is given by equationparentequation

(3.5a)
or
(3.5b)

where is the support of from the sample split.

The above calculation is repeated a number of times. The potential significant cases can be identified from the combined rejection set or . But to increase the precision, only the observations that have been detected repeatedly with high frequency across the ’s or ’s should be considered as potential true discoveries. The critical frequency of detection for significant discoveries will be discussed in sub-Section 3.3 (see Equation (3.13)).

3.3 Power and Error Probabilities Calculation.

Efron [Efron07, Efron10]

uses a whole data fit on z-transformations of p-values associated with the hypotheses under study and advocates the use of the local fdr for the screening of significant cases. In that analysis, for a given cutoff point

of the local fdr, the rejection region effectively is . The power diagnostic tools chosen in those discussions are the non-null average of the local fdr and the non-null cdf of the local fdr given by . Some empirical estimates of these functions were used in [Efron07, Efron10] for the power analysis.

For the sample splitting method proposed in this paper, where the model is fitted on the p-values or LTA’s associated with test statistics, the rejection region from the split can be obtained from (3.5). The combined rejection region from all splits can then be constructed as or as , depending on the screening tool used. Considering the mixture model setup in (1.2), for a given rejection region with a cutoff point , the following probabilities can be used for the power analysis and a relative efficiency comparison:

(3.6)
(3.7)
(3.8)
(3.9)
(3.10)

Here, , and are the true densities that follow from the assumption that are with pdf (1.2). The terms “recall”, “false negative rate” or “false positive rate” and “precision” are commonly used in machine learning (see [Powers]). measures the process-wise false discovery rate associated with the combined rejection region .

The estimates of eqs. 3.10, 3.9, 3.8, 3.7 and 3.6 for a given data set can be obtained from the following steps:
Rejection Region and Power Calculation Steps.

(i)

Suppose sample splitting and subsequent screening were done times following the steps in Section 3.2 and for a given the rejection regions or (as in (3.5)) were obtained from each of the splits.

(ii)

For a two-sided test with LTA’s the rejection region from the split, using either the tail-area Fdr or the local fdr screening, can be written as:

Then writing

the combined rejection region can be expressed as: equationparentequation

(3.11a)
or
(3.11b)

depending on the choice of the screening tool. Equation (3.11) will include sets with one-sided regions only for p-value analysis or one-sided tests with LTA’s.

(iii)

A mixture model with tail adjustment fitted to the entire data (without any data splitting) can be used for the estimates of the densities in eqs. 3.10, 3.9, 3.8, 3.7 and 3.6.

(iv)

Numerical integration and numerical root finding techniques can be used to estimate the probabilities in eqs. 3.10, 3.9, 3.8, 3.7 and 3.6 and to find and or and from (3.5), where closed form solutions are not feasible.

The power and error probabilities in the steps above are associated with the final analysis method that combines all splits and not with any single modeling split in particular. Therefore, for the estimation of eqs. 3.10, 3.9, 3.8, 3.7 and 3.6 it is appropriate to use a mixture model fitted to the entire data for estimates of densities , and as suggested in step (iii) above. An individual fit from any single particular modeling split should not be used for the power analysis.

Using the combined rejection region for screening will increase the number of rejections compared to whole data-based screening described in [Efron07]. This will naturally increase the power (3.6) of the proposed method, but the payoff will be a loss of precision (3.9) or increase in .

Critical Detection Frequency.
Using only the cases in with high detection frequencies as the potential discoveries will increase the precision of the method. If we want to impose a tolerance level for , a critical detection frequency requirement can be derived as follows:

(3.12)

Note that the factor is detected as potentially significant if . For any factor (individual hypothesis) under study, the relative frequency of detection from repeated sample splittings is essentially the bootstrap resampling relative frequency of lying in the rejection region . Thus, for large the relative frequency of detection for the factor estimates , .

Following Equation (3.3), if only factors with relative frequency of detection larger than are considered to be significant, that will control the estimated at . In other words, a factor can be considered a significant discovery if after run of sample splitting and subsequent screening the frequency of detection is at least

(3.13)

Here the choice of and will be user defined (just as the choice of the level of significance for a traditional single hypothesis test). Smaller and will result in more conservative screening.

An added benefit of expressing the error probabilities as a function of Fdr cutoff point is that one can choose where both the type I and the type II error probabilities are at a reasonable level. Alternatively, an appropriate can also be chosen so that the proportion of correct classifications is at a desired level. is also known as the “accuracy” function in machine learning.

4 Illustrative Examples.

In this section, we illustrate the proposed methodology with a microarray data set and an RNA-sequencing data set where the goal is to identify genes that are expressed at a significantly higher or lower level in the experimental group compared to the control group. We also present several simulation studies to compare the proposed approach with the existing Fdr screening methods.

4.1 Microarray Data Analysis.

Empirical Bayes methods have been used to analyze microarray data [smyth2004linear, hirakawa, Efron08]. The proposed methodology is more generalized since it can be applied with any test statistic (not necessarily Normal or t-distribution based).

We analyzed a prostate cancer microarray data set (used in [Efron10] from [Singh], the data is available in R under package sda, in a data file named “singh2002”) consists of 52 prostate cancer patients and 50 normal subjects. In the data, the expression levels of the same 6033 genes were measured for each subject. The analysis aims to detect genes that have significantly different expression levels between cancer and non-cancer groups and to explore the inter-relation between those genes. The significant genes are captured in the screening step. An association plot is generated based on concurrent detection frequencies, that is useful for exploring the inter-relation between these genes (see section 5 for discussion and interpretation of these plots). We used LTA’s with tail-area Fdr screening for the analysis, although a local fdr screening also can be used following the steps described in Section 3.

(a) Uniform-Beta mixture distribution.
(b) Adjusted Uniform-Beta mixture distribution.
Figure 1: The histogram of (left-tail-area for the observed two sample t-statistic for genes ) from a particular screening split consisting of half of the control and the treatment group respectively. Superimposed on 0(a) are the fitted Uniform-Beta (blue dashed lines) and the associated mixture distribution (red solid line) obtained from the corresponding modeling split as where is the pdf and is the pdf. Figure 0(b) is the empirical null fit adjusted from the fitted Uniform-Beta mixture distribution as in Equation (3.2) where .

To begin, the data was split into the modeling set and the screening set. The modeling split consisted of 26 randomly selected prostate cancer patients and 25 normal subjects. The remaining 26 patients and 25 non-cancer subjects formed the screening split. The modeling data was used to fit the contamination model (3.2). This is described next.

The two-sample -statistic , , is calculated for each gene from the modeling group where it is assumed that follows a central

-distribution with 49 degrees of freedom. The LTA for each of these

’s is calculated as:

(4.1)

Note that, ’s should be close to 0 or 1 for genes that are deemed significantly differentially expressed between cancer and non-cancer groups. To compute the -statistics, we used the difference (control mean-cancer mean). Hence, any close to implies that the gene is expressed at higher level in cancer patients, while any close to indicates that the gene is expressed at lower level in cancer patients compared to the control group. The histogram of the ’s from 6033 genes in Figure 1 shows a bathtub shape. Prompted by that shape, we first fitted a mixture of and a distribution to the modeling data and then readjusted it as described in subsection 3.1.

Figure 1 illustrates the histogram of (left-tail-area for the observed two sample t-statistic for genes ) from a particular screening split consisting of half of the control and the treatment group respectively. Panel 0(a) features the fitted Uniform-Beta (blue dashed lines) and the associated mixture distribution (red solid line) obtained from the corresponding modeling split as , where is the pdf and is the pdf. Panel 0(b) shows the adjusted empirical null fit as in Equation (3.2) derived as .

Next following (3.2), we computed the tail-area Fdr associated with each gene. Genes with tail-area Fdr less than 0.1 were declared as potentially significantly different between cancer patients and non-cancer subjects. This procedure was repeated on 500 different sample splits.

Out of these 500 repetitions, 234 verification groups identified at least one significant gene with associated tail-area Fdr less than 0.1. The other 266 verification groups failed to capture any significant gene. Few verification sets identified more than one significant gene. In the 500 repetitions of this procedure, out of 6033 total genes, 117 genes showed Fdr 0.1 at least once, and 47 genes had FDR at least twice. These genes included some that are expressed at significantly higher levels among the cancer patients (’s close to 0) and some at significantly lower levels among the cancer patients (’s close to 1).

Genes that are repeatedly detected as significant strongly confirm the difference between the patient and the control group. Table 1 shows the significant genes (with detection frequency at least 2) along with the number of times they were identified as significant through the 500 sample splits. The false discovery rate calculated from the whole data analysis by the BH [Benjamini] method is also included in the table ().

Gene_ID freq rfreq med(Fdr) med(x) mean(x) sd(x)
610 54 0.108 0.001 0.068 0.000 0.001 0.003
1720 51 0.102 0.005 0.063 0.000 0.002 0.006
4331 28 0.056 0.019 0.067 0.998 0.992 0.017
3940 22 0.044 0.015 0.065 0.999 0.994 0.013
914 20 0.040 0.015 0.067 0.001 0.005 0.010
364 15 0.030 0.015 0.076 0.999 0.995 0.010
4546 8 0.016 0.018 0.043 0.999 0.994 0.014
921 8 0.016 0.060 0.083 0.994 0.985 0.027
3647 7 0.014 0.019 0.071 0.002 0.008 0.020
4316 7 0.014 0.037 0.082 0.997 0.988 0.023
1077 6 0.012 0.031 0.066 0.002 0.011 0.023
2856 5 0.010 0.064 0.056 0.994 0.979 0.039
579 5 0.010 0.019 0.081 0.002 0.007 0.012
1089 4 0.008 0.019 0.059 0.002 0.009 0.020
11 4 0.008 0.092 0.041 0.011 0.030 0.048
1346 4 0.008 0.060 0.058 0.995 0.982 0.030
2 4 0.008 0.063 0.079 0.006 0.018 0.032
2852 4 0.008 0.151 0.060 0.014 0.038 0.059
2945 4 0.008 0.065 0.085 0.994 0.982 0.031
3930 4 0.008 0.070 0.078 0.008 0.023 0.041
4088 4 0.008 0.035 0.065 0.997 0.989 0.022
702 4 0.008 0.063 0.090 0.995 0.983 0.029
3017 3 0.006 0.070 0.059 0.994 0.982 0.029
3991 3 0.006 0.035 0.079 0.996 0.988 0.022
4073 3 0.006 0.040 0.065 0.997 0.987 0.028
4396 3 0.006 0.070 0.065 0.992 0.975 0.044
1068 2 0.004 0.019 0.087 0.002 0.005 0.011
1314 2 0.004 0.060 0.059 0.005 0.015 0.026
1491 2 0.004 0.119 0.048 0.013 0.029 0.043
1588 2 0.004 0.070 0.080 0.007 0.021 0.040
1966 2 0.004 0.120 0.071 0.014 0.036 0.055
2370 2 0.004 0.063 0.062 0.006 0.018 0.034
2621 2 0.004 0.512 0.078 0.062 0.097 0.110
2745 2 0.004 0.627 0.080 0.912 0.862 0.145
2785 2 0.004 0.327 0.077 0.969 0.936 0.084
2912 2 0.004 0.137 0.076 0.014 0.035 0.055
332 2 0.004 0.015 0.094 0.001 0.004 0.007
3600 2 0.004 0.065 0.056 0.994 0.984 0.027
3848 2 0.004 0.386 0.064 0.040 0.074 0.092
4000 2 0.004 0.063 0.079 0.993 0.982 0.032
4040 2 0.004 0.095 0.082 0.989 0.974 0.041
4104 2 0.004 0.060 0.086 0.994 0.982 0.031
4417 2 0.004 0.529 0.063 0.931 0.891 0.120
4515 2 0.004 0.117 0.057 0.987 0.969 0.054
4518 2 0.004 0.033 0.085 0.003 0.010 0.021
4549 2 0.004 0.060 0.083 0.005 0.016 0.033
758 2 0.004 0.255 0.067 0.973 0.946 0.071

Table 1: The prostate cancer data was analyzed using the proposed method with 500 random sample splits. Our method discovered 47 genes by using the Fdr threshold 0.1, and the detection frequency threshold 0.004 (at least two significance detection occur), respectively. Columns freq and rfreq show the detection frequency and relative frequency respectively represents the Benjamini-Hochberg FDR values by using the whole data for these 47 genes. For each gene, med(Fdr) shows the median tail-area Fdr from 500 screening splits. The columns med(), ave() and sd(

) are the median, mean and standard deviation of the 500 LTA’s

for each gene computed from 500 screening splits.
(a) Frequency network for the 47 dietected genes with relative frequency greater than 0.004, by using the tail-area Fdr threshold 0.1.
(b) Frequency network for the genes with relative detection frequency greater than 0.01.
Figure 2: Frequency network plot from the prostate cancer data analysis. The edge width indicates the frequency of the pair genes detected as significant at the same time. The node color represents the direction of difference in gene expression. That is, the blue nodes represent genes expressed at significantly higher levels, and the red nodes represent genes expressed at significantly lower levels in the cancer patients compared to the control group.

Table 1 further shows that the most frequently screened genes by the proposed method would also be detected by the BH method, since they had (Efron’s approach would have a similar result as the BH method since they are related [Efron07]). However, our approach was able to capture a few genes (with low detection frequencies) that would have been missed by the BH screening if the same threshold value 0.1 was used. For example, genes 3848, 4417, 4515, 758 all had larger than 0.1 but was detected twice by our method. That is, these genes showed tail-area in 2 screening splits among the 500 splits used in the analysis. Note that, the median LTA () for these genes are fairly close to 0 or 1 which supports the fact that these are potentially significant genes.

Figure 1(a) shows genes identified as significant at least twice and the sets of genes with which they are simultaneously identified as significant across 500 sample splits and subsequent screenings. In this gene frequency association plot (F-association), the nodes and edges respectively indicate the detected significant genes and the detection of two genes at the same time in a particular split. The edge width increases with the increase in the frequency of simultaneous detection for a pair of significant genes. The node color represents the median tail area from the 500 screening data sets for that gene, for which the color turns from blue to red accordingly as the tail area increases from small (close to 0) to large (close to 1). That is, the blue nodes represent genes expressed at significantly higher levels, and the red nodes represent genes expressed at significantly lower levels in the cancer patients’ group compared to the controls. Figure 1(b) shows genes that are detected at least 5 times for a clearer picture into the network.

4.2 RNA-seq Data Analysis

In this subsection, we show the application of our methodology for the analysis of global gene expression data using RNA-seq, a commonly used tool for transcriptome profiling. Here we demonstrate how the method can be used with p-values. We used the data consists of RNA-Seq profiles of cell lines derived from lymphoblastoid cells from 69 different Yoruba individuals from Ibadan, Nigeria [pickrell2010understanding]. The analysis aims to investigate differentially expressed genes between males and females. The RNA counts are available in the R package tweeDEseqCountData. In the raw RNA counts, there are 38415 genes with or without defined annotations. To filter out the noninformative genes, we keep genes with both defined annotations and at least 1 count-per-million (CPM) in at least 20 samples. At last, 17310 genes remain for the differential gene analysis.

(a) Unadjusted fit for the whole data.
(b) Adjusted fit for the whole data.
Figure 3: The mixture model for the p-values using the whole data set. where is density, and is the density. The adjusted mixture model is .
Gene Chrom freq FDR med(FDR) med mean sd
XIST X 91 1.78E-42 2.27E-19 8.92E-25 2.44E-23 7.24E-23
CYorf15A Y 91 1.03E-37 1.70E-16 1.64E-21 1.00E-20 2.58E-20
CYorf15B Y 91 1.31E-32 1.76E-14 1.92E-19 1.09E-18 2.13E-18
RPS4Y2 Y 91 1.31E-32 3.85E-14 2.32E-19 4.52E-18 1.49E-17
TTTY15 Y 91 1.53E-31 2.60E-13 1.31E-18 4.96E-17 2.03E-16
EIF1AY Y 91 8.78E-27 1.39E-11 3.89E-16 7.25E-15 2.11E-14
NLGN4Y Y 91 4.21E-24 7.37E-10 2.89E-14 3.86E-13 9.19E-13
UTY Y 91 7.01E-22 1.97E-09 1.63E-13 1.87E-12 5.62E-12
AC010889.1 Y 91 6.29E-21 4.93E-09 6.55E-13 3.18E-12 1.01E-11
RPS4Y1 Y 91 6.29E-21 1.16E-08 3.81E-13 2.03E-11 6.14E-11
KDM5D Y 91 1.82E-20 1.27E-08 1.41E-12 1.41E-11 4.08E-11
RP11-331F4.1 16 91 4.13E-20 2.12E-08 2.51E-12 2.82E-11 9.40E-11
DDX3Y Y 91 3.30E-18 2.16E-07 1.56E-11 6.59E-10 2.70E-09
NLGN4X X 91 2.24E-15 4.98E-06 7.24E-10 7.74E-09 1.86E-08
PNPLA4 X 89 3.65E-12 1.91E-04 9.08E-08 3.44E-07 8.81E-07
RP13-204A15.4 X 88 3.68E-13 2.41E-05 5.84E-09 1.84E-07 6.61E-07
AL137162.1 20 84 1.19E-10 4.20E-04 2.16E-07 1.48E-06 2.72E-06
RP5-1068H6.3 20 76 1.22E-09 9.72E-04 8.62E-07 2.95E-06 4.99E-06
AC016734.3 2 72 3.29E-09 7.49E-04 9.67E-07 1.10E-05 3.66E-05
RP11-143J12.1 18 64 1.47E-08 1.40E-03 2.47E-06 9.25E-06 1.58E-05
RPS4X X 64 1.56E-08 1.88E-03 3.49E-06 9.70E-06 1.69E-05
RP11-411G7.1 17 62 1.56E-08 1.80E-03 3.95E-06 1.14E-05 2.14E-05
RP11-863N1.1 18 61 3.35E-08 9.45E-04 2.68E-06 3.21E-05 8.07E-05
CTD-3116E22.2 19 58 3.56E-08 2.64E-03 6.68E-06 2.39E-05 6.72E-05
RP11-135F9.1 12 56 2.41E-08 2.73E-03 5.29E-06 1.87E-05 3.00E-05
RP11-624G17.1 11 56 3.59E-08 2.82E-03 5.76E-06 1.66E-05 3.06E-05
HDHD1 X 54 5.64E-08 1.58E-03 4.20E-06 3.77E-05 9.29E-05
RP11-21K20.1 12 48 5.64E-08 2.10E-03 7.33E-06 2.88E-05 4.67E-05
Table 2: RNA-seq data analysis with 100 sample splits. Table shows 28 most frequently detected significant genes with tail-area Fdr (detection frequencies 40 or higher). The “freq” column indicates the frequency of occurrence for the corresponding gene in the 100 screening splits. The FDR column shows the Benjamini-Hochberg FDR for each of these genes.The columns med.x, ave.x and sd.x are the median, mean and standard deviation of 100 LTA’s for each gene from 100 screening splits. Chrom refers to specific chromosome.

Of the 69 individuals, there are 40 females and 29 males. From the whole group, 20 female and 15 male subjects were randomly selected to construct the modeling split, (the remaining subjects formed the screening split). Assuming the data follows a negative binomial distribution

[anders2010diff], the p-value for each gene was calculated using a generalized likelihood ratio test comparing male and female subjects in the modeling split. The modeling data consists of these 17310 p-values. Similar p-values from the screening split provided the screening data.

A mixture of a and a distribution was fitted to the modeling data p-values and was adjusted to obtain the empirical fit as described in Subsection 3.1. Figure 3 illustrates the effect of the adjustment on the whole data. An initial mixture model fit with the density , and the density is illustrated in Panel 2(a), while the adjusted mixture model , according to Equation (3.2), is featured in Panel 2(b).

For the analysis, the adjusted fit from a modeling data was used to calculate the tail-area Fdr associated with the p-value in the corresponding screening data for each gene. Genes with Fdr less than 0.01 were detected as significantly differentially expressed between male and female subjects. The process was repeated 100 times.

100 sample splits and subsequent screening detected a total of 83 significant genes out of 17310. Of these 83 detected significant genes, 56 appeared more than once in the 100 repetitions of the splitting and screening. Table 2 presents a partial summary of the detection results that includes 28 most frequently detected significant genes with tail-area Fdr (detection frequencies 40 or higher). The “freq” column indicates the frequency of occurrence for the corresponding gene in the 100 screening splits. Note in particular, the top-ranked gene XIST (X inactive specific transcript) is expressed only in females to silence one of the two X chromosomes; this provides the dosage equivalence between females and males. The other most frequently detected significant genes appear on either X or Y chromosomes, which is expected since they are differentially expressed between males and females.

(a) F-association for 27 most significant genes
(b) F-association for 14 most significant genes
Figure 4: F-association for the most significant genes appearing more than 50 times in (a) and more than 90 times in part (b). The Fdr threshold is at 0.01.

Figure 4 shows the F-association plot for the most significant genes. Since we are using p-values for the analysis, detected genes cannot be flagged as over-expressed or under-expressed (unlike LTA values), but can only be detected as significantly differentially expressed in females compared to males. Therefore the color scheme of Figure 2 is absent in Figure 4. Each node represents a gene, and the gray link between nodes indicates the pair of genes that are simultaneously differentially expressed. In Figure 3(a), most of the inner clustered genes come from the top-ranked genes from Table 2, and the more outward the nodes are in the frequency network, the lower the frequency of being differentially expressed.

4.3 Simulation Studies

In this section, we present several simulation studies featuring the proposed screening. Subsection 4.3.1 includes a continuous data analysis and detailed power and precision comparison with Efron’s whole data fit method. Subsection 4.3.2 presents the specificity of the method for a simulated data without any signal. Subsection 4.3.3 illustrates the analysis of a discrete data, whereas in subsection 4.3.4 we analyze a similar discrete model simulation with a larger proportion of the signal.

4.3.1 Continuous Data Simulation

We simulated a data set with a treatment group of 50 subjects and a control group of 50 subjects where 1000 expressions were simulated for each subject. Out of the total 1000 variables, the first 30 were set to be the “non-null” cases (expressions were simulated from distributions different for the treatment and control groups); while the last 970 variables were set to be the “null” cases (expressions were simulated from the same distribution for the treatment and control groups).

Output Variables Mean Variance (treatment and control)
treatment  control
3
(independent) 6 2
(independent) 6 2
Table 3: The simulation parameters for 30 non-null variables. Here 10 values for were generated from an distribution, one for each of the variables to in the treatment group. Similarly, 10 values for were generated from an distribution, one for each of the output variables to in the treatment group.

Further, to show that the F-association plots constructed using the detection frequencies can pick up existing inter-relationships between screened cases, we added a correlation structure among the first 10 non-null variables. For a gene expression study, if some genes are inter-related with each other, they will work in tandem in any subject no matter whether from the control or from the treatment group, although their expression levels can be different between the two groups. Keeping that in mind, for the simulated data the same correlation structures were applied to both the treatment and the control group while keeping the non-null means different in the two groups.

We used the normal distribution for the simulation. The

distribution was used for all 970 null variables for each subject in the treatment and the control group. The treatment and control group parameters used to simulate the 30 non-null variables are described in Table 3

, where the mean and the variance columns illustrate the Normal distribution mean vector and variance-covariance matrices used for simulation. Variables

, , and are clustered together with the covariance structure shown in the table. Whereas, variables 11 to 30 were simulated independently from . 10 values for were generated from an distribution, one for the mean of each of the variables to in the treatment group. Similarly, 10 values for were generated from an distribution, one for the mean of each of the output variables to in the treatment group which are independently simulated from distributions.

(a) Uniform-Beta mixture distribution.
(b) Adjusted Uniform-Beta mixture distribution.
Figure 5: The histogram of (left-tail-area for the observed two sample t-statistic for the simulated variables ) from a screening data set consisting of half of the control and the treatment groups respectively. Superimposed on Figure 4(a) are the fitted Uniform, Beta and the mixture distribution obtained from the corresponding modeling split as where is the pdf and is the pdf. The red solid line shows the fitted mixture distribution. Figure 4(b) shows the empirical null fit adjusted from the fitted Uniform-Beta mixture distribution to as in Equation (3.2).

The 100 subjects were randomly split into a modeling and a screening set, each consisting of 25 subjects from the treatment group and 25 subjects from the control group. The data points were defined as , where

is the two-sample t-test statistic for each output variable from the 25 control and the 25 treatment samples in the modeling split. Then a mixture of a Uniform and a Beta distribution was fitted to the modeling split

’s and was adjusted to better capture the background and the signal similar to (3.2) in Section 4.1.

Figure 5 shows the histogram of (left-tail-area for the observed two sample t-statistic for the simulated variables ) from a particular screening data. Figure 4(a) illustrates the initial mixture model , where is the pdf and is the pdf, fitted from the modeling data. The red solid line shows the fitted mixture distribution. Figure 4(b) presents the adjusted empirical null fit according to Equation (3.2).

Here we used tail-area Fdr screening to construct the frequency table (Table 4) and the F-association plots (Figure 6). The local fdr screening results are used for comparison purposes in Figure 7.

The two-sample for each variable in the screening split and it’s left-tail-area formed the screening data. The modeling split fit (3.2) was used to obtain the tail-area Fdr associated with each screening data point . The variables with tail-area Fdr less than were detected as significant. The process was repeated 100 times. Table 4 shows 26 most frequently detected significant variables from 100 sample splits with the tail-area Fdr (detection frequencies 2 or higher). The “freq” column shows the frequency of detection for the corresponding variable in 100 screening splits.

variable () freq med(Fdr) med avg sd
8 63 0.03660 0.00006 0.00058 0.00153
5 45 0.04460 0.99987 0.99928 0.00146
10 44 0.03740 0.00040 0.00266 0.00862
6 41 0.04690 0.99979 0.99898 0.00216
7 38 0.05290 0.99971 0.99868 0.00303
15 36 0.04490 0.00051 0.00399 0.01051
26 34 0.05280 0.99962 0.99826 0.00362
21 22 0.05030 0.99923 0.99583 0.01447
9 20 0.04850 0.00213 0.01042 0.01890
16 17 0.04820 0.00233 0.01138 0.02408
23 17 0.05830 0.99841 0.99473 0.01289
30 17 0.07370 0.99819 0.99265 0.01572
29 13 0.04630 0.99760 0.99270 0.01855
12 12 0.04980 0.00648 0.01892 0.03579
20 10 0.06330 0.00518 0.02024 0.03748
11 9 0.06040 0.00311 0.01547 0.03765
13 6 0.06720 0.01671 0.02783 0.03881
19 6 0.06930 0.00958 0.01910 0.02499
4 6 0.06720 0.01157 0.02577 0.04406
24 5 0.04710 0.99692 0.99130 0.01475
3 5 0.06420 0.00861 0.02270 0.03694
106 3 0.03140 0.02360 0.04740 0.06810
523 3 0.04050 0.02281 0.05717 0.10463
27 2 0.07010 0.98869 0.96776 0.05341
508 2 0.07170 0.04119 0.07933 0.10046
919 2 0.05420 0.92189 0.89232 0.12438
Table 4: Screening result for simulated continuous data where variables 1 to 30 are set to be non-null. Table shows 26 detected significant variables from 100 sample splits with the tail-area Fdr (detection frequencies 2 or higher). The “freq” column shows the frequency of detection for the corresponding variable () in 100 screening splits. The values in the third column show the median tail-area Fdr from 100 sample splits for each gene. The columns med.x, ave.x and sd.x are the median, mean and standard deviation of the LTA’s for each variable computed from 100 randomly chosen screening data sets.

Note that, although variables 1 to 30 out of the 1000 simulated variables were set as non-null, the groups of variables 5, 6, 7 and 8, 9, 10 deviated the most from the null. The mean vectors for variables 1, 2 and 3, 4 did not deviate enough from the null mean to produce significantly large or small t-statistic values. The analysis was done using the t-statistic tail-area and not the original normal distribution, thus naturally non-null variables 1, 2 were not captured in the screening process.

(a) F-association for 26 most significant variables.
(b) Simplified F-association for 5(a).
Figure 6: Figure 5(a) is the F-association plot of 26 significant variables appearing at least twice in the 100 sample splits by using the tail-area Fdr. Figure 5(b) is the simplified network of Figure 5(a) by deleting variables with less than 5 connected edges. Node color blue indicates the gene was expressed at a lower level in the treatment group compared to the control group, whereas, red implies the gene was expressed at a higher level in the treatment group.

Figure 5(a) presents the F-association plot of 26 significant variables appearing at least twice in the 100 sample splits by using the tail-area Fdr. Figure 5(b) is the simplified network of Figure 5(a), obtained by deleting variables with less than 5 connected edges. Here node color blue indicates the gene was expressed at a lower level in the treatment group compared to the control group, whereas, red implies the gene was expressed at a higher level in the treatment group.

The F-association plots in Figures 5(a) and 5(b) show that among the detected significant variables, the groups with strongest correlation structure, one with variables 5, 6 and 7 and another with variables 8, 10 were captured successfully. However, the plot did not capture all correlated genes (e.g., gene 9 that had a low correlation with gene 10 but had a higher correlation with gene 8 did not show up as in the group with ).

The association featured in the network plot depends on the difference between control and treatment groups in the observed data, and while the inherent dependence structure might drive the difference, the difference itself might not always indicate a dependence. Hence the plot should be used for exploration and not to prove a causal association. A more detailed discussion about the limitation of the F-network is included in Section 5, while a justification for the association is presented in Appendix A.

Power and precision comparison with the whole data fit method.

(a) Power curve comparison.
(b) Precision curve comparison.
Figure 7: Figure 6(a) is the power curve comparison and Figure 6(b) is the precision curve comparison between the whole-data fit/screening method and the proposed sample splitting method. Here is the cutoff point of the local fdr. The solid line represents power or precision using the whole data fit/screening method, and the dashed line represents the same using the proposed method with 100 sample splits.

To compare the efficacy of the proposed method with the whole data fit/screening, we present the power (recall) and precision (Equations (3.6) and (3.9)) comparisons in Figure 7. Since Efron [Efron07] used the local fdr for screening, for comparison purposes with repeated sample splitting, the combined rejection region based on local fdr screening as in (3.11b) is used in Figure 7 along with the rejection region from the whole data fit and the local fdr screening method.

The combined rejection region from repeated sample splits results in a larger rejection region hence providing higher power as evident in Figure 6(a). But the inclusion of all detections increases the number of false discoveries and consequently decreases the precision as seen in Figure 6(b). Note that, even when the entire was used with the proposed method as the rejection region, at relevant values 0.1 to 0.3 (reasonable cutoff points for local fdr) the sample splitting technique shows a level of precision that is on par with the whole data fit/screening method.

However, we are proposing that only the variables with high detection frequencies should be screened as potential true discoveries and not the entire . By adding a high frequency criterion (as in (3.13)) to the combined rejection region, the precision can be set at any desired fixed level for all . Whereas, with enough repeated sample splitting, the set of variables in with high frequencies is unlikely to get much smaller compared to the rejection set produced by the existing methods, with sufficiently large repetition of sample splitting, the high frequency screening can be set to achieve intended level of precision without significant loss of power.

For example, 21 of 26 detections (about 81%) in Table 4 with detection frequencies of 2 or higher were true discoveries. If variables with detection frequency 5 or higher are considered, all 21 detections are true discoveries (100%).

Critical detection frequency and .

For a fixed , we screen significant genes at every splitting/screening step if the tail-area Fdr in the verification split. This leads to the rejection region from an individual split according to Equation (3.5a), which can be calculated by a numerical grid search for , so that Equation (3.5a) holds. The overall rejection region is constructed by combining rejection regions from 100 repeated splits, according to equation (3.11a).

For example, for the simulated data generated with parameters in Table 3, if we keep , the combined rejection regions from 100 repeated random sample split/screening is .

The and critical detection calculation steps described in Subsection 3.3, Equations (3.10) and (3.13) use a fitted mixture distribution on the entire data without sample splitting (as described in step (iii) under the heading “Rejection Region and Power Calculation Steps” in that subsection). For this particular simulated data set, the whole data Uniform/Beta mixture fit on left-tail areas is given by , which in its tail-adjusted form can be written as (adjusted according to Equation (3.2)).

When the splitting/screening in repeated times, to keep , the critical detection frequency for potential true significance can be calculated from Equation (3.13) as

where can be user-defined such as 0.05, 0.1, and 0.2, etc. A numerical integration procedure is used for this part of the calculation in our R-package.

To generalize the concept for any , one might just consider the critical detection “relative” frequency by

Figure 8 illustrates the critical relative frequency of detection required to keep the overall combined false discovery rate bounded at = , or , when each screening-split uses detection criterion tail-area Fdr .

(a) The tail- cutoff point used for each screening vs. relative frequency of detection needed to attain .
(b) Zoomed-in section of plot 7(a)
Figure 8: Critical relative frequency of detection needed to achieve according to Equation (3.3), when the criterion used to detect potential significant cases from each individual screening split is “tail-area Fdr”.

For example, in Figure 7(b), at the graph for shows the minimum relative frequency needed is about .01. This implies that for each screening-split, if the detection criterion employed is “a variable is significant when its corresponding tail-area Fdr” and the process is repeated 500 times, any variable needs to be detected more than times to be identified as a potential true discovery in order to keep for the combined screening process. If the process is repeated 1000 times, any variable needs to be detected more than 10 times to be identified as a potential true discovery in order to keep for the combined screening process.

Figure 9 shows type I and type II errors as functions of cutoff points for the tail-area Fdr screening calculated from the simulated data following steps described in Section 3.3. This may help in the choice of appropriate cutoff point for the main analysis.

(a) Type I and II errors.
(b) ROC curve for Type I and II error.
Figure 9: The solid line represents Type I error, and the dashed line represents Type II error as in Equations (3.7) and (3.8). The error probabilities are calculated as a function of tail-area Fdr cutoff point with the rejection region in (3.11a).

In Appendix C we have included an overall performance comparison between the repeated sample splitting/screening procedure and the conventional Benjamini-Hochberg screening method over 100 different data set simulated from the Normal distribution setup used in this section. The comparison reveals that our approach, as a whole, performed better in terms of stability selection [Meinshausen]. The sample splitting/screening approach could successfully detect all non-null genes, where the BH screening detected some non-null genes with high relative frequency but missed nearly half of them. In the results, a prominent specificity difference between the two approaches was apparent as well. Our approach almost never detected null genes as non-null, while the BH method erroneously picked up almost all null genes as non-null multiple times over the course of 100 simulations (Appendix C contains further details).

4.3.2 Null Model Simulation (Specificity Study)

Next, we consider the null model, i.e., the scenario where both the experiment and the control group data come from the same population distribution. In this case, all hypotheses under study belong to the null class. Hence any classification or detection of a non-null case will essentially be a false discovery. To understand if a given procedure works well in this scenario, we can examine the percentage of time the procedure can assign a hypothesis correctly to the null class or the “specificity” of the method.

(a) Specificity comparison between the Benjamini-Hochberg method and the proposed method. The dark solid line represents the Benjamini-Hochberg FDR method, and the gray dashed line is for the proposed method.
(b) The zoomed in version of Figure 9(a). The dark solid line represents the Benjamini-Hochberg FDR method, and the gray dashed line is for the proposed method.
Figure 10: Specificity comparison plot for the proposed method with the BH method. The x-axis shows the FDR cutoff point used for screening. The y-axis represents . Therefore smaller hight of the plot indicates higher specificity, i.e., a higher rate of correct classification of null cases.

We simulated continuous data from 100 subjects for 1000 genes, from the normal distribution with the mean , and standard deviation . The first 50 subjects and the next 50 subjects respectively constitute the control group and the treatment group, which obviously come from the same distribution. Figure 10 illustrates the specificity comparison of the BH method (dark solid line) and the proposed method (gray dashed line). The -axis presents , so smaller hight of the plot indicates higher specificity, i.e., a higher rate of correct classification of null cases. The plot shows that the proposed method has slightly higher specificity for moderate FDR (), but lower specificity for large FDR (). Note that, for multiple testing screening factors with small FDR are considered to be true discoveries, and Figure 10 shows that the proposed method works better in that crucial region.

4.3.3 Discrete Data Simulation

Here we consider a simulation of a discrete data for 100 subjects and 1000 genes, in which the first 50 subjects are assumed to form the control group, and the other 50 subjects form the treatment group. The control group values are simulated from the negative binomial distribution with the mean and the dispersion , in notation, (we are using the Negative-Binomial parameterization used in [robinson2007small]). The treatment group data is simulated from the mixture of negative binomial distributions, , where the first 30 genes are the non-null genes that are simulated from the negative binomial distribution with mean and the same dispersion value as the control group.

Gene_ID freq rfreq pvalue
22 44 0.088 0.001 1.01E-06
27 17 0.034 0.043 2.55E-04
5 14 0.028 0.031 6.95E-05
25 11 0.022 0.031 9.36E-05
209 9 0.018 0.033 1.30E-04
21 9 0.018 0.081 9.37E-04
4 8 0.016 0.034 1.71E-04
2 7 0.014 0.048 3.34E-04
19 6 0.012 0.054 4.70E-04
193 5 0.010 0.081 9.68E-04
29 5 0.010 0.054 5.22E-04
Table 5: Screening results for simulated discrete data where freq and rfreq are calculated from the proposed method in 500 sample splits. Genes 1 to 30 are set to be the non-null genes. The values in the column are calculated based on the BH method by using the whole data. The last column presents the p-values for each gene from the whole data likelihood ratio tests that compare the treatment group and the control group averages.

We used 500 random splits to analyze the data. By using the Fdr threshold 0.05 and the critical detection frequency threshold 0.01, our method found 11 significant genes including 9 true discoveries and had 2 false discoveries. Alternatively, the conventional BH method detected 7 genes including 6 true discoveries and one false discovery by using the threshold 0.05. Screening results are shown in Table 5, where the columns freq and rfreq illustrate the detection frequencies and detection relative frequencies of the significant genes. The values in the column are calculated based on the BH method by using the whole data. The last column presents the p-values for each gene from the whole data likelihood ratio tests that compare the treatment group and the control group averages. Table 5 shows that genes 19, 21 and 29 were detected more than 5 times by our method, while had greater than the threshold 0.05, hence would have been missed by the BH screening.

4.3.4 Discrete Data Simulation with Larger Signal Proportion

We repeated the discrete data simulation setup from Subsection 4.3.3 but with 100 non-null genes in the treatment group instead of 30. That is, data was simulated for 1000 genes for 50 control subjects and 50 treatment group subjects, with the control group data generated from a negative binomial distribution , and the treatment group data from a mixture of negative binomial distributions, , where the first 100 genes are set to be the non-null genes and are simulated from the negative binomial distribution with mean .

The analysis was done with 500 random sample splits using the proposed method. By using the FDR threshold 0.05 and the relative significant frequency 0.02, our method detected 46 significant genes, including 39 true discoveries and 7 false discoveries. Comparatively, the BH method detected 39 genes, including 34 true discoveries and 5 false discoveries, respectively. The power of the proposed method was 0.39, and the power of the BH method was 0.34. Table 6 presents the screening results that includes the detection frequencies and relative frequencies of all 46 genes along with their corresponding BH method FDR. The table shows that the proposed method detected non-null genes 27, 47, 80, 85, and 18 at least 10 times, but had greater than 0.05, hence would have been missed by the BH method with the same 0.05 FDR threshold value.

Gene_ID freq rfreq pvalue
86 123 0.246 0.002 5.03E-06
58 121 0.242 0.002 9.67E-06
70 107 0.214 0.002 1.15E-05
56 101 0.202 0.003 2.47E-05
66 96 0.192 0.003 1.89E-05
94 96 0.192 0.002 1.32E-05
55 93 0.186 0.002 1.28E-05
92 93 0.186 0.002 8.33E-06
21 77 0.154 0.006 8.97E-05
24 69 0.138 0.004 3.61E-05
25 66 0.132 0.003 2.41E-05
209 63 0.126 0.004 4.74E-05
23 53 0.106 0.012 1.86E-04
77 51 0.102 0.009 1.42E-04
72 50 0.100 0.004 5.31E-05
91 49 0.098 0.060 2.59E-03
83 47 0.094 0.024 7.30E-04
39 46 0.092 0.017 4.03E-04
82 45 0.090 0.005 6.10E-05
32 42 0.084 0.018 4.66E-04
99 42 0.084 0.017 3.95E-04
12 37 0.074 0.031 1.15E-03
57 34 0.068 0.013 2.64E-04
74 34 0.068 0.013 2.79E-04
8 29 0.058 0.013 2.52E-04
95 29 0.058 0.012 1.96E-04
35 28 0.056 0.018 5.00E-04
73 27 0.054 0.017 3.92E-04
97 25 0.050 0.031 1.07E-03
29 22 0.044 0.018 4.80E-04
22 21 0.042 0.013 2.80E-04
68 20 0.040 0.029 8.96E-04
193 18 0.036 0.018 4.46E-04
42 18 0.036 0.031 1.10E-03
60 18 0.036 0.024 7.26E-04
27 17 0.034 0.059 2.46E-03
927 15 0.030 0.029 9.42E-04
181 14 0.028 0.106 6.23E-03
763 14 0.028 0.031 1.22E-03
4 13 0.026 0.031 1.17E-03
117 12 0.024 0.031 1.05E-03
706 12 0.024 0.064 2.82E-03
47 11 0.022 0.090 4.53E-03
80 11 0.022 0.057 2.27E-03
85 11 0.022 0.077 3.75E-03
18 10 0.020 0.057 2.35E-03

Table 6: Screening results for simulated discrete data with a larger proportion of the signal where genes 1 to 100 are set to be the non-null genes. The freq and rfreq are calculated from the proposed method in 500 sample splits. The values in the column are calculated based on the BH method by using the whole data. The last column presents the p-values for each gene from the whole data likelihood ratio tests that compare the treatment group and the control group averages. The proposed method discovered 46 genes by using the FDR threshold 0.05, and the significant frequency threshold 0.02, respectively.

5 Discussion

Sample Splitting:

In a multiple testing situation, if the entire available data is used for the fitting of a null/non-null mixture model, then using the same model for the non-null detection again from that data may affect the screening process due to overfitting of the empirical model. The sample splitting in the proposed method allows a part of the available information to be used for model building, and the other part for screening significant cases hence avoids that drawback. Further, when the data is randomly split multiple times, it produces a different (may not be disjoint) modeling set each time. When models are fitted based on these different splits, it helps to neutralize the effect of sources of variation (noise) other than the one that is of interest in a study.

The use of only partial information for the model building part may lead to some loss of power for an individual modeling split, but repeated sample splitting and combining the resulting rejection regions overcomes that. In fact, in all simulation studies, the proposed method had a higher number of true discoveries than the existing whole data fit method. In models with higher signal proportion, this difference was more pronounced. However, the combined rejection region also accumulates false discoveries and reduces precision, but the use of a critical detection frequency threshold is designed to attain any fixed precision level for the entire process.

F-association Plots:

The other benefit of the repeated sample splitting is the F-association plots that are constructed based on the detection frequencies across the repeated splits. The simulation study shows that if some screened non-null cases are indeed associated, that relation is captured in the group structure of the F-association plot (see Appendix A

for a heuristic justification). Here we want to emphasize that the F-association plots generated by this method should not be used as a “proof” of group behavior among the cases. If two unrelated non-null cases deviate strongly from the null distribution, they are bound to be frequently detected as significant, no matter how the data is split. In that case, these two unrelated non-null cases may appear in the detected sets repeatedly at the same time and consequently will show up in the F-association plot as a group. Therefore, the F-association plot is not intended to use as a basis for a causal relation.

Thus we recommend that these plots be used as an exploratory tool that precedes further investigation to establish possible causal relationship between cases that show high concurrent detection frequencies. When a study includes thousands of cases, at least some starting point for an exploratory network analysis can be highly useful and cost-effective.

The relevance and effectiveness of the proposed method can be explained particularly well in a study where the main goal is the identification of groups of differentially expressed genes. Such studies are commonly used to identify the genes that are associated with specific biological behavior. Genes detected through the proposed methodology can be selected by experimental biologists for detailed follow-up functional studies. For example, a biologist may look into the few most frequently identified significant genes to distinguish the “regulators”. A systematic gene knockout experiment, conducted on the sets that appear in the F-association plot as a group, can reveal the effect of individual genes on the biological response of interest. This can provide a novel starting point for the elucidation of gene networks or hierarchical regulation patterns in a biological system. Thus, the proposed analysis can guide an exploratory biological study where, instead of experimental investigations of the effect of every gene, a small subset of significant ones can be selected for further experimentation to establish their individual or collective role in the biological response of interest.

Conditional Independence:

An interesting question can be posed: “When does the empirical Bayes model (1.2) work?” It of course works when the observations are i.i.d according to (1.2). But in many cases (in particular, the microarray example considered here), it is not correct. It does work, though, in pooling observations where the observations are conditionally independent. For example, in a microarray, groups of genes may be acting together but still conditionally independent. This is a typical argument used in multiple hypothesis testing cases [Karlin81, Sarkar97, Huque16]. See Appendix B for an explanation of why the empirical Bayes approach works in this case.

In conclusion, we present a method that can be used for identifying significant cases when carrying out a large number of simultaneous tests. We propose a cross-validation type analysis where a part of the available information goes into the understanding of the underlying process or model fitting, while the other part goes into screening for extreme cases. Random splitting and repeated screening provide a way to reduce the noise (other sources of variation) in the analysis. We present a way of controlling the overall chance of false discoveries that arise from combining the screening results from all splits. As a by-product we get an exploratory look into the associative pattern between significant cases.

6 Software

The analysis were done using R. The relevant codes are available upon request. An R package for the proposed analysis method is available for general use. The R package tiltmod can be downloaded via: https://github.com/chongma1989/tiltmod.

References

Appendix A F-association plots justification.

When we are trying to detect genes that have significantly different levels of expression in the treatment group compared to that of the control group, the detection can be based on the difference in the average expression level between the two groups.

Let denote the average expression difference between the experiment and the control group for the gene. Genes that behave similarly in both groups should have ’s close to , whereas genes that are truly different in these two groups should have ’s removed from in either a positive or a negative direction. These extreme genes will have smaller chances of being false discoveries. Thus an extreme gene detection criterion of the form for some is equivalent of for some .

Let us consider two genes and that are truly different on average in the treatment group compared to the control group. Suppose is detected, that is, the observed value of , or the average expression difference for is beyond the threshold; without loss of generality suppose . We present an argument below that the chance that also will be detected is higher when and have some level of dependency as compared to the scenario when and behave completely independently.

Similar to the simulation study in Section 4.3, we assume the ’s follow a Normal distribution (not an entirely hypothetical scenario if we have large enough sample sizes in the experiment and the control groups). If and function independently, we assume are variables. If and are not independent we assume . with and .
(We will denote the left and right tail probabilities of the standard Normal distribution by and respectively).

If the two genes behave independently and is detected, the chance that will also be detected is (here is the right-tail area from in a standard Normal distribution). If the two genes are not independent then we have the following two possibilities:

  • If and is detected, the chance that will also be detected is