 # Detecting weak signals by combining small P-values in observational studies with multiple testing

Human health is affected by multiple risk factors. Studies may focus on a hypothesis that a particular exposure is a risk factor for disease, but complex outcomes are typically influenced by spectra of weak genetic and environmental factors. A plethora of weak associations makes it difficult to detect a real signal because some portion of these associations is expected to be spurious. However, the combined effect of many individually weak signals has proven to be a more powerful approach to study the underpinnings of health conditions. Based on ideas from meta-analysis, statistical methods have been developed for combining top-ranked weak associations. For example, Truncated Product Method (TPM) and Rank Truncated Product (RTP) have gained substantial popularity in applications for linking combined contribution of multiple weak risk factors to disease. Both TRM and RTP aggregate only top ranking signals, while adjusting for the total number of predictors to assure Type-I error protection. Unlike TPM, the RTP distribution is comparatively unwieldy and involves repeated integration, which obscures its probabilistic interpretation and makes it difficult to implement. In this article, we developed new ways of evaluating the distribution of RTP and related statistics that not only are mathematically simple, but further lead to powerful extensions for combining top-ranked and correlated weak associations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Methods

### Theoretical RTP distribution

We derived a simple expression for the RTP distribution as the expectation of a function of a uniform (0 to 1) random variable:

 PRTP(k) = Pr(Wk≤w)=1−∫10Gk⎧⎪⎨⎪⎩ln⎛⎜⎝[B−1k+1(u)]kw⎞⎟⎠⎫⎪⎬⎪⎭du, (4)

where is inverse CDF of distribution, and is CDF of Gamma. is the combined RTP P-value. Notably, given the value of the product of P-values, , we can simultaneously evaluate :

 PRTP(k+1) = (5)

Details and the derivation are given in “Supplemental Materials (S-2)”.

### Modified RTP method, mRTP

Next, we develop an alternative to the product statistic for calculating combined P-values. This alternative statistic and its distribution are not an approximations to and the RTP distribution. However, similarly to RTP the new statistic is designed to capture information contained in the first smallest -values. To construct the new statistic, we propose the following transformation that involves the product and the variable :

 Ak = (k−1)ln{P(k)}−ln{Wk−1}+G−1λ{1−Bk(p(k))}, (6)

where is inverse CDF of Gamma,

 λ=(k−1)×E{−ln(p(k))}=(k−1)(Γ′(L+1)/Γ(L+1)−Γ′(k)/Γ(k)),

is the first derivative of a gamma function; and is the CDF of distribution evaluated at . Given the observed value , the combined P-value is then:

 PAk=Pr(Ak≤ak)=1−Gk+λ−1(ak). (7)

Under the null hypothesis, as illustrated by Fig. (

2), combined P-values based on the proposed are very similar to , and approach as increases. However, under the alternative, we found that mRTP has either the same or higher power than RTP. Furthermore, the combined P-value, , can be easily computed in R using its standard functions. A short example and an R code is given in “Supplemental Materials (S-3)”.

As we discussed earlier in Introduction, the number of P-values to be combined by the RTP method is fixed and needs to be pre-specified. The choice of is somewhat arbitrary, so a researcher may wish to evaluate at several values of , consequently choosing that corresponds to the smallest combined P-value. However, this additional step creates another layer of multiple comparisons, which needs to be accounted for. Yu et al.  proposed an empirical procedure to evaluate adaptive rank truncated product (aRTP) method based on the minimum P-value computed over various candidate truncation points. To avoid a cumbersome two-level permutation procedure, they built on the method suggested by Ge et al. to reduce computational time. While computationally efficient, the method requires to store a large matrix, with every row containing P-values generated under the null distribution over simulated experiments. Zhang et al. derived analytic but mathematically complex aRTP distribution, which needs to be evaluated using high-dimensional integration. Here, we propose a new and easily implemented version of the aRTP theoretical distribution. The method exploits the fact that ordered P-values can be represented as functions of the same number of independent uniform random variable (Supplemental Materials (S-4)).

### Correlated P-values

We further extend the proposed methods to combine correlated P-values. Let correlated -values, , originate from statistics that jointly follow a multivariate normal disitrbution, , under . Dependent variables can be transformed into independent variables by using eigendecomposition of , such that , where is a square matrix, with

th column containing eigenvector

of , and

is the diagonal matrix of eigenvalues

. Next, define an orthogonal matrix

and . P-values are decorrelated as . Then, the first smallest decorrelated P-values can be used to calculate various combined statistics.

There are many ways to choose orthogonal transformations, but a valid one needs to have the following “invariance to order” property. Suppose we sample an equicorrelated normal vector

with a common correlation for all pairs of variables. Before decorrelating the vector, we permute its values to a different order. A permutation in this example is a legitimate operation, because an equicorrelation structure does not suggest a particular order of values. After an orthogonal transformation of to , the order of entries may change due to permutation but their values should remain the same. This would not be the case if one were to use orthogonal matrices, such as the inverse Cholesky factor or , to decorrelate , unless and the combined P-value distribution was derived from the distribution of quadratic form in all variables, . Moreover, some transformations of equicorrelated data to independence, such as the Helmert transformation, may change values of depending on the order of values in even in a special equicorrelation case of , that is, when normal variables in are independent. The proposed , as defined above, has both the invariance to order property (due to its symmetry) and may be used with P-value transformations other than , implied by the quadratic form, .

## Results

### Simulation study results

We used simulation experiments to evaluate the Type I error rate and power of the proposed methods relative to the previously studied RTP (defined for a fixed ) and to the adaptive RTP (where is varied and the distribution is evaluated by single-layer simulations as in Yu et al, 2009).[18, 8] Performance of various methods was evaluated using first-ordered P-values, with and . Details of the simulation design are given in “Supplemental Materials (S-5)”.

Table 1 and Table 2 present Type I error rates for combinations of independent and decorrelated P-values respectively. In the tables, rows labeled “aRTP(new)” refer to our newly proposed adaptive RTP method, while “aRTP” labels the results of the conventional approach. For the aRTP(new) results, the number in parentheses is the average optimal (average across simulations). For the adaptive methods, the sequence of truncation points varied from 1 to or from 1 to , if . Both tables confirm that all methods maintain the correct Type I error rate.

Tables 3-6 summarize a set of power simulations for independent P-values. Results presented in Table 3 were obtained under the assumption that all statistics had the same underlying effect size (). From this table, it is evident that our mRTP has the highest power, closely followed by RTP. The Simes method has the lowest power, which is expected due to homogeneity in effect sizes across tests and absence of true nulls. For the results in Tables 4-5, the effect size was allowed to either randomly vary throughout the range from 0.05 to 0.45 (Table 4) or was equally spaced within the same range (Table 5). In both of these tables, the mRTP method has the highest power, while the Simes method has the lowest power. The power of both adaptive methods is very similar to one another but lower than that of methods based on a fixed (RTP and mRTP). Nonetheless, in practice, a good choice for may not be immediately clear, so a small sacrifice in power may be preferable to an arbitrary and possibly poor choice of . However, when is large, it can be impractical or unreasonable to vary candidate truncation points all the way up to . Finally, Table 6 summarizes results for simulations when some of the hypotheses were true nulls (), while the remaining hypotheses were true signals (). The results follow the same pattern as in the previous tables, with mRTP having the highest power. Further, “aRTP(new),” appears to outperform the regular aRTP.

Tables 7-9 summarize a set of power simulations for decorrelated P-values. Under the assumption of common effect size across tests, decorrelation-based methods performe relatively poorly, compared to the empirical combined P-values, where the null distribution of correlated P-values was kept as is, without transformation to independence (Table 7). However, it should be noted that the common effect size assumption is admittedly unrealistic. Also, under this assumption, if , an increase in does not lead to a dramatic increase in power of empirical methods (labeled “RTP” and “aRTP”), for which the null distribution was obtained without the decorrelation step. This is expected, because for MVN distributed , the noncentrality parameter for sums of squares is approximately , where and are averaged effect size and correlation values. This sum quickly approaches as increases, subsequently leading to a small power gain for combining more than a few statistical values.

Under heterogeneous effect sizes (Tables 8-9), relative ordering of powers is reversed. Empirical versions of the tests (“RTP”, “aRTP”) still show nearly identical (and low) power for various combinations of and values. However, decorrelation-based methods become quite powerful, and their power is increasing with and . This is expected, because decorrelation induces noncentrality that involves sum, which increases with the increased heterogeneity of .

### Real-data analysis

We analyzed data on reported associations between various dietary patterns and risk for type 2 diabetes mellitus (T2DM). T2DM is one of the most prevalent diseases in the United States and the effect of increasing the dietary intake of various nutrients on reducing the T2DM risk is not well understood. We collected 7 publications reporting generally weak but significant associations between T2DM and dietary intake of whole grain, protein, fiber, magnesium, calcium, fruit and berries, and alcohol. From these publications, we extracted odds ratios (OR), confidence interval bounds (LCI, UCI), and association P-values (Table

10). To test whether at least one of these nutrients is associated with T2DM, we employed our proposed adaptive RTP with varying from 1 to 7, and Fisher’s test (or the traditional RTP with ). The resulting combined adaptive RTP P-value was equal to at the value =6, while P-value based on the RTP and mRTP (for =6) were =1.3 and =1.8. Next, we applied both mRTP and RTP to find the minimum value , such that the combined P-value would be larger than 0.05. At , both P-values are still below 0.05, and . At , and . Therefore, around 270 manuscripts reporting non-significant associations with diabetes, possibly unpublished and tucked away in researchers’ file drawers, would collectively suggest that there is a good possibility of observing such set of P-values as in Table 10 due to chance. Here, we do not intend to cast doubt on the reported diabetes associations but merely want to illustrate one of the possible ways to use the RTP and its variations.

## Discussion

Complex diseases are influenced by collective effects of environmental exposures and genetic determinants. There can be numerous weak, but biologically meaningful risk factors. The challenge is to distinguish between real and spurious statistical signals in the presence of multiple comparisons and low detection power. In this article, we derived a mathematically simple form of the rank truncated product P-value distribution that not only makes its evaluation very simple, but also leads to new extensions, such as mRTP (the modified RTP) and our version of adaptive RTP. The mRTP is designed with the same objectives in mind as RTP and TPM: to facilitate detection of possibly weak signals among top-ranking predictors that could have been missed, unless combined into a single score. The mRTP is trivial to implement in terms of standard functions provided by packages such as R, and its P-value under the null hypothesis is closely tracking RTP values (Figure 2). In terms of power, however, mRTP has an edge over RTP: its power is either the same or higher. Further, we developed a fast adaptive algorithm which searches through a number of candidate values of truncation points and finds an optimal number of top ranking predictors to combine, to maximize power of the overall association. We also proposed a powerful analytic solution to accommodate correlated risk factors.

To evaluate performance of our methods, we gauged their power against the Simes test. The Simes test is a useful benchmark, because it is related to the combined P-value methods with truncation, as well as to multiple adjustment procedures. At the extremes, namely, for RTP with and TPM with the threshold set at , both truncation methods become equivalent to Šidák correction. Šidák correction is approximately the Bonferroni correction, for small P-values and large . The Simes P-value, , is at least as small as Bonferroni-corrected P-value. In addition, there is a connection of the Simes test to the Benjamini & Hochberg false discovery rate (FDR). Formally, the Simes test is algebraically the same procedure as the Benjamini & Hochberg FDR but the interpretation is different: FDR method determines the largest , such that , and rejects for all , , to control the expectation of FDR. TPM was omitted from comparisons, because its performance relative to RTP was extensively studied in previous publications.[2, 22]

Our evaluation of correlated P-values has two main limitations: (1) We make an assumption that the generating mechanism for dependent association statistics is via a multivariate normal distribution, MVN. The association statistics themselves can be distributed differently, for example, marginally follow a chi-square distribution, as in our simulations. The implicit assumption here is that chi-square statistics are transformations of the underlying MVN; (2) The MVN correlation structure is known or well estimated. Future research should explore the extent of measurement error in

on power of truncation methods.

Finally, in light of replicability crisis, utility of P-values is being re-evaluated. 

Despite numerous drawbacks, P-values remain a useful filtering tool. They can also be viewed as transformations of standardized effect sizes and contain information that can be converted to approximate posterior probability of a false discovery and to Bayesian effect size estimates.

[24, 25, 26, 27]

## Acknowledgements

Author affliation: Biostatistics and Computational Biology, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, North Carolina, USA (Fengjiao Hu, Dmitri V. Zaykin) Biostatistics Department, University of Kentucky, Lexington, Kentucky, USA (Olga A. Vsevolozhskaya)
Correspondence: Dmitri V. Zaykin, Senior Investigator at the Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, National Institutes of Health, P.O. Box 12233, Research Triangle Park, NC 27709, USA. Tel.: +1 (919) 541-0096 ; Fax: +1 (919) 541-4311. Email address: dmitri.zaykin@nih.gov

This research was supported in part by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences.

## References

•  Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS. Truncated product method for combining P-values. Genet Epidemiol. 2002;22(2):170–85.
•  Dudbridge F, Koeleman BP. Rank truncated product of P-values, with application to genomewide association scans. Genet Epidemiol. 2003;25(4):360–366.
•  Wilson PW, Meigs JB, Sullivan L, Fox CS, Nathan DM, D’Agostino RB. Prediction of incident diabetes mellitus in middle-aged adults: the Framingham Offspring Study. Archives of internal medicine. 2007;167(10):1068–1074.
•  Lin D, Zeng D. Meta-analysis of genome-wide association studies: no efficiency gain in using individual participant data. Genet Epidemiol. 2010;34(1):60–66.
•  Zaykin DV. Optimally weighted Z-test is a powerful method for combining probabilities in meta-analysis. J Evol Biol. 2011;24(8):1836–1841.
•  Fisher SRA. Statistical methods for research workers. Genesis Publishing Pvt Ltd; 1932.
•  Hoh J, Wille A, Ott J. Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res. 2001;11(12):2115–9.
•  Yu K, Li QZ, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, et al. Pathway Analysis by Adaptive Combination of P-Values. Genet Epidemiol. 2009;33(8):700–709.
•  Zhang S, Chen HS, Pfeiffer RM. A combined p-value test for multiple hypothesis testing. Journal of Statistical Planning and Inference. 2013;143(4):764–770.
•  Sheng X, Yang J. An adaptive truncated product method for combining dependent p-values. Economics letters. 2013;119(2):180–182.
•  Zaykin DV. Statistical analysis of genetic associations. Ph.D. thesis. North Carolina State University; 2000.
•  Zaykin DV, Zhivotovsky LA, Czika W, Shao S, Wolfinger RD. Combining P-values in large-scale genomics experiments. Pharm Stat. 2007;6(3):217–226.
•  Nagaraja HN. Order statistics from independent exponential random variables and the sum of the top order statistics. In: Advances in Distribution Theory, Order Statistics, and Inference. Springer; 2006. p. 173–185.
•  Stouffer SA, DeVinney LC, Suchmen EA. The American soldier: Adjustment during army life. vol. 1. Princeton University Press, Princeton, US; 1949.
•  Whitlock MC. Combining probability from independent tests: the weighted Z-method is superior to Fisher’s approach. Journal of Evolutionary Biology. 2005;18(5):1368–1373.
•  Loughin TM. A systematic comparison of methods for combining p-values from independent tests. Computational statistics & data analysis. 2004;47(3):467–485.
•  Won S, Morris N, Lu Q, Elston RC. Choosing an optimal method to combine P-values. Statistics in medicine. 2009;28(11):1537–1553.
•  Ge Y, Dudoit S, Speed TP. Resampling-based multiple testing for microarray data analysis. Test. 2003;12(1):1–77.
•  Šidák Z. Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association. 1967;78:626–633.
•  Bonferroni CE. Il calcolo delle assicurazioni su gruppi di teste. Studi in onore del professore salvatore ortu carboni. 1935;p. 13–60.
•  Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc B. 1995;57(1):289–300.
•  Zaykin DV, Zhivotovsky LA, Czika W, Shao S, Wolfinger RD. Combining p-values in large-scale genomics experiments. Pharmaceutical statistics. 2007;6(3):217–226.
•  Wasserstein RL, Lazar NA. The ASA’s statement on P-values: context, process, and purpose. Am Stat. 2016;70(2):129–133.
•  Wakefield J. A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007;81(2):208–227.
•  Wakefield J. Bayes factors for genome-wide association studies: comparison with P-values. Genet Epidemiol. 2009;33(1):79–86.
•  Storey JD. A direct approach to false discovery rates. J Royal Stat Soc B. 2002;64(3):479–498.
•  Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Am Statist Assoc. 2001;96(456):1151–1160.
•  Fung TT, Hu FB, Pereira MA, Liu S, Stampfer MJ, Colditz GA, et al. Whole-grain intake and the risk of type 2 diabetes: a prospective study in men. The American journal of clinical nutrition. 2002;76(3):535–540.
•  Tinker LF, Sarto GE, Howard BV, Huang Y, Neuhouser ML, Mossavar-Rahmani Y, et al. Biomarker-calibrated dietary energy and protein intake associations with diabetes risk among postmenopausal women from the Women’s Health Initiative. The American journal of clinical nutrition. 2011;94(6):1600–1606.
•  Meyer KA, Kushi LH, Jacobs DR, Slavin J, Sellers TA, Folsom AR. Carbohydrates, dietary fiber, and incident type 2 diabetes in older women. The American journal of clinical nutrition. 2000;71(4):921–930.
•  Cullmann M, Hilding A, Östenson CG. Alcohol consumption and risk of pre-diabetes and type 2 diabetes development in a Swedish population. Diabetic Medicine. 2012;29(4):441–452.
•  Montonen J, Järvinen R, Heliövaara M, Reunanen A, Aromaa A, Knekt P. Food consumption and the incidence of type II diabetes mellitus. European Journal of Clinical Nutrition. 2005;59(3):441–448.
•  Hruby A, Meigs JB, O’Donnell CJ, Jacques PF, McKeown NM. Higher magnesium intake reduces risk of impaired glucose and insulin metabolism and progression from prediabetes to diabetes in middle-aged americans. Diabetes Care. 2014;37(2):419–427.
•  Van Dam RM, Hu FB, Rosenberg L, Krishnan S, Palmer JR. Dietary calcium and magnesium, major food sources, and risk of type 2 diabetes in US black women. Diabetes care. 2006;29(10):2238–2243.
•  Balakrishnan N, Rao CR. Order statistics: theory & methods. Elsevier Amsterdam; 1998.
•  Mi X, Miwa T, Hothorn T. mvtnorm: New numerical algorithm for multivariate normal probabilities. The R Journal. 2009;1(1):37–39.
•  Bartlett MS. An inverse matrix adjustment arising in discriminant analysis. The Annals of Mathematical Statistics. 1951;22(1):107–111.
•  Simes R. An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986;73(3):751–754.

## Figure Legends Figure 1: Illustration for decorrelated yet dependent P-values; k=2, n=4. A plot of simulated and decorrelated values, U(1) vs U(2), reveals a hole in the middle, instead of the complete Malevich black square, indicating dependency. Figure 2: Combined P-values based on Ak versus RTP statistic. Multiple combined P-values were computed using the two proposed statistics based on either top 10 or top 15 P-values out of L=20 tests.

## Supplemental material

### S-1 Correlation and dependencies among k smallest P-values

The complexity of analytic forms of the RTP distribution is due to dependency introduced by ordering of -values. Although order statistics are correlated, products and sums are oblivious to the order of the terms, therefore for the case when , the statistic

follows the gamma distribution with the shape parameter equal to

, and the unit scale, i.e., . This is essentially the same as the Fisher combined P-value, where the statistic is , distributed as the chi-square with degrees of freedom. However, for , the smallest -values remain dependent even if these values are not sorted (e.g., randomly shuffled). The dependency is induced through being a random variable: when happens to be relatively small, the P-values have to squeeze into a relatively small interval from zero to that value. This induces a positive correlation between random sets of smallest P-values, similar to the clustering effect in the random effects models.

The smallest unordered P-values are equicorrelated and also have the same marginal distribution, which can be obtained as a permutation distribution of the first uniform order statistics. Assuming independence of

P-values and their uniform distribution under the null hypothesis, we can derive the correlation between any pair of

unordered smallest P-values as . As increases, the correlation approaches the limit that no longer depends on : . The correlation can be substantial for small and cannot be ignored. There is a very simple transformation that makes a set of P-values uncorrelated. All that is needed to decorrelate these P-values is to scale the largest of them:

 X1 = P(1) X2 = P(2) ⋮ Xk−1 = P(k−1) Xk = σP(k),

where

 σ=2L−k+3+√(k+1)(L+1)(L−k+1)4+2L,

and then randomly shuffle the set . This scale factor can be derived by solving the mixture covariance linear equations induced by the permutation distribution of the first order statistics. The decorrelated values can be further transformed so that each has the uniform (0,1) distribution marginally:

 Uj = 1kk∑i=1Beta(Xj;i,L−i+1),j=1,…,k−1 (S-1) Uk = 1kk∑i=1Beta(Xk/σ;i,L−i+1), (S-2)

where Beta is the CDF of a beta distribution evaluated at .

Unfortunately, although the scaling and subsequent shuffle removes the correlation, the values remain dependent, as illustrated in Figure 1.

### S-2 Derivation of the RTP distribution

An intuitive way to understand our derivation of the RTP distribution is through references to simulations. The simplest, brute-force algorithm to obtain the RTP combined P-value is by simulating its distribution directly. If is the product of actual P-values, one can repeatedly ( times) simulate Uniform(0,1) random variables , sort them, take the product of smallest values, and compare the resulting product to . As the number of simulations, , increases, the proportion of times that simulated values will be smaller than converges to the true combined RTP P-value.

There are several ways to optimize the above simulation scenario with respect to computational complexity. For instance, sets of ordered uniform P-values can be simulated directly using well-known results from the theory of order statistics. Despite the fact that the marginal distribution of th ordered value is Beta, to create the necessary dependency between the ordered P-values, sets of values have to be simulated in a step-wise, conditional fashion. The minimum value, , can be sampled from Beta distribution. Alternatively, using the relationship between beta and Uniform(0,1) random variables, it can be sampled as . Next, since the value cannot be smaller than

, conditionally on the obtained value, it has to be generated from a truncated beta distribution. The third smallest value should be sampled conditionally on the second one, and so on.

 Therefore, the sequence and the product can be obtained by simulating ordered P-values, rather then all unsorted values.

 P(1)=1−U1L1P(2)=1−u1L1U1L−12P(3)=1−u1L1u1L−12U1L−23⋮P(k)=1−u1L1u1L−12…U1L−k+1k. (S-3)

Further optimization of the simulation algorithm is illustrative because it provides intuition for theoretical derivation of the RTP distribution. This optimization is achieved by using the Markov property of order statistics. Specifically, the unordered set behaves as a sample of independent variables, identically distributed as . After re-scaling,

 {P1p(k+1),P2p(k+1),…,Pkp(k+1)}∼Unif(0,1). (S-4)

The capital notation is used here to emphasize the fact that the variable is random, while the lowercase refers to a realized value of a random variable, . Next, given that , minus log of the product of independent conditional uniform random variables will follow a gamma distribution. Specifically,

 −lnk∏i=1Pip(k+1)=klnp(k+1)−k∑i=1lnPi,

and treating as a constant,

 −lnk∏i=1Pip(k+1)∼12χ22k=% Gamma(k,1).

The above manipulations reduce the set of random variables to a set of just two variables: a gamma and a beta. Therefore, the combined RTP P-value can be evaluated numerically by simulating only pairs of beta- and gamma-distributed random variables as follows. Let

 −ln(k∏i=1P(i))=−ln(k∏i=1Pip(k+1))−klnp(k+1),

and define

 X = P(k+1)∼Beta(k+1,L−k) (S-5) Y∣X = −ln(k∏i=1Pip(k+1))∼% Gamma(k,1). (S-6)

The empirical distribution of the product of values under can then be obtained by repeatedly simulating and , and comparing the observed value of to in every simulation. would then be defined as the proportion of times simulated values of were larger than .

Further, it is important to note that with the above modifications one can simultaneously evaluate probabilities for two consequtive products,

 Pr(Wk≤w),and Pr(Wk+1≤w),

by reusing the same pair of random numbers, which follows from the fact that

 −ln(k+1∏i=1P(i))=−ln(k∏i=1Pip(k+1))−(k+1)lnp(k+1). (S-7)

In the latter case, is compared to . This simulation method is very fast and approaches the exact solution as the number of simulated pairs increases. Moreover, through these simulation experiments it becomes clear that once one conditions on the observed value of , the test statistic is formed as a product/sum of independent random variables. Specifically, Gamma distribution for the variable in Equ. (S-6) appears to be conditional on the observed when the pairs are simulated. Alternatively, one can first simulate and then generate a test statistic using uniform random variables, , on (0, ) interval.

We just described a way to evaluate the RTP distribution by repeated sampling of two random variables to elucidate the idea that the combined RTP P-value can be obtained by integrating out the random upper bound

over its probability density function. Random

has to be at least as large as but smaller than one, . Re-expressing in terms of the observed product , it becomes evident that because the product is maximized if for all , so the observed can be at most . Now, integrating over the beta density, with parameters , of a single variable , we treat as a constant:

 Pr(Wk≤w) = 1−∫1w1/kGk{ln(tkw)}f(t)dt. (S-8)

Next, by performing a quantile transformation, we can express the integral as an expectation and make the integration limits to be 0 to 1:

 PRTP(k)=Pr(Wk≤w)=1−∫10Gk⎧⎪⎨⎪⎩ln⎛⎜⎝[B−1k+1(u)]kw⎞⎟⎠⎫⎪⎬⎪⎭du, (S-9)

where is inverse CDF of distribution, and is CDF of Gamma. is the combined RTP P-value. Similarly,

 Pr(Wk+1≤w) = 1−∫10Gk⎧⎪⎨⎪⎩ln⎛⎜⎝[B−1k+1(u)]k+1w⎞⎟⎠⎫⎪⎬⎪⎭du, (S-10)

We have now derived a relatively simple expression that involves only a single integral. Moreover, in contrast to the initial form of this distribution in Equ (S-8), integration limits in Equ (S-9) no longer involve a product value and are conveniently bounded on zero to one interval. Equ (S-9) shows that the RTP distribution is in fact the expectation of a function of a uniform random variable, Uniform(0,1). If we let , the unconditional distribution of is

 Pr(Wk≤w)=1−∫10H(u|k,w)du=1−E{H(U|k,w)}.

Therefore, to evaluate numerically, one can simply sample a large number of uniform random numbers, , apply the function and then take the mean. The corresponding R code using one million random numbers is:

mean(1-pgamma(log(qbeta(runif(1e6),k+1,L-k))*k+z,k))


where . Using the integration explicitly, the R code is:

integrate(function(x,w,k,L) 1-pgamma(log(qbeta(x,k+1,L-k))*k+w,k),0,1,z,k,L)$va.  ### S-3 R code example for computation of PAk and PRTP In the code below, and are computed for a vector of six P-values with =4, lW and Pk: Ak <- function(lW, Pk, k, L) { d = (k-1)*(digamma(L+1) - digamma(k)) ak = (k-1)*log(Pk) - lW + qgamma(1-pbeta(Pk, k, L-k+1), shape=d) 1 - pgamma(ak, shape=k+d-1) } P = sort(c(0.7, 0.07, 0.15, 0.12, 0.08, 0.09)) L = length(P) k = 4 Z = sum(-log(P[1:k])) lW = sum(log(P[1:(k-1)])) P.rtp = integrate(function(x,y,m,n) 1-pgamma(log(qbeta(x,m+1,n-m))*m+y,m),0,1,Z,k,L)$va
P.ak = Ak(lW, P[k], k, L)


The resulting combined P-values are =0.045 and =0.047. Note that all six original P-values are larger than the combined mRTP and RTP. This example demonstrates that week signals can form a much stronger one after they are combined.

### S-4 Transformation to independence in the derivation of the aRTP distribution

As we discussed, ordered P-values can be represented as functions of the same number of independent uniform random variables (Eq. S-3). This reveals that the th value, , is a function of all and that in a given set of variables (i.e., conditionally) all information is contained in independent random variables, . These independent components can be extracted and utilized. Specifically, by using the conditional distribution of , which only depends on the two preceding partial products, and , we define independent variables ’s as . Successive partial products relate to one another as:

 Wk=Wk−1−Wk−1(1−Wk−1Wk−2)U1L−k+1k.

Since , the conditional density and the CDF for the product is

 f(Wk=x∣Wk−1=tk−1,Wk−2=tk−2)=(tk−1−x)L−kB(L−k+1,1)(tk−1(1−tk−1tk−2))L−k+1.

Let

 1−Zi=Pr(Wi
 Pr(Wk≤x∣Wk−1=tk−1,Wk−2=tk−2)=∫xt2k−1/tk−2f(Wk=x|Wk−1=tk−1,Wk−2)dx=1B(L−k+1,1)(tk−1(1−tk−1tk−2))∫xt2k−1/tk−2(tk−1−x)L−kdx=1−⎛⎜ ⎜⎝tk−1−xtk−1(1−tk−1tk−2)⎞⎟ ⎟⎠L−k+1=1−(1−p(k)1−p(k−1))L−k+1.

Thus, we obtain a transformation to new set of independent uniform () random variables.

 Zi=(1−p(i)1−p(i−1))L−i+1,

with

 Z1=(1−p(1))L.

Next, define , where is the inverse gamma CDF with the shape and the scale 1. Under , has a gamma distribution with the shape = . The combined -value can be obtained as:

 1−G∑ki=1λi(k∑i=1G−1λi(1−Zi)) (S-11)

When is large, the gamma CDF approaches the standard normal CDF, which motivates an alternative, the inverse normal transformation. The quantiles can be calculated by using instead of

. The inverse normal method is useful for the reason that the joint distribution of the partial sums can be derived to evaluate the adaptive RTP (aRTP) P-value. For the aRTP,, we define partial sums as:

 Sk=k∑i=1λiΦ−1(1−Zi),

where is inverse CDF of the standard normal distribution. Then, under the null hypothesis, follows a multiviate normal distribution MVN, with and

The vector can be standardized as , where are the diagonal elements of , then MVN, . The null distribution of can be used to evaluate aRTP by using probabilities or quantiles to obtain significance thresholds with MVN functions available in R mvtnorm package.

### S-5 Simulation setup

We performed =100,000 simulations to evaluate the Type I error rate and power of the proposed methods. To study performance of combination methods for independent P-values, in each simulation, we generated normally distributed statistics, . The squared values of follow the chi-square distribution with one degree of freedom and noncentrality parameter , . P-values were obtained as one minus the CDF of the noncentral chi-square evaluated at , or as in terms of the normal CDF. P-values generated from normal statistics (without squaring them) were also considered, but these results are omitted for brevity, because the resulting ranking of the methods by power was found to be similar. Under , P-values were sampled from the uniform (0, 1) distribution, which is equivalent to setting to zero.

To study non-independent P-values, we simulated statistics from a mutivariate normal distribution and decorrelated them by eigendecomposition described in Section 2.5. In each simulation, a correlation matrix was generated randomly by perturbing an equicorrelated matrix . Specifically, we added perturbation to equicorrelated matrix with off-diagonal elements as:

 R=D+uuT, (S-12)

where is random vector. Then, was converted to a correlation matrix with off-diagonal elements . The amount of “jiggle” in depends on the variability of elements in . If elements of are generated in the range between and , the value of would represent the upper bound for the amount of jiggle allowed between pairwise correlations in . In our simulations, we set , allowing for a mix of positive and negative values of in .

The Type I error rate and power performance were computed based on two matrices of P-values, and , every row of which contained smallest sorted P-values out of tests across simulations ( P-values were discarded). stored simulated P-values under and under the alternative hypothesis, . Taking the product of P-values in each row, we obtain two vectors, , . RTP P-values were computed based on the empirical CDF (eCDF) of evaluated at values of . Power was calculated as the proportion of P-values that were smaller than the significance threshold, .

Finally, when various combined P-value methods are being compared, it is meaningful to gauge their performance against methods designed for multiple testing adjustments. This is especially relevant with methods that employ truncation due to their emphasis on small P-values. Therefore, we included the Simes method in our power comparisons because it can be viewed as a combined P-value method. The Simes method tests the overall without a reference to individual P-values: the is rejected at level if for at least one . Equivalently, the overall (or the “combined”) Simes P-value can be obtained as .