Quantifying replicability and consistency in systematic reviews

07/16/2019 ∙ by Iman Jaljuli, et al. ∙ Brown University 0

Systematic reviews of interventions are important tools for synthesizing evidence from multiple studies. They serve to increase power and improve precision, in the same way that larger studies can do, but also to establish the consistency of effects and replicability of results across studies which are not identical. In this work we suggest to incorporate replicability analysis tools to quantify the consistency and conflict. These are offered both for the fixed-effect and for the random-effects meta-analyses. We motivate and demonstrate our approach and its implications by examples from systematic reviews from the Cochrane library, and offer a way to incorporate our suggestions in their standard reporting system.



There are no comments yet.


page 9

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In systematic reviews, several studies that examine the same questions are analyzed together. Viewing all the available information is extremely valuable for practitioners in the health sciences. A notable example is the Cochrane systematic reviews on the effects of healthcare interventions(Higgins J and Green S, 2011).

Deriving conclusions about the overall health benefits or harms from an ensemble of studies can be difficult, since the studies are never exactly the same and there is danger that these differences affect the inference. For example, factors that are particular to the study, such as the specific cohorts in the study that are from specific populations exposed to specific environments, the specific experimental protocol used in the study, the specific care givers in the study, etc., may have an impact on the treatment effect.

There are many reasons to perform a meta-analysis, according to the Cochrane Handbook for Systematic Reviews of Interventions (§ 9.1.3, Deeks JJ, Higgins JPT, and Altman DG 2017

). The two first reasons are the obvious ones: (1) to increase power, since many individual studies are too small to detect small effects, but when combined there is a higher chance of detecting an effect; (2) to improve precision. The next two reasons come to answer questions that cannot be addressed by individual studies. We quote the third reason "Primary studies often involve a specific type of patient and explicitly defined interventions. A selection of studies in which these characteristics differ can allow investigation of the consistency of effect and, if relevant, allow reasons for differences in effect estimates to be investigated." The fourth reason is "To settle controversies arising from apparently conflicting studies or to generate new hypotheses. Statistical analysis of findings allows the degree of conflict to be assessed formally, and reasons for different results to be explored and quantified."

If there are only one or two studies with results that are in apparent conflict with the rest of the studies, they may be excluded from the meta-analysis if an obvious reason for the outlying result can be identified. However, in general, at least one characteristic can be found for any study in any meta-analysis which makes it different from the others (Higgins J and Green S, 2011), so exclusion of studies may introduce bias (Deeks JJ, Higgins JPT, and Altman DG, 2017). It is possible to carry out a sensitivity analysis in which the meta-analysis is repeated, each time omitting one of the studies (Anzures-Carbera J and Higgins J, 2010). A plot of the results of these meta-analyses, called an "exclusion sensitivity plot" (Bax L, Yu L, Ikeda N, Tsuruta H, and Moons K, 2006), will reveal any studies that have a particularly large influence on the results of the meta-analysis. We concur with this view, but in this work we offer a single number of summary information of such a sensitivity analysis and recommend it being added to the report of the main results, and to the forest plot, of the meta-analysis.

Gail and Simon (1985), Piantadosi and Gail (1993) suggested tests targeted towards identifying whether effects are inconsistent, i.e., for identifying whether the effect direction is positive in some studies but negative in other studies (termed ’qualitative’ or ’crossover’ interactions). We offer in addition a lower confidence bound on the number of studies with a significant effect in each direction, using the replicability analysis tools developed in Benjamini Y and Heller R (2008); Benjamini Y, Heller R, and Yekutieli D (2009); Heller R (2010).

Our starting point is that the researchers decided that multiple studies are sufficiently similar to answer a clinical question of interest by a meta-analysis. Consistency evaluation is important for proper assessment of the overall evidence towards a positive or a negative effect. We examined the extent of the lack of consistency in systematic reviews in the Cochrane library in § 6 and found it to be non-negligible.

In § 2 we review meta-analyses methods that are carried out routinely in various disciplines. For clarity of exposition we focus on the two common methods in the Cochrane library. In § 3 we explain how to carry out the additional replicability and consistency evaluations that we suggest adding to the meta-analyses. In § 4 we carry out simulations that examine the differences between meta analysis and replicability analysis in various settings of interest. In § 5 we demonstrate how such an evaluation contributes to the assessment of the overall intervention effect in case studies from the Cochrane library. In § 7 we conclude with some final remarks.

2 Review of meta-analysis methods

Let be the number of studies available for meta-analysis, and let be the (unknown) treatment effect in study , . Let and

be the estimated effect size and its standard error for study

. The test statistic for testing

is . Although is estimated from the data, for clarity of exposition we adopt the typical assumption in meta-analysis that the null distribution of (possibly after logarithmic or other appropriate transformation) is (approximately) standard normal (Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR, 2009; Deeks JJ, Higgins JPT, and Altman DG, 2017).

2.1 Random effects and fixed effect meta-analysis

The overwhelming majority of meta-analyses published over the past two decades are meta-analyses of effect sizes (Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR, 2009). The primary aim of the meta-analysis of effect sizes is to estimate the overall treatment effect, whether efficacy or adverse response, from . The summary effect estimate is the weighted average

where . Let

denote the expectation of the summary effect estimate. It is of interest to test the meta-analysis null hypothesis

as well as provide a confidence interval (CI) for

. The choice of weights, and the distribution of , depend on whether it is assumed that the effects are heterogeneous.

Assuming no heterogeneity, i.e., each study is estimating exactly the same quantity , a fixed effect (FE) meta-analysis is performed where . The standard error of the summary effect is therefore

The test statistic for is , and the pooled CI for is, typically, . If there is no heterogeneity, the pooled CI may be far narrower than the CIs of the individual studies; and if is false the test based on is powerful. However, if there is heterogeneity, is false, and the ’s have mixed signs, then the test based on can have low power because may be close to zero. Moreover, the CI for the summary effect is meaningless, since if the effect is decreasing in some studies but increasing in others, the researchers may be far more interested in investigating the sources of the sign inconsistencies rather than assessing the magnitude of , which averages positive and negative terms.

A model that accounts for heterogeneity of effects is the random effects (RE) model. Some researchers argue that the fixed effect assumptions are implausible most of the time, and thus suggest to always use the random effects model (Higgins J and Green S, 2011). Others choose the RE model over the FE model for inference on a priori, either based on clinical knowledge or based on a heterogeneity summary statistic. The Cochrane Handbook for Systematic Reviews & Interventions § 9.5.4 (Deeks JJ, Higgins JPT, and Altman DG, 2017) cautions against choosing a RE over FE meta-analysis, based on a statistical test for hypothesis using the data for the meta-analysis, . It is relevant to note that the RE model was also recommended by Kafkafi N, Benjamini Y, Sakov A, Elmer GI, and Golani I (2005) as the tool for assessing replicability across laboratories in animal phenotyping experiments.

In the RE model, is an independent identically distributed sample from a distribution with (unknown) mean

and variance

. In the Cochrane library, the random effects meta-analyses typically assume that the distribution of is Gaussian. The variance of is estimated by the method of DerSimonian and Laird (Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR, 2009) to be

where . The weights are inversely proportional to the total variability of , The standard error of the summary effect is

If the

’s come from a Gaussian distribution, then estimating

by an appropriate CI is useful if represents an overall effect size. This may be the case if the heterogeneity of effect sizes is due to clinical variability (e.g., participants, details of interventions or of outcomes). However, this may not be the case if the heterogeneity in intervention effects is due to methodological variability which introduces bias. In general, it is not possible to distinguish whether heterogeneity results from clinical or methodological variability (Deeks JJ, Higgins JPT, and Altman DG, 2017). Even when the source of variability is clinical, it is difficult to establish the validity of any distributional assumption on the ’s, and this is a common criticism of random effects meta-analyses (Deeks JJ, Higgins JPT, and Altman DG, 2017).

The default method of CI estimation is still nowadays , despite its low coverage. The performance of this method depends on the number of studies and the magnitude of , with the poorest performance for few studies, compared to other available methods (Veroniki et al., 2019).

The assumption that ’s have Gaussian distribution may be plausible if the heterogeneity of effect sizes is due to clinical variability (e.g., participants, details of interventions or of outcomes), rather than methodological variability which introduces bias. Either way, the

test by DerSimonian and Laird is known for not controlling the type I error rate

(IntHout J, Ioannidis J,and Borm G, 2014), In general, it is not possible to distinguish whether heterogeneity results from clinical or methodological variability (Deeks JJ, Higgins JPT, and Altman DG, 2017). Even when the source of variability is clinical, it is difficult to establish the validity of any distributional assumption on the ’s, and this is a common criticism of random effects meta-analyses (Deeks JJ, Higgins JPT, and Altman DG, 2017).

2.2 Combining independent studies with no assumptions

The random effects effect meta-analyses null, , is that the mean effect across studies is zero. A more fundamental aim is to test the global null hypothesis that the effect size is zero in each and every study,

If is true, then is true as well. But if is false, may still be true. Testing is useful regardless of the the source of heterogeneity (clinical or methodological).

Many tests are available for using the independent test statistics , or their corresponding -values (Loughin T, 2004; Futschik A, Tuas T, and Zehetmayer S, 2018). The preferred test depends on the (unknown) alternative, and there is no single test that dominates all others. Let be a combining function for testing the global null. We consider Fisher’s combining function, . If is true,

has a chi-squared distribution with

degrees of freedom, so the global null -value using Fisher’s combining function is


Fisher’s combining method is popular in various application fields (e.g., genomic research, education, social sciences) since it has been shown to have excellent power properties (Owen A, 2009). It is rarely used in systematic reviews of randomized clinical trials, where the focus is on effect sizes. However, we believe it can be useful for assessing the consistency of the direction of effect, as we will show in § 3. Before doing this, we shall consider two extensions of this combining method, which may be useful for application in systematic reviews.

In the first, define the -value for a left-sided alternative that and for the right-sided alternative. Pearson suggests to combine the left sided and the right sided values separately by Fisher’s combing function and , respectively.

Pearson’s test statistic is the maximum of the two resulting statistics,

Pearson’s global null -value is, therefore,


This test has greater power than a test based on Fisher’s combining method using two-sided -values when the direction of the signal is consistent across studies, while not requiring us to know the common direction.

The second extension is useful if the ’s are suspected to have mixed signs. A potentially more powerful test for may therefore be based on aggregation of the -values that are at most a predefined threshold (Zaykin DV, Zhivotovsky LA, Westfall PH, and Weir BS, 2002), and the null distribution is adjusted accordingly. The directional test statistics are


and the truncated-Pearson test statistic is Clearly .

The null distribution has a simple form, and following Hsu J, Small DS, and Rosenbaum PR (2013) it is straightforward to see that the computation is as follows for



is to the cumulative gamma distribution with scale parameter equal to one and shape parameter

, and

is the cumulative Binomial distribution with

trials and probability of success

. Therefore, the truncated-Pearson’s value is .

If is false and the ’s have mixed signs, then testing is meaningless. However, testing whether the ’s have mixed signs is informative, and at level the conclusion that the ’s have mixed signs follows if . A major contribution of this paper is the evaluation of the extent of the “mixing", see details in § 3.2. Clearly, it is worth reporting the evidence towards mixed signs in order to interpret appropriately the -value of and the CI provided for in the fixed, and especially random, meta-analysis.

3 Replicability Analysis

In a meta-analysis, if the true and unknown treatment effect was present in the same direction in more than one study, we claim replicability. If in addition there was no treatment effects in the opposite direction, we claim consistency . We shall show how to evaluate the evidence towards replicability and consistency, by testing the null hypothesis that the intervention effect is zero except possibly in studies, for , using Pearson’s truncated extension to Fisher’s combining function defined in  3.1.2.

3.1 The r-value

We suggest testing the null hypothesis that at most studies have an effect in the same direction (Benjamini Y and Heller R, 2008; Benjamini Y, Heller R, and Yekutieli D, 2009; Heller R, 2010), denoted by . By rejecting , we conclude that at least out of the studies in the meta-analysis have an effect in the same direction. Specifically, this is the conclusion for a two-sided alternative. Directional replicability can be similarly defined, to claim that at least out of the studies in the meta-analysis have a positive effect (right-sided alternative), or a negative effect ( left-sided alternative). The -value for the test quantifies the evidence that the treatment effect was replicated in at least studies, and we call this -value the -out-of- -value. The minimal requirement for replicability is with . The null hypothesis is true if at most one study has an effect in the same direction, so rejecting this null enables us to conclude that the significant finding does not hinge on a single study. Henceforth, the -out-of- -value will be referred to simply as the -value

Let denote the set of all possible subsets of size from , so it has cardinality . Let denote the subset . The procedure for computing the out of -value is as follows.

  1. Select appropriate left- and right- sided tests for the null hypotheses that ,

  2. Compute the left- and right- -values from testing the null hypothesis that , for all . Denote them by and .

  3. For a left-sided alternative that there exist at least studies with a negative effect, the -value is

    For a right-sided alternative that there exist at least studies with a positive effect, the -value is

    For a two-sided alternative that there exist at least studies with a positive effect, or at least studies with a negative effect, the -value is

-out-of- replicability at significance level is claimed if . For this means that the conclusion remains significant at level using a meta-analysis of each of the subsets of studies, see Benjamini Y and Heller R (2008) for a simple proof. Similarly, we claim at level replicability of a decreasing effect if , and of an increasing effect if .

3.1.1 The r-value for a fixed effect meta-analysis

The report of an -value is valuable for the fixed effect model, with the underlining assumption that the effects in the studies are equal . The -value quantifies the evidence against the null that at most one study has an effect. Using the r-value, it is possible to provide a lower confidence bound on the number of studies with effect.

The appropriate test to be selected in Step 1 is the fixed effect meta-analysis test statistic if it is believed that except possibly studies. For , the computation of the the -value for is as follows. Let the fixed effect meta-analysis summary effect estimate and its standard error, excluding study , be

and let be the test statistic for

The -values in Step 2 above are:

The -values in Step 3 above are:

Intuitively, the -value should be larger than the meta-analysis -value since a stronger scientific claim is made by rejecting than by rejecting . We formalize this in the following proposition.

Proposition 3.1

Let be the fixed-effects meta-analysis test statistic, with meta-analysis -value , where and . Let be the out of -values for , as defined in this section. Then , and moreover, if , , and if , .

See Appendix for the proof.

3.1.2 The r-value when combining studies with no assumptions

The only difference in the computation of the -value from § 3.1.1 is the choice of test statistic in Step 1, which is no longer targeted towards the alternative that the non-null intervention effects have a common sign. When the ’s are expected to be of different magnitudes or signs, the test statistic for determining the lower bounds that is used in the fixed effect model can be improved . Instead, we suggest using the Pearson or truncated-Pearson combining function in expression  (4).

For , the computation of the -value for is as follows. Let the value statistic for

be , where and are the expressions in  (4) after excluding the study.

The truncation value should be predefined, and is set to in this work. The -values in Step 2 above, for , are as in equation  (4) when replacing with .

The -values in Step 3 above are as detailed in § 3.1.1

Since and are monotone in the -values , and can be computed efficiently by sorting the left-sided -values so that . Then , i.e., the combined -value based on the subset of largest left-sided -values, and similarly, .

3.2 Sign consistency analysis

Whether the ’s have mixed signs can be assessed by testing against both the right- and the left-sided-alternatives.

Let be the maximal value of for which was rejected against the left-sided alternative at significance level ,

where is the left-sided out of -value computed in Steps 1–3 above. Then we can conclude with confidence, that there are at least studies with a negative intervention effect(Heller R, 2010).

Similarly, compute

Combined, with confidence, there are at least studies with a negative intervention effect and studies with a positive intervention effect.

Meta-analysis evidence is said to be consistent in effect direction if (1) and , or (2) and . The larger the non-zero value is, the greater the consistency evidence. The evidence is inconsistent if . There is not enough evidence to assess the consistency of the meta-analysis otherwise.

We shall show how the consistency evidence complements the random or fixed effect meta-analysis findings in examples in § 5. We view this evaluation as particularly useful for accompanying the RE meta-analysis, in which it is assumed that the effects are unequal, but sampled from a Gaussian distribution.

If the ’s are sampled from a Gaussian distribution, then the protection from the danger that the significant conclusion may rely on a single study is almost automatically guaranteed, see § 4 for simulation results. However, if the Gaussian assumption on the ’s is false (and why should it be Gaussian?), we can still provide confidence bounds on the number of studies with increasing effect and on the number of studies with decreasing effect, so the consistency of the sign can be assessed. If the lower bound in one direction is zero, and in the other direction is greater than one, we conclude there is sign consistency. If on the other hand the lower bound is greater than zero in both direction, we have sign inconsistency. This does not mean that the random effects model is incorrect, but it does urge the researcher to examine why some studies deem the intervention effective and others harmful.

4 Simulations

Underlying the meta-analysis models is the assumption that all studies in a review share the same signal , and the goal is to identify whether this signal is bigger or smaller than 0. The approach that we forward is that when the underlying assumption is compromised by having only one out of studies in one direction away from 0, we should look suspiciously at such a finding. Even more so, if a single other study has a signal in the opposite direction. However, if the conclusion reflects two or more studies with real signal in the same direction we should be able to discover it. We therefore carry out a simulation study where we compare and contrast the rejection probability of meta-analysis and replicability analysis.

We generate two groups of sizes 25 and standard deviation 1, so that

with . We consider a combination of studies, and various configurations of interest for the effect sizes . Similar results were obtained in additional simulations with studies (not shown). We used iterations for each simulation setting.

Random Effects Analysis

We consider several studies with heterogeneous but fixed signal. Although this is not the setting assumed by the RE model, the typical analysis that will carried out (arguably) is the RE meta-analysis because of the heterogeneity in effect sizes. Therefore, we compare the rejection probability of the RE meta-analysis test with that of the replicability analysis test detailed in § 3.2 with and .

Figure 1 shows in the top left panel that the meta-analysis test has a rejection rate of at most 0.07 when a single study has signal and that the replicability analysis test is well below the nominal level. In the bottom row, the advantage of complementing the meta-analysis with a replicability analysis is manifest: when at least two studies have signal in the same direction, the replicability analysis test has power increasing to one to detect the replicated signals. On the other hand, the RE meta-analysis test has little (left panel) and zero (right panel) power to detect the replicated signals.

Figure 1: Rejection rate as function of , analyzed by RE meta-analysis (dashed), and replicability analysis (solid). The effect estimates are sampled from , where are indicated at the top of each panel. The consistent settings are in the left column, and the inconsistent settings are in the right column. The replicability null hypothesis, , is true in the top row graphs and false otherwise. The horizontal solid line is the 0.05 significance level of the test.

Fixed Effects Analysis

In order to quantify the behavior of the type I error probability and power in FE analysis, we carry out simulations assuming that if the effect is not null, it is fixed at a value , so . We examined the rejection rate with fixed effect meta-analysis as well as with replicability analysis, when the number of studies with signal varied from zero to eight.

Figure 2 shows that the significance of the fixed effect meta-analysis can, yet the significance of the replicability analysis cannot, be driven by a single study: when a single study is non-null, the rejection rate for replicability analysis is at the nominal 0.05 level, but the rejection rate for the fixed effect meta-analysis is far above it. As expected, the greater the number of studies with non-zero signal, the greater the rejection rate, and this rate is greater for meta-analysis than for replicability analysis.

Figure 2: Rejection rate versus the number of nonnull studies with effect size , with the fixed effect meta-analysis test (dashed) or with the replicability analysis test for the fixed effect model (solid). The curves with: circles, triangles, and squares, have value of one, two, and three, respectively. The fixed effect meta-analysis is true when the number of studies is zero, and false otherwise. The replicability null hypothesis, , is true when the number of studies is zero or one, and false otherwise. The horizontal solid line is the 0.05 significance level of the test.

Replicability analysis to complement RE meta-analysis

We consider several studies with heterogeneous signal generated independently from . This is the setting assumed by the RE model, so the ideal analysis is (arguably) the RE meta-analysis. We compare the rejection probability of the RE meta-analysis test with that of the replicability analysis test detailed in § 3.2 with and .

Figure 3 shows that the rejection rate is much higher for replicability analysis than for meta-analysis. In the RE model, the effect sizes are non-zero with probability one. Therefore, the replicability null hypothesis is never true, and the rejection rate increases with . The meta-analysis null hypothesis is false except when . Interestingly, at , the rejection rate for the RE meta-analysis test is 0.093 instead of the nominal 0.05 level, an inflation that is due primarily to the use of the test instead of the -test (IntHout J, Ioannidis J,and Borm G, 2014).

Figure 3: Rejection rate versus , for the left-sided (dotted), right-sided (dashed), and two-sided (solid) meta analysis test (red) and replicability analysis test (black). The horizontal solid line is the 0.05 two-sided significance level of the test. The one-sided tests are carried out at level 0.025.

The fact that the rejection rate is much higher for replicability analysis than for meta-analysis suggests that for many forest plots generated from the RE model, it may be possible to provide informative lower bounds on the number of studies with effect in each direction even when the RE meta-analysis -value is non-significant. The top panel of Figure 4 illustrates such a dataset. Although the meta-analysis test is non-significant, the replicability analysis reports with 95% confidence that at least three studies have a decreased effect and at least four studies have an increased effect. In the bottom panel, both the meta-analysis test and the replicability analysis test are significant, and the replicability analysis complements the meta-analysis by reporting with 95% confidence that at least five studies have an increased effect, and no studies have a decreased effect.

Figure 4: A simulated meta-analysis with effect sizes sampled from (top) and from (bottom). The estimated effect sizes are sampled from .

5 Case studies from the Cochrane library

We provide examples of meta-analyses in the breast cancer domain for which we can, and cannot, claim replicability. For each example, we compute the -value with , i.e., the smallest significance level at which we claim replicability. In addition, we evaluate the extent of the evidence for consistency by providing 95% confidence lower bounds for the number of studies with decreased effect and the number of studies with increased effect in the meta-analysis. We provide recommendations on how to incorporate these new analyses in the Cochrane reviews’ abstract and forest plots.

5.1 Assessing replicability and consistency in fixed effect meta-analyses

We present two examples that were analyzed as a fixed effect meta-analysis by the authors of the systematic reviews. In both examples the meta-analysis null hypothesis of zero average effect was rejected at the 0.05 significance level (which according to Proposition 3.1 is a necessary requirement for value). However, the examples differ drastically in their evidence towards replicability. The replicability analysis supports the study conclusion in the first example but weakens it in the second example.

The first example is based on a meta analysis in review CD002943 (Figure 5

). The primary objective of this review was to assess the effectiveness of different strategies for increasing the participation rate of women invited to community breast cancer screening activities or mammography programs. In this meta-analysis, the effect of sending invitation letters was examined in five studies. Only two studies resulted in a significant (at the 0.05 level) positive effect. However, this does not mean that the effect is absent in the other three studies. Indeed, in our complimentary analysis, we estimate that at least 3 studies have a positive effect. The authors write: "The odds ratio in relation to the outcome, ’attendance in response to the mammogram invitation during the 12 months after the invitation, was 1.66 (

CI 1.43 to 1.92)". To this, we suggest adding: "The evidence towards an increased effect was replicable, with an . Moreover, with 95% confidence, we can conclude that at least three studies had an increased effect."

Figure 5: In review CD002943, the effect of mammogram invitation on attendance during the following 12 months. The evidence towards replicability is strong: the 2 out of 5 ; the 95% lower bound on the number of studies with increased effect is 3.

The second example is based on a meta analysis in review CD007077 (Figure 6). The primary objective of this review was to assess the effectiveness of partial breast irradiation (PBI) or accelerated partial breast irradiation (APBI), i.e., the delivery of radiation to a limited volume of the breast around the tumor bed, sometimes with a shortened treatment duration. The main objective of this review is to determine whether PBI/APBI is equivalent to or better than conventional or hypo-fractionated whole breast radiotherapy (WBRT) after breast-conservation therapy for early-stage breast cancer. The primary outcome was Cosmesis. The lack of replicability in this study is not surprising, as the two largest studies report conflicting significant effects. However, since the summary effect size is significantly positive, the authors write that "Cosmesis (physician-reported) appeared worse with PBI/APBI (odds ratio (OR) 1.51, CI 1.17 to 1.95, five studies, 1720 participants, low-quality evidence)". To this, we suggest adding: "We cannot rule out the possibility that this result is critically based on a single study ( )."

Figure 6: In review CD007077, the effect of PBI/APBI versus WBRT on Cosmesis. There is no evidence towards replicability, -value 1.

5.2 Assessing replicability and consistency in random effects meta-analyses

The -values were computed using the test statistic based on the truncated Pearson combining method, as described in § 3.1.2, with and . These values may be useful regardless of whether the RE meta-analysis -value is significant, as we shall demonstrate in two examples that were analyzed by the Cochrane authors using the RE model.

The first example is based on a meta analysis in review CD006823 (Figure 7), where the meta-analysis finding was statistically significant. The authors examine the effects of wound drainage after axillary dissection for breast carcinoma on the incidence of post-operative Seroma formation. We establish replicability with an -value of 0.0002, as well as consistency: with 95% confidence there are at least two studies with decreased effect and zero studies with increased effect. The authors write "The OR for Seroma formation was 0.46 ( CI 0.23 to 0.91, P = 0.03) in favor of a reduced incidence of Seroma in participants with drains inserted." To this, we suggest adding "The evidence towards a decreased effect is consistent: there were at least two studies with decreased effect and no study with increased effect (with 95% confidence)."

Figure 7: In review CD006823, the effects of wound drainage on Seroma formation. The evidence is consistent: the 2 out of 7 ; there is a decreased effect in at least 3 out of 7 studies, and no study with increased effect, with 95% confidence.

The second example is based on a meta-analysis in review CD003366 (Figure 8). The authors compare taxane containing chemotherapies: single agent taxane vs. Regimen C, on overall effect in Leukopaenia. Pooling 13 studies, the random effects meta-analysis fails to declare any significant difference between regimens, due to the highly-significant yet contradicting results. The authors write: "Overall, there was no difference in the risk of Leukopaenia (RR 1.07; 95% CI 0.97 to 1.17; P = 0.16; participants = 6564; Analysis 5.2) with significant heterogeneity across the studies (I2 = 90%; P 0.00001)". We suggest adding: "There is inconsistent evidence for the direction of effect: an increased effect in at least three studies, and a decreased effect in at least two studies (with 95% confidence)."

Figure 8: In review CD003366, the effect of Single agent taxane vs. Regimen C on Leukopaenia. The evidence towards both an increased and a decreased effect is strong: the 2 out of 7 -value is ; the 95% lower bound on the number of studies with increased and decreased effect is 3 and 2, respectively.

6 The extent of the replicability problem in Cochrane systematic reviews

We took all the updated Cochrane Collaboration systematic reviews in breast cancer domain. Our eligibility criteria were as follows: (a) the review included forest plots; (b) at least one fixed-effect primary outcome was reported as significant at the .05 level, which is the default significant level used in Cochrane Reviews; (c) the meta-analysis of at least one of the primary outcomes was based on at least three studies (d) there was no reporting in the review of unreliable/biased primary outcomes or poor quality of available evidence, and (e) the data is available for download. We consider as primary outcomes the outcomes that were defined as primary by the review authors. If none were defined we selected the most important findings from the review summaries and treated the outcomes for these findings as primary. In the breast cancer domain 62 updated (up to February 2018) reviews were published by the Cochrane Breast Cancer Group in the Cochrane library, out of which we analyzed 23 reviews that met our eligibility criteria (16, 12 , 5 , 2 and 4 reviews was excluded due reasons a, b, c, d and e respectively). Out of the 23 eligible reviews, 13 reviews had in at least one primary outcome a meta-analysis -values at most 0.05 but an -values greater than 0.05.

Figure 9: -values versus -values, after the quartic root ( power of ) transformation, for each of the 219 primary outcomes analyzed with the fixed-effect model (red circles), or the random effects model (blue triangles). The axes show the matching values on the original scale. Color darkness increases according to the number of overlapping results.The solid line is the diagonal line of .

We analyzed total of 247 primary outcomes contributed by the eligible systematic reviews of which 106 were fixed effect meta-analyses, as reported by the authors. Out of the 70 outcomes with a statistically significant fixed effect -value, 15 were sensitive to omitting one study (i.e., had -value). see Figure 9.

The remaining 141 outcomes were analyzed by the authors using a random effects model. For this model, the -value may be smaller or larger than the meta-analysis -value. Table 1 summarizes the number of consistent and inconsistent meta-analyses for the 141 outcomes. As expected, among the non-significant outcomes there is less consistency than among the significant outcomes. Nine inconsistent outcomes were detected, warranting further research into why the effect is increasing in some experiments yet decreasing in others. Among the 62 significant outcomes, consistency of effect direction could be established for 46 outcomes.

non-significant value significant value
Consistent 8 46
Inconsistent 9 1
Not enough evidence 62 15
Table 1: The consistency distribution for the 141 random effects meta-analyses, for significant outcomes (column 2) and non-significant outcomes (column 1). The significance was assessed by the default -test for the random effects model, at the 0.05 level.

7 Discussion

In  6 we report that 101 out of 132 significant meta-analyses outcomes were found to have a significant -value, meaning that the replicability claim was not met in about of outcomes that report a statistically significant effect size. This high percentage suggests that many systematic reviews may critically depend on the result of a single study, and therefore can benefit from complementing the meta-analysis with a report of the -value and lower bounds on the number of studies with increasing and decreasing effect. Seemingly, it may be thought that if the number of studies is large, the meta-analysis cannot be driven by one outlying study. However, we found three fairly large fixed-effect analyses, with 11,13 and 13 studies, for which the meta-analysis -value was significant but the -value was not. The reporting of the -value can be valuable when the number of studies is small or large: a significant -value when pooling a small number of studies reflects strong evidence towards replicability of effects ; a non-significant -value for numerous studies salvages from unfounded results.

High heterogeneity may lead to studies having opposite signs for estimated effect sizes. In such cases, random effects meta-analysis will mostly result in a non-significant -value. Our complementary replicability analysis gives insight into the effects consistency. For example, we see from Table 1 than among the outcomes with a non-significant random effects -value, we can establish consistency in and inconsistency in of the 79 meta-analysis with a non-significant RE meta-analysis -value.

While we motivated and demonstrated our approach and its implications by examples from the Cochrane reviews, it should be clear that the methods we offer can be used in any meta-analysis. The only caveat is that meta-analysis is prone to publication bias, where only significant results (at ) are published. The Cochrane reviews are known to be careful during their search for eligible studies, avoiding as much as possible this problem. In other areas, where this may not be feasible, using conditional p-values rather than the raw ones in the procedures we offer may circumvent the problem (with unfortunate loss of some power.)


The authors thank Ian Pan for his help with extracting and processing data from Cochrane forest plots; David Steinberg for useful comments on an earlier version of this manuscript; and Daniel Yekutieli for useful discussions about the methodology.


This research was supported by the Israeli Science Foundation [grant 1049/16 (RH)]; and the European Research Council [grant FP7/2007-2013 ERC agreement no. PSARPS- 294519 ( YB, IJ, LS)].


  • Higgins J and Green S (2011) Higgins J and Green S (editors). The Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011].
  • Deeks JJ, Higgins JPT, and Altman DG (2017) Deeks JJ, Higgins JPT, Altman DG (eds) on behalf of the Cochrane Statistical Methods Group. Chapter 9: Analysing data and undertaking meta-analyses. In: Higgins JPT, Churchill R, Chandler J, Cumpston MS (eds), Cochrane Handbook for Systematic Reviews of Interventions version 5.2.0 (updated June 2017), Cochrane, 2017. Available from www.training.cochrane.org/handbook
  • Anzures-Carbera J and Higgins J (2010) Anzures-Carbera J and Higgins J Graphical displays for meta-analysis: An overview with suggestions for practice. Research Synthesis Methods 2010; 1: 66–80.
  • Bax L, Yu L, Ikeda N, Tsuruta H, and Moons K (2006) Bax L, Yu L, Ikeda N, Tsuruta H, and Moons K. Development and validation of Mix: comprehensive free software for meta-analysis of causal research data. BMC Medical Research Methodology 2006; 6 (50).
  • Benjamini Y and Heller R (2008) Benjamini Y and Heller R. Screening for partial conjunction hypotheses. Biometrics 2008; 64:1215–1222.
  • Benjamini Y, Heller R, and Yekutieli D (2009) Benjamini Y, Heller R, and Yekutieli D. Selective inference in complex research. Philosophical Transactions of the Royal Society 2009; 367:4255–4271.
  • Heller R (2010) Heller R. Discussion of ?Multiple Testing for Exploratory Research? by J. J. Goeman and A. Solari. Statistical Science 2010; 26 (4): 598–600.
  • Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR (2009) Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR Introduction to Meta-Analysis, Wiley-Blackwell, 2009. Available from https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470743386
  • Kafkafi N, Benjamini Y, Sakov A, Elmer GI, and Golani I (2005) Kafkafi N, Benjamini Y, Sakov A, Elmer GI, and Golani I. Genotype?environment interactions in mouse behavior: a way out of the problem. Proceedings of the National Academy of Sciences 2005; 102 (12): 4619-4624.
  • Loughin T (2004) Loughin T. A systematic comparison of methods for combining -values from independent tests. Computational Statistics & Data Analysis 2004; 47 :467–485.
  • Futschik A, Tuas T, and Zehetmayer S (2018) Futschik A, Tuas T, and Zehetmayer S. An omnibus test for the global null hypothesis. Statistical Methods in Medical Research 2018. DOI: 10.1177/0962280218768326.
  • Owen A (2009) Owen A. Karl Pearson’s meta-analysis revisited, The Annals of Statistics 2009; 37 (6B): 3867–3892.
  • Zaykin DV, Zhivotovsky LA, Westfall PH, and Weir BS (2002) Zaykin DV, Zhivotovsky LA, Westfall PH, and Weir BS.Truncated Product Method of Combining P-values. Genetic Epidemiology 2002; 22: 170?-185.
  • Hsu J, Small DS, and Rosenbaum PR (2013) Hsu J, Small DS, and Rosenbaum PR. Effect Modification and Design Sensitivity in Observational Studies. Journal of the American Statistical Association 2013; 108 (501): 135–148.
  • IntHout J, Ioannidis J,and Borm G (2014) IntHout J, Ioannidis J,and Borm G. The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC Medical Research Methodology 2014; 14:25.
  • DerSimonian R and Laird N (2015) DerSimonian R and Laird N. Meta-Analysis in Clinical Trials Revisited. Contemp Clin Trials 2015; 45: 139–145.
  • Veroniki et al. (2019) Veroniki, A. A., Jackson, D., Bender, R., Kuss, O., Langan, D., Higgins, J. P., Knapp, G., and Salanti, G. (2019). Methods to calculate uncertainty in the estimated overall effect size from a random-effects meta-analysis. Research synthesis methods, 10(1):23–43.
  • Gail and Simon (1985) Gail, M. and Simon, R. (1985). Testing for qualitative interactions between treatment effects and patient subsets. Biometrics, 41(2):361–372.
  • Piantadosi and Gail (1993) Piantadosi, S. and Gail, M. (1993). A comparison of the power of two tests for qualitative interactions. Statistics in Medicine, 12(13):1239–1248.

Appendix A Proof of Proposition 3.1

Let , , and be the fixed-effect meta-analysis test statistic, estimated effect, and SE, respectively, for the intersection hypotheses indexed by . Since , the meta-analysis test statistic can be expressed in terms of , :


Let . By definition, . We shall show that if , and . Clearly, since . Therefore, the result follows by showing that and .

We start by showing that . If , then by definition and therefore it follows that . If then

where the first inequality follows from (5) and the definition of , the second inequality follows since for all , and the last equality follows since

Since it thus follows that .

Next, we show that . By definition, . Since and

it follows that and therefore that .

Therefore, if we have and . Similar arguments show that if , and . It thus follows that .

Remark A.1

The property that, with probability one, the global null -value is smaller than the -value, is not satisfied with popular combining functions such as Fisher, Simes, and Bonferroni (Benjamini Y and Heller R, 2008). For example, if are the ordered -values, then the Bonferroni meta-analysis -value is , its -value for is , and