1 Introduction
In systematic reviews, several studies that examine the same questions are analyzed together. Viewing all the available information is extremely valuable for practitioners in the health sciences. A notable example is the Cochrane systematic reviews on the effects of healthcare interventions(Higgins J and Green S, 2011).
Deriving conclusions about the overall health benefits or harms from an ensemble of studies can be difficult, since the studies are never exactly the same and there is danger that these differences affect the inference. For example, factors that are particular to the study, such as the specific cohorts in the study that are from specific populations exposed to specific environments, the specific experimental protocol used in the study, the specific care givers in the study, etc., may have an impact on the treatment effect.
There are many reasons to perform a metaanalysis, according to the Cochrane Handbook for Systematic Reviews of Interventions (§ 9.1.3, Deeks JJ, Higgins JPT, and Altman DG 2017
). The two first reasons are the obvious ones: (1) to increase power, since many individual studies are too small to detect small effects, but when combined there is a higher chance of detecting an effect; (2) to improve precision. The next two reasons come to answer questions that cannot be addressed by individual studies. We quote the third reason "Primary studies often involve a specific type of patient and explicitly defined interventions. A selection of studies in which these characteristics differ can allow investigation of the consistency of effect and, if relevant, allow reasons for differences in effect estimates to be investigated." The fourth reason is "To settle controversies arising from apparently conflicting studies or to generate new hypotheses. Statistical analysis of findings allows the degree of conflict to be assessed formally, and reasons for different results to be explored and quantified."
If there are only one or two studies with results that are in apparent conflict with the rest of the studies, they may be excluded from the metaanalysis if an obvious reason for the outlying result can be identified. However, in general, at least one characteristic can be found for any study in any metaanalysis which makes it different from the others (Higgins J and Green S, 2011), so exclusion of studies may introduce bias (Deeks JJ, Higgins JPT, and Altman DG, 2017). It is possible to carry out a sensitivity analysis in which the metaanalysis is repeated, each time omitting one of the studies (AnzuresCarbera J and Higgins J, 2010). A plot of the results of these metaanalyses, called an "exclusion sensitivity plot" (Bax L, Yu L, Ikeda N, Tsuruta H, and Moons K, 2006), will reveal any studies that have a particularly large influence on the results of the metaanalysis. We concur with this view, but in this work we offer a single number of summary information of such a sensitivity analysis and recommend it being added to the report of the main results, and to the forest plot, of the metaanalysis.
Gail and Simon (1985), Piantadosi and Gail (1993) suggested tests targeted towards identifying whether effects are inconsistent, i.e., for identifying whether the effect direction is positive in some studies but negative in other studies (termed ’qualitative’ or ’crossover’ interactions). We offer in addition a lower confidence bound on the number of studies with a significant effect in each direction, using the replicability analysis tools developed in Benjamini Y and Heller R (2008); Benjamini Y, Heller R, and Yekutieli D (2009); Heller R (2010).
Our starting point is that the researchers decided that multiple studies are sufficiently similar to answer a clinical question of interest by a metaanalysis. Consistency evaluation is important for proper assessment of the overall evidence towards a positive or a negative effect. We examined the extent of the lack of consistency in systematic reviews in the Cochrane library in § 6 and found it to be nonnegligible.
In § 2 we review metaanalyses methods that are carried out routinely in various disciplines. For clarity of exposition we focus on the two common methods in the Cochrane library. In § 3 we explain how to carry out the additional replicability and consistency evaluations that we suggest adding to the metaanalyses. In § 4 we carry out simulations that examine the differences between meta analysis and replicability analysis in various settings of interest. In § 5 we demonstrate how such an evaluation contributes to the assessment of the overall intervention effect in case studies from the Cochrane library. In § 7 we conclude with some final remarks.
2 Review of metaanalysis methods
Let be the number of studies available for metaanalysis, and let be the (unknown) treatment effect in study , . Let and
be the estimated effect size and its standard error for study
. The test statistic for testing
is . Although is estimated from the data, for clarity of exposition we adopt the typical assumption in metaanalysis that the null distribution of (possibly after logarithmic or other appropriate transformation) is (approximately) standard normal (Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR, 2009; Deeks JJ, Higgins JPT, and Altman DG, 2017).2.1 Random effects and fixed effect metaanalysis
The overwhelming majority of metaanalyses published over the past two decades are metaanalyses of effect sizes (Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR, 2009). The primary aim of the metaanalysis of effect sizes is to estimate the overall treatment effect, whether efficacy or adverse response, from . The summary effect estimate is the weighted average
where . Let
denote the expectation of the summary effect estimate. It is of interest to test the metaanalysis null hypothesis
as well as provide a confidence interval (CI) for
. The choice of weights, and the distribution of , depend on whether it is assumed that the effects are heterogeneous.Assuming no heterogeneity, i.e., each study is estimating exactly the same quantity , a fixed effect (FE) metaanalysis is performed where . The standard error of the summary effect is therefore
The test statistic for is , and the pooled CI for is, typically, . If there is no heterogeneity, the pooled CI may be far narrower than the CIs of the individual studies; and if is false the test based on is powerful. However, if there is heterogeneity, is false, and the ’s have mixed signs, then the test based on can have low power because may be close to zero. Moreover, the CI for the summary effect is meaningless, since if the effect is decreasing in some studies but increasing in others, the researchers may be far more interested in investigating the sources of the sign inconsistencies rather than assessing the magnitude of , which averages positive and negative terms.
A model that accounts for heterogeneity of effects is the random effects (RE) model. Some researchers argue that the fixed effect assumptions are implausible most of the time, and thus suggest to always use the random effects model (Higgins J and Green S, 2011). Others choose the RE model over the FE model for inference on a priori, either based on clinical knowledge or based on a heterogeneity summary statistic. The Cochrane Handbook for Systematic Reviews & Interventions § 9.5.4 (Deeks JJ, Higgins JPT, and Altman DG, 2017) cautions against choosing a RE over FE metaanalysis, based on a statistical test for hypothesis using the data for the metaanalysis, . It is relevant to note that the RE model was also recommended by Kafkafi N, Benjamini Y, Sakov A, Elmer GI, and Golani I (2005) as the tool for assessing replicability across laboratories in animal phenotyping experiments.
In the RE model, is an independent identically distributed sample from a distribution with (unknown) mean
and variance
. In the Cochrane library, the random effects metaanalyses typically assume that the distribution of is Gaussian. The variance of is estimated by the method of DerSimonian and Laird (Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR, 2009) to bewhere . The weights are inversely proportional to the total variability of , The standard error of the summary effect is
If the
’s come from a Gaussian distribution, then estimating
by an appropriate CI is useful if represents an overall effect size. This may be the case if the heterogeneity of effect sizes is due to clinical variability (e.g., participants, details of interventions or of outcomes). However, this may not be the case if the heterogeneity in intervention effects is due to methodological variability which introduces bias. In general, it is not possible to distinguish whether heterogeneity results from clinical or methodological variability (Deeks JJ, Higgins JPT, and Altman DG, 2017). Even when the source of variability is clinical, it is difficult to establish the validity of any distributional assumption on the ’s, and this is a common criticism of random effects metaanalyses (Deeks JJ, Higgins JPT, and Altman DG, 2017).The default method of CI estimation is still nowadays , despite its low coverage. The performance of this method depends on the number of studies and the magnitude of , with the poorest performance for few studies, compared to other available methods (Veroniki et al., 2019).
The assumption that ’s have Gaussian distribution may be plausible if the heterogeneity of effect sizes is due to clinical variability (e.g., participants, details of interventions or of outcomes), rather than methodological variability which introduces bias. Either way, the
test by DerSimonian and Laird is known for not controlling the type I error rate
(IntHout J, Ioannidis J,and Borm G, 2014), In general, it is not possible to distinguish whether heterogeneity results from clinical or methodological variability (Deeks JJ, Higgins JPT, and Altman DG, 2017). Even when the source of variability is clinical, it is difficult to establish the validity of any distributional assumption on the ’s, and this is a common criticism of random effects metaanalyses (Deeks JJ, Higgins JPT, and Altman DG, 2017).2.2 Combining independent studies with no assumptions
The random effects effect metaanalyses null, , is that the mean effect across studies is zero. A more fundamental aim is to test the global null hypothesis that the effect size is zero in each and every study,
If is true, then is true as well. But if is false, may still be true. Testing is useful regardless of the the source of heterogeneity (clinical or methodological).
Many tests are available for using the independent test statistics , or their corresponding values (Loughin T, 2004; Futschik A, Tuas T, and Zehetmayer S, 2018). The preferred test depends on the (unknown) alternative, and there is no single test that dominates all others. Let be a combining function for testing the global null. We consider Fisher’s combining function, . If is true,
has a chisquared distribution with
degrees of freedom, so the global null value using Fisher’s combining function is(1) 
Fisher’s combining method is popular in various application fields (e.g., genomic research, education, social sciences) since it has been shown to have excellent power properties (Owen A, 2009). It is rarely used in systematic reviews of randomized clinical trials, where the focus is on effect sizes. However, we believe it can be useful for assessing the consistency of the direction of effect, as we will show in § 3. Before doing this, we shall consider two extensions of this combining method, which may be useful for application in systematic reviews.
In the first, define the value for a leftsided alternative that and for the rightsided alternative. Pearson suggests to combine the left sided and the right sided values separately by Fisher’s combing function and , respectively.
Pearson’s test statistic is the maximum of the two resulting statistics,
Pearson’s global null value is, therefore,
(2) 
This test has greater power than a test based on Fisher’s combining method using twosided values when the direction of the signal is consistent across studies, while not requiring us to know the common direction.
The second extension is useful if the ’s are suspected to have mixed signs. A potentially more powerful test for may therefore be based on aggregation of the values that are at most a predefined threshold (Zaykin DV, Zhivotovsky LA, Westfall PH, and Weir BS, 2002), and the null distribution is adjusted accordingly. The directional test statistics are
(3) 
and the truncatedPearson test statistic is Clearly .
The null distribution has a simple form, and following Hsu J, Small DS, and Rosenbaum PR (2013) it is straightforward to see that the computation is as follows for
(4) 
where
is to the cumulative gamma distribution with scale parameter equal to one and shape parameter
, andis the cumulative Binomial distribution with
trials and probability of success
. Therefore, the truncatedPearson’s value is .If is false and the ’s have mixed signs, then testing is meaningless. However, testing whether the ’s have mixed signs is informative, and at level the conclusion that the ’s have mixed signs follows if . A major contribution of this paper is the evaluation of the extent of the “mixing", see details in § 3.2. Clearly, it is worth reporting the evidence towards mixed signs in order to interpret appropriately the value of and the CI provided for in the fixed, and especially random, metaanalysis.
3 Replicability Analysis
In a metaanalysis, if the true and unknown treatment effect was present in the same direction in more than one study, we claim replicability. If in addition there was no treatment effects in the opposite direction, we claim consistency . We shall show how to evaluate the evidence towards replicability and consistency, by testing the null hypothesis that the intervention effect is zero except possibly in studies, for , using Pearson’s truncated extension to Fisher’s combining function defined in 3.1.2.
3.1 The rvalue
We suggest testing the null hypothesis that at most studies have an effect in the same direction (Benjamini Y and Heller R, 2008; Benjamini Y, Heller R, and Yekutieli D, 2009; Heller R, 2010), denoted by . By rejecting , we conclude that at least out of the studies in the metaanalysis have an effect in the same direction. Specifically, this is the conclusion for a twosided alternative. Directional replicability can be similarly defined, to claim that at least out of the studies in the metaanalysis have a positive effect (rightsided alternative), or a negative effect ( leftsided alternative). The value for the test quantifies the evidence that the treatment effect was replicated in at least studies, and we call this value the outof value. The minimal requirement for replicability is with . The null hypothesis is true if at most one study has an effect in the same direction, so rejecting this null enables us to conclude that the significant finding does not hinge on a single study. Henceforth, the outof value will be referred to simply as the value
Let denote the set of all possible subsets of size from , so it has cardinality . Let denote the subset . The procedure for computing the out of value is as follows.

Select appropriate left and right sided tests for the null hypotheses that ,
. 
Compute the left and right values from testing the null hypothesis that , for all . Denote them by and .

For a leftsided alternative that there exist at least studies with a negative effect, the value is
For a rightsided alternative that there exist at least studies with a positive effect, the value is
For a twosided alternative that there exist at least studies with a positive effect, or at least studies with a negative effect, the value is
outof replicability at significance level is claimed if . For this means that the conclusion remains significant at level using a metaanalysis of each of the subsets of studies, see Benjamini Y and Heller R (2008) for a simple proof. Similarly, we claim at level replicability of a decreasing effect if , and of an increasing effect if .
3.1.1 The rvalue for a fixed effect metaanalysis
The report of an value is valuable for the fixed effect model, with the underlining assumption that the effects in the studies are equal . The value quantifies the evidence against the null that at most one study has an effect. Using the rvalue, it is possible to provide a lower confidence bound on the number of studies with effect.
The appropriate test to be selected in Step 1 is the fixed effect metaanalysis test statistic if it is believed that except possibly studies. For , the computation of the the value for is as follows. Let the fixed effect metaanalysis summary effect estimate and its standard error, excluding study , be
and let be the test statistic for
The values in Step 2 above are:
The values in Step 3 above are:
Intuitively, the value should be larger than the metaanalysis value since a stronger scientific claim is made by rejecting than by rejecting . We formalize this in the following proposition.
Proposition 3.1
Let be the fixedeffects metaanalysis test statistic, with metaanalysis value , where and . Let be the out of values for , as defined in this section. Then , and moreover, if , , and if , .
See Appendix for the proof.
3.1.2 The rvalue when combining studies with no assumptions
The only difference in the computation of the value from § 3.1.1 is the choice of test statistic in Step 1, which is no longer targeted towards the alternative that the nonnull intervention effects have a common sign. When the ’s are expected to be of different magnitudes or signs, the test statistic for determining the lower bounds that is used in the fixed effect model can be improved . Instead, we suggest using the Pearson or truncatedPearson combining function in expression (4).
For , the computation of the value for is as follows. Let the value statistic for
be , where and are the expressions in (4) after excluding the study.
The truncation value should be predefined, and is set to in this work. The values in Step 2 above, for , are as in equation (4) when replacing with .
The values in Step 3 above are as detailed in § 3.1.1
Since and are monotone in the values , and can be computed efficiently by sorting the leftsided values so that . Then , i.e., the combined value based on the subset of largest leftsided values, and similarly, .
3.2 Sign consistency analysis
Whether the ’s have mixed signs can be assessed by testing against both the right and the leftsidedalternatives.
Let be the maximal value of for which was rejected against the leftsided alternative at significance level ,
where is the leftsided out of value computed in Steps 1–3 above. Then we can conclude with confidence, that there are at least studies with a negative intervention effect(Heller R, 2010).
Similarly, compute
Combined, with confidence, there are at least studies with a negative intervention effect and studies with a positive intervention effect.
Metaanalysis evidence is said to be consistent in effect direction if (1) and , or (2) and . The larger the nonzero value is, the greater the consistency evidence. The evidence is inconsistent if . There is not enough evidence to assess the consistency of the metaanalysis otherwise.
We shall show how the consistency evidence complements the random or fixed effect metaanalysis findings in examples in § 5. We view this evaluation as particularly useful for accompanying the RE metaanalysis, in which it is assumed that the effects are unequal, but sampled from a Gaussian distribution.
If the ’s are sampled from a Gaussian distribution, then the protection from the danger that the significant conclusion may rely on a single study is almost automatically guaranteed, see § 4 for simulation results. However, if the Gaussian assumption on the ’s is false (and why should it be Gaussian?), we can still provide confidence bounds on the number of studies with increasing effect and on the number of studies with decreasing effect, so the consistency of the sign can be assessed. If the lower bound in one direction is zero, and in the other direction is greater than one, we conclude there is sign consistency. If on the other hand the lower bound is greater than zero in both direction, we have sign inconsistency. This does not mean that the random effects model is incorrect, but it does urge the researcher to examine why some studies deem the intervention effective and others harmful.
4 Simulations
Underlying the metaanalysis models is the assumption that all studies in a review share the same signal , and the goal is to identify whether this signal is bigger or smaller than 0. The approach that we forward is that when the underlying assumption is compromised by having only one out of studies in one direction away from 0, we should look suspiciously at such a finding. Even more so, if a single other study has a signal in the opposite direction. However, if the conclusion reflects two or more studies with real signal in the same direction we should be able to discover it. We therefore carry out a simulation study where we compare and contrast the rejection probability of metaanalysis and replicability analysis.
We generate two groups of sizes 25 and standard deviation 1, so that
with . We consider a combination of studies, and various configurations of interest for the effect sizes . Similar results were obtained in additional simulations with studies (not shown). We used iterations for each simulation setting.Random Effects Analysis
We consider several studies with heterogeneous but fixed signal. Although this is not the setting assumed by the RE model, the typical analysis that will carried out (arguably) is the RE metaanalysis because of the heterogeneity in effect sizes. Therefore, we compare the rejection probability of the RE metaanalysis test with that of the replicability analysis test detailed in § 3.2 with and .
Figure 1 shows in the top left panel that the metaanalysis test has a rejection rate of at most 0.07 when a single study has signal and that the replicability analysis test is well below the nominal level. In the bottom row, the advantage of complementing the metaanalysis with a replicability analysis is manifest: when at least two studies have signal in the same direction, the replicability analysis test has power increasing to one to detect the replicated signals. On the other hand, the RE metaanalysis test has little (left panel) and zero (right panel) power to detect the replicated signals.
Fixed Effects Analysis
In order to quantify the behavior of the type I error probability and power in FE analysis, we carry out simulations assuming that if the effect is not null, it is fixed at a value , so . We examined the rejection rate with fixed effect metaanalysis as well as with replicability analysis, when the number of studies with signal varied from zero to eight.
Figure 2 shows that the significance of the fixed effect metaanalysis can, yet the significance of the replicability analysis cannot, be driven by a single study: when a single study is nonnull, the rejection rate for replicability analysis is at the nominal 0.05 level, but the rejection rate for the fixed effect metaanalysis is far above it. As expected, the greater the number of studies with nonzero signal, the greater the rejection rate, and this rate is greater for metaanalysis than for replicability analysis.
Replicability analysis to complement RE metaanalysis
We consider several studies with heterogeneous signal generated independently from . This is the setting assumed by the RE model, so the ideal analysis is (arguably) the RE metaanalysis. We compare the rejection probability of the RE metaanalysis test with that of the replicability analysis test detailed in § 3.2 with and .
Figure 3 shows that the rejection rate is much higher for replicability analysis than for metaanalysis. In the RE model, the effect sizes are nonzero with probability one. Therefore, the replicability null hypothesis is never true, and the rejection rate increases with . The metaanalysis null hypothesis is false except when . Interestingly, at , the rejection rate for the RE metaanalysis test is 0.093 instead of the nominal 0.05 level, an inflation that is due primarily to the use of the test instead of the test (IntHout J, Ioannidis J,and Borm G, 2014).
The fact that the rejection rate is much higher for replicability analysis than for metaanalysis suggests that for many forest plots generated from the RE model, it may be possible to provide informative lower bounds on the number of studies with effect in each direction even when the RE metaanalysis value is nonsignificant. The top panel of Figure 4 illustrates such a dataset. Although the metaanalysis test is nonsignificant, the replicability analysis reports with 95% confidence that at least three studies have a decreased effect and at least four studies have an increased effect. In the bottom panel, both the metaanalysis test and the replicability analysis test are significant, and the replicability analysis complements the metaanalysis by reporting with 95% confidence that at least five studies have an increased effect, and no studies have a decreased effect.
5 Case studies from the Cochrane library
We provide examples of metaanalyses in the breast cancer domain for which we can, and cannot, claim replicability. For each example, we compute the value with , i.e., the smallest significance level at which we claim replicability. In addition, we evaluate the extent of the evidence for consistency by providing 95% confidence lower bounds for the number of studies with decreased effect and the number of studies with increased effect in the metaanalysis. We provide recommendations on how to incorporate these new analyses in the Cochrane reviews’ abstract and forest plots.
5.1 Assessing replicability and consistency in fixed effect metaanalyses
We present two examples that were analyzed as a fixed effect metaanalysis by the authors of the systematic reviews. In both examples the metaanalysis null hypothesis of zero average effect was rejected at the 0.05 significance level (which according to Proposition 3.1 is a necessary requirement for value). However, the examples differ drastically in their evidence towards replicability. The replicability analysis supports the study conclusion in the first example but weakens it in the second example.
The first example is based on a meta analysis in review CD002943 (Figure 5
). The primary objective of this review was to assess the effectiveness of different strategies for increasing the participation rate of women invited to community breast cancer screening activities or mammography programs. In this metaanalysis, the effect of sending invitation letters was examined in five studies. Only two studies resulted in a significant (at the 0.05 level) positive effect. However, this does not mean that the effect is absent in the other three studies. Indeed, in our complimentary analysis, we estimate that at least 3 studies have a positive effect. The authors write: "The odds ratio in relation to the outcome, ’attendance in response to the mammogram invitation during the 12 months after the invitation, was 1.66 (
CI 1.43 to 1.92)". To this, we suggest adding: "The evidence towards an increased effect was replicable, with an . Moreover, with 95% confidence, we can conclude that at least three studies had an increased effect."The second example is based on a meta analysis in review CD007077 (Figure 6). The primary objective of this review was to assess the effectiveness of partial breast irradiation (PBI) or accelerated partial breast irradiation (APBI), i.e., the delivery of radiation to a limited volume of the breast around the tumor bed, sometimes with a shortened treatment duration. The main objective of this review is to determine whether PBI/APBI is equivalent to or better than conventional or hypofractionated whole breast radiotherapy (WBRT) after breastconservation therapy for earlystage breast cancer. The primary outcome was Cosmesis. The lack of replicability in this study is not surprising, as the two largest studies report conflicting significant effects. However, since the summary effect size is significantly positive, the authors write that "Cosmesis (physicianreported) appeared worse with PBI/APBI (odds ratio (OR) 1.51, CI 1.17 to 1.95, five studies, 1720 participants, lowquality evidence)". To this, we suggest adding: "We cannot rule out the possibility that this result is critically based on a single study ( )."
5.2 Assessing replicability and consistency in random effects metaanalyses
The values were computed using the test statistic based on the truncated Pearson combining method, as described in § 3.1.2, with and . These values may be useful regardless of whether the RE metaanalysis value is significant, as we shall demonstrate in two examples that were analyzed by the Cochrane authors using the RE model.
The first example is based on a meta analysis in review CD006823 (Figure 7), where the metaanalysis finding was statistically significant. The authors examine the effects of wound drainage after axillary dissection for breast carcinoma on the incidence of postoperative Seroma formation. We establish replicability with an value of 0.0002, as well as consistency: with 95% confidence there are at least two studies with decreased effect and zero studies with increased effect. The authors write "The OR for Seroma formation was 0.46 ( CI 0.23 to 0.91, P = 0.03) in favor of a reduced incidence of Seroma in participants with drains inserted." To this, we suggest adding "The evidence towards a decreased effect is consistent: there were at least two studies with decreased effect and no study with increased effect (with 95% confidence)."
The second example is based on a metaanalysis in review CD003366 (Figure 8). The authors compare taxane containing chemotherapies: single agent taxane vs. Regimen C, on overall effect in Leukopaenia. Pooling 13 studies, the random effects metaanalysis fails to declare any significant difference between regimens, due to the highlysignificant yet contradicting results. The authors write: "Overall, there was no difference in the risk of Leukopaenia (RR 1.07; 95% CI 0.97 to 1.17; P = 0.16; participants = 6564; Analysis 5.2) with significant heterogeneity across the studies (I2 = 90%; P 0.00001)". We suggest adding: "There is inconsistent evidence for the direction of effect: an increased effect in at least three studies, and a decreased effect in at least two studies (with 95% confidence)."
6 The extent of the replicability problem in Cochrane systematic reviews
We took all the updated Cochrane Collaboration systematic reviews in breast cancer domain. Our eligibility criteria were as follows: (a) the review included forest plots; (b) at least one fixedeffect primary outcome was reported as significant at the .05 level, which is the default significant level used in Cochrane Reviews; (c) the metaanalysis of at least one of the primary outcomes was based on at least three studies (d) there was no reporting in the review of unreliable/biased primary outcomes or poor quality of available evidence, and (e) the data is available for download. We consider as primary outcomes the outcomes that were defined as primary by the review authors. If none were defined we selected the most important findings from the review summaries and treated the outcomes for these findings as primary. In the breast cancer domain 62 updated (up to February 2018) reviews were published by the Cochrane Breast Cancer Group in the Cochrane library, out of which we analyzed 23 reviews that met our eligibility criteria (16, 12 , 5 , 2 and 4 reviews was excluded due reasons a, b, c, d and e respectively). Out of the 23 eligible reviews, 13 reviews had in at least one primary outcome a metaanalysis values at most 0.05 but an values greater than 0.05.
We analyzed total of 247 primary outcomes contributed by the eligible systematic reviews of which 106 were fixed effect metaanalyses, as reported by the authors. Out of the 70 outcomes with a statistically significant fixed effect value, 15 were sensitive to omitting one study (i.e., had value). see Figure 9.
The remaining 141 outcomes were analyzed by the authors using a random effects model. For this model, the value may be smaller or larger than the metaanalysis value. Table 1 summarizes the number of consistent and inconsistent metaanalyses for the 141 outcomes. As expected, among the nonsignificant outcomes there is less consistency than among the significant outcomes. Nine inconsistent outcomes were detected, warranting further research into why the effect is increasing in some experiments yet decreasing in others. Among the 62 significant outcomes, consistency of effect direction could be established for 46 outcomes.
nonsignificant value  significant value  

Consistent  8  46 
Inconsistent  9  1 
Not enough evidence  62  15 
7 Discussion
In 6 we report that 101 out of 132 significant metaanalyses outcomes were found to have a significant value, meaning that the replicability claim was not met in about of outcomes that report a statistically significant effect size. This high percentage suggests that many systematic reviews may critically depend on the result of a single study, and therefore can benefit from complementing the metaanalysis with a report of the value and lower bounds on the number of studies with increasing and decreasing effect. Seemingly, it may be thought that if the number of studies is large, the metaanalysis cannot be driven by one outlying study. However, we found three fairly large fixedeffect analyses, with 11,13 and 13 studies, for which the metaanalysis value was significant but the value was not. The reporting of the value can be valuable when the number of studies is small or large: a significant value when pooling a small number of studies reflects strong evidence towards replicability of effects ; a nonsignificant value for numerous studies salvages from unfounded results.
High heterogeneity may lead to studies having opposite signs for estimated effect sizes. In such cases, random effects metaanalysis will mostly result in a nonsignificant value. Our complementary replicability analysis gives insight into the effects consistency. For example, we see from Table 1 than among the outcomes with a nonsignificant random effects value, we can establish consistency in and inconsistency in of the 79 metaanalysis with a nonsignificant RE metaanalysis value.
While we motivated and demonstrated our approach and its implications by examples from the Cochrane reviews, it should be clear that the methods we offer can be used in any metaanalysis. The only caveat is that metaanalysis is prone to publication bias, where only significant results (at ) are published. The Cochrane reviews are known to be careful during their search for eligible studies, avoiding as much as possible this problem. In other areas, where this may not be feasible, using conditional pvalues rather than the raw ones in the procedures we offer may circumvent the problem (with unfortunate loss of some power.)
Acknowledgments
The authors thank Ian Pan for his help with extracting and processing data from Cochrane forest plots; David Steinberg for useful comments on an earlier version of this manuscript; and Daniel Yekutieli for useful discussions about the methodology.
Funding
This research was supported by the Israeli Science Foundation [grant 1049/16 (RH)]; and the European Research Council [grant FP7/20072013 ERC agreement no. PSARPS 294519 ( YB, IJ, LS)].
References
 Higgins J and Green S (2011) Higgins J and Green S (editors). The Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 [updated March 2011].
 Deeks JJ, Higgins JPT, and Altman DG (2017) Deeks JJ, Higgins JPT, Altman DG (eds) on behalf of the Cochrane Statistical Methods Group. Chapter 9: Analysing data and undertaking metaanalyses. In: Higgins JPT, Churchill R, Chandler J, Cumpston MS (eds), Cochrane Handbook for Systematic Reviews of Interventions version 5.2.0 (updated June 2017), Cochrane, 2017. Available from www.training.cochrane.org/handbook
 AnzuresCarbera J and Higgins J (2010) AnzuresCarbera J and Higgins J Graphical displays for metaanalysis: An overview with suggestions for practice. Research Synthesis Methods 2010; 1: 66–80.
 Bax L, Yu L, Ikeda N, Tsuruta H, and Moons K (2006) Bax L, Yu L, Ikeda N, Tsuruta H, and Moons K. Development and validation of Mix: comprehensive free software for metaanalysis of causal research data. BMC Medical Research Methodology 2006; 6 (50).
 Benjamini Y and Heller R (2008) Benjamini Y and Heller R. Screening for partial conjunction hypotheses. Biometrics 2008; 64:1215–1222.
 Benjamini Y, Heller R, and Yekutieli D (2009) Benjamini Y, Heller R, and Yekutieli D. Selective inference in complex research. Philosophical Transactions of the Royal Society 2009; 367:4255–4271.
 Heller R (2010) Heller R. Discussion of ?Multiple Testing for Exploratory Research? by J. J. Goeman and A. Solari. Statistical Science 2010; 26 (4): 598–600.
 Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR (2009) Borenstein M, Hedges LV, Higgins JPT, and Rothstein HR Introduction to MetaAnalysis, WileyBlackwell, 2009. Available from https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470743386
 Kafkafi N, Benjamini Y, Sakov A, Elmer GI, and Golani I (2005) Kafkafi N, Benjamini Y, Sakov A, Elmer GI, and Golani I. Genotype?environment interactions in mouse behavior: a way out of the problem. Proceedings of the National Academy of Sciences 2005; 102 (12): 46194624.
 Loughin T (2004) Loughin T. A systematic comparison of methods for combining values from independent tests. Computational Statistics & Data Analysis 2004; 47 :467–485.
 Futschik A, Tuas T, and Zehetmayer S (2018) Futschik A, Tuas T, and Zehetmayer S. An omnibus test for the global null hypothesis. Statistical Methods in Medical Research 2018. DOI: 10.1177/0962280218768326.
 Owen A (2009) Owen A. Karl Pearson’s metaanalysis revisited, The Annals of Statistics 2009; 37 (6B): 3867–3892.
 Zaykin DV, Zhivotovsky LA, Westfall PH, and Weir BS (2002) Zaykin DV, Zhivotovsky LA, Westfall PH, and Weir BS.Truncated Product Method of Combining Pvalues. Genetic Epidemiology 2002; 22: 170?185.
 Hsu J, Small DS, and Rosenbaum PR (2013) Hsu J, Small DS, and Rosenbaum PR. Effect Modification and Design Sensitivity in Observational Studies. Journal of the American Statistical Association 2013; 108 (501): 135–148.
 IntHout J, Ioannidis J,and Borm G (2014) IntHout J, Ioannidis J,and Borm G. The HartungKnappSidikJonkman method for random effects metaanalysis is straightforward and considerably outperforms the standard DerSimonianLaird method. BMC Medical Research Methodology 2014; 14:25.
 DerSimonian R and Laird N (2015) DerSimonian R and Laird N. MetaAnalysis in Clinical Trials Revisited. Contemp Clin Trials 2015; 45: 139–145.
 Veroniki et al. (2019) Veroniki, A. A., Jackson, D., Bender, R., Kuss, O., Langan, D., Higgins, J. P., Knapp, G., and Salanti, G. (2019). Methods to calculate uncertainty in the estimated overall effect size from a randomeffects metaanalysis. Research synthesis methods, 10(1):23–43.
 Gail and Simon (1985) Gail, M. and Simon, R. (1985). Testing for qualitative interactions between treatment effects and patient subsets. Biometrics, 41(2):361–372.
 Piantadosi and Gail (1993) Piantadosi, S. and Gail, M. (1993). A comparison of the power of two tests for qualitative interactions. Statistics in Medicine, 12(13):1239–1248.
Appendix A Proof of Proposition 3.1
Let , , and be the fixedeffect metaanalysis test statistic, estimated effect, and SE, respectively, for the intersection hypotheses indexed by . Since , the metaanalysis test statistic can be expressed in terms of , :
(5) 
Let . By definition, . We shall show that if , and . Clearly, since . Therefore, the result follows by showing that and .
We start by showing that . If , then by definition and therefore it follows that . If then
where the first inequality follows from (5) and the definition of , the second inequality follows since for all , and the last equality follows since
Since it thus follows that .
Next, we show that . By definition, . Since and
it follows that and therefore that .
Therefore, if we have and . Similar arguments show that if , and . It thus follows that .
Remark A.1
The property that, with probability one, the global null value is smaller than the value, is not satisfied with popular combining functions such as Fisher, Simes, and Bonferroni (Benjamini Y and Heller R, 2008). For example, if are the ordered values, then the Bonferroni metaanalysis value is , its value for is , and
Comments
There are no comments yet.