The harmonic mean χ^2 test to substantiate scientific findings

11/24/2019 ∙ by Leonhard Held, et al. ∙ Universität Zürich 0

A new significance test is proposed to substantiate scientific findings from multiple primary studies investigating the same research hypothesis. The test statistic is based on the harmonic mean of the squared study-specific test statistics and can also include weights. Appropriate scaling ensures that, for any number of studies, the null distribution is a χ^2-distribution with one degree of freedom. The null distribution can be used to compute a one-sided p-value or to ensure Type-I error control at a pre-specified level. Further properties are discussed and a comparison with FDA's two-trials rule for drug approval is made, as well as with alternative research synthesis methods. An attractive feature of the new approach is that a claim of success requires each study to be convincing on its own to a certain degree depending on the significance level and the number of studies. As a by-product, the approach provides a calibration of the sceptical p-value recently proposed for the analysis of replication studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Research synthesis has been characterized as the process of combining the results of multiple primary studies aimed at testing the same conceptual hypothesis. Meta-analysis is the preferred technique of quantitative research synthesis, as it provides overall effect estimates with confidence intervals and

-values through pooling and allows for the incorporation of heterogeneity between studies. However, meta-analysis can be criticized as a too weak technique if the goal is to substantiate an original claim through one or more additional independent studies. Specifically, a significant result may occur in a meta-analysis even if some of the individual studies have not been convincing on its own, perhaps even with effect estimates in the wrong direction. This may be acceptable if the unconvincing studies have been small, but seems less tolerable if each study was well-powered and well-conducted.

For example, consider the results from 5 clinical trials on the effect of Carvedilol, a beta- and alpha-blocker and an antioxidant drug for the treatment of patients with moderate to severe heart failure patients, on mortality (cf.  Fisher, 1999a, Table 1). One-sided -values (from log-rank tests) and hazard ratios (HR) are shown in Table 1, indicating a consistent reduction in mortality between 28 and 78% across the different studies.

study number -value HR log HR SE
240 0.0245 0.22 -1.51 0.85
221 0.1305 0.57 -0.56 0.51
220 0.00025 0.27 -1.31 0.41
239 0.2575 0.53 -0.63 1.02
223 0.128 0.72 -0.33 0.29
Table 1: Results from 5 clinical trials on the effect of Carvedilol for the treatment of patients with moderate to severe heart failure patients. Shown are one-sided

-values, hazard ratios (HR), and the associated log hazard ratios (log HR) with standard errors (SE).

A meta-analysis could be applied to the data shown in Table 1, but the drug regulation industry (including the U.S. ”Food and Drug Administration, or FDA) typically relies instead on the "two-trials rule" (Kay, 2015, Section 9.4), also known as the “two pivotal study paradigm” (Hlavin et al., 2016), for approval. This simple decision rule requires “at least two adequate and well-controlled studies, each convincing on its own, to establish effectiveness” (FDA, 1998, p. 3). This is usually achieved by independently replicating the result of a first study in a second study, both significant at one-sided level . However, in modern drug development often more than two trials are conducted and it is unclear how to extend the two-trials rule to this setting. Requiring at least 2 out of studies to be significant is too lax a criterion if the results from the non-significant studies are not taken into account at all. On the other hand, requiring all studies to be significant is too stringent. This problem applies to the Carvedilol example, where two trials are significant at the 2.5% level (one just with ) but where it is unclear whether the remaining three studies (with -values 0.1305, 0.2575 and 0.128) can be considered as sufficiently “convincing on its own.”

This has led statistical researchers to discuss the possibility of pooling the results from the different studies into one -value (Fisher, 1999b; Darken and Ho, 2004; Shun et al., 2005). Ronald Fisher’s method to combine -values (Fisher, 1958) is often used for this task, e. g. in Fisher (1999a) for the Carvedilol example. However, Fisher’s method shares the problems of a meta-analysis as it can produce a significant overall result even if one of the trials was negative. For example, one completely unconvincing trial with (one-sided) combined with a convincing second one with would give Fisher’s , so a claim of success with respect to the Type I error rate of the two-trials rule. On the other hand, two trials both with would not be considered as successful with Fisher’s . Both decisions seem undesirable from a regulator’s perspective. Another problem is that Fisher’s method treats large and small studies equally. It can be extended to incorporate weights (Good, 1955), but then the null distribution does no longer have a convenient form.

The two-trials rule therefore remains the standard in drug regulation, but has additional deficiencies even for studies, where independent -value thresholding at may lead to decisions that are the opposite to what the evidence warrants. For example, two trials both with will lead to drug approval but carry less evidence for a treatment effect than one trial with and the other one with , which would, however, not pass the two-trials rule. Rosenkrantz (2002) has therefore proposed a method to claim efficacy if one of two trials is significant while the other just shows a trend. He combines the two-trials rule with Fisher’s method and a relaxed criterion for significance of the two individual trials, say . A similar approach has been proposed by Maca et al. (2002) using Stouffer’s pooled rather than Fisher’s combined method. The arbitrariness in the choice of the relaxed significance criterion is less attractive, though, and it is not obvious how to extend the methods to results from more than two studies.

In this paper I develop a new method that addresses these issues and leads to more appropriate inferences, the harmonic mean test described in Section 2. At the Type-I error rate of the two-trials rule, the proposed test comes to opposite conclusions for the examples mentioned above: In contrary to Fisher’s method, it leads to approval of two trial both with , but not to approval if one has and the other one . Contrary to the two-trials rule, it leads to approval of one trial with and the other one with , but not to approval if both trials have . The work is motivated from a recent proposal how to evaluate the success of replication studies (Held, 2020) and is based on the harmonic mean of the squared -scores. It can include weights for the individual studies and can be calibrated to ensure exact Type-I error control. Furthermore, the new approach implies useful bounds on the individual study-specific -values , thus formalizing the meaning of “at least two adequate and well-controlled studies, each convincing on its own”. The approach will be compared to the two-trials rule and illustrated on the Carvedilol data in Section 3. Section 4 outlines how the method can be used to calibrate the sceptical -value (Held, 2020). I close with some discussion in Section 5.

2 The harmonic mean test

Suppose one-sided -values are available from independent studies. How can we combine the -values into one -value? Cousins (2007) compares some of the more prominent papers on this topic. Among them is Stouffer’s method, which is based on the -scores , here

denotes the quantile function of the standard normal distribution. Under the assumption of no effect, the test statistic

is standard normally distributed. The corresponding -value forms the basis of the “pooled-trials rule” and is equivalent to investigate significance of the overall effect estimate from a fixed-effects meta-analysis (Senn, 2007, Section 12.2.8). It can also be extended to include weights. Fisher’s method is also commonly used and compares with a -distribution with degrees of freedom to compute a combined -value.

Here I propose a different quantity to assess the overall evidence for a treatment effect based on the harmonic mean of the squared -scores:

(1)

This form is motivated from the special case of successive studies, one original and one replication, where a reverse-Bayes approach for the assessment of replication success has recently been described (Held, 2020). If the two studies have equal precision (i. e. sample size), the assessment of replication success does not depend on the order of the two studies and is based on the test statistic , compare Held (2020, equation (9)). Equation (1) extends this to studies with an additional multiplicative factor , which ensures that the null distribution of does not depend on . As a by-product, this enables us to calibrate the sceptical -value proposed in Held (2020, Section 3), see Section 4 for details.

Weights can also be introduced in (1), then the test statistic

(2)

should be used. Multiplication with ensures that the null distribution of does not depend on the weights nor on .

The specific form of (2) deserves some additional comments. In practice we often have where is the standard error of the effect estimate ,

is the one-unit variance and

the effective sample size of study . If we use weights equal to the precision of the effect estimates, (2) can be written as the unweighted harmonic mean of the squared effect estimates times a scaling factor :

(3)

In the special case of equal study-specific sample sizes , the scaling factor reduces to .

There is a subtle difference between the two formulations and . The unweighted test statistic (1) is based on the harmonic mean of the squared study-specific test statistics , . If we increase the sample size of the different studies, (1) will therefore also tend to increase if there is a true non-zero effect. However, the test statistic (3) is based on the harmonic mean of the squared study-specific effect estimates , which should not be much affected by any increase of study-specific sample sizes because the study-specific estimates should then stabilize around their true values. It is the scaling factor that will react to an increase in study-specific sample sizes. The test statistic can thus be factorized into a component depending on sample sizes and a component depending on effect sizes.

Using properties of Lévy distributions it can be shown that under the null hypothesis of no effect, the distribution of both (

1) and (2) is with one degree of freedom, see Appendix A for details. We can thus compute an overall -value from (1) or (2) based on the distribution function. However, we have to be careful since (1) does not take the direction of the effects into account. Usually we are interested in a pre-defined direction of the underlying effect, say : against : and we will have to adjust for the fact that (1) and (2) can be large for any of the possible combinations of the signs on , with all these combinations being equally likely under the null hypothesis. Since we are interested only in the case where all signs are positive, we have to adjust the -value accordingly.

To be specific, suppose all studies have a positive effect and the observed test statistic (1) or (2) is , respectively . The overall -value from the proposed significance test is then

(4)

Likewise we can obtain the critical value

(5)

for the test statistic (1) or (2) to control the Type-I error rate at some overall significance level . Note that the overall -value (4) cannot be larger than

as it should, since under the null hypothesis the probability to obtain

positive results is . We are only interested in this case, so if at least one of the studies has a negative effect we suggest to report the inequality , for example for studies.

In what follows I restrict attention to the unweighted test statistic given in (1), similar results can be obtained for given in (2). Let denote the observed test statistic in the -th study. I assume that for all , i. e. all effects go in the right direction. First note that the smallest squared test statistic multiplied by the number of studies is an upper bound on the harmonic mean :

where the last inequality holds for all . This implies for the observed test statistic and any study and with equation (4) we obtain

If is required for a claim of success at level , then obviously must hold, which can be re-written as with as defined in (5). The restriction on the corresponding -values is

(6)

This is a necessary but not sufficient restriction on the study-specific -values for a claim of success.

It is also possible to derive the corresponding sufficient bound. Assume all -values are equal (i. e), then the condition implies . Note that the sufficient bound on differs from the corresponding necessary bound only by the multiplicative factor . The restriction on the corresponding -values is

(7)
bound
1/1600 necessary 0.065 0.17 0.26 0.32 0.37
sufficient 0.016 0.053 0.099 0.15 0.20
1/31574 necessary 0.028 0.11 0.19 0.26 0.30
sufficient 0.0034 0.017 0.041 0.071 0.10
1/3488556 necessary 0.0075 0.058 0.13 0.19 0.24
sufficient 0.00029 0.0032 0.011 0.024 0.04
Table 2: Necessary and sufficient bounds on the one-sided study-specific -values for overall significance level and different number of studies

The necessary and sufficient bounds in (6) and (7), respectively, are shown in Table 2 for (the two-trials rule), (the “four-sigma rule”) and (the “five-sigma rule”). For example, for studies and level , the requirement , , is necessary for claiming success. If one of the two studies has a -value larger than , a claim of success at level is thus impossible, no matter how small the other -value is. The stricter requirement , , is sufficient for a claim of success at that level. For the five-sigma level , the necessary bound is for and for studies. The sufficient bound is for and for .

3 Comparison with the two-trials rule

The two-trials rule for drug approval is usually implemented by requiring that each study is significant at the one-sided level , so the probability of significant positive trials when there is no treatment effect is . Suppose both studies have a positive effect in the right direction and the observed test statistic (1) is . The harmonic mean -value (4) now reduces to . A critical value for the test statistic (1) can also be calculated using (5). For and we obtain the value .

Figure 1: Comparison of different approaches for drug approval as function of the two -values and (left) and the two -values and (right), respectively. The acceptance region of the two-trials rule is shown in grey. The acceptance regions of the other methods is below (left) or above (right) the corresponding curves. All methods control the Type-I error rate at except for the liberal version of the harmonic mean test, which has Type-I error rate . The contour lines in the right plot represent the distribution of and under the alternative if the two studies have 80% power at the one-sided 2.5% significance level.

Figure 1 compares the region for drug approval based on the two-trials rule with the proposed harmonic mean test. Shown are two versions of the latter, the “controlled” version based on , i. e. critical value and a “liberal” version with critical value . This has been computed by equating the right-hand side of (7) with 0.025 and solving for . The liberal version thus ensures that approval by the two-trials rule always leads to approval by the harmonic mean test. The Type-I error rate of the liberal version is , inflated by a factor of compared to the level.

Also shown in Figure 1 is the corresponding region for drug approval of the pooled and combined method, both controlled at Type-I error . Both methods compensate smaller intersections with the two-trials rejection region with additional regions of rejection where one of the trials shows only weak or even no evidence for an effect. It is interesting to see that the harmonic mean test is closer to the two-trials rule than Stouffer’s pooled or Fisher’s combined method, particularly good to see in the -scale shown in the right plot of Figure 1. The latter two suffer from the possibility of approval if one of the -values is very small while the other one is far away from traditional significance. A highly significant -value may actually guarantee approval through Fisher’s method, no matter how large the -value from the other study is. This is not possible for Stouffer’s method, but it may still happen that the effects from the two studies go in different directions with the combined effect being significant. As a consequence, the sufficient -value bound, shown in the left plot of Figure 1, is considerably smaller for the pooled (0.011) and combined (0.008) method than for the controlled harmonic mean test (0.016). These features make both the pooled and the combined method less suitable for drug approval.

For comparison, the two-trials rule has the necessary and sufficient conditions , . The harmonic mean test can be significant only if both -values are small (). This has been discussed in Section 2 and can also be seen from Figure 2, which shows the conditional power for drug approval given the -value from the first study. The values represent the power to detect the observed effect from the first study with a second study of equal design and sample size. The two-trials rule has conditional power as described by Goodman (1992), but with a discontinuity at 0.025. The power curves of the two harmonic tests (calculated as described in Held (2020, Section 4)) are smooth, quickly approaching zero at respectively . Both the combined and the pooled method have longer tails with non-zero conditional power even for a larger -value of the first study. Here the conditonal power of the combined method can be derived as where . The conditonal power of the pooled method turns out to be .

Figure 2: Power for drug approval at level conditional on the one-sided -value of the first study. Power values of exactly zero are omitted.

3.1 Project power

Of central interest in drug development is often the “project power” for a claim of success (Maca et al., 2002) before the two trials are conducted. It is well known (Matthews, 2006) that under the alternative that was used to power the study, the distribution of (and ) is where

We can thus simulate independent and for and different values of the power and compute the proportion of trial results with drug approval at level . This is shown in Table 3 for the different methods.

As expected, the two-trials rule gives project power equal to , since the two trials are assumed to be independent, each significant with probability . The project power of the Type-I error controlled harmonic mean test is 4 to 7 percentage points larger, depending on the power of the two studies. The project power of the combined and pooled methods are even larger but this comes at the price that approval may be granted even if one of the trials was not sufficiently convincing on its own.

Power two-trials rule harmonic combined pooled
70 49 56 58 61
80 64 71 74 77
90 81 87 90 91
95 90 94 96 97
Table 3: The probability of drug approval (in %) as a function of the original power of the two studies

3.2 Application

Two advantages of the proposed method are that it allows for weighting and is readily applicable to the case where results from more than 2 studies are available. For practical illustration, I revisit the data shown in Table 1 on the effect of Carvedilol on mortality. Note that all -values are below the necessary success bound 0.32 at the level of the two-trials rule, compare Table 2. Only the -value of study #239 is above the sufficient bound 0.15, otherwise we could already claim success with the unweighted harmonic mean test.

Fisher (1999a) reports Fisher’s combined -value, which is 0.00013. Stouffer’s pooled test gives the -value 0.00009. The harmonic mean test gives 0.00048 (unweighted) and 0.00034 (weighted), so somewhat larger values. For the latter the weights have been chosen inversely proportional to the squared standard errors of the associated log hazard ratios also shown in Table 1, see Appendix B for further details. Note that all these -values are smaller than the two-trials threshold 0.000625. However, Stouffer’s weighted test doesn’t meet the two-trials criterion ().

Suppose now that the -value in study #223 (the largest study with the smallest standard error) is 2 times as large, i. e. 0.256 rather than 0.128. This would be considered as unimportant by many scientists, as both -values are non-significant anyway and far away from the 0.025 significance threshold. Keeping the standard error of the log relative risk fixed, the relative risk reduction in this study is now 17% rather than 28%. But this change has a large effect on the proposed methods: The unweighted and weighted harmonic mean test -values increase by a factor of 2.5 and 7.9 to 0.0012 and 0.0027, respectively, so both would now fail the criterion for drug approval. The unweighted and weighted versions of Stouffer’s test -values increase by a factor of 2.3 and 3.5 to 0.00021 and 0.005, respectively. Fisher’s combined -value increases only by a factor of 1.7 to 0.00022, which is still below the 0.000625 threshold, just as Stouffer’s unweighted test. This illustrates that the harmonic mean test is more sensitive to studies with unconvincing results, i. e. relatively small effect sizes with large -values.

4 Calibrating the sceptical -value

Replication studies are conducted in order to investigate whether an original finding can be confirmed in an independent study. The sceptical -value has been proposed in Held (2020) as a method to assess the degree of replication success. The sceptical -value combines statistical significance of the two studies with a comparison of effect and sample sizes. A small sceptical -value can be interpreted as replication success at level .

The sceptical -value depends on the two -values of the original and replication study and on the variance ratio , which can be written as the sample size of the replication study relative to the sample size of the original study. If the two studies have the same sample size (), the sceptical -value depends only on the two -values and in a symmetric fashion. This has similarities to the two-trials rule which requires two independent significant studies for drug approval. As usual in this context we will consider only one-sided -values where the standard significance level is . Without loss of generality we consider the alternative : against the point null : .

The framework in Held (2020) was developed for two-sided -values but a one-sided version of the sceptical -value has also been proposed. Assume that and are both positive. For , the one-sided sceptical -value can be computed from

(8)

where is the test statistic based on the one-sided -value of the first study and is the test statistic based on the one-sided -value of the second study. The direct connection to the harmonic mean test statistic (1) makes it easy to choose the threshold such that the sceptical -value has the same Type-I error as the two-trials rule with Type-I error . With (5) we obtain the threshold for (8), which corresponds to a one-sided sceptical -value threshold of . It is no accident that this is exactly the same value as the necessary bound on the individual -values for a claim of success, as shown in Table 2. The reason is that the property combined with the requirement implies that must hold.

Alternatively we may pick the nominal threshold . The Type-I error rate of the nominal threshold is , times smaller than . One may also re-consider the liberal threshold, i. e. the smallest threshold where and is sufficient for . For we obtain the liberal threshold and we know from Section 3 that the Type-I error rate of the liberal threshold is times larger than the controlled threshold . All three thresholds are displayed for various values of in Figure 3 together with the corresponding type-I error rate values.

Figure 3: The three thresholds for the one-sided sceptical -value as a function of (left). The right plot gives the corresponding Type-I error rate curves, where the controlled threshold has Type-I error rate .

5 Discussion

There is considerable variation of clinical trial evidence for newly approved therapies (Downing et al., 2014). New methods are required to provide better inferences for the assessment of pivotal trials supporting novel therapeutic approval. The harmonic mean test is an attractive alternative to the two-trials rule as it has more power at the same Type-I error rate and avoids the evidence paradoxes that may occur close to the 0.025 threshold. It provides a principled extension to substantiate research findings from more than two trials, requesting each trial to be convincing on its own, and allows for weights.

The method implicitly assumes that each of the individual trials is well-powered for realistic treatment effects. The risk that the harmonic mean test fails increases substantially, if some of the trials have low power. Implementation of this new method may therefore be seen as a means to ensure sufficiently powered individual studies.

The proposed method is different from the harmonic mean -value (Good, 1958; Wilson, 2019; Held, 2019), where the null distribution is more difficult to compute (Wilson, 2019, Section 1 of Supplementary Material). The harmonic mean test is not directly linked to an effect estimate and a confidence interval. However, the test could easily be inverted to obtain a confidence interval based on study-specific test statistics for non-zero means . More research is needed to investigate this possibility.

The two-trials rule is the standard for many indications, including many neurogenerative and cardiovascular diseases. However, approval of treatments in areas of high medical need may not follow the two-trials rule. An alternative approach is conditional approval based on “adaptive pathways” (European Medical Agency, 2016), where a temporary license is is granted based on an initial positive trial. A second post-marketing clinical trial is then often required to confirm or revoke the initial decision (Zhang et al., 2019). This setting has much in common with replication studies that try to confirm original results in independent investigations (Roes, 2020) and the re-calibration of the sceptical -value described in Section 4 may be useful to explore in this setting.

Acknowledgments

I am grateful to Karen Kafadar, Meinhard Kieser and Martin Posch for helpful discussions and suggestions. Support by the Swiss National Science Foundation (Project # 189295) is gratefully acknowledged.

References

  • Collett (2003) Collett, D. (2003). Modelling Survival Data in Medical Research. Chapman & Hall, 2nd edition.
  • Cousins (2007) Cousins, R. D. (2007). Annotated bibliography of some papers on combining significances or p-values. https://arxiv.org/abs/0705.2209.
  • Darken and Ho (2004) Darken, P. F. and Ho, S.-Y. (2004). A note on sample size savings with the use of a single well-controlled clinical trial to support the efficacy of a new drug. Pharmaceutical Statistics, 3(1):61–63.
  • Downing et al. (2014) Downing, N. S., Aminawung, J. A., Shah, N. D., Krumholz, H. M., and Ross, J. S. (2014). Clinical trial evidence supporting FDA approval of novel therapeutic agents, 2005-2012. JAMA: The Journal of the American Medical Association, 311(4):368–377. https://dx.doi.org/10.1001/jama.2013.282034.
  • European Medical Agency (2016) European Medical Agency (2016). Adaptive pathways workshop - Report on a meeting with stakeholders held at EMA on Thursday 8 December 2016. https://www.ema.europa.eu/en/documents/report/adaptive-pathways-workshop-report-meeting-stakeholders-8-december-2016_en.pdf.
  • FDA (1998) FDA (1998). Providing clinical evidence of effectiveness for human drug and biological products. Technical report, US  Food  and  Drug  Administration. www.fda.gov/regulatory-information/search-fda-guidance-documents/providing-clinical-evidence-effectiveness-human-drug-and-biological-products.
  • Fisher (1999a) Fisher, L. D. (1999a). Carvedilol and the Food and Drug Administration (FDA) approval process: The FDA paradigm and reflections on hypothesis testing. Controlled Clinical Trials, 20(1):16 – 39.
  • Fisher (1999b) Fisher, L. D. (1999b). One large, well-designed, multicenter study as an alternative to the usual FDA paradigm. Drug Information Journal, 33(1):265–271.
  • Fisher (1958) Fisher, R. A. (1958). Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh, 13th ed.(rev.) edition.
  • Good (1955) Good, I. J. (1955). On the weighted combination of significance tests. Journal of the Royal Statistical Society. Series B (Methodological), 17(2):264–265.
  • Good (1958) Good, I. J. (1958). Significance tests in parallel and in series. Journal of the American Statistical Association, 53(284):799–813. http://dx.doi.org/10.1080/01621459.1958.10501480.
  • Goodman (1992) Goodman, S. N. (1992). A comment on replication, p-values and evidence. Statistics in Medicine, 11(7):875–879. https://doi.org/10.1002/sim.4780110705.
  • Held (2019) Held, L. (2019). On the Bayesian interpretation of the harmonic mean -value. Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.1900671116.
  • Held (2020) Held, L. (2020). A new standard for the analysis and design of replication studies (with discussion). Journal of the Royal Statistical Society, Series A. To appear, https://arxiv.org/abs/1811.10287.
  • Hlavin et al. (2016) Hlavin, G., Koenig, F., Male, C., Posch, M., and Bauer, P. (2016). Evidence, eminence and extrapolation. Statistics in Medicine, 35(13):2117–2132. https://doi.org/10.1002/sim.6865.
  • Kay (2015) Kay, R. (2015). Statistical Thinking for Non-Statisticians in Drug Regulation. John Wiley & Sons, Chichester, U.K., second edition.
  • Maca et al. (2002) Maca, J., Gallo, P., Branson, M., and Maurer, W. (2002). Reconsidering some aspects of the two-trials paradigm. Journal of Biopharmaceutical Statistics, 12(2):107–119.
  • Matthews (2006) Matthews, J. N. (2006). Introduction to Randomized Controlled Clinical Trials. Chapman & Hall/CRC, second edition.
  • Nolan (2018) Nolan, J. P. (2018). Stable Distributions - Models for Heavy Tailed Data. Birkhauser, Boston. In progress, Chapter 1 online at http://fs2.american.edu/jpnolan/www/stable/chap1.pdf.
  • Roes (2020) Roes, K. C. B. (2020). Discussion of "A new standard for the analysis and design of replication studies" by Leonhard Held. Journal of the Royal Statistical Society, Series A. To appear.
  • Rosenkrantz (2002) Rosenkrantz, G. (2002). Is it possible to claim efficacy if one of two trials is significant while the other just shows a trend? Drug Information Journal, 36(1):875–879.
  • Senn (2007) Senn, S. (2007). Statistical Issues in Drug Development. John Wiley & Sons, Chichester, U.K., second edition.
  • Shun et al. (2005) Shun, Z., Chi, E., Durrleman, S., and Fisher, L. (2005). Statistical consideration of the strategy for demonstrating clinical evidence of effectiveness—one larger vs two smaller pivotal studies. Statistics in Medicine, 24(11):1619–1637.
  • Uchaikin and Zolotarev (1999) Uchaikin, V. V. and Zolotarev, V. M. (1999). Chance and Stability: Stable Distributions and Their Applications. Walter de Gruyter.
  • Wilson (2019) Wilson, D. J. (2019). The harmonic mean -value for combining dependent tests. Proceedings of the National Academy of Sciences. https://doi.org/10.1073/pnas.1814092116.
  • Zhang et al. (2019) Zhang, A. D., Puthumana, J., Downing, N. S., Shah, N. D., Krumholz, H., and Ross, J. S. (2019). Clinical trial evidence supporting FDA approval of novel therapeutic agents over three decades, 1995-2017: Cross-sectional analysis. medRxiv. http://dx.doi.org/10.1101/19007047.

Appendix A The null distribution of the harmonic mean test statistic

Under the null hypothesis, , , is standard normal distributed, so is with 1 degree of freedom, i. e. a gamma

distribution. The random variable

is therefore inverse gamma distributed,

, also known as the standard Lévy distribution: . More generally, the distribution corresponds to the distribution and belongs to the class of stable distributions (Uchaikin and Zolotarev, 1999, Section 2.3).

Now are assume to be independent, so are also independent and we are interested in the distribution of the sum , compare equation (1). The standard Lévy distribution is known to be stable, which means that the sum of independent standard Lévy random variables is again a Lévy random variable: , which corresponds to a distribution. Therefore follows a distribution and follows a , i. e. a distribution with one degree of freedom.

The weighted version is also a Lévy random variable, where , see Nolan (2018, Proposition 1.17). Therefore also follows a distribution with one degree of freedom.

Appendix B Further details on the Carvedilol example

The data shown in Table 1 are taken from Fisher (1999a, Table 1) for the outcome mortality. The discussion on Fisher (1999a, page 17) suggests that the -values reported in the table come from a log-rank test. The relative risks reported in the table appear to be “instantaneous relative risks”, i. e. hazard ratios. I have calculated the standard error of the log hazard ratios from the limits of the 95% confidence intervals also reported in the table. Note that there is an apparent discrepancy between the -value and the confidence interval reported for Study 240, with the one-sided log-rank -value being just significant (=0.0245) whereas the 95% confidence interval for the hazard ratio runs from 0.04 to 1.14 and includes the reference value 1. Leaving rounding errors aside, the corresponding one-sided -value from a Wald-test is =0.038. This does not much affect the harmonic mean test but the two-trials rule would obviously no longer be fulfilled. The difference between log-rank and Wald is still surprising, but a similar example has been reported in Collett (2003, Example 3.3). I have decided to use the log-rank -values as reported, whereas the standard errors of log hazard ratios are only used to weight the harmonic mean test statistic. Finally note that mortality was not the primary endpoint of the different studies, but Fisher (1999a) argues that “it is the most important endpoint” and “almost always of primary importance to patients and their loved ones”.