Why "Redefining Statistical Significance" Will Not Improve Reproducibility and Could Make the Replication Crisis Worse

11/21/2017
by   Harry Crane, et al.
Rutgers University
0

A recent proposal to "redefine statistical significance" (Benjamin, et al. Nature Human Behaviour, 2017) claims that false positive rates "would immediately improve" by factors greater than two and replication rates would double simply by changing the conventional cutoff for 'statistical significance' from P<0.05 to P<0.005. I analyze the veracity of these claims, focusing especially on how Benjamin, et al neglect the effects of P-hacking in assessing the impact of their proposal. My analysis shows that once P-hacking is accounted for the perceived benefits of the lower threshold all but disappear, prompting two main conclusions: (i) The claimed improvements to false positive rate and replication rate in Benjamin, et al (2017) are exaggerated and misleading. (ii) There are plausible scenarios under which the lower cutoff will make the replication crisis worse.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/19/2021

A Proposed Hybrid Effect Size Plus p-Value Criterion: A Replication of Goodman et al. (2019)

In a recent simulation study, Goodman et al. (2019) compare several meth...
11/18/2019

Rebuttal to Berger et al., TOPLAS 2019

Berger et al., published in TOPLAS 2019, is a critique of our 2014 FSE c...
03/20/2019

Statistical Methods for Replicability Assessment

Large-scale replication studies like the Reproducibility Project: Psycho...
05/10/2020

Replication Markets: Results, Lessons, Challenges and Opportunities in AI Replication

The last decade saw the emergence of systematic large-scale replication ...
01/15/2018

Conceptualizing and Evaluating Replication Across Domains of Behavioral Research

We discuss the authors' conceptualization of replication, in particular ...
05/25/2021

Reproducibility Companion Paper: Knowledge Enhanced Neural Fashion Trend Forecasting

This companion paper supports the replication of the fashion trend forec...

1. Introduction

The proposal to “redefine statistical significance” [2] (henceforth abbreviated as the ‘RSS proposal’ or simply ‘RSS’) is intended to counteract the so-called ‘reproducibility crisis’, i.e., the disproportionate fraction of statistically significant scientific findings that cannot be replicated by subsequent studies. In championing their proposal, the signatories of RSS claim, “This simple step would immediately improve the reproducibility of scientific research in many fields.” The authors go on to assert that “false positive rates would typically fall by factors greater than two” and suggest that replication rates would roughly double under their proposal.

In ignoring the effects of P-hacking on false positive rate and replication rate, RSS misses the whole point of the reproducibility crisis. By appealing to the same formal technique and empirical evidence [17] used to support the RSS proposal, I will unmask major conceptual and technical flaws in the RSS argument.111The analysis in Section 4 is based on the replication study in [17], which was conducted on a sample of 97 results in psychology. I do not analyze the results in [3], which performs a similar study for findings in experimental economics but for a much smaller sample of 18 studies. The analysis presented here is not a counterproposal to RSS, but rather a refutation which is intended to elucidate the proposal’s flaws and therefore neutralize the potential damage which would result from its implementation.222The analysis below focuses on the potential impact of the proposal concerning statistical ‘significance’. The RSS proposal also includes a recommendation to regard findings with as ‘suggestive’. The motivation for this suggestion, its justification, and perceived impact are not clearly articulated in [2], and thus will not be discussed further here. Our analysis may be seen as complementary to, but should not be read in any way as an endorsement of, the critiques and alternative proposals by other authors [1, 12, 16, 21]. I discuss this last point further in Section 5.

Brief summary

P-hacking and the reproducibility crisis: like smoking and lung cancer, one cannot be discussed without the other [7, 9, 22].333For simplicity, I use the term ‘P-hacking’ to refer to any unsound statistical practice used in justifying a scientific finding by a significant P-value, including but not limited to “cherry-picking, P-hacking, hunting for significance, selective reporting, multiple testing and other biasing selection effects” [15]. Most salient for our purpose here is that P-values obtained by P-hacking do not warrant the same interpretation as the standard theory assumes. To be clear, I do not use the term ‘P-hacking’ in a pejorative sense. For the purposes it is employed here, P-hacking need not be intentional or done with malicious intent. What is important, however, is that P-hacking cannot be ignored when studying the effect of the RSS and other proposals on reproducibility. Yet the RSS analysis does just that. Because they do not intend their proposal to combat P-hacking directly, the advocates of RSS seem to think that they can set it aside. But in ignoring the effects of P-hacking, RSS arrive at overly optimistic projections and tout misleading conclusions about the “immediate” benefits of their proposal. With P-hacking accounted for, we arrive at much more realistic, and sobering, conclusions about the potential impact of redefining statistical significance.

Figure 1. False positive rate (4) for different significance levels () and hacking rates (). Solid lines correspond to false positive rate without P-hacking; dashed lines to FPR with (i.e., 5% of all P-values are hacked); and dotted lines to FPR with .

To foreshadow the analysis presented in Section 4, Figure 1 plots the false positive rate (FPR) versus statistical power for different combinations of significance level and P-hacking rate (i.e., the proportion of all P-values obtained by P-hacking). The solid lines, which correspond to the traditional false positive rate in the absence of P-hacking (see Equation (1) below), are also shown as part of [2, Figure 2]. These lines suggest a substantial improvement to FPR under the reduced significance level: for statistical power of 0.80, false positive rate is projected to decrease from 0.38 to 0.06. But if 15% of all P-values are hacked,444The values of and in Figure 1 are chosen as the extreme estimates obtained in Section 4, which based on the replication study in [17] suggests that between 5% and 15% of all P-values are hacked.

then the false positive rate would decrease from 0.75 to 0.71, just a 5% improvement, as a result of the lower cutoff. And even at the low end of our estimate for the prevalence of P-hacking (i.e.,

), the false positive rate would only improve from 0.57 to 0.44. Regardless of how much the false positive rate improves in relative terms, the end result (44%-71% false positives) is hardly worth celebrating: the false positive rate is bound to remain much higher than the suggested 0.06 level presented in [2]. In direct conflict to the main argument in [2], Figure 1 illustrates in stark terms the deception lurking in the RSS argument and makes clear that decreasing the significance level will hardly make a dent in the reproducibility crisis.

The forthcoming analysis develops the framework from which Figure 1

is derived. Though the analysis is set in the context of the RSS proposal, the principles underlying it are relevant beyond the specific numbers (i.e., 0.05 versus 0.005) or methods under consideration (e.g., hypothesis testing, Bayes factors, confidence intervals, etc.). P-hacking and other forms of statistical misuse and malpractice are endemic to science at all levels. Its role in the reproducibility crisis cannot be ignored, especially when evaluating proposals such as the one considered here.

2. Null Hypothesis Significance Testing (NHST)

Under the Null Hypothesis Significance Testing (NHST) paradigm, a null hypothesis (

) is tested against an alternative hypothesis (

) by computing a P-value, defined as the probability (under a true null hypothesis) that a certain test statistic attains a value as or more extreme than what is actually observed. Small P-values, which correspond to observations that are unlikely to have occurred if the null hypothesis were true, are interpreted as evidence against

. In fields adhering to the NHST paradigm, it is conventional to ‘reject ’ and confer the label of ‘statistical significance’ whenever a P-value falls below a pre-described threshold .

true false
Proportion
Reject
Not Reject
Table 1.

Table showing the relative proportion of null hypotheses falling under each possible combination of true/false and reject/not reject for a family of statistical tests with Type-I error rate

, Type-II error rate

, and prior odds

.

The NHST protocol is prone to two types of error:

  • Type-I error: rejecting when is true; and

  • Type-II error: failing to reject when is false.

Since, when is true, the corresponding P-value is distributed uniformly on , the probability of a Type-I error (i.e., obtaining ‘’ when is true) is . (In particular, when , the probability of Type-I error in any given test is 0.05.) For a given Type-I error probability, the Type-II error rate, denoted as , is the probability of failing to reject a false null hypothesis. The power is the probability of correctly rejecting a false null hypothesis. Though norms vary among disciplines, it is conventional in many fields to tradeoff between Type-I and Type-II error by aiming for 80% power at the 5% significance level. We use these as benchmarks in the empirical analysis below (see Section 4).

2.1. False positive rate

For a collection of statistical tests, the false positive rate (FPR) is the proportion of significant P-values (with ) obtained under a true null hypothesis. Since FPR reflects the proportion of significant P-values obtained in a collection of hypothesis tests, it depends on the proportion of all tests for which is true, in addition to the Type-I and Type-II error rates. (The proportion is sometimes quoted in terms of the prior odds in favor of , which is the relative proportion of false to true null hypotheses among all those tested.) For tests conducted under the standard protocol with Type-I and Type-II error probabilities and and prior odds , the false positive rate is given by

(1)

These quantities are summarized in Table 1.

2.2. Replication Rate

Figure 2. Plot of false positive rate (solid) and replication rate (dashed) at different levels of power for prior odds at significance levels 0.05 and 0.005. The plot illustrates the complementary relationship between FPR and RR under the perfect reproducibility assumption, as shown in (3).

We consider a significant result () to be reproducible (or replicable) if it is verifiable by subsequent testing. Attempts to replicate often appeal to the same or similar statistical methods as the initial study, and are thus prone to the same sources of statistical error. For a precise analysis of the replication rate, we rule out the possibility of Type-I and Type-II error in replication attempts and assume that 100% of true positives and 0% of false positives are replicable.555As a practical matter, this assumption is reasonable because replication studies typically seek to achieve high power () at the 0.05 significance level, minimizing the occurrence of false positives and false negatives in replication attempts. This assumption, which we call perfect reproducibility, is in the spirit of the reproducibility discussion in that it treats replication as a property of the finding itself: if the finding is true, then it is replicable; if it is false, then it is not. Under perfect reproducibility, the replication rate for a family of tests with significance level , power , and prior odds is the proportion of true positives,

(2)

Comparing (1) and (2), we observe the complementary relationship between replication rate and false positive rate,

(3)

See Figure 2 for a plot of FPR and RR against power for significance levels and .

3. NHST with P-hacking

When analyzing the impact of P-hacking on replication, it is important to distinguish between the protocol of NHST and the policy for assigning the label of ‘statistical significance’ to small P-values. In a sound application of NHST, as assumed in Section 2, the protocol used to obtain the P-value is independent of the policy for conferring statistical significance. The calculations of false positive rate and replication rate in Section 2 are thus valid in a world in which all P-values are obtained independently of the prevailing cutoff, as the NHST paradigm prescribes. In such a world, all Type-I errors occur purely by chance, with probability , and both FPR and RR can be substantially improved by decreasing the significance level while holding power fixed, or alternatively by increasing power while holding significance level fixed, as Figure 2 indicates. That world is a far cry from the world in which we live.

In the real world, many (and perhaps most) Type-I errors are the result of systematic departures from the NHST protocol which cause the observed false positive rate to be much larger than predicted by (1). (These systematic departures are referred to generically as ‘P-hacking’ here.) As a consequence, the standard theory of Section 2 can no longer be applied to assess the false positive rate and replication rate in the presence of P-hacking.

To account for the effects of P-hacking, we distinguish between sound or unsound P-values:

  • A sound P-value is one for which the standard interpretation is valid (i.e., the probability, under the null hypothesis, that the test statistic attains a value as extreme or more extreme than observed).

  • An unsound (or hacked) P-value is one which does not warrant the above interpretation because proper statistical protocol was not followed.

For our purposes here, neither the specific type of P-hacking (e.g., multiple testing, selective reporting, etc.) nor whether P-hacking was intentional is relevant. What matters is (i) P-values obtained by P-hacking do not warrant the same interpretation as those obtained under the standard paradigm and (ii) since P-hacking protocol depends on the policy (i.e., cutoff value), the effect of a new policy on hacked P-values cannot be studied empirically based on data obtained under the present policy. When considering the likely behavior of hacked P-values under the new cutoff, we cannot ignore that the scientific teams behind these P-values have been successful both in obtaining a significant P-value at the current significance level and in justifying why their hacked P-value gives a finding worthy of publication. When the significance threshold changes, we can expect that these same scientists will still be successful in finding ways to obtain significant, publishable results. So, whereas the behavior of sound P-values can be treated as absolute—by virtue of being sound their behavior is unaffected by policy changes—the behavior of unsound P-values must be interpreted relative to the prevailing policy.666In making this observation, I do not mean to represent P-hacking as a universally malicious, subversive activity by which ne’er-do-well scientists try to make a P-value as small as possible at all costs. In many cases, scientists have been trained that what we are here calling P-hacking is a legitimate research technique. I discuss practical matters of this sort in Section 5.

For example, a hacked P-value of 0.045 under the current 0.05 significance regime should not be expected to remain at 0.045 after the cutoff changes to 0.005. By virtue of being hacked, this P-value is already smaller than it ought to be, as a consequence of so-called ‘researcher degrees of freedom’. If the significance policy were changed to 0.005, this same P-value should be expected to decrease, and would likely end up below 0.005 upon further application of said degrees of freedom.

3.1. False positive rate and replication rate under P-hacking

The classification of P-values as sound and unsound identifies two different regimes to consider when computing FPR. Of all computed P-values, we write to denote the proportion obtained by unsound protocol (i.e., not according to the theory of Section 2). With fixed as a baseline significance level, we assume that this entire proportion is significant at level when the prevailing significance cutoff is .777In making this assumption, we ignore attempts at P-hacking which fail to obtain a P-value less than . Since these P-values are never labeled ‘significant’, they are assumed not to be observed in the scientific literature and therefore do not contribute to the reproducibility crisis. The remaining proportion of P-values are sound and behave according to the analysis in Section 2.

true false
Sound Unsound Sound
Proportion
Reject
Not Reject
Table 2. Table showing the relative proportion of null hypotheses corresponding to true/false positives/negatives in the presence of P-hacking. The proportion of sound tests has Type-I error rate , Type-II error rate , and prior odds . Table assumes as the baseline significance level, so that 100% of the proportion of hacked P-values are significant at level . By setting , these proportions coincide with those in Table 1.

The proportions of significant P-values of each type (i.e., sound true positive, sound false positive, and unsound false positive) for a given choice of and are now

See Table 2 for a breakdown of these proportions.888Since the primary contribution to the reproducibility crisis is the disproportionate number of false positives obtained by P-hacking, we also rule out the possibility of obtaining a ‘true positive’ via P-hacking. Under perfect reproducibility, such findings would be reproducible, and thus would not contribute to reproducibility issues. With the addition of , the false positive rate is now computed as

(4)

Under the assumption of perfect reproducibility, we once again observe the complementary relationship between false positive rate and replication rate

(5)

By setting we immediately recover (1), thus observing the precise sense in which the calculations in Section 2 ignore P-hacking.

3.2. P-value distributions

To analyze how each of the three classes of significant P-value at level will behave upon lowering the cutoff to , for , we treat the behavior of sound P-values as absolute—sound P-values do not depend on the policy used in determining significance—and unsound P-values as relative—in the absence of other information, an unsound P-value satisfying when the cutoff is should also be expected to satisfy when the cutoff is changed to . The key point is that hacked P-values are not hacked specifically to be below . They are hacked to be below the significance threshold, which just so happens to be . When the threshold changes, so does the target, and likely so will the P-value.

A key consideration when assessing the impact of a regime change is the extent to which hacked P-values under one regime will persist (i.e., continue to be significant) under the new regime. We write , , to denote the proportion of hacked P-values in the range when the significance cutoff is , for . Since a hacked P-value is not expected to increase in value upon lowering the cutoff, we assume that these distributions are monotone with respect to significance level, in the sense that a hacked P-value which is less than at level will also be less than if the cutoff is decreased to . Thus, we assume for all and . In particular, by setting the baseline significance level at , we assume for all , and is the proportion of all P-values that are unsound and significant at level .

For , is the proportion of all hacked P-values which are significant when the significance threshold is decreased to . The proportion is central to our assessment of the RSS proposal, as it quantifies the extent to which hacked P-values that are significant at level will persist (and remain significant) under the lower cutoff. The analysis in [2] compares the empirical replication rate for P-values with and observed in [3, 17] to arrive at their suggested 2-to-1 increase in replication rate under the lower cutoff. But since the behavior of hacked P-values depends on the cutoff, we cannot naively estimate by (i.e., the proportion of hacked P-values that lie below when the cutoff is ). Instead, we should expect to lie somewhere between and , reflecting the inevitability that P-hackers will adapt to the new significance regime.

Since depends on the cutoff , we cannot estimate it from data observed under the current cutoff. When analyzing the potential impact of the lower cutoff, we consider all possible values in the range

by interpolation,

(6)

for . The parameter quantifies the rate at which hacked P-values ‘persist’ after a change in cutoff. We therefore call the persistence parameter and sometimes refer to simply as the persistence at level . Note that corresponds to maximal persistence (i.e., ), so that 100% of hacked significant P-values under the current cutoff are significant at the new cutoff; and corresponds to minimal persistence (i.e., ) so that only those hacked P-values that currently lie below are significant at the new threshold.

3.3. False positive and replication rates under regime change

Under cutoff
true false
Sound Unsound Sound
Proportion
Reject
Not Reject
Table 3. Table showing the relative proportion of null hypotheses corresponding to true/false positives/negatives in the presence of P-hacking with persistence at level . The proportion of sound tests has Type-I error rate , Type-II error rate , and prior odds .

In taking as the baseline, we set so that the proportion of unsound significant P-values at level equals as in Table 2. Assuming is independent of the cutoff, the proportion of all hacked P-values that are significant at the new cutoff is given by , as reflected in the updated proportions of Table 3. The cutoff-dependent distribution of unsound P-values now features in the calculation of false positive rate and replication rate from (4) and (5) by

(7)
(8)

The assumed form of in (6) and the convention implies , giving the bounds

(9)

which we use to obtain conservative estimates in the forthcoming empirical analysis.

4. Analysis of “Redefining Statistical Significance”

For the sake of this analysis, we consider the potential effects of changing the baseline significance level from to . We take 0.80 as the conventional level of power (i.e., ) at significance level . We assume prior odds of for results in psychology, as suggested by empirical evidence in that field [10]. Since the analysis of FPR (Equation (7)) under the change in cutoff relies on the persistence parameter which is not estimable from data, we present all possible values of FPR over the entire range whenever applicable. Although it is possible that the prior odds could change in response to redefining statistical significance, we assume for simplicity that will remain unchanged upon lowering the cutoff to 0.005.

4.1. Estimating the hacking rate

In showing the sensitivity of FPR (and therefore RR) to the rate of P-hacking (), Figure 1 immediately raises doubts about the analysis in [2]. The expressions of FPR and RR in (7) and (8), which account for P-hacking, allow us to examine these doubts and arrive at the conclusions highlighted in the abstract and introductory section. In particular, for , we see that the FPR is much higher than suggested by the analysis in [2], and any improvement due to lowering the cutoff will be minimal. To validate this observation, we estimate the hacking rate based on data from a recent replication project in psychology [17].

We obtain a range of estimates for by comparing the empirical observations about replication obtained in [17] with the predicted FPR and RR under P-hacking in (4) and (5). The replication study in [17] shows an empirical replication rate of (36 out of 97) in the field of psychology, far below the rate of 62% predicted by (2) with , , and prior odds . Fitting this observed value to (5) gives a point estimate of . By further comparing the observed replication rate among P-values which were originally in the ranges (24 out of 47) and (12 out of 50), we obtain estimates for ranging between and . These estimates are consistent with the widespread belief that P-hacking is prevalent in the psychology literature.999These estimates, which were obtained merely by matching theoretical and empirical values of replication rate based on a single study, are not put forward here as estimates of the ‘real’ rate of P-hacking in psychology or any other discipline. They are intended merely to provide a ballpark figure for the sake of analyzing the RSS proposal in light of P-hacking.

4.2. The impact of redefining statistical significance

Benjamin, et al cite two main pieces of evidence in support of their proposal to change the significance cutoff from 0.05 to 0.005:

  • The observation (partially reconstructed in Figure 1) that the false positive rate is unacceptably high (greater than 0.33 for all levels of power) under the current 0.05 cutoff, and that these rates will fall to more acceptable levels (below 0.10 for many levels of power) under the new 0.005 cutoff. Compare the solid lines in Figure 1, also [2, Figure 2].

  • The empirical studies of replication rates in psychology [17] and experimental economics [3] suggest that the replication rate for findings with P-value in the range is twice that of P-values in the complementary range .

Neither claim is valid. Claim (i) relies on theory (Section 2) which tacitly ignores P-hacking, and thus does not apply in the context for which the RSS proposal was designed. Claim (ii) ignores the dependence of on the significance cutoff , as supported by empirical studies [9, 14, 20] and common sense (i.e., as P-hackers currently flout the prescribed protocol, knowingly or unknowingly, they should be expected to continue doing so when the prescription changes). With the dependence of on the significance cutoff accounted for, we see that the suggested benefits of lowering the significance threshold (i.e., FPR decreases by a factor greater than two and replication rate doubles) are far less certain than presented in [2]. Quite simply, such projections are incongruous with reality.

Figure 3. Solid lines show the projected false positive rates for (left) and (right) over the full range of persistence parameters. Both plots assume , , and prior odds . The plot on the left assumes and the plot on the right assumes . The dotted lines are the false positive rates computed under the estimated value of (top line) and by setting for significance levels (middle) and (bottom).

Claimed benefits of the RSS proposal are exaggerated

Figure 3 shows how the false positive rate will change (as a function of persistence ) compared to several reference points. The top dotted line in both figures is the FPR for and at the 0.05 significance level. In both cases, the FPR is much higher than the standard theory predicts in the absence of P-hacking (as given by the middle dotted line in both plots). The bottom dotted line is the FPR predicted at significance level 0.005 in the absence of P-hacking. The solid curve shows how FPR varies (at the 0.005 cutoff) with the persistence . To appreciate the discrepancy between what can be expected in the presence of P-hacking and what is claimed in [2], note the difference between the solid lines in Figure 3 and the bottom dotted line in each plot, which represents the predicted FPR purported by the RSS analysis. Notice that FPR under the new 0.005 cutoff will remain above 20% even if persistence is relatively low () and power is assumed to remain at 0.80 under the new cutoff. For high levels of persistence (e.g., ), the FPR is near or far above what would be expected under the 0.05 level in the absence of P-hacking.

Figure 4. False positive rates under different combinations of power () and P-hacking rate () for significance level 0.05 (left) and 0.005 (right). The prior odds is taken to be 1:10 (i.e., ) in both cases and power at the 0.05 level fixed at 0.80.

False positive rate will remain high under RSS

A finer grained analysis is given in Figure 4, which plots FPR under different combinations of power and persistence . We note that the analysis put forward by RSS in support of redefining the cutoff from to assumes , which for a power of 0.80 leads to a decrease in FPR from 0.38 (under the 0.05 cutoff) to 0.06 (under the 0.005 cutoff). In the absence of P-hacking, the perceived improvement is appealing in both relative (decrease by factor greater than 6) and absolute (decrease false positive rate to 6%) terms. However, the improvements are substantially mitigated in the presence of even modest P-hacking: for and power of 0.80, FPR decreases from 0.57 to 0.44; and for and power 0.80, FPR decreases from 0.75 to 0.71. So if power can be maintained after the significance cutoff is decreased, then the false positive rate will go down, but by much more modest factors (between and ) which are unlikely to be noticeable in practice and would still leave FPR at much higher levels than desired.

The reproducibility crisis could get worse

So far we have granted the RSS proposal the benefit of the doubt in assuming that power of 0.80 can be maintained after the change in cutoff. This is possible, but not without additional costs (i.e., time and money) due to the need of larger sample sizes. Even granting this benefit, the unimpressive improvements to FPR shown in Figures 3 and 4 call into question the RSS assertion that the benefits of the lower cutoff would outweigh the costs of achieving the same level of power. Furthermore, the practical difficulty of achieving such high power without compromising other parts of the study cannot be ignored in fields such as psychology, where the reproducibility crisis seems most pronounced. It seems inevitable that the power of many studies would decrease under the new cutoff.

Figure 5. Ratio of RR under 0.005 and 0.05 cutoffs based on the psychology replication data. (Left) Shows ratio
for different combinations of power and persistence . (Right) Shows ratio
for different combinations of power and persistence . Ratios less than 1 are colored in red, indicating that replication rate is worse under the lower 0.005 cutoff.

To examine the impact of a potential loss in power, Figure 5 plots the ratios of replication rate for different choices of power () and persistence () to the replication rate under the 0.05 level with power 0.80. Ratios smaller than 1 are colored red to indicate that the replication rate gets worse under the lower cutoff of 0.005. As expected, the replication rate improves as long as the same level of 0.80 power can be maintained at the lower cutoff. But only under special circumstances does the ratio exceed 2, as the authors suggest based on the empirical evidence in [3] and [17]. For the replication rate to double, the persistence can be no greater than 0.35 (at power 0.80 and ) and no greater than 0.15 (at power 0.80 and ). Based on these observations, it is hard to imagine a circumstance under which the replication rate would improve by a factor greater than .

Since the RSS proposal takes no active step to combat the root cause of the reproducibility crisis (i.e., P-hacking), it is unrealistic to expect P-hacking to improve in any noticeable way under the new proposal. And if the same level of power cannot be maintained, then the situation is even more uncertain, with most scenarios suggesting modest gains or even losses in replication rate. For example, if power falls to 0.50 under the new 0.005 cutoff and P-hacking persists at rate greater than 0.75, then replication rate could increase by as much as 19% (at persistence and ) or decrease by 19% (at persistence 1 and ). The replication rate under the new cutoff would vary between 20% and 51% in such a case, as can be compared to the 37% replication rate observed in [17].

4.3. How persistent are P-hackers?

A determining factor in evaluating the impact of the RSS proposal is the extent to which the lower cutoff will lead to a reduction in the proportion of hacked P-values that attain statistical significance. Since RSS takes no active steps to deter P-hacking, it is unlikely to have any positive impact on the prevalence of P-hacking. In particular, we cannot rule out the possibility that , in which case the benefit to reproducibility would be undetectable or even negative if the same level of power cannot be achieved: see the top line in Figure 5. Without compelling evidence to the contrary, we should expect P-hacking to continue just as it is currently. The authors of RSS have tried to preempt this criticism, arguing that “[r]educing the P-value threshold complements—but does not substitute for—solutions to these other problems” [2]. Based on our above analysis, however, we find little ‘complementary’ about the RSS proposal. All available data points to a marginal improvement in FPR as long as the lower significance level has no adverse effect on the level of statistical power (), hacking rate (), or prior odds (). This alone is a strong assumption. Whatever improvements to reproducibility might result from the other efforts to thwart P-hacking will have been achieved in spite of the RSS proposal.

5. Concluding remarks

Using the same theoretical device (i.e., false positive rate under NHST) and empirical evidence (i.e., the psychology replication study in [17]), we have analyzed the RSS proposal in light of claims that it will improve reproducibility. By accounting for the effects of P-hacking, we see that the claimed benefits to false positive rate and replication rate are much less certain than suggested in [2]. In fact, if false positive rate were to decrease at all, it will be virtually unnoticeable, and will remain much higher than claimed in [2]. Replication rates will not come close to doubling unless the lower cutoff successfully eliminates all but a small fraction of P-hacking. All available evidence suggests that P-hacking would continue much as it is now, and replication rates would either improve marginally or even decrease and would remain far below what is necessary to improve reproducibility in any substantive way. Altogether, these observations point to one conclusion: The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted.

Defenders of the proposal will inevitably criticize this conclusion as “perpetuating the status quo,” as one of them already has [13]. Such a rebuttal is in keeping with the spirit of the original RSS proposal, which has attained legitimacy not by coherent reasoning or compelling evidence but rather by appealing to the authority and number of its 72 authors. The RSS proposal is just the latest in a long line of recommendations aimed at resolving the crisis while perpetuating the cult of statistical significance [23] and propping up the flailing and failing scientific establishment under which the crisis has thrived. The proposal to redefine statistical significance is the status quo made by a group of authors who represent the status quo: advocates of the proposal not only embrace the bureaucracy of ‘statistical significance’ but also contribute to the status quo replication crisis by peddling an argument grounded in unstated assumptions and incomplete analysis. This latter point cannot be overstated: if the authors of this proposal, many of whom are prominent in their respective fields and have been active in combatting the reproducibility crisis for years, are prone to the same fundamental errors that are largely responsible for the replication crisis, then what hope is there to stem the crisis in the scientific communities they lead?

While I am sympathetic to the sentiment prompting the various responses to RSS [1, 12, 16, 21], I am not optimistic that the problem can be addressed by ever expanding scientific regulation in the form of proposals and counterproposals advocating for pre-registered studies, banned methods, better study design, or generic ‘calls to action’. Those calling for bigger and better scientific regulations ought not forget that another regulation—the 5% significance level—lies at the heart of the crisis.

Beyond their implications for the specific recommendation to redefine statistical significance, the subtle but severe issues with the RSS proposal call into question the role of statistics in scientific research [5]. The past century has seen the steady rise of statistical thinking as a method for validating scientific findings [18]. Statistics is now the sole adjudicator of what is ‘significant’ and what is not, what is scientific and what is not, what is important and what is not. Other techniques have largely been phased out, slowly but steadily [8, 19], to such an extent that it is now difficult, or impossible, in some fields to describe what constitutes ‘knowledge’ without referring to the concept of statistical significance, or some similar statistical measure. Despite this, or perhaps because of it, science finds itself in the midst of crisis.

Viewed in this light, P-hacking and its consequences are seen for what they are: not as a defective or fraudulent behavior but rather as a natural response to overbearing and repressive regulation. For the practicing scientist, statistics is just one of many potential ways to glean insights and validate findings. While some hypotheses may be properly tested and understood by statistical reasoning, others might not be amenable to any formal quantitative or statistical method. The rise of statistical thinking over the past century [8, 18] has crowded out these other ways of arriving at truth, which despite being more qualitative, less precise, and more ambiguous are no less valid. In such a parochial system, what option other than P-hacking is available for the scientist who in earnest believes to have come upon a veritable finding, which despite having a sound basis in logic and evidence cannot be tested statistically?

I have commented before [4] about the need to break free from the modern data-crazed mindset, which fetishizes the ‘statistical’ and brushes aside everything else. To borrow from Feyerabend [6, p. 7], “The only principle that does not inhibit progress is: anything goes”. Scientists ought to be encouraged to make convincing, cogent arguments for their hypotheses however they see fit, without the decree of mandated protocol or embargoed methods.

To be clear, I am not calling for a ‘ban’ of statistics—I am not calling for a ban of anything. The reproducibility crisis exposes the vulnerability of relying too heavily on any one paradigm, statistical or otherwise, and the folly of entrusting the arbitration of ‘knowledge’ to an enlightened intellectual bureaucracy which keeps the proletariat in check by ‘raising awareness’, banning methods, and redefining what is significant. These vulnerabilities are only magnified by the allure of publication, prestige, promotion, and the many other human (all too human) factors whose influence is undeniable and inevitable. Against conventional wisdom, I contend that the reproducibility crisis cannot be fixed by better statistical education, increased awareness, patronizing ‘how to’ guides (e.g., [11]), or enhanced oversight by journals and intellectual bureaucrats. These measures have been tried, and they have failed. Such steps only accelerate bureaucracy, which in turn calcifies the status quo and further promotes a collective inability to conceive of what constitutes ‘knowledge’ independently of the bureaucratic policies used to evaluate ‘findings’. The only way to reverse course is to loosen—not tighten—the restrictions on what makes an analysis scientific and a finding significant.

References

  • [1] V. Amrhein and A. Greenlad. (2017). Remove, rather than redefine, statistical significance. Nature Human Behaviour.
  • [2] D.J. Benjamin, J.O. Berger, Magnus Johannesson, B. Nosek, E.J. Wagenmakers, R. Berk, K.A. Bollen, B. Brembs, L. Brown, C. Camerer, D. Cesarini, C. Chambers, M. Clyde, T. Cook, P. de Boeck, Z. Dienes, A. Dreber, K. Easwaran, C. Efferson, E. Fehr, F. Fidler, A.P. Field, M. Forster, E. George, R. Gonzales, S. Goodman, E. Green, D. Green, A. Greenwald, J. Hadfield, L. Hedges, L. Held, T.H. Ho, H. Hoijtink, J.H. Jones, D. Hruschka, K. Imai, G. Imbens, J. Ioannidis, M. Jeon, M. Kirchler, D. Laibson, J. List, R. Little, S. Lupia, E. Machery, S. Maxwell, M. McCarthy, D. Moore, S. Morgan, M. Munaf, S. Nakagawa, B. Nyhan, T. Parker, L. Pericchi, M. Perugini, J. Rouder, J. Rousseau, V. Savalei, F. Schoenbrodt, T. Sellke, R. Shiffrin, B. Sinclair, D. Tingley, T. Van Zandt, S. Vazire, D. Watts, C. Winship, R. Wolpert, Y. Xie, C. Young, J. Zinman, V.E. Johnson. (2017). Redefine statistical significance. Nature Human Behaviour.
  • [3] Colin F. Camerer, Anna Dreber, Eskil Forsell, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, Johan Almenberg, Adam Altmejd, Taizan Chan, Emma Heikensten, Felix Holzmeister, Taisuke Imai, Siri Isaksson, Gideon Nave, Thomas Pfeiffer, Michael Razen, Hang Wu. (2016). Evaluating replicability of laboratory experiments in economics. Science.
  • [4] H. Crane. (2017). Comment on Gelman and Hennig: Beyond objective and subjective in statistics. Journal of the Royal Statistical Society, Series A.
  • [5] H. Crane and R. Martin. (2017). Is statistics meeting the needs of science?
  • [6] P. Feyerabend. (2010). Against Method, Fourth Edition. Verso.
  • [7] A. Gelman and E. Lokin. (2014). Data-dependent analysis—a “garden of forking paths”—explains why many statistically significant comparisons don’t hold up. Am. Sci., 102(6):460.
  • [8] G. Gigerenzer, Z. Switjtink, T. Porter, L. Daston, J. Beatty, and L. Krüger. (1989). The Empire of Chance: How probability changed science and everyday life.
  • [9] M.L. Head, L. Holman, R. Lanfear, A.T. Kahn, M.D. Jennions. (2015). The Extent and Consequences of P-Hacking in Science. PLOS Biology.
  • [10] V. E. Johnson et al. (2016). On the reproducibility of psychological science. J. Am. Stat. Assoc., 112, 1–10.
  • [11] Kass, R. E., Scappo, B. S., Davidian, M., Meng, X.-L., Yu, B., and Reid, N. (2016). Ten simple rules for effective statistical practice. PLOS Comput. Biol., 12(6):e1004961.
  • [12] D. Lakens, et al. Justify Your Alpha: A Response to “Redefine Statistical Significance”. PsyArXiv, 18 Sept. 2017. Web.
  • [13] E. Machery. (2017). Comment at http://philosophyofbrains.com/2017/10/02/should-we-redefine-statistical-significance-a-brains-blog-roundtable.aspx.
  • [14] E.J. Mariscampo, D.R. Lalande. (2012) A peculiar prevalence of p values just below .05. Q Rev Biol. 65:2271–2279.
  • [15] D. Mayo. (2017). Comment at http://philosophyofbrains.com/2017/10/02/should-we-redefine-statistical-significance-a-brains-blog-roundtable.aspx.
  • [16] B.B. McShane, D. Gal, A. Gelman, C. Robert, J.L. Tackett. Abandon statistical significance. arXiv:1709.07588.
  • [17] Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349:6251.
  • [18] T.M. Porter. (1988). The Rise of Statistical Thinking: 1820–1900.
  • [19] T.M. Porter. (1996). Trust in Numbers.
  • [20] U. Simonsohn, L.D. Nelson, and J.P. Simmons. (2014) P-Curve: A Key to the File Drawer. Journal of Experimental Psychology.
  • [21] David Trafimow, Valentin Amrhein, Corson N. Areshenkoff, Carlos Barrera-Causil, Eric J. Beh, Yusuf Bilgiç, Roser Bono, Michael T. Bradley, William M. Briggs, Héctor A. Cepeda-Freyre, Sergio E. Chaigneau, Daniel R. Ciocca, Juan Carlos Correa, Denis Cousineau, Michiel R. de Boer, Subhra Sankar Dhar, Igor Dolgov, Juana Gómez-Benito, Marian Grendar, James Grice, Martin E. Guerrero-Gimenez, Andrés Gutiérrez, Tania B. Huedo-Medina, Klaus Jaffe, Armina Janyan, Ali Karimnezhad, Fränzi Korner-Nievergelt, Koji Kosugi, Martin Lachmair, Rubén Ledesma, Roberto Limongi, Marco Tullio Liuzza, Rosaria Lombardo, Michael Marks, Gunther Meinlschmidt, Ladislas Nalborczyk, Hung T. Nguyen, Raydonal Ospina, Jose D. Perezgonzalez, Roland Pfister, Juan José Rahona, David A. Rodríguez-Medina, Xavier Romão, Susana Ruiz-Fernández, Isabel Suarez, Marion Tegethoff, Mauricio Tejo, Rens van de Schoot, Ivan Vankov, Santiago Velasco-Forero, Tonghui Wang, Yuki Yamada45, Felipe C. Zoppino, Fernando Marmolejo-Ramos. (2017). Manipulating the alpha level cannot cure significance testing – comments on "Redefine statistical significance". PeerJ Preprints.
  • [22] Wasserstein, R. L. and Lazar, N. A. (2016). The ASA’s statement on p-values: context, process, and purpose. Am. Stat., 70:129–133.
  • [23] S.T. Ziliak and D.N. McCloskey. (2008).

    The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Economics, Cognition, and Society)

    .