The posterior probability of a null hypothesis given a statistically significant result

01/21/2019 ∙ by Daniel J. Schad, et al. ∙ Universität Potsdam 0

Some researchers informally assume that, when they carry out a null hypothesis significance test, a statistically significant result lowers the probability of the null hypothesis being true. Although technically wrong (the null hypothesis does not have a probability associated with it), it is possible under certain assumptions to compute the posterior probability of the null hypothesis being true. We show that this intuitively appealing belief, that the probability of the null being true falls after a significant effect, is in general incorrect and only holds when statistical power is high and when, as suggested by Benjamin et al., 2018, a type I error level is defined that is lower than the conventional one (e.g., α = 0.005). We provide a Shiny app (https://danielschad.shinyapps.io/probnull/) that allows the reader to visualize the different possible scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Case 1: Investigating the posterior probability of the null hypothesis being true when power is low (Type II error 0.90) and the prior probability of the null hypothesis is high (Mean Prob(H0)=.90)

We next look at the posterior probability for different situations: As a first case, we investigate the posterior probability of the null hypothesis being true when the prior probability of the null hypothesis is high (Mean Prob(H0)=.90) and when power is low (Type II error 0.90). We investigate several scenarios by using different Type I errors ( and ).

Scenario 1: Low power (0.10), Type I error 0.05

Let Type I error be and Type II error be . So, we have power at . Such low power is by no means an uncommon situation in areas like cognitive psychology, psycholinguistics, and linguistics; examples from psycholinguistics are discussed in Jäger . (2017); Vasishth . (2018); Nicenboim . (2018).

Figure 2 shows the prior and posterior distributions. The prior distribution is plotted in blue. Figure 2 shows that getting a significant result hardly shifts our belief regarding the null. This should be very surprising to researchers who believe that a significant result shifts their belief about the null hypothesis being true to a low value. Next, consider what happens when we reduce Type I error to 0.01, which is lower than the traditional 0.05.

Scenario 2: Low power (0.10), Type I error 0.01

Many researchers (Benjamin ., 2018) have suggested that lowering Type I error will resolve many of the problems with NHST. Let’s start by investigating what changes when we decrease Type I error to (researchers like Benjamin ., 2018 have proposed 0.005 as a threshold for Type I error; we turn to this proposal below). Type II error is held constant at 0.90.

Figure 2 shows that lowering Type I error does shift our posterior probability of the null being true a bit more but not enough to have any substantial effect on our beliefs. It seems unreasonable to discard a null hypothesis if the posterior probability of it being true lies between 30 and 70%.

Scenario 3: Low power, Type I error 0.05, incorporating uncertainty about Type II error

So far, we have been assuming a point value as representing power. However, power is really a function that depends (inter alia) on the magnitude of the true (unknown) effect. Power therefore also has some uncertainty associated with it, because we do not know the magnitude of the true effect, and we do not know the true standard deviation. We can introduce uncertainty about power (or equivalently, uncertainty about Type II error) into the picture by setting our prior on , so that the Type II error is around 70%. Different levels of power (1-Type II error) are visualized in Figure 3, and the low power situation of 30% is shown in the bottom row of the figure.

Figure 3: Visualization of the probability distribution associated with Type II error corresponding to low power (, mean power %), medium power (, mean power %), and high power (, mean power %) . Recall that Power is 1-Type II error.
Figure 4: Probabilities for the null hypothesis, , considering uncertainty about power. Prior probability (blue) and posterior probabilities given a significant effect (grey) at a Type I error of (upper panels), (middle panels), and (lower panels). This is shown for situations of low statistical Power, (mean Type II error rate of about %, mean Power of about %), medium statistical Power, (mean Type II error rate of %, mean Power of %), and high statistical Power, (mean Type II error rate of about %, mean Power of about %), and for situations with a high (left panels), medium (middle panels), and low (right panels) prior probability for the null hypothesis.

Incorporating the uncertainty about Type II error (equivalently, power) increases the uncertainty about the posterior probability of the null quite a bit. Compare Figure 2 () and Figure 4a (low power). Figure 4a shows that the posterior of the null being true now lies between 40 and 90% (as opposed to 70 and 90% in Figure 2).

Scenario 4: Type I error 0.01, incorporating uncertainty in Type II error

Having incorporated uncertainty into Type II error, consider now what happens if we lower Type I error to 0.01, from 0.05. Figure 4d shows (cf. low power) that now the posterior distribution for the null hypothesis shifts to the left quite a bit more, but with wide uncertainty (10-60%). Even with a low Type I error of 0.01, we should be quite unhappy rejecting the null if the posterior probability of the null being true is between the wide range of 10 and 60%.

Scenario 5: Type I error 0.005, incorporating uncertainty in Type II error

Next, consider what happens if we lower Type I error to 0.005. This is the suggestion from Benjamin . (2018). Perhaps surprisingly, Figure 4g shows that now the posterior distribution for the null hypothesis does not shift much compared to Scenario 4 (see Fig. 4d): the range is 6 to 45% (compare with the range 10-60% in scenario 4). Thus, when power is low, there is simply no point in engaging in null hypothesis significance testing. Simply lowering the threshold of Type I error to 0.005 will not change much regarding our belief in the null hypothesis.

Case 2: Investigating the posterior probability of the null hypothesis being true when power is high

As a second case, we investigate the posterior probability of the null hypothesis being true when power is high. We consider the case where power is around 90%. We will still assume a high prior probability for the null (Mean Prob(H0) = .90). We will consider three scenarios: Type I error at 0.05, 0.01, and 0.005.

Scenario 6: High power (0.90), Type I error 0.05

First consider a situation where we have high power and Type I error is at the conventional 0.05 value. The question here is: in high power situations, does a significant effect shift our belief considerably away from the null, with Type I error at the conventional value of 0.05? The prior on Type II error is shown in Figure 3. The mean Type II error is 10%, implying a mean for the power distribution to be 90%. Perhaps surprisingly, Figure 4a shows that even under high power, our posterior probability of the null being true does not shift dramatically: the probability lies between 20 and 60%.

Simulation 7: High power (mean 0.90), Type I error 0.01

Next, we reduce Type I error to 0.01. Figure 4d shows that when power is high and Type I error is set at 0.01, we get a big shift in posterior probability of the null being true: the range is 5-25%.

Simulation 8: High power (mean 0.90), Type I error 0.005

Next, in this high-power situation, we reduce Type I error to 0.005. Figure 4g shows that when power is high and Type I error is set at 0.005, we get a decisive shift in posterior probability of the null being true: the range is now 2-13%.

Cases 3 and 4: Prior probability for the null is medium or low

Finally, we consider cases where the prior probability for the null is medium or low.

Low prior probability for the null: Mean Prob(H0)=.10

One possible objection to the above analyses, however, is that the prior probability for the null hypothesis could often be much smaller than an average of 90%. Indeed, in some situations, the null hypothesis may be very unlikely. We here simulate a situation where the prior probability for the null is an average of 10% (). For this situation, Figures 4c,f, and i show that the posterior probability for the null is always decisively low. Even for a conventional Type I error of in a low-powered study, the posterior probability for the null ranges from 1 to 25%, which is quite low, and when turning to smaller levels or higher power, the effect is decisive.

However, of course this is not very informative, as we started out assuming that the null hypothesis was unlikely to be correct in the first place. Thus, we haven’t learned much; a statistical significance test would just confirm what we already believed with high certainty before we carried out the test.

Medium prior probability for the null: Mean Prob(H0)=.50

Now consider the case where the prior probability for the null being true lies at an average of 50% (e.g., ). Here, we don’t know whether the null or the alternative hypothesis is true a priori, and both outcomes seem similarly likely. In this situation, when we use a conventional Type I error level of in a low-powered study, a significant effect will bring our posterior probability for the null only to a range of 6-40%, and will thus leave us with much uncertainty after obtaining a significant effect.

However, either using a stricter Type I error level (e.g., ) or running a high-powered study each suffices to yield informative results: For a high-powered study and , a significant result will (under our assumptions) bring the posterior probability to 2-10% (Figure 4b), which is quite informative. And for a Type I error level of a significant effect brings decisive evidence against the null for all the levels of power that we investigated (Figure 4h), with a posterior probabilty of 0.7-6% even for low-powered studies. This suggests that when the prior probabilities of the null and the alternative hypotheses are each at 50%, then either high power or a strict Type I error of will yield informative outcomes once a significant effect is observed.

Discussion

In summary, we analyzed the posterior probability for the null given a significant effect. We provide a shiny app (https://danielschad.shinyapps.io/probnull/) that allows the user to compute the posterior distribution for different choices of prior, Type I and Type II error. For psychology and preclinical biomedical research, the prior odds of H1 relative to H0 are estimated to be about 1:10 (Benjamin ., 2018), reflecting a high prior probability of the null of 90%. For this common and standard situation in psychology and other areas, when power is low, the posterior probability of the null being true doesn’t change in any meaningful way after seeing a significant result, even if we change Type I error to 0.005. What shifts our belief in a meaningful way is reducing Type I error to say 0.005 (as suggested by Benjamin ., 2018 and others), as well as running a high powered study. Only this combination of high power and small Type I error rate yields informative results.

One might object here that we set the prior probability of the null hypothesis being true at an unreasonably high value. This objection has some merit; although typically the prior probability for the null may lie at 10%, there may well be some situations where the null is unlikely to be true a priori. In this situation, our results show that a significant effect does indicate a very low posterior probability of the null. This is the case across a range of Type I error levels ( of 0.05, 0.01, 0.005) and for different levels of power (Figure 4c+f+i). Even for low power studies with the posterior probability is between 1 and 25%, which is quite low. So yes, if the prior probability of the null being true is already low, even with relatively low power and the standard Type I error level of 0.05, we are entitled to changing our belief quite strongly against the null once we have a significant effect. An obvious issue here is that if we already don’t believe in the null before we do the statistical test, why bother to try to reject the null hypothesis? Even if we were satisfied with rejecting a null hypothesis we don’t believe in in the first place, running low power studies is always a bad idea because of Type M and S error. As Gelman  Carlin (2014) and many others before them have pointed out, significant effects from low power studies will have exaggerated estimates of effects and could have the wrong sign of the effect. The probability of the null hypothesis being true is not the only important issue in a statistical test; accurate estimation of the parameter of interest and quantifying our uncertainty about that estimate are equally important.

In summary, we investigated the intuitive belief held by some researchers that finding a significant effect reduces the posterior probability of the null hypothesis to a low value. We show that this intuition is not true in general. The common situation in psychology and other areas is that the null hypothesis is a priori quite likely to be true. In such a situation, contrary to intuition, finding a significant effect leaves us with much posterior uncertainty about the null hypothesis being true. Obtaining a reasonable reduction in uncertainty is thus another reason to adopt the recent recommendation by Benjamin . (2018) to change Type I error to . Furthermore, conducting high power studies is an obvious but neglected remedy. Otherwise, the results will be indecisive.

Our key result is that the posterior probability for the null given a significant effect varies widely across settings involving different Type I and Type II errors and different prior probabilities for the null. The intuition that frequentist p-values may provide a shortcut to this information is in general misleading.

Acknowledgements

Thanks to Valentin Amrhein and Sander Greenland for comments. Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project number 317633480, SFB 1287, Project Q, with principal investigators Shravan Vasishth and Ralf Engbert.

Author contribution

SV had the idea for the paper. SV and DJS performed analyses. SV and DJS generated the shiny app. SV and DJS wrote the paper.

Availability of simulations and computer code

All the computer code used for the simulations reported in the present manuscript, and the code for generating all Figures, will be freely available online at https://osf.io/9g5bp/. Moreover, we make a shiny app available at https://danielschad.shinyapps.io/probnull/ that allows computing the posterior probability for the null given a significant effect for many different settings of Type I and II error and for different prior probabilities for the null.

References

  • Benjamin . (2018) benjamin2018redefineBenjamin, DJ., Berger, JO., Johannesson, M., Nosek, BA., Wagenmakers, EJ., Berk, R.others  2018. Redefine statistical significance Redefine statistical significance. Nature Human Behaviour216.
  • Betancourt (2018) betancourt2018calibratingBetancourt, M.  2018. Calibrating Model-Based Inferences and Decisions Calibrating model-based inferences and decisions. arXiv preprint arXiv:1803.08393.
  • Doherty . (2002) doherty2002fluorideDoherty, UB., Benson, PE.  Higham, SM.  2002. Fluoride-releasing elastomeric ligatures assessed with the in situ caries model Fluoride-releasing elastomeric ligatures assessed with the in situ caries model. The European Journal of Orthodontics244371–378.
  • Gelman  Carlin (2014) gelmancarlinGelman, A.  Carlin, J.  2014. Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science96641–651.
  • Harris  Taylor (2003) harristaylorHarris, M.  Taylor, G.  2003. Medical statistics made easy Medical statistics made easy. Scion Banbury, England.
  • Jäger . (2017) JaegerEngelmannVasishth2017Jäger, LA., Engelmann, F.  Vasishth, S.  2017. Similarity-based interference in sentence comprehension: Literature review and Bayesian meta-analysis Similarity-based interference in sentence comprehension: Literature review and Bayesian meta-analysis. Journal of Memory and Language94316-339. https://doi.org/10.1016/j.jml.2017.01.004
  • Kolmogorov (2018) kolmogorov2018foundationsKolmogorov, AN.  2018. Foundations of the Theory of Probability: Second English Edition Foundations of the theory of probability: Second English Edition. Courier Dover Publications.
  • McElreath (2016) mcelreath2016statisticalMcElreath, R.  2016. Statistical rethinking: A Bayesian course with examples in R and Stan Statistical rethinking: A Bayesian course with examples in R and Stan ( 122). CRC Press.
  • Nicenboim . (2018) NicenboimRoettgeretalNicenboim, B., Roettger, TB.  Vasishth, S.  2018. Using meta-analysis for evidence synthesis: The case of incomplete neutralization in German Using meta-analysis for evidence synthesis: The case of incomplete neutralization in German. Journal of Phonetics7039-55. https://osf.io/g5ndw/ https://doi.org/10.1016/j.wocn.2018.06.001
  • Oakley  O’Hagan (2010) OakleyOHaganOakley, JE.  O’Hagan, A.  2010. SHELF: The Sheffield Elicitation Framework (version 2.0) SHELF: The Sheffield Elicitation Framework (version 2.0) []. University of Sheffield, UK. http://tonyohagan.co.uk/shelf
  • O’Hagan . (2006) ohagan2006uncertainO’Hagan, A., Buck, CE., Daneshkhah, A., Eiser, JR., Garthwaite, PH., Jenkinson, DJ.Rakow, T.  2006. Uncertain judgements: Eliciting experts’ probabilities Uncertain judgements: Eliciting experts’ probabilities. John Wiley & Sons.
  • Tam . (2018) tam2018doctorsTam, CWM., Khan, AH., Knight, A., Rhee, J., Price, K., McLean, K. .  2018. How doctors conceptualise’P’values: A mixed methods study How doctors conceptualise’p’values: A mixed methods study. Australian Journal of General Practice4710705.
  • Vasishth . (2018) VasishthMertzenJaegerGelman2018Vasishth, S., Mertzen, D., Jäger, LA.  Gelman, A.  2018. The statistical significance filter leads to overoptimistic expectations of replicability The statistical significance filter leads to overoptimistic expectations of replicability. Journal of Memory and Language103151-175. https://osf.io/eyphj/ https://doi.org/10.1016/j.jml.2018.07.004