    # P-value: A Bless or A Curse for Evidence-Based Studies?

As a convention, p-value is often computed in frequentist hypothesis testing and compared with the nominal significance level of 0.05 to determine whether or not to reject the null hypothesis. The smaller the p-value, the more significant the statistical test. We consider both one-sided and two-sided hypotheses in the composite hypothesis setting. For one-sided hypothesis tests, we establish the equivalence of p-value and the Bayesian posterior probability of the null hypothesis, which renders p-value an explicit interpretation of how strong the data support the null. For two-sided hypothesis tests of a point null, we recast the problem as a combination of two one-sided hypotheses alone the opposite directions and put forward the notion of a two-sided posterior probability, which also has an equivalent relationship with the (two-sided) p-value. Extensive simulation studies are conducted to demonstrate the Bayesian posterior probability interpretation for the p-value. Contrary to common criticisms of the use of p-value in evidence-based studies, we justify its utility and reclaim its importance from the Bayesian perspective, and recommend the continual use of p-value in hypothesis testing. After all, p-value is not all that bad.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Hypothesis testing is ubiquitous in modern statistical applications, which permeates many different fields such as biology, medicine, phycology, economics, and engineering etc. As a critical component of the hypothesis testing procedure (Lehmann and Romano, 2005),

-value is defined as the probability of observing the random data as or more extreme than the observed given the null hypothesis being true. In general, the statistical significance level or the type I error rate is set at 5%, so that a

-value below 5% is considered significant leading to rejection of the null hypothesis, and that above 5% insignificant resulting in failure to reject the null.

Although -value is the most commonly used summary measure for evidence or strength in the data regarding the null hypothesis, it has been the center of controversies and debates for decades. To clarify ambiguities surrounding -value, the American Statistical Association (2016) gave statements on -value and, in particular, the second point states that “-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.” It is often argued that -value only gives information on how incompatible the data are with the null hypothesis, but it does not provide any information on how likely the data would occur under the alternative hypothesis.

Extensive investigations have been conducted on the inadequacy of the -value. Rosenthal and Rubin (1983) studied how -value can be adjusted to allow for greater power when an order of importance exists on the hypothesis tests. Royall (1986) investigated the effect of sample size on -value. Schervish (1996) described computation of the -value for one-sided point null hypotheses, and also discussed the intermediate interval hypothesis. Hung et al. (1997) studied the behavior of -value under the alternative hypothesis, which depends on both the true value of the tested parameter and sample size. Rubin (1998) proposed an alternative randomization-based -value for double-blind trials with non-compliance. Sackrowitz and Samuel-Cahn (1999) promoted more widespread use of the expected -value in practice. Donahue (1999) suggested that the distribution of the -value under the alternative hypothesis provide more information for rejection of implausible alternative hypotheses. As there is a widespread notion that medical research is interpreted mainly based on -value, Ioannidis (2005) claimed that most of the published findings are false. Hubbard and Lindsay (2008) showed that -value tends to exaggerate the evidence against the null hypothesis. Simmons et al. (2011) demonstrated that -value is subject to manipulation to achieve the threshold of 0.05 and cautioned against its use. Nuzzo (2014) gave an editorial on why -value alone cannot serve as adequate statistical evidence for inference.

Criticisms on -value and null hypothesis significance testing have become even more contentious in recent years. If the key words “misuse of -value” or “ban -value” are used in Google search, millions of queries can be found to attack and bash -value. More seriously, several journals, e.g., Basic and Applied Social Psychology and Political Analysis, have made claims to ban the use of -value in their publications (Trafimow and Marks, 2015; Gill, 2018). The controversy over -value has recently been reignited, which is more centered around the proposals to adjust, abandon or provide alternatives to

-value. Fidler et al. (2004) and Ranstam (2012) recommended use of the confidence interval as an alternative to

-value, and Cumming (2014) called for abandoning -value in favor of reporting the confidence interval. Colquhoun (2014) investigated the issue of misinterpretation of -value as a culprit for the high false discovery rate. Concato and Hartigan (2016) suggested that -value should not be the primary focus of attention or the sole basis for evaluation of scientific results. McShane et al. (2017) recommended that the role of -value as a threshold for screening scientific findings should be demoted, and that -value should not take priority over other statistical measures. In the aspect of reproducibility concerns of scientific research, Johnson (2013) traced one major cause of nonreproducibility as the routine use of the null hypothesis testing procedure. Leek et al. (2017) proposed abandonment of -value thresholding and transparent reporting of false positive risk as remedies to the replicability issue in science. Benjamin et al. (2018) recommended shifting the significance threshold from 0.05 to 0.005, while Trafimow et al. (2018) argued that such a shift is futile and unacceptable.

Bayesian approaches are often advocated as a solution to the crisis resulting from abusing the

-value. Goodman (1999) strongly supported use of the Bayes factor in contrast to

-value as a measure of evidence for medical evidence-based research. Rubin (1984) proposed the predictive

-value as the tail-area probability of the posterior predictive distribution, and Meng (1994) further studied its properties. In the applications to psychology, Wagenmakers (2007) revealed the issues with

-value and recommended use of the Bayesian information criterion instead. In an effort to support the wider use of Bayesian statistics, Lee (2010) demonstrated that Bayesian approaches provide a superior alternative to the frequentist methods using

-values. Alongside its ban on -value, the journal of Basic and Applied Social Psychology gave endorsement of Bayesian approaches (Trafimow and Marks, 2015). Briggs (2017) proposed that -value should be proscribed and be substituted with the Bayesian posterior probability, while Savalei and Dunn (2015) expressed skepticism on the utility of abandoning -value and resorting to alternative hypothesis testing paradigms, such as the Bayesian approach, in solving the reproducibility issue.

On the other hand, extensive research has been conducted in an attempt to reconcile or account for the differences between frequentist and Bayesian hypothesis testing approaches (Berger, 2003; and Bayarri and Berger, 2004). For hypothesis testing, Berger and Sellke (1987), Berger and Delampady (1987), and Casella and Berger (1987) investigated the relationships between -value and the Bayesian measure of evidence against the null hypothesis. In particular, they provided an in-depth study of one-sided hypothesis testing and point null cases, and also discussed the posterior probability of the null hypothesis with respect to various prior distributions including the mixture prior distribution with a point mass at the null and the other more broad distribution over the alternative (Lindley, 1957). Sellke, Bayarri, and Berger (2001) proposed to calibrate -value for testing precise null hypotheses.

Although

-value is often regarded as an inadequate and insufficient representation of statistical evidence, it did not stall the scientific advancement in the past years. Jager and Leek (2014) surveyed high-profile medical journals and estimated the rate of false discoveries in the medical literature using reported

-values as the data, which led to a conclusion that the medical literature remains a reliable record of scientific progress. Murtaugh (2014) defended the use of -value based on the ground that it is closely linked to the confidence interval and to the difference in Akaike’s information criterion. Despite the fact that Bayesian alternatives are often recommended as superior solutions to the various notorious drawbacks of -value, in many common cases, -value in fact has a simple and clear Bayesian interpretation. We present the relationship between the frequentist -value and Bayesian posterior probability in several commonly encountered settings in clinical trials, and show that in both one-sided and two-sided hypothesis tests, asymptotic equivalence, sometimes exact equivalence, can be established. Although in terms of definition, -value is not the probability that the null hypothesis is true, contrary to the conventional notion, it does have a close correspondence to the Bayesian posterior probability of the null hypothesis being true. Based on the theoretical results of Dudley and Haughton (2002), we present several cases where -value and the posterior probability of the null are equivalent for one-sided tests. Further, we extend such equivalence results to two-sided hypothesis testing problems, where most of the controversies and discrepancies lie. In particular, we introduce the notion of two-sided posterior probability which matches the -value from a two-sided hypothesis test. After all, we conclude that -value is not all that bad.

The rest of the paper is organized as follows. In Section 2, we present a motivating example that shows the similarity in operating characteristics of a frequentist hypothesis test and a Bayesian counterpart using the posterior probability. In Section 3, we show that

-value and the posterior probability have an equivalence relationship for the case of binary outcomes. In Section 4, we present such equivalence properties for univariate normal data with known and unknown variances respectively, and in Section 5, we develop similar results for hypothesis tests involving multivariate data. Finally, Section 6 concludes with some remarks.

## 2 Motivating Example

The use of binary endpoint is common in clinical trial design. Frequentist design typically utilizes an exact binomial test or -test based on normal approximation, and Bayesian design often bases the decision on the posterior probabilities. As a motivating example, we consider a two-arm clinical trial comparing the response rate of an experimental drug versus that of the standard drug . We are interested in testing a one-sided hypothesis,

 H0: pE≤pSversusH1: pE>pS. (2.1)

When there is sufficient evidence to support , we would reject and claim that the experimental treatment is superior.

Under the frequentist approach, we construct a

 Z=^pE−^pS[{^pE(1−^pE)+^pS(1−^pS)}/n]1/2, (2.2)

where is the sample size per arm, and are the sample proportions, and are the numbers of responders in the respective arms. We reject the null hypothesis if , where is the

th percentile of the standard normal distribution.

Under the Bayesian framework, we assume beta prior distributions for and , i.e., and . The binomial likelihood function for group can be formulated as

 P(yg|pg)=(nyg)pygg(1−pg)n−yg,g=E,S.

The posterior distribution of is given by

 pg|yg ∼ Beta(ag+yg,bg+n−yg),

for which the density function is denoted by . Let be a prespecified cutoff probability boundary. We declare treatment superiority if the posterior probability of greater than exceeds threshold . Based on the posterior probability, we can construct a Bayesian decision rule so that the experimental treatment is declared as superior if

 Pr(H1|yE,yS)=Pr(pE>pS|yE,yS)>η, (2.3)

where

 Pr(pE>pS|yE,yS)=∫10∫1pSf(pE|yE)f(pS|yS)dpEdpS.

Otherwise, we fail to declare treatment superiority, i.e., fail to reject the null hypothesis.

To maintain the frequentist type I error rate at , we need to set

. The exact probabilities of committing type I and type II errors under the frequentist design are respectively given by

 α=n∑yE=0n∑yS=0P(yE|pE=pS)P(yS|pS)I(Z>zα),

and

 β=n∑yE=0n∑yS=0P(yE|pE=pS+δ)P(yS|pS)I(Z

where is the desired treatment difference and is the indicator function. The exact error rates under the Bayesian test can be derived similarly by replacing with inside the indicator function.

As a numerical study, we consider a two-arm randomized trial with a type I error rate of 10% and 5% and target power of 80% and 90% when and , respectively. Under equal randomization, to achieve the desired power, the required sample size per arm is

 n=(zα+zβ)2δ2{pE(1−pE)+pS(1−pS)},

where we take and 0.15. Under the Bayesian design, we assume non-informative prior distributions, and . For comparison, we compute the type I error rate and power for both the Bayesian test with and the frequentist -test with a critical value . As shown in Figure 1, both designs produce similar operating characteristics: the type I error rate can be maintained at the nominal level, and the power attains the target level of 80% or 90% at the specified values of . It is worth noting that because the endpoints are binary and the trial outcomes are discrete, exact calibration of the empirical type I error rate to the nominal level is not possible, particularly when the sample size is small. When we adopt a larger sample size by setting the type I error rate to be 5% and the target power to be 90%, the empirical type I error rate is closer to the nominal level as shown in the blue lines.

## 3 Hypothesis Test for Binary Data

### 3.1 Two-Sample Hypothesis Test

We first study the relationship between -value and the posterior probability in a two-arm randomized clinical trial with dichotomous outcomes. We consider the one-sided hypothesis test in (2.1), and under the frequentist -test for two proportions given by (2.2), the -value is

 p-value1=1−Φ(Z),

where

denotes the cumulative distribution function (CDF) of the standard normal distribution. At the significance level of

, we reject the null hypothesis if -value is smaller than .

In the Bayesian paradigm, we base our decision on the posterior probability, as given in (2.3). We reject the null hypothesis if the posterior probability of is smaller than ,

 PoP1=Pr(pE≤pS|yE,yS)<α.

As a numerical study, we set , 50, 100 and 500, and randomly draw integers between 0 and to be the values for and , and for each replication we compute the posterior probability of the null hypothesis and the -value. As shown in Figure 2, all the paired values lie very close to the straight line of , indicating the equivalence between the -value and posterior probability of the null.

Figure 3 shows the differences between -values and posterior probabilities under sample sizes of 20, 50, 100 and 500, respectively. As sample size increases, the differences diminish toward 0, corroborating the asymptotic equivalence between -value and the posterior probability.

For two-sided hypothesis tests, we are interested in examining whether there is any difference in the treatment effect between the experimental drug and the standard drug,

 H0: pE=pSversusH1: pE≠pS.

The -value under the two-sided hypothesis test is

 p-value2 = 2−2Φ(|Z|)=2[1−max{Φ(Z),Φ(−Z)}].

It is worth emphasizing that under the frequentist paradigm, the two-sided test can be viewed as a combination of two one-sided tests along the opposite directions. Therefore, to construct an equivalent counterpart under the Bayesian paradigm, we may regard the problem as two opposite one-sided Bayesian test and compute the posterior probabilities of the two opposite hypotheses; this approach to Bayesian hypothesis testing is different from the one commonly adopted in the literature, where a prior probability mass is imposed on the point null, e.g., see Berger and Sellke (1987), Berger and Delampady (1987), and Berger (2003).

If we define the two-sided posterior probability () as

 PoP2=2[1−max{Pr(pE>pS|yE,yS),Pr(pE

then its relationship with -value is similar to that of one-sided hypothesis testing as shown in Figure 4.

The equivalence of the

-value and the posterior probability in the case of binary outcomes can be established by applying the Bayesian central limit theorem. Under large sample size, the posterior distribution of

and can be approximated as

 pg|yg∼N(^pg,^pg(1−^pg)/n),g=E,S.

As and are independent, the posterior distribution of can be derived as

 pE−pS|yE,yS∼N(^pE−^pS,{^pE(1−^pE)+^pS(1−^pS)}/n).

Therefore, the posterior probability of is

 PoP1=Pr(pE≤pS|yE,yS)≈Φ(−^pE−^pS[{^pE(1−^pE)+^pS(1−^pS)}/n]1/2)=Φ(−Z),

which is equivalent to . The equivalence relationship for a two-sided test can be derived along similar lines.

More generally, Dudley and Haughton (2002) proved that under mild regularity conditions, the posterior probability of a half space converges to the standard normal CDF transformation of the likelihood ratio test statistic. In a one-sided hypothesis test, the posterior probability of the half space is , whereas the standard normal CDF transformation of the likelihood ratio test statistic equals to one minus , an therefore and are asymptotically equivalent.

### 3.2 One-Sample Hypothesis Test

In a single-arm clinical trial with dichotomous outcomes, we are interested in examining whether the response rate of the experimental drug exceeds a prespecified threshold , by formulating a one-sided hypothesis test,

 H0: pE≤p0versusH1: pE>p0.

In the frequentist paradigm, the -value can be computed based on the exact binomial test. In the Bayesian paradigm, we assume a beta prior distribution for , e.g., . The posterior distribution of is given by , for which the density function is denoted by . Based on the posterior probability, we can construct a Bayesian decision rule so that the experimental treatment is declared as promising if

 Pr(H1|yE)=Pr(pE>p0|yE)>η,

where

 Pr(pE>p0|yE)=∫1p0f(pE|yE)dpE.

Otherwise, we fail to declare treatment efficacy. As a result, the one-sided posterior probability is defined as

 PoP1=Pr(H0|yE)=Pr(pE≤p0|yE).

For two-sided hypothesis tests, we are interested in examining whether the response rate of the experimental drug is different from ,

 H0: pE=p0versusH1: pE≠p0.

The -value can be computed based on the exact binomial test. If we define the two-sided posterior probability,

 PoP2=2[1−max{Pr(pE>p0|yE),Pr(pE

then its relationship with -value is similar to that of one-sided hypothesis testing as shown in Figure 4.

In a numerical study, we set , 50, 100 and 500, , and randomly draw integers between 0 and to be the values of . We assume a noninformative prior for , i.e., and . Figure 2 shows the relationship between the posterior probability of the null hypothesis and the -value, which clearly indicates that all the points lie very close to the straight line of .

## 4 Hypothesis Test for Normal Data

### 4.1 Hypothesis Test with Known Variance

In a two-arm randomized clinical trial with normal endpoints, we are interested in comparing the means of the outcomes between the experimental and standard arms. Let denote the sample size for each arm, and let denote the paired data under the experimental and standard treatments. Assume and with unknown means and but a known variance . Let and denote the sample means, and let and denote the true and the observed treatment difference, respectively.

Considering the one-sided hypothesis test,

 H0: θ≤0versusH1: θ>0,

the frequentist -test statistic is formulated as , which follows the standard normal distribution under the null hypothesis. Therefore, the -value under the one-sided hypothesis test is given by

 p-value1 =Pr(Z≥^θ√n/2|H0)=1−Φ(^θ√n/2),

where

denotes the standard normal random variable.

In the Bayesian paradigm, if we assume an improper flat prior distribution, , the posterior distribution of is

 θ|D∼N(^θ,2/n).

Therefore, the posterior probability of smaller or equal to 0 is

 PoP1=Pr(θ≤0|D)=1−Φ(^θ√n/2).

Under such an improper prior distribution of , we can establish an exact equivalence relationship between -value and .

Under the two-sided hypothesis test, versus , the -value is given by

 p-value2 =2[1−max{Pr(Z≥z|H0),Pr(Z≤z|H0)}] =2−2max{Φ(^θ√n/2),Φ(−^θ√n/2)}.

Correspondingly, the two-sided posterior probability is defined as

 PoP2 =2[1−max{Pr(θ<0|D),Pr(θ>0|D)}] =2−2max{Φ(^θ√n/2),Φ(−^θ√n/2)},

which is exactly the same as the (two-sided) -value.

### 4.2 Hypothesis Test with Unknown Variance

In a more general setting, we consider the case where , and are all unknown parameters. We define , which follows the normal distribution . For notational simplicity, let and we are interested in modeling the joint posterior distribution of and .

In the frequentist paradigm, Student’s -test statistic is

 T=^θ√∑ni=1(xi−^θ)2/{(n−1)n}.

Therefore, the -value under the one-sided hypothesis test is

 p-value1 =1−Ftn−1(T),

where denotes the CDF of Student’s distribution with degrees of freedom.

In the Bayesian paradigm, if we assume Jeffreys’ prior for and , , the corresponding posterior distribution is

 p(θ,ν|D)∝ν−(n+3)/2exp{−∑ni=1(xi−^θ)2+n(^θ−θ)22ν},

which matches the normal-inverse-chi-square distribution,

 (θ,ν)|D∼N--Inv χ2(^θ,n,n,n∑i=1(xi−^θ)2/n).

Based on the posterior distribution, the one-sided posterior probability of the null hypothesis is .

As an alternative to Jeffreys’ prior distribution, we also consider a normal-inverse-gamma prior distribution for and , , which belongs to the conjugate family of prior distributions for the normal likelihood function. As a result, the corresponding posterior distribution is also a normal-inverse-gamma prior distribution,

 (θ,ν)|D∼N--IG(θ0ν0+n^θν0+n,ν0+n,α+n2,β+12n∑i=1(xi−^θ)2+nν0ν0+n(^θ−θ0)22).

For a two-sided hypothesis test, the -value is

 p-value2 =2−2Ftn−1(|T|) =2[1−max{Ftn−1(T),Ftn−1(−T)}].

Similarly, we define the two-sided posterior probability as

 PoP2=2[1−max{Pr(θ>0|D),Pr(θ<0|D)}].

In a numerical study, we simulate a large number of trials, and for each replication we compute the posterior probability and -value. To ensure that the simulated -values can cover the entire range of , we generate values of from and from truncated at zero. To construct a vague normal-inverse-gamma prior distribution, we take , , and . Under Jeffreys’ prior and the vague normal-inverse-gamma prior distributions, the equivalence relationships between -values and the posterior probabilities are shown in Figure 5, with sample size of 20, 50 and 100, respectively.

In addition, we generate values of from a Gamma distribution, a Beta distribution, as well as a mixture of normal distributions of N and N with equal weights. To ensure that the simulated -values can cover the entire range of , the simulated values of are further deducted by the mean value of the corresponding distribution plus a uniform random variable. Under Jeffreys’ prior, the equivalence relationships between -values and the posterior probabilities are shown in Figure 6.

To study the effect of informative prior and sample size on the relationship between -values and the posterior probabilities, we construct an informative prior distribution on by setting , , and . Under such an informative prior distribution, the relationships between -values and the posterior probabilities under increasing sample sizes are shown in Figure 7. As sample size increases, the equivalence relationship is gradually established. Moreover, we consider the case where the sample size is fixed but the prior variance increases, i.e., we take and let change from 0.001 to 1. As shown in Figure 7, as the prior distribution becomes less informative, the equivalence relationship becomes more evident.

## 5 Hypothesis Test for Multivariate Normal Data

In hypothesis testing on the mean vector of a multivariate normal random variable, we consider

, where is the dimension of the multivariate normal distribution. For the ease of exposition, the covariance matrix is assumed to be known. Let denote the observed multivariate vectors, let denote the sample mean vector, and thus .

Consider the one-sided hypothesis test,

 H0: \bf c⊤kμ≤0 for some k=1,…,KversusH1: \bf c⊤kμ>0 for all k=1,…,K,

where are prespecified -dimensional vectors. The likelihood ratio test statistics (Sasabuchi, 1980) are

 Zk=\bf c⊤k¯\bf X√\bf c⊤kΣ\bf ck/n,  k=1,…,K, (5.1)

and the corresponding -values are

 p-value(k)1=1−Φ(Zk).

The null hypothesis is rejected if all of the -values are smaller than .

In the Bayesian paradigm, we assume a conjugate multivariate normal prior distribution for , . The corresponding posterior distribution is , where

 μn Σn =1nΣ(Σ0+Σn)−1Σ.

The one-sided posterior probability corresponding to is

 PoP(k)1=Pr(\bf c⊤kμ≤0|D).

For two-sided hypothesis testing (Liu and Berger, 1995), we are interested in

 H0: \bf c⊤kμ≤0 for some k=1,…,K,and \bf c⊤kμ≥0 for some k=1,…,K versus H1: \bf c⊤kμ>0 for all k=1,…,K, or \bf c⊤kμ<0 for all k=1,…,K.

Based on (5.1), the -values are given by

 p-value(k)2=2−2Φ(|Zk|)=2[1−max{Φ(Zk),Φ(−Zk)}].

The null hypothesis is rejected if all of the -values are smaller than . Similar to the univariate case, we define the two-sided posterior probability,

 PoP(k)2=2[1−max{Pr(\bf c⊤kμ>0|D),Pr(\bf c⊤kμ<0|D)}].

In a numerical study, we compute the posterior probabilities of for , and compare them with the corresponding -values. We take and to be a unit vector with 1 on the th element and 0 otherwise, and assume a vague normal prior distribution for , i.e., and , where is a

-dimensional identity matrix. The relationship between the posterior probabilities and

-values is shown in Figure 8, which is very similar to that in the univariate setting, which again demonstrates their equivalence.

## 6 Discussion

Berger and Sellke (1987) studied the point null for two-sided hypothesis tests, and noted discrepancies between the frequentist test and the Bayesian test based on the posterior probability. The major difference between their work and the equivalence relationship between the posterior probability and -value established here lies in the assumption of the prior distribution. Berger and Sellke (1987) assumed a point mass prior distribution at the point null hypothesis, which violates the regularity condition of continuity in Dudley and Haughton (2002), leading to the discrepancy between the posterior probability and -value. The equivalence relationship between the posterior probability and -value for one-sided tests can be established from the theoretical results of Dudley and Haughton (2002), where the posterior probability of a half space is proven to converge to the standard normal CDF transformation of the likelihood ratio test statistic. A future direction of research is on more complex composite hypotheses tests involving multivariate normal outcomes. Berger (1989) and Liu and Berger (1995) constructed a uniformly more powerful test than the likelihood ratio test for multivariate one-sided tests involving linear inequalities. Follman (1996) proposed a simple alternative to the likelihood ratio test. It would be of interest to study the relationship of these tests with the Bayesian counterparts based on posterior probabilities.

## References

Bayarri, M. J. and Berger, J. O. (2004). The interplay of Bayesian and frequentist analysis. Statistical Science 19, 58–80.

Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman have agreed on testing? (with discussion) Statistical Science 18, 1–32.

Berger, J. O. and Delampady M. (1987). Testing precise hypotheses. Statistical Science 2, 317–335.

Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis: the irreconcilability of P values and evidence. Journal of the American Statistical Association 82, 112–122.

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E., et al. (2017). Redefine statistical significance. Nature Human Behaviour 2, 6–10.

Briggs, W. M. (2017). The substitute for p-values. Journal of the American Statistical Association 112, 897–898.

Berger, R. L. (1989). Uniformly more powerful tests for hypotheses concerning linear inequalities and normal means. Journal of the American Statistical Association 84, 192–199.

Casella, G. and Berger, R. L. (1987). Reconciling Bayesian and frequentist evidence in the one-sided testing problem. (with discussion) Journal of the American Statistical Association 82, 106–111.

Concato, J. and Hartigan, J. A. (2016). P values: from suggestion to superstition. Journal of Investigative Medicine 64, 1166–1171.

Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society of Open Science 1, 140–216.

Cumming, G. (2014). The new statistics: why and how. Psychological Science 25, 7–29.

Donahue, R. M. J. (1999). A note on information seldom reported via the P value. The American Statistician 53, 303–306.

Dudley, R. M. and Haughton, D. (2002). Asymptotic normality with small relative errors of posterior probabilities of half-spaces. The Annals of Statistics 30, 1311–1344.

Fidler, F., Thomason, N., Cumming, G., Finch, S., Leeman, J. (2004). Editors can lead researchers to confidence intervals, but can’t make them think: Statistical reform lessons from medicine. Psychological Science 15, 119–126.

Follmann, D. (1996). A simple multivariate test for one-sided alternatives. Journal of the American Statistical Association 91, 854–861.

Gill, J. (2018). Comments from the New Editor. Political Analysis 26, 1–2.

Goodman, S. N. (1999). Toward evidence-based medical statistics. 1: the p value fallacy. Annals of Internal Medicine Volume 130, 995–1004.

Hubbard, R. and Lindsay, R. M. (2008). Why P values are not a useful measure of evidence in statistical significance testing. Theory Psychology 18, 69–88.

Hung, H. J., O’Neill, R. T., Bauer, P., Kohne, K. (1997). The behavior of the p-value when the alternative hypothesis is true. Biometrics 53, 11–22.

Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Medicine 2, 124.

Jager, L. R. and Leek, J. T. (2014). An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 15, 1–12.

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences 110, 19313–19317.

Lee, J. J. (2010). Demystify statistical significance–time to move on from the p-value to Bayesian analysis. Journal of the National Cancer Institute 103, 16–20.

Leek, J., McShane, B. B., Gelman, A., Colquhoun, D., Nuijten, M. B., Goodman, S. N. (2017). Five ways to fix statistics. Nature 551, 557–559.

Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses. New York: Springer.

Lindley, D. V. (1957). A statistical paradox. Biometrika 44, 187–192.

Liu, H. and Berger, R. L. (1995). Uniformly more powerful, one-sided tests for hypotheses about linear inequalities. The Annals of Statistics 23, 55–72.

McShane, B. B., Gal, D., Gelman, A., Robert, C., Tackett, J. L. (2018). Abandon statistical significance. arXiv:1709.07588

Meng, X. L. (1994). Posterior predictive p-values. The Annals of Statistics 22, 1142–1160.

Murtaugh, P. A. (2014). In defense of P values. Ecology 95, 611–617.

Nuzzo, R. (2014). Statistical errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume. Nature 506, 150–152.

Ranstam, J. (2012). Why the P-value culture is bad and confidence intervals a better alternative. Osteoarthritis Cartilage 20, 805-808.

Rosenthal, R. and Rubin, D. B. (1983). Ensemble-adjusted p values. Psychological Bulletin 94, 540–541.

Royall, R. M. (1986). The effect of sample size on the meaning of significance tests. The American Statistician 40, 313–315.

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applies statistician. The Annals of Statistics 12, 1151–1172.

Rubin, D. B. (1998). More powerful randomization-based p-values in double-blind trials with non-compliance. Statistics in Medicine 17, 371–385.

Sackrowitz, H. and Samuel-Cahn, E. (1999). P values as random variable-expected P values. The American Statistician 53, 326–331.

Sasabuchi, S. (1980). A test of a multivariate normal mean with composite hypotheses determined by linear inequalities. Biometrika 67, 429–439.

Savalei, V. and Dunn, E. (2015). Is the call to abandon p-values the red herring of the replicability crisis? Frontiers in Psychology 6, 245.

Schervish, M. J. (1996). P values: what they are and what they are not. The American Statistician 50, 203–206.

Sellke, T., Bayarri, M. J., and Berger, J. O. (2001). Calibration of p-values for testing precise null hypotheses. The American Statistician 55, 62–71.

Simmons, J. P., Nelson, L. D., Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22, 1359–1366.

Trafimow, D., Amrhein, V., Areshenkoff, C. N., Barrera-Causil, C. J., Beh, E. J., et al. (2018). Manipulating the alpha level cannot cure significance testing. Frontiers in Psychology 9, 699.

Trafimow, D. and Marks, M. (2015). Editorial. Basic and Applied Social Psychology 37, 1–2.

Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin Review 14, 779–804.

Wasserstein, R. L. and Lazar, N. A. (2016). The ASA’s statement on p-values: context, process, and purpose. The American Statistician 70, 129–133. Figure 1: Comparison of the type I error rate and power under the frequentist Z-test and Bayesian test based on the posterior probability for detecting treatment difference δ=0.15 (left) and δ=0.1 (right). Figure 2: The relationship between p-value and the posterior probability over 1000 replications under one-sided one-sample and two-sample hypothesis tests with binary outcomes under sample sizes of 20, 50, 100 and 500 per arm, respectively. Figure 3: The differences between p-values and posterior probabilities over 1000 replications in one-sided two-sample hypothesis tests with binary outcomes under sample sizes of 20, 50, 100 and 500, respectively. Figure 4: The relationship between p-value and the posterior probability over 1000 replications under two-sided one-sample and two-sample hypothesis tests with binary outcomes under sample size of 500 per arm. Figure 5: The relationship between p-value and the posterior probability over 1000 replications under one-sided and two-sided hypothesis tests with normal outcomes assuming Jeffreys’ prior and vague normal-inverse-gamma prior under sample size of 20, 50 and 100, respectively. Figure 6: The relationship between p-value and the posterior probability over 1000 replications under one-sided hypothesis tests with outcomes generated from Gamma, Beta and mixture normal distributions, assuming Jeffreys’ prior for the normal distribution under sample size of 20 and 50, respectively. Figure 7: The relationship between p-value and the posterior probability Pr(μE≤μS|D) over 1000 replications under one-sided hypothesis tests with normal outcomes; left panel: assuming a fixed informative normal-inverse-gamma prior under increasing sample sizes of 1000, 10000 and 100000 (from top to bottom), right panel: assuming a fixed sample size of 1000 with an increasing prior variance of 0.001, 0.01 and 1 (from top to bottom). Figure 8: The relationship between p-value and the posterior probability over 1000 replications under one-sided and two-sided hypothesis tests with multivariate normal outcomes under sample size of 100.