Testing models and hypotheses against experimental data is a fundamental part of science yet the statistical approaches for doing it remain contentious. In many fields null hypothesis significance testing via p-
values is popular. In this approach, one tests a particular null hypothesis by computing the probability of obtaining data as or more extreme than that observed, assuming that the null hypothesis were true. This probability, thep-value, may be used as either a measure of evidence against the null hypothesis, or as a means to controlling a type-1 error rate (10.2307/20115217). This error rate is the rate at which we would reject the null hypothesis when it was true. If we reject the null hypothesis only when , the type-1 error rate would be .
The use of p-values has been contentious for decades (Edwards1992), and in the recent years the situation appeared to reach a critical point (benjamin2018redefine; lakens2018justify), after it became apparent that many effects discovered through null hypothesis significance testing were in fact spurious and could not be reproduced by other researchers (10.1371/journal.pmed.0020124). All kinds of remedies have been proposed, from abandoning null hypothesis significance testing altogether (mcshane2019abandon) through to Bayesian methods, including Bayes factors (Jeffreys:1939xee; doi:10.1080/01621459.1995.10476572).
In the Bayesian approach we compute directly the relative probability that two models, and , predict the observed data, ,
This so-called Bayes factor updates the relative plausibility of two models. The numerator and denominator in the Bayes factor are Bayesian evidences. In general for composite models they may be written
that is, as the marginal likelihood, where are the model’s unknown parameters with prior density .
In the null hypothesis significance test, we are left to decide what possible data would be more extreme than that observed. If we consider a fixed type-1 error rate, this amounts to choosing a rejection region, , in the sample space of the experiment of size . Although other points of view exist, the dominant approach is choosing a rejection region that maximises the chance of rejecting the null hypothesis when it is false, i.e., maximising the power, . In other words, we chose such that for fixed
Equivalently, we minimise the type-2 error rate , the probability of not rejecting the null hypothesis when it is false. The Neyman-Pearson lemma (doi:10.1098/rsta.1933.0009) and the Karlin-Rubin theorem (casella2002statistical) indicate the most powerful tests are often built upon likelihood ratios, and likelihood ratios are ubiquitous in significance tests.
We could, on the other hand, relinquish control of the type-1 error rate, but insist on a rejection region that minimises a weighted sum of type-1 and type-2 error rates (Lindley1953; Lehmann1958; MR0133898; 10.1371/journal.pone.0032734),
with weights and . For the purposes of the calculations and arguments in this work, this is equivalent to the conventional approach discussed above. To see that, suppose that the region that minimises eq. 1.5 has size . Since this region minimises , among all possible regions of size , it must be the one that maximises the power, .
In this work, we use the Bayes factor as a test statistic in a frequentist setting, argue that it is an optimal choice and that this may help reconcile advocates of Bayesian and frequentist procedures for hypothesis testing. The idea of using the Bayes factor in this way and its potential for harmonising Bayesian and frequentist procedures (berger2003; BAYARRI201690) dates as far back as the 1950s (10.1214/aoms/1177706790; good1961weight; 10.2307/2290192), though 10.1214/ss/1030037904 notes that it failed to gain traction among practitioners. In sec. 2 we briefly review the Neyman-Pearson lemma and the Karlin-Rubin theorem, and in sec. 3 and sec. 4, we show they may be applied to Bayes factors, such that the Bayes factor is the most powerful choice of test statistic. This was previously discussed formally in doi:10.1111/anzs.12171. Besides drawing attention to the relevant results and presenting them in an original, pedagogical manner, we combine the ideas of 10.1214/aoms/1177706790; good1961weight; 10.2307/2290192 and results of doi:10.1111/anzs.12171. We thus argue in sec. 5 that the Neyman-Pearson lemma and Karlin-Rubin theorem in fact put Bayes factors at the centre of frequentist hypothesis tests as they are an optimal choice of test statistic. This may help finally reconcile Bayesian and frequentist approaches to testing.
2 The Neyman-Pearson lemma
For simple hypotheses with no unknown parameters, the Neyman-Pearson lemma (doi:10.1098/rsta.1933.0009) tell us that the likelihood ratio,
is an optimal test statistic. The rejection region of size that maximises the power must be a contour of the likelihood ratio, , where the value at the contour, , depends on the desired type-1 error rate and would be found from eq. 1.3. The Neyman-Pearson lemma doesn’t extend to composite hypotheses that depend on unknown parameters, as the optimal choice of test statistic would in general depend on the assumed values for the unknown parameters. In some cases, a uniformly most powerful test (UMPT) exists, which maximises the power for any values of the unknown parameters in the alternative hypothesis. The Karlin-Rubin theorem demonstrates cases in which a UMPT exists; see sec. 4.
3 The Neyman-Pearson lemma for Bayes factors
In the Bayesian formalism, for any composite model, we may find a simple model by marginalising the unknown parameters,
This is the prior predictive for the data and when evaluated at the observed data, the Bayesian evidence for the model. We may thus use our Bayes factor in place of the likelihood ratio in the Neyman-Pearson lemma. In this case, though, the error rates are subtly re-interpreted, and to distinguish them we denote them with a bar,
That is, here they are the expected error rates, averaging over the possible values for the unknown model parameters in the null hypothesis and for the unknown model parameters in the alternative hypothesis. They do not in general correspond to any observable long-run error rates.
This requires, of course, choices of prior density, and ; see e.g., doi:10.1080/01621459.1996.10477003; consonni2018 for discussion of rules for choosing priors in a Bayesian setting. In a frequentist setting, the priors needn’t be interpreted in the same manner as in subjective or objective Bayesian approaches. The prior for any parameters in the alternative hypothesis, , may be thought of as a weight function that indicates which choices of parameters we want our test to be most powerful for (10.2307/2286006; BAYARRI201690). Similarly, the prior for any parameters in a composite null hypothesis, , weights which choices of parameters we most want to control the type-1 error rate for.
We could instead consider a fixed maximum type-1 error rate, , for any values of the unknown parameters in the null hypothesis. If we assume that it occurs when , we may write
This is in fact equivalent to a sharp prior , i.e., specifying through our prior that we must control the type-1 error rate for the parameters that maximise the type-1 error rate. In this case we would use the Bayes factor,
in place of the likelihood ratio in the Neyman-Pearson lemma. Under the null hypothesis the expected magnitude of the preference for the alternative model from the Bayes factor in eq. 1.1 must be smaller than that from the one used when we control the maximum type-1 error rate in eq. 3.5. This result follows from Gibbs’ inequality,
So whilst controlling the maximum error rate rather than the expected error may seem conservative, the computation involves the Bayes factor in eq. 3.5 that we expect to overstate the evidence against the null hypothesis.
If the null hypothesis is simple, as is often the case, , and so the type-1 error rates may be interpreted in the usual, completely frequentist manner. The power, on the other hand, remains the expected power, averaged across the unknown parameters in the alternative hypothesis.
4 Karlin-Rubin theorem for Bayes factors
It would be desirable to extend the Neyman-Pearson lemma to composite models. Unfortunately, UMPT do not always exist as the optimal test generally depends on the values of the unknown parameters in the alternative hypothesis. There are approaches that sidestep the issue such as a minimax treatment of type-1 error rates and power. The Karlin-Rubin theorem (casella2002statistical), on the other hand, extends the Neyman-Pearson lemma to a UMPT in a composite case in special circumstances.
To apply the theorem, the null and alternative hypothesis must be disjoint regions of a one-dimensional parameter space separated by a boundary at , that is,
Suppose that a sufficient test statistic, , exists and that the likelihood ratio
is a monotonic non-decreasing function of for any . We call these condition the monotone likelihood ratio (MLR) conditions. Under these conditions, the UMPT is a threshold on , . The threshold is determined by fixing the maximum type-1 error rate,
and occurs when . This ensures the size of the test is no larger than for any choice of unknown parameter in the null hypothesis.
We can make a similar statement for our Bayes factor,
We assume that the prior for the unknown parameters in the null hypothesis was chosen. We suppose that, with that choice, the Bayes factor is always a monotonic function of for any choice of prior for the unknown parameters in the alternative model, . This is satisfied by the MLR conditions of the Karlin-Rubin theorem for any choice of ; see App. A. The Karlin-Rubin theorem considered a fixed maximum type-1 error rate, . In our Bayesian interpretation, we could fix this or the mean error rate . If we fix the former, we consider the Bayes factor in eq. 3.5 in the following argument. If we choose to fix the expected type-1 error rate, on the other hand, we instead consider the Bayes factor in eq. 4.4 in the following argument.
By an application of the Neyman-Pearson lemma for Bayes factors, the most powerful test at a fixed size should be a threshold on , . As the Bayes factor is a monotonic function of , this is equivalent to a threshold on , . As can be found independently from the prior by eq. 3.2 or 3.4, it must be the most powerful test for any choice of prior , including point masses at any particular values of the unknown parameters. Thus, it is the UMPT.
We thus find connections between the Bayes factor and the UMPT. By generalising the Karlin-Rubin theorem, we find that a UMPT exists whenever the Bayes factor is a monotonic function of a sufficient statistic for any choice of prior for the unknown parameters in the alternative model. The Bayes factor corresponding the observed must, however, depend on the choices of prior.
The Neyman-Pearson lemma leads to optimal test statistics for null hypothesis significance tests for simple hypotheses. Interpreting the Bayes factor as a likelihood ratio for two simple models leads to a Bayesian interpretation of the Neyman-Pearson lemma that goes beyond simple models. The Bayes factor maximises the expected power for a test of a fixed expected size.
On the Bayesian side, this could provide further justification for using Bayes factors for objective Bayesians or more generally those who are concerned about the frequentist properties of Bayes factors. If we place a threshold on the Bayes factor, for whatever type-1 error rate it to which that threshold corresponds, the Bayes factor was the statistic that maximised the expected power. On the frequentist side, even if you want to carry on computing p-values, there is justification for doing so using the Bayes factor as a test statistic, especially in the case of simple null hypotheses but composite alternatives. In the case of simple null hypotheses, using the Bayes factor results in the best expected power for a fixed completely frequentist type-1 error rate. The only concession required is that to construct the test we must choose a weight function that marks where we want power and talk about expected power. The test itself would, however, remain strictly frequentist.
The Karlin-Rubin theorem extends the Neyman-Pearson lemma to particular composite models. We found that the conditions of the Karlin-Rubin theorem may be recast as the requirement that the Bayes factor is a monotonic function of a sufficient statistic for any choices of prior for unknown parameters. This lead to a slightly novel proof of a generalised Karlin-Rubin theorem and a connection between the properties of Bayes factors and the existence of uniformly most powerful tests. The Bayesian interpreted Karlin-Rubin theorem provides conditions under which a test of fixed size always maximises the expected power, regardless which prior was chosen for the unknown parameters in the alternative hypothesis in the computation of the expected power.
These results could help synthesise frequentist and Bayesian procedures, as it shows that the Bayes factors could lie at the heart of each one, and proponents of either should be interested, in principle, in computing the Bayes factor. The outstanding difference would be that the Bayesian would consider the magnitude of the observed Bayes factor, in accordance with the likelihood principle (zbMATH02166302), whereas the frequentist would consider the probability of obtaining a Bayes factor more extreme than that observed. In practice, computing the Bayes factor and finding its distribution could be challenging, as popular asymptotic approaches such as Wilks’ theorem (wilks1938) needn’t apply (though see 10.2307/25734099).
Appendix A Karlin-Rubin conditions for Bayes factors
Let us check the implications of the MLR conditions in the Karlin-Rubin theorem on the Bayes factor. For the hypotheses under consideration in the Karlin-Rubin theorem in eq. 4.1, the Bayes factor could be written
for some suitably normalised choices of prior densities and . Starting from eq. 4.2, for and , we have
where we multiplied each side by the ratio of prior densities that appear in the Bayes factor and rearranged terms. Integrating each side with respect to and only over the regions guaranteeing , we find
Finally, we rearrange terms to find
such that for . We did not make use of any properties of any particular choice of prior densities. In other words, the ordinary MLR conditions of the Karlin-Rubin theorem mean that the Bayes factor is a monotonic function of for any choices of prior densities and .