Optional Stopping with Bayes Factors: a categorization and extension of folklore results, with an application to invariant situations

by   Allard Hendriksen, et al.

It is often claimed that Bayesian methods, in particular Bayes factor methods for hypothesis testing, can deal with optional stopping. We first give an overview, using only most elementary probability theory, of three different mathematical meanings that various authors give to this claim: stopping rule independence, posterior calibration and (semi-) frequentist robustness to optional stopping. We then prove theorems to the effect that - while their practical implications are sometimes debatable - these claims do indeed hold in a general measure-theoretic setting. The novelty here is that we allow for nonintegrable measures based on improper priors, which leads to particularly strong results for the practically important case of models satisfying a group invariance (such as location or scale). When equipped with the right Haar prior, calibration and semi-frequentist robustness to optional stopping hold uniformly irrespective of the value of the underlying nuisance parameter, as long as the stopping rule satisfies a certain intuitive property.



page 1

page 2

page 3

page 4


Adaptive Stopping Rule for Kernel-based Gradient Descent Algorithms

In this paper, we propose an adaptive stopping rule for kernel-based gra...

The Bayes Lepski's Method and Credible Bands through Volume of Tubular Neighborhoods

For a general class of priors based on random series basis expansion, we...

Expected-Cost Analysis for Probabilistic Programs and Semantics-Level Adaption of Optional Stopping Theorems

In this article, we present a semantics-level adaption of the Optional S...

Strong Converse for Testing Against Independence over a Noisy channel

A distributed binary hypothesis testing (HT) problem over a noisy channe...

The Athena Class of Risk-Limiting Ballot Polling Audits

The main risk-limiting ballot polling audit in use today, BRAVO, is desi...

Fiducial and Posterior Sampling

The fiducial coincides with the posterior in a group model equipped with...

A Further Look at the Bayes Blind Spot

Gyenis and Redei have demonstrated that any prior p on a finite algebra,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, a surprising number of scientific results have failed to hold up to continued scrutiny. Part of this ‘replicability crisis’ may be caused by practices that ignore the assumptions of traditional (frequentist) statistical methods

(John et al., 2012). One of these assumptions is that the experimental protocol should be completely determined upfront. In practice, researchers often adjust the protocol due to unforeseen circumstances or collect data until a point has been proven. This practice, which is referred to as optional stopping can cause true hypotheses to be wrongly rejected much more often than these statistical methods promise.

Bayes factor hypothesis testing has long been advocated as an alternative to traditional testing that can resolve several of its problems; in particular, it was claimed early on that Bayesian methods continue to be valid under optional stopping (Lindley, 1957, Edwards et al., 1963). In particular, the latter paper claims that (with Bayesian methods) “it is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience.”. In light of the replicability crisis, such claims have received much renewed interest (Wagenmakers, 2007, Rouder, 2014, Schönbrodt et al., 2017, Yu et al., 2013, Sanborn and Hills, 2013). But what do they mean mathematically? It turns out that different authors mean quite different things by ‘Bayesian methods handle optional stopping’; moreover, such claims are often shown to hold only in an informal sense, or in restricted contexts. Thus, the first goal of the present paper is to give a systematic overview and formalization of such claims in a simple, expository setting; the second goal is to extend the reach of such claims to more general settings, for which they have never been formally verified.

Overview and Most Important Result

In Section 2, we give a systematic overview of what we identified to be the three main mathematical senses in which Bayes factor methods can handle optional stopping, which we call -independence, calibration, and (semi-) frequentist. We first do this in a setting chosen to be as simple as possible — finite sample spaces and strictly positive probabilities — allowing for straightforward statements and proofs of results.

In Section 3, we extend the statements and results to a much more general setting allowing for a wide range of sample spaces and measures, including measures based on improper priors

. These are priors that are not integrable, thus not defining standard probability distributions over parameters, and as such they cause technical complications. Such priors are indispensable within the recently popularized,

default Bayes factors for common hypothesis tests (Rouder et al., 2009, 2012, Jamil et al., 2016).

In Section 4, we provide stronger results for the case in which both models satisfy the same group invariance. Most (but not all) default Bayes factor settings concern such situations; prominent examples are Jeffreys’ Bayesian one- and two-sample -tests, going back to (Jeffreys, 1961), in which the models are location and location-scale families, respectively. Many more examples are given by Berger and various collaborators in a sequence of papers (Berger et al., 1998, Dass and Berger, 2003, Bayarri et al., 2012, 2016) who give compelling arguments for using the (typically improper) right Haar prior on the nuisance parameters in such situations; for example, in Jeffreys’ one-sample

-test, one puts a right Haar prior on the variance. Haar priors and group invariant models were studied extensively by

Eaton (1989), Andersson (1982), Wijsman (1990), whose results this paper depends on considerably. When nuisance parameters (shared by both and ) are of the right form and the right Haar prior is used, we can strengthen the results of Section 3: they now hold uniformly for all possible values of the nuisance parameters, rather than in the marginal, ‘on average’ sense we consider in Section 3; however — and this is our most important insight — we cannot take arbitrary stopping rules if we want to handle optional stopping in this strong sense: the stopping rules have to satisfy a certain intuitive condition, which will hold in many but not all practical cases: a rule such as ‘stop as soon as the Bayes factor is ’ is allowed, but a rule (in the Jeffreys’ one-sample -test) such as ‘stop as soon as ’ is not.

The paper ends with an Appendix containing all longer mathematical proofs.


Our analysis is restricted to Bayesian testing and model selection using the Bayes factor method; we do not make any claims about other types of Bayesian inference. Some of the results we present were already known, at least in simple settings; we refer in each case to the first appearance in the literature that we are aware of. The main mathematical novelties in the paper are the results on optional stopping in the general case with improper priors and in the group invariance case. The main difficulties here are that, (a), for fixed sample sizes, at least with continuous-valued data, the Bayes factor usually has a distribution with full support, i.e. its density is strictly positive on the positive reals, whereas with variable stopping times, the support of its distribution may have ‘gaps’ at which its density is zero or very near 0; and (b), as indicated, we need certain restrictions on the stopping times in order for the results to be valid.

Finally, as an important caveat, we point out that the idea that Bayesian methods can handle optional stopping has also been criticized, for example by Yu et al. (2013), Sanborn and Hills (2013), and also by ourselves (de Heide and Grünwald, 2018). There are two main issues: first, in many practical situations, many Bayesian statisticians use priors that are themselves

dependent on parts of the data and/or the sampling plan and stopping time. Examples are Jeffreys prior with the multinomial model and the Gunel-Dickey default priors for 2x2 contingency tables advocated by

Jamil et al. (2016). With such priors, final results evidently depend on the stopping rule employed; none of the results below continue to hold for such priors. The second issue is that all mathematical theorems below are just that, mathematical theorems. For them to have implications for practice, one needs to make additional assumptions which sometimes may not be warranted. In particular the second sense of ‘handling optional stopping’, calibration, relies on an analysis that holds under a Bayes marginal distribution, assigning probabilities that are really average (expected) probabilities with expectations taken over a prior. Yet many if not most priors used in practice (such as Cauchy priors over a location parameter) are of a ‘default’ or ‘pragmatic’ nature and are not really believed by the statistician, making the practical meaning of such an expectation questionable; de Heide and Grünwald (2018) discuss the issue at length.

2 The Simple Case

Consider a finite set and a sample space where is some very large (but in this section, still finite) integer. One observes a sample , which is an initial segment of . In the simplest case, is a sample size that is fixed in advance; but, more generally is a stopping time defined by some stopping rule (which may or may not be known to the data analyst), defined formally below.

We consider a hypothesis testing scenario where we wish to distinguish between a null hypothesis

and an alternative hypothesis . Both and are sets of distributions on , and they are each represented by unique probability distributions and respectively. Usually, these are taken to be Bayesian marginal distributions, defined as follows. First one writes, for both , with ‘parameter spaces’

; one then defines or assumes some prior probability distributions

and on and , respectively. The Bayesian marginal probability distributions are then the corresponding marginal distributions, i.e. for any set they satisfy:


For now we also further assume that for every , every , and

(full support), where here as below we use random variable notation,

denoting the event . We note that there exist approaches to testing and model choice such as testing by nonnegative martingales (Shafer et al., 2011, van der Pas and Grünwald, 2018) and minimum description length (Barron et al., 1998, Grünwald, 2007) in which the and may be defined in different (yet related) ways. Several of the results below extend to general and ; we return to this point at the end of the paper, in Section 5. In all cases, we further assume that we have determined an additional probability mass function on , indicating the prior probabilities of the hypotheses. The evidence in favour of relative to given data is now measured either by the Bayes factor or the

posterior odds

. We now give the standard definition of these quantities for the case that

, i.e., that the sample size is fixed in advance. First, noting that all conditioning below is on events of strictly positive probability, by Bayes’ theorem, we can write for any



where here as in the remainder of the paper we use the symbol to denote not just prior, but also posterior distributions on . In the case that we observe for fixed , the event is of the form . Plugging this into (2), the left-hand side becomes the standard definition of posterior odds, and the first factor on the right is called the Bayes factor.

2.1 First Sense of Handling Optional Stopping: -Independence

Now, in reality we do not necessarily observe for fixed but rather where is a stopping time that may itself depend on (past) data (and that in some cases may in fact be unknown to us). This may be defined in terms of a stopping rule . is then defined as the random variable which, for any sample , outputs the smallest such that . For any given stopping time , any and sequence of data , we say that is compatible with if it satisfies . We let be the set of all sequences compatible with .

Observations take the form , which is equivalent to the event for some and some which of necessity must be compatible with . We can thus instantiate (2) to


where we note that, by definition . Using Bayes’ theorem again, the expression on the right can be further rewritten as:


Combining (3) and (4) we get:


where we introduce the notation for the posterior odds and for the Bayes factor based on sample , calculated as if were fixed in advance.

We see that the stopping rule plays no role in the expression on the right. Thus, we have shown that, for any two stopping times and that are both compatible with some observed , the posterior odds one arrives at will be the same irrespective of whether came to be observed because was used or if came to be observed because was used. We say that the posterior odds do not depend on the stopping rule and call this property -independence. Incidentally, this also justifies that we write the posterior odds as , a function of alone, without referring to the stopping time .

The fact that the posterior odds given do not depend on the stopping rule is the first (and still relatively weak) sense in which Bayesian methods handle optional stopping; it was perhaps first noted by Lindley (1957); another early source is Edwards et al. (1963)

. Lindley gave an (informal) proof in the context of specific parametric models; in Section 

3.1 we show that the result indeed remains true for general -finite and . We note once again that the result only holds if the priors and themselves do not depend on , an assumption that is violated in many default Bayesian methods.

2.2 Second Sense of Handling Optional Stopping: Calibration

An alternative definition of handling optional stopping was introduced by Rouder (2014). Rouder calls the nominal posterior odds calculated from an obtained sample , and defines the observed posterior odds as

as the posterior odds given the nominal odds. Rouder first notes that, at least if the sample size is fixed in advance to , one expects these odds to be equal. For instance, if an obtained sample yields nominal posterior odds of 3-to-1 in favor of the alternative hypothesis, then it must be 3 times as likely that the sample was generated by the alternative probability measure. In the terminology of de Heide and Grünwald (2018), Bayes is calibrated for a fixed sample size . Rouder then goes on to note that, if is determined by an arbitrary stopping time (based for example on optional stopping), then the odds will still be equal — in this sense, Bayesian testing is well-behaved in the calibration sense irrespective of the stopping rule/time. Formally, the requirement that the nominal and observed posterior odds be equal leads us to define the calibration hypothesis, which postulates that holds for any that has non-zero probability. For simplicity, for now we only consider the case with equal prior odds for and so that . Then the calibration hypothesis says that, for arbitrary stopping time , for every such that for some , one has


In the present simple setting, this hypothesis is easily shown to hold, because we can write:

Rouder noticed that the calibration hypothesis should hold as a mathematical theorem, without giving an explicit proof; he demonstrated it by computer simulation in a simple parametric setting. Deng et al. (2016) gave a proof for a somewhat more extended setting yet still with proper priors. In Section 3.2 we show that a version of the calibration hypothesis continues to hold for general measures based on improper priors, and in Section 4.4 we extend this further to strong calibration for group invariance settings as discussed below.

We note that this result, too, relies on the priors themselves not depending on the stopping time, an assumption which is violated in several standard default Bayes factor settings. We also note that, if one thinks of one’s priors in a default sense — they are practical but not necessarily fully believed — then the practical implications of calibration are limited, as shown experimentally by de Heide and Grünwald (2018). One would really like a stronger form of calibration in which (6) holds under a whole range of distributions in and , rather than in terms of and which average over a prior that perhaps does not reflect one’s beliefs fully. For the case that and share a nuisance parameter taking values in some set , one can define this strong calibration hypothesis as stating that, for all with for some , all ,


where is still defined as above; in particular, when calculating one does not condition on the parameter having the value , but when assessing its likelihood as in (7) one does. de Heide and Grünwald (2018) show that the strong calibration hypothesis certainly does not hold for general parameters, but they also show by simulations that it does hold in the practically important case with group invariance and right Haar priors. In Section 4.4 we show that in such cases, one can indeed prove that a version of (7) holds.

2.3 Third Sense of Handling Optional Stopping: (Semi-) Frequentist

In classical, Neyman-Pearson style null hypothesis testing, the main concern is limiting the false positive rate of a hypothesis test. If this false positive rate is bounded above by some , then a null hypothesis significance test (NHST) is said to have significance level , and if the significance level is independent of the stopping rule used, we say that the test is robust under frequentist optional stopping.

Definition 1.

A function is said to be a frequentist sequential test with significance level and minimal sample size that is robust under optional stopping relative to if for all

i.e. the probability that there is an at which (‘the test rejects when given sample ’) is bounded by .

In our present setting, we can take (larger become important in Section 3.3), so runs from to and it is easy to show that, for any , we have


For any fixed and any sequence , let be the smallest such that, for the initial segment of , (if no such exists we set ). Then is a stopping time, is a random variable, and the probability in (8) is equal to the -probability that which by Markov’s inequality is bounded by . ∎

It follows that, if is a singleton, then the sequential test that rejects (outputs ) whenever is a frequentist sequential test with significance level that is robust under optional stopping.

The fact that Bayes factor testing with singleton handles optional stopping in this frequentist way was noted by Edwards et al. (1963) and also emphasized by Good (1991), among many others. If is not a singleton, then (8

) still holds, so the Bayes factor still handles optional stopping in a mixed frequentist (Type I-error) and Bayesian (marginalizing over prior within

) sense. While form a frequentist perspective, one may not consider this to be fully satisfactory, as argued by Bayarri et al. (2016) this semi-frequentist sense of optional stopping, when applied to rather than , sometimes does correspond to what many frequentists consider acceptable in practice.

Yet, in the practically important group invariance case, we can once again show that, if all parameters in are right Haar, then the Bayes factor is truly robust to optional stopping in the above frequentist sense, i.e. (8) will hold for all and not just ‘on average’. While this is hinted at in several papers (e.g. (Bayarri et al., 2016, Dass and Berger, 2003)) it seems to never have been proven formally; we provide a proof in Section 4.5.

3 The General Case

Let be a measurable space. Fix some and consider a sequence of functions on so that each , takes values in some fixed set (‘outcome space’) with associated -algebra . When working with proper priors we invariably take and then we define and we let be the -fold product algebra of . When working with improper priors it turns out to be useful (more explanation further below) to take and define an initial sample random variable on , taking values in some set with associated -algebra . In that case we set, for , , and and we let be . In either case, we let be the -algebra (relative to ) generated by . Then is a filtration relative to and if we equip with a distribution then becomes a random process adapted to . A stopping time is now generalized to be a function such that for each , the event is -measurable; note that we only consider stopping after initial outcomes. Again, for a given stopping time and sequence of data , we say that is compatible with if it satisfies , i.e. .

and are now sets of probability distributions on . Again one writes where now the parameter sets (which, however, could itself be infinite-dimensional) are themselves equipped with suitable -algebras .

We will still represent both and by unique measures and respectively, which we now allow to be based on (1) with improper priors and that may be infinite measures; as a result and are positive real measures that may themselves be infinite. We also allow to be a general (in particular uncountable) set. Both nonintegrability and uncountability cause complications, but these can be overcome if suitable Radon-Nikodym derivatives exist. To ensure this, we will assume that for all , for all and , , and are all mutually absolutely continuous and that the measures and are -finite. Then there also exists a measure on such that, for all such , , and are all mutually absolutely continuous: we can simply take , but in practice, it is often possible and convenient to take such that is the Lebesgue measure on , which is why explicitly introduce here.

The absolute continuity conditions guarantee that all required Radon-Nikodym derivatives exist. Finally, we assume that the measures and are proper probability measures for all . This final requirement is the reason why we sometimes need to consider and nonstandard sample spaces in the first place: in practice , one usually starts with the standard setting of a where and all have the same status. In all practical situations with improper priors and/or that we know of, there is a smallest finite and a set that has measure under all probability distributions in , such that, restricted to the sample space , the measures and are -finite and mutually absolutely continuous, and the posteriors (as defined in the standard manner in (11) below) are proper probability measures. One then sets to equal this , and sets , and the required properness will be guaranteed. Our initial sample is a variation of what is called (for example, by Bayarri et al. (2012)) a minimal sample. Yet, the sample size of a standard minimal sample is itself a random quantity; by restricting to , we can take its sample size to be constant rather than random, which will greatly simplify the treatment of optional stopping with group invariance; see Example 1 and 2 below.

We henceforth refer to the setting now defined (with and initial space satisfying the requirements above) as the general case.

We need an analogue of (5) for this general case. If and

are probability measures, then there is still a standard definition of conditional probability distributions

in terms of conditional expectation for any given -algebra ; based on this, we can derive the required analogue in two steps. First, we consider the case that for some ; we know in advance that we observe for a fixed : the appropriate is then , is determined by hence can be written as , and a straightforward calculation gives that


where and are versions of the Radon-Nikodym derivatives defined relative to . The second step is now to follow exactly the same steps as in the derivation of (5), replacing by (9) wherever appropriate (we omit the details). This yields, for any such that , and for -almost every that is compatible with ,


where here as below, for , we abbreviate to .

The above expression for the posterior is valid if and are probability measures; we will simply take it as the definition of the Bayes factor for the general case. Again this coincides with standard usage for the improper prior case. In particular, let us define the conditional posteriors and Bayes factors given in the standard manner, by the formal application of Bayes’ rule, for and measurable and -measurable ,


where is defined as the value that (a version of) the conditional probability takes when , and is thus defined up to a set of -measure 0.

With these definitions, it is straightforward to derive we have the following coherence property, which automatically holds if the priors are proper, and which in combination with (10) expresses that first updating on and then on has the same result as updating based on the full at once:


3.1 -independence, general case

The general version of the claim that the posterior odds do not depend on the specific stopping rule that was used is now immediate, since the expression (10) for the Bayes factor does not depend on the stopping time .

3.2 Calibration, general case

We will now show that the calibration hypothesis continues to hold in our general setting. From here onward, we make the further reasonable assumption that for every , (the stopping time is almost surely finite), and we define .

To prepare further, let be any collection of positive random variables such that for each , is -measurable. We can define the stopped random variable as


where we note that, under this definition, is well-defined even if .

We can define the induced measures on the positive real line under the null and alternative hypothesis for any probability measure on :


where denotes the Borel -algebra of . Note that, when we refer to , this is identical to for the stopping time which on all of stops at . The following lemma is crucial for passing from fixed-sample size to stopping-rule based results.

Lemma 1.

Let and be as above. Consider two probability measures and on . Suppose that for all , the following fixed-sample size calibration property holds:


Then we have


The proof is in Appendix A.

In this subsection we apply this lemma to the measures for arbitrary fixed , with their induced measures for the stopped posterior odds . Formally, the posterior odds as defined in (10) constitute a random variable for each , and, under our mutual absolute continuity assumption for and , can be directly written as . Since, by definition, the measures are probability measures, the Radon-Nikodym derivatives in (16) and (17) are well-defined.

Lemma 2.

We have for all , all :


Combining the two lemmas now immediately gives (19) below, and combining further with (13) and (10) gives (20):

Corollary 3.

In the setting considered above, we have for all :


and also


In words, the posterior odds remain calibrated under any stopping rule which stops almost surely at times .

For discrete and strictly positive measures with prior odds , we always have , and (19) is equivalent to (6). Note that -a.s. in (19) is equivalent to -a.s. because the two measures are assumed mutually absolutely continuous.

3.3 (Semi-) Frequentist Optional Stopping

In this section we consider our general setting as in the beginning of Section 3.2, i.e. with the added assumption that the stopping time is a.s. finite, and with . From here onward we shall further simplify slightly by assuming that the prior on and is equal, so that the posterior odds (given and possibly ) equal the Bayes factor .

Consider any initial sample and let and be the conditional Bayes marginal distributions as defined in (12). We first note that, by Markov’s inequality, for any nonnegative random variable on with, for all , , we must have, for , .

Proposition 4.

Let be any stopping rule satisfying our requirements. The stopped Bayes factor as defined by (14) (with in the role of ) is a random variable that satisfies, for all , , so that, by the reasoning above, .


The following implications are all immediate:

The desired result now follows by plugging in a particular stopping rule: let be the frequentist sequential test defined by setting, for all , : iff .

Corollary 5.

Let be the smallest for which . Then for arbitrarily large , when applied to the stopping rule , we find that

The corollary implies that the test is robust under optional stopping in the frequentist sense relative to (Definition 1). Note that, just as in the simple case, the setting is really just ‘semi-frequentist’ whenever is not a singleton.

4 Optional stopping with group invariance

Whenever the null hypothesis is composite, the previous results only hold under the marginal distribution or, in the case of improper priors, under . When a group structure can be imposed on the outcome space and (a subset of the) parameters that is joint to and , stronger results can be derived for calibration and frequentist optional stopping. Invariably, such parameters function as nuisance parameters and our results are obtained if we equip them with the so-called right Haar prior which is usually improper. Below we show how we then obtain results that simultaneously hold for all values of the nuisance parameters. Such cases include many standard testing scenarios such as the (Bayesian variations) of the -test, as illustrated in the examples below. Note though that our results do not apply to settings with improper priors for which no group structure exists; for example, if expresses that are i.i.d. Poisson, then from an objective Bayes or MDL point of view it makes sense to adopt Jeffreys’ prior for the Poisson model; this prior is improper, allows initial sample size , but does not allow for a group structure. For such a prior we can only use the marginal results Corollary 3 and Corollary 5.

4.1 Background for fixed sample sizes

Here we prepare for our results by providing some general background on invariant priors for Bayes factors with fixed sample size on models with nuisance parameters that admit a group structure, introducing the right Haar measure, the corresponding Bayes marginals, and (maximal) invariants. We use these results in Section 4.2 to derive Lemma 7, which gives us a strong version of calibration for fixed . The setting is extended to variable stopping times in Section 4.3, and then Lemma 7 is used in this extended setting to obtain our strong optional stopping results in Section 4.4 and 4.5.

For now, we assume a sample space that is locally compact and Haussdorf, and that is a subset of some product space where is itself locally compact and Haussdorf. This requirement is met, for example, when and . In practice, the space is invariably a subset of that excludes some ‘singular’ outcomes that have measure under all hypotheses involved. We associate with its Borel -algebra which we denote as

. Observations are denoted by the random vector

. We thus consider outcomes of fixed sample size, denoting these as , returning to the case with stopping times in Section 4.4 and 4.5.

We start with some group-theoretical preliminaries; for more details, see e.g. (Eaton, 1989, Wijsman, 1990, Andersson, 1982).

Definition 2 (Eaton (1989), Definition 2.1).

Let be a group of measurable one-to-one transformations on with identity . Let be a set. A function satisfying