Combining e-values and p-values

12/12/2019 ∙ by Vladimir Vovk, et al. ∙ University of Waterloo Royal Holloway, University of London 0

Multiple testing of a single hypothesis and testing multiple hypotheses are usually done in terms of p-values. In this paper we replace p-values with their Bayesian counterpart, e-values, which are, essentially, Bayes factors stripped of their Bayesian context. We demonstrate that e-values are often mathematically more tractable and develop procedures using e-values for multiple testing of a single hypothesis and testing multiple hypotheses.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of multiple testing of a single hypothesis is usually formalized as that of combining a set of p-values. The notion of p-values, however, has a strong competitor, which we refer to as e-values in this paper. E-values have been used widely, under different names and in different contexts. However, they have started being widely discussed in their pure form, regardless of the context, only recently: see, e.g., Shafer (2019) (who uses the term “betting score” for our “e-value”), Shafer and Vovk (2019, Section 11.5) (who use “Skeptic’s capital”), and Grünwald et al. (2019) (who use “S-value”).

Historically, the use of p-values vs e-values reflects the conventional division of statistics into frequentist and Bayesian (although a sizable fraction of people interested in the foundations of statistics, including the authors of this paper, are neither frequentists nor Bayesians). P-values are a hallmark of frequentist statistics, but Bayesians often regard p-values as misleading, preferring the use of Bayes factors (which can be combined with prior probabilities to obtain posterior probabilities). In the case of simple statistical hypotheses, a Bayes factor is the likelihood ratio of an alternative hypothesis to the null hypothesis (or vice versa, as in

Shafer et al. 2011

). The key property of the Bayes factor is that it is a nonnegative extended random variable whose expected value under the null hypothesis is at most 1. We express this property by saying that the Bayes factor is an

e-value. (P-values are also known as “probability values”; similarly, we abbreviate “expectation values” to “e-values.)

The literature on Bayes factors is vast; we only mention the influential review by Kass and Raftery (1995) and the historical investigation by Etz and Wagenmakers (2017).

The question of transforming p-values into e-values, or calibration

of p-values, has a long history in Bayesian statistics. The idea was first raised by

Berger and Delampady (1987, Section 4.2) (who, however, referred to the idea as “ridiculous”; since then the idea has been embraced by the Bayesian community). The class of calibrators was proposed in Vovk (1993) and rediscovered in Sellke et al. (2001). A simple characterization of the class of all calibrators was first obtained in Shafer et al. (2011). A popular Bayesian point of view is that p-values tend to be misleading and need to be transformed into e-values in order to make sense of them. The problem of non-uniqueness of calibrators is sometimes solved by considering (the best e-value that can be attained by the class , advocated by, e.g., Benjamin and Berger (2019), Recommendations 2 and 3), but this does not produce a valid e-value.

One area where both p-values and e-values have been used for a long time is the algorithmic theory of randomness (see, e.g., Shen et al. 2017), an area that originated in Kolmogorov’s work on the algorithmic foundations of probability and information (Kolmogorov, 1965, 1968). Martin-Löf (1966) introduced an algorithmic version of p-values, and then Levin (1976) introduced an algorithmic version of e-values. In the algorithmic theory of randomness people are often interested in low-accuracy results, and then p-values and e-values can be regarded as slight variations of each other. If is an e-value, will be a p-value; and vice versa, if is a p-value, will be an approximate e-value.

As we have said, the focus of this paper is on combining e-values and multiple hypotheses testing using e-values. The picture that arises for these two fields is remarkably different from its counterpart for p-values.

We start the main part of the paper by defining the notion of e-values and showing that the problem of merging e-values is more or less trivial: natural merging functions are essentially dominated by arithmetic mean (Section 2). In Section 3 we assume, additionally, that the e-variables being merged are independent, and show that the domination structure is much richer. In Section 4 we apply these results to multiple hypotheses testing. Section 5 reviews known results about relations between individual e-values and individual p-values; we will discuss how the former can be turned into the latter and vice versa (with very different domination structures for the two directions). In the next section, Section 6, we review known results about merging p-values and draw parallels with merging e-values; in the last subsection we assume that the p-values are independent. Section 7 discusses “cross-merging”: merging p-values into one e-value and merging e-values into one p-value. Section 8 is devoted to experimental results, and Section 9 concludes the main part of the paper. Appendix A contains several results that are less closely connected with the main messages of this paper.

2 Merging e-values

For a probability space , an e-variable is an extended random variable satisfying (we refer to it as “extended” since its values are allowed to be ). The values taken by e-variables will be referred to as e-values, and we denote the set of e-variables by . It is important to allow to take value ; in the context of testing , observing for an a priori chosen e-variable means that we are entitled to reject as null hypothesis.

Let be a positive integer (fixed throughout the paper). An e-merging function of e-values is an increasing Borel function such that, for any probability space and random variables on it,


(in other words, transforms e-values into an e-value). In this paper we will also refer to increasing Borel functions satisfying (1) for all probability spaces and all e-variable taking values in as e-merging functions; such functions are canonically extended to e-merging functions by setting them to on (see Proposition A.1 in Appendix A).

It suffices to require that (1) hold for a fixed atomless probability space , as we explain in Appendix A (Proposition A.4). We will fix such a probability space for the rest of the paper (apart from Section 4 and Appendix A itself) and will let or stand for for any random variable .

An e-merging function dominates an e-merging function if (i.e., for all ). The domination is strict (and we say that strictly dominates ) if and for some . We say that an e-merging function is admissible if it is not strictly dominated by any e-merging function; in other words, admissibility means being maximal in the partial order of domination. Finally, we say that an e-merging function is inadmissible if it is not admissible.

The notion of admissibility is much stronger than the notion of being “precise” that we used in Vovk and Wang (2019a). In the context of e-merging functions, an e-merging function is precise if is not an e-merging function for any .

Merging e-values via averaging

In this paper we are only interested in symmetric merging functions (i.e., those invariant w.r. to permutations of their arguments). The main message of this section is that the most useful (and the only useful, in a natural sense) symmetric e-merging function is the arithmetic mean


It will follow immediately from Proposition 3.1 that is admissible. But first we state formally the claim that is the only useful symmetric e-merging function.

An e-merging function essentially dominates an e-merging function if, for all ,

This weakens the notion of domination in a natural way: now we require that is not worse than only in cases where is not useless; we are not trying to compare degrees of uselessness. The following proposition can be interpreted as saying that is at least as good as any other symmetric e-merging function.

Proposition 2.1.

Arithmetic mean essentially dominates any symmetric e-merging function.

In particular, if is an e-merging function that is symmetric and positively homogeneous (i.e., for all ), then is dominated by . This includes the e-merging functions discussed later in Section 6.

Proof of Proposition 2.1.

Let be a symmetric e-merging function. First let us check that, for all ,


Suppose for the purpose of contradiction that there exists such that


Write and . Let be the set of all permutations of , be randomly and uniformly drawn from , and .

Further, let , where is an event independent of and satisfying (the existence of such random and is guaranteed by Lemma A.2 in Appendix A). For each , we have , and hence . Moreover, by symmetry,

a contradiction. Therefore, we conclude that there is no such that (4) holds.

Now suppose . Our goal is to prove . Arguing indirectly, suppose . If , we get a contradiction by applying (3). And if , we can increase some or all components of to get , and we will still have ; this contradicts (3). ∎

It is clear that arithmetic mean does not dominate every symmetric e-merging function; for example, the convex mixtures


of the trivial e-merging function and are pairwise non-comparable (with respect to the relation of domination).

3 Merging independent e-values

In this section we consider merging functions for independent e-values; remember that in Section 2 we fixed an atomless probability space . An ie-merging function of e-values is an increasing Borel function such that for all independent . As for e-merging functions,

  • this definition is essentially equivalent to the definition involving rather than (by Proposition A.1 in Appendix A, which is still applicable in the context of merging independent e-values),

  • and this definition is equivalent to the definition involving the universal quantifier over all probability spaces (see Proposition A.6).

The definitions of domination, strict domination, admissibility, and inadmissibility are obtained from the definition of the previous section by replacing “e-merging” with “ie-merging”.


be the set of (component-wise) independent random vectors in

, and be the all-1 vector in . The following proposition has already been used (in particular, it implies that arithmetic mean is an admissible e-merging function).

Proposition 3.1.

For an increasing Borel function , if for all with (resp., for all with ), then is an admissible e-merging function (resp., an admissible ie-merging function).


It is obvious that is an e-merging function (resp., ie-merging function). Next we show that is admissible. Suppose for the purpose of contradiction that there exists an ie-merging function such that and for some . Take with such that . Such a random vector is easy to construct by considering any distribution with a positive mass on each of . Then we have

which implies

contradicting the assumption that is an ie-merging function. Therefore, no ie-merging function strictly dominates . Noting that an e-merging function is also an ie-merging function, admissibility of is guaranteed under both settings. ∎

If are independent e-variables, their product will also be an e-variable. This is the analogue of Fisher’s [1932] method for p-values (according to the rough relation mentioned in Section 1 and discussed further in Section 5; Fisher’s method is discussed at the end of Section 6). The ie-merging function


is admissible by Proposition 3.1. It will be referred to as the product (or multiplication) ie-merging function.

More generally, we can see that the U-statistics


and their convex mixtures are ie-merging functions. Notice that this class includes product (for ), arithmetic average (for ), and constant 1 (for ). Proposition 3.1 implies that the U-statistics (7) and their convex mixtures are admissible e-merging functions.

Let us now establish a very weak counterpart of Proposition 2.1 for independent e-values. An ie-merging function weakly dominates an ie-merging function if, for all ,

In other words, we require that is not worse than if all input e-values are useful (and this requirement is weak because, especially for a large , we are also interested in the case where some of the input e-values are useless).

Proposition 3.2.

The product weakly dominates any symmetric ie-merging function.


Indeed, suppose that there exists such that

Let be independent random variables such that each for takes values in the two-element set and with probability . Then each is an e-variable but

which contradicts being an ie-merging function. ∎

Testing with martingales

The assumption of the independence of e-variables is not necessary for the product to be an e-variable. It suffices to assume that a.s. for all . In this situation the sequence of the partial products , becomes a supermartingale (or a test supermartingale, in the terminology of Shafer et al. 2011 and Grünwald et al. 2019, meaning a nonnegative supermartingale with initial value 1). A possible interpretation of this test supermartingale is that the e-values are obtained by laboratories in this order, and laboratory makes sure that its result is a valid e-value given the previous results .

4 Application to testing multiple hypotheses

As in Vovk and Wang (2019a), we will apply results for multiple testing of a single hypothesis (combining e-values in the context of Sections 2 and 3) to testing multiple hypotheses, spelling out the corresponding closed testing procedures (Marcus et al., 1976).

1:A sequence of e-values .
2:Find a permutation such that .
3:Set , (these are the order statistics).
5:for  do
7:for  do
9:     for  do
11:         if  then
Algorithm 1 Closed method for adjusting e-values

We are given a set of composite null hypotheses , , and, for each , an e-variable w.r. to : for any . The closure for multiple testing of our e-merging procedure is given as Algorithm 1. The procedure adjusts the e-values obtained in the experiments (not necessarily independent) to new e-values . Applying the procedure to the e-values produced by the e-variables , we obtain extended random variables taking values . First we define our desired property of validity for the procedure, which we will refer to as family-wise validity (FWV), in analogy with the standard family-wise error rate (FWER).

Formally, we are given subsets of the family of all probability measures on , and for each , we are given an e-variable for testing , as described earlier; suppose our procedure (such as the one given by Algorithm 1) produces extended random variables taking values in . A conditional e-variable is a family of extended nonnegative random variables , , that satisfies

(i.e., each is in ). The procedure is family-wise valid (FWV) for the given if there exists a conditional e-variable such that

(where means, as usual, that for all ). We can say that such witnesses the FWV property.

Let us check that Algorithm 1 is family-wise valid. For , the composite hypothesis is defined by


where is the complement of . The conditional e-variable witnessing that Algorithm 1 is family-wise valid is


where . The optimal adjusted e-variables can be defined as


but for computational efficiency we use the conservative definition

Remark 4.1.

The inequality “” in (10) holds as the equality “” if all the intersections (8) are non-empty. If some of these intersections are empty, we can have a strict inequality. Algorithm 1 implements the definition (11). Therefore, it is valid regardless of whether some of the intersections (8) are empty; however, if they are, it may be possible to improve the adjusted e-values. According to Holm’s [1979] terminology, we allow “free combinations”. Shaffer (1986) pioneered methods that take account of the logical relations between the base hypotheses .

To obtain Algorithm 1, we rewrite the definitions (11) as

for , where is the ordering permutation and is the th order statistic among , as in Algorithm 1. In lines 3–5 of Algorithm 1 we precompute the sums

in lines 8–9 we compute

for , and as result of executing lines 6–11 we will have

which shows that Algorithm 1 is an implementation of (11).

The computational complexity of Algorithm 1 is .

1:A sequence of e-values .
2:Let be the order statistics, as in Algorithm 1.
3:Let be the product of all , (and if there are no such ).
4:for  do
Algorithm 2 Closed method for adjusting independent e-values

In the case of independent e-variables, we have Algorithm 2. This algorithm assumes that the base e-variables are independent under any for any . The conditional e-variable witnessing that Algorithm 2 is family-wise valid is the one given by the product ie-merging function,

where is as in (9), and the adjusted e-variables are defined by

A remark similar to Remark 4.1 can also be made about Algorithm 2. The computational complexity of Algorithm 2 is (notice that the algorithm does not really require sorting the base e-values).

5 Calibrating p-values and e-values

Similarly to the case of e-values, without loss of generality we fix an atomless probability space for all discussions of p-values (cf. Vovk and Wang 2019a, Section 2). A p-variable is a random variable satisfying

The set of all p-variables is denoted by .

A calibrator is a function transforming p-values to e-values. Formally, a decreasing function is a calibrator (or, more fully, p-to-e calibrator) if, for any p-variable , . A calibrator is said to dominate a calibrator if , and the domination is strict if . A calibrator is admissible if it is not strictly dominated by any other calibrator.

The following proposition says that a calibrator is a nonnegative decreasing function integrating to 1 over the uniform probability measure.

Proposition 5.1.

A decreasing function is a calibrator if and only if . It is admissible if and only if is upper semicontinuous, , and .

Of course, in the context of this proposition, being upper semicontinuous is equivalent to being left-continuous.


Proofs of similar statements are given in, e.g., Vovk (1993, Theorem 7), Shafer et al. (2011, Theorem 3), and (Shafer and Vovk, 2019, Proposition 11.7), but we will give an independent short proof using our definitions. Suppose that and is a p-variable, and let us show that . We can assume, without loss of generality, that the distribution of is uniform on replacing, if needed, with defined by


where is a random variable that is independent of

and uniformly distributed on

(for the existence of such , at least in the case where is replaced by another random variable with the same distribution, see Lemma A.2; for the distribution of (12) being uniform, see, e.g., Ferguson 1967, Lemma 5.3.1). Now it remains to notice that .

The second statement in Proposition 5.1 is obvious. ∎

The following is a simple family of calibrators. Since , the functions


are calibrators, where . To solve the problem of choosing the parameter , sometimes the maximum

is used; we will refer to it as the VS bound (abbreviating “Vovk–Sellke bound”, as used in, e.g., the JASP package). It is important to remember that is not a valid e-value, but just an overoptimistic upper bound on what is achievable with the class (13).

In the opposite direction, an e-to-p calibrator is a function transforming e-values to p-values. Formally, a decreasing function is an e-to-p calibrator if, for any e-variable , . The following proposition, which is the analogue of Proposition 5.1 for e-to-p calibrators, says that there is, essentially, only one e-to-p calibrator, .

Proposition 5.2.

The function defined by is an e-to-p calibrator. It dominates every other e-to-p calibrator. In particular, it is the only admissible e-to-p calibrator.


The fact that is an e-to-p calibrator follows from Markov’s inequality: if and ,

On the other hand, suppose that is another e-to-p calibrator. It suffices to check that is dominated by . Suppose for some . Consider two cases:

  • If for some , fix such and consider an e-variable that is with probability and otherwise. Then is with probability , whereas it would have satisfied had it been a p-variable.

  • If for some , fix such and consider an e-variable that is a.s. Then is a.s., and so it is not a p-variable. ∎

Proposition 5.1 implies that the domination structure of calibrators is very rich, whereas Proposition 5.2 implies that the domination structure of e-to-p calibrators is trivial.

Remark 5.3.

A possible interpretation of this section’s results is that e-variables and p-variables are connected via a rough relation , as already discussed in Section 1. In one direction, the statement is precise: the reciprocal of an e-variable is a p-variable by Proposition 5.2. On the other hand, using a calibrator (13) with a small and ignoring positive constant factors (as customary in the algorithmic theory of randomness), we can see that the reciprocal of a p-variable is approximately an e-variable.

6 Merging p-values

Merging p-values is a much more difficult topic than merging e-values, but it is very well explored. First we review merging p-values without any assumptions, and then we move on to merging independent p-values.

A p-merging function of p-values is an increasing Borel function such that whenever .

For merging p-values without the assumption of independence, we will concentrate on two natural families of p-merging functions. The older family is the one introduced by Rüger (1978), and the newer one was introduced in our paper Vovk and Wang (2019a). Rüger’s family is parameterized by , and its th element is the function (shown by Rüger 1978 to be a p-merging function)


where and is a permutation of ordering the p-values in the ascending order: . The other family (Vovk and Wang, 2019a), which we will refer to as the -family, is parameterized by , and its element with index has the form , where


and is a suitable constant. We also define for as the limiting cases of (15), which correspond to the geometric average, the maximum, and the minimum, respectively.

The initial and final elements of both families coincide: the initial element is the Bonferroni p-merging function


and the final element is the maximum p-merging function

Similarly to the case of e-merging functions, we say that a p-merging function dominates a p-merging function if . The domination is strict if, in addition, for at least one . We say that a p-merging function is admissible if it is not strictly dominated by any p-merging function .

The domination structure of p-merging functions is much richer than that of e-merging functions. The maximum p-merging function is clearly inadmissible (e.g., is strictly dominated by ) while the Bonferroni p-merging function is admissible, as the following proposition shows.

Proposition 6.1.

The Bonferroni p-merging function (16) is admissible.


Denote by the Bonferroni p-merging function (16). Suppose the statement of the proposition is false and fix a p-merging function that strictly dominates . If whenever , then also when , since is increasing. Hence for some point ,

Fix such and set ; we know that . Since

we can take such that . Let be disjoint events such that for all and (their existence is guaranteed by the inequality ). Define random variables

. It is straightforward to check that . By writing and , we have

Therefore, is not a p-merging function, which gives us the desired contradiction. ∎

The general domination structure of p-merging functions appears to be very complicated, and is the subject of future planned work.

E-merging functions and the two families

The domination structure of the class of e-merging functions is very simple, as suggested by Proposition 2.1. It makes it very easy to understand what the e-merging analogues of Rüger’s family and the -family are; when stating the analogues we will use the rough relation between e-values and p-values (see Remark 5.3).

For a sequence , let be the order statistics numbered from the largest to the smallest; here is a permutation of ordering in the descending order: . Let us check that the Rüger-type function is a precise e-merging function. It is a merging function since it is dominated by arithmetic mean: indeed, the condition of domination


can be rewritten as

and so is obvious. As sometimes we have a strict inequality, the e-merging function is inadmissible (remember that we assume ). The e-merging function is precise (by Proposition 2.1) because (17) holds as equality when the largest , , are all equal and greater than 1 and all the other are 0.

In the case of the -family, let us check that the function


is a precise e-merging function, for any . For , is increasing in (Hardy et al., 1952, Theorem 16), and so is dominated by arithmetic mean , and so it is an e-merging function. For we can rewrite the function as

and we know that the last expression is a decreasing function of (Hardy et al., 1952, Theorem 19); therefore, is also dominated by and so is a merging function. The e-merging function is precise (for any ) since

and so by Proposition 2.1 (applied to a sufficiently large ) is not an e-merging function for any . But is admissible if and only if .

Remark 6.2.

The rough relation also sheds light on the coefficient, for , given in (18) in front of . The coefficient , , in front of for averaging e-values corresponds to a coefficient of , , in front of for averaging p-values. And indeed, by Proposition 5 of Vovk and Wang (2019a), the asymptotically precise coefficient in front of , , for averaging p-values is . The extra factor appears because the reciprocal of a p-variable is only approximately, but not exactly, an e-variable.

Remark 6.3.

Our formulas for merging e-values are explicit and much simpler than the formulas for merging p-values given in Vovk and Wang (2019a). Merging e-values does not involve asymptotic approximations via the theory of robust risk aggregation, as used in that paper. This suggests that in some important respects e-values are easier objects to deal with than p-values.

Merging independent p-values

In this section we will discuss ways of combining p-values under the assumption that the p-values are independent.

One of the oldest and most popular methods for combining p-values is Fisher’s [1932, Section 21.1], which we already mentioned in Section 3. Fisher’s method is based on the product statistic (with its low values significant) and uses the fact that has the distribution with degrees of freedom when all are independent and distributed uniformly on the interval .

Simes (1986) proves a remarkable result for Rüger’s family (14) under the assumption that the p-values are independent: the minimum


of Rüger family over all turns out to be a p-merging function.

The counterpart of Simes’s result still holds for e-merging functions; moreover, now the p-values do not have to be independent. Namely,

is an e-merging function. This follows immediately from (17), the left-hand side of which can be replaced by its maximum over . And it also follows from (17) that there is no sense in using this counterpart; it is better to use arithmetic mean.

7 Cross-merging between e-values and p-values

In this section we will briefly discuss functions performing “cross-merging”: either merging several e-values into a p-value or several p-values into an e-value. Formally, an e-to-p merging function is a decreasing Borel function such that is a p-variable whenever are e-variables, and a p-to-e merging function is a decreasing Borel function such that is an e-variable whenever are p-variables. The message of this section is that cross-merging can be performed as composition of pure merging (applying an e-merging function or a p-merging function) and calibration (either e-to-p calibration or p-to-e calibration); however, in some important cases (we feel in the vast majority of cases) pure merging is more efficient, and should be done, in the domain of e-values.

Let us start from e-to-p merging. Given e-values , we can merge them into one e-value by applying arithmetic mean, the only essentially admissible e-merging function (Proposition 2.1), and then by applying inversion , the only admissible e-to-p calibrator (Proposition 5.2). This gives us the e-to-p merging function


The following proposition shows that in this way we obtain the optimal symmetric e-to-p merging function.

Proposition 7.1.

The e-to-p merging function (20) dominates all symmetric e-to-p merging functions.


Suppose that a symmetric e-to-p merging function satisfies for some . The following arguments are similar to the proof of Proposition 2.1. As before, is the set of all permutations on , is randomly and uniformly drawn from , and . Further, let , where is an event independent of and satisfying . For each , we have , and hence . By the symmetry of , we have , and hence

This contradicts being an e-to-p merging function. ∎

It is interesting that (20) can also be obtained by composing e-to-p calibration and improper pure p-merging. Given e-values we first transform them into p-values (in this paragraph we allow p-values greater than 1, as in Vovk and Wang 2019a). Wilson (2019)

proposed harmonic mean as a p-merging function. The composition of these two transformations again gives us the e-to-p merging function (

20). The problem with this argument is that, as Goeman et al. (2019, Wilson’s second claim) point out (with a reference to Vovk and Wang 2019a), Wilson’s method is in general not valid (we obtain a valid method if we multiply harmonic mean by for , according to Vovk and Wang 2019a). Despite the illegitimate application of harmonic mean, the resulting function (20) is still a valid e-to-p merging function. At least in this context, we can see that e-to-p merging should be done by first doing pure merging and then e-to-p calibration, not vice versa.

Now suppose we are given p-values , and we would like to merge them into one e-value. Let . Applying the calibrator (13), we obtain e-values , and since the average of e-values is an e-value,


is a p-to-e merging function.

The following proposition will imply that all p-to-e merging functions (21) are admissible; moreover, it will show, in conjunction with Proposition 5.1, that for any admissible p-to-e calibrator , the function

is an admissible p-to-e merging function.

Proposition 7.2.

If is an upper semicontinuous and decreasing Borel function, for all