1 Introduction
The problem of multiple testing of a single hypothesis is usually formalized as that of combining a set of pvalues. The notion of pvalues, however, has a strong competitor, which we refer to as evalues in this paper. Evalues have been used widely, under different names and in different contexts. However, they have started being widely discussed in their pure form, regardless of the context, only recently: see, e.g., Shafer (2019) (who uses the term “betting score” for our “evalue”), Shafer and Vovk (2019, Section 11.5) (who use “Skeptic’s capital”), and Grünwald et al. (2019) (who use “Svalue”).
Historically, the use of pvalues vs evalues reflects the conventional division of statistics into frequentist and Bayesian (although a sizable fraction of people interested in the foundations of statistics, including the authors of this paper, are neither frequentists nor Bayesians). Pvalues are a hallmark of frequentist statistics, but Bayesians often regard pvalues as misleading, preferring the use of Bayes factors (which can be combined with prior probabilities to obtain posterior probabilities). In the case of simple statistical hypotheses, a Bayes factor is the likelihood ratio of an alternative hypothesis to the null hypothesis (or vice versa, as in
Shafer et al. 2011). The key property of the Bayes factor is that it is a nonnegative extended random variable whose expected value under the null hypothesis is at most 1. We express this property by saying that the Bayes factor is an
evalue. (Pvalues are also known as “probability values”; similarly, we abbreviate “expectation values” to “evalues.)The literature on Bayes factors is vast; we only mention the influential review by Kass and Raftery (1995) and the historical investigation by Etz and Wagenmakers (2017).
The question of transforming pvalues into evalues, or calibration
of pvalues, has a long history in Bayesian statistics. The idea was first raised by
Berger and Delampady (1987, Section 4.2) (who, however, referred to the idea as “ridiculous”; since then the idea has been embraced by the Bayesian community). The class of calibrators was proposed in Vovk (1993) and rediscovered in Sellke et al. (2001). A simple characterization of the class of all calibrators was first obtained in Shafer et al. (2011). A popular Bayesian point of view is that pvalues tend to be misleading and need to be transformed into evalues in order to make sense of them. The problem of nonuniqueness of calibrators is sometimes solved by considering (the best evalue that can be attained by the class , advocated by, e.g., Benjamin and Berger (2019), Recommendations 2 and 3), but this does not produce a valid evalue.One area where both pvalues and evalues have been used for a long time is the algorithmic theory of randomness (see, e.g., Shen et al. 2017), an area that originated in Kolmogorov’s work on the algorithmic foundations of probability and information (Kolmogorov, 1965, 1968). MartinLöf (1966) introduced an algorithmic version of pvalues, and then Levin (1976) introduced an algorithmic version of evalues. In the algorithmic theory of randomness people are often interested in lowaccuracy results, and then pvalues and evalues can be regarded as slight variations of each other. If is an evalue, will be a pvalue; and vice versa, if is a pvalue, will be an approximate evalue.
As we have said, the focus of this paper is on combining evalues and multiple hypotheses testing using evalues. The picture that arises for these two fields is remarkably different from its counterpart for pvalues.
We start the main part of the paper by defining the notion of evalues and showing that the problem of merging evalues is more or less trivial: natural merging functions are essentially dominated by arithmetic mean (Section 2). In Section 3 we assume, additionally, that the evariables being merged are independent, and show that the domination structure is much richer. In Section 4 we apply these results to multiple hypotheses testing. Section 5 reviews known results about relations between individual evalues and individual pvalues; we will discuss how the former can be turned into the latter and vice versa (with very different domination structures for the two directions). In the next section, Section 6, we review known results about merging pvalues and draw parallels with merging evalues; in the last subsection we assume that the pvalues are independent. Section 7 discusses “crossmerging”: merging pvalues into one evalue and merging evalues into one pvalue. Section 8 is devoted to experimental results, and Section 9 concludes the main part of the paper. Appendix A contains several results that are less closely connected with the main messages of this paper.
2 Merging evalues
For a probability space , an evariable is an extended random variable satisfying (we refer to it as “extended” since its values are allowed to be ). The values taken by evariables will be referred to as evalues, and we denote the set of evariables by . It is important to allow to take value ; in the context of testing , observing for an a priori chosen evariable means that we are entitled to reject as null hypothesis.
Let be a positive integer (fixed throughout the paper). An emerging function of evalues is an increasing Borel function such that, for any probability space and random variables on it,
(1) 
(in other words, transforms evalues into an evalue). In this paper we will also refer to increasing Borel functions satisfying (1) for all probability spaces and all evariable taking values in as emerging functions; such functions are canonically extended to emerging functions by setting them to on (see Proposition A.1 in Appendix A).
It suffices to require that (1) hold for a fixed atomless probability space , as we explain in Appendix A (Proposition A.4). We will fix such a probability space for the rest of the paper (apart from Section 4 and Appendix A itself) and will let or stand for for any random variable .
An emerging function dominates an emerging function if (i.e., for all ). The domination is strict (and we say that strictly dominates ) if and for some . We say that an emerging function is admissible if it is not strictly dominated by any emerging function; in other words, admissibility means being maximal in the partial order of domination. Finally, we say that an emerging function is inadmissible if it is not admissible.
The notion of admissibility is much stronger than the notion of being “precise” that we used in Vovk and Wang (2019a). In the context of emerging functions, an emerging function is precise if is not an emerging function for any .
Merging evalues via averaging
In this paper we are only interested in symmetric merging functions (i.e., those invariant w.r. to permutations of their arguments). The main message of this section is that the most useful (and the only useful, in a natural sense) symmetric emerging function is the arithmetic mean
(2) 
It will follow immediately from Proposition 3.1 that is admissible. But first we state formally the claim that is the only useful symmetric emerging function.
An emerging function essentially dominates an emerging function if, for all ,
This weakens the notion of domination in a natural way: now we require that is not worse than only in cases where is not useless; we are not trying to compare degrees of uselessness. The following proposition can be interpreted as saying that is at least as good as any other symmetric emerging function.
Proposition 2.1.
Arithmetic mean essentially dominates any symmetric emerging function.
In particular, if is an emerging function that is symmetric and positively homogeneous (i.e., for all ), then is dominated by . This includes the emerging functions discussed later in Section 6.
Proof of Proposition 2.1.
Let be a symmetric emerging function. First let us check that, for all ,
(3) 
Suppose for the purpose of contradiction that there exists such that
(4) 
Write and . Let be the set of all permutations of , be randomly and uniformly drawn from , and .
It is clear that arithmetic mean does not dominate every symmetric emerging function; for example, the convex mixtures
(5) 
of the trivial emerging function and are pairwise noncomparable (with respect to the relation of domination).
3 Merging independent evalues
In this section we consider merging functions for independent evalues; remember that in Section 2 we fixed an atomless probability space . An iemerging function of evalues is an increasing Borel function such that for all independent . As for emerging functions,

and this definition is equivalent to the definition involving the universal quantifier over all probability spaces (see Proposition A.6).
The definitions of domination, strict domination, admissibility, and inadmissibility are obtained from the definition of the previous section by replacing “emerging” with “iemerging”.
Let
be the set of (componentwise) independent random vectors in
, and be the all1 vector in . The following proposition has already been used (in particular, it implies that arithmetic mean is an admissible emerging function).Proposition 3.1.
For an increasing Borel function , if for all with (resp., for all with ), then is an admissible emerging function (resp., an admissible iemerging function).
Proof.
It is obvious that is an emerging function (resp., iemerging function). Next we show that is admissible. Suppose for the purpose of contradiction that there exists an iemerging function such that and for some . Take with such that . Such a random vector is easy to construct by considering any distribution with a positive mass on each of . Then we have
which implies
contradicting the assumption that is an iemerging function. Therefore, no iemerging function strictly dominates . Noting that an emerging function is also an iemerging function, admissibility of is guaranteed under both settings. ∎
If are independent evariables, their product will also be an evariable. This is the analogue of Fisher’s [1932] method for pvalues (according to the rough relation mentioned in Section 1 and discussed further in Section 5; Fisher’s method is discussed at the end of Section 6). The iemerging function
(6) 
is admissible by Proposition 3.1. It will be referred to as the product (or multiplication) iemerging function.
More generally, we can see that the Ustatistics
(7) 
and their convex mixtures are iemerging functions. Notice that this class includes product (for ), arithmetic average (for ), and constant 1 (for ). Proposition 3.1 implies that the Ustatistics (7) and their convex mixtures are admissible emerging functions.
Let us now establish a very weak counterpart of Proposition 2.1 for independent evalues. An iemerging function weakly dominates an iemerging function if, for all ,
In other words, we require that is not worse than if all input evalues are useful (and this requirement is weak because, especially for a large , we are also interested in the case where some of the input evalues are useless).
Proposition 3.2.
The product weakly dominates any symmetric iemerging function.
Proof.
Indeed, suppose that there exists such that
Let be independent random variables such that each for takes values in the twoelement set and with probability . Then each is an evariable but
which contradicts being an iemerging function. ∎
Testing with martingales
The assumption of the independence of evariables is not necessary for the product to be an evariable. It suffices to assume that a.s. for all . In this situation the sequence of the partial products , becomes a supermartingale (or a test supermartingale, in the terminology of Shafer et al. 2011 and Grünwald et al. 2019, meaning a nonnegative supermartingale with initial value 1). A possible interpretation of this test supermartingale is that the evalues are obtained by laboratories in this order, and laboratory makes sure that its result is a valid evalue given the previous results .
4 Application to testing multiple hypotheses
As in Vovk and Wang (2019a), we will apply results for multiple testing of a single hypothesis (combining evalues in the context of Sections 2 and 3) to testing multiple hypotheses, spelling out the corresponding closed testing procedures (Marcus et al., 1976).
We are given a set of composite null hypotheses , , and, for each , an evariable w.r. to : for any . The closure for multiple testing of our emerging procedure is given as Algorithm 1. The procedure adjusts the evalues obtained in the experiments (not necessarily independent) to new evalues . Applying the procedure to the evalues produced by the evariables , we obtain extended random variables taking values . First we define our desired property of validity for the procedure, which we will refer to as familywise validity (FWV), in analogy with the standard familywise error rate (FWER).
Formally, we are given subsets of the family of all probability measures on , and for each , we are given an evariable for testing , as described earlier; suppose our procedure (such as the one given by Algorithm 1) produces extended random variables taking values in . A conditional evariable is a family of extended nonnegative random variables , , that satisfies
(i.e., each is in ). The procedure is familywise valid (FWV) for the given if there exists a conditional evariable such that
(where means, as usual, that for all ). We can say that such witnesses the FWV property.
Let us check that Algorithm 1 is familywise valid. For , the composite hypothesis is defined by
(8) 
where is the complement of . The conditional evariable witnessing that Algorithm 1 is familywise valid is
(9) 
where . The optimal adjusted evariables can be defined as
(10) 
but for computational efficiency we use the conservative definition
(11) 
Remark 4.1.
The inequality “” in (10) holds as the equality “” if all the intersections (8) are nonempty. If some of these intersections are empty, we can have a strict inequality. Algorithm 1 implements the definition (11). Therefore, it is valid regardless of whether some of the intersections (8) are empty; however, if they are, it may be possible to improve the adjusted evalues. According to Holm’s [1979] terminology, we allow “free combinations”. Shaffer (1986) pioneered methods that take account of the logical relations between the base hypotheses .
To obtain Algorithm 1, we rewrite the definitions (11) as
for , where is the ordering permutation and is the th order statistic among , as in Algorithm 1. In lines 3–5 of Algorithm 1 we precompute the sums
in lines 8–9 we compute
for , and as result of executing lines 6–11 we will have
The computational complexity of Algorithm 1 is .
In the case of independent evariables, we have Algorithm 2. This algorithm assumes that the base evariables are independent under any for any . The conditional evariable witnessing that Algorithm 2 is familywise valid is the one given by the product iemerging function,
where is as in (9), and the adjusted evariables are defined by
A remark similar to Remark 4.1 can also be made about Algorithm 2. The computational complexity of Algorithm 2 is (notice that the algorithm does not really require sorting the base evalues).
5 Calibrating pvalues and evalues
Similarly to the case of evalues, without loss of generality we fix an atomless probability space for all discussions of pvalues (cf. Vovk and Wang 2019a, Section 2). A pvariable is a random variable satisfying
The set of all pvariables is denoted by .
A calibrator is a function transforming pvalues to evalues. Formally, a decreasing function is a calibrator (or, more fully, ptoe calibrator) if, for any pvariable , . A calibrator is said to dominate a calibrator if , and the domination is strict if . A calibrator is admissible if it is not strictly dominated by any other calibrator.
The following proposition says that a calibrator is a nonnegative decreasing function integrating to 1 over the uniform probability measure.
Proposition 5.1.
A decreasing function is a calibrator if and only if . It is admissible if and only if is upper semicontinuous, , and .
Of course, in the context of this proposition, being upper semicontinuous is equivalent to being leftcontinuous.
Proof.
Proofs of similar statements are given in, e.g., Vovk (1993, Theorem 7), Shafer et al. (2011, Theorem 3), and (Shafer and Vovk, 2019, Proposition 11.7), but we will give an independent short proof using our definitions. Suppose that and is a pvariable, and let us show that . We can assume, without loss of generality, that the distribution of is uniform on replacing, if needed, with defined by
(12) 
where is a random variable that is independent of
and uniformly distributed on
(for the existence of such , at least in the case where is replaced by another random variable with the same distribution, see Lemma A.2; for the distribution of (12) being uniform, see, e.g., Ferguson 1967, Lemma 5.3.1). Now it remains to notice that .The second statement in Proposition 5.1 is obvious. ∎
The following is a simple family of calibrators. Since , the functions
(13) 
are calibrators, where . To solve the problem of choosing the parameter , sometimes the maximum
is used; we will refer to it as the VS bound (abbreviating “Vovk–Sellke bound”, as used in, e.g., the JASP package). It is important to remember that is not a valid evalue, but just an overoptimistic upper bound on what is achievable with the class (13).
In the opposite direction, an etop calibrator is a function transforming evalues to pvalues. Formally, a decreasing function is an etop calibrator if, for any evariable , . The following proposition, which is the analogue of Proposition 5.1 for etop calibrators, says that there is, essentially, only one etop calibrator, .
Proposition 5.2.
The function defined by is an etop calibrator. It dominates every other etop calibrator. In particular, it is the only admissible etop calibrator.
Proof.
The fact that is an etop calibrator follows from Markov’s inequality: if and ,
On the other hand, suppose that is another etop calibrator. It suffices to check that is dominated by . Suppose for some . Consider two cases:

If for some , fix such and consider an evariable that is with probability and otherwise. Then is with probability , whereas it would have satisfied had it been a pvariable.

If for some , fix such and consider an evariable that is a.s. Then is a.s., and so it is not a pvariable. ∎
Proposition 5.1 implies that the domination structure of calibrators is very rich, whereas Proposition 5.2 implies that the domination structure of etop calibrators is trivial.
Remark 5.3.
A possible interpretation of this section’s results is that evariables and pvariables are connected via a rough relation , as already discussed in Section 1. In one direction, the statement is precise: the reciprocal of an evariable is a pvariable by Proposition 5.2. On the other hand, using a calibrator (13) with a small and ignoring positive constant factors (as customary in the algorithmic theory of randomness), we can see that the reciprocal of a pvariable is approximately an evariable.
6 Merging pvalues
Merging pvalues is a much more difficult topic than merging evalues, but it is very well explored. First we review merging pvalues without any assumptions, and then we move on to merging independent pvalues.
A pmerging function of pvalues is an increasing Borel function such that whenever .
For merging pvalues without the assumption of independence, we will concentrate on two natural families of pmerging functions. The older family is the one introduced by Rüger (1978), and the newer one was introduced in our paper Vovk and Wang (2019a). Rüger’s family is parameterized by , and its th element is the function (shown by Rüger 1978 to be a pmerging function)
(14) 
where and is a permutation of ordering the pvalues in the ascending order: . The other family (Vovk and Wang, 2019a), which we will refer to as the family, is parameterized by , and its element with index has the form , where
(15) 
and is a suitable constant. We also define for as the limiting cases of (15), which correspond to the geometric average, the maximum, and the minimum, respectively.
The initial and final elements of both families coincide: the initial element is the Bonferroni pmerging function
(16) 
and the final element is the maximum pmerging function
Similarly to the case of emerging functions, we say that a pmerging function dominates a pmerging function if . The domination is strict if, in addition, for at least one . We say that a pmerging function is admissible if it is not strictly dominated by any pmerging function .
The domination structure of pmerging functions is much richer than that of emerging functions. The maximum pmerging function is clearly inadmissible (e.g., is strictly dominated by ) while the Bonferroni pmerging function is admissible, as the following proposition shows.
Proposition 6.1.
The Bonferroni pmerging function (16) is admissible.
Proof.
Denote by the Bonferroni pmerging function (16). Suppose the statement of the proposition is false and fix a pmerging function that strictly dominates . If whenever , then also when , since is increasing. Hence for some point ,
Fix such and set ; we know that . Since
we can take such that . Let be disjoint events such that for all and (their existence is guaranteed by the inequality ). Define random variables
. It is straightforward to check that . By writing and , we have
Therefore, is not a pmerging function, which gives us the desired contradiction. ∎
The general domination structure of pmerging functions appears to be very complicated, and is the subject of future planned work.
Emerging functions and the two families
The domination structure of the class of emerging functions is very simple, as suggested by Proposition 2.1. It makes it very easy to understand what the emerging analogues of Rüger’s family and the family are; when stating the analogues we will use the rough relation between evalues and pvalues (see Remark 5.3).
For a sequence , let be the order statistics numbered from the largest to the smallest; here is a permutation of ordering in the descending order: . Let us check that the Rügertype function is a precise emerging function. It is a merging function since it is dominated by arithmetic mean: indeed, the condition of domination
(17) 
can be rewritten as
and so is obvious. As sometimes we have a strict inequality, the emerging function is inadmissible (remember that we assume ). The emerging function is precise (by Proposition 2.1) because (17) holds as equality when the largest , , are all equal and greater than 1 and all the other are 0.
In the case of the family, let us check that the function
(18) 
is a precise emerging function, for any . For , is increasing in (Hardy et al., 1952, Theorem 16), and so is dominated by arithmetic mean , and so it is an emerging function. For we can rewrite the function as
and we know that the last expression is a decreasing function of (Hardy et al., 1952, Theorem 19); therefore, is also dominated by and so is a merging function. The emerging function is precise (for any ) since
and so by Proposition 2.1 (applied to a sufficiently large ) is not an emerging function for any . But is admissible if and only if .
Remark 6.2.
The rough relation also sheds light on the coefficient, for , given in (18) in front of . The coefficient , , in front of for averaging evalues corresponds to a coefficient of , , in front of for averaging pvalues. And indeed, by Proposition 5 of Vovk and Wang (2019a), the asymptotically precise coefficient in front of , , for averaging pvalues is . The extra factor appears because the reciprocal of a pvariable is only approximately, but not exactly, an evariable.
Remark 6.3.
Our formulas for merging evalues are explicit and much simpler than the formulas for merging pvalues given in Vovk and Wang (2019a). Merging evalues does not involve asymptotic approximations via the theory of robust risk aggregation, as used in that paper. This suggests that in some important respects evalues are easier objects to deal with than pvalues.
Merging independent pvalues
In this section we will discuss ways of combining pvalues under the assumption that the pvalues are independent.
One of the oldest and most popular methods for combining pvalues is Fisher’s [1932, Section 21.1], which we already mentioned in Section 3. Fisher’s method is based on the product statistic (with its low values significant) and uses the fact that has the distribution with degrees of freedom when all are independent and distributed uniformly on the interval .
Simes (1986) proves a remarkable result for Rüger’s family (14) under the assumption that the pvalues are independent: the minimum
(19) 
of Rüger family over all turns out to be a pmerging function.
The counterpart of Simes’s result still holds for emerging functions; moreover, now the pvalues do not have to be independent. Namely,
is an emerging function. This follows immediately from (17), the lefthand side of which can be replaced by its maximum over . And it also follows from (17) that there is no sense in using this counterpart; it is better to use arithmetic mean.
7 Crossmerging between evalues and pvalues
In this section we will briefly discuss functions performing “crossmerging”: either merging several evalues into a pvalue or several pvalues into an evalue. Formally, an etop merging function is a decreasing Borel function such that is a pvariable whenever are evariables, and a ptoe merging function is a decreasing Borel function such that is an evariable whenever are pvariables. The message of this section is that crossmerging can be performed as composition of pure merging (applying an emerging function or a pmerging function) and calibration (either etop calibration or ptoe calibration); however, in some important cases (we feel in the vast majority of cases) pure merging is more efficient, and should be done, in the domain of evalues.
Let us start from etop merging. Given evalues , we can merge them into one evalue by applying arithmetic mean, the only essentially admissible emerging function (Proposition 2.1), and then by applying inversion , the only admissible etop calibrator (Proposition 5.2). This gives us the etop merging function
(20) 
The following proposition shows that in this way we obtain the optimal symmetric etop merging function.
Proposition 7.1.
The etop merging function (20) dominates all symmetric etop merging functions.
Proof.
Suppose that a symmetric etop merging function satisfies for some . The following arguments are similar to the proof of Proposition 2.1. As before, is the set of all permutations on , is randomly and uniformly drawn from , and . Further, let , where is an event independent of and satisfying . For each , we have , and hence . By the symmetry of , we have , and hence
This contradicts being an etop merging function. ∎
It is interesting that (20) can also be obtained by composing etop calibration and improper pure pmerging. Given evalues we first transform them into pvalues (in this paragraph we allow pvalues greater than 1, as in Vovk and Wang 2019a). Wilson (2019)
proposed harmonic mean as a pmerging function. The composition of these two transformations again gives us the etop merging function (
20). The problem with this argument is that, as Goeman et al. (2019, Wilson’s second claim) point out (with a reference to Vovk and Wang 2019a), Wilson’s method is in general not valid (we obtain a valid method if we multiply harmonic mean by for , according to Vovk and Wang 2019a). Despite the illegitimate application of harmonic mean, the resulting function (20) is still a valid etop merging function. At least in this context, we can see that etop merging should be done by first doing pure merging and then etop calibration, not vice versa.Now suppose we are given pvalues , and we would like to merge them into one evalue. Let . Applying the calibrator (13), we obtain evalues , and since the average of evalues is an evalue,
(21) 
is a ptoe merging function.
The following proposition will imply that all ptoe merging functions (21) are admissible; moreover, it will show, in conjunction with Proposition 5.1, that for any admissible ptoe calibrator , the function
is an admissible ptoe merging function.
Proposition 7.2.
If is an upper semicontinuous and decreasing Borel function, for all
Comments
There are no comments yet.