    # True and false discoveries with e-values

We discuss controlling the number of false discoveries using e-values instead of p-values. Using e-values simplifies the known algorithms radically.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

There are numerous ways of merging p-values, both under the assumption of independence and in general. One of the more exotic ways is merging by averaging (Vovk and Wang, 2019b); however, the arithmetic average has to be scaled up by a factor of 2 (Rüschendorf, 1982; Meng, 1993) to get a valid merging function. The situation with e-values (as defined in Vovk and Wang (2019a)) is radically different: no scaling is required for arithmetic averaging, and, moreover, arithmetic averaging becomes essentially the only symmetric method of merging (Vovk and Wang, 2019a, Proposition 2.1).

In this note we define an analogue for e-values of a known procedure of controlling the numbers of false discoveries (Goeman et al., 2019). Of course, we base it on arithmetic averaging of e-values as merging function. We will freely use the definitions given in Vovk and Wang (2019a).

In Section 2 we describe our procedures, and in Section 3 report results of simple simulation studies.

## 2 Controlling true and false discoveries

Suppose we are given e-values for testing hypotheses on a measurable space . Without loss of generality we assume that the e-values are sorted in the ascending order, .

Following Goeman and Solari (2011), we first fix a rejection set (the set of base hypotheses, represented by their indices, that the researcher chooses to reject). The discovery curve for is defined as

(notice that is excluded for any ).

The main mathematical result of this note (the very simple Theorem 2.1 below) says that the discovery curve satisfies the following natural property of validity. Let us say that a function controls true discoveries if there exists a conditional e-variable such that

 ∀Q∈P(Ω)∀j∈{1,…,|R|}:(|{k∈R∣Q∉Hk}|≥j)∨(EQ(ω)≥D(j,ω)). (1)

We will say that such a conditional e-variable witnesses that controls true discoveries. Notice that, intuitively, controlling true discoveries and controlling false discoveries are the same thing, since the total number of discoveries is known. Since is a random function of , we sometimes write instead of suppressing the dependence on (as we do for other random functions, such as ).

The disjunction in (1) is of the kind discussed by Fisher (, Section III.1): assuming is large for some , either there are at least true discoveries (rejections of false null hypotheses) or a rare chance has occurred (namely, the observed e-value is at least ).

###### Theorem 2.1.

The discovery curve controls true discoveries.

###### Proof.

The function witnessing that controls true discoveries will be the arithmetic mean

 EQ:=1∣∣IQ∣∣∑k∈IQEk, (2)

where and are the e-variables generating , . If , the disjunction in (1) is obvious for any ; therefore, we can define the e-variable arbitrarily in this case (e.g., as 0 or 1).

To check the disjunction in (1) for and for given and , let us assume that the second term of the disjunction is false, namely . By the definition (2), this means

 1∣∣IQ∣∣∑k∈IQek

and we can see that . We have at least true discoveries, which establishes the disjunction.

It is clear that is the largest function whose control of true discoveries is witnessed by the arithmetic mean (2) for all measurable spaces , all hypotheses , and all rejection sets . ∎

A polynomial-time algorithm for computing is given as Algorithm 1. It uses the notation

 Fe(I):=1|I|∑i∈Iei,I⊆{1,…,K},I≠∅,

where (as we said earlier, arithmetic averaging is essentially the only symmetric e-merging function (Vovk and Wang, 2019a)).

Next we will discuss a less flexible method in which we consider a family of rejection sets that are chosen in an optimal way, in some sense. For each , the set is the optimal rejection set of size , meaning that, for any other set of size , we have . In the terminology of statistical decision theory (Wald, 1950, Section 1.3), is the minimal complete class of rejection sets.

The output of Algorithm 2 is a matrix , where and . We regard as a matrix whose elements above the main diagonal are undefined and refer to it as the discovery matrix. It is defined as

 DMr,j:=DCRr(j),r∈{1,…,K},j∈{1,…,r},

where is the optimal rejection set of size . Possible uses of such a matrix will be discussed at the end of the next section.

## 3 Experiments

In our simulation studies we will visualize the discovery matrix in two simple cases and compare Algorithm 2 with a method based on p-values. Our setting will be similar to that of Vovk and Wang (2019a, Section 8), where we study family-wise validity. Figure 1: The discovery matrix for 10 false (with δ=−3) and 10 true null hypotheses, as described in text.

Figure 1 shows the discovery matrix that Algorithm 2 gives in the following situation involving the Gaussian model . The null hypotheses are and the alternatives are , where we take . We generate observations from (the alternative distribution) and then from (the null distribution).

We colour-code the entries of discovery matrices following Jeffreys’s [1961, Appendix B] rule of thumb:

• If an entry

is below 1, the null hypothesis is supported. Such entries are shown in green.

• If , the evidence against the null hypothesis is not worth more than a bare mention. Such entries are shown in lime (also known as yellow-green).

• If , the evidence against the null hypothesis is substantial. Such entries are shown in yellow.

• If , the evidence against the null hypothesis is strong. Such entries are shown in red.

• If , the evidence against the null hypothesis is very strong. Such entries are shown in dark red.

• If , the evidence against the null hypothesis is decisive. Such entries are shown in black.

The full colour map is shown on the right of the figure with the thresholds between different colours given in terms of the decimal logarithm of .

The base e-values are the likelihood ratios

 E(x):=exp(−(x−δ)2/2)exp(−x2/2)=exp(δx−δ2/2) (3)

of the alternative to the null density, where is the corresponding observation and . Figure 2: The discovery matrix for 100 false (with δ=−4) and 100 true null hypotheses, as described in text.

Figure 2 is the counterpart of Figure 2 for a larger number of hypotheses, where now the null hypotheses are and the alternatives are for (since we have more hypotheses to test, we make our task easier). We generate observations from and then from . The base e-values are (3) with .

For comparison with methods based on p-values, in this version of this note we use Theorem 1 of Goeman et al. (2019) applied to Simes’s  procedure for combining p-values and to the full rejection set , as implemented in the R package hommel. As the base p-values we take , where

is the standard Gaussian distribution function; these are the p-values found using the most powerful test given by the Neyman–Pearson lemma. The output of

hommel is “With confidence: at least 86 discoveries”. Our method gives , which is even slightly better than the Vovk–Sellke bound

 −exp(−1)0.05ln0.05≈2.456

corresponding to the significance level (despite the bound not being a bona fide e-value).

It is interesting that our method does not require the assumption of independence of e-values (whereas independence, in some form, is essential for Simes’s method, and not assuming Simes’s inequality in hommel leads to only 72 discoveries instead of 86).

In practice, a discovery matrix, such as that shown in Figure 2, can be used in different ways, for example:

• The researcher may have budget for a limited number of follow-up studies of the hypotheses. For example, if in the situation of Figure 2 her budget is 50 hypotheses, she just concentrates on row 50 (studying the 50 hypotheses with the largest e-values). The smallest (right-most) e-value in this row of the discovery matrix is , and so we have strong evidence that all the discoveries are real.

• The researcher might have some idea of what proportion of false discoveries she is willing to tolerate (in the spirit of choosing the false discovery rate a priori (Benjamini and Hochberg, 1995)). For example, if she is willing to tolerate of false discoveries and willing to use Jeffreys’s standard (e-value greater than 10) of strong evidence, she should concentrate on row 83 (i.e., study the 83 hypotheses with the largest e-values), which is the lowest row with less that of entries below .

• Alternatively, the researcher might have some idea of how many false discoveries she is willing to tolerate (in the spirit of -FWER (Romano and Wolf, 2007)). If she is willing to tolerate false discoveries and still willing to use Jeffreys’s standard of strong evidence, she should concentrate on row 86, which is the lowest row with at most entries (in fact, exactly entries) below .

Of course, the researcher may know her hypotheses and relations between them very well, and after looking at the discovery matrix she may come up with her own rejection set , as emphasized by Goeman and Solari (2011). In this case she should use Algorithm 1.

### Acknowledgments

In Section 3 we used the R package hommel written by Jelle Goeman, Rosa Meijer, and Thijmen Krebs; it is available on CRAN. For our simulation studies we used Python, and our Jupyter notebook is available on arXiv.

V. Vovk’s research has been partially supported by Astra Zeneca and Stena Line. R. Wang is supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2018-03823, RGPAS-2018-522590).

## References

• Benjamini and Hochberg (1995) Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society B, 57:289–300, 1995.
• Fisher (1973) Ronald A. Fisher. Statistical Methods and Scientific Inference. Hafner, New York, third edition, 1973.
• Goeman and Solari (2011) Jelle J. Goeman and Aldo Solari. Multiple testing for exploratory research. Statistical Science, 26:584–597, 2011. Correction: 28:464.
• Goeman et al. (2019) Jelle J. Goeman, Rosa J. Meijer, Thijmen J. P. Krebs, and Aldo Solari. Simultaneous control of all false discovery proportions in large-scale multiple hypothesis testing. Biometrika, 106:841–856, 2019.
• Jeffreys (1961) Harold Jeffreys.

Theory of Probability

.
Oxford University Press, Oxford, third edition, 1961.
• Meng (1993) Xiao-Li Meng. Posterior predictive -values. Annals of Statistics, 22:1142–1160, 1993.
• Romano and Wolf (2007) Joseph P. Romano and Michael Wolf. Control of generalized error rates in multiple testing. Annals of Statistics, 35:1378–1408, 2007.
• Rüschendorf (1982) Ludger Rüschendorf. Random variables with maximum sums. Advances in Applied Probability, 14:623–632, 1982.
• Simes (1986) R. John Simes. An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73:751–754, 1986.
• Vovk and Wang (2019a) Vladimir Vovk and Ruodu Wang. Combining e-values and p-values. Technical Report arXiv:1912.06116 [math.ST], arXiv.org e-Print archive, December 2019a.
• Vovk and Wang (2019b) Vladimir Vovk and Ruodu Wang. Combining p-values via averaging. Technical Report arXiv:1212.4966 [math.ST], arXiv.org e-Print archive, October 2019b. Journal version: Biometrika (to appear).
• Wald (1950) Abraham Wald. Statistical Decision Functions. Wiley, New York, 1950.