True and false discoveries with independent e-values

03/01/2020 ∙ by Vladimir Vovk, et al. ∙ University of Waterloo Royal Holloway, University of London 0

In this note we use e-values (a non-Bayesian version of Bayes factors) in the context of multiple hypothesis testing assuming that the base tests produce independent e-values. Our simulation studies and theoretical considerations suggest that, under this assumption, our new algorithms are superior to the known algorithms using independent p-values and to our recent algorithms using e-values that are not necessarily independent.



There are no comments yet.


page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our recent paper [6] gives a generic procedure for turning e-merging functions into discovery matrices and applies it to arithmetic mean. Using arithmetic mean is very natural in the case of arbitrary dependence between the base e-values, at least in the symmetric case, since arithmetic mean essentially dominates any e-merging function [6, Theorem 5.1]. But in this note we will show that in the case of independent e-values we can greatly improve on arithmetic mean.

2 Discovery matrices for independent e-values

To make our exposition self-contained, we start from basic definitions (see our previous papers [4, 5, 6] exploring e-values for further information).

An e-variable

on a probability space

is a nonnegative extended random variable

such that . A measurable function for an integer is an ie-merging function if, for any probability space and any independent e-variables on it, the extended random variable is an e-variable. We will only consider ie-merging functions that are increasing in each argument and are symmetric (do not depend on the order of their arguments).

Important examples of ie-merging functions [5] are


We will refer to them as the U-statistics (they are the standard U-statistics with product as kernel). The statistics play a special role since they belong to the narrower class of e-merging functions, meaning that is an e-variable whenever are e-variables (not necessarily independent).

Multiple hypothesis testing using was explored in [5, 6], and in this note we will mainly concentrate on . It will be convenient to generalize (1) to the case ; namely, we set

(we are mostly interested in the case and ).

Let us fix the underlying sample space , which is simply a measurable space. Let be the set of all probability measure on the sample space. A simple hypothesis is and a (composite) hypothesis is . An e-variable w.r. to a hypothesis is an extended random variable such that for all . It is clear that any ie-merging function transforms independent e-variables w.r. to (i.e., independent e-variables w.r. to any ) to an e-variable w.r. to .

An e-value is a value taken by an e-variable. Let us fix , hypotheses , and independent e-variables w.r. to , respectively. (The e-variables are required to be independent under any .) An e-test is a family , , of nonnegative extended random variables such that for all .

Let us say that a measurable function is a discovery matrix if there exists an e-test , , such that, for all and all ,


where and stand for “and” and “or”, respectively. To emphasize that we interpret as a matrix, we write its values as . The intuition behind (2) is that if is large and we reject hypotheses with largest , we can count on at least true discoveries.

1:Ie-merging functions , .
2:An increasing sequence of e-values .
3:for  do
4:     for  do
7:         for  do
9:              if  then
Algorithm 1 Discovery matrix

Algorithm 1 is one way of constructing a discovery matrix based on a family of ie-merging functions , . It uses the notation , where and is a symmetric function of arguments, to mean the value of on the sequence of , , arranged in any order. The algorithm is an obvious modification of Algorithm 2 in [6]; now we apply it to arbitrary ie-merging functions (such as ) rather than just to arithmetic mean (i.e., ). As in [6], the e-values are assumed to be ordered, without loss of generality.

The validity of Algorithm 1 can be demonstrated by the argument in the proof of Theorem 2.1 in [6]. It is clear that, in the case of , the assumption of independence of can be relaxed to the assumption that all covariances , , are nonpositive.

The discovery matrix constructed in Algorithm 1 does not depend on the probability spaces , hypotheses , or e-variables , and in this sense is universal (in the terminology of [6, Section 5]).

3 A toy simulation study

In this section we run Algorithm 1 applied to and . Slightly generalizing the explanation in [6, Appendix B in Working Paper 27], we can see that the discovery matrix can be computed in time . For , the time can be improved from to [6, Appendix B in Working Paper 27]. For , we can easily improve the time to by noticing that

This is sufficient to cope with the case that we usually use in our simulation studies.

Figure 1: Left panel: the discovery matrix for the statistic (i.e., arithmetic mean) for 100 false and 100 true null hypotheses. Right panel: the analogue.

We generate the base e-values as in Section 3 of [6]

: the null hypothesis is

, , the first observations are generated from , the last from , all independently, and the base e-variables are the likelihood ratios

The results are shown in Figure 1 (whose left panel is identical to the left panel of Figure 2 in [6]); they are much better for . Each panel shows the lower triangular matrix , the left for and the right for . The colour scheme used in this figure is inspired by Jeffreys’s [3, Appendix B] (as in [6]):

  • The entries with below 1 are shown in dark green; there is no evidence that there are at least true discoveries among hypotheses with the largest e-values.

  • The entries are shown in green. For them the evidence is poor.

  • The entries are shown in yellow. The evidence is substantial.

  • The entries are shown in red. The evidence is strong.

  • The entries are shown in dark red. The evidence is very strong.

  • Finally, the entries are shown in black, and for them the evidence is decisive.

Figure 2: Left panel: the discovery p-matrix for the GWGS procedure. Right panel: the discovery matrix e-to-p calibrated.

It is interesting that after the crude e-to-p calibration our method produces p-values that look even better than the p-values produced by the GWGS procedure (in the terminology of [6]) designed specifically for p-values: see Figure 2.

In Figure 2 we use what we called Fisher’s scale in [6], but now we extend it by two further thresholds, one of which is , as advocated by [1]. Our colour scheme is:

  • P-values above are shown in green; they are not significant.

  • P-values between and are shown in yellow; they are significant but not highly significant.

  • P-values between and are shown in red; they are highly significant (but fail to attain the more stringent criterion of significance advocated in [1]).

  • P-values between and are shown in dark red.

  • P-values below are shown in black; they can be regarded as providing decisive evidence against the null hypothesis (to use Jeffreys’s expression).

4 An attempt of a theoretical explanation

We start from an alternative representation of , which will shed some light on the expected performance of our algorithm.

Let , be the arithmetic mean of , be the quadratic mean of , and

be the sample variance of


Lemma 4.1.

For any ,


By definition,

Corollary 4.2.

For any ,

For some the equality holds as equality.


The first statement follows from , and an example for the second one is . ∎

According to Corollary 4.2,

which we will call the relative (sample) variance of , is a dimensionless quantity in the interval . When , we set . The relative variance is zero if and only if all coincide, and it is 1 if and only if all but one are zero.

Using the notion of relative variance, we can rewrite (3) as

We can see the method of this paper based on has a potential for improving on the method of [6], but the best it can achieve is squaring the entries of the discovery matrix. An entry is squared if the multiset of e-values on which the infimum in the algorithm of [6] is attained consists of a single value. Otherwise we suffer as the e-values become more diverse.

5 Conclusion

The most natural direction of further research is to find computationally efficient procedures for computing discovery matrices based on , .


We are grateful to Yuri Gurevich for useful discussions. In our simulation studies we used Python and R, including the package hommel [2].

V. Vovk’s research has been partially supported by Astra Zeneca and Stena Line. R. Wang is supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2018-03823, RGPAS-2018-522590).


  • [1] Daniel J. Benjamin, James O. Berger, Magnus Johannesson, Brian A. Nosek, Eric-Jan Wagenmakers, Richard Berk, Kenneth A. Bollen, Björn Brembs, Lawrence Brown, Colin Camerer, David Cesarini, Christopher D. Chambers, Merlise Clyde, Thomas D. Cook, Paul De Boeck, Zoltan Dienes, Anna Dreber, Kenny Easwaran, Charles Efferson, Ernst Fehr, Fiona Fidler, Andy P. Field, Malcolm Forster, Edward I. George, Richard Gonzalez, Steven Goodman, Edwin Green, Donald P. Green, Anthony Greenwald, Jarrod D. Hadfield, Larry V. Hedges, Leonhard Held, Teck Hua Ho, Herbert Hoijtink, Daniel J. Hruschka, Kosuke Imai, Guido Imbens, John P. A. Ioannidis, Minjeong Jeon, James Holland Jones, Michael Kirchler, David Laibson, John List, Roderick Little, Arthur Lupia, Edouard Machery, Scott E. Maxwell, Michael McCarthy, Don Moore, Stephen L. Morgan, Marcus Munafó, Shinichi Nakagawa, Brendan Nyhan, Timothy H. Parker, Luis Pericchi, Marco Perugini, Jeff Rouder, Judith Rousseau, Victoria Savalei, Felix D. Schönbrodt, Thomas Sellke, Betsy Sinclair, Dustin Tingley, Trisha Van Zandt, Simine Vazire, Duncan J. Watts, Christopher Winship, Robert L. Wolpert, Yu Xie, Cristobal Young, Jonathan Zinman, and Valen E. Johnson. Redefine statistical significance: We propose to change the default p-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries (Comment). Nature Human Behaviour, 2:6–10, 2018.
  • [2] Jelle J. Goeman, Rosa Meijer, and Thijmen Krebs. hommel: Methods for closed testing with Simes inequality, in particular Hommel’s method, 2019. R package version 1.5, available on CRAN.
  • [3] Harold Jeffreys. Theory of Probability. Oxford University Press, Oxford, third edition, 1961.
  • [4] Vladimir Vovk. Non-algorithmic theory of randomness. Technical Report arXiv:1910.00585 [math.ST], e-Print archive, October 2019. The conference version is to appear in: Fields of Logic and Computation III: Essays Dedicated to Yuri Gurevich on the Occasion of His 80th Birthday, ed. by Andreas Blass, Patrick Cégilski, Nachum Dershowitz, Manfred Droste, and Bernd Finkbeiner. Springer, 2020.
  • [5] Vladimir Vovk and Ruodu Wang. Combining e-values and p-values. Technical Report arXiv:1912.06116 [math.ST], e-Print archive, December 2019.
  • [6] Vladimir Vovk and Ruodu Wang. True and false discoveries with e-values. Technical Report arXiv:1912.13292 [math.ST], e-Print archive, December 2019. For the latest version, see, Working Paper 27.