Main comment
Glenn Shafer’s paper is a powerful appeal for a wider use of betting ideas and intuitions in statistics. He admits that pvalues will never be completely replaced by betting scores, and I discuss it further in Appendix A (one of the two online appendices that I have prepared to meet the word limit). Both pvalues and betting scores generalize Cournot’s principle [13], but they do it in their different ways, and both ways are interesting and valuable.
Other authors have referred to betting scores as Bayes factors
[16] and evalues [23, 7]. For simple null hypotheses, betting scores and Bayes factors indeed essentially coincide [7, Section 1, interpretation 3], but for composite null hypotheses they are different notions, and using “Bayes factor” to mean “betting score” is utterly confusing to Bayesians [11]. However, the Bayesian connection still allows us to apply Jeffreys’s [9, Appendix B] rule of thumb to betting scores; namely, a pvalue of 5% is roughly equivalent to a betting score of , and a pvalue of 1% to a betting score of 10. This agrees beautifully with Shafer’s rule (6), which gives, to two decimal places:
for , instead of Jeffreys’s (slight overshoot);

for , instead of Jeffreys’s (slight undershoot).
The term “evalues” emphasizes the fundamental role of expectation in the definition of betting scores (somewhat similar to the role of probability in the definition of pvalues). It appears that the natural habitat for “betting scores” is gametheoretic while for “evalues” it is measuretheoretic
[14]; therefore, I will say “evalues” in the online appendices (Appendix A and [19]), which are based on measuretheoretic probability.In the second online appendix [19] I give a new example showing that betting scores are not just about communication; they may allow us to solve real statistical and scientific problems (more examples will be given by my coauthor Ruodu Wang). David Cox [4] discovered that splitting data at random not only allows flexible testing of statistical hypotheses but also achieves high efficiency. A serious objection to the method is that different people analyzing the same data may get very different answers (thus violating “inferential reproducibility” [6, 8]). Using evalues instead of pvalues remedies the situation.
Acknowledgments
Thanks to Ruodu Wang for useful discussions and for sharing with me his much more extensive list of advantages of evalues. This research has been partially supported by Amazon, Astra Zeneca, and Stena Line.
References
 [1] James O. Berger and Mohan Delampady. Testing precise hypotheses (with discussion). Statistical Science, 2:317–352, 1987.
 [2] Jacob Bernoulli. Ars Conjectandi. Thurnisius, Basel, 1713.
 [3] AntoineAugustin Cournot. Exposition de la théorie des chances et des probabilités. Hachette, Paris, 1843.
 [4] David R. Cox. A note on datasplitting for the evaluation of significance levels. Biometrika, 62:441–444, 1975.
 [5] Annie Duke. Thinking in Bets. Portfolio, New York, 2018.
 [6] Steven N. Goodman, Daniele Fanelli, and John P. A. Ioannidis. What does research reproducibility mean? Science Translational Medicine, 8:341ps12, 2016.
 [7] Peter Grünwald, Rianne de Heide, and Wouter M. Koolen. Safe testing. Technical Report arXiv:1906.07801 [math.ST], arXiv.org ePrint archive, June 2020.
 [8] Leonhard Held and Simon Schwab. Improving the reproducibility of science. Significance, 17(1):10–11, 2020.
 [9] Harold Jeffreys. Theory of Probability. Oxford University Press, Oxford, third edition, 1961.
 [10] Erich L. Lehmann and Joseph P. Romano. Testing Statistical Hypotheses. Springer, New York, third edition, 2005.
 [11] Christian P. Robert. Bayes factors and martingales, 2011. Entry in blog “Xi’an’s Og” for August 11.
 [12] Thomas Sellke, M. J. Bayarri, and James Berger. Calibration of pvalues for testing precise null hypotheses. American Statistician, 55:62–71, 2001.
 [13] Glenn Shafer. From Cournot’s principle to market efficiency. In JeanPhilippe Touffut, editor, Augustin Cournot: Modelling Economics, chapter 4. Edward Elgar, Cheltenham, 2007.
 [14] Glenn Shafer. Personal communication. May 8, 2020.
 [15] Glenn Shafer. Testing by betting: A strategy for statistical and scientific communication. To be read before the Royal Statistical Society on September 9 and to appear as discussion paper in the Journal of the Royal Statistical Society A, 2020.
 [16] Glenn Shafer, Alexander Shen, Nikolai Vereshchagin, and Vladimir Vovk. Test martingales, Bayes factors, and pvalues. Statistical Science, 26:84–101, 2011.
 [17] Judith ter Schure and Peter Grünwald. Accumulation bias in metaanalysis: the need to consider time in error control. Technical Report arXiv:1905.13494 [stat.ME], arXiv.org ePrint archive, May 2019.
 [18] Vladimir Vovk. A logic of probability, with application to the foundations of statistics (with discussion). Journal of the Royal Statistical Society B, 55:317–351, 1993.
 [19] Vladimir Vovk. A note on data splitting with evalues: online appendix to my comment on Glenn Shafer’s “Testing by betting”. Technical Report arXiv:2008.11474 [stat.ME], arXiv.org ePrint archive, August 2020.
 [20] Vladimir Vovk. Testing randomness online. Technical Report arXiv:1906.09256 [math.PR], arXiv.org ePrint archive, March 2020.
 [21] Vladimir Vovk, Bin Wang, and Ruodu Wang. Admissible ways of merging pvalues under arbitrary dependence. Technical Report arXiv:2007.14208 [math.ST], arXiv.org ePrint archive, July 2020.
 [22] Vladimir Vovk and Ruodu Wang. True and false discoveries with evalues. Technical Report arXiv:1912.13292 [math.ST], arXiv.org ePrint archive, December 2019.
 [23] Vladimir Vovk and Ruodu Wang. Combining evalues and pvalues. Technical Report arXiv:1912.06116 [math.ST], arXiv.org ePrint archive, May 2020.
 [24] Vladimir Vovk and Ruodu Wang. Combining pvalues via averaging. Biometrika, 2020. To appear, published online.
Appendix A Cournot’s principle, pvalues, and evalues
This is an online appendix to the main comment. It is based, to a large degree, on Glenn Shafer’s ideas about the philosophy of statistics. After a brief discussion of pvalues and evalues as different extensions of Cournot’s principle, I list some of their advantages and disadvantages.
a.1 Three ways of testing
Both pvalues and evalues are developments of Cournot’s principle [13], which is referred to simply as the standard way of testing in Shafer’s [15, Section 2.1]
. If a given event has a small probability, we do not expect it to happen; this is Cournot’s bridge between probability theory and the world. (This bridge was discussed already by James Bernoulli
[2]; Cournot’s [3] contribution was to say that this is the only bridge.) See Figure 1.Cournot’s principle requires an a priori choice of a rejection region
. Its disadvantage is that it is binary: either the null hypothesis is completely rejected or we find no evidence whatsoever against it. A
pvariableis a nonnegative random variable
such that, for any , ; one way to define pvariables is via Shafer’s (3). An evariable is a nonnegative random variable such that ; one way to define evariables is via Shafer’s first displayed equation in Section 2. In ptesting, we choose a pvariable in advance and reject the null hypothesis when the observed value of (the pvalue) is small, and in etesting, we choose an evariable in advance and reject the null hypothesis when the observed value of (the evalue) is large. In both cases, binary testing becomes graduated: now we have a measure of the amount of evidence found against the null hypothesis.We can embed Cournot’s principle into both ptesting,
and etesting (as Shafer [15, Section 2.1, (1)] explains),
where .
There are numerous ways to transform pvalues to evalues (to calibrate them) and essentially one way () to transform evalues to pvalues, as discussed in detail in [22]
. The idea of calibrating pvalues originated in Bayesian statistics (
[1, Section 4.2], [18, Section 9], [12]), and there is a wide range of admissible calibrators. Transforming evalues into pvalues is referred to as etop calibration in [22], where is shown to dominate any etop calibrator [22, Proposition 2.2].Moving between the pdomain and edomain is, however, very inefficient. Borrowing the idea of “roundtrip efficiency” from energy storage, let us start from the highly statistically significant () pvalue , transform it to an evalue using Shafer’s [15, (6)] calibrator
and then transform it back to a pvalue using the only admissible etop calibrator: . The resulting pvalue of is not even statistically significant ().
a.2 Some comparisons
Both pvalues and evalues have important advantages, and I think they should complement (rather than compete with) each other. Let me list a few advantages of each that come first to mind. Advantages of pvalues:

Pvalues can be more robust to our assumptions (perhaps implicit). Suppose, for example, that our null hypothesis is simple. When we have a clear alternative hypothesis (always assumed simple) in mind, the likelihood ratio has a natural property of optimality as evariable (Shafer [15, Section 2.2]), and the pvariable corresponding to the likelihood ratio as test statistic is also optimal (Neyman–Pearson lemma [10, Section 3.2, Theorem 1]). For some natural classes of alternative hypotheses, the resulting pvalue will not depend on the choice of the alternative hypothesis in the class (see, e.g., [10, Chapter 3] for numerous examples; a simple example can be found in [19, Section 4]). This is not true for evalues.

There are many known efficient ways of computing pvalues for testing nonparametric hypotheses that are already widely used in science.

In many cases, we know the distribution of pvalues under the null hypothesis: it is uniform on the interval . If the null hypothesis is composite, we can test it by testing the simple hypothesis of uniformity for the pvalues. A recent application of this idea is the use of conformal martingales for detecting deviations from the IID model [20].
Advantages of evalues (starting from advantages mentioned by Shafer [15, Section 1]):

Evalues appear naturally as a technical tool when applying the duality theorem in deriving admissible functions for combining pvalues [21].
Comments
There are no comments yet.