1 Introduction
There has been a great deal of criticism of the notion of pvalue lately, and in particular, Glenn Shafer [16] defended the use of betting scores instead. This paper refers to betting scores as evalues and demonstrates their advantages by establishing results that become much more precise when they are stated in terms of evalues instead of pvalues.
Both pvalues and evalues have been used, albeit somewhat implicitly, in the algorithmic theory of randomness: MartinLöf’s tests of algorithmic randomness [14] are an algorithmic version of pfunctions (i.e., functions producing pvalues [5]) while Levin’s tests of algorithmic randomness [12, 2] are an algorithmic version of efunctions (this is the term we will use in this paper for functions producing evalues). Levin’s tests are a natural modification of MartinLöf’s tests leading to simpler mathematical results; similarly, many mathematical results stated in terms of pvalues become simpler when stated in terms of evalues.
The algorithmic theory of randomness is a powerful source of intuition, but strictly speaking, its results are not applicable in practice since they always involve unspecified additive or multiplicative constants. The goal of this paper is to explore ways of obtaining results that are more precise; in particular, results that may be applicable in practice. The price to pay is that our results may involve more quantifiers (usually hidden in our notation) and, therefore, their statements may at first appear less intuitive.
In Section 2 we define pfunctions and efunctions in the context of testing simple statistical hypotheses, explore relations between them, and explain the intuition behind them. In Section 3 we generalize these definitions, results, and explanations to testing composite statistical hypotheses.
Section 4
is devoted to testing in Bayesian statistics and gives nonalgorithmic results that are particularly clean and intuitive. They will be used as technical tools later in the paper. In Section
5these results are slightly extended and then applied to clarifying the difference between statistical randomness and exchangeability. (In this paper we use “statistical randomness” to refer to being produced by an IID probability measure; there will always be either “algorithmic” or “statistical” standing next to “randomness” in order to distinguish between the two meanings.)
Section 6 explores the question of defining Bernoulli sequences, which was of great interest to Kolmogorov [7], MartinLöf [14], and Kolmogorov’s other students. Kolmogorov defined Bernoulli sequences as exchangeable sequences, but we will see that another natural definition is narrower than exchangeability.
Kolmogorov paid particular attention to algorithmic randomness w.r. to uniform probability measures on finite sets. On one hand, he believed that his notion of algorithmic randomness in this context “can be regarded as definitive” [9], and on the other hand, he never seriously suggested any generalizations of this notion (and never endorsed generalizations proposed by his students). In Section 6 we state a simple result in this direction that characterizes the difference between Bernoulliness and exchangeability.
In Sections 4 and 6 we state our results first in terms of efunctions and then pfunctions. Results in terms of efunctions are always simpler and cleaner, supporting Glenn Shafer’s recommendation in [16] to use betting scores more widely.
Remark 1.
There is no standard terminology for what we call evalues and efunctions. In addition to Shafer’s use of “betting scores” for our evalues,
No formal knowledge of the algorithmic theory of randomness will be assumed in this paper; the reader can safely ignore all comparisons between our results and results of the algorithmic theory of randomness.
Notation
Our notation will be mostly standard or defined at the point where it is first used. If is a class of valued functions on some set and is a function, we let stand for the set of all compositions , (i.e, is applied to elementwise). We will also use obvious modifications of this definition: e.g., would be interpreted as , where for .
2 Testing simple statistical hypotheses
Let be a probability measure on a measurable space . A pfunction [5] is a measurable function such that, for any , . An efunction is a measurable function such that . (Efunctions have been promoted in [16], [4], and [17, Section 11.5], but using different terminology.)
Let be the class of all pfunctions and be the class of all efunctions, where the underlying measure is shown as subscript. We can define pvalues and evalues as values taken by pfunctions and efunctions, respectively. The intuition behind pvalues and evalues will be discussed later in this section.
The following is an algorithmfree version of the standard relation (see, e.g., [13, Lemma 4.3.5]) between MartinLöf’s and Levin’s algorithmic notions of randomness deficiency.
Proposition 2.
For any probability measure and ,
(1) 
Proof.
The right inclusion in (1) follows from the Markov inequality: if is an efunction,
Both pfunctions and efunctions can be used for testing statistical hypotheses. In this section we only discuss
simple statistical hypotheses, i.e., probability measures. Observing a large evalue or a small pvalue w.r. to a simple statistical hypothesis entitles us to rejecting as the source of the observed data, provided the efunction or pfunction were chosen in advance. The evalue can be interpreted as the amount of evidence against found by our chosen efunction. Similarly, the pvalue reflects the amount of evidence against on a different scale; small pvalues reflect a large amount of evidence against .Remark 3.
Proposition 2 tells us that using pvalues and using evalues are equivalent, on a rather crude scale. Roughly, a pvalue of corresponds to an evalue of . The right inclusion in (2) says that any way of producing evalues can be translated into a way of producing pvalues . On the other hand, the left inclusion in (2) says that any way of producing pvalues can be translated into a way of producing evalues , where the “” assumes that we are interested in the asymptotics as , is small, and we ignore positive constant factors (as customary in the algorithmic theory of randomness).
3 Testing composite statistical hypotheses
Let be a measurable space, which we will refer to as our sample space, and be another measurable space (our parameter space). We say that is a statistical model on if is a Markov kernel with source and target : each is a probability measure on , and for each measurable , the function of is measurable.
The notions of an efunction and a pfunction each split in two. We are usually really interested only in the outcome , while the parameter is an auxiliary modelling tool. This motivates the following pair of simpler definitions. A measurable function is an efunction w.r. to the statistical model (which is our composite statistical hypothesis in this context) if
In other words, if , where is the upper envelope
(3) 
(in Bourbaki’s [1, IX.1.1] terminology, is an encumbrance provided the integral in (3) is understood as the upper integral). Similarly, a measurable function is a pfunction w.r. to the statistical model if, for any ,
In other words, if, for any , .
Let be the class of all efunctions w.r. to the statistical model , and be the class of all pfunctions w.r. to . We can easily generalize Proposition 2 (the proof stays the same).
Proposition 5.
For any statistical model and ,
For , we regard the evalue as the amount of evidence against the statistical model found by (which must be chosen in advance) when the outcome is . The interpretation of pvalues is similar.
In some case we would like to take the parameter into account more seriously. A measurable function is a conditional efunction w.r. to the statistical model if
Let be the class of all such functions. And a measurable function is a conditional pfunction w.r. to if
Let be the class of all such functions.
We can embed (resp. ) into (resp. ) by identifying a function on domain with the function on domain that does not depend on , .
For , we can regard as the amount of evidence against the specific probability measure in the statistical model found by when the outcome is .
We can generalize Proposition 5 further as follows.
Proposition 6.
For any statistical model and ,
(4) 
4 The validity of Bayesian statistics
In this section we establish the validity of Bayesian statistics in our framework, mainly as a sanity check. We will translate the results in [24], which are stated in terms of the algorithmic theory of randomness, to our algorithmfree setting. It is interesting that the proofs simplify radically, and become almost obvious. (And remarkably, one statement also simplifies.)
Let be a statistical model, as in the previous section, and be a probability measure on the parameter space . Together, and form a Bayesian model, and is known as the prior measure in this context.
The joint probability measure on the measurable space is defined by
for all measurable and . Let be the marginal distribution of on : for any measurable , .
The product of and is defined as the class of all measurable functions such that, for some and ,
(5) 
Such can be regarded as ways of finding evidence against being produced by the Bayesian model : to have evidence against being produced by we need to have evidence against being produced by the prior measure or evidence against being produced by ; we combine the last two amounts of evidence by multiplying them. The following proposition tells us that this product is precisely the amount of evidence against found by a suitable efunction.
Proposition 7.
If
is a statistical model with a prior probability measure
on , and is the joint probability measure on , then(6) 
Proposition 7 will be deduced from Theorem 14 in Section 5. It is the analogue of Theorem 1 in [24], which says, in the terminology of that paper, that the level of impossibility of a pair w.r. to the joint probability measure is the product of the level of impossibility of w.r. to the prior measure and the level of impossibility of w.r. to the probability measure . In an important respect, however, Proposition 7 is simpler than Theorem 1 in [24]: in the latter, the level of impossibility of w.r. to has to be conditional on the level of impossibility of w.r. to , whereas in the former there is no such conditioning. Besides, Proposition 7 is more precise: it does not involve any constant factors (specified or otherwise).
Remark 8.
The nonalgorithmic formula (6) being simpler than its counterpart in the algorithmic theory of randomness is analogous to the nonalgorithmic formula being simpler than its counterpart in the algorithmic theory of complexity, being entropy and being prefix complexity. The fact that does not coincide with to within an additive constant, being Kolmogorov complexity, was surprising to Kolmogorov and wasn’t noticed for several years [6, 7].
The infprojection onto of an efunction w.r. to is the function defined by
Intuitively, regards as typical under the model if it can be extended to a typical for at least one . Let be the set of all such infprojections.
The results in the rest of this section become simpler if the definitions of classes and are modified slightly: we drop the condition of measurability on their elements and replace all integrals by upper integrals and all measures by outer measures. We will use the modified definitions only in the rest of this section (we could have used them in the whole of this paper, but they become particularly useful here since projections of measurable functions do not have to be measurable [18]).
Proposition 9.
If is a probability measure on and is its marginal distribution on ,
(7) 
Proof.
To check the inclusion “” in (7), let , i.e., . Setting , we have (i.e., ) and is the infprojection of onto .
To check the inclusion “” in (7), let and . We then have
Proposition 9 says that we can acquire evidence against an outcome being produced by the Bayesian model if and only if we can acquire evidence against being produced by the model for all .
We can combine Propositions 7 and 9 obtaining
The rough interpretation is that we can acquire evidence against being produced by if and only if we can, for each , acquire evidence against being produced by or acquire evidence against being produced by .
The following statements in terms of pvalues are cruder, but their interpretation is similar.
Corollary 10.
If and is a Bayesian model,
Proof.
Corollary 11.
If , is a probability measure on , and is its marginal distribution on ,
where is defined similarly to (with in place of ).
Proof.
5 Parametric Bayesian models
Now we generalize the notion of a Bayesian model to that of a parametric Bayesian (or paraBayesian) model. This is a pair consisting of a statistical model on a sample space and a statistical model on the sample space (so that the sample space of the second statistical model is the parameter space of the first statistical model). Intuitively, a paraBayesian model is the counterpart of a Bayesian model in the situation of uncertainty about the prior: now the prior is a parametric family of probability measures rather than one probability measure.
The following definitions are straightforward generalizations of the definitions for the Bayesian case. The joint statistical model on the measurable space is defined by
(8) 
for all measurable and . For each , is the marginal distribution of on : for any measurable , . The product of and is still defined as the class of all measurable functions such that, for some and , we have the equality in (5) a.s., for all .
Remark 12.
Another representation of paraBayesian models is as a sufficient statistic, as elaborated in [11]:

For the paraBayesian model , the statistic is a sufficient statistic in the statistical model on the product space .

If is a sufficient statistic for a statistical model on a sample space , then is a paraBayesian model, where is the distribution of , and are (fixed versions of) the conditional distributions given .
Remark 13.
Yet another way to represent a paraBayesian model is a Markov family with time horizon :

the initial state space is , the middle one is , and the final one is ;

there is no initial probability measure on , the statistical model is the first Markov kernel, and the statistical model is the second Markov kernel.
Theorem 14.
Proof.
The inclusion “” in (6) follows from the definition of : if and , we have, for all ,
To check the inclusion “” in (6), let . Define and by
(setting, e.g., in the last fraction). Since by definition, a.s., it suffices to check that and . The inclusion follows from the fact that, for any ,
And the inclusion follows from the fact that, for any ,
(we have rather than because of the possibility ). ∎
IID vs exchangeability
De Finetti’s theorem (see, e.g., [15, Theorem 1.49]) establishes a close connection between IID and exchangeability for infinite sequences in , where is a Borel measurable space: namely, the exchangeable probability measures are the convex mixtures of the IID probability measures (in particular, their upper envelopes, and therefore, e and pfunctions, coincide). This subsection discusses a somewhat less close connection in the case of sequences of a fixed finite length.
Fix (time horizon), and let be the set of all sequences of elements of (a measurable space, not necessarily Borel) of length . An IID probability measure on is a measure of the type , where is a probability measure on . The configuration of a sequence is the multiset of all elements of , and a configuration measure is the pushforward of an IID probability measure on under the mapping . Therefore, a configuration measure is a measure on the set of all multisets in of size (with the natural quotient algebra).
Let be the class of all efunctions w.r. to the family of all IID probability measures on and be the class of all efunctions w.r. to the family of all configuration probability measures. Let be the class of all efunctions w.r. to the family of all exchangeable probability measures on ; remember that a probability measure on is exchangeable if, for any permutation and any measurable set ,
The product of and is the set of all measurable functions such that, for some and ,
holds for almost all (under any IID probability measure).
Corollary 15.
It is true that
Proof.
It suffices to apply Theorem 14 in the situation where is the set of all configurations, is the probability measure on concentrated on the set of all sequences with the configuration and uniform on that set (we can order arbitrarily, and then assigns weight to each permutation of that ordering), is the set of all IID probability measures on , and is the pushforward of w.r. to the mapping . ∎
6 Bernoulli sequences: IID vs exchangeability
In this section we apply the definitions and results of the previous sections to the problem of defining Bernoulli sequences. Kolmogorov’s main publications on this topic are [7] and [8]. The results of this section will be algorithmfree versions of the results in [19] (also described in V’yugin’s review [25], Sections 11–13).
The definitions of the previous subsection simplify as follows. Now is the set of all binary sequences of length . Let be the class of all efunctions w.r. to the family of all Bernoulli IID probability measures on (this is a special case of ) and be the class of all efunctions w.r. to the family of all binomial probability measures on (this is a special case of ); remember that the Bernoulli measure with parameter is defined by , where is the number of 1s in , and the binomial measure with parameter is defined by . (The notation for the number of 1s in is motivated by being the sum of the elements of .)
We continue to use the notation for the class of all efunctions w.r. to the family of all exchangeable probability measures on ; a probability measure on is exchangeable if and only if depends on only via . It is clear that a function is in if and only if, for each ,
The product of and is the set of all functions for and . The following is a special case of Corollary 15.
Corollary 16.
It is true that
The intuition behind Corollary 16 is that a sequence is Bernoulli if and only if it is exchangeable and the number of 1s in it is binomial. The analogue of Corollary 16 in the algorithmic theory of randomness is Theorem 1 in [19], which says, using the terminology of that paper, that the Bernoulliness deficiency of equals the binomiality deficiency of plus the conditional randomness deficiency of in the set of all sequences in with 1s given the binomiality deficiency of . Corollary 16 is simpler since it does not involve any analogue of the condition “given the binomiality deficiency of ”. Theorem 1 of [19] was generalized to the nonbinary case in [22] (Theorem 3 of [22], given without a proof, is an algorithmic analogue of Corollary 15).
Remark 17.
Kolmogorov’s definition of Bernoulli sequences is via exchangeability. We can regard this definition as an approximation to definitions taking into account the binomiality of the number of 1s. In the paper [7] Kolmogorov uses the word “approximately” when introducing his notion of Bernoulliness (p. 663, lines 5–6 after the 4th displayed equation). However, it would be wrong to assume that here he acknowledges disregarding the requirement that the number of 1s should be binomial; this is not what he meant when he used the word “approximately” [10].
The reason for Kolmogorov’s definition of Bernoulliness being different from the definitions based on evalues and pvalues is that carries too much information about ; intuitively [20], contains not only useful information about the probability
of 1 but also noise. To reduce the amount of noise, we will use an imperfect estimator of
. Set(9) 
where stands for integer part. Let be the estimator of defined by , where is the element of the set (9) that is nearest to among those satisfying ; if such elements do not exist, set .
Denote by the partition of the set into the subsets , where . For any , denotes the element of the partition containing . Let be the class of all efunctions w.r. to the statistical model , being the uniform probability measure on . (This is a Kolmogorovtype statistical model, consisting of uniform probability measures on finite sets; see, e.g., [23, Section 4].)
Theorem 18.
For some universal constant ,
The analogue of Theorem 18 in the algorithmic theory of randomness is Theorem 2 in [19], and the proof of Theorem 18 can be extracted from that of Theorem 2 in [19] (details omitted).
Remark 19.
In conclusion of this section, let us extract corollaries in terms of pvalues from Corollary 16 and Theorem 18; we will use the obvious notation , , and .
Corollary 20.
For each ,
(10) 
Proof.
Corollary 21.
There is a universal constant such that, for each ,
(11) 
Proof.
As in the previous proof, the left inclusion of (11) follows from
and the right inclusion from
where stands for a positive universal constant. ∎
7 Conclusion
In this section we discuss some directions of further research. A major advantage of the nonalgorithmic approach to randomness proposed in this paper is the absence of unspecified constants; in principle, all constants can be computed. The most obvious open problem is to find the best constant in Theorem 18.
In Section 6 we discussed a possible implementation of Kolmogorov’s idea of defining Bernoulli sequences. However, Kolmogorov’s idea was part of a wider programme; e.g., in [8, Section 5] he sketches a way of applying a similar approach to Markov sequences. For other possible applications, see [23, Section 4] (most of these applications were mentioned by Kolmogorov in his papers and talks). Analogues of Corollary 16 in Section 6 can be established for these other applications (cf. [11] and Remark 12), but it is not obvious whether Theorem 18 can be extended in a similar way.
References
 [1] Nicolas Bourbaki. Elements of Mathematics. Integration. Springer, Berlin, 2004. In two volumes. The French originals published in 1952–1969.
 [2] Peter Gács. Uniform test of algorithmic randomness over a general space. Theoretical Computer Science, 341:91–137, 2005. A later version of this paper (2013) is available on the author’s web site (accessed in October 2019).
 [3] Alex Gammerman and Vladimir Vovk. Data labelling apparatus and method thereof, 2003. US Patent Application 0236578 A1. Available on the Internet (accessed in October 2019).
 [4] Peter Grünwald, Rianne de Heide, and Wouter M. Koolen. Safe testing. Technical Report arXiv:1906.07801 [math.ST], arXiv.org ePrint archive, June 2019.

[5]
Yuri Gurevich and Vladimir Vovk.
Test statistics and pvalues.
Proceedings of Machine Learning Research
, 105:89–104, 2019. COPA 2019.  [6] Andrei N. Kolmogorov. Несколько теорем об алгоритмической энтропии и алгоритмическом количестве информации. Успехи математических наук, 23(2):201, 1968. Abstract of a talk before the Moscow Mathematical Society. Meeting of 31 October 1967.

[7]
Andrei N. Kolmogorov.
Logical basis for information theory and probability theory.
IEEE Transactions on Information Theory, IT14:662–664, 1968. Russian original: К логическим основам теории информации и теории вероятностей, published in Проблемы передачи информации.  [8] Andrei N. Kolmogorov. Combinatorial foundations of information theory and the calculus of probabilities. Russian Mathematical Surveys, 38:29–40, 1983. Russian original: Комбинаторные основания теории информации и исчисления вероятностей.
 [9] Andrei N. Kolmogorov. On logical foundations of probability theory. In Yu. V. Prokhorov and K. Itô, editors, Probability Theory and Mathematical Statistics, volume 1021 of Lecture Notes in Mathematics, pages 1–5. Springer, 1983. Talk at the Fourth USSR–Japan Symposium on Probability Theory and Mathematical Statistics (Tbilisi, August 1982) recorded by Alexander A. Novikov, Alexander K. Zvonkin, and Alexander Shen. Our quote follows Selected Works of A. N. Kolmogorov, volume II, Probability Theory and Mathematical Statistics, edited by A. N. Shiryayev, Kluwer, Dordrecht, p. 518.
 [10] Andrei N. Kolmogorov. Personal communication, ca 1983.
 [11] Steffen L. Lauritzen. Extremal Families and Systems of Sufficient Statistics, volume 49 of Lecture Notes in Statistics. Springer, New York, 1988.
 [12] Leonid A. Levin. Uniform tests of randomness. Soviet Mathematics Doklady, 17:337–340, 1976. Russian original: Равномерные тесты случайности.
 [13] Ming Li and Paul Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York, third edition, 2008.
 [14] Per MartinLöf. The definition of random sequences. Information and Control, 9:602–619, 1966.
 [15] Mark J. Schervish. Theory of Statistics. Springer, New York, 1995.
 [16] Glenn Shafer. The language of betting as a strategy for statistical and scientific communication. Technical Report arXiv:1903.06991 [math.ST], arXiv.org ePrint archive, March 2019.
 [17] Glenn Shafer and Vladimir Vovk. GameTheoretic Foundations for Probability and Finance. Wiley, Hoboken, NJ, 2019.
 [18] Mikhail Ya. Souslin. Sur une définition des ensembles mesurables B sans nombres transfinis. Comptes rendus hebdomadaires des séances de l’Académie des sciences, 164:88–91, 1917.
 [19] Vladimir Vovk. On the concept of the Bernoulli property. Russian Mathematical Surveys, 41:247–248, 1986. Russian original: О понятии бернуллиевости. Another English translation with proofs: [21].
 [20] Vladimir Vovk. Learning about the parameter of the Bernoulli model. Journal of Computer and System Sciences, 55:96–104, 1997.
 [21] Vladimir Vovk. On the concept of Bernoulliness. Technical Report arXiv:1612.08859 [math.ST], arXiv.org ePrint archive, December 2016.
 [22] Vladimir Vovk, Alex Gammerman, and Craig Saunders. Machinelearning applications of algorithmic randomness. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 444–453, San Francisco, CA, 1999. Morgan Kaufmann.
 [23] Vladimir Vovk and Glenn Shafer. Kolmogorov’s contributions to the foundations of probability. Problems of Information Transmission, 39:21–31, 2003.
 [24] Vladimir Vovk and Vladimir V. V’yugin. On the empirical validity of the Bayesian method. Journal of the Royal Statistical Society B, 55:253–266, 1993.
 [25] Vladimir V. V’yugin. Algorithmic complexity and stochastic properties of finite binary sequences. Computer Journal, 42:294–317, 1999.
Comments
There are no comments yet.