Non-algorithmic theory of randomness

10/01/2019 ∙ by Vladimir Vovk, et al. ∙ 0

This paper proposes an alternative language for expressing results of the algorithmic theory of randomness. The language is more precise in that it does not involve unspecified additive or multiplicative constants, making mathematical results, in principle, applicable in practice. Our main testing ground for the proposed language is the problem of defining Bernoulli sequences, which was of great interest to Andrei Kolmogorov and his students.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been a great deal of criticism of the notion of p-value lately, and in particular, Glenn Shafer [16] defended the use of betting scores instead. This paper refers to betting scores as e-values and demonstrates their advantages by establishing results that become much more precise when they are stated in terms of e-values instead of p-values.

Both p-values and e-values have been used, albeit somewhat implicitly, in the algorithmic theory of randomness: Martin-Löf’s tests of algorithmic randomness [14] are an algorithmic version of p-functions (i.e., functions producing p-values [5]) while Levin’s tests of algorithmic randomness [12, 2] are an algorithmic version of e-functions (this is the term we will use in this paper for functions producing e-values). Levin’s tests are a natural modification of Martin-Löf’s tests leading to simpler mathematical results; similarly, many mathematical results stated in terms of p-values become simpler when stated in terms of e-values.

The algorithmic theory of randomness is a powerful source of intuition, but strictly speaking, its results are not applicable in practice since they always involve unspecified additive or multiplicative constants. The goal of this paper is to explore ways of obtaining results that are more precise; in particular, results that may be applicable in practice. The price to pay is that our results may involve more quantifiers (usually hidden in our notation) and, therefore, their statements may at first appear less intuitive.

In Section 2 we define p-functions and e-functions in the context of testing simple statistical hypotheses, explore relations between them, and explain the intuition behind them. In Section 3 we generalize these definitions, results, and explanations to testing composite statistical hypotheses.

Section 4

is devoted to testing in Bayesian statistics and gives non-algorithmic results that are particularly clean and intuitive. They will be used as technical tools later in the paper. In Section 

5

these results are slightly extended and then applied to clarifying the difference between statistical randomness and exchangeability. (In this paper we use “statistical randomness” to refer to being produced by an IID probability measure; there will always be either “algorithmic” or “statistical” standing next to “randomness” in order to distinguish between the two meanings.)

Section 6 explores the question of defining Bernoulli sequences, which was of great interest to Kolmogorov [7], Martin-Löf [14], and Kolmogorov’s other students. Kolmogorov defined Bernoulli sequences as exchangeable sequences, but we will see that another natural definition is narrower than exchangeability.

Kolmogorov paid particular attention to algorithmic randomness w.r. to uniform probability measures on finite sets. On one hand, he believed that his notion of algorithmic randomness in this context “can be regarded as definitive” [9], and on the other hand, he never seriously suggested any generalizations of this notion (and never endorsed generalizations proposed by his students). In Section 6 we state a simple result in this direction that characterizes the difference between Bernoulliness and exchangeability.

In Sections 4 and 6 we state our results first in terms of e-functions and then p-functions. Results in terms of e-functions are always simpler and cleaner, supporting Glenn Shafer’s recommendation in [16] to use betting scores more widely.

Remark 1.

There is no standard terminology for what we call e-values and e-functions. In addition to Shafer’s use of “betting scores” for our e-values,

  • Grünwald et al. [4] refer to e-values as “s-values” (“s” for “safe”),

  • and Gammerman and Vovk [3] refer to the inverses of e-values as “i-values” (“i” for “integral”).

No formal knowledge of the algorithmic theory of randomness will be assumed in this paper; the reader can safely ignore all comparisons between our results and results of the algorithmic theory of randomness.

Notation

Our notation will be mostly standard or defined at the point where it is first used. If is a class of -valued functions on some set and is a function, we let stand for the set of all compositions , (i.e, is applied to element-wise). We will also use obvious modifications of this definition: e.g., would be interpreted as , where for .

2 Testing simple statistical hypotheses

Let be a probability measure on a measurable space . A p-function [5] is a measurable function such that, for any , . An e-function is a measurable function such that . (E-functions have been promoted in [16], [4], and [17, Section 11.5], but using different terminology.)

Let be the class of all p-functions and be the class of all e-functions, where the underlying measure is shown as subscript. We can define p-values and e-values as values taken by p-functions and e-functions, respectively. The intuition behind p-values and e-values will be discussed later in this section.

The following is an algorithm-free version of the standard relation (see, e.g., [13, Lemma 4.3.5]) between Martin-Löf’s and Levin’s algorithmic notions of randomness deficiency.

Proposition 2.

For any probability measure and ,

(1)
Proof.

The right inclusion in (1) follows from the Markov inequality: if is an e-function,

The left inclusion in (1) follows from [17, Section 11.5]. The value of the constant in front of the on the left-hand side of (1) follows from . ∎

Both p-functions and e-functions can be used for testing statistical hypotheses. In this section we only discuss

simple statistical hypotheses, i.e., probability measures. Observing a large e-value or a small p-value w.r. to a simple statistical hypothesis entitles us to rejecting as the source of the observed data, provided the e-function or p-function were chosen in advance. The e-value can be interpreted as the amount of evidence against found by our chosen e-function. Similarly, the p-value reflects the amount of evidence against on a different scale; small p-values reflect a large amount of evidence against .

Remark 3.

Proposition 2 tells us that using p-values and using e-values are equivalent, on a rather crude scale. Roughly, a p-value of corresponds to an e-value of . The right inclusion in (2) says that any way of producing e-values can be translated into a way of producing p-values . On the other hand, the left inclusion in (2) says that any way of producing p-values can be translated into a way of producing e-values , where the “” assumes that we are interested in the asymptotics as , is small, and we ignore positive constant factors (as customary in the algorithmic theory of randomness).

Remark 4.

Proposition 2 can be greatly strengthened, under the assumptions of Remark 3. For example, we can replace (1) by

where

(2)

and (see [17, Section 11.1]). The value of the coefficient in (2) follows from

3 Testing composite statistical hypotheses

Let be a measurable space, which we will refer to as our sample space, and be another measurable space (our parameter space). We say that is a statistical model on if is a Markov kernel with source and target : each is a probability measure on , and for each measurable , the function of is measurable.

The notions of an e-function and a p-function each split in two. We are usually really interested only in the outcome , while the parameter is an auxiliary modelling tool. This motivates the following pair of simpler definitions. A measurable function is an e-function w.r. to the statistical model (which is our composite statistical hypothesis in this context) if

In other words, if , where is the upper envelope

(3)

(in Bourbaki’s [1, IX.1.1] terminology, is an encumbrance provided the integral in (3) is understood as the upper integral). Similarly, a measurable function is a p-function w.r. to the statistical model if, for any ,

In other words, if, for any , .

Let be the class of all e-functions w.r. to the statistical model , and be the class of all p-functions w.r. to . We can easily generalize Proposition 2 (the proof stays the same).

Proposition 5.

For any statistical model and ,

For , we regard the e-value as the amount of evidence against the statistical model found by (which must be chosen in advance) when the outcome is . The interpretation of p-values is similar.

In some case we would like to take the parameter into account more seriously. A measurable function is a conditional e-function w.r. to the statistical model if

Let be the class of all such functions. And a measurable function is a conditional p-function w.r. to if

Let be the class of all such functions.

We can embed (resp. ) into (resp. ) by identifying a function on domain with the function on domain that does not depend on , .

For , we can regard as the amount of evidence against the specific probability measure in the statistical model found by when the outcome is .

We can generalize Proposition 5 further as follows.

Proposition 6.

For any statistical model and ,

(4)

Remarks 3 and 4 are also applicable in the context of Propositions 5 and 6.

4 The validity of Bayesian statistics

In this section we establish the validity of Bayesian statistics in our framework, mainly as a sanity check. We will translate the results in [24], which are stated in terms of the algorithmic theory of randomness, to our algorithm-free setting. It is interesting that the proofs simplify radically, and become almost obvious. (And remarkably, one statement also simplifies.)

Let be a statistical model, as in the previous section, and be a probability measure on the parameter space . Together, and form a Bayesian model, and is known as the prior measure in this context.

The joint probability measure on the measurable space is defined by

for all measurable and . Let be the marginal distribution of on : for any measurable , .

The product of and is defined as the class of all measurable functions such that, for some and ,

(5)

Such can be regarded as ways of finding evidence against being produced by the Bayesian model : to have evidence against being produced by we need to have evidence against being produced by the prior measure or evidence against being produced by ; we combine the last two amounts of evidence by multiplying them. The following proposition tells us that this product is precisely the amount of evidence against found by a suitable e-function.

Proposition 7.

If

is a statistical model with a prior probability measure

on , and is the joint probability measure on , then

(6)

Proposition 7 will be deduced from Theorem 14 in Section 5. It is the analogue of Theorem 1 in [24], which says, in the terminology of that paper, that the level of impossibility of a pair w.r. to the joint probability measure is the product of the level of impossibility of w.r. to the prior measure and the level of impossibility of w.r. to the probability measure . In an important respect, however, Proposition 7 is simpler than Theorem 1 in [24]: in the latter, the level of impossibility of w.r. to has to be conditional on the level of impossibility of w.r. to , whereas in the former there is no such conditioning. Besides, Proposition 7 is more precise: it does not involve any constant factors (specified or otherwise).

Remark 8.

The non-algorithmic formula (6) being simpler than its counterpart in the algorithmic theory of randomness is analogous to the non-algorithmic formula being simpler than its counterpart in the algorithmic theory of complexity, being entropy and being prefix complexity. The fact that does not coincide with to within an additive constant, being Kolmogorov complexity, was surprising to Kolmogorov and wasn’t noticed for several years [6, 7].

The inf-projection onto of an e-function w.r. to is the function defined by

Intuitively, regards as typical under the model if it can be extended to a typical for at least one . Let be the set of all such inf-projections.

The results in the rest of this section become simpler if the definitions of classes and are modified slightly: we drop the condition of measurability on their elements and replace all integrals by upper integrals and all measures by outer measures. We will use the modified definitions only in the rest of this section (we could have used them in the whole of this paper, but they become particularly useful here since projections of measurable functions do not have to be measurable [18]).

Proposition 9.

If is a probability measure on and is its marginal distribution on ,

(7)
Proof.

To check the inclusion “” in (7), let , i.e., . Setting , we have (i.e., ) and is the inf-projection of onto .

To check the inclusion “” in (7), let and . We then have

Proposition 9 says that we can acquire evidence against an outcome being produced by the Bayesian model if and only if we can acquire evidence against being produced by the model for all .

We can combine Propositions 7 and 9 obtaining

The rough interpretation is that we can acquire evidence against being produced by if and only if we can, for each , acquire evidence against being produced by or acquire evidence against being produced by .

The following statements in terms of p-values are cruder, but their interpretation is similar.

Corollary 10.

If and is a Bayesian model,

Proof.

We can rewrite (4) in Proposition 6 as

and as

with similar representations for (2) in Proposition 2 and (5) in Proposition 5. Therefore, by (6) in Proposition 7,

and

Corollary 11.

If , is a probability measure on , and is its marginal distribution on ,

where is defined similarly to (with in place of ).

Proof.

As in the proof of Corollary 10, we have

and

5 Parametric Bayesian models

Now we generalize the notion of a Bayesian model to that of a parametric Bayesian (or para-Bayesian) model. This is a pair consisting of a statistical model on a sample space and a statistical model on the sample space (so that the sample space of the second statistical model is the parameter space of the first statistical model). Intuitively, a para-Bayesian model is the counterpart of a Bayesian model in the situation of uncertainty about the prior: now the prior is a parametric family of probability measures rather than one probability measure.

The following definitions are straightforward generalizations of the definitions for the Bayesian case. The joint statistical model on the measurable space is defined by

(8)

for all measurable and . For each , is the marginal distribution of on : for any measurable , . The product of and is still defined as the class of all measurable functions such that, for some and , we have the equality in (5) -a.s., for all .

Remark 12.

Another representation of para-Bayesian models is as a sufficient statistic, as elaborated in [11]:

  • For the para-Bayesian model , the statistic is a sufficient statistic in the statistical model on the product space .

  • If is a sufficient statistic for a statistical model on a sample space , then is a para-Bayesian model, where is the distribution of , and are (fixed versions of) the conditional distributions given .

Remark 13.

Yet another way to represent a para-Bayesian model is a Markov family with time horizon :

  • the initial state space is , the middle one is , and the final one is ;

  • there is no initial probability measure on , the statistical model is the first Markov kernel, and the statistical model is the second Markov kernel.

Theorem 14.

If is a para-Bayesian model with the joint statistical model (as defined by (8)), we have (6).

Proof.

The inclusion “” in (6) follows from the definition of : if and , we have, for all ,

To check the inclusion “” in (6), let . Define and by

(setting, e.g., in the last fraction). Since by definition, -a.s., it suffices to check that and . The inclusion follows from the fact that, for any ,

And the inclusion follows from the fact that, for any ,

(we have rather than because of the possibility ). ∎

IID vs exchangeability

De Finetti’s theorem (see, e.g., [15, Theorem 1.49]) establishes a close connection between IID and exchangeability for infinite sequences in , where is a Borel measurable space: namely, the exchangeable probability measures are the convex mixtures of the IID probability measures (in particular, their upper envelopes, and therefore, e- and p-functions, coincide). This subsection discusses a somewhat less close connection in the case of sequences of a fixed finite length.

Fix (time horizon), and let be the set of all sequences of elements of (a measurable space, not necessarily Borel) of length . An IID probability measure on is a measure of the type , where is a probability measure on . The configuration of a sequence is the multiset of all elements of , and a configuration measure is the pushforward of an IID probability measure on under the mapping . Therefore, a configuration measure is a measure on the set of all multisets in of size (with the natural quotient -algebra).

Let be the class of all e-functions w.r. to the family of all IID probability measures on and be the class of all e-functions w.r. to the family of all configuration probability measures. Let be the class of all e-functions w.r. to the family of all exchangeable probability measures on ; remember that a probability measure on is exchangeable if, for any permutation and any measurable set ,

The product of and is the set of all measurable functions such that, for some and ,

holds for almost all (under any IID probability measure).

Corollary 15.

It is true that

Proof.

It suffices to apply Theorem 14 in the situation where is the set of all configurations, is the probability measure on concentrated on the set of all sequences with the configuration and uniform on that set (we can order arbitrarily, and then assigns weight to each permutation of that ordering), is the set of all IID probability measures on , and is the pushforward of w.r. to the mapping . ∎

6 Bernoulli sequences: IID vs exchangeability

In this section we apply the definitions and results of the previous sections to the problem of defining Bernoulli sequences. Kolmogorov’s main publications on this topic are [7] and [8]. The results of this section will be algorithm-free versions of the results in [19] (also described in V’yugin’s review [25], Sections 11–13).

The definitions of the previous subsection simplify as follows. Now is the set of all binary sequences of length . Let be the class of all e-functions w.r. to the family of all Bernoulli IID probability measures on (this is a special case of ) and be the class of all e-functions w.r. to the family of all binomial probability measures on (this is a special case of ); remember that the Bernoulli measure with parameter is defined by , where is the number of 1s in , and the binomial measure with parameter is defined by . (The notation for the number of 1s in is motivated by being the sum of the elements of .)

We continue to use the notation for the class of all e-functions w.r. to the family of all exchangeable probability measures on ; a probability measure on is exchangeable if and only if depends on only via . It is clear that a function is in if and only if, for each ,

The product of and is the set of all functions for and . The following is a special case of Corollary 15.

Corollary 16.

It is true that

The intuition behind Corollary 16 is that a sequence is Bernoulli if and only if it is exchangeable and the number of 1s in it is binomial. The analogue of Corollary 16 in the algorithmic theory of randomness is Theorem 1 in [19], which says, using the terminology of that paper, that the Bernoulliness deficiency of equals the binomiality deficiency of plus the conditional randomness deficiency of in the set of all sequences in with 1s given the binomiality deficiency of . Corollary 16 is simpler since it does not involve any analogue of the condition “given the binomiality deficiency of ”. Theorem 1 of [19] was generalized to the non-binary case in [22] (Theorem 3 of [22], given without a proof, is an algorithmic analogue of Corollary 15).

Remark 17.

Kolmogorov’s definition of Bernoulli sequences is via exchangeability. We can regard this definition as an approximation to definitions taking into account the binomiality of the number of 1s. In the paper [7] Kolmogorov uses the word “approximately” when introducing his notion of Bernoulliness (p. 663, lines 5–6 after the 4th displayed equation). However, it would be wrong to assume that here he acknowledges disregarding the requirement that the number of 1s should be binomial; this is not what he meant when he used the word “approximately” [10].

The reason for Kolmogorov’s definition of Bernoulliness being different from the definitions based on e-values and p-values is that carries too much information about ; intuitively [20], contains not only useful information about the probability

of 1 but also noise. To reduce the amount of noise, we will use an imperfect estimator of

. Set

(9)

where stands for integer part. Let be the estimator of defined by , where is the element of the set (9) that is nearest to among those satisfying ; if such elements do not exist, set .

Denote by the partition of the set into the subsets , where . For any , denotes the element of the partition containing . Let be the class of all e-functions w.r. to the statistical model , being the uniform probability measure on . (This is a Kolmogorov-type statistical model, consisting of uniform probability measures on finite sets; see, e.g., [23, Section 4].)

Theorem 18.

For some universal constant ,

The analogue of Theorem 18 in the algorithmic theory of randomness is Theorem 2 in [19], and the proof of Theorem 18 can be extracted from that of Theorem 2 in [19] (details omitted).

Remark 19.

Paper [19] uses a net slightly different from (9); (9) was introduced in [20] and also used in [25].

In conclusion of this section, let us extract corollaries in terms of p-values from Corollary 16 and Theorem 18; we will use the obvious notation , , and .

Corollary 20.

For each ,

(10)
Proof.

Similarly to Corollary 10, the left inclusion of (10) follows from

and the right inclusion of (10) follows from

Corollary 21.

There is a universal constant such that, for each ,

(11)
Proof.

As in the previous proof, the left inclusion of (11) follows from

and the right inclusion from

where stands for a positive universal constant. ∎

7 Conclusion

In this section we discuss some directions of further research. A major advantage of the non-algorithmic approach to randomness proposed in this paper is the absence of unspecified constants; in principle, all constants can be computed. The most obvious open problem is to find the best constant in Theorem 18.

In Section 6 we discussed a possible implementation of Kolmogorov’s idea of defining Bernoulli sequences. However, Kolmogorov’s idea was part of a wider programme; e.g., in [8, Section 5] he sketches a way of applying a similar approach to Markov sequences. For other possible applications, see [23, Section 4] (most of these applications were mentioned by Kolmogorov in his papers and talks). Analogues of Corollary 16 in Section 6 can be established for these other applications (cf. [11] and Remark 12), but it is not obvious whether Theorem 18 can be extended in a similar way.

References

  • [1] Nicolas Bourbaki. Elements of Mathematics. Integration. Springer, Berlin, 2004. In two volumes. The French originals published in 1952–1969.
  • [2] Peter Gács. Uniform test of algorithmic randomness over a general space. Theoretical Computer Science, 341:91–137, 2005. A later version of this paper (2013) is available on the author’s web site (accessed in October 2019).
  • [3] Alex Gammerman and Vladimir Vovk. Data labelling apparatus and method thereof, 2003. US Patent Application 0236578 A1. Available on the Internet (accessed in October 2019).
  • [4] Peter Grünwald, Rianne de Heide, and Wouter M. Koolen. Safe testing. Technical Report arXiv:1906.07801 [math.ST], arXiv.org e-Print archive, June 2019.
  • [5] Yuri Gurevich and Vladimir Vovk. Test statistics and p-values.

    Proceedings of Machine Learning Research

    , 105:89–104, 2019.
    COPA 2019.
  • [6] Andrei N. Kolmogorov. Несколько теорем об алгоритмической энтропии и алгоритмическом количестве информации. Успехи математических наук, 23(2):201, 1968. Abstract of a talk before the Moscow Mathematical Society. Meeting of 31 October 1967.
  • [7] Andrei N. Kolmogorov.

    Logical basis for information theory and probability theory.

    IEEE Transactions on Information Theory, IT-14:662–664, 1968. Russian original: К логическим основам теории информации и теории вероятностей, published in Проблемы передачи информации.
  • [8] Andrei N. Kolmogorov. Combinatorial foundations of information theory and the calculus of probabilities. Russian Mathematical Surveys, 38:29–40, 1983. Russian original: Комбинаторные основания теории информации и исчисления вероятностей.
  • [9] Andrei N. Kolmogorov. On logical foundations of probability theory. In Yu. V. Prokhorov and K. Itô, editors, Probability Theory and Mathematical Statistics, volume 1021 of Lecture Notes in Mathematics, pages 1–5. Springer, 1983. Talk at the Fourth USSR–Japan Symposium on Probability Theory and Mathematical Statistics (Tbilisi, August 1982) recorded by Alexander A. Novikov, Alexander K. Zvonkin, and Alexander Shen. Our quote follows Selected Works of A. N. Kolmogorov, volume II, Probability Theory and Mathematical Statistics, edited by A. N. Shiryayev, Kluwer, Dordrecht, p. 518.
  • [10] Andrei N. Kolmogorov. Personal communication, ca 1983.
  • [11] Steffen L. Lauritzen. Extremal Families and Systems of Sufficient Statistics, volume 49 of Lecture Notes in Statistics. Springer, New York, 1988.
  • [12] Leonid A. Levin. Uniform tests of randomness. Soviet Mathematics Doklady, 17:337–340, 1976. Russian original: Равномерные тесты случайности.
  • [13] Ming Li and Paul Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York, third edition, 2008.
  • [14] Per Martin-Löf. The definition of random sequences. Information and Control, 9:602–619, 1966.
  • [15] Mark J. Schervish. Theory of Statistics. Springer, New York, 1995.
  • [16] Glenn Shafer. The language of betting as a strategy for statistical and scientific communication. Technical Report arXiv:1903.06991 [math.ST], arXiv.org e-Print archive, March 2019.
  • [17] Glenn Shafer and Vladimir Vovk. Game-Theoretic Foundations for Probability and Finance. Wiley, Hoboken, NJ, 2019.
  • [18] Mikhail Ya. Souslin. Sur une définition des ensembles mesurables B sans nombres transfinis. Comptes rendus hebdomadaires des séances de l’Académie des sciences, 164:88–91, 1917.
  • [19] Vladimir Vovk. On the concept of the Bernoulli property. Russian Mathematical Surveys, 41:247–248, 1986. Russian original: О понятии бернуллиевости. Another English translation with proofs: [21].
  • [20] Vladimir Vovk. Learning about the parameter of the Bernoulli model. Journal of Computer and System Sciences, 55:96–104, 1997.
  • [21] Vladimir Vovk. On the concept of Bernoulliness. Technical Report arXiv:1612.08859 [math.ST], arXiv.org e-Print archive, December 2016.
  • [22] Vladimir Vovk, Alex Gammerman, and Craig Saunders. Machine-learning applications of algorithmic randomness. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 444–453, San Francisco, CA, 1999. Morgan Kaufmann.
  • [23] Vladimir Vovk and Glenn Shafer. Kolmogorov’s contributions to the foundations of probability. Problems of Information Transmission, 39:21–31, 2003.
  • [24] Vladimir Vovk and Vladimir V. V’yugin. On the empirical validity of the Bayesian method. Journal of the Royal Statistical Society B, 55:253–266, 1993.
  • [25] Vladimir V. V’yugin. Algorithmic complexity and stochastic properties of finite binary sequences. Computer Journal, 42:294–317, 1999.