 # Prediction with eventual almost sure guarantees

We study the problem of predicting the properties of a probabilistic model and the next outcome of the model sequentially in the infinite horizon, so that the perdition will make finitely many errors with probability 1. We introduce a general framework that models such predication problems. We prove some general properties of the framework, and show some concrete examples with the application of such properties.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Suppose there is a unknown probabilistic model that generates samples sequentially. You can observe as many samples as you want from the model. At each time step you need to make a prediction on some properties of the model as well as sample , based on your current observed samples . There is a predefined -loss that measure if your prediction is correct or not. Under what conditions you will be able to make only finitely many losses with probability , so that your prediction will be correct eventually almost surely? If you are able to do so, can you identify from what point you will not incur any more losses?

Such scenarios could naturally arise in learning and estimation tasks when there is no uniform bounds available. For example, you want to estimate some properties of a distribution but you have little information about the distribution. You will not have a bounded sample complexity that allow you to quantify the goodness of you estimation. What is the last thing you can hope for? One work around is to show that your estimation will eventually convergence to your desired property that you are estimating. Which is known as point-wise consistence. However, such a guarantee may not be adequate in the following sense. First, your ultimate goal of the estimation is to finish some tasks, to make your estimation error below some artificial threshold may not be necessary. Second, by known the estimation will eventually convergence to the actually value doesn’t tell you when you can finish your task.

Problems with similar flavor were initiated in 

, where Cover considered the problem of predicting the irrationality of the mean of random variables on

when observing samples of the variable. Cover provided a surprising sufficient condition of such problem, by showing that such a prediction scheme will exist for all means in except a zero measure set of irrational numbers. In , Dembo and Peres generalized Cover work by considering testing general hypothesis of distributions on . Where they provide topological criterion on the testability of their hypothesis testing problem. In , Kulkarni and Tse considered even more general identification problems with similar flavor. However, all such works are restricted to identifying classes and uses specific topological properties to characterize the problems.

In this paper, we deal with such problems with a much broad and abstract setup. Our main contributions are summarized as follows. We introduce a general framework for arbitrary underline random processes and losses, which could cover all previous known results as special cases and bring more problems in the setup. We provide some general properties of the framework that purely from the definition and independent of the underline random processes. We provide some examples below, to motivate the study of our problem.

### 1.1 Strong law of large number

Let be random variables over with , and

. The strong law of large number asserts that for any

, we have events

happens eventually almost sure surely. This assertion is somewhat tricky to prove, especial when we do not have a control on the moments. However, if we only want to get some prediction

(not necessary the empirical mean) such that the event happens eventually almost surely for all . We will see in the following context, such a requirement can be easily satisfied with a rather simple proof.

### 1.2 Prediction general properties of distribution

Most of the previous results, e.g.  and  are focuses on the prediction of on the properties of the mean of distributions (though  also considered the general properties, but their topological criterion is hard to verify in general). One such example could be, for given distribution over , can we predict whether the entropy of the distribution is or not, so that the prediction will be correct eventually almost surely? It can be shown that such a prediction is not possible. However, such a prediction will be possible if we also know that the expectation is finite. We refer to section 4.1 for detailed discussion.

### 1.3 Online learning

Let be the class of functions over , such that for all we have if and otherwise, for some . Consider the following online learning game: we fixed some , at each time step we generate that uniformly sampled from . The learner needs to predict the value and we will reveal the true value after the prediction. The problem is that can we have a prediction scheme so that the learner will make finitely many error w.p. ? It can be shown that such a scheme will not exist. However, if we restrict to be rational numbers then one can show that such a scheme will be exist. Even though the rational numbers are dense in . We refer to section 4.3 for more discussion.

## 2 Main Results

We consider general prediction problems, and first develop some notation common to all problems. Let be a set, and let be a collection of probability models over an appropriate sigma algebra on that we will specify based on the problem at hand. We consider a discrete time random process generated by sampling from a probability law .

Prediction is modeled as a function , where denotes the set of all finite strings of sequences from , and is the set of all predictions. The loss function is a measurable function . We consider the property we are estimating to be defined implicitly by the subset of where .

We consider the following game that proceeds in time indexed by . The game has two parties: the learner and nature. Nature chooses some model to begin the game. At each time step , the learner makes a prediction based on the current observation generated according to . Nature then generates based on and .

The learner fails at step if . The goal of the learner is to optimize his strategy to minimize his cumulative loss in the infinite horizon, no matter what model the environment chooses at the beginning.

Note that the loss is a function of the probability model in addition to the sample observed, and finally, our prediction on the sample. In the supervised

setting, all models in the class incur the same loss function, therefore we can just consider the loss to be a function from

to . In the event the loss depends on the probability model as well, there may be no direct way to estimate the loss incurred from observing the sample even after the prediction is made, and we call such situations the unsupervised setting.

[predictability] A collection is -predictable, if there exists a prediction rule and a sample size such that for all ,

 p\Paren∞∑i=nℓ(p,Xi1,Φ(Xi−11))>0≤η,

the probability that the learner makes errors after step is at most uniformly over .

A collection is said to be eventually almost surely (e.a.s.)-predictable, if there exists a prediction rule , such that for all

 p(∞∑n=1ℓ(p,Xn1,Φ(Xn−11))<∞)=1,

We need a technical definition that will help simplify notation further.

A nesting of is a collection of subsets of , such that and .

Our first result connects the two definitions above. We will subsequently unravel the theorem below in different contexts to illustrate it. We first consider the supervised setting. Consider a collection with a loss (the supervised setting). is e.a.s.-predictable iff for all , there exists a nesting of such that for all , is -predictable.

The analogous result for the unsupervised setting is as follows.

Consider a collection with a loss (the unsupervised setting). is e.a.s.-predictable if there exists a nesting of such that for all , is -predictable.

Conversely, if is e.a.s.-predictable, then for all , there is a nesting of such that for all , is predictable.

Theorems 2 and 2 show that analyzing the e.a.s.-predictability of a class wrt some loss is equivalent to studying how the class decomposes into uniformly predictable classes. Indeed, the primary challenge is in choosing the decompositions carefully, but doing so allows us to tackle a wide range of problems.

In the rest of the Section, we will illustrate how the above Theorems can provide simpler, novel proofs of a variety of prior results. In Section 4. we will apply the Theorems to prove new results as well.

#### Cover’s problem

We now consider a problem introduced by Cover in . The task is to predict whether the parameter of an Bernoulli process is rational or not using samples from it.

Therefore our predictor . In , Cover showed a scheme that predicted accurately with only finitely many errors for all rational sources, and for a set of irrationals with Lebesgue measure 1. Here we show a more transparent version of Cover’s proof as well as subsequent refinements in  using Theorem 2 above.

Define the loss iff matches the irrationality of . Note that the setting is what we would call the “unsupervised” case and that there is no way to judge if our predictions thus far are right or wrong.

Let be an enumeration of rational numbers in . Let be the set of numbers in whose distance from is . For all , let

 Sk=\Paren[0,1]∖∞⋃i=1B(ri,1k2i)∪{r1,⋯,rk}

be the set that excludes a ball centered on each rational number, but throws back in the first rational numbers. Note that the Lebesgue measure of is . Now contains exactly rational numbers, such that contains no other number with distance around each of the included rationals, while the rest of is irrational. Hence (see Lemma C in Appendix C), the set of Bernoulli processes with parameters in is -predictable for all .

From Theorem 2, we can conclude that the collection is e.a.s.-predictable. Note that every rational number belongs to , and the set of irrational numbers in has Lebesgue measure 1, proving [3, Theorem 1].

Conversely, let and be the Bernoulli variables with parameters in . We show that if is e.a.s.-predictable for rationality of the underlying parameter, then such that

 inf{|r−x|:r,x∈Sk and r is rational, x is % irrational}>0.

Since is e.a.s.-predictable, Theorem 2 yields that for any , the collection can be decomposed as where each is -predictable and . Let be the set of parameters of the sources in . Intuitively (see Corollary C in Appendix C for formal proof), predictability of implies that we must have

 inf\Sets|u−v|:u,v∈Sk and u rational, v irrational>0,

or else we would not be able to universally attest to rationality with confidence using a bounded number of samples.

Suppose we want to contain all rational numbers in . Then it follows (see Lemma C in Appendix C) that the subset of irrational numbers of must be nowhere dense. Therefore, the set of irrationals in is meager or Baire first category set , completing the result in .

We can show that theorem 2 naturally implies most of the results in  as well. We also showed that the open question asked in  holds when we have some continuity condition on the predictor. The detailed analysis are left in appendix A.

#### Poor man’s strong law

Let be the collection of distributions over with finite first moment. We show that for any , there exists a universal predictor , such that for all , the events happen finitely often w.p. .

Clearly, the empirical mean achieves the desired goal by the strong law of large numbers. Our purpose here is to use Theorem 2 to provide an alternate simple predictor.

Define , where is the random variable governed by distribution . Clearly . Now for all and all , is -predictable under loss . To see this, consider the estimator

 Φϵ(Xn1)=⎧⎨⎩0, if n≤Nϵ,kX′1+X′2+⋯+X′NϵNϵ,k, otherwise,

where and . Since , we know by Theorem 2 that is e.a.s.-predictable under loss .

It is not hard to show that the dependence of the predictor on can be removed, i.e. we can construct a predictor such that for all and , events happen finitely often w.p. . In other words, almost surely. See appendix D for a detailed construction.

A similar argument establishes a strong law of estimation of any real valued function defined on distributions over , so long as the function is continuous on any bounded support, the entropy of distributions over .

#### A lower bound on accuracy of classifiers

For any and , we denote

 h\w,a(\x){0,  if \x⋅\w≤a1, otherwise.

Let be any distribution on that is absolutely continuous with respect to the Lebesgue measure. Let

be the class of all such linear classifiers.

The VC dimension of is .

A simple application of VC-theorem tells us that for any one can find some such that with confidence by observing at most

 m=O((d+1)log(1/ϵ)+log(1/δ)ϵ)

samples. By letting , we have . A natural question is whether a different approach could retain the same confidence, but improve the accuracy to say, the order of .

We show that this is not possible. Consider a online prediction game as follows. We choose some at the beginning. At step we generate a sample point and ask the learner to predict based on past observations . After the prediction is made, we reveal the true label, and the game proceeds. The learner incur a loss at step if the prediction he made is different from . A Corollary below of Theorem 2 shows that this game is not e.a.s.-predictable.

Let be a class of measurable functions over and be an arbitrary distribution over . If there exist a uncountable subset such that , , then is not e.a.s.-predictable with the above process and loss.

Since the distribution is absolutely continuous w.r.t. Lebesgue measure, we know that there are uncountably many satisfying the requirements in Corollary 2.

This Corollary now implies that the desired pair above is also impossible. Suppose otherwise, the accuracy-confidence pair is achievable. Then the function learned at any stage makes an error in predicting the label of the th step with probability at most . The Borel-Cantelli lemma then implies that we will make only finitely many errors w.p. 1, a contradiction on Corollary 2.

### 2.1 e.a.s.-Learnability

If in addition, the learner could also specify when he will stop making mistakes with some confidence, we call the class to be e.a.s-learnable, as defined formally in the follows. A collection is said to be eventually almost surely (e.a.s.)-learnable, if for any confidence , there exists a universal prediction rule together with a stopping rule , such that for all

 p⎛⎜⎝∞∑k=min{n∣τη(Xn1)=1}ℓ(p,Xk1,Φη(Xk−11))>0⎞⎟⎠<η,

and

 p(∃n∈N, τη(Xn1)=1)=1.

Let and be two events, the conditions of definition 2.1 is equivalent to say

 ∞∑n=1pX(An∩Bn)<η,

and

 ∞∑n=1pX(Bn)=1.

In some cases, one may also wish to have the following guarantee

 supn,w∈BnpX(An∣w)<η.

One should notice that Definition 2.1 takes all the randomness into account when considering confidence, while here we only consider the randomness after we stop. It turns out that such a guarantee is much stronger than Definition 2.1.

Let be a class of processes, be a subclass of . For any , define loss . The class is said to be identifiable in if is e.a.s.-learnable.

We will see shortly bellow in Theorem 2.1 that the e.a.s.-predictability for any classes with processes is purely captured by the concept of identifiability. The following definition generalized Definition 2.1 to the non-case.

Let be two collection of models. is said to be identifiable in , if there is a nesting of such that for all and we have to be -predictable for property (i.e. the prediction incur a loss at step if for , where ).

A simple example given as follows, we leave the proof in the Appendix B. Let to be the class of all distributions over , is subset of . Then is identifiable in with sampling iff contains no limit point of under topology (i.e. is relatively open in ).

We have the following theorems, the proofs are left to the next section. A collection is e.a.s.-learnable with loss , then is also e.a.s.-predictable.

Let be a model class with loss , if for any there exists countable collection (not necessary nested) such that

• is -predictable,

• ,

• is identifiable in for all .

Then, is e.a.s.-learnable. Moreover, the conditions are necessary if the underlying processes in are . Note that the first two conditions are the same as the necessary condition given in theorem 2. We are unaware if condition is necessary in general, we leave it as an open problem. However, for specific examples, we do be able to show that the condition given in theorem 2.1 is necessary and sufficient, see section 4.2.

## 3 Proofs

In this section we will prove the results that stated in section 2. A simple but powerful lemma gives as follows, which illustrates our main proof technique. [Back and forth lemma] Let be countably many e.a.s.-predictable classes with loss in the supervised setting. Then, the class

 P=∞⋃i=1Pi

is also e.a.s.-predictable with loss .

###### Proof.

We construct a scheme that incurs only finitely many errors. Let be the predictor for that makes only makes a finite number of errors with probability 1, no matter what is in force. The prediction rule for works as follows:

• We maintain three indices and initialized to be . denotes the time step (namely we have observed thus far). At any given point, we only keep a finite number of predictors in contention—these will be . indicates which of the above predictors we will use to make the prediction on .

• At time step , use to predict.

• If , we make no change to and . If ,

• If , set and ;

• If , set and .

• Move to the next time step and repeat steps .

Consider any . We prove incurs only finitely many errors with probability 1 when is in force. Assume the contrary, that on a set such that , makes infinitely many errors. Note that changes iff there is an error made. Then given any sequence in , infinitely many times and in addition makes infinitely many errors on the subsequence it is used on. But this is a contradiction since makes only finitely many errors with probability 1 on sequences generated by any . ∎

Let to be the event that . We show that condition on event , the predictor will make only finitely many errors. We observe that the number of errors is exactly the number of changes of , and when and , will no longer change. Now suppose the process makes infinite many errors, then must reach for infinitely many time step , by our definition of step . Therefore, there must be some status with and , which is a stopping status, a contradiction! The lemma now follows by observing that .

Let be countably many classes of models, such that is -predictable with loss . Then the class is e.a.s.-predictable with loss .

###### Proof.

Let be the -predictor for , and let be a number such that the probability makes errors after step is at most . The predictor is defined as follows. When the sequence length , use to make the prediction.

Let . For all , and in the phase is used, the probability of an error . The result follows by Borel-Cantelli lemma. ∎

Let be a collection of models, be a loss function of , and be a nesting of . If is -predictable for all , then there exist another nesting of such that and is -predictable with sample size less than .

###### Proof.

Let be the empty set, which is trivially -predictable with sample size . Let be the samples size of that achieves -predictability. Clearly, we have . For , we let . The lemma now follows. ∎

We now ready to prove theorem 2 and theorem 2.

###### Proof of Theorem 2.

If is e.a.s.-predictable, by Definition 2, there exists predictor such that . For any , we define

 Pn={p∈P∣p(Φ makes errors after time n)<η}.

For any , let event

 Ak={X∞1:∞∑n=kℓ(p,Φ(Xn−11),Xn)>0},

we have as . Therefore, there must be some such that , and for such a number we have . We now have and , .

For the sufficiency part, we will use a similar argument as in lemma 3. By assumption, for any , there exist exist a nesting of such that is predictable. Wolog, we can assume there exists predictors such that

 supp∈\cPjn{p(Φn,j makes errors after time n)}≤2−j.

The prediction for works as follows:

• Sine is countable. Let be an enumeration of such that for we have . See figure LABEL:.

• At time step , use predictor to predict.

• If the loss we make no change to . Otherwise, set

• Move to next time step and repeat steps .

We claim that the predictor will make only finitely many errors with probability for all models in . Fix some . Let . Define event

 Aj={Φbj,j makes errors after time step bj},

we have . Therefore, with probability we have happens finitely many times. By the construction of , we will be able to reach any in at most many changes of with . Therefore, must stop make changes after hitting some in finitely many changes, i.e. will make only finitely many errors. ∎

###### Proof of Theorem 2.

The sufficiency part follows by lemma 3. And the necessary part is identical to the proof of theorem 2. ∎

We now prove theorem 2.1.

###### Proof of Theorem 2.1.

Since is e.a.s.-learnable, for each , we denote and . We partition the prediction into stages, at stage we use the prediction rule to make the prediction. If at some time step we have , we move to stage and continue. Now, by definition we have the probability that the predictor making errors at stage is at most . And since the stopping rule stops with probability , we know that w.p. we will enter all the stages. The theorem now follows by Borel-Cantelli lemma. ∎

We now prove theorem 2.1.

###### Proof of Theorem 2.1.

For each , by condition and , we have such that each is -predictable with sample size less than (see Lemma 3). Let as the nested class in Definition 2.1, with . The prediction goes with phases indexed by and initialized to be . We maintain another index , initialized to be . At phase we test if the underlying model is in with confidence , which can be done by identifiable. If positive, we stop and use the -predictor for to make the prediction. Otherwise, increase and use the back-forth trick as in lemma 3 to update . We first observe that we have probability at most to choose the wrong class. We now claim that we must stop at the right class eventually, this is because suppose , we will reach infinitely many with and . Since every time we reach the right class we have probability at most to miss it, therefore one can’t miss infinitely many times by Borel-Cantelli lemma. The theorem now follows.

To prove the converse, for any , let . By definition of e.a.s.-learnable, we have to be -predictable and . We only need to show the identifiability of in for all . The key observation is that, since the process is , we can estimate the probability of events with arbitrary high accuracy and confidence by considering independent block of samples. Now, by the definition of , one will be able to prove the identifiability. ∎

## 4 Examples

In this section we will consider some concrete examples of the framework in Sections 2.

### 4.1 Entropy prediction

Let be a class of distributions over , we consider the random process to be processes of distributions from . Let be a subsets of positive reals and . The entropy prediction problem is that, for any , predict whether the entropy or .

For any integer and distribution , we define the tail entropy of with order to be:

 Hn(p)=∑k≥n−pklogpk.

For any function such that when . We say the tail entropy of a class to be dominated by , if for all , there exist a number such that for all we have

 Hn(p)≤ρ(n).

Two sets is said to be -separable if there exists countably many sets and , such that , and for all , are disjoint closed sets. Wolog, one many assume have the flowing property

 inf{|x−y|:x∈An,y∈Bn}>0,

by Lemma A in Appendix A.

We have the following theorem. Let be any function such that as , is the class of all distributions over with tail entropy dominated by . Then is e.a.s.-predictable with the entropy loss, iff and are -separable.

###### Proof.

For sufficiency, we define . By dominance on the tail entropy, we have and , . Now, since are -separate, we have and such that , and for all

 ϵi\edinf{|x−y|∣x∈Bi, y∈Ci}>0.

Define , where and is choosing so that . We claim that and is -predictable. The first part is easy to verify. To see the second part, we can estimate the mass of that less than sufficiently accurate so that we have the partial entropy differ by at most with confidence at least . This can be done since the entropy function is uniformly continuous under distance on finite support. The theorem now follows by Theorem 2.

To prove the necessity, we construct a class of distribution with the following properties:

• The range covers all .

• For any there exist only one with , i.e. the distribution is parameterized by its entropy.

• For any such that , we have , i.e. the family is continuous w.r.t. under distance.

• The support of all distributions in is finite.

We construct inductively, at phase we add distributions over to the class. For , we add distribution with to . For , let be the distribution with maximum entropy that added in phase

, which is exactly uniform distribution over

. We add distributions with to , where has probability on . We note that

 H(p(k)ϵ)=h(ϵ)+log(k−1),

i.e. the distribution added in phase takes entropy range exactly and monotone increasing according to . It can be easily check that the properties are satisfied.

By Theorem 2, we have a nesting of such that each is -predictable for some with sample size . Denote to be the set of entropy range of distributions in that intersect with , and be the intersect with . We show that . Suppose otherwise, wolog, one may assume there exist some which is a limit point of both and . Now, by the continuous property of we can find some and such that . By Corollary C, which contradicts to the -predictability of . This complete the proof. ∎

If exist and such that as (by symmetry the role of and can be exchanged). Then is not e.a.s.-predictable with the entropy loss.

###### Proof.

Suppose , wolog, we assume

 ∞∑n=2|xn−xn−1|<∞, (4.1)

by taking a subsequence from if necessary. Let be the class of all distributions with entropy in . By Theorem 2 we know that where is -predictable with sample size , and . Clearly, for any , if then and are in or simultaneously. Otherwise, one will not be able to distinguish them with only samples, sine . We now construct a distribution but not in any , thus derive a contradiction. Let be the first class that contains some such that and with finite support. We now recursively define with finite support and as follows. Suppose has been defined and . Let be the upper bound of the support of . We define , where is uniform distribution over . We now choose and such that . This can always be done, since and the entropy function is continuous. Now, by the construction we have is Cauchy sequence wrt total variation and convergence to some . Let measure (not necessary probability measure) , by (4.1) we have . Since , we know by dominate convergence theorem that . However, by the construction, we have , meaning that . Since and the class is nested, we have . A contradiction!

The construction when is very similar, one just need to choose the small enough so that the entropy of restrict to the support of is at least for some absolute constant . Which insures that for some constant . ∎

The above theorem shows that one can’t decide the problem: is finite? is ? eventually almost surely. However, one should note that we will be able to decide problem: is ? eventually almost surely. See Appendix D for a proof.

### 4.2 The insurance problem

We consider the problem that introduced in  and . We will consider the random processes to be processes of distributions on . The loss function is defined to be for all processes and . Let to be a class of such probabilistic models. The following theorems were shown in . [[10, Theorem 1]] The collection is e.a.s.-predictable iff there are countably many tight distribution classes . such that

 P=∞⋃i=1Pi.

and [[10, Corollary 3]] The collection is e.a.s.-learnable iff

 P=∞⋃i=1Pi,

where each is tight and relatively open in in the topology induced by metric on .

It can be easily seen that the sufficiency condition of Theorem 4.2 and Theorem 4.2 follows by Lemma 3 and Theorem 2.1 respectively. However, the necessary condition can’t be directly derived from properties that we proved in Section 3, since it requires the specific properties of the underline distributions that considered. We reproduce the proof in the Appendix B for self contains.

### 4.3 Online learning

Let be a set of binary measurable functions over , is an arbitrary distributions on . We consider the following prediction game with two parties, the nature and the leaner, both knows and . At the beginning the nature chooses some . At times step , the nature independently sample , and the learner outputs a guess of based on his previous observations and the new sample . The nature then reveals after the guess has been made. The learner incur a loss at step if . We prove the following theorem. The class is e.a.s.-predictable with the above process and loss iff

 H=∞⋃i=1Hi

where we have , for all .

###### Proof.

By Theorem 2 and Lemma 3, we known is e.a.s.-predictable if for all

 H=⋃i∈NHi,

where is -predictable with sample size less than . Fix some and the decomposition of . We show that can be partitioned into countably classes such that any two functions within one class differ by a measure zero set.

For any , we define a (pseudo)metric . We claim that there exist such that

 inf{d(h1,h2):h1,h2∈Hi and d(h1,h2)>0}≥δ.

Otherwise, there exist such that , where is choosing so that . Let be the event that can’t be distinguished within samples. We have . Fix some predictor , let for . We show that , which contradicts to the -predictability of , thus establishes the claim. To do so, we use a probabilistic argument, let h be the random variable uniformly choosing from . We only need to show

 EhE\x∼μ∞1{Φ