We consider the general problem of sequence prediction, where a sequence of symbols is drawn from an unknown computable distribution , and the task is to predict the next symbol . If belongs to some known countable class of distributions, then a Bayesian mixture over the class leads to good loss bounds: the expected loss is at most where is the loss of the informed predictor that knows (Hutter, 2005, Thm. 3.48). These bounds are known to be tight.
the probability that a universal Turing machine prints something starting withwhen fed with fair coin flips. Equivalently, the distribution can be seen as a Bayesian mixture that weighs each distribution according to their Kolmogorov complexity (Wood et al., 2013), assigning higher a priori probability to simpler hypotheses (Hutter, 2007). However, is incomputable (Leike and Hutter, 2015), which has thus far limited its application.
Schmidhuber has proposed a computable alternative to which discounts strings that are not efficiently computable (Schmidhuber, 2002). This distribution is called the speed prior because asymptotically only the computationally fastest distributions that explain the data contribute to the mixture. However, no loss bounds for Schmidhuber’s prior, which we write as , are known except in the case where the data are drawn from a prior like Schmidhuber’s.
We introduce a prior that is related to both and , and establish in Section 3 that it is also a speed prior in Schmidhuber’s sense. Our first main contribution is a bound on the loss incurred by a -based predictor when predicting strings drawn from a distribution that is computable in polynomial time. This is proved in Section 4
. The bounds we get are only a logarithmic factor worse than the bounds for the Solomonoff predictor. In particular, if the measure is deterministic and the loss function penalises errors,-based prediction will only make a logarithmic number of errors. Therefore, is able to effectively learn the generating distribution . Our second main contribution is a proof that the same bound holds for the loss incurred by a -based predictor when computing a string deterministically generated in polynomial time, shown in the same section.
In Section 5 we discuss the time complexity of and . We show that is computable in exponential time while is computable in doubly-exponential time, but not in polynomial time, limiting its practical applicability. However, we also show that if we are predicting a sequence that is computable in polynomial time, it only takes polynomial time to compute and exponential time to compute .
Although the results of this paper are theoretical and the algorithms impractical-seeming, related ideas from the field of algorithmic information theory have been approximated and put into practice. Examples include the Universal Similarity Metric’s use in clustering (Cilibrasi and Vitanyi, 2005)
, Solomonoff-based reinforcement learning(Veness et al., 2011), and the Levin search-inspired Optimal Ordered Problem Solver (Schmidhuber, 2004). However, using the theory to devise practical applications is a non-trivial task that we leave for future work.
2.1 Setup and notation
Throughout this paper, we use monotone Turing machines with a binary alphabet , although all results generalise to arbitrary finite alphabets. A monotone machine is one with a unidirectional read-only input tape where the head can only move one way, a unidirectional write-only output tape where the head can only move one way, and some bidirectional work tapes. We say that a monotone machine computes string given program if the machine prints after reading all of but no more, and write (Li and Vitányi, 2008, Def. 4.5.2). Some of these machines are universal Turing machines, or ‘UTM’s. A UTM can simulate all other machines, so that the output of given input (where is a prefix-free coding111A coding such that for no two different machines and is a prefix of . of a Turing machine ) is the same as the output of given input . Furthermore, we may assume this simulation occurs with only polynomial time overhead. In this paper, we fix a ‘reference’ UTM , and whenever a function takes an argument that is a Turing machine, we will often write , where we set to be the reference UTM.
Our notation is fairly standard, with a few exceptions. If , then we simply write . We write if , and if and . Also, if is some string, we denote the length of by . We write the set of finite binary strings as , the set of infinite binary sequences as , an element of as , the th symbol of a string or as , and the first symbols of any string or as . is the cardinality of set . Finally, we write if string is a prefix of string , and if is a proper prefix of .
To define , we first need to define the fast algorithm (called search by Li and Vitányi (2008), Ch. 7.5) after which it is named. This algorithm performs phase for each , whereby instructions of all programs satisfying are executed as they would be on , and the outputs are printed sequentially, separated by blanks. If string is computed by program in phase , then we write . Then, is defined as
This algorithm is inspired by the complexity of a string, defined as
where is the time taken for program to compute on the UTM , and if program never computes , we set (Li and Vitányi, 2008, Def. 7.5.1). If we define the -cost of a computation of a string by program as the minimand of , that is,
then we can see that program computes string in phase of fast iff . As such, gives low probability to strings of high complexity.
Similarly to the above, the monotone Kolmogorov complexity of is defined as
If we define the minimand of as
then the Solomonoff prior can be written as
. and are both semimeasures, but not measures:
A semimeasure is a function such that and for all . If satisfies these with equality, we call a measure.
Semimeasures can be used for prediction:
If is a semimeasure, the -probability of given is .
3 Speed priors
By analogy to , we can define a variant of the Solomonoff prior that penalises strings of high complexity more directly than does:
is a semimeasure, but is not a measure.
3.1 Similar definitions for and
The definitions (1) of and (2) of have been given in different forms—the first in terms of phases of fast, and the second in terms of -cost. In this subsection, we show that each can be rewritten in a form similar to the other’s definition, which sheds light on the differences and similarities between the two.
First, we note that for each program and string , if , then for all , . Now,
since all of the contributions to from program in phases add up to twice the contribution from in phase alone.
Next, suppose . Then, by the definition of fast,
Also, if , then either , implying , or , also implying . Therefore, if and , then
Subtracting and exponentiating yields
which, together with equation (3), proves the proposition. ∎
Using equation (4), we have that if and , then
Summing over all programs such that and , we have
Then, summing over all phases , we have
Now, as noted in the proof of Proposition 3, if , then for all . Similarly to the start of that proof, we note that
The left hand side is the contribution of to the sum
and the right hand side is twice the contribution of to the sum
which, together with (6), proves the proposition. ∎
3.2 is a speed prior
Although we have defined , we have not shown any results that indicate it deserves to be called a speed prior. Two key properties of
justify its description as a speed prior: firstly, that the cumulative prior probability measure of allincomputable in time is at most inversely proportional to , and secondly, that if , and program computes within at most steps, then the contribution to by programs that take time much longer than vanishes as (Schmidhuber, 2002). In this subsection, we prove that both of these properties also hold for . and are the only distributions that the authors are aware of that satisfy these two properties.
Let denote the set of strings that are incomputable in time (that is, there is no program such that in or fewer timesteps) such that for any , the prefix is computable in time . By definition, all strings that are incomputable in time have as a prefix an element of , and is a prefix-free set222That is, a set such that no element is a prefix of another element. (by construction). Furthermore, the probability measure of all strings incomputable in time is simply the sum of the probabilities of all elements of .
by the Kraft inequality, since the fact that is a prefix-free set guarantees that the set of programs that compute elements of is also prefix-free, due to our use of monotone machines. ∎
Let be such that there exists a program which outputs in steps for all . Let grow faster than , i.e. . Then,
where iff program computes string in no more than steps.
An informal statement of this proposition is that contributions to by programs that take time longer than steps to run are dwarfed by those by programs that take less than steps to run. Therefore, asymptotically, only the fastest programs contribute to .
Equation (7) comes from increasing to in the numerator, and decreasing the denominator by throwing out all terms of the sum except that of , which takes time to compute . Equation (8) takes out of the fraction, and increases the numerator by adding contributions from all programs that compute . Equation (9) uses the Kraft inequality to bound from above by 1. Finally, we use the fact that . ∎
4 Loss bounds
In this section, we prove a performance bound on -based sequence prediction, when predicting a sequence drawn from a measure that is estimable in polynomial time. We also prove a similar bound on -based sequence prediction when predicting deterministic sequences computable in polynomial time.
For the purpose of this section, we write somewhat more explicitly as
and give some auxiliary definitions. Let be a prefix-free coding of the strings of finite length and be a prefix-free coding of the integers, where both of these prefix-free codings are computable and decodable in polynomial time.
A function is finitely computable if there is some Turing machine that when given input prints and then halts, where . The function is finitely computable in polynomial time if it takes at most timesteps to halt on input , where is a polynomial.
Let . is estimable in polynomial time by if is finitely computable in polynomial time and . The function is estimable in polynomial time if it is estimable in polynomial time by some function .
First, note that this definition is reasonably weak, since we only require , rather than . Also note that if is finitely computable in polynomial time, it is estimable in polynomial time by itself. For a measure , estimability in polynomial time captures our intuitive notion of efficient computability: we only need to know up to a constant factor for prediction, and we can find this out in polynomial time.
We consider a prediction setup where a predictor outputs a prediction, and then receives some loss depending on the predicted next bit and the correct next bit. More formally, we have some loss function defined for all and all , representing the loss incurred for a prediction of when the actual next bit is , which the predictor observes after prediction. One example of such a loss function is the 0-1 loss, which assigns 0 to a correct prediction and 1 to an incorrect prediction, although there are many others.
We define the predictor to be the predictor which minimises -expected loss, outputting at time . If the true distribution is , we judge a predictor by its total -expected loss in the first steps:
In particular, if we are using 0-1 loss, is the expected number of errors made by up to time in the environment .
Theorem 9 (Bound on prediction loss).
If is a measure that is estimable in polynomial time by some semimeasure , and is a sequence sampled from , then the expected loss incurred by the predictor is bounded by
where .333A similar bound that can be proved the same way is for the same (Hutter, 2007, Eq. 8, 5).
Since , this means that only incurs at most extra loss in expectation, although this bound will be much tighter in more structured environments where makes few errors, such as deterministic environments.
In order to prove this theorem, we use the following lemma:
Let be a semimeasure that is finitely computable in polynomial time. There exists a Turing machine such that for all
where is the length of the shortest program for on .444Note that this lemma would be false if we were to let be an arbitrary lower-semicomputable semimeasure, since if , this would imply that , which was disproved by Gács (1983).
Note that a proof already exists that there is some machine such that (10) holds (Li and Vitányi, 2008, Thm. 4.5.2), but it does not prove (11), and we wish to understand the operation of in order to prove Theorem 9.
Proof of Lemma 10.
The machine is essentially a decoder of an algorithmic coding scheme with respect to . It uses the natural correspondence between and , associating a binary string with the real number . It determines the location of the input sequence on this line, and then assigns a certain interval for each output string, such that the width of the interval for output string is equal to . Then, if input string lies inside the interval for the output string , it outputs .
first calculates and , and sets as the output interval for 0 and as the output interval for 1. It then reads the input, bit by bit. After reading input , it constructs the input interval , which represents the inerval that could lie in. It then checks if this input interval is contained in one of the output intervals. If it is, then it prints output appropriate for the interval, and if not, then it reads one more bit and repeats the process.
Suppose the first output bit is a 1. Then, calculates and , and forms the new output intervals: for outputting 0, and for outputting 1. It then reads more input bits until the input interval lies within one of these new output intervals, and then outputs the appropriate bit. The computation proceeds in this fashion.
Equation (10) is satisfied, because is just the total length of all possible input intervals that fit inside the output interval for , which by construction is .
To show that (11) is satisfied, note that is the length of the largest input interval for . Now, input intervals are binary intervals (that is, their start points and end points have a finite binary expansion), and for every interval , there is some binary interval contained in with length that of . Therefore, the output interval for contains some input interval with length at least that of the length of the output interval. Since the length of the output interval for is just , we can conclude that . ∎
Proof of Theorem 9.
Using Lemma 10, we show a bound on that bounds its KL divergence with . We then apply the unit loss bound (Hutter, 2005, Thm. 3.48) (originally shown for the Solomonoff prior, but valid for any prior) to show the desired result.
First, we reason about the running time of the shortest program that prints on the machine (defined in Lemma 10). Since we would only calculate and for , this amounts to calculations. Each calculation need only take polynomial time in the length of its argument, because could just simulate the machine that takes input and returns the numerator and denominator of , prefix-free coded, and it only takes polynomial time to undo this prefix-free coding. Therefore, the calculations take at most , where is a polynomial. We also, however, need to read all the bits of the input, construct the input intervals, and compare them to the output intervals. This takes time linear in the number of bits read, and for the shortest program that prints , this number of bits is (by definition) . Since , , and since , . Therefore, the total time taken is bounded above by , where we absorb the additive constants into .
This out of the way, we can calculate
Now, the unit loss bound tells us that
where is the relative entropy. We can calculate using equation (12):
denotes the binary entropy of the random variablewith respect to
We therefore have a loss bound on the -based sequence predictor in environments that are estimable in polynomial time by a semimeasure. Furthermore:
for deterministic measures555That is, measures that give probability 1 to prefixes of one particular infinite sequence. computable in polynomial time, if correct predictions incur no loss.
We should note that this method fails to prove similar bounds for , since we instead get
which gives us
Since can grow linearly in (for example, take to be , the uniform measure), this can only prove a trivial linear loss bound without restrictions on the measure . It is also worth explicitly noting that the constants hidden in the notation depend on the environment , as will be the case for the rest of this paper.
One important application of Theorem 9 is to the 0-1 loss function. Then, it states that a predictor that outputs the most likely successor bit according to only makes logarithmically many errors in a deterministic environment computable in polynomial time. In other words, quickly learns the sequence it is predicting, making very few errors.
Next, we show that makes only logarithmically many errors on a sequence deteriministically computed in polynomial time. This follows from a rather simple argument.
Theorem 12 (Bound on prediction loss).
Let be a deterministic environment and be the sequence whose prefixes assigns probability 1 to. If is computable in polynomial time by a program , then only incurrs logarithmic loss, if correct predictions incur no loss.
Using the unit loss bound,
5 Time complexity
Although it has been proved that is computable (Schmidhuber, 2002), no bounds are given for its computational complexity. Given that the major advantage of -based prediction over -based prediction is its computability, it is of interest to determine the time required to compute , and whether such a computation is feasible or not. The same questions apply to , to a greater extent because we have not even yet shown that is computable.
In this section, we show that an arbitrarily good approximation to is computable in time exponential in , and an arbitrarily good approximation to is computable in time doubly-exponential in . We do this by explicitly constructing algorithms that perform phases of fast until enough contributions to or are found to constitute a sufficient proportion of the total.
We also show that no such approximation of or can be computed in polynomial time. We do this by contradiction: showing that if it were possible to do so, we would be able to construct an ‘adversarial’ sequence that was computable in polynomial time, yet could not be predicted by our approximation; a contradiction.
Finally, we investigate the time taken to compute and along a polynomial-time computable sequence . If we wanted to predict the most likely continuation of according to , we would have to compute an approximation to and , to see which one was greater. We show that it is possible to compute these approximations in polynomial time for and in exponential time for : an exponential improvement over the worst-case bounds in both cases.
5.1 Upper bounds
Theorem 13 ( computable in exponential time).
For any , there exists an approximation of such that and is computable in time exponential in .
First, we note that in phase of fast, we try out program prefixes , and each prefix gets steps. Therefore, the total number of steps in phase is , and the total number of steps in the first phases is
Now, suppose we want to compute a sufficient approximation . If we compute phases of fast and then add up all the contributions to found in those phases, the remaining contributions must add up to . In order for the contributions we have added up to contribute of the total, it suffices to use such that
Now, since the uniform measure is finitely computable in polynomial time, it is estimable in polynomial time by itself, so we can substitute into equation (16) to obtain
Therefore, is computable in exponential time. ∎
Theorem 14 ( computable in doubly-exponential time).
For any , there exists an approximation of such that and is computable in time doubly-exponential in .
We again use the general strategy of computing phases of fast, and adding up all the contributions to we find. Once we have done this, the other contributions come from computations with -cost . Therefore, the programs making these contributions either have a program of length , or take time (or both).
First, we bound the contribution to by computations of time