Remarkably, there is a theoretically optimal way of making predictions based on observations, rooted in the early work of Solomonoff and Kolmogorov [62, 28]. The approach reflects basic principles of Occam’s razor: simple explanations of data are preferable to complex ones.
The theory of universal inductive inference quantifies what simplicity really means. Given certain very broad computability assumptions, it provides techniques for making optimally reliable statements about future events, given the past.
Once there is an optimal, formally describable way of predicting the future, we should be able to construct a machine that continually computes and executes action sequences that maximize expected or predicted reward, thus solving an ancient goal of AI research.
For many decades, however, AI researchers have not paid a lot of attention to the theory of inductive inference. Why not? There is another reason besides the fact that most of them have traditionally ignored theoretical computer science: the theory has been perceived as being associated with excessive computational costs. In fact, its most general statements refer to methods that are optimal (in a certain asymptotic sense) but incomputable. So researchers in machine learning and artificial intelligence have often resorted to alternative methods that lack a strong theoretical foundation but at least seem feasible in certain limited contexts. For example, since the early attempts at building a “General Problem Solver”[36, 43] much work has been done to develop mostly heuristic machine learning algorithms that solve new problems based on experience with previous problems. Many pointers to learning by chunking, learning by macros, hierarchical learning, learning by analogy, etc. can be found in Mitchell’s book  and Kaelbling’s survey .
Recent years, however, have brought substantial progress in the field of computable and feasible variants of optimal algorithms for prediction, search, inductive inference, problem solving, decision making, and reinforcement learning in very general environments. In what follows I will focus on the results obtained at IDSIA.
Sections 3, 4, 7 relate Occam’s razor and the notion of simplicity to the shortest algorithms for computing computable objects, and will concentrate on recent asymptotic optimality results for universal learning machines, essentially ignoring issues of practical feasibility—compare Hutter’s contribution  in this volume.
Section 5, however, will focus on our recent non-traditional simplicity measure which is not based on the shortest but on the fastest way of describing objects, and Section 6 will use this measure to derive non-traditional predictions concerning the future of our universe.
Sections 8, 9, 10 will finally address quite pragmatic issues and “true” time-optimality: given a problem and only so much limited computation time, what is the best way of spending it on evaluating solution candidates? In particular, Section 9
will outline a bias-optimal way of incrementally solving each task in a sequence of tasks with quickly verifiable solutions, given a probability distribution (thebias
) on programs computing solution candidates. Bias shifts are computed by program prefixes that modify the distribution on their suffixes by reusing successful code for previous tasks (stored in non-modifiable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the first general system tolearn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by profiting from previously solved, simpler tasks involving samples of a simple context free language. Sections 10 discusses how to use this approach for building general reinforcement learners.
2 More Formally
What is the optimal way of predicting the future, given the past? Which is the best way to act such as to maximize one’s future expected reward? Which is the best way of searching for the solution to a novel problem, making optimal use of solutions to earlier problems?
Most previous work on these old and fundamental questions has focused on very limited settings, such as Markovian environments where the optimal next action, given past inputs, depends on the current input only .
We will concentrate on a much weaker and therefore much more general assumption, namely, that the environment’s responses are sampled from a computable probability distribution. If even this weak assumption were not true then we could not even formally specify the environment, leave alone writing reasonable scientific papers about it.
Let us first introduce some notation. denotes the set of finite sequences over the binary alphabet , the set of infinite sequences over , the empty string, . stand for strings in . If then is the concatenation of and (e.g., if and then ). For , denotes the number of bits in , where for ; . is the prefix of consisting of the first bits, if , and otherwise (). denotes the logarithm with basis 2, denote functions mapping integers to integers. We write if there exist positive constants such that for all
. For simplicity let us consider universal Turing Machines (TMs) with input alphabet and trinary output alphabet including the symbols “0”, “1”, and “ ” (blank). For efficiency reasons, the TMs should have several work tapes to avoid potential quadratic slowdowns associated with 1-tape TMs. The remainder of this paper assumes a fixed universal reference TM.
Now suppose bitstring represents the data observed so far. What is its most likely continuation
? Bayes’ theorem yields
where is the probability of , given knowledge of , and is just a normalizing factor. So the most likely continuation is determined by , the prior probability of . But which prior measure is plausible? Occam’s razor suggests that the “simplest” should be more probable. But which exactly is the “correct” definition of simplicity? Sections 3 and 4 will measure the simplicity of a description by its length. Section 5 will measure the simplicity of a description by the time required to compute the described object.
3 Prediction Using a Universal Algorithmic Prior Based on the Shortest Way of Describing Objects
Roughly fourty years ago Solomonoff started the theory of universal optimal induction based on the apparently harmless simplicity assumption that is computable . While Equation (1) makes predictions of the entire future, given the past, Solomonoff  focuses just on the next bit in a sequence. Although this provokes surprisingly nontrivial problems associated with translating the bitwise approach to alphabets other than the binary one — this was achieved only recently  — it is sufficient for obtaining essential insights. Given an observed bitstring , Solomonoff assumes the data are drawn according to a recursive measure ; that is, there is a program for a universal Turing machine that reads and computes
and halts. He estimates the probability of the next bit (assuming there will be one), using the remarkable, well-studied, enumerable prior[62, 77, 63, 15, 31]
is universal, dominating the less general recursive measures as follows: For all ,
where is a constant depending on but not on . Solomonoff observed that the conditional -probability of a particular continuation, given previous observations, converges towards the unknown conditional as the observation size goes to infinity , and that the sum over all observation sizes of the corresponding -expected deviations is actually bounded by a constant. Hutter (on the author’s SNF research grant “”Unification of Universal Induction and Sequential Decision Theory”) recently showed that the number of prediction errors made by universal Solomonoff prediction is essentially bounded by the number of errors made by any other predictor, including the optimal scheme based on the true .
Recent Loss Bounds for Universal Prediction. A more general recent result is this. Assume we do know that is in some set of distributions. Choose a fixed weight for each in such that the add up to 1 (for simplicity, let be countable). Then construct the Bayesmix , and predict using instead of the optimal but unknown . How wrong is it to do that? The recent work of Hutter provides general and sharp (!) loss bounds :
Let and be the total expected unit losses of the -predictor and the p-predictor, respectively, for the first events. Then is at most of the order of . That is, is not much worse than . And in general, no other predictor can do better than that! In particular, if is deterministic, then the -predictor soon won’t make any errors any more.
If contains all recursively computable distributions, then becomes the celebrated enumerable universal prior. That is, after decades of somewhat stagnating research we now have sharp loss bounds for Solomonoff’s universal induction scheme (compare recent work of Merhav and Feder ).
Solomonoff’s approach, however, is uncomputable. To obtain a feasible approach, reduce M to what you get if you, say, just add up weighted estimated future finance data probabilities generated by 1000 commercial stock-market prediction software packages. If only one of the probability distributions happens to be close to the true one (but you do not know which) you still should get rich.
4 Super Omegas and Generalizations of Kolmogorov Complexity & Algorithmic Probability
An object is formally describable if a finite amount of information completely describes and only . More to the point, should be representable by a possibly infinite bitstring such that there is a finite, possibly never halting program that computes and nothing but in a way that modifies each output bit at most finitely many times; that is, each finite beginning of eventually converges and ceases to change. This constructive notion of formal describability is less restrictive than the traditional notion of computability , mainly because we do not insist on the existence of a halting program that computes an upper bound of the convergence time of ’s -th output bit. Formal describability thus pushes constructivism [5, 1] to the extreme, barely avoiding the nonconstructivism embodied by even less restrictive concepts of describability (compare computability in the limit [17, 40, 14] and -describability [31, p. 46-47]).
The traditional theory of inductive inference focuses on Turing machines with one-way write-only output tape. This leads to the universal enumerable Solomonoff-Levin (semi) measure. We introduced more general, nonenumerable, but still limit-computable measures and a natural hierarchy of generalizations of algorithmic probability and Kolmogorov complexity [50, 52], suggesting that the “true” information content of some (possibly infinite) bitstring actually is the size of the shortest nonhalting program that converges to and nothing but on a Turing machine that can edit its previous outputs. In fact, this “true” content is often smaller than the traditional Kolmogorov complexity. We showed that there are Super Omegas computable in the limit yet more random than Chaitin’s “number of wisdom” Omega  (which is maximally random in a weaker traditional sense), and that any approximable measure of is small for any lacking a short description.
We also showed that there is a universal cumulatively enumerable measure of based on the measure of all enumerable lexicographically greater than . It is more dominant yet just as limit-computable as Solomonoff’s . That is, if we are interested in limit-computable universal measures, we should prefer the novel universal cumulatively enumerable measure over the traditional enumerable one. If we include in our Bayesmix such limit-computable distributions we obtain again sharp loss bounds for prediction based on the mix [50, 52].
Our approach highlights differences between countable and uncountable sets. Which are the potential consequences for physics? We argue that things such as uncountable time and space and incomputable probabilities actually should not play a role in explaining the world, for lack of evidence that they are really necessary . Some may feel tempted to counter this line of reasoning by pointing out that for centuries physicists have calculated with continua of real numbers, most of them incomputable. Even quantum physicists who are ready to give up the assumption of a continuous universe usually do take for granted the existence of continuous probability distributions on their discrete universes, and Stephen Hawking explicitly said: “Although there have been suggestions that space-time may have a discrete structure I see no reason to abandon the continuum theories that have been so successful.” Note, however, that all physicists in fact have only manipulated discrete symbols, thus generating finite, describable proofs of their results derived from enumerable axioms. That real numbers really exist in a way transcending the finite symbol strings used by everybody may be a figment of imagination — compare Brouwer’s constructive mathematics [5, 1] and the Löwenheim-Skolem Theorem [32, 61] which implies that any first order theory with an uncountable model such as the real numbers also has a countable model. As Kronecker put it: “Die ganze Zahl schuf der liebe Gott, alles Übrige ist Menschenwerk” (“God created the integers, all else is the work of man” ). Kronecker greeted with scepticism Cantor’s celebrated insight  about real numbers, mathematical objects Kronecker believed did not even exist.
Assuming our future lies among the few (countably many) describable futures, we can ignore uncountably many nondescribable ones, in particular, the random ones. Adding the relatively mild assumption that the probability distribution from which our universe is drawn is cumulatively enumerable provides a theoretical justification of the prediction that the most likely continuations of our universes are computable through short enumeration procedures. In this sense Occam’s razor is just a natural by-product of a computability assumption! But what about falsifiability? The pseudorandomness of our universe might be effectively undetectable in principle, because some approximable and enumerable patterns cannot be proven to be nonrandom in recursively bounded time.
The next sections, however, will introduce additional plausible assumptions that do lead to computable optimal prediction procedures.
5 Computable Predictions through the Speed Prior Based on the Fastest Way of Describing Objects
Unfortunately, while and the more general priors of Section 4 are computable in the limit, they are not recursive, and thus practically infeasible. This drawback inspired less general yet practically more feasible principles of minimum description length (MDL) [71, 41] as well as priors derived from time-bounded restrictions  of Kolmogorov complexity [28, 62, 9]
. No particular instance of these approaches, however, is universally accepted or has a general convincing motivation that carries beyond rather specialized application scenarios. For instance, typical efficient MDL approaches require the specification of a class of computable models of the data, say, certain types of neural networks, plus some computable loss function expressing the coding costs of the data relative to the model. This provokes numerousad-hoc choices.
Our recent work , however, offers an alternative to the celebrated but noncomputable algorithmic simplicity measure or Solomonoff-Levin measure discussed above [62, 77, 63]. We introduced a new measure (a prior on the computable objects) which is not based on the shortest but on the fastest way of describing objects.
Let us assume that the observed data sequence is generated by a computational process, and that any possible sequence of observations is therefore computable in the limit . This assumption is stronger and more radical than the traditional one: Solomonoff just insists that the probability of any sequence prefix is recursively computable, but the (infinite) sequence itself may still be generated probabilistically.
Given our starting assumption that data are deterministically generated by a machine, it seems plausible that the machine suffers from a computational resource problem. Since some things are much harder to compute than others, the resource-oriented point of view suggests the following postulate.
The cumulative prior probability measure of all incomputable within time by any method is at most inversely proportional to .
This postulate leads to the Speed Prior , the probability that the output of the following probabilistic algorithm starts with :
Initialize: Set Let the input scanning head of a universal TM point to the first cell of its initially empty input tape.
Forever repeat: While the number of instructions executed so far exceeds : toss an unbiased coin; if heads is up set ; otherwise exit. If the input scanning head points to a cell that already contains a bit, execute the corresponding instruction (of the growing self-delimiting program, e.g., [30, 31]). Else toss the coin again, set the cell’s bit to 1 if heads is up (0 otherwise), and set
Algorithm GUESS is very similar to a probabilistic search algorithm used in previous work on applied inductive inference [47, 49]. On several toy problems it generalized extremely well in a way unmatchable by traditional neural network learning algorithms.
With comes a computable method AS for predicting optimally within accuracy . Consider a finite but unknown program computing . What if Postulate 1 holds but is not optimally efficient, and/or computed on a computer that differs from our reference machine? Then we effectively do not sample beginnings from but from an alternative semimeasure . Can we still predict well? Yes, because the Speed Prior dominates . This dominance is all we need to apply the recent loss bounds . The loss that we are expected to receive by predicting according to AS instead of using the true but unknown does not exceed the optimal loss by much .
6 Speed Prior-Based Predictions for Our Universe
“In the beginning was the code.”
First sentence of the Great Programmer’s Bible
Physicists and economists and other inductive scientists make predictions based on observations. Astonishingly, however, few physicists are aware of the theory of optimal inductive inference [62, 28]. In fact, when talking about the very nature of their inductive business, many physicists cite rather vague concepts such as Popper’s falsifiability , instead of referring to quantitative results.
All widely accepted physical theories, however, are accepted not because they are falsifiable—they are not—or because they match the data—many alternative theories also match the data—but because they are simple in a certain sense. For example, the theory of gravitation is induced from locally observable training examples such as falling apples and movements of distant light sources, presumably stars. The theory predicts that apples on distant planets in other galaxies will fall as well. Currently nobody is able to verify or falsify this. But everybody believes in it because this generalization step makes the theory simpler than alternative theories with separate laws for apples on other planets. The same holds for superstring theory  or Everett’s many world theory , which presently also are neither verifiable nor falsifiable, yet offer comparatively simple explanations of numerous observations. In particular, most of Everett’s postulated many worlds will remain unobservable forever, but the assumption of their existence simplifies the theory, thus making it more beautiful and acceptable.
In Sections 3 and 4 we have made the assumption that the probabilities of next events, given previous events, are (limit-)computable. Here we make a stronger assumption by adopting Zuse’s thesis [75, 76], namely, that the very universe is actually being computed deterministically, e.g., on a cellular automaton (CA) [68, 70]. Quantum physics, quantum computation [3, 10, 38], Heisenberg’s uncertainty principle and Bell’s inequality  do not imply any physical evidence against this possibility, e.g., .
But then which is our universe’s precise algorithm? The following method  does compute it:
Systematically create and execute all programs for a universal computer, such as a Turing machine or a CA; the first program is run for one instruction every second step on average, the next for one instruction every second of the remaining steps on average, and so on.
This method in a certain sense implements the simplest theory of everything: all computable universes, including ours and ourselves as observers, are computed by the very short program that generates and executes all possible programs . In nested fashion, some of these programs will execute processes that again compute all possible universes, etc. . Of course, observers in “higher-level” universes may be completely unaware of observers or universes computed by nested processes, and vice versa. For example, it seems hard to track and interpret the computations performed by a cup of tea.
The simple method above is more efficient than it may seem at first glance. A bit of thought shows that it even has the optimal order of complexity. For example, it outputs our universe history as quickly as this history’s fastest program, save for a (possibly huge) constant slowdown factor that does not depend on output size.
Nevertheless, some universes are fundamentally harder to compute than others. This is reflected by the Speed Prior discussed above (Section 5). So let us assume that our universe’s history is sampled from or a less dominant prior reflecting suboptimal computation of the history. Now we can immediately predict:
1. Our universe will not get many times older than it is now  — essentially, the probability that it will last times longer than it has lasted so far is at most .
2. Any apparent randomness in any physical observation must be due to some yet unknown but fast pseudo-random generator PRG  which we should try to discover. 2a. A re-examination of beta decay patterns may reveal that a very simple, fast, but maybe not quite trivial PRG is responsible for the apparently random decays of neutrons into protons, electrons and antineutrinos. 2b. Whenever there are several possible continuations of our universe corresponding to different Schrödinger wave function collapses — compare Everett’s widely accepted many worlds hypothesis  — we should be more likely to end up in one computable by a short and fast algorithm. A re-examination of split experiment data involving entangled states such as the observations of spins of initially close but soon distant particles with correlated spins might reveal unexpected, nonobvious, nonlocal algorithmic regularity due to a fast PRG.
4. Any probabilistic algorithm depending on truly random inputs from the environment will not scale well in practice.
Prediction 2 is verifiable but not necessarily falsifiable within a fixed time interval given in advance. Still, perhaps the main reason for the current absence of empirical evidence in this vein is that few  have looked for it.
In recent decades several well-known physicists have started writing about topics of computer science, e.g., [38, 10], sometimes suggesting that real world physics might allow for computing things that are not computable traditionally. Unimpressed by this trend, computer scientists have argued in favor of the opposite: since there is no evidence that we need more than traditional computability to explain the world, we should try to make do without this assumption, e.g., [75, 76, 13, 48].
7 Optimal Rational Decision Makers
So far we have talked about passive prediction, given the observations. Note, however, that agents interacting with an environment can also use predictions of the future to compute action sequences that maximize expected future reward. Hutter’s recent AIXI model  (author’s SNF grant 61847) does exactly this, by combining Solomonoff’s -based universal prediction scheme with an expectimax computation.
In cycle action results in perception and reward , where all quantities may depend on the complete history. The perception and reward are sampled from the (reactive) environmental probability distribution . Sequential decision theory shows how to maximize the total expected reward, called value, if is known. Reinforcement learning  is used if is unknown. AIXI defines a mixture distribution as a weighted sum of distributions , where is any class of distributions including the true environment .
It can be shown that the conditional probability of environmental inputs to an AIXI agent, given the agent’s earlier inputs and actions, converges with increasing length of interaction against the true, unknown probability , as long as the latter is recursively computable, analogously to the passive prediction case.
Recent work  also demonstrated AIXI’s optimality in the following sense. The Bayes-optimal policy based on the mixture is self-optimizing in the sense that the average value converges asymptotically for all to the optimal value achieved by the (infeasible) Bayes-optimal policy which knows in advance. The necessary condition that admits self-optimizing policies is also sufficient. No other structural assumptions are made on . Furthermore, is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments and a strictly higher value in at least one .
We can modify the AIXI model such that its predictions are based on the -approximable Speed Prior instead of the incomputable . Thus we obtain the so-called AIS model. Using Hutter’s approach  we can now show that the conditional probability of environmental inputs to an AIS agent, given the earlier inputs and actions, converges to the true but unknown probability, as long as the latter is dominated by , such as the above.
8 Optimal Universal Search Algorithms
In a sense, searching is less general than reinforcement learning because it does not necessarily involve predictions of unseen data. Still, search is a central aspect of computer science (and any reinforcement learner needs a searcher as a submodule—see Sections 10 and 11). Surprisingly, however, many books on search algorithms do not even mention the following, very simple asymptotically optimal, “universal” algorithm for a broad class of search problems.
Define a probability distribution on a finite or infinite set of programs for a given computer. represents the searcher’s initial bias (e.g., could be based on program length, or on a probabilistic syntax diagram).
Method Lsearch: Set current time limit T=1. While problem not solved do:
Test all programs such that , the maximal time spent on creating and running and testing , satisfies . Set
Lsearch (for Levin Search) may be the algorithm Levin was referring to in his 2 page paper  which states that there is an asymptotically optimal universal search method for problems with easily verifiable solutions, that is, solutions whose validity can be quickly tested. Given some problem class, if some unknown optimal program requires steps to solve a problem instance of size , then Lsearch will need at most steps — the constant factor may be huge but does not depend on . Compare [31, p. 502-505] and  and the fastest way of computing all computable universes in Section 6.
Recently Hutter developed a more complex asymptotically optimal search algorithm for all well-defined problems, not just those with with easily verifiable solutions . Hsearch cleverly allocates part of the total search time for searching the space of proofs to find provably correct candidate programs with provable upper runtime bounds, and at any given time focuses resources on those programs with the currently best proven time bounds. Unexpectedly, Hsearch manages to reduce the unknown constant slowdown factor of Lsearch to a value of , where is an arbitrary positive constant.
Unfortunately, however, the search in proof space introduces an unknown additive problem class-specific constant slowdown, which again may be huge. While additive constants generally are preferrable over multiplicative ones, both types may make universal search methods practically infeasible.
Hsearch and Lsearch are nonincremental in the sense that they do not attempt to minimize their constants by exploiting experience collected in previous searches. Our method Adaptive Lsearch or Als tries to overcome this  — compare Solomonoff’s related ideas [64, 65]. Essentially it works as follows: whenever Lsearch finds a program that computes a solution for the current problem, ’s probability is substantially increased using a “learning rate,” while probabilities of alternative programs decrease appropriately. Subsequent Lsearches for new problems then use the adjusted , etc. A nonuniversal variant of this approach was able to solve reinforcement learning (RL) tasks  in partially observable environments unsolvable by traditional RL algorithms [74, 60].
Each Lsearch invoked by Als is optimal with respect to the most recent adjustment of . On the other hand, the modifications of themselves are not necessarily optimal. Recent work discussed in the next section overcomes this drawback in a principled way.
9 Optimal Ordered Problem Solver (OOPS)
Our recent Oops [53, 55] is a simple, general, theoretically sound, in a certain sense time-optimal way of searching for a universal behavior or program that solves each problem in a sequence of computational problems, continually organizing and managing and reusing earlier acquired knowledge. For example, the -th problem may be to compute the -th event from previous events (prediction), or to find a faster way through a maze than the one found during the search for a solution to the -th problem (optimization).
Let us first introduce the important concept of bias-optimality, which is a pragmatic definition of time-optimality, as opposed to the asymptotic optimality of both Lsearch and Hsearch, which may be viewed as academic exercises demonstrating that the notation can sometimes be practically irrelevant despite its wide use in theoretical computer science. Unlike asymptotic optimality, bias-optimality does not ignore huge constant slowdowns:
Definition 1 (Bias-Optimal Searchers)
Given is a problem class , a search space of solution candidates (where any problem should have a solution in
), a task dependent bias in form of conditional probability distributionson the candidates , and a predefined procedure that creates and tests any given on any within time (typically unknown in advance). A searcher is -bias-optimal () if for any maximal total search time it is guaranteed to solve any problem if it has a solution satisfying . It is bias-optimal if .
This definition makes intuitive sense: the most probable candidates should get the lion’s share of the total search time, in a way that precisely reflects the initial bias. Now we are ready to provide a general overview of the basic ingredients of oops [53, 55]:
Primitives. We start with an initial set of user-defined primitive behaviors. Primitives may be assembler-like instructions or time-consuming software, such as, say, theorem provers, or matrix operators for neural network-like parallel architectures, or trajectory generators for robot simulations, or state update procedures for multiagent systems, etc. Each primitive is represented by a token. It is essential that those primitives whose runtimes are not known in advance can be interrupted at any time.
Task-specific prefix codes. Complex behaviors are represented by token sequences or programs. To solve a given task represented by task-specific program inputs, oops tries to sequentially compose an appropriate complex behavior from primitive ones, always obeying the rules of a given user-defined initial programming language. Programs are grown incrementally, token by token; their beginnings or prefixes are immediately executed while being created; this may modify some task-specific internal state or memory, and may transfer control back to previously selected tokens (e.g., loops). To add a new token to some program prefix, we first have to wait until the execution of the prefix so far explicitly requests such a prolongation, by setting an appropriate signal in the internal state. Prefixes that cease to request any further tokens are called self-delimiting programs or simply programs (programs are their own prefixes). Binary self-delimiting programs were studied by  and  in the context of Turing machines  and the theory of Kolmogorov complexity and algorithmic probability [62, 28]. Oops, however, uses a more practical, not necessarily binary framework.
The program construction procedure above yields task-specific prefix codes on program space: with any given task, programs that halt because they have found a solution or encountered some error cannot request any more tokens. Given the current task-specific inputs, no program can be the prefix of another one. On a different task, however, the same program may continue to request additional tokens. This is important for our novel approach—incrementally growing self-delimiting programs are unnecessary for the asymptotic optimality properties of Lsearch and Hsearch, but essential for oops.
Access to previous solutions. Let denote a found prefix solving the first tasks. The search for may greatly profit from the information conveyed by (or the knowledge embodied by) which are stored or frozen in special nonmodifiable memory shared by all tasks, such that they are accessible to (this is another difference to nonincremental Lsearch and Hsearch). For example, might execute a token sequence that calls as a subprogram, or that copies into some internal modifiable task-specific memory, then modifies the copy a bit, then applies the slightly edited copy to the current task. In fact, since the number of frozen programs may grow to a large value, much of the knowledge embodied by may be about how to access and edit and use older ().
Bias. The searcher’s initial bias is embodied by initial, user-defined, task dependent probability distributions on the finite or infinite search space of possible program prefixes. In the simplest case we start with a maximum entropy distribution on the tokens, and define prefix probabilities as the products of the probabilities of their tokens. But prefix continuation probabilities may also depend on previous tokens in context sensitive fashion.
Self-computed suffix probabilities. In fact, we permit that any executed prefix assigns a task-dependent, self-computed probability distribution to its own possible continuations. This distribution is encoded and manipulated in task-specific internal memory. So unlike with Als  we do not use a prewired learning scheme to update the probability distribution. Instead we leave such updates to prefixes whose online execution modifies the probabilities of their suffixes. By, say, invoking previously frozen code that redefines the probability distribution on future prefix continuations, the currently tested prefix may completely reshape the most likely paths through the search space of its own continuations, based on experience ignored by nonincremental Lsearch and Hsearch. This may introduce significant problem class-specific knowledge derived from solutions to earlier tasks.
Two searches. Essentially, oops provides equal resources for two near-bias-optimal searches (Def. 1) that run in parallel until is discovered and stored in non-modifiable memory. The first is exhaustive; it systematically tests all possible prefixes on all tasks up to . Alternative prefixes are tested on all current tasks in parallel while still growing; once a task is solved, we remove it from the current set; prefixes that fail on a single task are discarded. The second search is much more focused; it only searches for prefixes that start with , and only tests them on task , which is safe, because we already know that such prefixes solve all tasks up to .
Bias-optimal backtracking. Hsearch and Lsearch assume potentially infinite storage. Hence they may largely ignore questions of storage management. In any practical system, however, we have to efficiently reuse limited storage. Therefore, in both searches of oops, alternative prefix continuations are evaluated by a novel, practical, token-oriented backtracking procedure that can deal with several tasks in parallel, given some code bias in the form of previously found code. The procedure always ensures near-bias-optimality (Def. 1): no candidate behavior gets more time than it deserves, given the probabilistic bias. Essentially we conduct a depth-first search in program space, where the branches of the search tree are program prefixes, and backtracking (partial resets of partially solved task sets and modifications of internal states and continuation probabilities) is triggered once the sum of the runtimes of the current prefix on all current tasks exceeds the prefix probability multiplied by the total search time so far.
In case of unknown, infinite task sequences we can typically never know whether we already have found an optimal solver for all tasks in the sequence. But once we unwittingly do find one, at most half of the total future run time will be wasted on searching for alternatives. Given the initial bias and subsequent bias shifts due to no other bias-optimal searcher can expect to solve the -th task set substantially faster than oops. A by-product of this optimality property is that it gives us a natural and precise measure of bias and bias shifts, conceptually related to Solomonoff’s conceptual jump size of [64, 65].
Since there is no fundamental difference between domain-specific problem-solving programs and programs that manipulate probability distributions and thus essentially rewrite the search procedure itself, we collapse both learning and metalearning in the same time-optimal framework.
An example initial language. For an illustrative application, we wrote an interpreter for a stack-based universal programming language inspired by Forth , with initial primitives for defining and calling recursive functions, iterative loops, arithmetic operations, and domain-specific behavior. Optimal metasearching for better search algorithms is enabled through the inclusion of bias-shifting instructions that can modify the conditional probabilities of future search options in currently running program prefixes.
Experiments. Using the assembler-like language mentioned above, we first teach oops something about recursion, by training it to construct samples of the simple context free language ( 1’s followed by 2’s), for up to 30 (in fact, the system discovers a universal solver for all ). This takes roughly 0.3 days on a standard personal computer (PC). Thereafter, within a few additional days, oops demonstrates incremental knowledge transfer: it exploits aspects of its previously discovered universal -solver, by rewriting its search procedure such that it more readily discovers a universal solver for all disk Towers of Hanoi problems—in the experiments it solves all instances up to (solution size ), but it would also work for . Previous, less general reinforcement learners and nonlearning AI planners tend to fail for much smaller instances.
Future research may focus on devising particularly compact, particularly reasonable sets of initial codes with particularly broad practical applicability. It may turn out that the most useful initial languages are not traditional programming languages similar to the Forth-like one, but instead based on a handful of primitive instructions for massively parallel cellular automata [68, 70, 76]
, or on a few nonlinear operations on matrix-like data structures such as those used in recurrent neural network research[72, 44, 4]. For example, we could use the principles of oops to create a non-gradient-based, near-bias-optimal variant of Hochreiter’s successful recurrent network metalearner . It should also be of interest to study probabilistic Speed Prior-based oops variants  and to devise applications of oops-like methods as components of universal reinforcement learners (see below). In ongoing work, we are applying oops to the problem of optimal trajectory planning for robotics in a realistic physics simulation. This involves the interesting trade-off between comparatively fast program-composing primitives or “thinking primitives” and time-consuming “action primitives”, such as stretch-arm-until-touch-sensor-input.
10 OOPS-Based Reinforcement Learning
At any given time, a reinforcement learner  will try to find a policy (a strategy for future decision making) that maximizes its expected future reward. In many traditional reinforcement learning (RL) applications, the policy that works best in a given set of training trials will also be optimal in future test trials . Sometimes, however, it won’t. To see the difference between searching (the topic of the previous sections) and reinforcement learning (RL), consider an agent and two boxes. In the -th trial the agent may open and collect the content of exactly one box. The left box will contain Swiss Francs, the right box Swiss Francs, but the agent does not know this in advance. During the first 9 trials the optimal policy is “open left box.” This is what a good searcher should find, given the outcomes of the first 9 trials. But this policy will be suboptimal in trial 10. A good reinforcement learner, however, should extract the underlying regularity in the reward generation process and predict the future tasks and rewards, picking the right box in trial 10, without having seen it yet.
The first general, asymptotically optimal reinforcement learner is the recent AIXI model [22, 24] (Section 7). It is valid for a very broad class of environments whose reactions to action sequences (control signals) are sampled from arbitrary computable probability distributions. This means that AIXI is far more general than traditional RL approaches. However, while AIXI clarifies the theoretical limits of RL, it is not practically feasible, just like Hsearch is not. From a pragmatic point of view, what we are really interested in is a reinforcement learner that makes optimal use of given, limited computational resources. In what follows, we will outline one way of using oops-like bias-optimal methods as components of general yet feasible reinforcement learners.
We need two oops modules. The first is called the predictor or world model. The second is an action searcher using the world model. The life of the entire system should consist of a sequence of cycles 1, 2, … At each cycle, a limited amount of computation time will be available to each module. For simplicity we assume that during each cyle the system may take exactly one action. Generalizations to actions consuming several cycles are straight-forward though. At any given cycle, the system executes the following procedure:
For a time interval fixed in advance, the predictor is first trained in bias-optimal fashion to find a better world model, that is, a program that predicts the inputs from the environment (including the rewards, if there are any), given a history of previous observations and actions. So the -th task () of the first oops module is to find (if possible) a better predictor than the best found so far.
Once the current cycle’s time for predictor improvement is used up, the current world model (prediction program) found by the first oops module will be used by the second module, again in bias-optimal fashion, to search for a future action sequence that maximizes the predicted cumulative reward (up to some time limit). That is, the -th task () of the second oops module will be to find a control program that computes a control sequence of actions, to be fed into the program representing the current world model (whose input predictions are successively fed back to itself in the obvious manner), such that this control sequence leads to higher predicted reward than the one generated by the best control program found so far.
Once the current cycle’s time for control program search is used up, we will execute the current action of the best control program found in step 2. Now we are ready for the next cycle.
The approach is reminiscent of an earlier, heuristic, non-bias-optimal RL approach based on two adaptive recurrent neural networks, one representing the world model, the other one a controller that uses the world model to extract a policy for maximizing expected reward . The method was inspired by previous combinations of nonrecurrent, reactive world models and controllers [73, 37, 26].
At any given time, until which temporal horizon should the predictor try to predict? In the AIXI case, the proper way of treating the temporal horizon is not to discount it exponentially, as done in most traditional work on reinforcement learning, but to let the future horizon grow in proportion to the learner’s lifetime so far . It remains to be seen whether this insight carries over to oops-rl.
Despite the bias-optimality properties of oops for certain ordered task sequences, however, oops-rl is not necessarily the best way of spending limited time in general reinforcement learning situations. On the other hand, it is possible to use oops as a proof-searching submodule of the recent, optimal, universal, reinforcement learning Gödel machine  discussed in the next section.
11 The Gödel Machine
The Gödel machine  explicitly addresses the ‘Grand Problem of Artificial Intelligence’  by optimally dealing with limited resources in general reinforcement learning settings, and with the possibly huge (but constant) slowdowns buried by AIXI  in the somewhat misleading -notation. It is designed to solve arbitrary computational problems beyond those solvable by plain oops, such as maximizing the expected future reward of a robot in a possibly stochastic and reactive environment (note that the total utility of some robot behavior may be hard to verify—its evaluation may consume the robot’s entire lifetime).
How does it work? While executing some arbitrary initial problem solving strategy, the Gödel machine simultaneously runs a proof searcher which systematically and repeatedly tests proof techniques. Proof techniques are programs that may read any part of the Gödel machine’s state, and write on a reserved part which may be reset for each new proof technique test. In an example Gödel machine  this writable storage includes the variables proof and switchprog, where switchprog holds a potentially unrestricted program whose execution could completely rewrite any part of the Gödel machine’s current software. Normally the current switchprog is not executed. However, proof techniques may invoke a special subroutine check() which tests whether proof currently holds a proof showing that the utility of stopping the systematic proof searcher and transferring control to the current switchprog at a particular point in the near future exceeds the utility of continuing the search until some alternative switchprog is found. Such proofs are derivable from the proof searcher’s axiom scheme which formally describes the utility function to be maximized (typically the expected future reward in the expected remaining lifetime of the Gödel machine), the computational costs of hardware instructions (from which all programs are composed), and the effects of hardware instructions on the Gödel machine’s state. The axiom scheme also formalizes known probabilistic properties of the possibly reactive environment, and also the initial Gödel machine state and software, which includes the axiom scheme itself (no circular argument here). Thus proof techniques can reason about expected costs and results of all programs including the proof searcher.
Once check() has identified a provably good switchprog, the latter is executed (some care has to be taken here because the proof verification itself and the transfer of control to switchprog also consume part of the typically limited lifetime). The discovered switchprog represents a globally optimal self-change in the following sense: provably none of all the alternative switchprogs and proofs (that could be found in the future by continuing the proof search) is worth waiting for.
There are many ways of initializing the proof searcher. Although identical proof techniques may yield different proofs depending on the time of their invocation (due to the continually changing Gödel machine state), there is a bias-optimal and asymptotically optimal proof searcher initialization based on a variant of oops  (Section 9). It exploits the fact that proof verification is a simple and fast business where the particular optimality notion of oops is appropriate. The Gödel machine itself, however, may have an arbitrary, typically different and more powerful sense of optimality embodied by its given utility function.
Recent theoretical and practical advances are currently driving a renaissance in the fields of universal learners and optimal search . A new kind of AI is emerging. Does it really deserve the attribute “new,” given that its roots date back to the 1930s, when Gödel published the fundamental result of theoretical computer science  and Zuse started to build the first general purpose computer (completed in 1941), and the 1960s, when Solomonoff and Kolmogorov published their first relevant results? An affirmative answer seems justified, since it is the recent results on practically feasible computable variants of the old incomputable methods that are currently reinvigorating the long dormant field. The “new” AI is new in the sense that it abandons the mostly heuristic or non-general approaches of the past decades, offering methods that are both general and theoretically sound, and provably optimal in a sense that does make sense in the real world.
We are led to claim that the future will belong to universal or near-universal learners that are more general than traditional reinforcement learners / decision makers depending on strong Markovian assumptions, or than learners based on traditional statistical learning theory, which often require unrealistic i.i.d. or Gaussian assumptions. Due to ongoing hardware advances the time has come for optimal search in algorithm space, as opposed to the limited space of reactive mappings embodied by traditional methods such as artificial feedforward neural networks.
It seems safe to bet that not only computer scientists but also physicists and other inductive scientists will start to pay more attention to the fields of universal induction and optimal search, since their basic concepts are irresistibly powerful and general and simple. How long will it take for these ideas to unfold their full impact? A very naive and speculative guess driven by wishful thinking might be based on identifying the “greatest moments in computing history”
and extrapolating from there. Which are those “greatest moments”? Obvious candidates are:
1623: first mechanical calculator by Schickard starts the computing age (followed by machines of Pascal, 1640, and Leibniz, 1670).
Roughly two centuries later: concept of a programmable computer (Babbage, UK, 1834-1840).
One century later: fundamental theoretical work on universal integer-based programming languages and the limits of proof and computation (Gödel, Austria, 1931, reformulated by Turing, UK, 1936); first working programmable computer (Zuse, Berlin, 1941).
(The next 50 years saw many theoretical advances as well as faster and faster switches—relays were replaced by tubes by single transistors by numerous transistors etched on chips—but arguably this was rather predictable, incremental progress without radical shake-up events.)
Half a century later: World Wide Web (UK’s Berners-Lee, Switzerland, 1990).
This list seems to suggest that each major breakthrough tends to come roughly twice as fast as the previous one. Extrapolating the trend, optimists should expect the next radical change to manifest itself one quarter of a century after the most recent one, that is, by 2015, which happens to coincide with the date when the fastest computers will match brains in terms of raw computing power, according to frequent estimates based on Moore’s law. The author is confident that the coming 2015 upheaval (if any) will involve universal learning algorithms and Gödel machine-like, optimal, incremental search in algorithm space —possibly laying a foundation for the remaining series of faster and faster additional revolutions culminating in an “Omega point” expected around 2040.
Hutter’s frequently mentioned work was funded through the author’s SNF grant 2000-061847 “Unification of universal inductive inference and sequential decision theory.” Over the past three decades, numerous discussions with Christof Schmidhuber (a theoretical physicist) helped to crystallize the ideas on computable universes—compare his notion of “mathscape” .
-  M. Beeson. Foundations of Constructive Mathematics. Springer-Verlag, Heidelberg, 1985.
-  J. S. Bell. On the problem of hidden variables in quantum mechanics. Rev. Mod. Phys., 38:447–452, 1966.
-  C. H. Bennett and D. P. DiVicenzo. Quantum information and computation. Nature, 404(6775):256–259, 2000.
C. M. Bishop.
Neural networks for pattern recognition. Oxford University Press, 1995.
-  L. E. J. Brouwer. Over de Grondslagen der Wiskunde. Dissertation, Doctoral Thesis, University of Amsterdam, 1907.
-  F. Cajori. History of mathematics (2nd edition). Macmillan, New York, 1919.
-  G. Cantor. Über eine Eigenschaft des Inbegriffes aller reellen algebraischen Zahlen. Crelle’s Journal für Mathematik, 77:258–263, 1874.
-  G.J. Chaitin. A theory of program size formally identical to information theory. Journal of the ACM, 22:329–340, 1975.
-  G.J. Chaitin. Algorithmic Information Theory. Cambridge University Press, Cambridge, 1987.
-  D. Deutsch. The Fabric of Reality. Allen Lane, New York, NY, 1997.
-  T. Erber and S. Putterman. Randomness in quantum mechanics – nature’s ultimate cryptogram? Nature, 318(7):41–43, 1985.
-  H. Everett III. ‘Relative State’ formulation of quantum mechanics. Reviews of Modern Physics, 29:454–462, 1957.
-  E. F. Fredkin and T. Toffoli. Conservative logic. International Journal of Theoretical Physics, 21(3/4):219–253, 1982.
-  R. V. Freyvald. Functions and functionals computable in the limit. Transactions of Latvijas Vlasts Univ. Zinatn. Raksti, 210:6–19, 1977.
-  P. Gács. On the relation between descriptional complexity and algorithmic probability. Theoretical Computer Science, 22:71–93, 1983.
-  K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173–198, 1931.
-  E. M. Gold. Limiting recursion. Journal of Symbolic Logic, 30(1):28–46, 1965.
-  M.B. Green, J.H. Schwarz, and E. Witten. Superstring Theory. Cambridge University Press, 1987.
-  S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artificial Neural Networks (ICANN-2001), pages 87–94. Springer: Berlin, Heidelberg, 2001.
-  M. Hutter. Convergence and error bounds of universal prediction for general alphabet. Proceedings of the 12th European Conference on Machine Learning (ECML-2001), (TR IDSIA-07-01, cs.AI/0103015), 2001. (On J. Schmidhuber’s SNF grant 20-61847).
-  M. Hutter. General loss bounds for universal sequence prediction. In C. E. Brodley and A. P. Danyluk, editors, Proceedings of the International Conference on Machine Learning (ICML-2001), pages 210–217. Morgan Kaufmann, 2001. (On J. Schmidhuber’s SNF grant 20-61847).
-  M. Hutter. Towards a universal theory of artificial intelligence based on algorithmic probability and sequential decisions. Proceedings of the 12 European Conference on Machine Learning (ECML-2001), pages 226–238, 2001. (On J. Schmidhuber’s SNF grant 20-61847).
-  M. Hutter. The fastest and shortest algorithm for all well-defined problems. International Journal of Foundations of Computer Science, 13(3):431–443, 2002. (On J. Schmidhuber’s SNF grant 20-61847).
Self-optimizing and Pareto-optimal policies in general environments
based on Bayes-mixtures.
In J. Kivinen and R. H. Sloan, editors,
Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002), Lecture Notes in Artificial Intelligence, pages 364–379, Sydney, Australia, 2002. Springer. (On J. Schmidhuber’s SNF grant 20-61847).
-  M. Hutter. A gentle introduction to the universal algorithmic agent AIXI. In B. Goertzel and C. Pennachin, editors, Real AI: New Approaches to Artificial General Intelligence. Plenum Press, New York, 2003. To appear.
-  M. I. Jordan and D. E. Rumelhart. Supervised learning with a distal teacher. Technical Report Occasional Paper #40, Center for Cog. Sci., Massachusetts Institute of Technology, 1990.
-  L.P. Kaelbling, M.L. Littman, and A.W. Moore. Reinforcement learning: a survey. Journal of AI research, 4:237–285, 1996.
-  A.N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of Information Transmission, 1:1–11, 1965.
-  L. A. Levin. Universal sequential search problems. Problems of Information Transmission, 9(3):265–266, 1973.
L. A. Levin.
Laws of information (nongrowth) and aspects of the foundation of probability theory.Problems of Information Transmission, 10(3):206–210, 1974.
-  M. Li and P. M. B. Vitányi. An Introduction to Kolmogorov Complexity and its Applications (2nd edition). Springer, 1997.
-  L. Löwenheim. Über Möglichkeiten im Relativkalkül. Mathematische Annalen, 76:447–470, 1915.
-  N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Information Theory, 44(6):2124–2147, 1998.
-  T. Mitchell. Machine Learning. McGraw Hill, 1997.
-  C. H. Moore and G. C. Leach. FORTH - a language for interactive computing, 1970. http://www.ultratechnology.com.
-  A. Newell and H. Simon. GPS, a program that simulates human thought. In E. Feigenbaum and J. Feldman, editors, Computers and Thought, pages 279–293. McGraw-Hill, New York, 1963.
-  Nguyen and B. Widrow. The truck backer-upper: An example of self learning in neural networks. In Proceedings of the International Joint Conference on Neural Networks, pages 357–363. IEEE Press, 1989.
-  R. Penrose. The Emperor’s New Mind. Oxford University Press, 1989.
-  K. R. Popper. The Logic of Scientific Discovery. Hutchinson, London, 1934.
-  H. Putnam. Trial and error predicates and the solution to a problem of Mostowski. Journal of Symbolic Logic, 30(1):49–57, 1965.
-  J. Rissanen. Stochastic complexity and modeling. The Annals of Statistics, 14(3):1080–1100, 1986.
-  H. Rogers, Jr. Theory of Recursive Functions and Effective Computability. McGraw-Hill, New York, 1967.
-  P. S. Rosenbloom, J. E. Laird, and A. Newell. The SOAR Papers. MIT Press, 1993.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, 1986.
-  C. Schmidhuber. Strings from logic. Technical Report CERN-TH/2000-316, CERN, Theory Division, 2000. http://xxx.lanl.gov/abs/hep-th/0011065.
-  J. Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 500–506. Morgan Kaufmann, 1991.
-  J. Schmidhuber. Discovering solutions with low Kolmogorov complexity and high generalization capability. In A. Prieditis and S. Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 488–496. Morgan Kaufmann Publishers, San Francisco, CA, 1995.
-  J. Schmidhuber. A computer scientist’s view of life, the universe, and everything. In C. Freksa, M. Jantzen, and R. Valk, editors, Foundations of Computer Science: Potential - Theory - Cognition, volume 1337, pages 201–208. Lecture Notes in Computer Science, Springer, Berlin, 1997.
-  J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.
-  J. Schmidhuber. Algorithmic theories of everything. Technical Report IDSIA-20-00, quant-ph/0011122, IDSIA, Manno (Lugano), Switzerland, 2000. Sections 1-5: see ; Section 6: see .
-  J. Schmidhuber. Sequential decision making based on direct search. In R. Sun and C. L. Giles, editors, Sequence Learning: Paradigms, Algorithms, and Applications. Springer, 2001. Lecture Notes on AI 1828.
-  J. Schmidhuber. Hierarchies of generalized Kolmogorov complexities and nonenumerable universal measures computable in the limit. International Journal of Foundations of Computer Science, 13(4):587–612, 2002.
-  J. Schmidhuber. Optimal ordered problem solver. Technical Report IDSIA-12-02, IDSIA, Manno-Lugano, Switzerland, 2002. Available at arXiv:cs.AI/0207097 or http://www.idsia.ch/~juergen/oops.html. Machine Learning Journal, Kluwer, 2003, accepted.
-  J. Schmidhuber. The Speed Prior: a new simplicity measure yielding near-optimal computable predictions. In J. Kivinen and R. H. Sloan, editors, Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 2002), Lecture Notes in Artificial Intelligence, pages 216–228. Springer, Sydney, Australia, 2002.
-  J. Schmidhuber. Bias-optimal incremental problem solving. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 1571–1578, Cambridge, MA, 2003. MIT Press.
-  J. Schmidhuber. Gödel machines: self-referential universal problem solvers making provably optimal self-improvements. Technical Report IDSIA-19-03, arXiv:cs.LO/0309048 v2, IDSIA, Manno-Lugano, Switzerland, October 2003.
-  J. Schmidhuber. The new AI: General & sound & relevant for physics. Technical Report TR IDSIA-04-03, Version 1.0, cs.AI/0302012 v1, February 2003.
-  J. Schmidhuber. Towards solving the grand problem of AI. In P. Quaresma, A. Dourado, E. Costa, and J. F. Costa, editors, Soft Computing and complex systems, pages 77–97. Centro Internacional de Mathematica, Coimbra, Portugal, 2003. Based on .
-  J. Schmidhuber and M. Hutter. NIPS 2002 workshop on universal learning algorithms and optimal search. Additional speakers: R. Solomonoff, P. M. B. Vitányi, N. Cesa-Bianchi, I. Nemenmann. Whistler, CA, 2002.
-  J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28:105–130, 1997.
-  T. Skolem. Logisch-kombinatorische Untersuchungen über Erfüllbarkeit oder Beweisbarkeit mathematischer Sätze nebst einem Theorem über dichte Mengen. Skrifter utgit av Videnskapsselskapet in Kristiania, I, Mat.-Nat. Kl., N4:1–36, 1919.
-  R.J. Solomonoff. A formal theory of inductive inference. Part I. Information and Control, 7:1–22, 1964.
-  R.J. Solomonoff. Complexity-based induction systems. IEEE Transactions on Information Theory, IT-24(5):422–432, 1978.
-  R.J. Solomonoff. An application of algorithmic probability to problems in artificial intelligence. In L. N. Kanal and J. F. Lemmer, editors, Uncertainty in Artificial Intelligence, pages 473–491. Elsevier Science Publishers, 1986.
A system for incremental learning based on algorithmic probability.
Proceedings of the Sixth Israeli Conference on Artificial Intelligence, Computer Vision and Pattern Recognition, pages 515–527. Tel Aviv, Israel, 1989.
-  G. ’t Hooft. Quantum gravity as a dissipative deterministic system. Technical Report SPIN-1999/07/gr-gc/9903084, http://xxx.lanl.gov/abs/gr-qc/9903084, Institute for Theoretical Physics, Univ. of Utrecht, and Spinoza Institute, Netherlands, 1999. Also published in Classical and Quantum Gravity 16, 3263.
-  A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, Series 2, 41:230–267, 1936.
-  S. Ulam. Random processes and transformations. In Proceedings of the International Congress on Mathematics, volume 2, pages 264–275, 1950.
-  V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
-  J. von Neumann. Theory of Self-Reproducing Automata. University of Illionois Press, Champain, IL, 1966.
-  C. S. Wallace and D. M. Boulton. An information theoretic measure for classification. Computer Journal, 11(2):185–194, 1968.
-  P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974.
-  P. J. Werbos. Learning how the world works: Specifications for predictive networks in robots and brains. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, N.Y., 1987.
-  M.A. Wiering and J. Schmidhuber. Solving POMDPs with Levin search and EIRA. In L. Saitta, editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–542. Morgan Kaufmann Publishers, San Francisco, CA, 1996.
-  K. Zuse. Rechnender Raum. Elektronische Datenverarbeitung, 8:336–344, 1967.
-  K. Zuse. Rechnender Raum. Friedrich Vieweg & Sohn, Braunschweig, 1969. English translation: Calculating Space, MIT Technical Translation AZT-70-164-GEMIT, Massachusetts Institute of Technology (Proj. MAC), Cambridge, Mass. 02139, Feb. 1970.
-  A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the algorithmic concepts of information and randomness. Russian Math. Surveys, 25(6):83–124, 1970.