extended Solomonoff induction to the full AI (general reinforcement learning) setting where an agent is taking a sequence of actions that may affect the unknown environment to achieve as large amount of reward as possible. The resulting agent was named AIXI. Here we take a closer look at what principles underlie Solomonoff induction and the AIXI agent. We are going to derive Solomonoff induction from four general principles and discuss how AIXI follows from extended versions of the same.
Our setting consists of a reference universal Turing machine (UTM), a binary sequence (produced by an environment program (not revealed) on the reference machine) fed incrementaly to the agent and a loss function (or reward structure). We give the agent in question the task of choosing a program for the reference machine so as to minimize the loss. The loss is in general defined to be a function from a pair of programs, an environment program and an agent program, to real numbers. The loss function can be such that it is only the prediction (for a certain number of bits) produced by the program that matters or it can care about exactly which program was presented. A loss function of the latter kind leads to the agent performing the task of prediction, which is what Solomonoff induction is primarily concerned with while the latter can be viewed as identifying an explanatory hypothesis, which is more closely related to the minimum message length principle[WB68, WD99, Wal05] or the minimum description length principle [Ris78, Grü07, Ris10]. Solomonoff induction is using a mixture of hypothesis to achieve the best possible prediction. Note that the fact that we pick one program does not rule out that the choice is internally based on a mixture. In the case when the loss only cares about the prediction, the program is only a representation of that prediction and not really a hypothesis.
The principles are designed to avoid stating what the internal workings of the agent should be and instead derive those as a consequence of the demands on the behaviour. Thus we demand rationality instead of stating explicitly that the agent should have probabilistic beliefs and we demand time consistency instead of explicitly stating probabilistic conditioning. The computability principle is avoiding saying that the agent should have a hypothesis class that consists of all computable environments by instead demanding that it deliver a computation procedure (a program for our reference machine) that produces its prediction for the next few bits.The indifference principle states what the initial preferences of the agent must be, i.e. a demand for how the initial decision should be taken. The choice is based on symmetry with respect to a chosen representation scheme for sequences, e.g. programs on a reference machine. In other words we do not allow the agent to be biased in a certain sense that depends on our reference machine. Informally we state the principles as follows:
Computability: If we are going to guess the future of a sequence, we should choose a computation procedure (a program for the reference machine) that produces the predicted bits
Rationality: We should choose our predicted sequence such that the dependence on the priorities (formalized by a reward (or loss) structure) is consistent.
Indifference: The initial choice between programs only depends on their length and the priorities (again formalized by reward (or loss))
Time Consistency: The choice of program does not change by a new observation if the program’s output is consistent with the oberservation and the reward structure is still the same and concerned with the same bits
Our reasoning leading from external behavioural principles to a completely defined internal procedure can be summarized as follows; The rationality principle tells us that we need to have probabilistic beliefs over some set of alternatives; The computability principle tells us what the alternatives are, namely programs; The indifference principle leads to a choice of the original beliefs; The time-consistency principle leads to a simple procedure for updating the beliefs that the second principle tells us must exist, namely conditioning. In total it leads to Solomonoff Induction.
We can not remove any of the principles without losing the complete specification of a procedure. The first property is part of the set up of what we ask the agent to do. Without the second we lose the restriction that we take decisions based on maximum expected utility with respect to probabilistic beliefs and one could then have an agent that always chose the same program (e.g. a very short one). Without the third principle we could have any apriori beliefs and without the fourth the agent could after a while change its mind regarding what beliefs it started with.
We are considering a setting where we give an agent a task that is defined by a reference machine (a UTM), a reward structure (or loss function if we negate) and a binary sequence that is presented one bit at a time. The binary sequence is generated by a program for the reference machine.
The agent must (as stated by the first principle) chose a program (whose output must be consistent with anything that we have seen in case we have made observations) for the reference machine and then use its output (which can be of finite or infinite length) as a prediction. If we want to predict at least bits we have to restrict ourself to machines that output at least bits. We will consider an enumeration of all programs . We are also going to consider a class of reward structures . The meaning is that if we guess that the sequence is (as the output of) and the actual sequence is , then we receive reward . Note that for any finite string there are always Turing machines that computes it. We will furthermore suppose that , as . This means that we consider it to be a harder and harder task to guess as gets really large. This assumption is not strictly necessary as we will discuss later.
Section 2 provides background on Solomonoff induction and AIXI. In Section 3 we deal with the first two principles mentioned above about rationality and computability. In Section 4, we discuss the third principle which defines a prior from a (Universal Turing Machine) representation. Section 5 describes the sequence prediction algorithm that results from adding the fourth principle to what has been achieved in the previous sections. Section 6 extends our analysis to the case where an agent takes a sequence of actions that may affect its environment. Section 7 concerns equivalence between our beliefs over deterministic environments and beliefs over a much larger class of stochastic environments.
2.1 Sequence Prediction
We consider both finite and infinite sequences from a finite alphabet . We denote the finite strings by and we use the notation for the first elements in a sequence . A function
is a probability measure if
and where is the empty string. Such a function describes a priori probabilistic beliefs about the sequence. If the equality in (1) is instead and then we have a semi-measure. We define the probability of seeing the string after seeing as being . If we have a loss function , we ([Hut07]) choose, after seeing the string , to predict
More generally, if we have an alphabet of actions we can take and a loss function we make the choice
2.2 The Solomonoff Prior
Ray Solomonoff [Sol60] defined a set of priors that only differ by a multiplicative constant. We call them Solomonoff priors. To define them we need to first introduce some notions about Turing machines [Tur36].
A monotone Turing machine (which we will just call Turing machine and whose exact technical definition can be found in [LV08]) is a function from a set of (binary) strings to binary sequences that can either be finite or infinite. We demand that it be possible to describe the function as a machine with unidirectional input and output tapes, read/write heads, a bidirectional work tape and a finite state machine that decides the next action of the machine given the symbols under the head on the input and work tape. The input tape is read only and the output tape is write only. We write that if output of starts with when given input (program) .
A universal Turing machine is a Turing machine that can emulate all other Turing machines in the sense that for every Turing machine there is at least one prefix , such that when is fed to the universal Turing machine, it computes the same output as would when fed (See [LV08, Hut05] for further details).
A sequence is called computable if some Turing machine outputs it, or in other words, if for every universal Turing machine there is a program that leads to this sequence being the output.
We can also define what we will call a computable environment from a Turing machine. A computable environment is something which you (an agent) feed an action to and the environment outputs a string which we call a perception. We can for example have a finite number of possible actions and we put one after another on the input tape of the machine. We wait until the previous input has been processed and one of finitely many outputs has been produced. The machine might halt after a finite number of actions have been processed or it might run for ever.
Definition 1 (Semi-measure from Turing machine).
Given a Turing machine , we let
where is the length of the program (input) and means that starts with outputting when fed , though it might continue and output more afterwards.
If the Turing machine in Definition 1 is universal we call a Solomonoff distribution. Solomonoff induction is defined by letting in Section 2.1 be the Solomonoff prior for some universal Turing machine. If is a universal Turing machine and is any Turing machine there exists a constant (namely where is the prefix that encodes in ) such that
The set can be identified with [LV08] with all lower semi-computable semi-measures (see [LV08] for definitions and proofs). The property expressed by (5) is called universality (or dominance) and is the key to proving the strong convergence results of Solomonoff Induction [Sol78, LV08, Hut05, Hut07].
In the active case where an agent is taking a sequence of actions to achieve some sort of objective, we are trying to determine the best policy , defined as a function from a history of actions and perceptions to a choice of the next action . The function from the sequence prediction case is in the active case of the form and represent the probability of seing given that we have chosen actions . We can again define a “learning” algorithm by conditioning on what we have seen to define
If and , then we also write for the left hand side in (6).
Suppose that we have an enumerated set of policies to choose from. Given a definition of reward for a sequence of percepts that can for example be defined as in reinforcement learning by splitting into observation and reward and using a discounted reward sum [SB98, Hut05], then we can define
and make the choice
If we have a class of environments (say the computable environments) and if is defined by saying that we assign probability to being the true environment, then we let if is the sequence of perceptions resulting from using policy in environment . Then and we choose the policy with index
As outlined in [Hut05], one can choose a Solomonoff distribution also over active environments. The resulting agent is referred to as AIXI.
3 Choosing a Program
In this section we describe the setup of the second principle mentioned in the introduction, namely rationality. The section is much briefer than what is suitable for the topic and we refer the reader to our companion paper [SH11] for a more comprehensive treatment. Rationality is meant in the sense of internal consistency [Sug91], which is how it has been used in [NM44] and [Sav54]. We set up simple axioms for a rational decision maker, which implies that the decisions can be explained (or defined) from probabilistic beliefs. The approach to probability by [Ram31, deF37]
is interpreting probabilities as fair betting odds. There is an intuitive similarity between our setup to the idea of explaining/deriving probabilities as a bookmaker’s betting odds as done in[deF37] and [Ram31].
Before we consider the question regarding which program we want to choose we will first consider the question if we are prepared to accept guessing for a given (i.e. accepting this bet). We suppose that the alternative is to abstain (reject) and receive zero reward. We introduce rationality axioms and prove that we must have probabilistic beliefs over the possible sequences. Note that for any given , we have a sequence in (the space of real valued sequences that converge to ). We will set up some common sense rationality axioms for the way we make our decisions. We will demand that a decision can be taken for any reward structure ( with fixed ) from . If is acceptable and then we want to be acceptable since this is simply a multiple of the same. We also want the sum of two acceptable reward structures to be acceptable. If we cannot lose (receive negative reward) we are prepared to accept while if we are guaranteed to gain we are not prepared to reject it. We cannot remove any axiom without losing the conclusion.
Definition 2 (Rationality).
Suppose that we have a function defining the decision reject/accept/either and .
if and only if
If then while if then .
The following theorem connects our Rationality axioms with the Hahn-Banach theorem [Kre89] and concludes that rational decisions can be described with a positive continuous linear functional on the space of reward structures. The Banach space dual of is which gives us a probabilistic representation of underlying beliefs.
Theorem 3 (Linear separation).
Given the assumptions in Definition 2 there exists a positive continuous linear functional defined by where , and , such that
The second property tells us that and are convex cones. The first and third property tells us that . Suppose that there is a point that lies in both the interior of and of . Then the same is true for according to the first property and for the origin. That a ball around the origin lies in means that which is not true. Thus the interiors of and
are disjoint open convex sets and can, therefore, be separated by a hyperplane (according to the Hahn-Banach theorem) which goes through the origin (since according to the first and third property). The first property tell us that . Given a separating hyperplane (between the interiors of and ), must contain everything on one side. This means that is a half space whose boundary is a hyperplane that goes through the origin and the closure of is a closed half space and can be written as for some in the Banach space dual of . The third property tells us that is positive. ∎
Theorem 3 also leads us to how to choose between different options. If we consider picking over we will do (accept) that if is accepted. This is the case if . The conclusion is that if we are presented with and a class and we assign probability to being the truth, then we choose
If we replace the space by as the space of reward structures in Theorem 3, the conclusion (see [SH11]) is instead that is in the Banach space dual of which contains (the countably additive measures) but also functions that cannot be written on the form . is sometimes called the ba space [Die84] and it consists of all finitely additive measures.
In this section we will discuss how indifference together with a representation leads to a choice of prior weights. The representation will be given in terms of codes that are strings of letters from a finite alphabet and it tells us which distinctions we will apply our indifference principle to. Choosing the first bit can be viewed as choosing between two propositions, e.g. is a vegetable or is a fruit. More choices follow until a full specification (a code word for the given reference machine) is reached. The section describes the usual material on the Solomonoff distribution (see [LV08]) in a way that highlights in what sense it is based on indifference. The indifference principle itself is an external behavioural principle.
Definition 5 (Indifference).
Given a reward structure for two alternative outcomes of an event where we receive or depending on the outcome, then if we are indifferent we accept this bet if . For an agent with probabilistic beliefs that maximize expected utility this means that equal probability is assigned to both possibilities.
We will discuss examples that are based on considering the set apple, orange, carrot and the representation that is defined by first separating fruit from vegetables and then the fruits into apples and oranges.
We are about to open a box within which there is either a fruit or a vegetable. We have no other information (except possibly, a list of what is a fruit and what is a vegetable).
We are about to open a box within which there is either an apple, or an orange or a carrot. We have no other information.
Consider a representation where we use binary codes. If the first digit is a it means a vegetable, i.e. a carrot. No more digits are needed to describe the object. If the first digit is a it means a fruit. If the next digit after the is a its an apple and if it is a its an orange. In the absence of any other background knowledge/information and given that we are going to be indifferent for this choice, we assign uniform probabilities for each choice of letter in the string. For our examples this results in probabilities fruitvegetable. After concluding this we consider the next distinction and conclude that applefruitorangefruit. This means that the decision maker has the prior beliefs carrot, appleorange.
An alternative representation would be to have a trinary alphabet and give each object its own letter. The result of this is appleorangecarrot, fruit and vegetable.
The following formalizes the definition of a code and a prefix free code. Since we are assuming that the possible outcomes are never special cases of each other we need our code to be prefix free. Furthermore, Kraft’s inequality says that if the set of codes is prefix free.
Definition 8 (Codes).
A code for a set is a set of strings of letters from a finite alphabet and a surjective map from to . We say that a code is prefix-free if no code string is a proper prefix of another.
Definition 9 (Computable Representation).
We say that a code is a computable representation if the map from code-strings to outcomes is a Turing machine.
In the definition below we provide the formula for how a binary representation of the letters in an alphabet leads to a choice of a distribution. It is easily extended to non-binary representations.
Definition 10 (Distribution from representation).
Given a binary prefix-free code for (our possible outcomes), the expression
defines a measure over .
Though the formula in Definition 10 uniquely determines the weights given a representation, there is still a very wide choice of representations. We are going to deal with this concern to restrict ourself to the class of universal representations with the property that given any other computable representation, the universal weights are at least a constant times the weights resulting from the other representation. See [Sol60, LV08, Hut05] for a more extensive treatment. These universal representations are defined by having a universal Turing machine (in our case the given reference machine) as the map from codes to outcomes.
Definition 11 (Universal Representation).
If a universal Turing machine is used for defining the map from codes to outcomes we say that we have a universal (computable) representation.
The weights that result from using a universal representation satisfy the property that if are the resulting weights from another computable representation, then there is such that . This follows directly from the universality of the Turing machine, which means that any other Turing machine can be simulated on the universal one by adding an extra prefix (interpreter) to each code. That is, feeding to the universal machine gives the same output as feeding to the other machine. The constant is .
Applying Definition 10 together with a representation of finite strings based on a universal Turing machine gives us the Solomonoff semi-measure.
Given a universal Turing machine we create a set of codes from all programs that generate an output of at least bits. We let the code represent the finite string with if . We show below that this representation together with Definition 10 leads to the Solomonoff distribution for the next bits. By considering all we recover the Solomonoff semi-measure over .
Formally, given we let (in Definition 10) and we define and conclude that
which is the Solomonoff semi-measure. ∎
Remark 13 (Unique Representation).
Given a universal Turing machine, we could choose to let only the shortest program that generates a certain output represent that output, and not all the programs that generate this output. The length of the shortest program that gives output is called the Kolmogorov complexity of . Using only the shortest program leads to the slightly different weights
5 Sequence Prediction
We will in this section summarize how Solomonoff Induction as described in [Hut07] follows from what we have presented in Section 3 and Section 4 together with our fourth principle of time consistency. Consider a binary sequence that is revealed to us one bit at a time. We are trying to predict the future of the sequence, either one bit, several bits or all of them. By combining the conclusions of Section 3 and 4, we can define a sequence prediction algorithm which turns out to be Solomonoff Induction. The results from Section 3 tells us that if we are going to be able to make rational guesses about which computable sequence we will see, we need to have probabilistic beliefs.
If we are interested in predicting a finite number of bits we need to design the reward structure in Section 3 to reflect what we are interested in. If we want to predict the next bit we can let if and have the same next bit and otherwise. This leads to (a weighted majority decision to) predicting if and if the reverse inequality is true. The reasoning and result generalizes naturally to predicting finitely many bits and we can interpret this as minimizing the expected number of errors.
Suppose that we have observed a number of bits of the sequences. This result in contradictions with many of the sequences and they can be ruled out. We next formally state the fourth principle from the introduction.
Definition 14 (Time-consistency).
Suppose that we are observing a sequence one bit at a time ( at time ). Suppose that we (at time ) want to predict the next bits of a sequence and our decisions (for any and ) are defined by a function from the set of all reward structures ( where in the binary case) to the set of strings of length .
Suppose that if and starts with . If it then follows that where is the restriction of to the strings that start with (and we identify such a string of length with the string of length that follow the first bit) and if this implication is true for any we say that we have time-consistency.
Suppose that we have a semi-measure and that we at time (given any loss ) predict the next bits according to
If we furthermore assume time-consistency and observe , then we predict
Suppose that there are and such that . This obviously contradicts time-consistency. In other words, time-consistency implies that relative beliefs in strings that are not yet contradicted remains the same. Therefore, the decision function after seeing can be described by a semi-measure where the inconsistent alternatives have been ruled out and the others just renormalized. This is what (13) is describing. The only remaining point to make is that we have expressed (12) and (13) in terms of loss instead of reward though it is simply a matter of changing the sign and max for min. ∎
6 The AIXI Agent
In this section we discuss extensions to the case where an agent is choosing a sequence of actions that affect the environment it is in. We will simply replace the principle that says that we predict computable sequences by one that says that we predict computable environments. The environments are such that the agent takes an action that is fed to the environment and the environment responds with an output that we call a perception. There is a finite alphabet for the action and one for the perception.
Our aim is to choose a policy for the agent. This is a function from the history of the actions and perceptions that has appeared so far, to the action which the agent chooses next. Suppose that a class of policies, a class of (all) computable environments and a reward structure which is the total reward for using policy in environment . To assume the property that , would mean that we assume that the stakes are lower in the environments of high index. This somewhat restrictive and there are alternatives to making this assumption (that the reward structure is in ) and we investigate the result of assuming that we instead have the larger space (see Remark 4) in a separate article [SH11] on rationality axioms and conclude that the difference is that we get finite additivity instead of countable additivity for the probability measure but that we can get back to countable additivity by adding an extra monotonicity assumption. The arguments in Section 3 imply (given reward structure) that we must assign probabilities for the environment being and choose a policy with index
This is what the AIXI agent described in [Hut05] is doing. The AIXI choice of weights correspond to the choice (as in Remark 13), but for the class of lower semi-computable discussed below in Section 7.
The same updating technique as in Section 5, where we eliminate the environments which are inconsistent with what has occurred, is being used. This is deduced from the same time-consistency principle as for sequence prediction, just stating that the relative belief in environments that are still consistent will remain unchanged. This leads to the AIXI agent from [Hut05].
7 Remarks on Stochastic Lower
Having the belief that the environment is computable does seem like a restrictive assumption though we will here argue that it is in an interesting way equivalent to having beliefs over all lower semi-computable stochastic environments. The Solomonoff prior is based on having belief in having input program defining the environment. We can (proven up to a multiplicative factor in [LV08] and exact identity in [WSH11]), however, rewrite this prior as a mixture over all lower semi-computable environments where for all . Therefore, acting according to our Solomonoff mixture over computable enviroments is identical to acting according to beliefs over a much larger set of environments where we have randomness.
We defined four principles for universal sequence prediction and showed that Solomonoff induction and AIXI are determined from them. These principles are computability, rationality, indifference and time consistency. Computability tells us that Turing machines are the explanations we consider for what we are seeing. Rationality tells us that we have probabilistic beliefs over these. Time-consistency leads to the conclusion that we update these beliefs based on conditional probability and the principle of indifference tells us how to chose the original beliefs based on how compactly the various Turing machines can be implemented on the reference machine.
This work was supported by ARC grant DP0988049.
- [deF37] B. deFinetti. La prevision: Ses lois logiques, ses sources subjectives. In Annales de l’Institut Henri Poincare 7, pages 1–68. Paris, 1937.
- [Die84] J. Diestel. Sequences and series in Banach spaces. Springer-Verlag, 1984.
- [Grü07] P. Grünwald. The Minimum Description Length Principle. MIT Press Books. The MIT Press, 2007.
Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005.
- [Hut07] M. Hutter. On universal prediction and Bayesian confirmation. Theoretical Computer Science, 384:33–48, 2007.
- [Kre89] E. Kreyszig. Introductory Functional Analysis With Applications. Wiley, 1989.
- [LV08] M. Li and P. Vitányi. Kolmogorov Complexity and its Applications. Springer, 2008.
- [NM44] J. Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1944.
- [Ram31] F. Ramsey. Truth and probability. In R. B. Braithwaite, editor, The Foundations of Mathematics and other Logical Essays, chapter 7, pages 156–198. Brace & Co., 1931.
- [RH11] S. Rathmanner and M. Hutter. A philosophical treatise of universal induction. Entropy, 13(6):1076–1136, 2011.
- [Ris78] J. Rissanen. Modeling By Shortest Data Description. Automatica, 14:465–471, 1978.
Minimum description length principle.
In C. Sammut and G. Webb, editors,
Encyclopedia of Machine Learning, pages 666–668. Springer, 2010.
- [Sav54] L. Savage. The Foundations of Statistics. Wiley, New York, 1954.
- [SB98] R. Sutton and A. Barto. Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning). The MIT Press, March 1998.
- [SH11] P. Sunehag and M. Hutter. Axioms for rational reinforcement learning. In Proc. of 22nd International Conf. on Algorithmic Learning Theory, Espoo, Finland, 2011.
- [Sol60] R. Solomonoff. A Preliminary Report on a General Theory of Inductive Inference. Report V-131, Zator Co, Cambridge, Ma., 1960.
- [Sol78] R.J. Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. IEEE Transactions on Information Theory, 24:422–432, 1978.
- [Sol96] R.J. Solomonoff. Does algorithmic probability solve the problem of induction? In Proceedings of the Information, Statistics and Induction in Science Conferece, 1996.
- [Sug91] R. Sugden. Rational choice: A survey of contributions from economics and philosophy. Economic Journal, 101(407):751–85, July 1991.
- [Tur36] A. M. Turing. On Computable Numbers, with an application to the Entscheidungsproblem. Proc. London Math. Soc., 2(42):230–265, 1936.
- [Wal05] C.S. Wallace. Statistical and Inductive Inference by Minimum Message Length. Springer-Verlag (Information Science and Statistics), 2005.
- [WB68] C. S. Wallace and D.M. Boulton. An information measure for classification. Computer Journal, 11:185–194, 1968.
- [WD99] C. S. Wallace and D. L. Dowe. Minimum message length and Kolmogorov complexity. Computer Journal, 42:270–283, 1999.
- [WSH11] I. Wood, P. Sunehag, and M. Hutter. (Non-)Equivalence of universal priors. In Proc. of Solomonoff Memorial Conference, Melbourne, Australia, 2011.