MC-AIXI-CTW by Marcus Hutter and his students (in particular Daniel Visentin)
This paper introduces a principled approach for the design of a scalable general reinforcement learning agent. Our approach is based on a direct approximation of AIXI, a Bayesian optimality notion for general reinforcement learning agents. Previously, it has been unclear whether the theory of AIXI could motivate the design of practical algorithms. We answer this hitherto open question in the affirmative, by providing the first computationally feasible approximation to the AIXI agent. To develop our approximation, we introduce a new Monte-Carlo Tree Search algorithm along with an agent-specific extension to the Context Tree Weighting algorithm. Empirically, we present a set of encouraging results on a variety of stochastic and partially observable domains. We conclude by proposing a number of directions for future research.READ FULL TEXT VIEW PDF
Automated negotiation is a rising topic in Artificial Intelligence resea...
The real-time strategy game of StarCraft II has been posed as a challeng...
Deep reinforcement learning has been successfully applied to several
We investigate a human-machine collaborative drawing environment in whic...
Reciprocating interactions represent a central feature of all human
Code generation, defined as automatically writing a piece of code to sol...
This paper introduces a new negotiating agent model for automated
MC-AIXI-CTW by Marcus Hutter and his students (in particular Daniel Visentin)
Experiment involving mc-aixi vs. zpaq compressor
Mirror of http://chriswarbo.net/git/mc-aixi-ctw
A main difficulty of doing research in artificial general intelligence has always been in defining exactly what artificial general intelligence means. There are many possible definitions [LH07], but the AIXI formulation [Hut05] seems to capture in concrete quantitative terms many of the qualitative attributes usually associated with intelligence.
Consider an agent that exists within some (unknown to the agent) environment. The agent interacts with the environment in cycles. At each cycle, the agent executes an action and receives in turn an observation and a reward. There is no explicit notion of state, neither with respect to the environment nor internally to the agent. The general reinforcement learning problem is to construct an agent that, over time, collects as much reward as possible in this setting.
The AIXI agent is a mathematical solution to the general reinforcement learning problem. The AIXI setup mirrors that of the general reinforcement problem, however the environment is assumed to be an unknown but computable function; i.e. the observations and rewards received by the agent given its actions can be computed by a Turing machine. Furthermore, the AIXI agent is assumed to exist for a finite, but arbitrarily large amount of time. The AIXI agent results from a synthesis of two ideas:
the use of a finite-horizon expectimax operation from sequential decision theory for action selection; and
an extension of Solomonoff’s universal induction scheme [Sol64] for future prediction in the agent context.
More formally, let denote the output of a universal Turing machine supplied with program and input , a finite lookahead horizon, and the length in bits of program . The action picked by AIXI at time , having executed actions and received the sequence of observation-reward pairs from the environment, is given by:
Intuitively, the agent considers the sum of the total reward over all possible futures (up to steps ahead), weighs each of them by the complexity of programs (consistent with the agent’s past) that can generate that future, and then picks the action that maximises expected future rewards. Equation (1) embodies in one line the major ideas of Bayes, Ockham, Epicurus, Turing, von Neumann, Bellman, Kolmogorov, and Solomonoff. The AIXI agent is rigorously shown in [Hut05] to be optimal in different senses of the word. (Technically, AIXI is Pareto optimal and ‘self-optimising’ in different classes of environment.) In particular, the AIXI agent will rapidly learn an accurate model of the environment and proceed to act optimally to achieve its goal.
The AIXI formulation also takes into account stochastic environments because Equation (1) can be shown to be formally equivalent to the following expression:
is the probability ofgiven actions . Class consists of all enumerable chronological semimeasures [Hut05], which includes all computable , and denotes the Kolmogorov complexity of [LV08].
The AIXI formulation is best understood as a rigorous definition of optimal decision making in general unknown
environments, and not as an algorithmic solution to the general AI problem. (AIXI after all, is only asymptotically computable.) As such, its role in general AI research should be viewed as, for example, the same way the minimax and empirical risk minimisation principles are viewed in decision theory and statistical machine learning research. These principles define what is optimal behaviour if computational complexity is not an issue, and can provide important theoretical guidance in the design of practical algorithms. It is in this light that we see AIXI. This paper is an attempt to scale AIXI down to produce a practical agent that can perform well in a wide range of different, unknown and potentially noisy environments.
As can be seen in Equation (1), there are two parts to AIXI. The first is the expectimax search into the future which we will call planning. The second is the use of a Bayesian mixture over Turing machines to predict future observations and rewards based on past experience; we will call that learning. Both parts need to be approximated for computational tractability. There are many different approaches one can try. In this paper, we opted to use a generalised version of the UCT algorithm [KS06] for planning and a generalised version of the Context Tree Weighting algorithm [WST95] for learning. This harmonious combination of ideas, together with the attendant theoretical and experimental results, form the main contribution of this paper.
The paper is organised as follows. Section 2 describes the basic agent setting and discusses some design issues. Section 3 then presents a Monte Carlo Tree Search procedure that we will use to approximate the expectimax operation in AIXI. This is followed by a description of the context tree weighting algorithm and how it can be generalised for use in the agent setting in Section 4. We put the two ideas together in Section 5 to form our agent algorithm. Theoretical and experimental results are then presented in Sections 6 and 7. We end with a discussion of related work and other topics in Section 8.
A string of length is denoted by . The prefix of , , is denoted by or . The notation generalises for blocks of symbols: e.g. denotes and denotes . The empty string is denoted by . The concatenation of two strings and is denoted by .
The (finite) action, observation, and reward spaces are denoted by , and respectively. Also, denotes the joint perception space .
A history is a string , for some . A partial history is the prefix of some history.
The set of all history strings of maximum length will be denoted by .
The following definition states that the agent’s model of the environment takes the form of a probability distribution over possible observation-reward sequences conditioned on actions taken by the agent.
An environment model is a sequence of functions , , that satisfies:
The first condition (called the chronological condition in [Hut05]) captures the natural constraint that action has no effect on observations made before it. The second condition enforces the requirement that the probability of every possible observation-reward sequence is non-zero. This ensures that conditional probabilities are always defined. It is not a serious restriction in practice, as probabilities can get arbitrarily small. For convenience, we drop the index in from here onwards.
Given an environment model , we have the following identities:
We represent the notion of reward as a numeric value that represents the magnitude of instantaneous pleasure experienced by the agent at any given time step. Our agent is a hedonist; its goal is to accumulate as much reward as it can during its lifetime. More precisely, in our setting the agent is only interested in maximising its future reward up to a fixed, finite, but arbitrarily large horizon .
In order to act rationally, our agent seeks a policy that will allow it to maximise its future reward. Formally, a policy is a function that maps a history to an action. If we define for , then we have the following definition for the expected future value of an agent acting under a particular policy:
Given history , the -horizon expected future reward of an agent acting under policy with respect to an environment model is:
where for , . The quantity is defined similarly, except that is now no longer defined by .
The optimal policy is the policy that maximises Equation (5). The maximal achievable expected future reward of an agent with history in environment looking steps ahead is . It is easy to see that
All of our subsequent efforts can be viewed as attempting to define an algorithm that determines a policy as close to the optimal policy as possible given reasonable resource constraints. Our agent is model based
: we learn a model of the environment and use it to estimate the future value of our various actions at each time step. These estimates allow the agent to make an approximate best action given limited computational resources.
We now discuss some high-level design issues before presenting our algorithm in the next section.
A major problem in general reinforcement learning is perceptual aliasing [Chr92], which refers to the situation where the instantaneous perceptual information (a single observation in our setting) does not provide enough information for the agent to act optimally. This problem is closely related to the question of what constitutes a state, an issue we discuss next.
A Markov state [SB98] provides a sufficient statistic for all future observations, and therefore provides sufficient information to represent optimal behaviour. No perceptual aliasing can occur with a Markov state. In Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) all underlying environmental states are Markov.
A compact state representation is often assumed to generalise well and therefore enable efficient learning and planning. A common approach in reinforcement learning (RL) [SB98] is to approximate the environmental state by using a small number of handcrafted features. However, this approach requires both that the environmental state is known, and that sufficient domain knowledge is available to select the features.
In the general RL problem, neither the states nor the domain properties are known in advance. One approach to general RL is to find a compact representation of state that is approximately Markov [McC96, Sha07, SJR04, ST04], or a compact representation of state that maximises some performance criterion [Hut09b, Hut09a]. In practice, a Markov representation is rarely achieved in complex domains, and these methods must introduce some approximation, and therefore some level of perceptual aliasing.
In contrast, we focus on learning and planning methods that use the agent’s history as its representation of state. A history representation can be generally applied without any domain knowledge. Importantly, a history representation requires no approximation and introduces no aliasing: each history is a perfect Markov state (or -Markov for length histories). In return for these advantages, we give up on compactness. The number of states in a history representation is exponential in the horizon length (or for length histories), and many of these histories may be equivalent. Nevertheless, a history representation can sometimes be more compact than the environmental state, as it ignores extraneous factors that do not affect the agent’s direct observations.
In order to form non-trivial plans that span multiple time steps, our agent needs to be able to predict the effects of its interaction with the environment. If a model of the environment is known, search-based methods offer one way of generating such plans. However, a general RL agent does not start with a model of the environment; it must learn one over time. Our agent builds an approximate model of the true environment from the experience it gathers when interacting with the real world, and uses it for online planning.
If the problem is small, model-based RL methods such as Value Iteration for MDPs can easily derive an optimal policy. However this is not appropriate for the larger problems more typical of the real world. Local search is one way to address this problem. Instead of solving the problem in its entirety, an approximate solution is computed before each decision is made. This approach has met with much success on difficult decision problems within the game playing research community and on large-sized POMDPs [RPPCD08].
The general RL problem is extremely difficult. On any real world problem, an agent is necessarily restricted to making approximately correct decisions. One of the distinguishing features of sophisticated heuristic decision making frameworks, such as those used in computer chess or computer go, is the ability of these frameworks to provide acceptable performance on hardware ranging from mobile phones through to supercomputers. To take advantage of the fast-paced advances in computer technology, we claim that a good autonomous agent framework should naturally and automatically scales with increasing computational resources. Both the learning and planning components of our approximate AIXI agent have been designed with scalability in mind.
One of the key resources in real world decision making is time.
As we are interested in a practical general agent framework, it is imperative that our agent be able to make good approximate decisions on demand
on demand. Different application domains have different real-world time constraints. We seek an agent framework that can make good, approximate decisions given anything from milliseconds to days thinking time per action.
In this section we describe Predictive UCT, a Monte Carlo Tree Search (MCTS) technique for stochastic, partially observable domains that uses an incrementally updated environment model to predict and evaluate the possible outcomes of future action sequences.
The Predictive UCT algorithm is a straightforward generalisation of the UCT algorithm [KS06], a Monte Carlo planning algorithm that has proven effective in solving large state space discounted, or finite horizon MDPs. The generalisation requires two parts:
The use of an environment model that is conditioned on the agent’s history, rather than a Markov state.
The updating of the environment model during search. This is essential for the algorithm to utilise the extra information an agent will have at a hypothetical, particular future time point.
The generalisation involves a change in perspective which has significant practical ramifications in the context of general RL agents. Our extensions to UCT allow Predictive UCT, in combination with a sufficiently powerful predictive environment model , to implicitly take into account the value of information in search and be applicable to partially observable domains.
Predictive UCT is a best-first Monte Carlo Tree Search technique that iteratively constructs a search tree in memory. The tree is composed of two interleaved types of nodes: decision nodes and chance nodes. These correspond to the alternating and operations in expectimax. Each node in the tree corresponds to a (partial) history . If ends with an action, it is a chance node; if ends with an observation, it is a decision node. Each node contains a statistical estimate of the future reward.
Initially, the tree starts with a single decision node containing children. Much like in existing MCTS methods [CWU08], there are four conceptual phases to a single iteration of Predictive UCT. The first is the selection phase, where the search tree is traversed from the root node to an existing leaf chance node . The second is the expansion phase, where a new decision node is added as a child to . The third is the simulation phase, where a playout policy in conjunction with the environment model is used to sample a possible future path from until a fixed distance from the root is reached. Finally, the backpropagation phase updates the value estimates for each node on the reverse trajectory leading back to the root. Whilst time remains, these four conceptual operations are repeated. Once the time limit is reached, an approximate best action can be selected by looking at the value estimates of the children of the root node.
During the selection phase, action selection at decision nodes is done using a policy that balances exploration and exploitation. This policy has two main effects:
to move the estimates of the future reward towards the maximum attainable future reward if the agent acted optimally.
to cause asymmetric growth of the search tree towards areas that have high predicted reward, implicitly pruning large parts of the search space.
The future reward at leaf nodes is estimated by choosing actions according to a heuristic policy until a total of actions have been made by the agent, where is the search horizon. This heuristic estimate helps the agent to focus its exploration on useful parts of the search tree, and in practice allows for a much larger horizon than a brute-force expectimax search.
Predictive UCT builds a sparse search tree in the sense that observations are only added to chance nodes once they have been generated along some sample path. A full expectimax search tree would not be sparse; each possible stochastic outcome will be represented by a distinct node in the search tree. For expectimax, the branching factor at chance nodes is thus , which means that searching to even moderate sized is intractable.
Figure 1 shows an example Predictive UCT tree. Chance nodes are denoted with stars. Decision nodes are denoted by circles. The dashed lines from a star node indicate that not all of the children have been expanded. The squiggly line at the base of the leftmost leaf denotes the execution of a playout policy. The arrows proceeding up from this node indicate the flow of information back up the tree; this is defined in more detail in Section 3.
A decision node will always contain distinct children, all of whom are chance nodes. Associated with each decision node representing a particular history will be a value function estimate, . During the selection phase, a child will need to be picked for further exploration. Action selection in MCTS poses a classic exploration/exploitation dilemma. On one hand we need to allocate enough visits to all children to ensure that we have accurate estimates for them, but on the other hand we need to allocate enough visits to the maximal action to ensure convergence of the node to the value of the maximal child node.
Like UCT, Predictive UCT recursively uses the UCB policy [Aue02] from the -armed bandit setting at each decision node to determine which action needs further exploration. Although the uniform logarithmic regret bound no longer carries across from the bandit setting, the UCB policy has been shown to work well in practice in complex domains such as Computer Go [GW06] and General Game Playing [FB08]. This policy has the advantage of ensuring that at each decision node, every action eventually gets explored an infinite number of times, with the best action being selected exponentially more often than actions of lesser utility.
The visit count of a decision node is the number of times has been sampled by the Predictive UCT algorithm. The visit count of the chance node found by taking action at is defined similarly, and is denoted by .
Suppose is the search horizon and each single time-step reward is bounded in the interval . Given a node representing a history in the search tree, the action picked by the UCB action selection policy is:
where is a positive parameter that controls the ratio of exploration to exploitation. If there are multiple maximal actions, one is chosen uniformly at random.
Note that we need a linear scaling of in Definition 5 because the UCB policy is only applicable for rewards confined to the interval.
Chance nodes follow immediately after an action is selected from a decision node. Each chance node following a decision node contains an estimate of the future utility denoted by . Also associated with the chance node is a density over observation-reward pairs.
After an action is performed at node , is sampled once to generate the next observation-reward pair . If has not been seen before, the node is added as a child of . We will use the notation to denote the subset of representing the children of partial history created so far.
If a leaf decision node is encountered at depth in the tree, a means of estimating the future reward for the remaining time steps is required. The agent applies its heuristic playout function to estimate the sum of future rewards . A particularly simple, pessimistic baseline playout function is , which chooses an action uniformly at random at each time step.
A more sophisticated playout function that uses action probabilities estimated from previously taken real-world actions could potentially provide a better estimate. The quality of the actions suggested by such a predictor can be expected to improve over time, since it is trying to predict actions that are chosen by the agent after a Predictive UCT search. This powerful and intuitive method of constructing a generic heuristic will be explored further in a subsequent section.
Asymptotically, the heuristic playout policy makes no contribution to the value function estimates of Predictive UCT. When the remaining depth is zero, the playout policy always returns zero reward. As the number of simulations tends to infinity, the structure of the Predictive UCT search tree is equivalent to the exact depth expectimax tree with high probability. This implies that the asymptotic value function estimates of Predictive UCT are invariant to the choice of playout function. However, when search time is limited, the choice of playout policy will be a major determining factor of the overall performance of the agent.
After the selection phase is completed, a path of nodes , , will have been traversed from the root of the search tree to some leaf . For each , the statistics maintained for (partial) history associated with node will be updated as follows:
Note that the same backup equations are applied to both decision and chance nodes.
Recall from Definition 2 that an environment model is a sequence of functions , where . When invoking the Sample routine to decide on an action, many hypothetical future experiences will be generated, with being used to simulate the environment at time . For the algorithm to work well in practice, we need to be able to perform the following two operations in time sublinear with respect to the length of the agent’s entire experience string.
Update - given and , produce
Revert - given , recover
The revert operation is needed to restore the environment model to after each simulation to time is performed. In Section 4, we will show how these requirements can be met efficiently by a certain kind of Bayesian mixture over a rich model class.
We now give the pseudocode of the entire Predictive UCT algorithm.
Algorithm 1 is responsible for determining an approximate best action. Given the current history , it first constructs a search tree containing estimates for each , and then selects a maximising action. An important property of Algorithm 1 is that it is anytime; an approximate best action is always available, whose quality improves with extra computation time.
For simplicity of exposition, Initialise can be understood to simply clear the entire search tree . In practice, it is possible to carry across information from one time step to another. If is the search tree obtained at the end of time , and is the agent’s actual action and experience at time , then we can keep the subtree rooted at node in and make that the search tree for use at the beginning of the next time step. The remainder of the nodes in can then be deleted.
As a Monte Carlo Tree Search routine, Algorithm 1 is embarrassingly parallel. The main idea is to concurrently invoke the Sample routine whilst providing appropriate locking mechanisms for the nodes in the search tree. An efficient parallel implementation is beyond the scope of the paper, but it is worth noting that ideas [CWH08] applicable to high performance Monte Carlo Go programs are easily transferred to our setting.
Algorithm 2 implements a single run through some trajectory in the search tree. It uses the SelectAction routine to choose moves at interior nodes, and invokes the playout policy at unexplored leaf nodes. After a complete path of length is completed, the recursion takes care that every visited node along the path to the leaf is updated as per Section 3.
The action chosen by SelectAction is specified by the UCB policy described in Definition 5. If the selected child has not been explored before, then a new node is added to the search tree. The constant is a parameter that is used to control the shape of the search tree; lower values of create deep, selective search trees, whilst higher values lead to shorter, bushier trees.
Context Tree Weighting (CTW) [WST95, WST97] is a theoretically well-motivated online binary sequence prediction algorithm that works well in practice [BEYY04]. It is an online Bayesian model averaging algorithm that computes a mixture of all prediction suffix trees [RST96] of a given bounded depth, with higher prior weight given to simpler models. We examine in this section several extensions of CTW needed for its use in the context of agents. Along the way, we will describe the CTW algorithm in detail.
We first look at how CTW can be generalised for use as environment models (Definition 2), which are functions of the form . This means we need an extension of CTW that, incrementally, takes as input a sequence of actions and produces as output successive conditional probabilities over observations and rewards. The high-level view of the algorithm is as follows: we process observations and rewards one bit at a time using standard CTW, but bits representing actions are simply appended to the input sequence without updating the context tree. The algorithm is now described in detail. If we drop the action sequence throughout the following description, the algorithm reduces to the standard CTW algorithm.
We start with a brief review of the KT estimator [KT81]
for Bernoulli distributions. Given a binary stringwith zeroes and ones, the KT estimate of the probability of the next symbol is as follows:
The KT estimator is obtained via a Bayesian analysis by putting a ()-Beta prior on the parameter of the Bernoulli distribution. From (10)-(11), we obtain the following expression for the block probability of a string:
Given a binary string , one can establish that depends only on the number of zeroes and ones in . If we let denote a string with zeroes and ones then:
We write to denote in the following. The quantity can be updated incrementally as follows:
with the base case being .
We next describe prediction suffix trees, which are a form of variable-order Markov models.
A prediction suffix tree (PST) is a pair satisfying the following:
is a binary tree where the left and right edges are labelled 1 and 0 respectively; and
associated with each leaf node in is a probability distribution over parameterised by (the probability of 1).
We call the model of the PST and the parameter of the PST, in accordance with the terminology of [WST95], .
A prediction suffix tree maps each binary string , where the depth of , to a probability distribution over in the natural way: we traverse the model by moving left or right at depth depending on whether the bit is one or zero until we reach a leaf node in , at which time we return . For example, the PST shown in Figure 2 maps the string 110 to . At the root node (depth 0), we move right because . We then move left because . We say is the distribution associated with the string 110. Sometimes we need to refer to the leaf node holding the distribution associated with a string ; we denote that by , where is the model of the PST used to process the string.
To use a prediction suffix tree of depth for binary sequence prediction, we start with the distribution at each leaf node of the tree. The first bits of the input sequence are set aside for use as an initial context and the variable denoting the bit sequence seen so far is set to . We then repeat the following steps as long as needed:
predict the next bit using the distribution associated with ;
observe the next bit , update using Formula (10) by incrementing either or according to the value of , and then set .
The above describes how a PST is used for binary sequence prediction. In the agent setting, we reduce the problem of predicting history sequences with general non-binary alphabets to that of predicting the bit representations of those sequences. Further, we only ever condition on actions and this is achieved by appending bit representations of actions to the input sequence without a corresponding update of the KT estimators. These ideas are now formalised.
For convenience, we will assume without loss of generality that and for some . Given , we denote by the bit representation of . Observation and reward symbols are treated similarly. Further, the bit representation of a symbol sequence is denoted by . The th bit in is denoted by and the first bits of is denoted by .
To do action-conditional prediction using a PST, we again start with at each leaf node of the tree. We also set aside a sufficiently long initial portion of the binary history sequence corresponding to the first few cycles to initialise the variable as usual. The following steps are then repeated as long as needed:
set , where is the current selected action;
for to do
predict the next bit using the distribution associated with ;
observe the next bit , update using Formula (10) according to the value of , and then set .
Now, let be the model of a prediction suffix tree, the leaf nodes of , an action sequence, and an observation-reward sequence. We have the following expression for the probability of given and :
where is the (non-contiguous) subsequence of that ended up in leaf node in . More precisely,
where and, for each , .
The above deals with action-conditional prediction using a single PST. We now show how we can perform action-conditional prediction using a Bayesian mixture of PSTs in an efficient way. First, we need a prior distribution on models of PSTs.
Our prior, containing an Ockham-like bias favouring simple models, is derived from a natural prefix coding of the tree structure of a PST. The coding scheme works as follows: given a model of a PST of maximum depth , a pre-order traversal of the tree is performed. Each time an internal node is encountered, we write down 1. Each time a leaf node is encountered, we write a 0 if the depth of the leaf node is less than ; otherwise we write nothing. For example, if , the code for the model shown in Figure 2 is 10100; if , the code for the same model is 101. The cost of a model is the length of its code, which is given by the number of nodes in minus the number of leaf nodes in of depth . One can show that
where is the set of all models of prediction suffix trees with depth at most ; i.e. the prefix code is complete. We remark that the above is another way of describing the coding scheme in [WST95]. We use , which penalises large trees, to determine the prior weight of each PST model.
The following is a key ingredient of the (action-conditional) CTW algorithm.
A context tree of depth is a perfect binary tree of depth where the left and right edges are labelled 1 and 0 respectively and attached to each node (both internal and leaf) is a probability on .
The node probabilities in a context tree are estimated from data using KT estimators as follows. We update a context tree with the history sequence similarly to the way we use a PST, except that
the probabilities at each node in the path from the root to a leaf traversed by an observed bit is updated; and
The process can be best understood with an example. Figure 3 (left) shows a context tree of depth two. For expositional reasons, we show binary sequences at the nodes; the node probabilities are computed from these. Initially, the binary sequence at each node is empty. Suppose is the history sequence. Setting aside the first two bits 10 as an initial context, the tree in the middle of Figure 3 shows what we have after processing the third bit 0. The tree on the right is the tree we have after processing the fourth bit 1. In practice, we of course only have to store the counts of zeros and ones instead of complete subsequences at each node because, as we saw earlier in (12), . Since the node probabilities are completely determined by the input sequence, we shall henceforth speak unambiguously about the context tree after seeing a sequence.
The context tree of depth after seeing a sequence has the following important properties:
the model of every PST of depth at most can be obtained from the context tree by pruning off appropriate subtrees and treating them as leaf nodes;
the block probability of as computed by each PST of depth at most can be obtained from the node probabilities of the context tree via Equation (15).
These properties, together with an application of the distributive law, form the basis of the highly efficient (action-conditional) CTW algorithm. We now formalise these insights.
We first need to define the weighted probabilities at each node of the context tree. Suppose is the action sequence and is the observation-reward sequence. Let be the (non-contiguous) subsequence of that ended up in node of the context tree. The weighted probability of each node in the context tree is defined inductively as follows:
where and are the left and right children of respectively. Note that the set of sequences n has a dependence on the action sequence .
If is a node at depth in a tree, we denote by the path description to node in the tree.
Let be the depth of the context tree. For each node in the context tree at depth , we have for all , for all ,
where is the (non-contiguous) subsequence of that ended up in the node with path description in the context tree.
The proof proceeds by induction on . The statement is clearly true for the leaf nodes at depth . Assume now the statement is true for all nodes at depth , where . Consider a node at depth . Letting , we have
where denotes the tree in whose left and right subtrees are and respectively. ∎
A corollary of Lemma 1 is that at the root node of the context tree we have
where the last step follows from Equation (15). Note carefully that in line (17) denotes the subsequence of that ended in the node pointed to by in the context tree but in line (18) denotes the subsequence of that ended in the leaf node in if is used as the only model to process . Equation (19) shows that the quantity computed by the (action-conditional) CTW algorithm is exactly a Bayesian mixture of (action-conditional) PSTs.
The weighted probability is a block probability. To recover the conditional probability of given , we simply evaluate
which follows directly from Equation (3). To sample from this conditional probability, we simply sample the individual bits of one by one. For brevity, we will sometimes use the following notation for :
In summary, to do action-conditional prediction using a context tree, we set aside a sufficiently long initial portion of the binary history sequence corresponding to the first few cycles to initialise the variable and then repeat the following steps as long as needed:
set , where is the current selected action;
for to do
predict the next bit using the weighted probability ;
observe the next bit , update the context tree using and , calculate the new weighted probability , and then set .
Note that in practice, the context tree need only be constructed incrementally as needed. The depth of the context tree can thus take on non-trivial values. This memory requirement of maintaining a context tree is discussed further in Section 7.
As explained in Section 3, the Revert operation is performed many times during search and it needs to be efficient. Saving and restoring a copy of the context tree is unsatisfactory. Luckily, the block probability estimated by CTW using a context depth of at time can be recovered from the block probability estimated at time in operations in a rather straightforward way. Alternatively, a copy on write implementation can be used to modify the context tree during the simulation phase.
As foreshadowed in [Bun92, HS97], the CTW algorithm can be generalised to work with rich logical tree models [BD98, KW01, Llo03, Ng05, LN07] in place of prediction suffix trees. A full description of this extension, especially the part on predicate definition/enumeration and search, is beyond the scope of the paper and will be reported elsewhere. Here we outline the main ideas and point out how the extension can be used to incorporate useful background knowledge into our agent.
Let be a set of predicates (boolean functions) on histories . A -model is a binary tree where each internal node is labelled with a predicate in and the left and right outgoing edges at the node are labelled True and False respectively. A -tree is a pair where is a -model and associated with each leaf node in is a probability distribution over parameterised by .
A -tree represents a function from histories to probability distributions on in the usual way. For each history , , where is the leaf node reached by pushing down the model according to whether it satisfies the predicates at the internal nodes and is the distribution at .
The use of general predicates on histories in -trees is a powerful way of extending the notion of a “context” in applications. To begin with, it is easy to see that, with a suitable choice of predicate class , both prediction suffix trees (Definition 6) and looping suffix trees [HJ06] can be represented as -trees. Much more background contextual information can be provided in this way to the agent to aid learning and action selection.
The following is a generalisation of Definition 7.
Let be a set of predicates on histories. A -context tree is a perfect binary tree of depth where
each internal node at depth is labelled by and the left and right outgoing edges at the node are labelled True and False respectively; and
attached to each node (both internal and leaf) is a probability on .
We remark here that the (action-conditional) CTW algorithm can be generalised to work with -context trees in a natural way, and that a result analogous to Lemma 1 but with respect to a much richer class of models can be established. A proof of a similar result is in [HS97]. Section 7 describes some experiments showing how predicate CTW can help in more difficult problems.
We now describe how the entire agent is constructed. At a high level, the combination is simple. The agent uses the action-conditional (predicate) CTW predictor presented in Section 4 as a model of the (unknown) environment. At each time step, the agent first invokes the Predictive UCT routine to estimate the value of each action. The agent then picks an action according to some standard exploration/exploitation strategy, such as -Greedy or Softmax [SB98]. It then receives an observation-reward pair from the environment which is then used to update . Communication between the agent and the environment is done via binary codings of action, observation, and reward symbols. Figure 4 gives an overview of the agent/environment interaction loop.
It is worth noting that, in principle, the AIXI agent does not need to explore according to any heuristic policy. This is since the value of information, in terms of expected future reward, is implicitly captured in the expectimax operation defined in Equations (1) and (2). Theoretically, ignoring all computational concerns, it is sufficient just to choose a large horizon and pick the action with the highest expected value at each timestep.
Unfortunately, this result does not carry over to our approximate AIXI agent. In practice, the true environment will not be contained in our restricted model class, nor will we perform enough Predictive UCT simulations to converge to the optimal expectimax action, nor will the search horizon be as large as the agent’s maximal lifespan. Thus, the exploration/exploitation dilemma is a non-trivial problem for our agent. We found that the standard heuristic solutions to this problem, such as -Greedy and Softmax exploration, were sufficient for obtaining good empirical results. We will revisit this issue in Section 7.
Some theoretical properties of our algorithm are now explored.
We first study the relationship between and the universal predictor in AIXI. Using in place of in Equation (6), the optimal action for an agent at time , having experienced , is given by