1 Introduction
Reinforcement Learning (RL) [27]
aims to learn how to succeed in a task through trial and error. This active research area is well developed for environments that are Markov Decision Processes (MDPs); however, real world environments are often partially observable and nonMarkovian. The recently introduced Feature Markov Decision Process (
MDP) framework [14] attempts to reduce actual RL tasks to MDPs for the purpose of attacking the general RL problem where the environment’s model as well as the set of states are unknown. In [26], Sunehag and Hutter take a step further in the theoretical investigation of Feature Reinforcement Learning by proving consistency results. In this article, we develop an actual Feature Reinforcement Learning algorithm and empirically analyze its performance in a number of environments.One of the most useful classes of maps (s) that can be used to summarize histories as states of an MDP, is the class of context trees. Our stochastic search procedure, the principal component of our MDP algorithm GSA, works on a subset of all context trees, called Markov trees. Markov trees have previously been studied in [22] but under names like FSMX sources or FSM closed tree sources. The stochastic search procedure employed for our empirical investigation utilizes a parallel tempering methodology [7], [12] together with a specialized proposal distribution. In the experimental section, the performance of the MDP algorithm where stochastic search is conducted over the space of contexttree maps is shown and compared with three other related context treebased methods.
Our MDP algorithm is briefly summarized as follows. First, perform a certain number of random actions, then use this history to find a highquality map by minimizing a cost function that evaluates the quality of each map. The quality here refers to the ability to predict rewards using the created states. We perform a search procedure for uncovering highquality maps followed by executing learning on the MDP whose states are induced by the detected optimal map. The current history is then updated with the additional experiences obtained from the interactions with the environment through QLearning. After that, we may repeat the procedure but without the random actions. The repetition refines the current “optimal” map, as longer histories provide more useful information for map evaluation. The ultimate optimal policy of the algorithm is retrieved from the action values Q on the resulting MDP induced from the final optimal map.
Contributions. Our contributions are: extending the original MDP cost function presented in [14] to allow for more discriminative learning and more efficient minimization (through stochastic search) of the cost; identifying the Markov actionobservation context trees as an important class of feature maps for MDP; proposing the GSA algorithm where several chosen learning and search procedures are logically combined; providing the first empirical analysis of the MDP model; and designing a specialized proposal distribution for stochastic search over the space of Markov trees, which is of critical importance for finding the best possible MDP agent.
Related Work. Our algorithm is a historybased method. This means that we are utilizing memory that in principle can be long, but in most of this article and in the related works is near term. Given a history of observations, actions and rewards we define states based on some map . The main class of maps that we will consider are based on context trees. The classical algorithm of this sort is Utree [21], which uses a local criterion based on a statistical test for splitting nodes in a context tree; while MDP employs a global cost function. Because of this advantage, MDP can potentially be used in conjunction with any optimization methods to find the optimal model.
There has been a recent surge of interest in history based methods with the introduction of the activeLZ algorithm [6], which generalizes the widely used LempelZiv compression scheme to the reinforcement learning setting and assumes
Markov models of environments; and MCAIXICTW
[28], which uses a Bayesian mixture of context trees and incorporates both the Context Tree Weighting algorithm [31] as well as UCT Monte Carlo planning [16]. These can all be viewed as attempts at resolving perceptual aliasing problems with the help of shortterm memory. This has turned out to be a more tractable approach than BaumWelch methods for learning a Partially Observable Markov Decision Process (POMDP) [4] or Predictive State Representations [24]. The history based methods attempt to directly learn the environment states, thereby avoiding the POMDPlearning problem [15], [20] which is extremely hard to solve. Model minimization [8] is a line of works that also seek for a minimal representation of the state space, but focus on solving Markovian problems while MDP and other aforementioned historybased methods target nonMarkovian ones. It is also worthy to note that there are various other attempts to find compact representations of MDP state spaces [18]; most of which, unlike our approach, address the planning problem where the MDP model is givenPaper Organization. The paper is organized as follows. Section 2 introduces preliminaries on Reinforcement Learning, Markov Decision Processes, Stochastic Search methods and Context Trees. These are the components from which the MDP algorithm (GSA) is built. In Section 3 we put all of the components into our MDP algorithm and also describe our specialized search proposal distribution in detail. Section 4 presents experimental results on four domains. Finally Section 5 summarizes the main results of this paper, and briefly suggests possible research directions.
2 Preliminaries
2.1 Markov Decision Processes (MDP)
An environment is a process which at any discrete time , given action produces an observation and a corresponding reward . When the process is a Markov Decision Process [27]; represents the environment state, and hence is denoted by instead. Formally, a finite MDP is denoted by a quadruple in which is a finite set of states; is a finite set of actions;
is a collection of transition probabilities of the next state
given the current state and action ; and is a reward function . The return at time step is the total discounted reward , where is the geometric discount factor ().Similarly, the action value in state following policy is defined as
. For a known MDP, a useful way to find an estimate of the optimal action values
is to employ the ActionValue Iteration (AVI) algorithm, which is based on the optimal actionvalue Bellman equation [27], and iterates the updateIf the MDP model is unknown, an effective estimation technique is provided by learning, which incrementally updates estimates through the equation
where the feedback error , and is the learning rate at time . Under the assumption of sufficient visits of all stateaction pairs, QLearning converges if and only if some conditions of the learning rates are met [2], [27]. In practice a small constant value of the learning rates () is, however, often adequate to get a good estimate of . QLearning is offpolicy; it directly approximates regardless of what actions are actually taken. This approach is particularly beneficial when handling the explorationexploitation tradeoff in RL.
It is well known that learning by taking greedy actions retrieved from the current estimate of to explore the stateaction space generally leads to suboptimal behavior. The simplest remedy for this inefficiency is to employ the greedy scheme, where with probability we take a random action, and with probability the greedy action is selected. This method is simple, but has shown to fail to properly resolve the explorationexploitation tradeoff. A more systematic strategy for exploring the unseen scenarios, instead of just taking random actions, is to use optimistic initial values [27], [3]. To apply this idea to Learning, we simply initialize with large values. Suppose is the maximal reward, initializations of at least are optimistic as .
2.2 Feature Reinforcement Learning
Problem description. An RL agent aims to find the optimal policy for taking action given the history of past observations, rewards and actions in order to maximize the longterm reward signal. If the problem satisfies an MDP; as can be seen above, efficient solutions are available. We aim to attack the most challenging RL problem where the environment’s states and model are both unknown. In [13]
, this problem is named the Universal Artificial Intelligence (AI) problem since almost all AI problems can be reduced to it.
MDP framework. In [14], Hutter proposes a historybased method, a general statistical and information theoretic framework called MDP. This approach offers a critical preliminary reduction step to facilitate the agent’s ultimate search for the optimal policy. The general MDP framework endeavors to extract relevant features for reward prediction from the past history by using a feature map : , where is the set of all finite histories. More specifically, we want the states and the resulting tuple to satisfy the Markov property of an MDP. As aforementioned, one of the most useful classes of s is the class of context trees, where each tree maps a history to a single state represented by the tree itself. A more general class of is ProbabilisticDeterministic Finite Automata (PDFA) [29], which map histories to the MDP states where the next state can be determined from the current state and the next observation. The primary purpose of MDP is to find a map so that rewards of the MDP induced from the map can be predicted well. This enables us to use MDP solvers, like AVI and Qlearning, on the induced MDP to find a good policy. The reduction quality of each is dictated by the capability of predicting rewards of the resulting MDP induced from that . A suitable cost function that measures the utility of s for this purpose is essential, and the optimal is the one that minimizes this cost function.
Cost function. The cost used in this paper is an extended version of the original cost introduced in [14]. We define a cost that measures the reward predictability of each , or more specifically of the resulting MDP induced from that . Based on this, our cost includes the description length of rewards; however, rewards depend on states as well, so the description length of states must be also added to the cost. In other words, the cost comprises coding of the rewards and resulting states, and is defined as follows:
where and and and and . For coding we use the twopart code [30], [10], hence the code length (CL) is where denotes the data sampled from the model specified by parameters . We employ the optimal codes [5] for describing data , while parameters are uniformly encoded to precision where is the sequence length of [10]: , here is the number of parameters. The optimal is found via the optimization problem .
Denote ( is determined in specific context); (
s are components of vector
); cardinality of a set; ; andShannon entropy of a random variable with distribution
where . The state and reward cost functions can, then, be analytically computed as follows:As we primarily want to find a that has the best reward predictability, the introduction of is primarily to stress on reward coding, making costs for highquality s much lower with very small values. In other words, amplifies the differences among highquality s and bad ones; and this accelerates our stochastic search process described below.
We furthermore replace with in to define for the purpose of being able to select the right model given limited data. The motivation to introduce is the following. For stationary environments the cost function is analytically of this form where are constants, and are linear functions. The optimal should be the one with the smallest value of , however, the curse here is that in practice is often big, so in order to obtain the optimal with limited data, a small value of will help. We assert that with a very large number of samples , and can be ignored in the above cost function (use as the cost in [14]). The choice of small and helps us more quickly to overcome the model penalty and find the optimal map. This strategy is a quite common practice in statistics, and even in the Minimum Description Length (MDL) community [10]. For instance, AIC [1] uses a very small .
2.3 Context Trees
The class of maps that we will base our algorithm on is a class of context trees.
Observation Context Tree (OCT). OCT is a class of maps used to extract relevant information from histories that include only past observations, not actions and rewards. The presentation of OCT is mainly to facilitate the definitions of the below ActionObservation Context Tree.
Definition. Given an ary alphabet , an OCT constructed from the alphabet is defined as a ary tree in which edges coming from any internal node are labeled by letters in from left to right in the order given.
Given an OCT constructed from the alphabet , the state suffix set, or briefly state set induced from is defined as the set of all possible strings of edge labels forming along a path from a leaf node to the root node of . is called a Markov tree if it has the socalled Markov property for its associated state set, that is, for every and , has a unique suffix . The state set of a Markov OCT is called Markov state set. OCTs that do not have the Markov property are identified as nonMarkov OCTs. NonMarkov state sets are similarly defined.
Example. Figure 1(a)(A) and 1(a)(B) respectively represent two binary OCTs of depths two and three; also Figures 1(b)(A) and 1(b)(B) illustrate two ternary OCTs of depths two and three.
As can be seen from Figure 1, trees 1(a)(A) and 1(b)(A) are Markov; on the other hand, trees 1(a)(B) and 1(b)(B) are nonMarkov. The state set of tree 1(a)(A) is ; and furthermore with any further observation and , there exists a unique which is a suffix of . Hence, tree 1(a)(A) is Markov. Table 1(a) represents the deterministic relation between and .


However, there is no such relation in tree 1(a)(B), or state set ; for and , it is ambiguous whether 101 or 001. Table 1(b) clarifies the nonMarkov property of tree 1(a)(B).
It is also worthy to illustrate how an OCT can be used as a map. We illustrate the mapping using again the OCTs in Figure 1. Given two histories including only past observations and , then .
ActionObservation Context Tree (AOCT). AOCTs are extended from the OCTs presented above for the generic RL problem where relevant histories contain both actions and observations.
Definition. Given two alphabets, named observation set, and named action set, an AOCT constructed from the two alphabets is defined as a tree where any internal node at even depths has branching factor , and edges coming from such nodes are labeled by letters in
from left to right in the order given; and similarly any internal node at odd depths has branching factor
, and edges coming from these nodes are labeled by letters in also from left to right in the specified order.The definitions of Markov and nonMarkov AOCTs are similar to those of OCTs except that a next observation is now replaced by the next action and observation. Formally, suppose is an AOCT constructed from the above two alphabets; and is the state suffix set of the tree, then is defined as a Markov AOCT if it has the Markov property, that is, for every , , and there exist a unique such that is a suffix of . AOCTs that do not have Markov property are categorized as nonMarkov AOCTs.
The total number of AOCTs up to a certain depth , , can be recursively computed via the formula where . As can be easily seen from the recursive formula, the total number of AOCTs is doubly exponential in the tree depth.
An important point to note here is that in our four experiments presented in Section 4, the space is limited to Markov AOCTs, since as explained above, the state suffix set induced from a nonMarkov AOCT does not represent an MDP state set; to put it more clearly, in nonMarkov AOCTs, from the next action and observation, we cannot derive the next state from the current one. The Markov constraint on AOCTs significantly reduces the search space for our stochastic search algorithm. In the Utree algorithm [21], no distinction of Marov and nonMarkov trees is identified; the algorithm attempts to search for the optimal tree over the whole space of AOCTs.
2.4 Stochastic search
While we have defined the cost criterion for evaluating maps, the problem of finding the optimal map remains. When the space is huge, e.g. contexttree map space where the number of s grows doubly exponentially with the tree depth, exhaustive search is unable to deal with domains where the optimal is nontrivial. Stochastic search is a powerful tool for solving optimization problems where the landscape of the objective function is complex, and it appears impossible to analytically or numerically find the exact or even approximate global optimal solution. A typical stochastic search algorithm starts with a predefined or arbitrary configuration (initial argument of the objective function or state of a system), and from this generates a sequence of configurations based on some predefined probabilistic criterion; the configuration with the best objective value will be retained. There are a wide range of stochastic search methods proposed in the literature [23]; the most popular among these are simulatedannealingtype algorithms [19], [25]
. An essential element of a simulatedannealing (SA) algorithm is a Markov Chain Monte Carlo (MCMC) sampling scheme where a proposed new configuration
is drawn from a proposal distribution , and we then change from configuration to with probability where is a target distribution. In a simulatedannealing (SA) algorithm where the traditional MetropolisHasting sampling scheme is utilized, is proportional to if is an objective function that we want to minimize, and is some positive constant temperature. is called the correction factor; it is there to compensate for bias in .The traditional SA uses an MCMC scheme with some temperaturedecreasing strategy. Although shown to be able to find the global optimum asymptotically [9], it generally works badly in practice as we do not know which temperature cooling scheme is appropriate for the problem under consideration. Fortunately in the MDP cost function we know typical cost differences between two s (), so the range of appropriate temperatures can be significantly reduced. The search process may be improved if we run a number of SA procedures with various different temperatures. Parallel Tempering (PT) [7], [12], an interesting variant of the traditional SA, significantly improves this stochastic search process by smartly offering a swapping step, letting the search procedure use small temperatures for exploitation and big ones for exploration.
Parallel tempering. PT performs stochastic search over the product space , where is the objective function’s domain, and is the parallel factor. Fixed temperatures (, and ) are chosen for spaces . Temperatures () are selected based on the following formula where is the “typical” difference between function values of two successive configurations; and is the lower bound for the swapping acceptance rate. The main steps of each PT loop are as follows:

is the current sampling; draw Uniform[0,1]

If , update every to via some Markov Chain Monte Carlo (MCMC) scheme like MetropolisHasting (Parallel step)

If , randomly choose a neighbor pair, say and , and accept the swap of and with probability (Swapping step).
The full details of PT are given in Algorithm 1.
If its swapping phase is excluded, PT is simply the combination of a fixed number of MetropolisHastings procedures. The central point that makes PT powerful is its swapping step where adjacent temperatures interchange their sampling regions. This means that a good configuration can be allowed to use a cooler temperature and exploit what it has found while a worse configuration is given a higher temperature which results in more exploration.
3 The MDP Algorithm
We now describe how the generic MDP algorithm works. The general algorithm is shown below (Algorithm 2). It first takes a number of random actions ( in all our experiments). Then it defines the cost function based on this history. Stochastic search is then used to find a map with low cost. Based on the optimal the history is transformed into a sequence of states, actions and rewards. We use optimistic frequency estimates from this history to estimate probability parameters for state transitions and rewards. More precisely, we use instead of the average to estimate expected reward, where are the rewards that have been observed for a certain stateaction pair, and is the highest possible reward. The statistics are used to estimate Q values using AVI. After this the agent starts to interact with the environment again using learning initialized with the values that resulted from the performed AVI. The switch from AVI to QLearning is rather obvious, as QLearning only needs one cheap update per time step, while AVI requires updating the whole environment model and running a number of value iterations. The first set of random actions might not be sufficient to characterize what the best maps look like, so it might be beneficial to add the new history gathered by the QLearning interactions with the environment to the old history, and then repeat the process but without the initial sampling.
In the first four experiments in Section 4, PT is employed to search over the space of Markov AOCTs.
3.1 Proposal Distribution for Stochastic Search over the MarkovAOCT Space
The principal optional component of the above highlevel algorithm, GSA, is a stochastic search procedure of which some algorithms have been presented in Section 2.4. In these algorithms, an essential technical detail is the proposal distribution . It is natural to generate the next tree (the next proposal or configuration) from the current tree by splitting or merging nodes. It is possible to express the exact form of our proposal distribution, and based on this to explain how the next tree (next configuration) is proposed from the current tree (current configuration). However, the analytical form of the distribution is cumbersome to specify, so for better exposition we opt to describe the exact behavior of the tree proposal distribution instead.
The stochastic search procedure starts with a Markov AOCT where all of the tree nodes are mergeable, and splittable. However, in the course of the search, a tree node might become unmergeable, but not the other way round; and a splittable node might turn to be unsplittable and vice versa. These specific transfering scenarios are described as follows. A mergeable tree node of the current tree becomes unmergeable if the current tree is proposed from the previous tree by splitting that node, and the cost of the current tree is smaller than that of the previous tree. A splittable leaf node of the current tree becomes unsplittable if the state associated with that node is not present in the current history; however, an unsplittable leaf node might revert to splittable when the state associated with that node is present in the future updated history. The constraint on merging is to keep good shortterm memory for predicting rewards, while the other on splitting is simply following the Occam’s razor principle.
Merge and split permits. Given some current tree at a particular point in time of the stochastic search process, when considering the generation of the next tree proposal, most of the tree nodes, though labeled splittable and/or mergeable, might have no split, or merge permit, or neither. A node has split permit if it is a leaf node with splittable label. When a leaf node has been split, we simply add all possible children for this node, and label the edges according to the definition of AOCTs. As mentioned above, the newly added leaf nodes might be labeled unmergeable if the cost of the new tree is smaller than that of the old one; and these nodes might also be labeled unsplittable if the states associated with the new leaf nodes are not present in the current history. A node has merge permit if it is labeled mergeable, and all of its children are leaf nodes. When a tree node is merged, all the edges and nodes associated with its children are removed.
Markovmerge and Markovsplit permits. Since our search space is the class of Markov OACTs, whenever a split or merge occurs, extra adjustments might be needed to make the new tree Markov. After a split, there might be nodes that make the tree violate the Markov assumption, and therefore, need to be split. After we split all of those we have to check again to see if any other nodes now need to be split. This goes on until we have a Markov AOCT again. The same applies to merging.
When a node is Markovsplit, it and all of the leaf nodes that need to be split (including recursive splits) as a consequence in order to make the tree Markov, are split. A tree node is said to have Markovsplit permit if it, and all the other nodes that would be split in a Markovsplit of the node, have split permits. This notion is best illustrated with an example. First we define Markov and NonMarkov states of an AOCT. A state of an AOCT is Markov if given any next actionobservation pair, the next state is determined; otherwise it is labeled as nonMarkov. Now in Figure 2(A), suppose the current Markov AOCT is the tree without dashed edges. Then after splitting the leaf node marked by * (the node associated with state 00101), the state 001 becomes nonMarkov so this associated node needs to be split. However, after splitting this node (node associated with state 001), state 0 becomes nonMarkov, hence it needs splitting as well. In short, to split the node marked by *, the two nodes associated with states 001 and 0 have to be split as well so as to ensure the resulting tree is Markov after splitting. Similarly, a tree node has Markovmerge permit if it, and all of the tree nodes that minimally and recursively need to be merged after the original node is merged in order to make the tree Markov, have merge permits. For example, in Figure 2(B), suppose the current tree is the tree including both solid and dashed edges, then the node marked by * has Markovmerge permit, if it itself, and the nodes associated with paths , and that need to be merged, have merge permits. When a node with Markovmerge permit is Markovmerged, it and its Markovmergeassociated nodes are merged.
Our procedure to generate the next tree from the current tree (draw sample from ) in the space of Markov AOCTs consists of the following main steps:

From the given tree, identify two sets: one is containing nodes with Markovsplit permits, and the other containing nodes with Markovmerge permits.

Suppose that either or is nonempty otherwise the algorithm (GSA) must stop; then if either or is empty, select a node uniformly at random from the other set; otherwise select or randomly with probability each, and after that choose a tree node randomly from the selected set.

Markovsplit the node if it belongs to , otherwise Markovmerge it
Once we have drawn the new tree , the Metropolis Hastings correction factor can be straightforwardly calculated via the formula
here and are respectively the set of nodes with Markovsplit permits, and the set of nodes with Markovmerge permits of .
Sharing. If the stochastic search algorithm utilized is PT, we apply another trick to effectively accelerate the search process. Whenever a node is labeled unmergeable, that is, by splitting this node the cost function decreases, or in other words a good additional relevant shortterm memory for predicting rewards is found, the states associated with the new nodes created by the splitting are replicated in the trees with the other temperatures.
4 Experiments
4.1 Experimental Setup
Parameter  Component  Value 

0.1  
0.1  
GSA  5000  
GSA  1  
Iterations  PT  100 
PT  10  
PT  
PT  0.7  
AVI, QLearning  0.999999  
QLearning  0.01 
Below in this section we present our empirical studies of the MDP algorithm GSA described in Section 3. For all of our experiments, stochastic search (PT) is applied in the space of Markov AOCTs.
For a variety of tested domains, our algorithm produces consistent results using the same set of parameters. These parameters are shown in Table 2, and are not fine tuned.
The results of MDP and the three competitors in the four abovelisted environments are shown in Figures 3, 4 7, 8 and LABEL:fig:relaymazeplot. In each of the plots, various time points are chosen to assess and compare the quality of the policies learned by the four approaches. In order to evaluate how good a learned policy is, at each point, the learning process of each agent, and the exploration of the three competitors are temporarily switched off. The selected statistic to compare the quality of learning is the averaged reward over 5000 actions using the current policy. For stability, the statistic is averaged over 10 runs.
As shown in more detail below, MDP is superior to Utree and activeLZ, and is comparable to MCAIXICTW in shortterm memory domains. Overall conclusions are clear, and we, therefore, omit error bars.
4.2 Environments and results
We describe each environment, the resulting performance, and the tree that was found by MDP in the cheese maze domain.
Grid.
The domain is a 44 grid world. At each time step, the agent can move one cell left, right, up and down within the grid world. The observations are uninformative. When the agent enters the bottomright corner of the grid; it gets a reward of 1, and is automatically and randomly sent back to one of the remaining 15 cells. Entering any cell other than the bottomright one gives the agent a zero reward. To achieve the maximal total reward, the agent must be able to remember a series of smart actions without any clue about its relative position in the grid.
The context tree found contains 34 states. Some series of actions that take the agent towards the bottomright corner of the grid are present in the context tree. As shown in the 4grid plot in Figure 3, after 5000 experiences gathered from the random policy, MDP finds the optimal policy, and so does MCAIXICTW and UTree. ActiveLZ, however, does not converge to an optimal policy even after 50,000 learning cycles.
Tiger. The tiger domain is described as follows. There are two doors, left and right; an amount of gold and a tiger are placed behind the two doors in a random order. The person has three possible actions: listen to predict the position of the tiger, open the right door, and open the left door. If the person listens, he has to pay some money (reward of 1). The probability that the agent hears correctly is 0.85. If the person opens either of the doors and sees the gold, the obtained reward is 10; or otherwise he faces the tiger, then the agent receives a reward of 100. After the door is opened, the episode ends; and in the next episode the tiger sits randomly again behind either the left or the right door.
Our parallel tempering procedure found a context tree consisting of 39 states including some important states where the history is such that the agent has listened a few times before opening the door. It can be seen from the tiger plot in Figure 4 that the optimal policy MDP found after 5,000 learning experiences does yield positive reward on average, while from time point 10,000 on, it achieves as high rewards as MCAIXICTW. UTree appears to learn more slowly but eventually manages to get positive averaged rewards after 50,000 cycles like MDP and MCAIXICTW. ActiveLZ is performing far worse. The optimal policy that MDP, MCAIXICTW, and UTree ultimately found is the following. First listen two times, if the listening outcomes are consistent, open the predicted door with gold behind; otherwise take one more listening action, and based on the majority to open the appropriate door.
Cheese Maze.
This domain, as shown in Figure 5, consists of a elevencell maze with a cheese in it. The agent is a mouse that attempts to find the cheese. The agent’s starting position for each episode is at one of the eleven cells uniformly random. The actions available to the agent are: move one cell left (0), right (1), up (2) and down (3). However, it should be noticed that if the agent hits the wall, its relative position in the maze remains unchanged. At each cell the agent can observe which directions among left, right, up and down the cell is blocked by a wall. If wallblocking statuses of each cell are represented by 1 (blocked), and 0 (free) respectively; then an observation is described by a fourdigit binary number where the digits from left to right are wallblocking statuses of up, left, down and right directions. For example, 0101 = 5, 0111 = 7, … as described in Figure 5. The agent gets a reward of 1 when moving into a free cell without a cheese; hitting the wall gives it a penalty of 10; and a reward of 10 is given to the agent when it finds the cheese. As can be seen, some observations themselves alone are insufficient for the mouse to locate itself unambiguously in the maze. Hence, the mouse must learn to resolve these ambiguities of observations in the maze to be able to find the optimal policy.
Our algorithm found a context tree consisting of 43 states that contains the tree as shown in Figure 6. The tree splits from the root into the possible observations. Then observations and are split into the four possible actions; and some of these actions, the ones that come from a different location and not a wall collision, are split further into the “possible” observations before that. This resolves which or which we are at. The states in this tree resolve the most important ambiguities of the raw observations and an optimal policy can be found. The domain contains an infinite amount of longer dependencies among which our found states pick up a small subset. The cheesemaze plot in Figure 7 shows that after the initial 5000 experiences, MDP is marginally worse than MCAIXICTW but is better than UTree and ActiveLZ. From time point 10,000, there is no difference between MDP and MCAIXICTW. UTree and ActiveLZ remain inferior.
Kuhn Poker.
In Kuhn poker [17] a deck of only three cards (Jack, Queen and King) is used. The agent always plays second in any game (episode). After putting a chip each into play, the players are dealt a card each. Then the first player says bet or pass and the second player chooses bet or pass. If player one says pass and player two says bet then player one must choose again between bet and pass. Whenever a player says bet they must put in another chip. If one player bets and the other pass the better gets all the chips in play. Otherwise the player with the highest card gets the chips. Player one plays according to a fixed but stochastic Nash optimal strategy [11]. MDP finds states. It can be observed from the Kunhpoker plot in Figure 8 that MDP is comparable to MCAIXICTW and much better than UTree and ActiveLZ, who loose money.
5 Conclusions
Based on the Feature Reinforcement Learning framework [14] we defined actual practical reinforcement learning agents that perform very well empirically. We evaluated a reasonably simple instantiation of our algorithm that first takes random actions followed by finding a map through a search procedure and then it performs Qlearning on the MDP defined by the map’s state set.
We performed an evaluation on four test domains used to evaluate MCAIXICTW in [28]. Those domains are all suitably attacked with context tree methods. We defined a MDP agent for a class of maps based on context trees, and compared it to three other context treebased methods. Key to the success of our MDP agent was the development of a suitable stochastic search method for the class of Markov AOCTs. We combined parallel tempering with a specialized proposal distribution that results in an effective stochastic search procedure. The MDP agent outperforms both the classical Utree algorithm [21] and the recent ActiveLZ algorithm [6], and is competitive with the newest state of the art method MCAIXICTW [28]. The main reason that MDP outperforms Utree is that MDP uses a global criterion (enabling the use of powerful global optimizers) whereas Utree uses a local splitmerge criterion. MDP also performs significantly better than ActiveLZ. ActiveLZ learns slowly as it overestimates the environment model (assuming Markov or complete contexttree environment models); and this leads to unreliable valuefunction estimates.
Below are some detailed advantages of MDP over MCAIXICTW:

MDP is more efficient than MCAIXICTW in both computation and memory usage. MDP only needs an initial number of samples and then it finds the optimal map and uses AVI to find MDP parameters. After this it only needs a Qlearning update for each iteration. On the other hand, MCAIXICTW requires model updating, planning and valuereverting at every single cycle which together are orders of magnitude more expensive than Qlearning. In the experiments MDP finished in minutes while MCAIXICTW needed hours. Another disadvantage of MCAIXICTW is that it is a memoryhungry algorithm. MDP learns the best tree representation using stochastic search, which expands a tree towards relevant histories. MCAIXICTW learns the mixture of trees where the number of tree nodes grows (and thereby the memory usage) linearly with time.

MDP learns a single state representation and can use many classical RL algorithms, e.g. QLearning, for MDP learning and planning.

Another key benefit is that MDP represents a more discriminative approach than MCAIXICTW since it aims primarily for the ability to predict future rewards and not to fully model the observation sequence. If the observation sequence is very complex, this becomes essential.
On the other hand, to be fair it should be noted that compared to MDP, MCAIXICTW is more principled. The results presented in this paper are encouraging since they show that we can achieve comparable results to the more sophisticated MCAIXICTW algorithm on problems where only shortterm memory is needed. We plan to utilize the aforementioned advantages of the MDP framework, like flexibility in environment modeling and computational efficiency, to attack more complex and larger problems.
Acknowledgement
This work was supported by ARC grant DP0988049 and by NICTA. We also thank Joel Veness and Daniel Visentin for their assistance with the experimental comparison.
References
 [1] Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control 19, 716–723 (1974)
 [2] Bertsekas, D.P., Tsitsiklis, J.N.: NeuroDynamic Programming. Anthena Scientific, Belmont, MA (1996)
 [3] Brafman, R.I., Tennenholz, M.: Rmax a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learing Research 3, 213–231 (2002)
 [4] Chrisman, L.: Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In: AAAI. pp. 183–188 (1992)
 [5] Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Willey and Sons (1991)
 [6] Farias, V., Moallemi, C., Van Roy, B., Weissman, T.: Universal reinforcement learning. Information Theory, IEEE Transactions on 56(5), 2441 –2454 (May 2010)
 [7] Geyer, C.J.: Markov chain Monte Calro maximum likelihood. In: Computing Science and Statistics: the 23 Symposium on the Interface. pp. 156–163. Interface Foundation, Fairfax (1991)
 [8] Givan, R., Dean, T., Greig, M.: Equivalence notions and model minimization in Markov decision process. Artificial Intelligence 147, 163–223 (2003)
 [9] Granville, V., Kr̆ivánek, M., Rasson, J.P.: Simulated annealing: A proof of convergence. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(6), 652–656 (June 1994)
 [10] Grünwald, P.D.: The Minimum Description Length Principle. The MIT Press (2007)
 [11] Hoehn, B., Southey, F., Holte, R.C., Bulitko, V.: Effective shortterm opponent exploitation in simplified poker. In: AAAI. pp. 783–788 (2005)
 [12] Hukushima, K., Nemoto, K.: Exchange monte carlo method and application to spin glass simulations. Journal of the Physical Socieity of Japan 65(4), 1604–1608 (1996)
 [13] Hutter, M.: Universal Articial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin (2005)
 [14] Hutter, M.: Feature reinforcement learning: Part I. Unstructured MDPs. Journal of General Artificial Intelligence (2009)
 [15] Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in paritally observable stochastic domains. Artifical Intelligence 101, 99–134 (1998)

[16]
Kocsis, L., Szepesvári, C.: Bandit based montecarlo planning. In: The
European Conference on Machine Learning. pp. 99–134 (2006)
 [17] Kuhn, H.W.: A simplified twopersion poker. In: Contributions to the Theory of Games. pp. 97–103 (1950)
 [18] Li, L., Walsh, T.J., Littmans, M.L.: Towards a unified theory of state abstraction for Mdps. In: In Proceedings of the International Symposium on Artificial Intelligence and Mathematics (2006)
 [19] Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer (2001)
 [20] Madani, O., Handks, S., Condon: On the undecidability of probabilistic planning and related stochastic optimization problems. Artifical Intelligence 147, 5–34 (2003)
 [21] McCallum, A.K.: Reinforcement Learning with Selective Perception and Hidden State. Ph.D. thesis, Department of Computer Science, University of Rochester (1996)
 [22] Rissanen, J.: A universal data compression system. IEEE Transactions on Information Theory 29(5), 656–663 (1983)
 [23] Schneider, J., Kirkpatrick, S.: Stochastic Optimization. Springer, first edn. (2006)
 [24] Singh, S.P., James, M.R., Rudary, M.R.: Predictive state representations: A new theory for modeling dynamical systems. In: Proceedings of the 20 Conference in Uncertainty in Artificial Intelligence. pp. 512–518. Banff, Canada (2004)
 [25] Suman, B., Kumar, P.: A survey of simulated annealing as a tool for single and multiobjecctive optimization. Journal of the Operational Research Society 57, 1143–1160 (2006)
 [26] Sunehag, P., Hutter, M.: Consistency of feature Markov processes. In: Proc. 21st International Conf. on Algorithmic Learning Theory (ALT’10). LNAI, vol. 6331, pp. 360–374. Springer, Berlin, Canberra (2010)
 [27] Sutton, R., Barto, A.: Reinforcement Learning. The MIT Press (1998)
 [28] Veness, J., Ng, K.S., Hutter, M., Uther, W., Silver, D.: A MonteCarlo AIXI approximation. Journal of Artifiicial Intelligence Research 40(1), 95–142 (2011)
 [29] Vidal, E., Thollard, F., Higuera, C.D.L., Casacuberta, F., Carrasco, R.C.: Probabilitic finitestate machines. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(7), 1013–1025 (July 2005)
 [30] Wallace, C.S., Dowe, D.L.: Minimum message length and komogorov complexity. Computer Journal 42(4), 270–283 (1999)
 [31] Wilems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J.: The context tree weighting method: Basic properties. IEEE Transactions on Information Theory 41, 653–644 (1995)