Universal Reinforcement Learning Algorithms: Survey and Experiments

05/30/2017 ∙ by John Aslanides, et al. ∙ Google Australian National University 0

Many state-of-the-art reinforcement learning (RL) algorithms typically assume that the environment is an ergodic Markov Decision Process (MDP). In contrast, the field of universal reinforcement learning (URL) is concerned with algorithms that make as few assumptions as possible about the environment. The universal Bayesian agent AIXI and a family of related URL algorithms have been developed in this setting. While numerous theoretical optimality results have been proven for these agents, there has been no empirical investigation of their behavior to date. We present a short and accessible survey of these URL algorithms under a unified notation and framework, along with results of some experiments that qualitatively illustrate some properties of the resulting policies, and their relative performance on partially-observable gridworld environments. We also present an open-source reference implementation of the algorithms which we hope will facilitate further understanding of, and experimentation with, these ideas.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The standard approach to reinforcement learning typically assumes that the environment is a fully-observable Markov Decision Process (MDP) [Sutton and Barto1998]

. Many state-of-the-art applications of reinforcement learning to large state-action spaces are achieved by parametrizing the policy with a large neural network, either directly (e.g. with deep deterministic policy gradients

[Silver et al.2014]) or indirectly (e.g. deep Q-networks [Mnih et al.2013]). These approaches have yielded superhuman performance on numerous domains including most notably the Atari 2600 video games [Mnih et al.2015] and the board game Go [Silver et al.2016]

. This performance is due in large part to the scalability of deep neural networks; given sufficient experience and number of layers, coupled with careful optimization, a deep network can learn useful abstract features from high-dimensional input. These algorithms are however restricted in the class of environments that they can plausibly solve, due to the finite capacity of the network architecture and the modelling assumptions that are typically made, e.g. that the optimal policy can be well-approximated by a function of a fully-observable state vector.

In the setting of universal reinforcement learning, we lift the Markov, ergodic, and full-observability assumptions, and attempt to derive algorithms to solve this general class of environments. URL aims to answer the theoretical question: “making as few assumptions as possible about the environment, what constitutes optimal behavior?”. To this end several Bayesian, history-based algorithms have been proposed in recent years, central of which is the agent AIXI [Hutter2005]. Numerous important open conceptual questions remain [Hutter2009], including the need for a relevant, objective, and general optimality criterion [Leike and Hutter2015a]. As the field of artifical intelligence research moves inexorably towards AGI, these questions grow in import and relevance.

The contribution of this paper is three-fold: we present a survey of these URL algorithms, and unify their presentation under a consistent vocabulary. We illuminate these agents with an empirical investigation into their behavior and properties. Apart from the MC-AIXI-CTW implementation [Veness et al.2011] this is the only non-trivial set of experiments relating to AIXI, and is the only set of experiments relating to its variants; hitherto only their asymptotic properties have been studied theoretically. Our third contribution is to present a portable and extensible open-source software framework111The framework is named AIXIjs; the source code can be found at http://github.com/aslanides/aixijs. for experimenting with, and demonstrating, URL algorithms. We also discuss several tricks and approximations that are required to get URL implementations working in practice. Our desire is that this framework will be of use, both for education and research, to the RL and AI safety communities.222A more comprehensive introduction and discussion including more experimental results can be found in the associated thesis at https://arxiv.org/abs/1705.07615.

2 Literature Survey

This survey covers history-based Bayesian algorithms; we choose history-based algorithms, as these are maximally general, and we restrict ourselves to Bayesian algorithms, as they are generally both principled and theoretically tractable. The universal Bayesian agent AIXI [Hutter2005] is a model of a maximally intelligent agent, and plays a central role in the sub-field of universal reinforcement learning (URL). Recently, AIXI has been shown to be flawed in important ways; in general it doesn’t explore enough to be asymptotically optimal [Orseau2010], and it can perform poorly, even asymptotically, if given a bad prior [Leike and Hutter2015a]. Several variants of AIXI have been proposed to attempt to address these shortfalls: among them are entropy-seeking [Orseau2011], information-seeking [Orseau et al.2013], Bayes with bursts of exploration [Lattimore2013], MDL agents [Leike2016]

, Thompson sampling

[Leike et al.2016], and optimism [Sunehag and Hutter2015].

It is worth emphasizing that these algorithms are models of rational behavior in general environments, and are not intended to be efficient or practical reinforcement learning algorithms. In this section, we provide a survey of the above algorithms, and of relevant theoretical results in the universal reinforcement learning literature.

2.1 Notation

As we are discussing POMDPs, we distinguish between (hidden) states and percepts, and we take into account histories, i.e. sequences of actions and percepts. States, actions, and percepts use Latin letters, while environments and policies use Greek letters. We use as the reals, and . For sequences over some alphabet , is the set of all sequencesstrings of length over . We typically use the shorthand and . Concatenation of two strings and is given by . We refer to environments and environment models using the symbol , and distinguish the true environment with . The symbol is used to represent the empty string, while is used to represent a small positive number. The symbols and are deterministic and stochastic mappings, respectively.

2.2 The General Reinforcement Learning Problem

We begin by formulating the agent-environment interaction. The environment is modelled as a partially observable Markov Decision Process (POMDP). That is, we can assume without loss of generality that there is some hidden state with respect to which the environment’s dynamics are Markovian. Let the state space be a compact subset of a finite-dimensional vector space . For simplicity, assume that the action space

is finite. By analogy with a hidden Markov model, we associate with the environment stochastic dynamics

. Because the environment is in general partially observable, we define a percept space . Percepts are distributed according to a state-conditional percept distribution ; as we are largely concerned with the agent’s perspective, we will usually refer to as the environment itself.

The agent selects actions according to a policy , a conditional distribution over . The agent-environment interaction takes the form of a two-player, turn-based game; the agent samples an action from its policy , and the environment samples a percept from . Together, they interact to produce a history: a sequence of action-percept pairs . The agent and environment together induce a telescoping distribution over histories, analogous to the state-visit distribution in RL:


In RL, percepts consist of tuples so that . We assume that the reward signal is real-valued, , and make no assumptions about the structure of the . In general, agents will have some utility function that typically encodes some preferences about states of the world. In the partially observable setting, the agent will have to make inferences from its percepts to world-states. For this reason, the utility function is a function over finite histories of the form ; for agents with an extrinsic reward signal, . The agent’s objective is to maximize expected future discounted utility. We assume a general class of convergent discount functions, with the property . For this purpose, we introduce the value function, which in this setting pertains to histories rather than states:


In words, is the expected discounted future sum of reward obtained by an agent following policy in environment under discount function and utility function . For conciseness we will often drop the and/or subscripts from when the discount/utility functions are irrelevant or obvious from context; by default we assume geometric discounting and extrinsic rewards. The value of an optimal policy is given by the expectimax expression


which follows from Eqs. (1) and (2) by jointly maximizing over all future actions and distributing over . The optimal policy is then simply given by ; note that in general the optimal policy may not exist if is unbounded from above. We now introduce the only non-trivial and non-subjective optimality criterion yet known for general environments [Leike and Hutter2015a]: weak asymptotic optimality.

Definition 1 (Weak asymptotic optimality; Lattimore & Hutter, 2011).

Let the environment class be a set of environments. A policy is weakly asymptotically optimal in if , in mean, i.e.

where is the history distribution defined in Equation (1).

AIXI is not in general asymptotically optimal, but both BayesExp and Thompson sampling (introduced below) are. Finally, we introduce the the notion of effective horizon, which these algorithms rely on for their optimality.

Definition 2 (-Effective horizon; Lattimore & Hutter, 2014).

Given a discount function , the -effective horizon is given by


In words, is the horizon that one can truncate one’s planning to while still accounting for a fraction equal to of the realizable return under stationary i.i.d. rewards.

2.3 Algorithms

We consider the class of Bayesian URL agents. The agents maintain a predictive distribution over percepts, that we call a mixture model . The agent mixesmarginalizes over a class of modelshypothesesenvironments . We consider countable nonparametric model classes so that


where we have suppressed the conditioning on history for clarity. We have identified the agent’s credence in hypothesis with weights , with and

, and we write the probability that

assigns to percept as . We update with Bayes rule, which amounts to , which induces the sequential weight updating scheme ; see Algorithm 1. We will sometimes use the notation to represent the posterior mass on after updating on history , and when referring explicitly to a posterior distribution.

Definition 3 (Ai; Hutter, 2005).

AI is the policy that is optimal w.r.t. the Bayes mixture :


Computational tractability aside, a central issue in Bayesian induction lies in the choice of prior. If computational considerations aren’t an issue, we can choose to be as broad as possible: the class of all lower semi-computable conditional contextual semimeasures [Leike and Hutter2015b], and using the prior , where is the Kolmogorov complexity. This is equivalent to using Solomonoff’s universal prior [Solomonoff1978] over strings , and yields the AIXI model. AI is known to not be asymptotically optimal [Orseau2010], and it can be made to perform badly by bad priors [Leike and Hutter2015a].

1:Model class ; prior ; history .
3:function Act()
4:     Sample and perform action
5:     Receive
6:     for  do
8:     end for
10:end function
Algorithm 1 Bayesian URL agent [Hutter, 2005]

Knowledge-seeking agents (KSA). Exploration is one of the central problems in reinforcement learning [Thrun1992], and a principled and comprehensive solution does not yet exist. With few exceptions, the state-of-the-art has not yet moved past -greedy exploration [Bellemare et al.2016, Houthooft et al.2016, Pathak et al.2017, Martin et al.2017]. Intrinsically motivating an agent to explore in environments with sparse reward structure via knowledge-seeking is a principled and general approach. This removes the dependence on extrinsic reward signals or utility functions, and collapses the exploration-exploitation trade-off to simply exploration. There are several generic ways in which to formulate a knowledge-seeking utility function for model-based Bayesian agents. We present three, due to Orseau et al.:

Definition 4 (Kullback-Leibler KSA; Orseau, 2014).

The KL-KSA is the Bayesian agent whose utility function is given by the information gain


Informally, the KL-KSA gets rewarded for reducing the entropy (uncertainty) in its model. Now, note that the entropy in the Bayesian mixture can be decomposed into contributions from uncertainty in the agent’s beliefs and noise in the environment . That is, given a mixture and for some percept such that , and suppressing the history ,

That is, if , we say the agent is uncertain about whether hypothesis is true (assuming there is exactly one that is the truth). On the other hand, if we say that the environment is noisy or stochastic. If we restrict ourselves to deterministic environments such that , then implies that for at least one . This motivates us to define two agents that seek out percepts to which the mixture assigns low probability; in deterministic environments, these will behave like knowledge-seekers.

Definition 5 (Square & Shannon KSA; Orseau, 2011).

The Square and Shannon KSA are the Bayesian agents with utility given by and respectively.

Square, Shannon, and KL-KSA are so-named because of the form of the expression when one computes the -expected utility: this is clear for Square and Shannon, and for KL it turns out that the expected information gain is equal to the posterior weighted KL-divergence [Lattimore2013]. Note that as far as implementation is concerned, these knowledge-seeking agents differ from AI only in their utility functions.

The following two algorithms (BayesExp and Thompson sampling) are RL agents that add exploration heuristics so as to obtain weak asymptotic optimality, at the cost of introducing an

exploration schedule which can be annealed, i.e. and as .

BayesExp. BayesExp (Algorithm 2) augments the Bayes agent AI with bursts of exploration, using the information-seeking policy of KL-KSA. If the expected information gain exceeds a threshold , the agent will embark on an information-seeking policy for one effective horizon, where IG is defined in Equation (7).

1:Model class ; prior ; exploration schedule .
4:     if  then
5:         for  do
7:         end for
8:     else
10:     end if
11:end loop
Algorithm 2 BayesExp [Lattimore, 2013]

Thompson sampling. Thompson sampling (TS) is a very common Bayesian sampling technique, named for [Thompson1933]. In the context of general reinforcement learning, it can be used as another attempt at solving the exploration problems of AI. From Algorithm 3, we see that TS follows the -optimal policy for an effective horizon before re-sampling from the posterior . This commits the agent to a single hypothesis for a significant amount of time, as it samples and tests each hypothesis one at a time.

1:Model class ; prior ; exploration schedule .
4:     Sample
6:     for  do
8:     end for
9:end loop
Algorithm 3 Thompson Sampling [Leike et al., 2016]

MDL. The minimum description length (MDL) principle is an inductive bias that implements Occam’s razor, originally attributed to [Rissanen1978]. The MDL agent greedily picks the simplest probable unfalsified environment in its model class and behaves optimally with respect to that environment until it is falsified. If , Algorithm 4 converges with . Note that line of Algorithm 4 amounts to maximum likelihood regularized by the Kolmogorov complexity .

1:Model class ; prior , regularizer constant .
6:end loop
Algorithm 4 MDL Agent [Lattimore and Hutter, 2011].

3 Implementation

In this section, we describe the environments that we use in our experiments, introduce two Bayesian environment models, and discuss some necessary approximations.

3.1 Gridworld

We run our experiments on a class of partially-observable gridworlds. The gridworld is an grid of tiles; tiles can be either empty, walls, or stochastic reward dispenser tiles. The action space is given by , which move the agent in the four cardinal directions or stand still. The observation space is , the set of bit-strings of length four; each bit is if the adjacent tile in the corresponding direction is a wall, and is otherwise. The reward space is small, and integer-valued: the agent receives for walking over empty tiles, for bumping into a wall, and with some fixed probability if it is on a reward dispenser tile. There is no observation to distinguish empty tiles from dispensers. In this environment, the optimal policy (assuming unbounded episode length) is to move to the dispenser with highest payout probability and remain there, subsequently collecting reward per cycle in expectation (the environment is non-episodic). In all cases, the agent’s starting position is at the top left corner at tile .

This environment has small action and percept spaces and relatively straightforward dynamics. The challenge lies in coping with perceptual aliasing due to partial observability, dealing with stochasticity in the rewards, and balancing exploration and exploitation. In particular, for a gridworld with dispensers with , the gridworld presents a non-trivial exploration/exploitation dilemma; see Figure 1.

3.2 Models

Generically, we would like to construct a Bayes mixture over a model class

that is as rich as possible. Since we are using a nonparametric model, we are not concerned with choosing our model class so as to arrange for conjugate prior/likelihood pairs. One might consider constructing

by enumerating all gridworlds of the class described above, but this is infeasible as the size of such an enumeration explodes combinatorially: using just two tile types we would run out of memory even on a modest gridworld since . Instead, we choose a discrete parametrization that enumerates an interesting subset of these gridworlds. One can think of as describing a set of parameters about which the agent is uncertain; all other parameters are held constant, and the agent is fully informed of their value. We use this to create the first of our model classes, . The second, , uses a factorized distribution rather than an explicit mixture to avoid this issue.

. This is a Bayesian mixture parametrized by goal location; the agent knows the layout of the gridworld and knows its dynamics, but is uncertain about the location of the dispensers, and must explore the world to figure out where they are. We construct the model class by enumerating each of these gridworlds. For square gridworlds of side length , . From Algorithm 1 the time complexity of our Bayesian updates is .

. The Bayes mixture is a natural class of gridworlds to consider, but it is quite constrained in that it holds the maze layout and dispenser probabilities fixed. We seek a model that allows the agent to be uncertain about these features. To do this we move away from the mixture formalism so as to construct a bespoke gridworld model.

Let be the state of tile

in the gridworld. The joint distribution over all gridworlds

factorizes across tiles, and the state-conditional percept distributions are Dirichlet over the four tile types. We effectively use a Haldane prior – – with respect to Walls and a Laplace prior – – with respect to Dispensers. This model class allows us to make the agent uncertain about the maze layout, including the number, location, and payout probabilities of Dispensers. In contrast to , model updates are time complexity, making a far more efficient choice for large .

This model class incorporates more domain knowledge, and allows the agent to flexibly and cheaply represent a much larger class of gridworlds than using the naive enumeration . Importantly, is still a Bayesian model – we simply lose the capacity to represent it explicitly as a mixture of the form of Equation (5).

3.3 Agent Approximations

Planning. In contrast to model-free agents, in which the policy is explicitly represented by e.g. a neural network, our model-based agents need to calculate their policy from scratch at each time step by computing the value function in Equation (6). We approximate this infinite forsight with a finite horizon , and we approximate the expectation using Monte Carlo techniques. We implement the UCT algorithm [Silver and Veness2010, Veness et al.2011], a history-based version of the UCT algorithm [Kocsis and Szepesvári2006]. UCT is itself a variant of Monte Carlo Tree Search (MCTS), a commonly used planning algorithm originally developed for computer Go. The key idea is to try to avoid wasting time considering unpromising sequences of actions; for this reason the action selection procedure within the search tree is motivated by the upper confidence bound (UCB) algorithm [Auer et al.2002]:


where is the number of times the sampler has reached history ,

is the current estimator of

, is the planning horizon, is a free parameter (usually set to ), and and are the minimum and maximum rewards emitted by the environment. In this way, the MCTS planner approximates the expectimax expression in Equation (3), and effectively yields a stochastic policy approximating defined in Equation (6) [Veness et al.2011].

In practice, MCTS is very computationally expensive; when planning by forward simulation with a Bayes mixture over , the worst-case time-complexity of UCT is , where is the Monte Carlo sample budget. It is important to note that UCT treats the environment model as a black box: it is agnostic to the environment’s structure, and so makes no assumptions or optimizations. For this reason, planning in POMDPs with UCT is quite wasteful: due to perceptual aliasing, the algorithm considers many plans that are cyclic in the (hidden) state space of the POMDP. This is unavoidable, and means that in practice UCT can be very inefficient even in small environments.

Effective horizon. Computing the effective horizon exactly for general discount functions is not possible in general, although approximate effective horizons have been derived for some common choices of [Lattimore2013]. For most realistic choices of and , the effective horizon is significantly greater than the planning horizon we can feasibly use due to the prohobitive computational requirements of MCTS [Lamont et al.2017]. For this reason we use the MCTS planning horizon as a surrogate for . In practice, all but small values of are feasible, resulting in short-sighted agents with (depending on the environment and model) suboptimal and highly stochastic policies.

Utility bounds. Recall from Equation (9) that the UCT value estimator is normalized by a factor of . For reward-based reinforcement learners, and are part of the metadata provided to the agent by the environment at the beginning of the agent-environment interaction. For utility-based agents such as KSA, however, rewards are generated intrinsically, and so the agent must calculate for itself what range of utilities it expects to see, so as to correctly normalize its value function for the purposes of planning. For Square and KL-KSA, it is possible to get loose utility bounds by making some reasonable assumptions. Since , we have the bound . One can argue that for the vast majority of environments this bound will be tight above, since there will exist percepts that the agent’s model has effectively falsified such that as .

In the case of the KL-KSA, recall that . If we assume a uniform prior , then we have since entropy is non-negative over discrete model classes.

Now, in general is unbounded above as , so unless we can a priori place lower bounds on the probability that will assign to an arbitrary percept , we cannot bound its utility function and therefore cannot calculate the normalization in Equation (9). Therefore, planning with MCTS with the Shannon KSA will be problematic in many environments, as we are forced to make an arbitrary choice of upper bound .

4 Experiments

We run experiments to investigate and compare the agents’ learning performance. Except where otherwise stated, the following experiments were run on a gridworld with a single dispenser with . We average training score over simulations for each agent configuration, and we use MCTS samples and a planning horizon of . In all cases, discounting is geometric with . In all cases, the agents are initialized with a uniform prior.

Figure 1: Left: An example gridworld environment, with the agent’s posterior over superimposed. Grey tiles are walls, and the agent’s posterior is visualized in green; darker shades correspond to tiles with greater probability mass. Here, AI has mostly falsified the hypotheses that the goal is in the top-left corner of the maze, and so is motivated to search deeper in the maze, as its model assigns greater mass to unexplored areas. In the ground truth environment , the dispenser is located at the tile represented by the orange disc. Right: A gridworld that is too large to feasibly model with , with the agent’s posterior over superimposed. White tiles are unknown, pale blue tiles are known to be walls, and purple tiles are known to not be walls; darker shades of purple correspond to lower credence that a Dispenser is present. Notice that despite exploring much of the maze, AI has not discovered the profitable reward dispenser located in the upper centre; it has instead settled for a suboptimal dispenser in the lower part of the maze, illustrating the exploration-exploitation tradeoff.

We plot the training score averaged over simulation runs, with shaded areas corresponding to one sigma. That is, we plot where is the instantaneous score at time in run : this is either reward in the case of reinforcement learners, or fraction of the environment explored in the case of KSA.

Model class. AI performs significantly better using the Dirichlet model than with the mixture model. Since the Dirichlet model is less constrained (in other words, less informed), there is more to learn, and the agent is more motivated to explore, and is less susceptible to getting stuck in local maxima. From Figure 2

, we see that MC-AIXI-Dirichlet appears to have asymptotically higher variance in its average reward than MC-AIXI. This makes sense since the agent may discover the reward dispenser, but be subsequently incentivized to move away from it and keep exploring, since its model still assigns significant probability to there being better dispensers elsewhere; in contrast, under

, the agent’s posterior immediately collapses to the singleton once it discovers a dispenser, and it will greedily stay there ever after. This is borne out by Figure 2, which shows that, on average, AIXI explores significantly more of the gridworld using than with . These experiments illustrate how AI’s behavior depends strongly on the model class .

Figure 2: Performance of AI is dependent on model class ( in red, in blue, ‘Cycles’). Left: Exploration fraction, . Note that MC-AIXI and MC-AIXI-Dirichlet both effectively stop exploring quite early on, at around . Right: Average reward.

KSA. As discussed in Section 2.3, the Shannon- and Square-KSA are entropy-seeking; they are therefore susceptible to environments in which a noise source is constructed so as to trap the agent in a very suboptimal exploratory policy, as the agent gets ‘hooked on noise’ [Orseau et al.2013]. The noise source is a tile that emits uniformly random percepts over a sufficiently large alphabet such that the probability of any given percept is lower (and more attractive) than anything else the agent expects to experience by exploring.

KL-KSA explores more effectively than Square- and Shannon-KSA in stochastic environments; see Figure 3. Under the mixture model , the posterior collapses to a singleton once the agent finds the goal. Given a stochastic dispenser, this becomes the only source of entropy in the environment, and so Square- and Shannon-KSA will remain at the dispenser. In contrast, once the posterior collapses to the minimum entropy configuration, there is no more information to be gained, and so KL-KSA will random walk (breaking ties randomly), and outperform Square and Shannon-KSA marginally in exploration. This difference becomes more marked under the Dirichlet model; although all three agents perform better under due to its being less constrained and having more entropy than , KL-KSA outperforms the others by a considerable margin; KL-KSA is motivated to explore as it anticipates considerable changes to its model by discovering new areas; see Figure 4.

Figure 3: Intrinsic motivation is highly model-dependent, and the information-seeking policy outperforms entropy-seeking in stochastic environments. Left: . Right: .

Figure 4: KL-KSA-Dirichlet is highly motivated to explore every reachable tile in the gridworld. Left (): The agent begins exploring, and soon stumbles on the dispenser tile. Center (): The agent is motivated only by information gain, and so ignores the reward dispenser, opting instead to continue exploring the maze. Right (): The agent has now discovered of the maze, and continues to gain information from tiles it has already visited as it updates its independent Laplace estimators for each tile.

Exploration. Thompson sampling (TS) is asymptotically optimal in general environments, a property that AI lacks. However, in our experiments on gridworlds, TS performs poorly in comparison to AI; see Figure 5. This is caused by two issues: (1) The parametrization of means that the -optimal policy for any is to seek out some tile and wait there for one planning horizon . For all but very low values of or , this is an inefficient strategy for discovering the location of the dispenser. (2) The performance of TS is strongly constrained by limitations of the planner. If the agent samples an environment which places the dispenser outside its planning horizon – that is, more than steps away – then the agent will not be sufficiently far-sighted to do anything useful. In contrast, AI is motivated to continually move around the maze as long as it hasn’t yet discovered the dispenser, since will assign progressively higher mass to non-visited tiles as the agent explores more.

Thompson sampling is computationally cheaper than AI due to the fact that it plans with only one environment model , rather than with the entire mixture . When given a level-playing field with a time budget rather than a samples budget, the performance gap is reduced; see Figure 5b.

Figure 5: TS typically underperforms AI. Left: Both agents are given the same MCTS planning parameters: , and a samples budget of . Here, Thompson sampling is unreliable and lacklustre, achieving low mean reward with high variance. Right: When each agent is given a horizon of and a time budget of milliseconds per action, TS performs comparatively better, as the gap to AI is narrowed.

Occam bias. Since each environment differs by a single parameter and has otherwise identical source code, we cannot use a Kolmogorov complexity surrogate to differentiate them. Instead we opt to order by the environment’s index in a row-major order enumeration. On simple deterministic environments (effectively, in Algorithm 4), the MDL agent significantly outperforms the Bayes agent AI with a uniform prior over ; see Figure 6. This is because the MDL agent is biased towards environments with low indices; using the model class, this corresponds to environments in which the dispenser is close to the agent’s starting position. In comparison, AI’s uniform prior assigns significant probability mass to the dispenser being deep in the maze. This motivates it to explore deeper in the maze, often neglecting to thoroughly exhaust the simpler hypotheses.

Figure 6: Left: MDL vs uniform-prior AI in a deterministic () gridworld. Right: AI compared to the (MC-approximated) optimal policy AI with . Note that MC-AIXI paradoxically performs worse in the deterministic setting than the stochastic one; this is because the posterior over very quickly becomes sparse as hypotheses are rejected, making planning difficult for small .

5 Conclusion

In this paper we have presented a short survey of state-of-the-art universal reinforcement learning algorithms, and developed experiments that demonstrate the properties of their resulting exploration strategies. We also discuss the dependence of the behavior of Bayesian URL agents on the construction of the model class , and describe some of the tricks and approximations that are necessary to make Bayesian agents work with UCT planning. We have made our implementation open source and available for further development and experimentation, and anticipate that it will be of use to the RL community in the future.


We wish to thank Sean Lamont for his assistance in developing the gridworld visualizations used in Figures 1 and 4. We also thank Jarryd Martin and Suraj Narayanan S for proof-reading early drafts of the manuscript. This work was supported in part by ARC DP150104590.


  • [Auer et al.2002] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235–256, 2002.
  • [Bellemare et al.2016] Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, et al. Unifying count-based exploration and intrinsic motivation. CoRR, abs/1606.01868, 2016.
  • [Houthooft et al.2016] Rein Houthooft, Xi Chen, Yan Duan, et al. VIME: Variational information maximizing exploration. In Advances in Neural Information Processing Systems 29, pages 1109–1117. 2016.
  • [Hutter2005] Marcus Hutter.

    Universal Artificial Intelligence

    Springer, 2005.
  • [Hutter2009] Marcus Hutter. Open problems in universal induction & intelligence. Algorithms, 3(2):879–906, 2009.
  • [Kocsis and Szepesvári2006] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European Conference on Machine Learning, pages 282–293, 2006.
  • [Lamont et al.2017] Sean Lamont, John Aslanides, Jan Leike, and Marcus Hutter. Generalised discount functions applied to a Monte-Carlo AImu implementation. In Autonomous Agents and Multiagent Systems, pages 1589–1591, 2017.
  • [Lattimore2013] Tor Lattimore. Theory of General Reinforcement Learning. PhD thesis, ANU, 2013.
  • [Leike and Hutter2015a] Jan Leike and Marcus Hutter. Bad universal priors and notions of optimality. In Conference on Learning Theory, pages 1244–1259, 2015.
  • [Leike and Hutter2015b] Jan Leike and Marcus Hutter. On the computability of Solomonoff induction and knowledge-seeking. In Algorithmic Learning Theory, pages 364–378, 2015.
  • [Leike et al.2016] Jan Leike, Tor Lattimore, Laurent Orseau, and Marcus Hutter. Thompson sampling is asymptotically optimal in general environments. In Uncertainty in Artificial Intelligence, 2016.
  • [Leike2016] Jan Leike. Nonparametric General Reinforcement Learning. PhD thesis, ANU, 2016.
  • [Martin et al.2017] Jarryd Martin, Suraj Narayanan S, Tom Everitt, and Marcus Hutter. Count-based exploration in feature space for reinforcement learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI’17. AAAI Press, 2017.
  • [Mnih et al.2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Playing Atari with deep reinforcement learning. Technical report, Google DeepMind, 2013.
  • [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [Orseau et al.2013] Laurent Orseau, Tor Lattimore, and Marcus Hutter. Universal knowledge-seeking agents for stochastic environments. In Algorithmic Learning Theory, pages 158–172. Springer, 2013.
  • [Orseau2010] Laurent Orseau. Optimality issues of universal greedy agents with static priors. In Algorithmic Learning Theory, pages 345–359. Springer, 2010.
  • [Orseau2011] Laurent Orseau. Universal knowledge-seeking agents. In Algorithmic Learning Theory, pages 353–367. Springer, 2011.
  • [Pathak et al.2017] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven Exploration by Self-supervised Prediction. ArXiv e-prints, May 2017.
  • [Rissanen1978] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
  • [Silver and Veness2010] David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. In Advances in Neural Information Processing Systems 23, pages 2164–2172. 2010.
  • [Silver et al.2014] David Silver, Guy Lever, Nicolas Heess, et al. Deterministic policy gradient algorithms. In Tony Jebara and Eric P. Xing, editors, International Conference on Machine Learning, pages 387–395, 2014.
  • [Silver et al.2016] David Silver, Aja Huang, Chris J Maddison, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • [Solomonoff1978] Ray Solomonoff. Complexity-based induction systems: Comparisons and convergence theorems. IEEE Transactions on Information Theory, 24(4):422–432, 1978.
  • [Sunehag and Hutter2015] Peter Sunehag and Marcus Hutter. Rationality, optimism and guarantees in general reinforcement learning. Journal of Machine Learning Research, 16:1345–1390, 2015.
  • [Sutton and Barto1998] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT, 1998.
  • [Thompson1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, pages 285–294, 1933.
  • [Thrun1992] S. Thrun. The role of exploration in learning control. In Handbook for Intelligent Control: Neural, Fuzzy and Adaptive Approaches. 1992.
  • [Veness et al.2011] Joel Veness, Kee Siong Ng, Marcus Hutter, William Uther, and David Silver. A Monte-Carlo AIXI approximation. Journal of Artificial Intelligence Research, 40(1):95–142, 2011.