1 Introduction
“Efforts to solve [an instance of the explorationexploitation problem] so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.” –Peter Whittle [Whittle1979]
The Allied analysts were considering the simplest possible problem in which there is a tradeoff to be made between exploiting, taking the apparently best option, and exploring, choosing a different option to learn more. We tackle what we consider the most difficult instance of the explorationexploitation tradeoff problem: when the environment could be any computable probability distribution, not just a multiarmed bandit, how can one achieve optimal performance in the limit?
Our work is within the Reinforcement Learning (RL) paradigm: an agent selects an action, and the environment responds with an observation and a reward. The interaction may end, or it may continue forever. Each interaction cycle is called a timestep. The agent has a discount function that weights its relative concern for the reward it achieves at various future timesteps. The agent’s job is to select actions that maximize the total expected discounted reward it achieves in its lifetime. The “value” of an agent’s policy at a certain point in time is the expected total discounted reward it achieves after that time if it follows that policy. One formal specification of the explorationexploitation problem is: what policy can an agent follow so that the policy’s value approaches the value of the optimal informed policy with probability 1, even when the agent doesn’t start out knowing the true dynamics of its environment?
Most work in RL makes strong assumptions about the environment—that the environment is Markov, for instance. Impressive recent development in the field of reinforcement learning often makes use of the Markov assumption, including Deep Q Networks [Mnih et al.2015], A3C [Mnih et al.2016], Rainbow [Hessel et al.2018], and AlphaZero [Silver et al.2017]
. Another example of making strong assumptions in RL comes from some modelbased algorithms that implicitly assume that the environment is representable by, for example, a fixedsize neural network, or whatever construct is used to model the environment.
Many recent developments in RL are largely about tractably learning to exploit; how to explore intelligently is a separate problem. We address the latter problem. Our approach, inquisitiveness, is based on Orseau et al.’s Hutter:13ksaprob Knowledge Seeking Agent for Stochastic Environments, which selects the actions that best inform the agent about what environment it is in. Our Inquisitive Reinforcement Learner (Inq) explores like a knowledge seeking agent, and is more likely to explore when there is apparently (according to its current beliefs) more to be learned. Sometimes exploring well requires “expeditions,” or many consecutive exploratory actions. Inq entertains expeditions of all lengths, although it follows the longer ones less often, and it doesn’t resolutely commit in advance to seeing the expedition through.
This is a very human approach to information acquisition. When we spot an opportunity to learn something about our natural environment, we feel inquisitive. We get distracted. We are inclined to check it out, even if we don’t see directly in advance how this information might help us better achieve our goals. Moreover, if we can tell that the opportunity to learn something requires a longer term project, we may find ourselves less inquisitive.
For the class of computable environments (stochastic environments that follow a computable probability distribution), it was previously unknown whether any policy could achieve strong asymptotic optimality (convergence of the value to optimality with probability 1). Lattimore et al. Hutter:11asyoptag showed that no deterministic policy could achieve this. The key advantage that stochastic policies have is that they can let the exploration probability go to while still exploring infinitely often.
There is a weaker notion of optimality “weak asymptotic optimality” for which positive results already exist; this condition requires that the average value over the agent’s lifetime approach optimality. Lattimore et al. Hutter:11asyoptag indentified a weakly asymptotically optimal agent for deterministic
computable environments; the agent maintains a list of environments consistent with its observations, exploiting as if it is in the first such one, and exploring in bursts. A recent algorithm for a Thompson Sampling Bayesian agent was shown, with an elegant proof, to be weakly asymptotically optimal in all computable environments, but not strongly asymptotically optimal
[Leike et al.2016].In Section 2, we formally describe the RL setup and present notation. In Section 3, we present the algorithm for Inq. In Section 4, we prove our main result: that Inq is strongly asymptotically optimal. In Section 5, we present experimental results comparing Inq to weakly asymptotically optimal agents. Finally, we discuss the relevance of this exploration regime to tractable algorithms. Appendix A collates notation and definitions for quick reference. Appendix B contains the proofs of the lemmas.
2 Notation
We follow the notation in [Orseau et al.2013]. The reinforcement learning setup is as follows: is a finite set of actions available to the agent; is a finite set of observations it might observe, and is the set of possible rewards. The set of all possible interactions in a timestep is . A reinforcement learner’s policy is a stochastic function which outputs an action given an interaction history, denoted by . ( represents all finite strings from an alphabet ). An environment is a stochastic function which outputs an observation and reward given an interaction history and an action: .
A policy and an environment induce a probability measure over , the set of all possible infinite histories. For , denotes the probability that an infinite history begins with when actions are sampled from the policy , and observations and rewards are sampled from the environment . Formally, we define this inductively: , where is the empty history, and for , , , , we define . In an infinite history , , , and refer to the th action, observation and reward, and refers to the th timestep: . refers to the first timeteps, and refers to the string of timesteps through (inclusive). Strings of actions, observations, and rewards are notated similarly.
A Bayesian agent considers a class of environments a priori feasible. Its “beliefs” take the form of a probability distribution over which environment is the true one. We call this the agent’s belief distribution. In our formulation, Inq considers any computable environment feasible, and starts with a prior belief distribution based on the environments’ Kolmogorov complexities: that is, the length of the shortest program that computes the environment on some reference machine. However, all our results hold as long as the true environment is contained in the class of environments that are considered feasible, and as long as the prior belief distribution assigns nonzero probability to each environment in the class. We take to be the class of all computable environments, and
to be the prior probability of the environment
, where is the Kolmogorov complexity. A smaller class and a different prior probability could easily be substituted for and .We use to denote the agent’s beliefs about future observations. Together with a policy it defines a Bayesian mixture measure: . The posterior belief distribution of the agent after observing a history is . This definition is independent of the choice of as long as ; we can fix a reference policy just for this definition if we like. We sometimes also refer to the conditional distribution .
The agent’s discount at a timestep is denoted . To normalize the agent’s value to , we introduce . We consider an agent with a bounded horizon: . Intuitively, this means that the agent does not become more and more farsighted over time. A classic discount function giving a bounded horizon is a geometric one: for , . The value of a policy in an environment , given a history , is
(1) 
Here, the expectation is with respect to the probability measure . Reinforcement Learning is the attempt to find a policy that makes this value high, without access to .
3 Inquisitive Reinforcement Learner
We first describe how Inq exploits, then how it explores. It exploits by maximizing the discounted sum of its reward in expectation over its current beliefs, and it explores by following maximally informative “exploratory expeditions” of various lengths.
The optimal policy with respect to an environment is the policy that maximizes the value.
(2) 
where is the space of all policies. An optimal deterministic policy always exists [Lattimore and Hutter2014b]. When exploiting, Inq simply maximizes the value according to its belief distribution . Since this policy is deterministic, we write to mean the unique action at time for which . That is the exploitative action.
The most interesting feature of Inq is how it gets distracted by the opportunity to explore. Inq explores to learn. An agent has learned from an observation if its belief distribution changes significantly after making that observation. If the belief distribution has hardly changed, then the observation was not very informative. The typical informationtheoretic measure for how well a distribution approximates a distribution is the KLdivergence, . Thus, a principled way to quantify the information that an agent gains in a timestep is the KLdivergence from the belief distribution at time to the belief distribution at time . This is the rationale behind the construction of Orseau, et al.’s Hutter:13ksaprob Knowledge Seeking Agent, which maximizes this expected information gain.
Letting and , the information gain at time is defined:
(3) 
Recall that
is the posterior probability assigned to
after observing .An step expedition, denoted , represents all contingencies for how an agent will act for the next timesteps. It is a deterministic policy that takes historyfragments of length less than and returns an action:
(4) 
is a conditional distribution defined for , which represents the conditional probability of observing if the expedition is followed starting at time , after observing . Now we can consider the informationgain value of an step expedition. It is the expected information gain upon following that expedition:
(5) 
At a time , one might consider many expeditions: the onestep expedition which maximizes expected information gain, the twostep expedition doing the same, etc. Or one might consider carrying on with an expedition that began three timesteps ago.
Definition 1.
At time , the  expedition is the step expedition beginning at time which maximized the expected information gain from that point.^{2}^{2}2Ties in the argmax are broken arbitrarily.
(6) 
Example expeditions are diagrammed in Figure 1.
Expeditions are functions which return an action given what’s been seen so far on the expedition. The  exploratory action is the action to take at time according to the  expedition:
(7) 
Naturally, this is only defined for , since the expedition function can’t accept a history fragment of length , and must be positive. Note also that if , evaluates to the empty string, .
Definition 2.
Let is the probability of taking the  exploratory action after observing a history .
(8) 
where is an exploration constant.
This makes the total probability of exploration:
(9) 
The feature that makes Inq inquisitive is that is proportional to the expected information gain from the  expedition, . Note that completing an step expedition requires randomly deciding to explore in that way on separate occasions. While this may seem inefficient, if the agent always got boxed into long expeditions, the value of its policy would plummet infinitely often.
Finally, Inq’s policy , defined in Algorithm 1, takes the  exploratory action with probability , and takes the exploitative action otherwise.^{3}^{3}3This algorithm is written in a simplified way that does not halt, but if a real number in is sampled first, the actions can be assigned to disjoint intervals successively until the sampled real number lands in one of them.
4 Strong Asymptotic Optimality
Here we present our central result: that the value of approaches the optimal value. We present the theorem, motivate the result, and proceed to the proof. We recommend the reader have Appendix A and hand for quickly looking up definitions and notation.
Before presenting the theorem, we clarify an assumption, and define the optimal value. We call the true environment , and we assume that . For the class of computable environments, this is a very unassuming assumption. The optimal value is simply the value of the optimal policy with respect to the true environment:
(10) 
Recall also that we have assumed the agent has a bounded horizon in the sense that . The Strong Asymptotic Optimality theorem is that under these conditions, the value of Inq’s policy approaches the optimal value with probability 1, when actions are sampled from Inq’s policy and observations and rewards are sampled from the true environment .
Theorem 3 (Strong Asymptotic Optimality).
As ,
where is the true environment.
For a Bayesian agent, uncertainty about onpolicy observations goes to . Since “onpolicy” for Inq includes, with some probability, all maximally informative expeditions, Inq eventually has little uncertainty about the result of any course of action, and can therefore successfully select the optimal course. For any fixed horizon, Inq’s mixture measure approaches the true environment .
We use the following notation for a particular KLdivergence that plays a central role in the proof:
(11) 
This quantifies the difference between the expected observations of two different environments that would arise in the next timesteps when following policy . denotes the limit of the above as , which exists by [Orseau et al.2013, proof of Theorem 3].
In dealing with the KLdivergence, we simplify matters by asserting that , and .
We begin with a lemma that equates the information gain value of an expedition with the expected prediction error. The KLdivergence on the right hand side represents how different and appear when following the expedition in question.
Lemma 4.
Proof.
This result is shown in Orseau, et al. (2013), Eq. 4. ∎
We would like to use the fact that Bayesians become good at onpolicy prediction to bound Inq’s uncertainty regarding potential expeditions. Thus we have,
Lemma 5.
The lefthand side is the onpolicy prediction error, and will be pushed down to , and righthand side is the prediction error for a given expedition times the probability of seeing that expedition through. The proofs of the lemmas are in Appendix B.
Recall that is the posterior weight that Inq assigns to the environment after observing . Another piece of notation we need is the infimum of Inq’s belief that the is the true environment:
We show that this is strictly positive with probability 1.
Lemma 6.
Next we show that the exploration probability goes to . From here, all “w.p.1” statements mean with probability 1, if not otherwise specified.
Lemma 7.
The next lemma states that the expectation of a certain sum is finite: the sum over time of the inaccuracy (measured in a certain way) of ’s forecasts regarding the outcome of the step expedition that begins at that time. This holds even though most of those expeditions are not followed. As a first pass reading the lemma, ignore the quantifiers and the conditional.
Lemma 8.
:
The proof roughly follows the following argument: if all exploration probabilities go to , then the informativeness of the maximally informative expeditions goes to 0, so the informativeness of all expeditions goes to 0, meaning the prediction error as measured by the KLdivergence goes to 0.
Lemma 9 shows that the probabilities assigned by converge to those of . Lemma 8 showed this as well in a way, but in a less usable form.
Lemma 9.
As , , , :
Finally, we prove the Strong Asymptotic Optimality Theorem: .
Proof of Theorem 1.
Let . Since the agent has a bounded horizon, there exists an such that for all , . Recall
(12) 
Let
(13) 
Since ,
(14) 
Following from that we get,
(15) 
, , , and all hold with probability 1. follows from Lemma 9: for all , for all conditional probabilities of histories of length , with probability 1, and the countable sum is bounded (by ). follows from adding more nonnegative terms to the sum. follows being the optimal policy, and therefore it accrues at least as much expected reward in environment as does (see the definition of ). follows from , and . follows from Lemma 9 just as did. follows because the product in the denominator is the probability that mimics for consecutive timesteps, and by Lemma 7 there is a time after which this probability is uniformly strictly positive. follows from Lemma 7: with probability 1. follows from adding more nonnegative terms to the sum. Finally, follows from the value being normalized to by .
. Letting , we can combine the equations above to give
(16) 
Since ,
(17) 
∎
Strong Asymptotic Optimality is not a guarantee of efficacy; consider an agent that “commits suicide” on the first timestep, and thereafter receives a reward of no matter what it does. This agent is asymptotically optimal, but not very useful. In general, when considering many environments with many different “traps,” bounded regret is impossible to guarantee [Hutter2005], but one can still demand from a reinforcement learner that it make the best of whatever situation it finds itself in by correctly identifying (in the limit) the optimal policy.
We suspect that strong asymptotic optimality would not hold if Inq had an unbounded horizon, since its horizon of concern may grow faster than it can learn about progressively more longterm dynamics of the environment. Indeed, we tenuously suspect that it is impossible for an agent with an unbounded time horizon to be strongly asymptotically optimal in the class of all computable environments. If that is true, then the assumptions that our result relies on (namely that the true environment is computable, and the agent has a bounded horizon) are the bare minimum for strong asymptotic optimality to be possible.
Inq is not computable; in fact, no computable policy can be strongly asymptotically optimal in the class of all computable environments (Lattimore, et al. Hutter:11asyoptag show this for deterministic policies, but a simple modification extends this to stochastic policies). For many smaller environment classes, however, Inq would be computable, for example if is finite, and perhaps for decidable in general. The central result, that inquisitiveness is an effective exploration strategy, applies to any Bayesian agent.
5 Experimental Results
We compared Inq with other known weakly asymptotically optimal agents, Thompson sampling and BayesExp [Lattimore and Hutter2014a], in the gridworld environment using AIXIjs [Aslanides2017] which has previously been used to compare asymptotically optimal agents [Aslanides et al.2017]. We tested in gridworlds, and gridworlds, both with a single dispenser with probability of dispensing reward . Following the conventions of [Aslanides et al.2017] we averaged over 50 simulations, used discount factor , 600 MCTS samples, and planning horizon of 6. The code used for this experiment is available online at https://github.com/ejcatt/aixijs, and this version of Inq can be run in the browser at https://ejcatt.github.io/aixijs/demo.html#inq.
In the gridworlds Inq performed comparably to both BayesExp and Thompson sampling. However in the gridworlds Inq performed comparably to BayesExp, and out performed Thompson sampling.
6 Conclusion
We have shown that it is possible for an agent with a bounded horizon to be strongly asymptotically optimal in the class of all computable environments. No existing RL agent has as strong an optimality guarantee as Inq. The nature of the exploration regime that accomplishes this is perhaps of wider interest. We formalize an agent that gets distracted from reward maximization by its inquisitiveness: the more it expects to learn from an expedition, the more inclined it is to take it.
We have confirmed experimentally that inquisitiveness is a practical and effective exploration strategy for Bayesian agents with manageable model classes.
There are two main avenues for future work we would like to see. The first regards possible extensions of inquisitiveness: we have defined inquisitiveness for Bayesian agents with countable modelclasses, but inquisitiveness could also be defined for a Bayesian agent with a continuous model class, such as a Qlearner using a Bayesian Neural Network. The second avenue regards the theory of strong asymptotic optimality itself: is Inq strongly asymptotically optimal for more farsighted discounters? If not, can it be modified to accomplish that? Or is it indeed impossible for an agent with an unbounded horizon to be strongly asymptotically optimal in the class of computable environments? Answers to these questions, besides being interesting in their own right, will likely inform the design of tractable exploration strategies, in the same way that this work has done.
References

[Aslanides et al.2017]
John Aslanides, Jan Leike, and Marcus Hutter.
Universal reinforcement learning algorithms: survey and experiments.
In
Proceedings of the 26th International Joint Conference on Artificial Intelligence
, pages 1403–1410. AAAI Press, 2017.  [Aslanides2017] John Aslanides. Aixijs: A software demo for general reinforcement learning. arXiv preprint arXiv:1705.07615, 2017.
 [Hessel et al.2018] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Proc. of AAAI Conference on Artificial Intelligence, 2018.
 [Hutter2005] Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005.
 [Lattimore and Hutter2011] Tor Lattimore and Marcus Hutter. Asymptotically optimal agents. In Proc. 22nd International Conf. on Algorithmic Learning Theory (ALT’11), volume 6925 of LNAI, pages 368–382, Espoo, Finland, 2011. Springer.
 [Lattimore and Hutter2014a] Tor Lattimore and Marcus Hutter. Bayesian reinforcement learning with exploration. In International Conference on Algorithmic Learning Theory, pages 170–184. Springer, 2014.
 [Lattimore and Hutter2014b] Tor Lattimore and Marcus Hutter. General time consistent discounting. Theoretical Computer Science, 519:140–154, 2014.
 [Leike et al.2016] Jan Leike, Tor Lattimore, Laurent Orseau, and Marcus Hutter. Thompson sampling is asymptotically optimal in general environments. In Proc. 32nd International Conf. on Uncertainty in Artificial Intelligence (UAI’16), pages 417–426, New Jersey, USA, 2016. AUAI Press.
 [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[Mnih et al.2016]
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy
Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu.
Asynchronous methods for deep reinforcement learning.
In
International conference on machine learning
, pages 1928–1937, 2016.  [Orseau et al.2013] Laurent Orseau, Tor Lattimore, and Marcus Hutter. Universal knowledgeseeking agents for stochastic environments. In Proc. 24th International Conf. on Algorithmic Learning Theory (ALT’13), volume 8139 of LNAI, pages 158–172, Singapore, 2013. Springer.
 [Silver et al.2017] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
 [Whittle1979] Peter Whittle. Discussion of Dr Gittins’ paper. Journal of the Royal Statistical Society, 41:164–177, 1979.
Appendix A: Definitions and Notation – Quick Reference
Appendix B: Proofs of Lemmas
Lemma 5.
Proof.
(a) is the definition of this KLdivergence. (b) follows from conditional probabilities canceling, which is nonzero for all nonzero terms in the sum. Note assigns probability to all actions in . (c) follows from expanding into the probabilities of the actions and probabilities of the observations and rewards. (d) follows because the KLdivergence is nonnegative when the first term is a measure and the second a (semi)measure, and the probability of following the action sequence under is at least the probability of exploring that way for timesteps. Note that is a semimeasure even though it is a mixture over measures because . (e) follows from . (f) applies the definition of the KLdivergence. ∎
Lemma 6. (with probability 1)
Proof.
We consider a general , inducing the measure over . Suppose . for all histories generated by . Therefore, , and . We show that this has probability .
Let . We first show that is a martingale.
By the martingale convergence theorem , for , the sample space, and some . Therefore, . ∎
Lemma 7.
Comments
There are no comments yet.