1 Introduction
This article studies a fundamental question in artificial intelligence; given a set of environments, how do we define an agent that eventually acts optimally regardless of which of the environments it is in. This question relates to the even more fundamental question of what intelligence is.
[Hut05] defines an intelligent agent as one that can act well in a large range of environments. He studies arbitrary classes of environments with particular attention to universal classes of environments like all computable (deterministic) environments and all lower semicomputable (stochastic) environments. He defines the AIXI agent as a Bayesian reinforcement learning agent with a universal hypothesis class and a Solomonoff prior. This agent has some interesting optimality properties. Besides maximizing expected utility with respect to the a priori distribution by design, it is also Pareto optimal and selfoptimizing when this is possible for the considered class. It was, however, shown in [Ors10] that it is not guaranteed to be asymptotically optimal for all computable (deterministic) environments. [LH11a] shows that this is not surprising since, at least for geometric discounting, no agent can be. [LH11a] also shows that in a weaker (in average) sense, optimality can be achieved for the class of all computable environments using an algorithm that includes long exploration phases. Furthermore, it is simple to realize that Bayesian agents do not always achieve optimality for a finite class of deterministic environments even if all prior weights are strictly positive.We use the principle of optimism to define an agent that for any finite class of deterministic environments, eventually acts optimally. We extend our results to the case of finite and compact classes of stochastic environments. In the deterministic case we also prove finite error bounds. Optimism has previously been used to design exploration strategies for both discounted and undiscounted MDPs [KS98, SL05, AO06, LH12], though here we define optimistic algorithms for any finite class of environments.
Related work. Besides AIXI [Hut05] that was discussed above, [LH11a]
introduces an agent which achieves asymptotic optimality in an average sense for the class of all deterministic computable environments. There is, however, no time step after which it is optimal at every time step. This is due to an infinite number of long exploration phases. We introduce an agent, that for finite classes of environments, does eventually achieve optimality for every time step. For the stochastic case, the agent achieves with any given probability, optimality within
for any . Our very simple agent is relying elegantly on the principle of optimism, used previously in the restrictive MDP case with discounting [KS98, SL05, LH12] and without [AO06], instead of an indefinite number of explicitly enforced bursts of exploration. [RH08] also introduces an agent that relies on bursts of exploration with the aim of achieving asymptotic optimality. The asymptotic optimality guarantees are restricted to a setting where all environments satisfy a certain restrictive valuepreservation property. [EDKM05]studied learning general Partially Observable Markov Decision Processes (POMDPs). Though POMDPs constitute a very general reinforcement learning setting, we are interested in agents that can be given any (deterministic or stochastic) class of environments and successfully utilize the knowledge that the true environment lies in this class.
Background. We will consider an agent [RN10, Hut05] that interacts with an environment through performing actions from a finite set and receives observations from a finite set and rewards from a finite set . Let be the set of histories and the return
with the obvious extension to infinite sequences. A function from to is called a deterministic environment (studied in Section 2. A function is called a policy or an agent. We define the value function by where the sequence are the rewards achieved by following from time step onwards in environment after having seen .
Instead of viewing the environment as a function from to we can equivalently write it as a function where we write for the function value of . It equals zero if in the first formulation is not sent to and if it is. In the case of stochastic environments, which we will study in Section 3, we instead have a function such that . Furthermore, we define where . is a probability measure over strings or sequences as will be discussed in the next section and we can define by conditioning on . We define as the expected return of policy .
A special case of an environment is a Markov Decision Process (MDP) [SB98]. This is the classical setting for reinforcement learning. In this case the environment does not depend on the full history but only on the latest observation and action and is, therefore, a function from to . In this situation one often refers to the observations as states since the latest observation tells us everything we need to know. In this situation, there is an optimal policy that can be represented as a function from the state set (:=) to . We only need to base our decision on the latest observation. Several algorithms [KS98, SL05, LH12] have been devised for solving discounted () MDPs for which one can prove PAC (Probably Approximately Correct) bounds. They are finite time bounds that hold with high probability and depend only polynomially on the number of states, actions and the discount factor. These methods are relying on optimism as the method for making the agent sufficiently explorative. Optimism roughly means that one has high expectations for what one does not yet know. Optimism was also used to prove regret bounds for undiscounted () MDPs in [AO06] which was extended to feature MDPs in [MMR11]. Note that these methods are restricted to MDPs and that we do not make any (Markov, ergodicity, stationarity, etc.) assumptions on the environments, only on the size of the class.
Outline. In this article we will define optimistic agents in a far more general setting than MDPs and prove asymptotic optimality results. The question of their mere existence is already nontrivial, hence asymptotic results deserve attention. In Section 2 we consider finite classes of deterministic environments and introduce a simple optimistic agent that is guaranteed to eventually act optimally. We also provide finite error bounds. In Section 3 we generalize to finite classes of stochastic environments and in Section 4 to compact classes.
2 Finite Classes of Deterministic Environments
Given a finite class of deterministic environments , we define an algorithm that for any unknown environment from eventually achieves optimal behavior in the sense that there exists such that maximum reward is achieved from time onwards. The algorithm chooses an optimistic hypothesis from in the sense that it picks the environment in which one can achieve the highest reward (in case of a tie, choose the environment which comes first in an enumeration of ) and then the policy that is optimal for this environment is followed. If this hypothesis is contradicted by the feedback from the environment, a new optimistic hypothesis is picked from the environments that are still consistent with . This technique has the important consequence that if the hypothesis is not contradicted we are still acting optimally when optimizing for this incorrect hypothesis.
Let be the history up to time generated by policy in environment . In particular let be the history generated by Algorithm 1 (policy ) interacting with the actual “true” environment . At the end of cycle we know . An environment is called consistent with if . Let be the environments consistent with . The algorithm only needs to check whether and for each , since previous cycles ensure and trivially . The maximization in Algorithm 1 that defines optimism at time is performed over all , the set of consistent hypotheses at time , and is the class of all deterministic policies.
Theorem 1 (Optimality, Finite Deterministic Class).
If we use Algorithm 1 () in an environment , then there is such that
A key to proving Theorem 1 is timeconsistency [LH11b] of geometric discounting. The following lemma tells us that if we act optimally with respect to a chosen optimistic hypothesis, it remains optimistic until contradicted.
Lemma 2 (Timeconsistency).
Suppose , that we act according to from time to time and that is still consistent at time , then .
Proof.
Suppose that
for some , . It holds that
where is the accumulated
reward between and . Let be a policy
that equals from to and then equals
. It follows that
which
contradicts the assumption
.
Therefore, for all ,
.
Proof.
(Theorem 1) At time we know . If some is inconsistent with , i.e. , it gets removed, i.e. is not in for all .
Since is finite, such inconsistencies can only happen finitely often, i.e. from some onwards we have for all . Since , we know that .
Assume henceforth. The optimistic hypothesis will not change after this point. If the optimistic hypothesis is the true environment , we have obviously chosen the true optimal policy.
In general, the optimistic hypothesis is such that it will never be contradicted while actions are taken according to , hence do not change anymore. This implies
for all . The first equality follows from equals from onwards. The second equality follows from consistency of with . The third equality follows from optimism, the constancy of , , and for , and timeconsistency of geometric discounting (Lemma 2). The last inequality follows from . The reverse inequality
follows from . Therefore is acting optimally at all times .
Besides the eventual optimality guarantee above, we also provide a bound on the number of time steps for which the value of following Algorithm 1 is more than a certain less than optimal. The reason this bound is true is that we only have such suboptimality for a certain number of time steps before a point where the current hypothesis becomes inconsistent and the number of such inconsistency points are bounded by the number of environments.
Theorem 3 (Finite error bound).
Following (Algorithm 1),
for all but at most time steps .
Proof.
Consider the truncated value
where the sequence are the rewards achieved by following from time to in after seeing . By letting (which is positive due to negativity of both numerator and denominator) we achieve . Let be the policyenvironment pair selected by Algorithm 2 in cycle .
Let us first assume , i.e. is consistent with , and hence and do not change from (inner loop of Algorithm 1). Then
Now let be the times at which the currently selected gets inconsistent with , i.e. . Therefore (only) at times , which implies except possibly for . Finally
We refer to the algorithm above as the conservative agent since it sticks to its model for as long as it can. The corresponding liberal agent reevaluates its optimistic hypothesis at every time step and can switch between different optimistic policies at any time. Algorithm 1 is actually a special case of this as shown by Lemma 2. The liberal agent is really a class of algorithms and this larger class of algorithms consists of exactly the algorithms that are optimistic at every time step without further restrictions. The conservative agent is the subclass of algorithms that only switch hypothesis when the previous is contradicted. The results for the conservative agent can be extended to the liberal one, but we have to omit that here for space reasons.
3 Stochastic Environments
A stochastic hypothesis may never become completely inconsistent in the sense of assigning zero probability to the observed sequence while still assigning very different probabilities than the true environment. Therefore, we exclude based on a threshold for the probability assigned to the generated history. Unlike in the deterministic case, a hypothesis can cease to be the optimistic one without having been excluded. We, therefore, only consider an algorithm that reevaluates its optimistic hypothesis at every time step. Algorithm 2 specifies the procedure and Theorem 4 states that it is asymptotically optimal.
Theorem 4 (Optimality, Finite Stochastic Class).
Define by using Algorithm 2 with any threshold and a finite class of stochastic environments containing the true environment , then with probability there exists, for every , a number such that
We borrow some techniques from [Hut09] that introduced a “merging of opinions” result that generalized the classical theorem by [BD62]. The classical result says that it is sufficient that the true measure (over infinite sequences) is absolutely continuous with respect to a chosen a priori distribution to guarantee that they will almost surely merge in the sense of total variation distance. The generalized version is given in Lemma 6. When we combine a policy with an environment by letting the actions be taken by the policy, we have defined a measure, denoted by , on the space of infinite sequences from a finite alphabet. We denote such a sample sequence by and the :th to :th elements of by . The algebra is generated by the cylinder sets and a measure is determined by its values on those sets. To simplify notation in the next lemmas we will write , meaning that where and . Furthermore, .
Definition 5 (Total Variation Distance).
The total variation distance between two measures (on infinite sequences of elements from a finite alphabet) and is defined to be
where is in the previously specified algebra generated by the cylinder sets.
The results from [Hut09] are based on the fact that is a martingale sequence if is the true measure and therefore converges with probability [Doo53]. The crucial question is if the limit is strictly positive or not. The following lemma shows that with probability we are either in the case where the limit is or in the case where . We say that the environments and merge under if .
Lemma 6 (Generalized merging of opinions [Hut09]).
For any measures and it holds that where
Lemma 7 (Value convergence for merging environments).
Given a policy and environments and it follows that
Proof.
The lemma follows from the general inequality
by inserting and and
, and using .
The following lemma replaces the property for deterministic environments that either they are consistent indefinitely or the probability of the generated history becomes .
Lemma 8 (Merging of environments).
Suppose we are given two environments (the true one) and and a policy (defined e.g. by Algorithm 2). Let and . Then with probability we have that
The next lemma tells us what happens after all the environments that will be removed have been removed but we state it as if this was time for notational simplicity.
Lemma 9 (Optimism is nearly optimal).
Suppose that we have a (finite or infinite) class of (possibly) stochastic environments containing the true environment . Also suppose that none of these environments are excluded at any time by Algorithm 2 () during an infinite history that has been generated by running in . Given there is such that
if
Proof.
(Theorem 4) Given a policy , let where is the true environment and where . Let the outcome sequence (the sequence ) be denoted by . It follows from Doob’s Martingale inequality [Doo53] that for all
This proves, using a union bound, that the probability of Algorithm 2 ever excluding the true environment is less than .
The limits converge almost surely as argued before using the Martingale convergence theorem. Lemma 8 tells us that any given environment (with probability one) is eventually excluded or is permanently included and merge with the true one under . The remaining environments does, according to (and in the sense of) Lemma 8, merge with the true environment. Lemma 7 tells us that the difference between value functions (for the same policy) of merging environments converges to zero. Since there are finitely many environments and the ones that remain indefinitely in merge with the true environment under , there is for every a such that when following , it holds for all that
The proof is concluded by Lemma
9 in the case where the true environment remains
indefinitely included which happens with probability .
4 Compact Classes
In this section we discuss infinite but compact classes of stochastic environments. First note that without further assumptions, asymptotic optimality can be impossible to achieve, even for countably infinite deterministic environments [LH11a]. Here we consider classes that are compact with respect to the total variation distance, or more precisely with respect to
where is total variation distance from Section 3. An example is the class of Markov Decision Processes (or POMDPs) with a certain number of states. Algorithm 2 does need modification to achieve asymptotic optimality in the compact case. An alternative to modifying the algorithm is to be satisfied with reaching optimality within a prechosen . This can be achieved by first choosing a finite covering of with balls of total variation radius less than and use Algorithm 2 with the centers of these balls. To have an algorithm that for any eventually achieves optimality within is a more demanding task. This is because we need to be able to say that the true environment will remain indefinitely in the considered class with a given confidence. For this purpose we introduce a confidence radius inspired by MDP solving algorithms like MBIE [SL05] and UCRL [AO06]. We still use the notation as in Algorithm 2 and we define Algorithm 3 based on replacing it with a larger . If we do not do this the true environment is likely to be excluded.
Definition 10 (Confidence radius).
We denote all environments within from by
Given we say that is a confidence radius sequence if almost surely and if the true environment is in for all with probability .
Definition 11 (Algorithm 3).
Given a class of environments that is compact in the total variation distance we define Algorithm 3 as being Algorithm 2 with replaced by
Definition 12 (RadonNikodym differentiable class).
Suppose that the class is such that if is the true environment, then for any policy it holds with probability one that for all , converges as
to some random variables
. We call such a class RadonNikodym (RN) differentiable. If the property holds with respect to a specific policy we say that the class is RNdifferentiable with respect to .Remark 13.
Every countable class is RNdifferentiable and so is the class of MDPs with a certain number of states. The MBIE [SL05] and UCRL [AO06] algorithms are based on the fact that one can define confidence radiuses for MDPs, though their bounds need separate intervals for each stateaction pair depending on the number of visits. For an ergodic MDP all stateaction pairs will almost surely be seen infinitely often and the max length of those intervals will tend to zero. Therefore, one can define a radius based on this maximum length or, alternatively, one can easily allow Algorithm 3 to run with such rectangular sets instead.
Theorem 14 (Optimality, Compact Stochastic Class).
Suppose we use Algorithm 3 with threshold , a compact (in total variation) RNdifferentiable class (with respect to is enough) of stochastic environments and a confidence radius sequence for . Denote the resulting policy by . If the true environment is in , then with probability there is, for every , a tim e such that
Lemma 15 (Uniform exclusion).
Let and where is the true environment and the policy defined by Algorithm 3. For any outcome sequence , let
For any closed subset of and for every , there is such that for every in this subset there is such that
Proof.
Since is compact and the subset in question is closed it
follows that it is also compact. Using the ArzelàAscoli Theorem
[Rud76] we conclude that there is a subsequence such that
converges uniformly to on which means that there is
such that for all and we can let
.
Proof.
(Theorem 14) The strategy is to use that all environment that will be excluded and does not lie within a certain distance of some environment that merges with the true one, will be excluded after a certain finite time. Then we can say that the remaining environments’ value functions differ at most by a certain amount and we can apply Lemma 9.
We can with probability one say that for each , it will hold that converges and each environment will be in or . is compact (in the total variation distance topology) since it is a closed subset (again in the topology defined by ) of the compact set .
For any we can do the following: For each , consider a total variation ball of radius where . Note that for all whenever . The collection of these balls induces an open cover of the compact set and it follows that there is a finite subcover. Consider the balls in this finite cover that intersect with . Let be the union of these finitely many open balls. Let . is then a closed subset of . We want to say that there is a finite time after which all environments in will have been excluded from . This happens if , defined as the union of the closed balls of radius at every point in , has been excluded from . If is large enough for , then is also a closed subset of . Lemma 15 tells us that all of the environments in will have been excluded from after a finite amount of time and, therefore, all the environments in will have been excluded from . Thus and in particular the optimistic hypothesis will be in when . Let be the optimistic hypothesis at time and the optimistic policy.
Each parameter in (and in particular ) lies within of a ball with center which lies within of a point . Hence and .
Due to the uniform merging of environments (under ) on , there is such that . We conclude that and since
From Lemma 9 we know that if we picked small enough we know that for , for all . Furthermore, by picking sufficiently small we can, for , ensure that there is such that . Given that the true environment remains indefinitely in , which happens with at least probability , it follows that
5 Conclusions
We introduced optimistic agents for finite and compact classes of arbitrary environments and proved asymptotic optimality. In the deterministic case we also bound the number of time steps for which the value of following the algorithm is more than a certain amount lower than optimal. Future work includes investigating finiteerror bounds for classes of stochastic environments.
Acknowledgement. This work was supported by ARC grant DP120100950. The authors are grateful for feedback from Tor Lattimore and Wen Shao.
References
 [AO06] P. Auer and R. Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In Proceedings of NIPS’2006, pages 49–56, 2006.
 [BD62] D. Blackwell and L. Dubins. Merging of Opinions with Increasing Information. The Annals of Mathematical Statistics, 33(3):882–886, 1962.
 [Doo53] J. Doob. Stochastic processes. Wiley, New York, NY, 1953.
 [EDKM05] E. EvenDar, S. Kakade, and Y. Mansour. Reinforcement learning in pomdps without resets. In Proceedings of IJCAI05, pages 690–695, 2005.
 [Hut05] M. Hutter. Universal Articial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005.
 [Hut09] M. Hutter. Discrete MDL predicts in total variation. In Advances in Neural Information Processing Systems 22: (NIPS’2009), pages 817–825, 2009.

[KS98]
M. J. Kearns and S. Singh.
Nearoptimal reinforcement learning in polynomial time.
In Proceedings of the
International Conference on Machine Learning (ICML’1998)
, pages 260–268, 1998.  [LH11a] T. Lattimore and M. Hutter. Asymptotically optimal agents. In Proc. of Algorithmic Learning Theory (ALT’2011), volume 6925 of Lecture Notes in Computer Science, pages 368–382. Springer, 2011.
 [LH11b] T. Lattimore and M. Hutter. Time consistent discounting. In Proc. 22nd International Conf. on Algorithmic Learning Theory (ALT’11), volume 6925 of LNAI, pages 383–397, Espoo, Finland, 2011. Springer, Berlin.
 [LH12] T. Lattimore and M. Hutter. PAC bounds for discounted MDPs. In Proc. 23rd International Conf. on Algorithmic Learning Theory (ALT’12), volume 7568 of LNAI, Lyon, France, 2012. Springer, Berlin.
 [MMR11] O.A. Maillard, R. Munos, and D. Ryabko. Selecting the staterepresentation in reinforcement learning. In Advances in Neural Information Processing Systems 24 (NIPS’2011), pages 2627–2635, 2011.
 [Ors10] L. Orseau. Optimality issues of universal greedy agents with static priors. In Proc. of Algorithmic Learning Theory, (ALT’2010), volume 6331 of Lecture Notes in Computer Science, pages 345–359. Springer, 2010.
 [RH08] D. Ryabko and M. Hutter. On the possibility of learning in reactive environments with arbitrary dependence. Theor. C.S., 405(3):274–284, 2008.
 [RN10] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, NJ, edition, 2010.
 [Rud76] W. Rudin. Principles of mathematical analysis. McGrawHill, 1976.
 [SB98] R. Sutton and A. Barto. Reinforcement Learning. The MIT Press, 1998.

[SL05]
A. Strehl and M. Littman.
A theoretical analysis of modelbased interval estimation.
In Proceedings of ICML 2005, pages 856–863, 2005.