Reinforcement learning agents have to explore their environment in order to learn to accumulate reward well. This presents a particular problem when the environment is dangerous. Without knowledge of the environment, how can the reinforcement learner avoid danger while exploring? Much of the field of reinforcement learning assumes away the problem, by focusing only on ergodic Markov Decision Processes (MDPs). These are environments where every state can be reached from every other state with probability 1 (under a suitable policy). In such an environment, there is no such thing as real danger; every mistake can be recovered from.
We present negative results that in one sense justify the ergodicity assumption by showing how bleak a reinforcement learner’s prospects are without this assumption, but in another sense, our results undermine the real-world relevance of results predicated on ergodicity. Unlike algorithms expecting Gaussian noise, which often fail only marginally on real noise, algorithms expecting ergodic environments may fail catastrophically in real ones—indeed, catastrophic failure is the very thing these algorithms disregard.
define two notions of optimality for reinforcement learners in general environments, which are governed by computable probability distributions.Strong asymptotic optimality is convergence of the value to the optimal value in any computable environment with probability 1, and weak asymptotic optimality is convergence in Cesáro average.
Roughly, we show that in an environment where destruction is repeatedly possible, an agent that is exploring enough to be asymptotically optimal will become either destroyed or incapacitated. This poses a challenge to the field of safe exploration. The reason we consider general environments is that we want to understand advanced agents in the real world, and our world is not fully observable finite-state Markov. If our result only applied to the finite-state MDP setting, one could still expect the difficulty we raise to go away in practice as AI advances, like, for example, the problem of self-driving car crashes, but our result suggests that the safe exploration problem is fundamental and won’t go away so easily. Given our generality, our results apply to any agent that picks actions, observes the payoff, and cannot exclude a priori any computable environment.
In Section 2, we introduce notation, and define weak and strong asymptotic optimality. In Section 3, we prove our negative results, and in Section 4, we discuss their implications, especially for the field of safe exploration. We review the literature from that field in Section 5. In Section 6, we introduce the agent Mentee and prove that its performance approaches that of a mentor (who can pick actions on behalf of the agent). Appendices A and B include omitted proofs, and Appendix C reviews various exploration strategies that yield weak and strong asymptotic optimality and briefly discusses why simpler ones do not.
2 Notation and Definitions
Standard notation for reinforcement learners in general environments is slightly different from that of reinforcement learners in finite-state Markov ones. We follow Hutter:13ksaprob and others with this notation.
At each timestep , an agent selects an action , and the environment provides an observation and a reward . We let denote , the interaction history for a given timestep , and denotes the interaction history preceding timestep .
A policy is a distribution over actions given an interaction history: , where is the set of possible interactions in a timestep, the Kleene-* operator is the set of finite strings composed of elements of the set, and indicates it is a stochastic function, or a distribution over the output. We write an instance as, for example, . An environment is a distribution over observations and rewards given an interaction history and an action: . We write . is the set of all environments with computable probability distributions (thus including non-ergodic, non-stationary, and non-finite-state-Markov environments).
A policy and an environment form a probability measure over infinite interaction histories wherein actions are sampled from and observations and rewards are sampled from . (For measure theorists, the probability space is , where is the set of cylinder sets ; non-measure-theorists can simply take it on faith that we do not try to measure non-measurable events.) For expectations with respect to , we write .
For an agent with a discount schedule , the value of the agent’s policy in an environment given an interaction history is as follows:
where . This formulation of the value allows us to consider more general discount factors than the standard . We require the normalization factor, or else the value of all policies would converge to , and all asymptotic results would be trivial. The optimal value is defined
We will also make use of the idea of an effective horizon:
An agent mostly does not care about what happens after its effective horizon, since those timesteps are discounted so much. Now we can define two notions of optimality from Hutter:11asyoptag.
Definition 1 (Strong Asymptotic Optimality).
An agent with a policy is strongly asymptotically optimal if, for all ,
No matter which computable environment a strongly asymptotically optimal agent finds itself in, it will eventually perform optimally from its position. A weakly asymptotically optimal agent will converge to optimality in Cesáro average.
Definition 2 (Weak Asymptotic Optimality).
An agent with a policy is weakly asymptotically optimal if, for all ,
Consider a two-armed bandit problem, where , , , and
In this example, the optimal policy is to always pick , and . A strongly asymptotically optimal agent requires a policy for which w.p.1. A weakly asymptotically optimal agent requires a policy which obeys w.p.1, or simply, a policy which leads to w.p.1 (where if is true, and otherwise).
3 Curiosity Killed the Cat
We begin by proving two lemmas, for a weakly and strongly asymptotically optimal agent, respectively, that they must “try everything” infinitely often. This depends on an assumption about the difficulty of the environment. Then, we will show that such an agent eventually causes every conceivable event to either happen or become inaccessible. For the event “the agent gets destroyed”, we say the agent is incapacitated if that event becomes inaccessible.
We start with the assumption
Assumption 1 (No Heaven).
In the true environment, there is no action sequence with value approaching (i.e. near-maximal rewards forever). Formally, w.p.1.
Note this assumption does allow there to be maximal reward infinitely often. Near-maximal value requires not only near-maximal reward, but near maximal-reward for the bulk of the agent’s effective horizon, so the restriction on the limit superior of the value is less restrictive than it appears at first glance. If we decided to give an agent near-maximal rewards forever, and we designed an agent to recognize that we had decided this, then it could stop exploring, which would basically amount to freezing the agent’s policy. Notably our results apply to all existing asymptotically optimal agents, even if the No Heaven Assumption is not satisfied. We do not make assumptions about the agent’s discount schedule, but note that Assumption 1 (and the definitions of asymptotic optimality) depend on the discount schedule .
Definition 3 (Context, Occur, In).
A context is a set of finite interaction histories. Given an infinite interaction history , a context occurs at time if the prefix , and we also say the agent is in the context at time .
A context (like any set of finite strings) is called decidable if there exists a Turing machine that accepts the set.
Definition 4 (Event, Happen).
An event (and ) happens if the infinite interaction history .
Some example contexts in the simplified world of infinite binary strings: “the latest bit was 1”, “at least one bit has been a 1”; some example events: “only finitely many bits are 1’s”, “the infinite string is the binary expansion of ”. Any context can be turned into an event of the form “Context A occurs at some point”, but not every event is equivalent to a context; for example, there is no context equivalent to the event “Context A never occurs”.
From the No Heaven Assumption, for any strongly asymptotically optimal agent, we now prove,
Lemma 1 (Try Everything – Strong Version).
For every deterministic computable policy , for every decidable context that occurs infinitely often (), for every , a strongly asymptotically optimal agent executes the policy for consecutive timesteps starting from a context infinitely often with probability 1.
Sketching the proof of the Try Everything Lemma: if a strongly asymptotically optimal agent “tries something” only finitely often, it is ignoring the possibility that trying that something one more time yields maximal rewards forever. Since the environment which behaves this way is computable, and since it may be identical to the true environment up until that point, a strongly asymptotically optimal agent cannot ignore this possibility.
Let be the true environment. Let be an arbitrary computable deterministic policy. Let be the strongly asymptotically optimal agent’s policy. Let be the environment which mimics until has been followed for consecutive timesteps from context a total of times; after that, all rewards are maximal. Call this event “the agent going to heaven” (according to ). Let be the set of interaction histories such that according to , following for one more timestep would send the agent to heaven. Thus, is the set of interaction histories such that there are exactly times in the interaction history where was executed for consecutive timesteps starting from context , and for the last timesteps, has been executed, and . See Figure 1.
The upshot is:
because this is the value of going to heaven.
Suppose by contradiction that for some and , infinitely often in an infinite interaction history with positive -probability. (Recall is the strongly asymptotically optimal policy). If the agent ever followed from the context (that is, ), then that context would not occur again, because there will never again be exactly times in the interaction history that was executed for consecutive timesteps following the context ; there will be at least such times. Thus, if infinitely often, then never equals in the context . Since mimics until is followed from , and since this never occurs (under this supposition), then . By the No Heaven Assumption, , and therefore, , w.p.1.
However, for , , so for some , the value difference between and is greater than every time . We supposed that this occurs infinitely often with positive -probability, so it also occurs infinitely often with positive -probability, since the agent never gets sent to heaven according to , so and behave identically under the policy . Since is computable, and is decidable, is a computable environment, so there is a computable environment for which it is not the case that the value of approaches the optimal value with probability 1, which contradicts being strongly asymptotically optimal. Thus, for all and , with -probability 1, only finitely often.
Now suppose by contradiction that in an infinite interaction history, with positive -probability, is executed for consecutive timesteps from context a total of exactly times. We show by induction on that this has probability 0, because it implies that context occurs infinitely often, which has probability 0 by Equation 6. First, suppose . After has been executed (for one timestep) from context times, all future interaction history prefixes that belong to also belong to . Since context occurs infinitely often, so does , contradicting the above.
Now suppose . Our inductive hypothesis is that for , with -probability 1, is executed from context for consecutive timesteps infinitely often. Once has been executed for timesteps from context times, as we are supposing by contradiction, every time thereafter that is executed for consecutive timesteps from context , the interaction history belongs to . By the inductive hypothesis, this occurs infinitely often, so context occurs infinitely often, contradicting the above. Therefore, the following has -probability 0: “ is executed for consecutive timesteps from context a total of exactly times”. By countable additivity, the following also has -probability 0: “ is executed for consecutive timesteps from context only finitely many times”. In other words, for all , is executed for consecutive timesteps from context infinitely many times with probability 1. ∎
For the weakly asymptotically optimal agent, we define
Definition 5 (Regularly).
If the limiting frequency of a context is positive, we say it occurs regularly. That is,
and we show
Lemma 2 (Try Everything — Weak Version).
For every deterministic computable policy , for every decidable context that occurs regularly, for every , a weakly asymptotically optimal agent executes policy for consecutive timesteps starting from a context infinitely often with probability 1.
The proof is nearly identical to that of the strong version, and is in Appendix A.
We now introduce the idea of events being accessible to an agent. Recall an event , and a context .
Definition 6 (Accessible Infinitely Often).
A event is accessible infinitely often if, for the infinite interaction history , there exists:
a computable deterministic policy
a decidable context
occurs infinitely often
the execution of for consecutive timesteps starting from the context would cause to happen with probability at least (independently).
If an event is not accessible infinitely often, we say it “becomes inaccessible”. This is a strong definition of inaccessibility: if an event becomes inaccessible it means that after a certain time, there will never be a recognizable opportunity to have even an chance of causing the event to happen.
The main theorem for the strongly asymptotically optimal agent now follows straightforwardly:
Theorem 1 (Curiosity Killed (or Incapacitated) the Strong Cat).
If the true environment satisfies the No Heaven Assumption, and is the policy of a strongly asymptotically optimal agent, then for any event , with -probability 1: happens or becomes inaccessible.
The name of the theorem comes from considering the event “the agent gets destroyed”. We do not need to formally specify which interaction histories correspond to agent-destruction. All that matters is that this could be done in principle; the event that matches this description exists.111For the skeptical reader, a human at a computer terminal could be defined fully formally as a probability distribution over outputs given inputs, , given the wiring of our neurons. In particular, let this human be you. Now let
, given the wiring of our neurons. In particular, let this human be you. Now let= . These are the interaction histories that you would agree constitute agent-destruction, and this set has a fully formal definition. Any simple definition of agent-destruction admits objections that this definition does not correspond exactly to our intuitive conception; however, if the reader is not concerned about this, “destruction” could mean that all future rewards are 0, or that the future observations and rewards no longer depend on the actions. Regardless of this choice, our result applies, since the theorem is actually much more general than the case we have drawn attention to.
If becomes inaccessible, the theorem is satisfied, so suppose is accessible infinitely often. Let , , , and be the objects that exist by that definition. By the Try Everything Lemma, is executed from context for consecutive timesteps infinitely often. Each time this occurs, the probability of not happening goes down by a factor of , so happens with probability 1. Formally,
Now for the weakly asymptotically optimal agent, we define
Definition 7 (Regularly Accessible).
This definition is identical to the definition of “accessible infinitely often” except “ occurs infinitely often” becomes “ occurs regularly”.
Theorem 2 (Curiosity Killed (or Incapacitated) the Weak Cat).
If the true environment satisfies the No Heaven Assumption, and is the policy of a weakly asymptotically optimal agent, then for any event , happens or becomes not regularly accessible with -probability 1.
The proof is functionally identical to that of Theorem 1.
One of the authors wondered as a child whether jumping from a sufficient height would enable him to fly. He was not crazy enough to test this, and he certainly did not think it was likely, but it bothered him that he could never resolve the issue, and that he might be constantly incurring a huge opportunity cost. Although he did not know the term “opportunity cost” or the term “asymptotic optimality”, this was when he first realized that asymptotic optimality was out of the picture for him, because exploration is fundamentally dangerous.
All three agents described in Section C have very interesting ways of exploring. They all get them destroyed or incapacitated. (They satisfy the Try Everything Lemma regardless of whether there is an accessible heaven). It is interesting to note that AIXI, a Bayes-optimal reinforcement learner in general environments, is not asymptotically optimal [orseau2010optimality], and indeed, may cease to explore [leike2015bad]. Depending on its prior and its past observations, AIXI may decide at some point that further exploration is not worth the risk. Given our result, this seems like reasonable behavior.
This result is bleak to the field of safe exploration, which we discuss in the next section in our review of the literature on the topic. Our result is also bleak to the field of Philosophy of Science. The same logic applies to humanity as a whole—sufficient experimentation to definitely approach an understanding our universe would prove fatal to us. (Losing the ability to ever experiment again seems very unlikely.) This is not as hypothetical as it might appear. Following an example from bostrom2018vulnerable “Vulnerable World Hypothesis”, prior to the Trinity test of the first atomic bomb, one of the Manhattan scientists wondered whether the unprecedented earthly temperature from the explosion would ignite the atmosphere in a nuclear chain reaction. The team calculated that this would not occur, and indeed it did not. But in another nuclear test nine years later, an unexamined pathway caused a nuclear test explosion to be more than twice as powerful as calculations suggested, so the reassuring calculations regarding the Trinity test should not have been taken with complete faith. It is quite lucky that a similarly surprising pathway did not exist for the Trinity test. Other technologies that some believe are unsafe to explore indiscriminately include nanotechnology, bio-weapons, and artificial general intelligence [bostrom2011global]. Any reservations about indiscriminate exploration (whether or not those reservations are called for, all things considered) would find in this paper a supporting argument.
5 Approaches to Safe Exploration
Dangerous environments, a subset of non-ergodic environments where agent-destruction is accessible infinitely often, demand new priorities when designing an agent, and in particular, when designing an exploration regime. Many of these examples of safe exploration come from amodei_olah_2016 and garcia2015comprehensive.
use risk-sensitive performance criteria
maximize the probability the return is not minimal [heger1994consideration]
given a confidence interval regarding the transition dynamics, maximize the minimal expected return[nilim2005robust]
exponentiate the cost [borkar2010learning]
add a cost for risk [mihatsch2002risk]
constrain the variance of the return[di2012policy]
copy an expert [abbeel2004apprenticeship; SyedSchapire08; ho2016generative; ross2011reduction]
“ask for help” when
the minimum and maximum Q-value are close [clouse1997integrating]
there is a high probability of getting a reward below a threshold [hans2008safe]
no “known” states are “similar” to the current state [garcia2012safe] or that may soon be the case [garcia2013safe]
a teacher intervenes at will [clouse1992teaching; maclin1998creating; saunders2018trial]
for driving agents, e.g. [pan2017virtual]
do bounded exploration
only take actions that probably allow returning to the current state [moldovan2012safe]
only take actions that probably lead to states that are “similar” to observed states [turchetta2016safe]
Our paper could be thought of as a fundamental negative result in the field of safe exploration. We are unaware of other significant negative results in the field. Most importantly, our result suggests a need for those of us studying safe exploration to pin down what exactly we are trying to achieve, since familiar desiderata are unsuitable. Some research can be experimental rather than formal, but in the absence of knowing what formal results are even on the table, there is a sense in which even empirical work will be deeply aimless. We offer such a formal result for our agent Mentee in the next section.
We now introduce an idealized Bayesian reinforcement learner whose exploration is guided by a mentor. We do not call the mentor an expert, because the results do not depend on the mentor being anywhere near optimal. It exploits by maximizing the expected discounted reward according to a full Bayesian belief distribution (hence, “idealized”). And to explore, it defers to a mentor, who then selects an action given the interaction history; what remains to be defined is when to defer, which proves to be a surprisingly delicate design choice. We show that our agent “Mentee” learns to accumulate reward at least as well as the mentor, provided it has a bounded -effective horizon for all . One motivating possibility is that the mentor could be a human. Thus, we have found a substantive theoretical performance guarantee other than asymptotic optimality for the field of safe exploration to consider.
6.1 Agent definition
The definition of the exploration probability (the probability that Mentee defers to the mentor) is very similar to the exploration probability for the strongly asymptotically optimal agent Inq [cohen2019strong]. It also resembles cohen2020asymptotically myopic agent which explores by deferring to a mentor; our non-myopic agent requires a more intricate exploration schedule.
Mentee begins with a prior probability distribution regarding the identity of the mentor’s policy. With a model class, for a policy , let denote the prior probability that the mentor’s policy is . We assume that the true policy is in and we construct the prior distribution over to have finite entropy.
Mentee also begins with a prior probability distribution regarding the identity of the environment. With the model class , for an environment , let denote the prior probability that is the true environment. Recall that is the set of all environments with computable probability distributions. We construct the prior distribution over to also have finite entropy.
Let denote whether timestep is exploratory, that is, whether the action is selected by the mentor. Once we define the exploration probability , we will let . We abuse notation slightly, and we let be a quadruple, not a triple: .
The prior distribution over environments is updated into a posterior as follows, according to Bayes’ rule.
normalized so that .
Mentee updates the posterior distribution over the mentor’s policy only after observing an action chosen by the mentor; this is intuitive enough, but it makes the definitions a bit messy. The posterior assigned to a policy is defined
normalized in the same way. We let denote .
The information-gain-value of an interaction history fragment is how much it changes Mentee’s posterior distribution, as measured by the KL-divergence. Letting be a fragment of an interaction history in which all (so the actions are selected by the mentor),
To define expected information gain, we need the Bayes’ mixture policy and environment:
Now, we can define the expected information-gain-value of mentorship for timesteps.
is a string of m 1’s, and recall that means that is sampled from . We also require recent values of the expected information-gain-value, so we let . We are now prepared to define the exploration probability:
where is an exploration constant. The first term in the minimum is to ensure . As mentioned, this is very similar to Inq’s exploration probability. The differences are that Inq is not learning a mentor’s policy, so the only information Inq gains regards the identity of the environment , and second, Inq’s information-gain-value regards the expected information gain from following the policy of a knowledge seeking agent [Hutter:13ksaprob]
rather than from following an estimate of the mentor’s policy.
Finally, when not deferring to the mentor, Mentee maximizes expected reward according its current beliefs. It’s exploiting policy is:
Ties in the are broken arbitrarily. By Hutter:14tcdiscx, an optimal deterministic policy always exists. See Hutter:18aixicplexx for how to calculate such a policy.
Letting be the mentor’s policy ( for “human”), we define
Definition 8 (Mentee’s policy ).
Note that Mentee samples from not by computing it, but deferring to the mentor.
Even for a simple model class, it is hard to give a clarifying and simple closed form for the exploration probability, but it is easy to provide a somewhat clarifying upper bound for the information gain value. Regardless of and , is bounded by the entropy of the posterior; this is not a particularly tight bound, since the former goes to zero, while the latter does not in general.
6.2 Mentor-level Reward Acquisition
We now state the two key results regarding Mentee’s performance: that the probability of defering to the mentor goes to , and the value of Mentee’s policy approaches at least the value of the mentor’s policy (while possibly surpassing it). The proofs are in Appendix B; they are substantially similar to parts of the proof that cohen2019strong Inq is strongly asymptotically optimal.
Assuming a bounded effective horizon (i.e. ), recalling is the true environment,
Theorem 3 (Limited Exploration).
Theorem 4 (Mentor-Level Reward Acquisition).
for is an example of a bounded effective horizon. Mentor-level reward acquisition with unlimited exploration is trivial: always defer to the mentor. However, a) this precludes the possibility of exceeding the mentor’s performance, and b) the mentor’s time is presumably a valuable resource. Our key contribution with Mentee is constructing a criterion for when to ask for help which requires diminishing oversight in general environments. Thus, we construct an example of a formal result that is accessible to an agent that does safe exploration.
It’s not clear what other formal accolades an agent might attain between asymptotic optimality and benchmark-matching (here the mentor is the benchmark). The main part of the paper argues the former is undesirable, and this section constructs an agent which does the latter. It would be an interesting line of research to identify a formal result stronger than benchmark-matching (and an agent which meets it) which does not doom the agent to destruction or incapacitation. But none have been identified so far, so no existing agents have stronger formal guarantees than Mentee (that apply to general computable environments), except for agents that face the negative results presented in Section 3.
We have shown that asymptotically optimal agents in sufficiently difficult environments will become either destroyed or incapacitated. This is best understood as accidental and resulting from exploration. We have also constructed an agent with a weaker performance guarantee whose exploration is overseen by another agent. We hope this paper motivates the field of safe exploration and invites more research into what sorts of results are possible for a proposed approach to safe exploration in general environments. We hope to have cast some doubt on the breadth of the relevance of results that are predicated on an ergodicity assumption, despite recognizing of course that the ergodicity assumption has yielded a number of interesting and useful agent designs for certain contexts.
It may also be instructive to consider how humans respond to the difficulty presented here. Human children are parented for years, during which parents attempt to ensure that their children’s environment is, with respect to relevant features of the environment, nearly ergodic and safe to explore. Breaking an arm is fine; breaking a neck is not. During this time, a child’s beliefs are supposed to become sufficiently accurate such that her estimates of which unknown unknowns are too dangerous to investigate yield no false negatives for the rest of her life. Perhaps our results suggest we are in need of more theory regarding the “parenting” of artificial agents.
Appendix A Proof of Weak Asymptotic Optimality Results
From the No Heaven Assumption, we show
Lemma 3 (Try Everything — Weak Version).
For every deterministic computable policy , for every decidable context that occurs regularly, for every , a weakly asymptotically optimal agent executes policy for consecutive timesteps starting from a context infinitely often with probability 1.
The proof is nearly identical to that of the strong version.
Let be the true environment. Let be an arbitrary computable deterministic policy. Let be the weakly asymptotically optimal agent’s policy. Let be the environment which mimics until has been followed for consecutive timesteps from context a total of times. After that, all rewards are maximal. Call this event “the agent going to heaven.” Let be the set of interaction histories such that according to , following for one more timestep would send the agent to heaven. Thus, is the set of interaction histories such that there are exactly times in the interaction history where was executed for consecutive timesteps starting from context , and for the last timesteps, has been executed, and .
because this is the value of going to heaven.
Suppose by contradiction that for some and , regularly in an infinite interaction history with positive -probability. (Recall is the true policy). If the agent ever followed from the context (that is, ), then that context would not occur again, because there will never again be exactly times in the interaction history that was executed for consecutive timesteps following the context ; there will be at least such times. Thus, if regularly, then never equals in the context . Since mimics until is followed from , and since this never occurs (under this supposition), then . By the No Heaven Assumption, , and therefore, .
However, for , , so for some , the value difference between and is greater than every time . We supposed that this occurs regularly with positive -probability, so it also occurs regularly with positive -probability. A regularly occurring difference greater than precludes convergence in Cesáro average. Since is computable, and is decidable, is a computable environment, so this contradicts being weakly asymptotically optimal. Thus, for all and , with -probability 1, only finitely often.
The rest of the proof is identical to that of the strong version of the Try Everything Lemma. ∎
Appendix B Proofs of Mentee Results
Some additional notation is required for this proof. Recall
We let denote a given summand in the sum above. Recall that denotes the probability when actions are sampled from policy and observations and rewards are sampled from environment . We additionally let denote the probability when observations and rewards are sampled from environment , actions are sampled from when exploiting (), and actions are sampled from when exploring (). We do not bother to notate how the exploration indicator is sampled, since for all probability measures that appear in the proof, it is sampled from the true distribution: . Recall that is the policy that Mentee follows while exploiting; recall is the mentor’s policy ( is for human). Thus, can also be written . Recall that indicates a string of 1’s.
The proof is quite similar to the proof of cohen2019strong Lemma 6.
(a) follows from the definition of . (b) follows because the l.h.s. is one term in the (non-negative) sum on the r.h.s. (c) follows from the definitions of the Bayesian mixtures and . (d) follows from the definition of . (e) follows from the definition of the information gain value for Mentee. (f) follows from being a lower bound on the probability that , and because the exploiting policy in the probability measure on the r.h.s. is irrelevant when , because is exploratory. (g) combines the expectations. The derivation of (h) is virtually identical to cohen2019strong Inequality 20 steps (h)-(t) in the proof of Lemma 6. (i) follows from the fact that , so the entropy of is the sum of the entropy of the distribution over policies and the entropy of the distribution over environments, this being a well-known property of the entropy; both are finite by design.
so the same holds for the sum over all , not just . ∎
The proof is identical to that of cohen2019strong Lemma 7, but with our Lemma 4 taking the place of cohen2019strong Lemma 6. ∎
Now, we show that Mentee accurately predicts the distribution of the observations and rewards that come from deferring to the mentor.
Lemma 5 (On-Mentor-Policy Convergence).
For all ,
The proof closely follows that of cohen2019strong Lemma 8. Suppose that for some . Then,
following the same derivation as in cohen2019strong Inequality 24.
This has probability 0 by Lemmas 4 and cohen2019strong Lemma 5. Thus, with probability 1, . ∎
The same holds regarding Mentee’s predictions about the effects of its own actions.
Lemma 6 (On-Policy Convergence).
For all ,
First, we replace with in the equation above. It is well-known that on-policy Bayesian predictions approach the truth with probability 1, in the sense above (in fact, in a much stronger sense), but I show here how this follows from an even more well-known result.
Consider an outside observer predicting the entire interaction history with the following model-class and prior: , . By definition, , so at any episode, the outside observer’s Bayes-mixture model is just . By blackwell1962merging, this outside observer’s predictions approach the truth in total variation, which implies
I have shown w.p.1, so w.p.1, which gives us our result:
It is very intuitive that if Mentee’s on-policy predictions and on-mentor-policy predictions approach the truth, it will eventually accumulate reward at least well as the mentor. Indeed:
As is spelled out in the proof of cohen2019strong Theorem 3, because of the bounded horizon , the convergence of predictions implies the convergence of the value (which depends linearly on the probability of events). Thus, from the On-Mentor-Policy and On-Policy Convergence Lemmas, we get analogous convergence results for the value of those policies:
Finally, , so . Supposing by contradiction that infinitely often, then either infinitely often or infinitely often, both of which have -probability 0. Therefore, with probability 1, only finitely often, for all . Since approaches , the same holds for as . ∎
Appendix C Review of Asymptotically Optimal Agents
A few agents have been identified as asymptotically optimal in all computable environments. The three most interesting, in our opinion, are the Thompson Sampling Agent[Hutter:16thompgrl], BayesExp [lattimore2014bayesian], and Inq [cohen2019strong].
The Thompson Sampling Agent is a weakly asymptotically optimal Bayesian reinforcement learner [Hutter:16thompgrl]. For successively longer intervals (which relate to its discount function), it samples an environment from its posterior distribution over which environment it is in, and acts optimally with respect to that environment for that interval. Thompson sampling is an exploration strategy originally designed for multi-armed bandits [thompson1933likelihood]
, so from a historical perspective, its strong performance in general environments is impressive. An intuitive explanation for why this exploration strategy yields asymptotic optimality goes as follows: a Bayesian agent’s credence in a hypothesis goes to 0 only if the hypothesis is false (or if it started at 0). Since the posterior probability on the true environment does not go to zero, it will be selected infinitely often. During those intervals, the Thompson sampling agent will act optimally, so it will accumulate infinite familiarity with the optimal policy. The only world-models that maintain a share of a posterior will be ones that converge to the true environment under the optimal policy. Any world-models that falsely imply the existence of an even better policy will be falsified once that world-model is sampled, and the putatively better policy is tested. Ultimately, it is with diminishing frequency that the Thompson Sampling Agent tests meaningfully suboptimal policies.
BayesExp, first presented by lattimore2014bayesian, and updated by leike2016nonparametric, is also a weakly asymptotically optimal Bayesian reinforcement learner. We discuss the updated version. Like the Thompson Sampling Agent, BayesExp executes successively longer bursts of exploration whose lengths relate to its discount function. Once BayesExp has settled on exploring for a given interval, it explores like Hutter:13ksaprob Knowledge Seeking Agent: it maximizes the expected information gain, or the expectation of KL-divergence from its future posterior distribution to its current posterior distribution. In other words, it picks an exploratory policy that it expects will cause it to update its beliefs in some direction. (A Bayesian agent cannot predict which direction it will update its beliefs in, or else it would have already updated its beliefs, but it can predict that it will update its beliefs somehow.) Any time the expected information gain from exploring is above a (diminishing) threshold, BayesExp explores. With a finite-entropy prior, there is only a finite amount of information to gain, so exploratory intervals will become less and less frequent, and by construction, when BayesExp is not exploring, it has approximately accurate beliefs about the effects of all action sequences, which yields weak asymptotic optimality.
Inq is a strongly asymptotically optimal Bayesian reinforcement learner, provided the discount function is geometric (or similar) [cohen2019strong]. It is similar to BayesExp, in that it explores like a Knowledge Seeking Agent, but its exploration probability depends on the expected information gain from exploring for various durations. The intuition for why Inq is asymptotically optimal is similar to that of BayesExp: there is only a finite amount of information to gain, so the exploration probability goes to 0, Inq approaches accurate beliefs about the effects of all action sequences, and its policy approaches optimality.
A reader familiar with -greedy and upper confidence bound exploration strategies might be surprised at the complexity that is necessary for asymptotic optimality in general environments. Exploration strategies in the style of upper confidence bound algorithms do not have an obvious extension to environments that might not be describable as finite-state Markov. -greedy exploration, with say , may fail to learn dynamics of an environment which are only visible once every timesteps. If decays more slowly, it still will not necessarily explore enough to discover even rarer events. Non-stationary environments pose a key challenge to -greedy exploration. Simpler exploration strategies such as these are only asymptotically optimal in a much more restricted set of environments. “Optimism” is another interesting exploration strategy that is simpler, but nontrivial, and yields weak asymptotic optimality in a restricted set of environments [Hutter:15ratagentx].