The dream of artificial general intelligence is to create an agent that, starting with no knowledge of its environment, eventually learns to behave optimally. This means it should be able to learn chess just by playing, or Go, or how to drive a car or mow the lawn, or any task we could conceivably be interested in assigning it.
Before considering the existence of universally intelligent agents, we must be precise about what is meant by optimality. If the environment and goal are known, then subject to computation issues, the optimal policy is easy to construct using an expectimax search from sequential decision theory [NR03]. However, if the true environment is unknown then the agent will necessarily spend some time exploring, and so cannot immediately play according to the optimal policy. Given a class of environments, we suggest two definitions of asymptotic optimality for an agent.
An agent is strongly asymptotically optimal if for every environment in the class it plays optimally in the limit.
It is weakly asymptotic optimal if for every environment in the class it plays optimally on average in the limit.
The key difference is that a strong asymptotically optimal agent must eventually stop exploring, while a weak asymptotically optimal agent may explore forever, but with decreasing frequency.
In this paper we consider the (non-)existence of weak/strong asymptotically optimal agents in the class of all deterministic computable environments. The restriction to deterministic is for the sake of simplicity and because the results for this case are already sufficiently non-trivial to be interesting. The restriction to computable is more philosophical. The Church-Turing thesis is the unprovable hypothesis that anything that can intuitively be computed can also be computed by a Turing machine. Applying this to physics leads to the strong Church-Turing thesis that the universe is computable (possibly stochastically computable, i.e. computable when given access to an oracle of random noise). Having made these assumptions, the largest interesting class then becomes the class of computable (possibly stochastic) environments.
In [Hut04], Hutter conjectured that his universal Bayesian agent, AIXI, was weakly asymptotically optimal in the class of all computable stochastic environments. Unfortunately this was recently shown to be false in [Ors10], where it is proven that no Bayesian agent (with a static prior) can be weakly asymptotically optimal in this class.111Or even the class of computable deterministic environments. The key idea behind Orseau’s proof was to show that AIXI eventually stops exploring. This is somewhat surprising because it is normally assumed that Bayesian agents solve the exploration/exploitation dilemma in a principled way. This result is a bit reminiscent of Bayesian (passive induction) inconsistency results [DF86a, DF86b], although the details of the failure are very different.
We extend the work of [Ors10], where only Bayesian agents are considered, to show that non-computable weak asymptotically optimal agents do exist in the class of deterministic computable environments for some discount functions (including geometric), but not for others. We also show that no asymptotically optimal agent can be computable, and that for all “reasonable” discount functions there does not exist a strong asymptotically optimal agent.
The weak asymptotically optimal agent we construct is similar to AIXI, but with an exploration component similar to
-learning for finite state Markov decision processes or the UCB algorithm for bandits. The key is to explore sufficiently often and deeply to ensure that the environment used for the model is an adequate approximation of the true environment. At the same time, the agent must explore infrequently enough that it actually exploits its knowledge. Whether or not it is possible to get this balance right depends, somewhat surprisingly, on how forward looking the agent is (determined by the discount function). That it is sometimes not possible to explore enough to learn the true environment without damaging even a weak form of asymptotic optimality is surprising and unexpected.
Note that the exploration/exploitation problem is well-understood in the Bandit case [ACBF02, BF85] and for (finite-state stationary) Markov decision processes [SL08]. In these restrictive settings, various satisfactory optimality criteria are available. In this work, we do not make any assumptions like Markov, stationary, ergodicity, or else besides computability of the environment. So far, no satisfactory optimality definition is available for this general case.
2 Notation and Definitions
Strings. A finite string over alphabet is a finite sequence with . An infinite string over alphabet is an infinite sequence . , and are the sets of strings of length , strings of finite length, and infinite strings respectively. Let be a string (finite or infinite) then substrings are denoted where and . Strings may be concatenated. Let of length and respectively, and . Then define and . Some useful shorthands,
The second of these is ambiguous with concatenation, so wherever appears we assume the interleaving definition of (1) is intended. For example, it will be common to see , which represents the string . For binary strings, we write and to mean the number of 0’s and number of 1’s in respectively.
Environments and Optimality. Let , and be action, observation and reward spaces respectively. Let . An agent interacts with an environment as illustrated in the diagram on the right. First, the agent takes an action, upon which it receives a new observation/reward pair. The agent then takes another action, receives another observation/reward pair, and so-on indefinitely. The goal of the agent is to maximise its discounted rewards over time. In this paper we consider only deterministic environments where the next observation/reward pair is determined by a function of the previous actions, observations and rewards.
Definition 1 (Deterministic Environment).
A deterministic environment is a function where is the observation/reward pair given after action is taken in history . Wherever we write we implicitly assume and refer to and without defining them. An environment is computable if there exists a Turing machine that computes it.
Note that since environments are deterministic the next observation need not depend on the previous observations (only actions). We choose to leave the dependence as the proofs become clearer when both the action and observation sequence is more visible.
and are finite, .
Definition 3 (Policy).
A policy is a function from a history to an action .
As expected, a policy and environment can interact with each other to generate a play-out sequence of action/reward/observation tuples.
Definition 4 (Play-out Sequence).
We define the play-out sequence inductively by and .
We need to define the value of a policy in environment . To avoid the possibility of infinite rewards, we will use discounted values. While it is common to use only geometric discounting, we have two reasons to allow arbitrary time-consistent discount functions.
Geometric discounting has a constant effective horizon, but we feel agents should be allowed to use a discount function that leads to a growing horizon. This is seen in other agents, such as humans, who generally become less myopic as they grow older. See [FOO02] for a overview of generic discounting.
The existence of asymptotically optimal agents depends critically on the effective horizon of the discount function.
Definition 5 (Discount Function).
A regular discount function
is a vector satisfyingand for all .
The first condition is natural for any definition of a discount function. The second condition is often cited as the purpose of a discount function (to prevent infinite utilities), but economists sometimes use non-summable discount functions, such as hyperbolic. The second condition also guarantees the agent cares about the infinite future, and is required to make asymptotic analysis interesting. We only consider discount functions satisfying all three conditions. In the following, let
An infinite sequence of rewards starting at time , is given a value of . The term is a normalisation term to ensure that values scale in such a way that they can still be compared in the limit. A discount function is computable if there exists a Turing machine computing it. All well known discount functions, such as geometric, fixed horizon and hyperbolic are computable. Note that exists for all and represents the effective horizon of the agent. After time-steps into the future, starting at time , the agent stands to gain/lose at most .
Definition 6 (Values and Optimal Policy).
The value of policy when starting from history in environment is . The optimal policy and its value are defined and .
Assumption 2 combined with Theorem 6 in [LH11] guarantees the existence of . Note that the normalisation term does not change the policy, but is used to ensure that values scale appropriately in the limit. For example, when discounting geometrically we have, for some and so and .
Definition 7 (Asymptotic Optimality).
Let be a finite or countable set of environments and be a discount function. A policy is a strong asymptotically optimal policy in if
It is a weak asymptotically optimal policy if
Strong asymptotic optimality demands that the value of a single policy converges to the value of the optimal policy for all in the class. This means that in the limit, a strong asymptotically optimal policy will obtain the maximum value possible in that environments.
Weak asymptotic optimality is similar, but only requires the average value of the policy to converge to the average value of the optimal policy. This means that a weak asymptotically optimal policy can still make infinitely many bad mistakes, but must do so for only a fraction of the time that converges to zero. Strong asymptotic optimality implies weak asymptotic optimality.
While the definition of strong asymptotic optimality is rather natural, the definition of weak asymptotic optimality appears somewhat more arbitrary. The purpose of the average is to allow the agent to make a vanishing fraction of serious errors over its (infinite) life-time. We believe this is a necessary condition for an agent to learn the true environment. Of course, it would be possible to insist that the agent make only serious errors rather than , which would make a stronger version of weak asymptotic optimality. Our choice is the weakest notion of optimality of the above form that still makes sense, which turns out to be already too strong for some discount rates.
Note that for both versions of optimality an agent would be considered optimal if it actively undertook a policy that led it to an extremely bad “hell” state from which it could not escape. Since the state cannot be escaped, its policy would then coincide with the optimal policy and so it would be considered optimal. Unfortunately, this problem seems to be an unavoidable consequence of learning algorithms in non-ergodic environments in general, including the currently fashionable PAC algorithms for arbitrary finite Markov decision processes.
3 Non-Existence of Asymptotically Optimal Policies
We present the negative theorem in three parts. The first shows that, at least for computable discount functions, there does not exist a strong asymptotically optimal policy. The second shows that any weak asymptotically optimal policy must be incomputable while the third shows that there exist discount functions for which even incomputable weak asymptotically optimal policies do not exist.
Let be the class of all deterministic computable environments and a computable discount function, then:
There does not exist a strong asymptotically optimal policy in .
There does not exist a computable weak asymptotically optimal policy in .
If then there does not exist a weak asymptotically optimal policy in .
Part 1 of Theorem 8 says there is no strong asymptotically optimal policy in the class of all computable deterministic environments when the discount function is computable. It is likely there exist non-computable discount functions for which there are strong asymptotically optimal policies. Unfortunately the discount functions for which this is true are likely to be somewhat pathological and not realistic.
Given that strong asymptotic optimality is too strong, we should search for weak asymptotically optimal policies. Part 2 of Theorem 8 shows that any such policy is necessarily incomputable. This result features no real new ideas and relies on the fact that you can use a computable policy to hand-craft a computable environment in which it does very badly [Leg06]. In general this approach fails for incomputable policies because the hand-crafted environment will then not be computable. Note that this does not rule out the existence of a stochastically computable weak asymptotically optimal policy.
It turns out that even weak asymptotic optimality is too strong for some discount functions. Part 3 of Theorem 8 gives an example discount function for which no such policy (computable or otherwise) exists. In the next section we introduce a weak asymptotically optimal policy for geometric (and may be extended to other) discounting. Note that is an example of a discount function where . It is also analytically easy to work with.
All negative results are proven by contradiction, and follow the same basic form.
Assume is a computable/arbitrary weak/strong asymptotically optimal.
Therefore is weak/strong asymptotically optimal in for some particular .
Construct , which is indistinguishable from under , but where is not weak/strong asymptotically optimal in .
Proof of Theorem 8, Part 1.
Let and . Now assume some policy is a strong asymptotically optimal policy. Define an environment by,
That is is the reward given when taking action having previously taken actions . Note that we have omitted the observations as . It is easy to see that the optimal policy for all with corresponding value . Since is strongly asymptotically optimal,
Assume there exists a time-sequence such that (and hence ) for all . Therefore by the definition of the value function,
where (5) follows from the definitions of the value function and , and the assumption in the previous line. (6) follows by algebra and the definition of . This contradicts (4). Therefore for any strong asymptotically optimal policy there exists a such that for all , for some . I.e, cannot take sub-optimal action too frequently. In particular, it cannot take action for large contiguous blocks of time. Construct a new environment defined by
Note that is computable if is and that by construction the play-out sequences for and when using policy are identical. We now consider the optimal policy in . For any consider the value of policy defined by for all .
This is because spends time-steps playing and receiving reward before “unlocking” a reward of on all subsequent plays. On the other hand, because can never unlock the reward of because it never plays for a contiguous block of time-steps. By the definition of the optimal policy, . Therefore
Therefore there does not exist an asymptotically optimal policy in . ∎
Proof of Theorem 8, Part 2.
Let and . Now let be the class of all computable deterministic environments and be an arbitrary discount function. Suppose is computable and consider the environment defined by
Since is computable is as well. Therefore . Now for all while . Therefore and so is not weakly asymptotically optimal. ∎
Proof of Theorem 8, Part 3.
Recall and so . Now let and . Define by
where will be chosen later. As before, . Assume is weakly asymptotically optimal. Therefore
We show by contradiction that cannot explore (take action ) too often. Assume there exists an infinite time-sequence such that for all . Then for we have
The first inequality follows from (9) and because the maximum value of any play-out sequence in is . The second by algebra. Therefore , which contradicts (7). Therefore there does not exist a time-sequence such that for all .
So far we have shown that cannot “explore” for consecutive time-steps starting at time-step , infinitely often. We now construct an environment similar to where this is required. Choose to be larger than the last time-step at which for all Define by
Now we compare the values in environment of and at times . Since does not take action for consecutive time-steps at any time after , it never “unlocks” the reward of 1 and so . Now let for all . Therefore, for ,
We believe it should be possible to generalise the above to computable discount functions with with for infinitely many , but the proof will likely be messy.
4 Existence of Weak Asymptotically Optimal Policies
In the previous section we showed there did not exist a strong asymptotically optimal policy (for most discount functions) and that any weak asymptotically optimal policy must be incomputable. In this section we show that a weak asymptotically optimal policy exists for geometric discounting (and is, of course, incomputable).
The policy is reminiscant of -exploration in finite state MDPs (or UCB for bandits) in that it spends most of its time exploiting the information it already knows, while still exploring sufficiently often (and for sufficiently long) to detect any significant errors in its model.
The idea will be to use a model-based policy that chooses its current model to be the first environment in the model class (all computable deterministic environments) consistent with the history seen so far. With increasing probability it takes the best action according to this policy, while still occasionally exploring randomly. When it explores it always does so in bursts of increasing length.
Definition 9 (History Consistent).
A deterministic environment is consistent with history if .
Definition 10 (Weak Asymptotically Optimal Policy).
Let and be a countable class of deterministic environments. Define a probability measure on inductively by, . Now let be sampled from and define by
Next let be sampled from the uniform measure (each bit of is independently sampled from a Bernoulli distribution) and define a policy by,
where with . Note that is always finite because there exists an such that , in which case is necessarily consistent with .
Intuitively, at time-steps when the agent will explore for time-steps. if the agent is exploring at time and is the action taken if exploring at time-step . will be used later, with if the agent will explore at least once in the interval . If the agent is not exploring then it acts according to the optimal policy for the first consistent environment in .
Let with (geometric discounting) then the policy defined in Definition 10 is weakly asymptotically optimal in the class of all deterministic computable environments with probability 1.
That is only convenience, rather than necessity. The policy is easily generalised to arbitrary finite .
is essentially a stochastic policy. With some technical difficulties it is possible to construct an equivalent deterministic policy. This is done by choosing to be any -Martin-Löf random sequence and to be a sequence that is Martin-Löf random w.r.t to the uniform measure. The theorem then holds for all deterministic environments. The proof is somewhat delicate and may not extend nicely to stochastic environments. For an introduction to Kolmogorov complexity and Martin-Löf randomness, see [LV08]. For a reason why the stochastic case may not go through as easily, see [HM07].
The policy defined in Definition 10 is not computable for two reasons. First, because it relies on the stochastic sequences and . Second, because the operation of finding the first environment consistent with the history is not computable.222The class of computable environments is not recursively enumerable [LV08]. We do not know if there exists a weak asymptotically optimal policy that is computable when given access to a random number generator (or if it is given and ).
The bursts of exploration are required for optimality. Without them it will be possible to construct counter-example environments similar to those used in part 3 of Theorem 8.
Before the proof we require some more definitions and lemmas. Easier proofs are omitted.
Definition 12 (-Difference).
Let and be two environments consistent with history , then is -different to if there exists satisfying
Intuitively, is -different to at history if playing the optimal policy for for time-steps makes inconsistent with the new history. Note that -difference is not symmetric.
If and and is an indicator sequence with ,333 if is true and otherwise. then .
See the appendix for the proof.
Let be a sequence with for all . The following properties of are true with probability .
For any , .
If and then for infinitely many .
1. Let , and be the event that .
Using the definition of to compute the expectation and applying the Markov inequality
gives that . Therefore .
Therefore the Borel-Cantelli lemma gives that occurs for only finitely many with probability .
We now assume that and show that must occur infinitely often.
By the definition of and our assumption we have that there exists a sequence
such that for all .
Let and note that , which is
exactly . Therefore there exist infinitely many such that occurs and so
with probability 1.
2. The probability that for all is , by Lemma 13. Therefore the probability that for only finitely many is zero. Therefore there exists infinitely many with with probability , as required. ∎
Lemma 15 (Approximation Lemma).
Let and be policies, an environment and . Let be an arbitrary history and be the future action/observation/reward triples when playing policy . If then .
Recall that and are the optimal policies in environments and respectively (see Definition 6).
Lemma 16 (-difference).
If then is -different to on .
Follows from the approximation lemma. ∎
We are now ready to prove the main theorem.
Proof of Theorem 11.
Let be the policy defined in Definition 10 and be the true (unknown) environment. Recall that with is the first model consistent with the history at time and is used by when not exploring. First we claim there exists a and environment such that for all . Two facts,
If is inconsistent with history then it is also inconsistent with for all .
is consistent with for all .
By 1) we have that the sequence is monotone increasing. By 2) we have that the sequence is bounded by with . The claim follows since any bounded monotone sequence of natural numbers converges in finite time. Let be the environment to which converges to. Note that must be consistent with history for all . We now show by contradiction that the optimal policy for is weakly asymptotically optimal in environment . Suppose it were not, then
Let be defined by if and only if,
By Lemma 14 there exists (with probability one) an infinite sequence