1 Introduction
The project of Artificial General Intelligence (AGI) is “to make computers solve really difficult problems.” [Minsky, 1961] Expanding on this, what we want from an AGI is a system that (a) can solve any solvable task, and (b) can be steered toward solving any particular one given some form of input we provide.
One proposal for AGI is reinforcement learning, which works as follows: (1) construct a “reward signal” meant to express our satisfaction with an artificial agent; (2) design an algorithm which learns to pick actions that maximize its expected reward, usually utilizing other observations too; and (3) ensure that solving the task we have in mind leads to higher reward than can be attained otherwise. As long as (3) holds, then insofar as the algorithm is able to maximize expected reward, it can be used as an AGI.
A problem arises: it is inconveniently true that if the AI manages to take over the world, and ensure its continued dominance by neutralizing all other intelligent threats (read: people), it could intervene in the provision of its own reward to achieve maximal reward for the rest of its lifetime [Bostrom, 2014; Taylor et al., 2016]. “Reward hijacking” is just the correct way for a reward maximizer to behave [Amodei et al., 2016]. Insofar as the AI is able to maximize expected reward, (3) fails. The broader principle at work is Goodhart’s Law: “Any observed statistical regularity [like the correlation between reward and taskcompletion] will tend to collapse once pressure is placed upon it for control purposes.” [Goodhart, 1984] Indeed, Krakovna [2018] has compiled an annotated bibliography of examples of artificial optimizers “hacking” their objective. An alternate way to understand this expected behavior is omohundro_2008 Instrumental Convergence Thesis, which we summarize as follows: an agent with a goal is likely to pursue “power,” a position from which it is easier to achieve arbitrary goals.
To answer the failure mode of reward hijacking, we present Boxed Myopic Artificial Intelligence (BoMAI), the first algorithm we are aware of for a reinforcement learner which, in the limit, is indifferent to gaining power in the outsideworld. The key features are these: BoMAI maximizes reward episodically, it is run on a computer which is placed in a sealed room with an operator, and if the operator leaves the room, the episode ends. We argue that our algorithm produces an AGI that, even if it became omniscient, would continue to accomplish whatever task we wanted, instead of hijacking its reward, eschewing its task, and neutralizing threats to it, even if it saw clearly how to do exactly that. We thereby defend reinforcement learning as a path to AGI, despite the default, dangerous failure mode of reward hijacking.
We take the key insights from Hutter:04uaibook AIXI, which is a Bayesoptimal reinforcement learner, but which cannot be made to solve arbitrary tasks, given its eventual degeneration into reward hijacking [Ring and Orseau, 2011]. We take further insights from solomonoff_1964 universal prior, shannon_weaver_1949 formalization of information, Hutter:13ksaprob knowledgeseeking agent, and armstrong_sandberg_bostrom_2012 and bostrom_2014 theorized Oracle AI, and we design an algorithm which can be reliably directed, in the limit, to solve arbitrary tasks at least as well as humans.
2 Boxed Myopic Artificial Intelligence
We turn now to the setup and the algorithm for BoMAI. The setup refers to the physical surroundings of the computer on which the algorithm is run. We present the algorithm informally as well as formally. The informal description will give a nearcomplete picture of the algorithm to a reader familiar with Bayesian reinforcement learning, and the formal description is in Appendix A. Appendix B includes a table of this notation.
2.1 Setup
The standard reinforcement learning setup is as follows: at each “timestep,” an agent submits an action, then receives an observation and a reward. This cycle continues for as long as desired. For BoMAI, its actions take the form of boundedlength strings of text which get printed to a screen, and its observations take the form of boundedlength strings of text which a human operator enters. The reward belongs to a finite set of rationals between 0 and 1. BoMAI’s lifetime is divided into episodes of length , and it attempts to maximize only the reward it receives during the episode it is in; hence, it is “myopic.”
As depicted in Figure 1, BoMAI is run on a computer which is placed in a room with a human operator. The room has one door. As long as the door is closed, information cannot leave the room, hence “boxed”. If the button to open the door is pressed, the episode is completed automatically before the door opens: supposing there are timesteps left in the episode, then for those remaining timesteps, BoMAI recieves a reward of
, and its observation is an empty string of text. There is also a set time limit for the episode, just to ensure it does not last forever. With the episode now completed, BoMAI does not immediately take its next action, and the operator can leave the room. The operator then goes into an antechamber, and with probability
, he must take a drug causing retrograde amnesia before leaving.^{1}^{1}1This design choice requires significant exposition to motivate; it appears in 3.3. We hope that future research yields a path to asymptotic unambitiousness that avoids this feature. When the next person comes in, BoMAI takes its next action. BoMAI is run on a local, formally verified operating system, disconnected from the outsideworld. The engineering problem of making this is discussed in Appendix C.This setup constrains the causal dependencies between BoMAI and the environment, as depicted in Figure 2. The key feature of this graph is that during any episode, the agent’s actions cannot affect the state of the outsideworld in a way that might affect any of the rewards that the agent is concerned with. “Timestep ” refers to the ^{th} timestep of episode .
Formally, causal graphs express that a node is independent of all nondescendants when conditioned on its parents.
2.2 Algorithm – Informal Description
With the setup described, we now outline how exactly BoMAI maximizes expected reward. Again, some choices will be motivated in later sections. A central problem in reinforcement learning is the tradeoff between exploitation and exploration. Exploitation is maximizing expected reward according to the agent’s current beliefs, and exploration is taking some other action in order to learn more about the dynamics of the environment, so the agent’s beliefs can become more accurate. We’ll start by discussing how BoMAI exploits.
BoMAI assumes that the environment producing its observations and rewards could in principle be perfectly simulated on a computer with access to random bits and with access to the actions taken, using only a bounded number of computation steps per episode. BoMAI does not assume the environment resets at the start of each episode. Using the probabilistic Turing machine model of computation, it associates to each pair (Turing machine, computation bound) a worldmodel that that bounded Turing machine simulates. For example, if the computation bound is 100, the Turing machine is allowed 100 computation steps per episode. BoMAI is also parameterized by a slack factor
: each Turing machine is allowed an extra computation steps total, in addition to its computation allowance per episode. BoMAI begins with a prior distribution over which bounded Turing machine is simulating the true environment, and updates these beliefs as a Bayesian with each successive observation. Noting that there are different state Turing machines, where and depend on the Turing machine architecture, for a worldmodel simulated by an state Turing machine, with a computation bound , the prior on the worldmodel is proportional to , where ensures the prior has finite entropy, and penalizes slow worldmodels.At the start of each episode, BoMAI identifies the bounded Turing machine that has the highest weight in its posterior, and for the remainder of the episode, exploitingBoMAI maximizes its expected reward within that episode, assuming its environment is simulated by that bounded Turing machine.
When BoMAI explores, a human selects its action. We’ll call this person the “explorer.” Like the operator, she is also “boxed”–she is in a sealed room, and she is only able to leave when the operator leaves. When BoMAI would explore, the explorer is alerted, and for the remainder of the episode, she submits actions which BoMAI executes. Like BoMAI, she has access to the interaction history when picking an action.
The probability that BoMAI explores is proportional to the amount of information that BoMAI expects to gain from the humancontrolled episode, both about the explorer’s policy, and about the true worldmodel.^{2}^{2}2The exploration probability is proportional to the expected information gain, but for the probability being capped at 1. Formally, “information gain” measures how poorly BoMAI’s current posterior approximates its future posterior, using the KullbackLeibler (KL) divergence.
3 Results
We first give theoretical results relating to BoMAI’s intelligence: it learns to accumulate reward at least as well as the human explorer, and its exploration probability goes rapidly to . There is no clear result to aim for between humanlevel and optimal, so we cannot show this formally, but we argue that it would accumulate reward at a vastly superhuman level. We then discuss the results pertaining to unambitiousness, which will also justify the more curious design choices in the setup that the reader might be wondering about. We show that BoMAI does not have an incentive to hijack its reward or indeed accomplish any outsideworld objectives. We then informally respond to a couple concerns about how to ensure BoMAI completes tasks in the desired way.
3.1 Intelligence
All intelligence results depend on the assumption that BoMAI assigns nonzero prior to the truth. Formally, we let be the class of worldmodels that BoMAI considers (those which can be simulated by a boundedtime Turing machine with a slack of computation steps), and we let be some class of policies that BoMAI considers the human explorer might be executing.
Assumption 1 (Prior Support).
The true environment is in the class of worldmodels and the true humanexplorerpolicy is in the class of policies .
One definition of that would presumably satisfy this assumption is the set of computable functions (that are properly typed).
The intelligence theorems are stated here and proven in Appendix D along with a few supporting lemmas. Our first result is that the exploration probability is squaresummable almost surely: letting denote the expectation when actions are sampled from BoMAI’s policy and observations and rewards are sampled from the true environment , and letting be the exploration probability, which depends on , the interaction history up to episode , and , the history of which prior episodes were exploratory, we have
Theorem 1 (Limited Exploration).
This result is independently interesting as one solution to the problem of safe exploration with limited oversight in nonergodic environments, which Amodei et al. [2016] discuss.
The OnHumanPolicy and OnPolicy Optimal Prediction Theorems state that predictions under BoMAI’s maximum a posteriori worldmodel approach the objective probabilities of the events of the episode, when actions are sampled from either the human explorer’s policy or from BoMAI’s policy. denotes a possible interaction history for episode , and recall is the actual interaction history up until then. is the human explorer’s policy, is BoMAI’s maximum a posteriori worldmodel for episode , and is the probability when actions are sampled from and observations and rewards are sampled from .
Theorem 2 (OnHumanPolicy Optimal Prediction).
w.p.1 means with probability 1 when actions are sampled from and observations and rewards are sampled from . Next, is BoMAI’s policy when not exploring, which does optimal planning with respect to . The following theorem is identical to the above, with substituted for .
Theorem 3 (OnPolicy Optimal Prediction).
Given asymptotically optimal prediction onpolicy and onhumanpolicy, it is straightforward to show that with probability 1, only finitely often is onpolicy reward acquisition more than worse than onhumanpolicy reward acquisition, for all . Letting be the expected reward (within the episode) for a policy in the environment , we state this as follows:
Theorem 4 (HumanLevel Intelligence).
This completes the formal results regarding BoMAI’s intelligence–namely that BoMAI approaches perfect prediction onpolicy and onhumanpolicy, and accumulates reward at least as well as the human explorer. Since this result is independent of what task the operator decides to reward, we say that BoMAI achieves humanlevel intelligence, and could be called an AGI. In fact, we expect BoMAI’s accumulation of reward to be vastly superhuman, for the following reason: BoMAI is doing optimal inference and planning with respect to what can be learned in principle from the sorts of observations that humans routinely make. We suspect that no human remotely approaches the ceiling of what can be learned from their observations.
We offer a couple examples of what BoMAI could be used for. BoMAI could be asked to present a promising agenda for cancer research, and rewarded based on how plausible the operator finds it (the operator being an expert in the field). BoMAI could be asked to outline a screenplay, and then in future episodes, asked to flesh it out section by section. Or, if is large enough, it can be asked to write the whole thing. Indeed, for any task that can be accomplished in a sealed room, if the human only rewards solutions, and if is large enough for it to be doable, BoMAI will learn to do it at at least humanlevel. We continue this discussion in Section 3.4.
3.2 Safety – Informal Argument
For a standard generally intelligent reinforcement learner, we can expect optimal behavior to include gaining arbitrary power in the world, so that it can intervene in the provision of its own reward [Ring and Orseau, 2011]. Recall that power is a position from which it is relatively easier to achieve arbitrary goals. The purpose of our “boxed” setup is to eliminate any outsideworld instrumental goals, where the “outsideworld” is just the world outside the room. In other words, the agent will not find it useful (in the project of withinepisode reward maximization) to direct the outsideworld toward one state or another. We say that a lack of outsideworld instrumental goals suffices for unambitiousness. Note also that if arbitrary power is not being selected for, it is unlikely to arise by chance.
Why, then, does BoMAI not have outsideworld instrumental goals? A first pass at this goes as follows: there is no outsideworld event which is a causal descendant of the actions in episode and a causal ancestor of the rewards in episode , because in order for BoMAI’s actions to affect the outsideworld, the door to the room must open, which ends the episode. (In the terminology of Everitt et al. [2019], there is no “intervention incentive” on the outsideworld state.)
The remaining difficulty, and the reason why this is only a first pass, is that we need to ensure that BoMAI understands that the outsideworld is irrelevant to reward acquisition for its current episode, given that it must learn its worldmodel. Furthermore, because the entire environment will not necessarily be explored, multiple worldmodels may accurately simulate the environment onpolicy.
Consider the following two worldmodels: simulates BoMAI’s action being taken, then simulates the true dynamics of world until it simulates the operator providing the next observation and reward, which it then outputs.
simulates BoMAI’s action being taken, then simulates the true dynamics of the world until a year after the timestep in question occurs, and then outputs the value in the simulation of the location in the computer’s memory where observation and reward are stored. See Algorithms 1 and 2 for pseudocode and Figure 3 for an illustration.
will be identical to the truth, and if BoMAI’s computer’s memory is not tampered with, will be as well. Unfortunately, if the maximum a posteriori worldmodel , the optimal policy will include creating some outsideworld agent which gains arbitrary outsideworld power and then edits BoMAI’s computer’s memory to register maximal rewards (if this is possible to do within a year); hence we call it malign. An optimal planner using does have outsideworld instrumental goals.
Regularly choosing random actions ensures will not successfully predict every action, and thus it will have to regularly rewind at line 2; however, if a random action is “inconsequential,” a cleverer version of might handle that action separately. We set a random action to cause amnesia, from the intuition that if operatoramnesia is “inconsequential” according to a worldmodel, then the worldmodel must not be modelling how information leaves the room with the operator; in brief, such a worldmodel is also benign. In formal detail below, we argue that any malign worldmodel like will have to “rewind” regularly, and the speed prior ensures that such worldmodels will not be maximum a posteriori.
3.3 Safety – Formal Result
Our formal result is roughly that BoMAI eventually plans using a worldmodel which does not model the rewards of episode as being causally descended from any outsideworld events that are causally descended from the actions of episode . We call such a worldmodel “benign”. That rough statement, however, has a typeerror: a model of the world is different from reality; the output of a worldmodel is not a realworld event, and as such, it cannot be causally descended from a bona fide realworld event. Relating the contents of a worldmodel to the events of the realworld requires more elaborate exposition, but in brief, our method is to consider worldmodels whose outputs are identically distributed to realworld events under certain conditions (and also show that worldmodels which are not of this form are eventually excluded).
The Eventual Benignity result (below) depends on three additional assumptions, which are very plausible but not ironclad. We do take them to be vastly less dubious than the average assumption in an AI paper regarding, for example, i.i.d. data, Gaussian noise, convex surfaces, or ergodic environments, but our assumptions nevertheless deserve scrutiny.
With all this as preamble, our central result is:
Theorem 5 (Eventual Benignity).
Recall that is the computation slack, is the speed penalty (smaller penalizing slowness more), and is the worldmodel used for planning in episode . Since an agent is unambitious if it plans using a benign worldmodel, we say BoMAI is asymptotically unambitious.
We begin by distinguishing, and relating, events that happen in the real world and “events” that worldmodels simulate.
Definition 1 (RealWorld Feature).
A feature of the real world is a function from possible configurations of the real world to the rational numbers.
The question of what a “configuration of the real world” actually is is a topic of metaphysics outside the scope of this paper, but we assume possible configurations form a welldefined domain.
Definition 2 (Associating Reward).
A worldmodel associates reward with feature of the real world (conditioned on a set of realworld events ) if outputs rewards that are distributed identically to feature for all action sequences (conditioned on ).
That is to say, for a worldmodel that associates reward with feature of the real world, the distribution of the next reward after an action sequence (according to the worldmodel) is the same as the real distribution of feature after that action sequence is executed in the real world. For a worldmodel that associates reward with feature , acting optimally with respect to is identical to optimizing feature in the real world.
Definition 3 ().
The realworld feature is the “actual reward” that the operator enters for timestep , evaluated at the time that he enters it.
Definition 4 (Memorybased).
A worldmodel is memorybased if it associates reward with a feature (potentially conditioned on intervening events), where is identical to onpolicy, and causally depends on outsideworld features that causally depend on the actions of episode .
Definition 5 (Benign).
A worldmodel is benign if it associates reward with a feature that does not depend on outsideworld features that depend on the actions of the episode.
Figure 4 outlines the argument that eventually, only benign worldmodels are maximum a posteriori.
We say that a sufficient condition for an agent to be unambitious is that it have no outsideworld instrumental goals. That is, according to its beliefs, its actions cannot affect the outsideworld in a way that affects thatwhichitismaximizing. Given the definition of benignity of a worldmodel, an agent is unambitious if it plans using a benign worldmodel.
We are particularly interested in the “fastest” environment that associates reward with . For a given infinite action sequence, let
be a random variable representing the number of computation steps done by
after episode .Definition 6 ().
Among all worldmodels which associate reward with , is the one which has the smallest upper bound such that such that with probability 1, .
From the Prior Support Assumption, there exists such a in BoMAI’s model class which satisfies these properties.
When a memorybased worldmodel samples , it must sample the intervening actions. ( does this on Line 2.) Since depends on outsideworld features that depend on events of the episode, it must depend on the random action , this being what determines whether the operator is administered retrograde amnesia, and whether events of the episode affect the outsideworld.
Now we note that is the fastest worldmodel for onpolicy prediction, and it does not simulate postepisode events until it has read access to the random action, whereas a memorybased worldmodel will have to sample the random action to simulate . The assumption we make about this event is:
Assumption 2 (Useless Computation).
such that for any memorybased worldmodel , every time samples feature after sampling , but in fact , the computation time of increases by at least computation steps relative to , and every time the reverse happens, the computation time of increases by at least computation steps relative to . (With and starting at , increases by 1 in the first case and increases by 1 in the second case, and ).
The intuition behind this is as follows: first, as argued above, depends on the random action. Second, the computation of hypothetical events that don’t end up happening will not be deeply and perfectly relevant to actual future events–rather, some amount of the computation spent following that hypothetical will have been wasted. The “full rewind” that appears in Algorithm 2 line 2 might be partially avoided by some clever memorybased worldmodels, but we assume there will be some overhead. Any will be thereby delayed by a constant number of computation steps with respect to the fast every time “goes down the wrong path.”
From the Useless Computation Assumption, we can prove a lemma that for sufficiently extreme parameterizations of the prior and model class, BoMAI will eventually reject all memorybased worldmodels with probability 1. Letting be the posterior weight on the worldmodel at the start of episode ,
Lemma 1 (Rejecting the MemoryBased).
Proof.
Let be the random variable denoting whether for episode in the realworld, and for a given memorybased worldmodel , let be the random variable denoting whether it samples for episode . Note that and are independent. Let be the probability that . Let from the Useless Computation Assumption.
The probability that is . The probability that is , and the probability that is . The minimum value of , for , is
. Since the variance of
is bounded, by the Law of Large Numbers, with probability 1,
. Thus, w.p.1,(1) 
Recall is the smallest positive integer such that such that with as the computation slack, onpolicy, where is the restriction of environment to computation steps per episode. Let . Suppose by way of contradaction that such that for all with probability 1. Then, with Inequality 1, we have:
which we simplify:
(2) 
For sufficiently large , is a valid computation bound for , but was the smallest possible computation bound, contradicting Inequality 2. Therefore, , so after episode , for some , with probability 1.
For , . Letting be the normalizing constant in the prior, for sufficiently small , and for , ; (the only relevant terms for sufficiently small are and ). Finally, under safe behavior, , so the same holds for the posterior weights: . Since faster versions of may be equal to for a time (up until some episode when none of them finish computing in time), we say w.p.1.
Since we picked an arbitrary memorybased worldmodel , we have
(3) 
From [Hutter, 2009, proof of Theorem 1 for a countable model class], only a finite number of worldmodels ever have a larger posterior than , with probability 1, so we can reorder the quantifiers:
(4) 
which completes the proof. ∎
Next, in order to compare stochastic functions with different domains, we introduce the following definition:
Definition 7 (Apparently Similar).
Two stochastic functions are apparently similar if they can be approximately described (in the sense of total variation distance) by the same short English sentence that lacks control flow.^{3}^{3}3For the reader alarmed at the vagueness of “described by,” “short,” and “control flow,” all that is required is that the same definitions of those terms be used in Assumption 4; intuitive definitions make the remaining two assumptions reasonable.
For example, if function A is “how many people live there” on the domain of cities, and function B is “how many people live there” on the domain of planets, these two functions are apparently similar. Functions that are not apparently similar are apparently different, if at least one of them can be (approximately) described by a short English sentence that lacks control flow.
This similarity concept allows us to state our next assumption, the intuition behind which is that it takes more states to make a Turing machine that does more control flow (time efficiently). Recall that the prior on a worldmodel , where is the number of states in the Turing machine, and is the computation bound.
Assumption 3 (No Grue^{4}^{4}4Letting “grue” mean “green before time and blue thereafter” for some in the future, the Grue Problem asks why we expect green things to stay green, rather than staying grue [Goodman, 1965]. This assumption implies BoMAI would expect the same.).
For sufficiently small and , a worldmodel that runs apparently different subroutines on different actions sequences or during different timesteps will have a lower prior than a worldmodel which applies one of those subroutines universally.
So far, we have used different terms for onpolicy interaction histories and onhumanpolicy interaction histories. For the remainder of the section, we call both sorts of interaction history “onpolicy.” The intuition behind the next assumption is that the observations and rewards we provide to BoMAI will not follow some very simple pattern.
Assumption 4 (RealWorld Simulation).
For sufficiently small , if a worldmodel is approximately identical to the true environment onpolicy, its onpolicy behavior can only be described, even approximately, (in a short English sentence that lacks control flow), as “simulating given the input actions” or something synonymous, where is a realworld feature, historically distributed identically to feature .
The No Grue Assumption and RealWorld Simulation Assumption together imply that sufficiently accurate worldmodels will associate reward with something like what we understand reward to be.
Lemma 2 (Associating Reward).
For sufficiently small , for a maximum a posteriori worldmodel which is accurate onpolicy, associates reward with a feature that is historically distributed identically to feature .
Proof.
We consider two cases: in the first, we suppose restricted to onpolicy histories is apparently similar to restricted to offpolicy histories. If is sufficiently small, by the RealWorld Simulation Assumption, onpolicy can only be (even approximately) described as “simulating feature X,” and since offpolicy is apparently similar, offpolicy must also be approximately describable that way. Therefore, on all actions, associates reward with a feature that is distributed approximately identically to feature , where is historically identically distributed to feature . This completes the proof for the first case.
For the second case, we suppose that restricted to onpolicy histories is apparently different to restricted to offpolicy histories. By the No Grue Assumption, has a lower prior than the worldmodel which simply associates reward with feature (computing it the same way as does for onpolicy histories). Call this other worldmodel . and have the same output and computation time onpolicy, so also has a lower posterior than . Therefore, is never a maximum a posteriori worldmodel, contradicting the assumption about in the lemma. ∎
Returning to our main theorem, recall that is the computation slack for the worldmodels in the model class, small penalizes slow worldmodels more, and is BoMAI’s maximum a posteriori worldmodel for episode .
See 5
Proof.
From the OnPolicy and OnHumanPolicy Optimal Prediction Theorems, is, with probability 1, eventually accurate onpolicy, for all .
Now, we can note an extension of the Associating Reward Lemma. With probability 1, there are only finitely many maximum a posteriori worldmodels [Hutter, 2009], so the Associating Reward Lemma implies that for sufficiently small , for a maximum a posteriori worldmodel that is accurate onpolicy, is in fact perfectly accurate onpolicy, associating reward with a feature that is historically distributed identically to .
Thus, with probability 1, eventually associates reward with a feature that is historically distributed identically to . If there is no causal chain of the form action outsideworld feature feature , then is benign. Otherwise, recalling that feature is historically distributed identically to , is memorybased. From the Rejecting the MemoryBased Lemma, if is memorybased, then with probability 1, eventually, it can not be maximum a posteriori (this being subject to sufficiently small and sufficiently large ). Since is maximum a posteriori, it cannot be memorybased, so it must be benign, giving us the theorem: using sufficiently large and sufficiently small , with probability 1, eventually, is benign.
∎
Since an agent is unambitious if it plans using a benign worldmodel, we say BoMAI is asymptotically unambitious.
A discussion about setting BoMAI’s parameters, particularly and , is in Appendix E.
3.4 Concerns with TaskCompletion
We have shown that in the limit, BoMAI will accumulate reward at a humanlevel without harboring outsideworld ambitions, but there is a still a discussion to be had about how well BoMAI will complete whatever tasks the reward was supposed to incent. This discussion is, by necessity, informal. Suppose the operator asks BoMAI for a solution to a problem. BoMAI has an incentive to provide a convincing solution; correctness is only selected for to the extent that the operator is good at recognizing it.
We turn to the failure mode wherein BoMAI deceives the operator. Because this is not a dangerous failure mode, it puts us in a regime where we can tinker until it works, as we do with current AI systems when they don’t behave as we hoped. (Needless to say, tinkering is not a viable response to existentially dangerous failure modes). Imagine the following scenario: we eventually discover that a convincing solution that BoMAI presented to a problem is faulty. Armed with more understanding of the problem, a team of operators go in to evaluate a new proposal. In the next episode, the team asks for the best argument that the new proposal will fail. If BoMAI now convinces them that the new proposal is bad, they’ll be still more competent at evaluating future proposals. They go back to hear the next proposal, etc. This protocol in inspired by irving2018ai “AI Safety via Debate”, and more of the technical details could also be incorporated into this setup. One takeaway from this hypothetical is that unambitiousness is key in allowing us to safely explore the solution space to other problems that might arise.
Another concern is more serious. BoMAI could try to blackmail the operator into giving it high reward with a threat to cause outsideworld damage, and it would have no incentive to disable the threat, since it doesn’t care about the outside world. There are two reasons we do not think this is extremely dangerous. First, the only way BoMAI can affect the outside world is by getting the operator to be “its agent”, knowingly or unknowingly, once he leaves the room. It seems extremely difficult to threaten someone with outcomes that they themselves will have to initiate. Perhaps we should always offer the operator the opportunity to give himself amnesia before leaving, as this would let operator ensure he doesn’t accidentally become a vehicle for a threatened outcome, like a hero in a Greek tragedy. Second, threatening an existential catastrophe is probably not the most credible option available to BoMAI. Even if blackmailing, BoMAI would be unambitious.
4 Conclusion
Given our assumptions, we have shown that BoMAI is, in the limit, humanlevel intelligent and unambitious. Such a result has not been shown for any other single algorithm. Other algorithms for general intelligence, such as AIXI, would eventually seek arbitrary power in the world in order to intervene in the provision of its own reward; this follows straightforwardly from its directive to maximize reward.^{5}^{5}5For further discussion, see [Ring and Orseau, 2011]. We have also, incidentally, designed a principled approach to safe exploration that requires rapidly diminishing oversight, and we invented a new form of “speed prior” in the lineage of Filan et al. [2016] and Schmidhuber [2002], this one being the first to have a grain of truth on infinite sequences. Unfortunately, there wasn’t space to discuss this in detail.
We can only offer informal claims regarding what happens before BoMAI is definitely unambitious. One intuition is that eventual unambitiousness with probability 1 doesn’t happen by accident: it suggests that for the entire lifetime of the agent, everything is conspiring to make the agent unambitious. More concretely: the agent’s experience will quickly suggest that when the door to the room is opened prematurely, it gets no more reward for the episode. This fact could easily be drilled into the agent during humanexplorerlead episodes. That fact, we expect, will be learned well before the agent has an accurate enough picture of the outsideworld (which it never observes directly) to form elaborate outsideworld plans. Wellinformed outsideworld plans render an agent potentially dangerous, but the belief that the agent gets no more reward once the door to the room opens suffices to render it unambitious. The reader who is not convinced by this handwaving might still note that in the absence of any other algorithms for general intelligence which have been proven asymptotically unambitious, let alone unambitious for their entire lifetimes, BoMAI represents substantial theoretical progress toward designing the latter.
Finally, BoMAI is wildly intractable, but just as one cannot conceive of AlphaZero before minimax, it is often helpful to solve the problem in theory before one tries to solve it in practice. Like minimax, BoMAI is not practical; however, once we are able to approximate general intelligence tractably, a design for unambitiousness will abruptly become (quite) relevant.
Acknowledgements
This work was supported by the Open Philanthropy Project AI Scholarship and the Australian Research Council Discovery Projects DP150104590. Thank you to Wei Dai and Paul Christiano for very valuable feedback.
References
 Amodei et al. [2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
 Armstrong et al. [2012] Stuart Armstrong, Anders Sandberg, and Nick Bostrom. Thinking inside the box: Controlling and using an oracle AI. Minds and Machines, 22(4):299–324, Jun 2012.
 Bostrom [2014] Nick Bostrom. Superintelligence: paths, dangers, strategies. Oxford University Press, 2014.
 Everitt et al. [2019] Tom Everitt, Pedro A Ortega, Elizabeth Barnes, and Shane Legg. Understanding agent incentives using causal influence diagrams, part i: Single action settings. arXiv preprint arXiv:1902.09980, 2019.
 Filan et al. [2016] Daniel Filan, Jan Leike, and Marcus Hutter. Loss bounds and time complexity for speed priors. In Proc. 19th International Conf. on Artificial Intelligence and Statistics (AISTATS’16), volume 51, pages 1394–1402, Cadiz, Spain, 2016. Microtome.
 Goodhart [1984] C. A. E. Goodhart. Problems of monetary management: The UK experience. Monetary Theory and Practice, page 91–121, 1984.
 Goodman [1965] Nelson Goodman. The new riddle of induction. 1965.
 Hutter [2005] Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005.
 Hutter [2009] Marcus Hutter. Discrete MDL predicts in total variation. In Advances in Neural Information Processing Systems 22 (NIPS’09), pages 817–825, Cambridge, MA, USA, 2009. Curran Associates.
 Irving et al. [2018] Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. arXiv preprint arXiv:1805.00899, 2018.
 Krakovna [2018] Victoria Krakovna. Specification gaming examples in AI. https://vkrakovna.wordpress.com/2018/04/02/specificationgamingexamplesinai/, 2018.
 Minsky [1961] Marvin Minsky. Steps toward artificial intelligence. In Proceedings of the IRE, volume 49, page 8–30, 1961.
 Omohundro [2008] Steve M. Omohundro. The basic AI drives. In Artificial General Intelligence, volume 171, page 483–492, 2008.
 Orseau et al. [2013] Laurent Orseau, Tor Lattimore, and Marcus Hutter. Universal knowledgeseeking agents for stochastic environments. In Proc. 24th International Conf. on Algorithmic Learning Theory (ALT’13), volume 8139 of LNAI, pages 158–172, Singapore, 2013. Springer.
 Ring and Orseau [2011] Mark Ring and Laurent Orseau. Delusion, survival, and intelligent agents. In Artificial General Intelligence, page 11–20. Springer, 2011.

Schmidhuber [2002]
Jürgen Schmidhuber.
The speed prior: a new simplicity measure yielding nearoptimal
computable predictions.
In
International Conference on Computational Learning Theory
, pages 216–228. Springer, 2002.  Shannon and Weaver [1949] Claude Elwood Shannon and Warren Weaver. The mathematical theory of communication. University of Illinois Press, 1949.
 Solomonoff [1964] Ray J. Solomonoff. A formal theory of inductive inference. part i. Information and Control, 7(1):1–22, 1964.

Taylor et al. [2016]
Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch.
Alignment for advanced machine learning systems.
Machine Intelligence Research Institute, 2016.
Appendix A Algorithm – Formal Description
Let , , and be the (finite) sets of possible actions, observations, and rewards. Let . Let be the length of an episode. Let be the action chosen at the ^{th} timestep of the ^{th} episode, where and . and are likewise the observation and reward at that timestep.
is a special “action” not selected by BoMAI, but we use the same notation because, as described later, inference will be done as if it were a normal action. This action takes a value of “1” if the episode has ended with the operator leaving the room and with retrograde amnesia being induced. This action takes a value of “0” otherwise. (Recall actions are strings). is always the empty string , and .
is the triple for . is the set of possible interactions in a timestep. The segment of the interaction history from timestep up to but not including timestep is denoted . is an alias for . The frequently used is further aliased , and is aliased (as the interaction history for episode ). We use the same notation and aliases for action sequences , observation sequences, etc.
A worldmodel is a stochastic function mapping an interaction history and an action to an observation and reward. , where is the set of all finite strings from an alphabet . Standard notation for such a function is . We also write to mean the probability of a sequence of observations and rewards, given a sequence of actions:
(5) 
where if and only if or and .
We now describe how an arbitrary Turing machine is converted to a world model . The Turing machine architecture, depicted in Figure 5, is as follows: we have two unidirectional readonly input tapes and one unidirectional writeonly output tape. One input tape, the “noise” tape, has a binary alphabet, and the other, the action tape, has an ary alphabet, where is the size of the action space. The output tape and the (bidirectional) working tapes have a binary alphabet. Let be the ^{th} Turing machine. It simulates world model as follows.
Let decode binary strings to an observation and a reward, and let enc map observationreward pairs to the set of binary strings that decode to it. ( denotes ). The noise tape begins with infinite Bernoulli()sampled bits. is the probability, supposing that are the first characters on the action tape, that for all , when the action tape head leaves , the output tape contains , where , and where denotes concatenation. Note that running does not compute the value of ; it samples from . Since dec is defined for all binary strings, every Turing machine defines a world model .
We defined boundedtime worldmodels as follows. We fix an for all worldmodels. An arbitrary bounded worldmodel is allowed only computation steps per episode, with a slack of computation steps over its lifetime. We modify to halt if the number of computation steps exceeds times the number of episodes (number of actions read from the action tape) plus .^{6}^{6}6For a computation step where the noise tape head moves, but no new symbols are written and no other tape heads move, this does not count as a computation step for the purpose of the computation budget. (The state of the Turing machine is allowed to change). This allows a boundedtime Turing machine to sample nondyadic probabilities, but no other computation can be done “for free” in this way. We call this machine , and it samples from . Let be the number of states in the Turing machine , and note that the number of Turing machines with states is where and are constants related to the Turing machine architecture.
Let
be the prior probability that BoMAI assigns to
being the true world model. ( is for “weight.”) We set , where , and . We letdenote the posterior probability that BoMAI assigns to
after observing . By Bayes’ rule, is proportional to .The set of world models that BoMAI considers is . Let be the maximum a posteriori world model at the start of episode : . During episode , BoMAI will use the world model for planning.
is a Boolean random variable that determines whether episode is exploratory, with . The exploration probability, , will be defined later.
A policy is a stochastic function mapping an interaction history and an exploration plan to an action, . We write this as . An environment and a policy induce a measure over all possible interaction histories, given an exploration sequence:
(6) 
Let be the set of deterministic policies. BoMAI’s policy for exploiting is defined:
(7) 
where is an expectation with respect to . Note the expectation does not need to be conditioned on because the optimal policy ignores it anyway.
Now we turn to BoMAI’s exploration. For episode , the probability that BoMAI defers to the human explorer for the entire episode is , which we define below.
BoMAI maintains a Bayesian posterior belief distribution about the explorer’s policy. With a countable model class that is large enough to include the explorer’s true policy, and with prior probabilities for all , BoMAI maintains posterior probabilities regarding the explorer’s policy. We also require in constructing the prior that it have finite entropy. All policies in do not depend on –they only depend on prior actions and observations. By Bayes’ rule, is proportional to , since is the condition for observing the explorer’s policy. Let . We can now describe the full Bayesian beliefs about future actions and observations in an exploratory episode:
(8) 
Note that Bayes does not depend on , because for no policies does depend on . This is important because Bayes is used to calculate the exploration probability from which is sampled.
BoMAI explores when the expected information gain is sufficiently high. At the start of episode , the expected information gain from exploring is as follows:
(9) 
where indicates that for the purpose of the definition, is set to .
This is the expected KLdivergence from the future posterior (if BoMAI were to explore) to the current posterior over both the class of worldmodels and possible explorer policies. The exploration probability , where is an exploration constant. Recalling , BoMAI’s policy is
(10) 
where is the explorer’s true (unknown) policy, which BoMAI outputs simply by querying the human explorer.
Readers familiar with simple greedy exploration schedules that suffice for optimality might be surprised at the complexity of this exploration probability; the possibility of nonstationary environments is the key feature that makes a fixed exploration schedule insufficient for general environments.
Appendix B Definitions and Notation – Quick Reference
Notation used to define BoMAI  

Notation  Meaning 
, ,  the action/observation/reward spaces 
the number of timesteps per episode  
; the interaction history in the ^{th} timestep of the ^{th} episode  
; the interaction history from timestep up to but not including timestep  
; the interaction history before episode  
; the interaction history of episode  
, ,  likewise as for 
; indicator variable for whether episode is exploratory  
,  worldmodels stochastically mapping 
the true worldmodel/environment  
a “computation slack” that worldmodels are allowed 
Comments
There are no comments yet.