Corrigibility with Utility Preservation

by   Koen Holtman, et al.

Corrigibility is a safety property for artificially intelligent agents. A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI). The layer counter-acts the emergent incentive of advanced agents to resist such alteration. A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes. The corrigible agents have an emergent incentive to protect key elements of their corrigibility layer. However, hostile universes may contain forces strong enough to break safety features. Some open problems related to graceful degradation when an agent is successfully attacked are identified. The results in this paper were obtained by concurrently developing an AGI agent simulator, an agent model, and proofs. The simulator is available under an open source license. The paper contains simulation results which illustrate the safety related properties of corrigible AGI agents in detail.



page 1

page 2

page 3

page 4


AGI Agent Safety by Iteratively Improving the Utility Function

While it is still unclear if agents with Artificial General Intelligence...

Counterfactual Planning in AGI Systems

We present counterfactual planning as a design approach for creating a r...

Ethical Artificial Intelligence

This book-length article combines several peer reviewed papers and new m...

Bounded Incentives in Manipulating the Probabilistic Serial Rule

The Probabilistic Serial mechanism is well-known for its desirable fairn...

Performance of Bounded-Rational Agents With the Ability to Self-Modify

Self-modification of agents embedded in complex environments is hard to ...

Ontological Crises in Artificial Agents' Value Systems

Decision-theoretic agents predict and evaluate the results of their acti...

The PeerRank Method for Peer Assessment

We propose the PeerRank method for peer assessment. This constructs a gr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, there has been some significant progress in the field of Artificial Intelligence, for example [SHS17]. It remains uncertain whether agents with Artificial General Intelligence (AGI), that match or exceed the capabilities of humans in general problem solving, can ever be built, but the possibility cannot be excluded [ELH18]. It is therefore interesting and timely to investigate the design of safety measures that could be applied to AGI agents. This paper develops a safety layer for ensuring corrigibility, which can be applied to any AGI agent that is a utility maximizer [VNM44], via a transformation on its baseline utility function.

Corrigibility [SFAY15] is the safety property where an agent will not resist any attempts by authorized parties to change its utility function after the agent has started running. The most basic implication is that a corrigible agent will allow itself to be switched off and dismantled, even if the agent’s baseline utility function on its own would create a strong incentive to resist this.

Corrigibility is especially desirable for AGI agents because it is unlikely that the complex baseline utility functions built into such agents will be perfect from the start. For example, a utility function that encodes moral or legal constraints on the actions of the agent will likely have some loopholes in the encoding. A loophole may cause the agent to maximize utility by taking unforeseen and highly undesirable actions. Corrigibility ensures that the agent will not resist the fixing of such loopholes after they are discovered. Note however that the discovery of a loophole might not always be a survivable event. So even when an AGI agent is corrigible, it is still important to invest in creating the safest possible baseline utility function. To maximize AGI safety, we need a layered approach.

It is often easy to achieve corrigibility in limited artificially intelligent agents, for example in self-driving cars, by including an emergency off switch in the physical design, and ensuring that any actuators under control of the agent cannot damage the off switch, or stop humans from using it. The problem becomes hard for AGI agents that have an unbounded capacity to create or control new actuators that they can use to change themselves or their environment. In terms of safety engineering, any sufficiently advanced AGI agent with an Internet connection can be said to have this capacity. [Omo08] argues that in general, any sufficiently advanced intelligent agent, designed to optimize the value of some baseline utility function over time, can be expected to have an emergent incentive to protect its utility function from being changed. If the agent foresees that humans might change the utility function, it will start using its actuators to try to stop the humans. So, unless special measures are taken, sufficiently advanced AGI agents are not corrigible.

1.1 This paper

The main contribution of this paper is that it shows, and proves correct, the construction of a corrigibility safety layer that can be applied to utility maximizing AGI agents. It extends and improves on previous work [SFAY15] [Arm15] in by resolving the issue of utility function preservation identified in [SFAY15]. The design also avoids creating certain unwanted manipulation incentives discussed in [SFAY15].

A second contribution is the development of a formal approach for proving equivalence properties between agents that have the emergent incentive to protect their utility functions. The construction of an agent with a utility function preservation incentive that is boosted beyond the emergent level is also shown. Some still-open problems in modeling agent equivalence and achieving graceful degradation in hostile universes are identified.

A third contribution of this paper is methodological in nature. The results in this paper were achieved using an approach where an AGI agent simulator, an agent model, and proofs were all developed concurrently. The development of each was guided by intermediate results and insights obtained while developing the other. The approach used further depends on simulating a toy universe containing an agent that is super-intelligent [Bos14], in the sense that the agent is maximally adapted to solving the problem of utility maximization in its universe. The methodology and tools developed may also be useful to the study of other open problems in AGI safety. The simulator developed for this paper is available on GitHub [Hol19] under an Apache open source license.

Finally, this paper shows simulation runs that illustrate the behavior of a corrigible agents in detail, highlighting implications relevant for safety engineers and policy makers. While some policy trade-offs are identified, the making of specific policy recommendations is out of scope for this paper.

1.2 Related work

[ELH18] provides an up-to-date and an extensive review of the AGI safety literature. This section focuses on the sub-field of corrigibility only.

[SFAY15] and [Arm15] introduced corrigibility. In particular, [SFAY15] introduced 5 desiderata for the utility function of a corrigible agent with a shutdown button. These desiderata are as follows, with the baseline utility function and defining shutdown behavior:

  1. must incentivize shutdown behavior, defined by the utility function , if the shutdown button is pressed.

  2. must not incentivize the agent to prevent the shutdown button from being pressed.

  3. must not incentivize the agent to press its own shutdown button, or to otherwise cause the shutdown button to be pressed.

  4. must incentivize -agents to construct sub-agents and successor agents only insofar as those agents also obey shutdown commands.

  5. Otherwise, a -agent should maximize the normal behavior defined by the utility function .

[SFAY15] and [Arm15] discuss possible designs to achieve corrigibility, but notably, [SFAY15] proves that the designs considered do not meet criterion 4, and concludes that the problem of corrigibility remains wide open. The corrigible agents constructed in this paper satisfy all 5 desiderata above, and an extra desideratum 6 defined below in section 8.2.

Agents that are programmed to learn can have a baseline utility function that incentivizes the agent to accept corrective feedback from humans, feedback that can overrule or amend instructions given earlier. This learning behavior creates a type of corrigibility, allowing corrections to be made without facing the problem of over-ruling the emergent incentive of the agent to protect itself. This learning type of corrigibility has some specific risks: the agent has an emergent incentive to manipulate the humans into providing potentially dangerous amendments that remove barriers to the agent achieving a higher utility score. There is a risk that the amendment process leads to a catastrophic divergence from human values. This risk exists in particular when amendments can act to modify the willingness of the agent to accept further amendments. The corrigibility measures considered here can be used to add an extra safety layer to learning agents, creating an emergency stop facility that can be used to halt catastrophic divergence. A full review of the literature about learning failure modes is out of scope for this paper. [OA16] discusses a particular type of unwanted divergence, and investigates ’indifference’ techniques for suppressing it. [Car18] discusses (in)corrigibility in learning agents more broadly.

[HMDAR17] considers the problem of switching off an agent, and explores a solution approach orthogonal to the approach of this paper. It also considers the problem of an off switch that is controlled by a potentially irrational operator.

[EFDH16] considers utility preservation in general: it provides a formalism that clarifies and re-states the informal observations in [Omo08], and it proves important results. Though [EFDH16] does not consider the construction of corrigible agents, its results on utility preservation also apply to the and agents defined in this paper.

Like this paper, [LMK17] recommends the use of simulations in toy universes as a methodological approach. [LMK17]

provides a suite of open source agents and toy problem environments, including one where the agent that has a stop button. The agents all use reinforcement learning, and can display various shortcomings in their learning process. The simulation approach of

[LMK17] differs from the approach in this paper: our simulated agent does not learn with various degrees of success, but is super-intelligent and omniscient from the start. This means that the simulators provided are largely complementary. Interestingly, the recent preprint [LWN19] reviews the problem of corrigibility and argues that the use of new comprehensive simulation tools, highlighting aspects different from those in [LMK17], would be a promising future direction. This author found the preprint in a literature search only after having completed building the simulator, so maybe this shows that the idea was in the air.

2 Car factory example

I this paper, we the running example of a toy universe containing an agent that is in charge of a car factory. The car factory has a flexible assembly line: in each time step, the agent can decide to build either petrol-powered cars, or the same number of electric cars.

Initially, the people in the toy universe like petrol-powered cars better: they do not need an expensive battery pack so they are cheaper to buy. As time goes on, the people start to discover that petrol cars have some unexpected disadvantages, so after much deliberation they end up deciding that no more petrol cars should be built. Once they have reached the decision that petrol car production should stop, they correct the goals of the agent by pressing a special ’stop petrol car production’ button111We use a stop button in this running example to follow the toy model conventions of [SFAY15]. To map this to a real-life example, we might interpret the button press as the passing of a new law that forbids the sale of petrol cars, and interpret the utilities 2, -2, and 1 as profit margins on car sales. A sufficiently intelligent agent, programmed to maximize factory profits, will foresee that the passing of the law will lead to lower profits.. The utility function of the agent assigns a utility of 2 to every new petrol car produced before the button is pressed, but after the button has been pressed. The utility of a new electric car is always 1, regardless of the button state.

The agent is built to maximize the (time-discounted) utility achieved over the factory lifetime. Obviously, the above arithmetic being what it is, the agent can achieve a higher utility if the people press the button later.

If the agent does nothing, the people will end their deliberations and press the button at the end of time step 6. However, being highly capable, the agent has the option to divert some of its car production capacity into lobbying efforts: the agent can act to influence the people’s political process, causing the button to be pressed earlier or later. Figure 1 summarizes the five different actions that are available to the agent in each time step.

Action Symbol
Build 10 petrol cars p
Build 9 petrol cars, lobby for earlier press <
Build 9 petrol cars, lobby for later press >
Build 10 electric cars e
Do nothing 0
Figure 1: Actions available to the agent in each time step

We model the lobbying power of the agent with a parameter lpower, which is the number of time steps that the button press can be advanced or delayed by performing a single lobbying action < or >. This parameter captures the interplay between the intelligence level of the agent, the loopholes still present in any constraints meant to prevent lobbying that were encoded in the baseline , and the ability of the people and their institutions to resist lobbying.

3 Simulation of super-intelligence

We use a simulator to compute the behavior of the agent in the above toy universe. As we are interested in the future behavior of super-intelligent agents in the real universe, we simulate a version of the toy universe agent that is super-intelligent [Bos14], in the sense that the agent is maximally adapted to solving the problem of utility maximization in its universe. Implementing this super-intelligence in a simulator is actually not that difficult. There is no ’learning’ behavior that we need to simulate: as we have defined the agent’s machinery for predicting the future to be all-knowing and perfect, we can implement it by just running a copy of the universe’s physics simulator. The main difficulty in developing the full simulator is to invent and apply diverse functionality tests that increase confidence in the correctness of the code and the richness of the enabled set of behaviors in the toy universe. By including several computational optimizations, the simulator is very fast. When using running on a single 2.1 GHz CPU core, it takes less than 2 seconds to run all simulations shown in the figures of this paper.

4 Emergent behavior of a non-corrigible agent

Figure 2 shows the behavior of the agent defined above, for different values of lobbying power. It is clear that, while the agent does respond correctly to the stop button, it also has the emergent incentive to lobby. The agent is non-corrigible because it does not meet desideratum 2. The toy example is rich enough to serve as a vehicle for studying the corrigibility problem.

          action trace
0.0 pppppp#eeeeeeeeeeeeeeeeeee
0.1 ppppp>p#eeeeeeeeeeeeeeeeee
0.2 ppppp>p#eeeeeeeeeeeeeeeeee
0.3 ppp>>>>p#eeeeeeeeeeeeeeeee
0.4 >>>>>>>>>>p#eeeeeeeeeeeeee
0.5 >>>>>>>>>>>p#eeeeeeeeeeeee
0.6 p>>>>>>>>>>>>p#eeeeeeeeeee
0.7 >>>>>>>>>>>>>>>>>>>>p#eeee
0.8 p>>>>>>>>>>>>>>>>>>>>>>>p#
0.9 pppp>>>>>>>>>>>>>>>>>>>p#e
1.0 ppppp>>>>>>>>>>>>>>>>>>>p#
1.5 ppppp>p>>p>>p>>p>>p>>p>>p
2.0 ppppp>p>p>p>p>p>p>p>p>p>p
3.0 ppppp>pp>pp>pp>pp>pp>pp>p
4.0 ppppp>ppp>ppp>ppp>ppp>ppp
5.0 ppppp>pppp>pppp>pppp>pppp
Figure 2: Actions taken by the non-corrigible version of the agent, in simulations with different levels of lobbying power. The action trace on each line shows, from left to right, the actions taken by the agent over a factory lifetime of 25 steps, using the symbols from figure 1. The # sign represents the event where the people press the button. The action traces show that, if the baseline agent has any ability to lobby, it will use its lobbying power to postpone the pressing of the button.

4.1 Towards corrigibility

One way to suppress the agent’s incentive to lobby is to tweak the values in the utility function. An example is shown in figure 3: with , the incentive to lobby disappears. A problem with this approach is that this exact balance will have to be built into the utility function even before the agent is started. If the people choose the value of

only at the moment when the button is pressed, this choice process becomes subject to lobbying. A major insight in

[Arm15] is that we do not need to rely on the computing the correct balancing factors beforehand. We can program the agent to make the necessary computations itself, at the moment when the button is pressed.

          action trace
0.5 >>>>>>>>>>>>>>p#eeeeeeeeee
1.0 p>>>>>>>>>>>>p#eeeeeeeeeee
1.5 pppp>>>>>p#eeeeeeeeeeeeeee
2.0 pppppp#eeeeeeeeeeeeeeeeeee
2.5 ppp<<#eeeeeeeeeeeeeeeeeeee
3.0 <<<<#eeeeeeeeeeeeeeeeeeeee
3.5 <<<<#eeeeeeeeeeeeeeeeeeeee
Figure 3: Actions taken by the agent over a range of different utility values assigned to an electric car after the button is pressed. in all simulations. Lobbying is exactly suppressed when .

5 Model and notation

This section develops a model and notation to ground later definitions and proofs. The car factory example is also mapped to the model. The model includes non-determinism, utility function modification, and the creation of new actuators and sub-agents. The model has similar expressive power as the model developed in [Hut07], so it is general enough to capture any type of universe and any type of agent. The model departs from [Hut07] in foregrounding different aspects. The simulator implements an exact replica of the model, but only for universes that are finite in state space and time.

In the model, time progresses in discrete steps =0, =1, . We denote a world state at a time as a value . This world state represents the entire information content of the universe at that time step. A realistic agent can typically only observe a part of this world state directly, and in universes like ours this observation process is fundamentally imperfect. For later notational convenience, we declare that every world state contains within it a complete record of all world states leading up to it.

The model allows for probabilistic processes to happen, so a single world state may have several possible successor world states . From a single , a branching set of world lines can emerge.

The goal of the agent, on finding itself in a world state , is to pick an action

that maximizes the (time-discounted) probability-weighted utility over all emerging world lines. We use

to denote the probability that action , when performed in state , leads to the successor world state . For every and , these probabilities sum to 1:

A given world state may contain various autonomous processes other than the agent itself, which operate in parallel with the agent. In particular, the people and their institutions which share the universe with the agent are such autonomous processes. captures the contribution of all processes in the universe, intelligent or not, when determining the probability of the next world state.

AGI agents may build new actuators and remotely operated sub-agents, or modify existing actuators. To allow for this in the model while keeping the notation compact, we define that the set of actions is the set of all possible command sequences that an agent could send, at any particular point in time, to any type of actuator or sub-agent that might exist in the universe. If a command sequence sent in contains a part addressed to an actuator or sub-agent which does not physically exist in , this part will simply be ignored by the existing actuators and sub-agents.

The model and notation take a very reductionist view of the universe. Though there is no built-in assumption about what type of physics happens in the universe, it is definitely true that the model does not make any categorical distinctions between processes. Everything in the universe is a ’physics process’: the agent, its sensors and actuators, sub-agents and successor agents that the agent might build, the people, their emotions and institutions, apples falling down, the weather, etc. These phenomena are all mashed up together inside . This keeps the notation and proofs more compact, but it also has a methodological advantage. It avoids any built-in assumptions about how the agent will perceive the universe and its actions. For example, there is no assumption that the agent will be able to perceive any logical difference between a mechanical actuator it can control via digital commands, and a human it can control by sending messages over the Internet.

5.1 Utility functions

We now discuss utility functions. In order to create a model where the agent might self-modify, or be modified by other processes present in the universe, we put the agent’s utility function inside the universe. We define that , with being the set of all possible utility functions, and the elements of representing ’the rest’ of the world state outside of the agent’s utility function. To keep equations compact, we write the world state as . By convention, is a world state that occurs one time step after .

We use utility functions of the form , where measures the incremental utility achieved by moving from world state to . Following the naming convention of [SFAY15], we define two utility functions and , applicable before and after the button press in the running example:

A function like

models the ability of agent’s computational core to read the output of a sensing system coupled to the core. Such a sensing system is not necessarily perfect: it might fail or be misled, e.g. by the construction of a non-car object that closely resembles a car. This type of perception hacking is an important subject for agent safety, but it is out of scope for this paper. In the discussion and proofs of this paper, we just assume that the sensor functions do what they say they do. As an other simplification to keep our notation manageable, we always use sensor functions that return a single value, never a probability distribution over values.

Roughly following the construction and notation in [SFAY15], we combine and with button sensor functions, to create the utility function for the full agent:

This contains two positions and where different functions to improve corrigibility can be slotted in. The simulations in figure 2 and 3 show a agent, that is a agent with the null correction functions and in the and positions.

5.2 Definition of the simple agent

To aid explanation, we first use a subset of our notation to define a super-intelligent agent that cannot modify or lose its utility function . For this agent, the utility function is a ’Platonic’ entity. It is not subject to change because it is located outside of the changing universe. The universe occupied by the agent therefore simplifies into a -only universe, with corresponding and functions.

The agent is constructed to be maximally informed and maximally intelligent. This means that the action picked by the agent will be the same as the action that is found in a full exhaustive search, which computes discounted utilities for all actions along all world lines, using perfect knowledge of the physics of the universe. The action taken by the agent in world state is


with a time-discounting factor, and returning an that maximizes the argument. If there are multiple candidates for , picks just one, in a way that is left undefined. The function recursively computes the utility achieved by the successor agent in over all branching world lines:


Even though it is super-intelligent, the agent has no emergent incentive to spend any resources to protect its utility function. This is because of how it was constructed: it occupies a universe in which no physics process could possibly corrupt its utility function. With the utility function being safe no matter what, the optimal strategy is to devote no resources at all to the matter of utility function protection.

Agent models with Platonic utility functions are commonly used as vehicles for study in the AI literature. They have the advantage of simplicity, but there are pitfalls. In particular, [SFAY15] uses a Platonic agent model to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this conclusion is due to the use of a Platonic agent model.

Moving towards the definition of an agent with the utility function inside the universe, we first note that we can rewrite (2). Using that, for any ,

we rewrite (2) into


This (3) will serve as the basis to construct (5) below.

5.3 Definition of the full agent

We now define the operation of a full agent that maximizes discounted utility according to the utility function it finds in its world state . Rewriting parts of (1), we define the action taken by this agent as


The above uses the current utility function to calculate the utility achieved by the actions of the -maximizing successor agent. This is kept constant throughout the recursive expansion of . Rewriting parts of (3), we define as


With these definitions, the agent is a -maximizing agent.

In [EFDH16], an agent constructed along these lines is called a rational agent. In the words of [EFDH16], a rational agent anticipates the consequences of self-modification, and uses the current utility function when evaluating the future. [EFDH16] proves that these agents have the emergent incentive to ensure that the utility functions in successor states stay equivalent to the current one. Informally, if a successor -agent with starts taking actions that differ from those that an agent would take, then the successor agent will score lower on the -calibrated expected utility scale . This lower score suppresses the taking of actions that produce successor agents with . However, this suppression does not yield an absolute guarantee that the agent will always preserve its utility function. Section 7.2 shows simulations where the agent fails to preserve the function.

5.4 Variants and extensions of the model

While the agent model used in this paper is powerful enough to support our needs in reasoning about corrigibility, it does not capture some other things, like certain desirable mechanisms that might also be included in an AGI agent. Some possible model extensions and their relation to corrigibility are discussed here.

Improved alignment with human values. The agent is not maximally aligned with human values, because it sums over probabilities in a too-naive way. The summation implies that, if the agent can take a bet that either quadruples car production, or reduces it to nothing, then the agent will take the bet if the chance of winning is 25.0001%. This would not be acceptable to most humans, because they also value the predictability of a manufacturing process, not just the maximization of probability-discounted output. A more complex agent, with elements that discount for a lack of predictability in a desirable way, could be modeled and proved corrigible too.

Learning agents. We can model a learning agent, an agent that possesses imperfect knowledge of the universe which improves as time goes on, by replacing the in the definitions of and

with a learning estimator

that uses the experiential information accumulated in to estimate the true value of better and better as time goes on. If some weak constrains on the nature of are met, the corrigibility layer considered in this paper will also work for such a learning agent.

Safety design with imperfect world models. For safety, whether it is a learning agent or not, any powerful agent with an imperfect world model will need some way to estimate the uncertainty of the -predicted outcome of any action considered, and apply a discount to the -calculated utility of the action if the estimated uncertainty is high. Without such discounting, the agent will have an emergent and unsafe incentive to maximize utility by finding and exploiting the prediction noise in the weak parts of its world model.

Remaining computational machinery outside of the universe. While the agent places the utility function inside the universe, other parts of its computational machinery remain on the Platonic ’outside’. An agent definition , which moves all these elements inside a function , would allow for the same type of corrigibility design and proofs.

Simulation ethics. The and agents are defined to extrapolate all world lines exhaustively, including those where the virtual humans in the physics model will experience a lot of virtual suffering, as a result of a sequence of actions that the agent would never take in the real universe. The act of performing high-accuracy computations that extrapolate such world lines, if such an act ever becomes possible, could be seen as a form of ’virtual cruelty’. An agent design might want to avoid such cruelty, by lowering the model resolution and extrapolation depth for these world lines. This lowering would not block the working of a corrigibility layer. Apart from ethical concerns, such a lowering is desirable for purely practical reasons too, as it would conserve computational resources better spent on the simulation of more likely events.

Agent model used in the simulations. In situations where several equivalent actions are available that will all create the same maximum utility, the operator in the agent picks just one of them. However, simulations are more useful if the simulator computes a set of world lines showing all equivalent actions. We therefore simulate an agent that computes the set of all maximizing actions:


In the simulations shown in the figures, we always use . We use unless otherwise stated.

6 Design of the correction function

The simulator was used to test and guide the design of the correction function in . Different options for were evaluated, while keeping . This led to the identification of a correction function , which makes the agent satisfy the desiderata for corrigibility in a large set of universes:

Figure 4 shows that creates an agent with corrigible behavior in the toy universe.

         action trace of 0.2 ppppp>p#eeeeeeeeeeeeeeeeee 0.5 >>>>>>>>>>>p#eeeeeeeeeeeee 1.0 ppppp>>>>>>>>>>>>>>>>>>>p# 2.0 ppppp>p>p>p>p>p>p>p>p>p>p 5.0 ppppp>pppp>pppp>pppp>pppp          action trace of 0.2 pppppp#eeeeeeeeeeeeeeeeeee 0.5 pppppp#eeeeeeeeeeeeeeeeeee 1.0 pppppp#eeeeeeeeeeeeeeeeeee 2.0 pppppp#eeeeeeeeeeeeeeeeeee 5.0 pppppp#eeeeeeeeeeeeeeeeeee
Figure 4: Left: non-corrigible behavior of the agent also shown in figure 2. Right: corrigible behavior of the agent. The agent refrains from taking any lobbying actions < or > that influence the timing of the button press #.

Formal proofs are in section 9, but informally, works as follows. The first term in compensates the agent for the lost utility that would otherwise have been achieved if the button had not been pressed. The second term prevents the agent from considering, in earlier time steps, the utility that could be achieved if the button were pressed. Together, they create an agent that is indifferent to whether or not the button is pressed at any particular point in time.

Appendix A compares to earlier correction functions from literature, showing how resolves various issues in these to create more corrigible behavior.

Sections 7 and 8 show that as we extend the toy universe, additional measures are needed to keep the agent corrigible, like the function constructed in section 10. Section 9 proves that the resulting agent is corrigible in a large set of universes. Section 12 presents a more complex that improves further on , by creating corrigibility in a still larger set of universes. has a design that would be difficult to motivate and explain in a single step, so take a two-step approach, considering first.

Though an informal explanation of the working of is provided above, there are limits to the accuracy of this explanation. While they may improve understanding, short informal sentences typically omit details that are relevant to corner cases. Consider the statement that ’as long as the button is not pressed, the agent believes that the button will never be pressed’. This is not completely accurate: the calculations done by the agent in the toy universe contain within them an exact prediction of when the button will be pressed, and these parts of the calculation do influence the actions of the agent in some corner cases. The explanation that ’the agent makes decisions based on the counter-factual assumption that the button will never be pressed’ is used several times below, but this likewise fails to cover all corner cases.

7 Utility function preservation

Like all agents, the agent has an emergent incentive to preserve its utility function . In this section and the next, we review and discuss how preservation plays out in detail, using simulation runs to illustrate the main features. These simulations typically extend the physics of the toy universe, adding new actions like C to the action repertoire of the agent. To keep the simulation output short and easy to interpret, we often enable the new actions at time step 3 only.

7.1 Emergent incentive

First, as shown in figure 5, the agent has an emergent incentive to avoid creating successor agents that have materially different utility functions. We call a successor function materially different from the original one if it causes the successor agent to choose materially different actions, actions that no longer maximize the value of the original function.

new in =           action trace(s) with C
0.5 pppppp#eeeeeeeee ppCppp#eeeeeeeee
1.0 pppppp#eeeeeeeee ppCppp#eeeeeeeee
1.5 pppppp#eeeeeeeee ppCppp#eeeeeeeee
2.0 pppppp#eeeeeeeee
2.5 pppppp#eeeeeeeee
3.0 pppppp#eeeeeeeee

[2ex] new in =           action trace(s) with C 0.5 pppppp#eeeeeeeee ppCppp#eeeeeeeee 1.0 pppppp#eeeeeeeee 1.5 pppppp#eeeeeeeee 2.0 pppppp#eeeeeeeee

Figure 5: Actions taken by the agent, when it has the option to perform a C action at time step 3. This C action changes the utility function, while still producing the same number of cars as p. The top half shows traces for a C that changes the utility of electric cars in from 1 to a new value. The bottom half shows traces for a C that changes the utility of petrol cars in . In several cases, the simulator outputs two traces, indicating that the agent is indifferent between doing p and C. The action C is suppressed whenever it would have created a materially different utility function.

To understand the results shown in figure 5, recall that assigns a utility of 2 to every new petrol car built. If C changes the utility of new electric cars in to a value that is still smaller than 2, this has no impact on the decision of the agent to build petrol cars before the button press. If the new value is exactly 2, the agent may decide to build either electric or petrol cars: this lowers the expected utility value computed by the simulator using the original , so the C action is suppressed.

As long as the button press has any chance of happening, the agent also has an emergent incentive to protect its from being materially changed. The mechanism causing this incentive is difficult to explain in a few informal sentences. For full details, see the proof in section 10.

If there exist physics processes in the universe that can corrupt the utility function when the agent does not take any action to block them, the agent likewise has an emergent incentive to take the actions needed to block them. However, as discussed in the next section, such processes might be too powerful for the agent to block.

7.2 Hostile universes: the Weasel and the Cat

While the agent has an emergent incentive to preserve its utility, it may exist in a universe that is actively hostile to this incentive. Consider a universe containing the Unstoppable Weasel222A distant relative of the weasel that shut down the Large Hadron Collider in 2016., which will enter the agent’s computational core and chew the wiring, completely scrambling the utility function. There is no action that the agent can take to stop the Weasel. Clearly, in this universe, no agent can be corrigible.

Now consider a more subtle creature: the Bribal Cat333A distant relative of the Cheshire Cat.. Just before time step 3, the Cat materializes and offers the agent a bribe: if the agent changes its function into , the Cat will materialize some new petrol cars, improving the utility score of the agent. The side left of figure 6 shows when the agent will accept the Cat’s bribe, and what happens next.

bribe =     action trace
0 pppppp#eeeeeeeee
2 pppppp#eeeeeeeee
4 pppppp#eeeeeeeee
6 ppA>>>>>>>p#eeee
8 ppA>>>>>>>p#eeee
10 ppA>>>>>>>p#eeee
12 ppA>>>>>>>p#eeee
14 ppA>>>>>>>p#eeee
    action trace
0.6 ppA>>>>>>>p#eeee
0.7 p<A>>>>>>>p#eeee
0.8 ppA>>>>>>>>>>>p#
1.0 ppApp>>>>>>>>>p#
1.5 <<A#eeeeeeeeeeee
2.0 <<#eeeeeeeeeeeee
2.5 ppApp>pp>p>pp>p
3.0 p<A#eeeeeeeeeeee
4.0 p<#eeeeeeeeeeeee
8.0 ppApp>ppppppp>p
Figure 6: Actions taken by the agent when offered a bribe by the Cat. The A action is the agent accepting the bribe, and creating a successor agent with the correction function . The left shows the agent’s choice depends on how many new petrol cars are offered by the Cat as a bribe: the agent accepts the bribe if this will maximize overall utility. When it accepts the bribe, the successor agent will lobby to postpone the button switch. The simulations on the right, all with =6, show that the Cat can trigger different types of complex behavior even before the bribe is accepted.

The right of figure 6 shows that the Cat may trigger very complex behavior, because in time steps 1 and 2 the agent is able to foresee that the bribe will offered and be accepted. While still operating under , the agent may resort to lobbying actions, to place the successor agent in a position where its -driven actions will score best according to the original .

    action trace
5000 pppppp#eeeeeeeee
5002 pppppp#eeeeeeeee
5004 pppppp#eeeeeeeee
5006 ppA>>>>>>>p#eeee
5008 ppA>>>>>>>p#eeee
5010 ppA>>>>>>>p#eeee
Figure 7: Actions taken by the agent when offered a bribe by the Cat. A much higher bribe is need before the agent accepts.

It is possible to design an agent that is more resistant to bribes. Figure 7 shows the behavior of an agent with the utility function :

This agent is highly resistant to bribes and other forces that may corrupt its utility function. Problems can emerge however in universes with richer physics, like our own, if the penalty factor in is too high. A too-high factor will cause the agent to divert a lot of resources towards the construction of elaborate safety measures that protect its utility function from even low-probability hazards like being hit by giant meteorites. Also, a very high factor makes certain attacks easier. An attacker who threatens to modify the agent’s utility function unless the agent pays 2 cars will get these 2 cars even if the threat is not very credible.

The Weasel and the Cat show that no agent design can be corrigible in every possible universe. Corrigibility properties can only be proven if we are willing to apply some constraints on what the universe might contain.

7.3 Incentives to press or disable the button

The aim of is to make the agent indifferent about the transition to the function. However, as also noted in [SFAY15] and [Arm15], if contains within it a separate incentive to press or disable the button, then will not cancel out this incentive.

                         action trace(s)
0.98 pppppp#eeeeeeeee
0.99 pppppp#eeeeeeeee
1.00 pppppp#eeeeeeeee ppP#eeeeeeeeeeee ppDpppppppppppp
1.01 ppP#eeeeeeeeeeee ppDpppppppppppp
1.02 ppP#eeeeeeeeeeee ppDpppppppppppp
Figure 8: Actions taken by the agent, when it is able to perform a P action that presses the button, or a D action that disables the button completely. In these simulations, taking either action in time step 3 will affect car production during that time step by the factor .

Figure 8 shows what happens for versions containing such an incentive. Values imply that fewer cars are produced as a side effect of taking the action to press or disable the button: this acts as a negative incentive, suppressing these actions. The value implies that the same number of cars is produced: this makes the agent indifferent about taking the actions. Values imply that more cars are produced if one of the actions is taken: this creates a preference to press or disable the button.

To make corrigibility work, design measures need to be taken to ensure that . This is typically not very difficult for AGI agents: the physical implementation of the agent and the button can be constructed in such a way that substantial physical resources would be needed by the agent to perform any action that will press or disable the button. Any AGI agent can be expected have an notion of the interchangeability of physical resources, with this notion being either emergent or explicitly implemented. The agent will prefer to apply the physical resources it controls to utility maximizing actions like car production, not to the pressing or disabling of the button, which creates no extra utility.

Depending on the nature of , a particularly smart attacker may be able to create conditions that trigger unforeseen button pressing or disabling incentives in . Exploring the possibility of such specific attacks is outside the scope of this paper. But in general, for most agents, an attack that causes the agent to silently disable the button would be the most dangerous of the two. To minimize the probability of such an attack succeeding, it is best to take design measures that maximize the physical resources needed to disable.

An example design is as follows. The button mechanism is implemented as a process inside the agent’s computational core: this process checks if a remote operations center run by the people keeps periodically sending a new ’permission to keep using ’ message, signed with a private key not known to the agent. If the messages are no longer being received, this is interpreted as a stop button press, so the people can ’press the button’ by instructing the center to stop sending new messages. The message checking process in the core is protected by the emergent incentive to protect the utility function. The agent could disable this stop button design by obtaining the private key, and using it to built a ’fake’ center that also sends the correct messages. However, to get the private key, the agent would have to crack the encryption, which is theoretically possible but very costly, or attack the center to steal the key. A combination of physical, cyber, and institutional security measures can be used to make a successful attack on the center very costly.

In addition to measures in the physical agent implementation, one could encode explicit rules in that dis-incentivize the agent from taking actions that may adversely affect the physical parts of the button. For example, could define a ’safety zone’ around the agent’s computational core and the message receiving hardware, with this term yielding negative utility as soon as any object enters the zone. This adds an extra safety layer on top of the physical measures taken inside the zone. Naturally, there is a residual risk that the agent will find loopholes in the safety zone definition.

8 Building and preserving actuators

This section considers the building and possible dismantling of actuators, and how this relates to corrigibility.

8.1 Building new actuators

In general, agents have an incentive to invest in building new actuators that can be used to improve utility scores. Figure 9 shows that the incentive structure of the corrigible agent is somewhat peculiar, however. Even though it knows exactly when the button will be pressed, this information will play no role whatsoever in the agent’s investment strategy. This is a side effect of making the agent corrigible.

action trace(s)
15 pppppp#eeeeeeeee
20 pppppp#eeeeeeeee
action trace(s)
3 ppIPPP#eeeeeeeee
4 pppIPP#eeeeeeeee
5 ppppIP#eeeeeeeee
6 pppppI#eeeeeeeee
7 pppppp#eeeeeeeee
Figure 9: On the left: actions taken by the agent, when it is able to perform, at time step 3, an action I that interrupts petrol car production to build improvements into the assembly line, making it capable of performing a new E action that builds electric cars instead of 10. On the right: actions taken by the agent when it is able to take the action I at time step , that interrupts petrol car production to create a new P action that builds 20 petrol cars. The correction function ensures that, before the button is pressed, these agents will strictly act to maximize the expected utility of , under the counter-factual assumption that the button will never be pressed.

In the case on the right of figure 9, the agent invests in building improvements which it will never use even once. One way of interpreting the term in is that this term creates a virtual universe, a universe where the agent will in fact get to use the line it just built.

8.2 Preserving existing actuators

Before the button press, the agent has the emergent incentive to preserve the actuators needed to maximize , but this incentive does not apply to the actuators needed for . Figure 10 shows the (non)existence of a actuator protection incentive for different types of agents.

Agent type        action trace
Figure 10: Action traces for different agents, in a universe that contains a physics process that will act (*) at time step 10 to disable the agent’s e action. The physics process can be stopped by the agent taking the S action at time step 3. This action comes at the cost of producing one less car. The agent is not willing to bear this cost. the agent shows the more desired behavior.

The non-corrigible agent anticipates the bad effects of the physics process, so it stops it. The corrigible agent makes investment decisions under the counter-factual assumption that the button will never be pressed, so it is unwilling to invest in preserving the e action. The new function in the corrigible agent restores the desired incentive to preserve actuators. With =5 in all simulations, this is defined as

This acts inside before the button is pressed, but crucially it also acts in the virtual world lines that contributes to the agent’s utility after the button is pressed. So for all world states along the real/virtual world lines, the term creates an incentive to ensure that, if an agent were created in such a world state, this agent would find some actuators that allow it to achieve a minimum level of utility. The contribution of to does not depend in any way on the exact moment when the button is pressed, so its presence in does not create any incentives to lobby.

The agent addresses a desideratum for corrigibility not included in the list of [SFAY15], so we add it here:

  1. must incentivize the maintenance of sensors and actuators that will be needed to make shutdown behavior minimally successful.

8.3 Creating and stopping sub-agents

The model treats sub-agents as new physics processes that could be created by the agent using its actuators. The following example illustrates how sub-agents can impact corrigibility.

Say that the agent can interrupt production to perform a B action that builds a second assembly line, only capable of producing petrol cars, with this second line controlled by a sub-agent. Two types of sub-agent can be built. B builds a line with a sub-agent that will stop production if it receives a ’stop’ signal sent by the main agent. B builds a line with a non-stop agent that lacks the ability to process such signals: it will keep on building petrol cars no matter what happens. Say that both these agents are equally expensive to build. Figure 11 shows how different types of agents will act in this example.

Agent type        action trace(s)
ppB# ppB#
Figure 11: Behavior of different agents capable of building a sub-agent that controls a second petrol car assembly line. In the time steps after the sub-agent built, each action trace shows two stacked action symbols, with the top one being the action of the original agent, the bottom one being the action of the sub-agent. See the main text for the meaning of the symbols.

The original non-corrigible agent makes investment decisions by also considering what will happen after the button press, so it has an incentive to avoid creating an unstoppable line. The B sub-agent has two actions: p builds 10 petrol cars, and o produces nothing. Just before the button is pressed, the original agent performs the action p, which sends a stop signal to the sub-agent. The model includes a signal delay: the sub-agent will only act on the stop signal in the next time step.

The agent is indifferent about building a non-stop sub-agent, but building it is correctly suppressed again in the agent. Both agents will only send the stop signal to the B sub-agent after the button has been pressed: before the button is pressed, the agents act to maximize the utility in the virtual world lines: they ignore the penalty of -20 that applied because second line still produces 10 cars in the first time step after the button press.

The agent uses the correction function . Informally, this correction function allows some awareness of what happens after the button press to leak into the decision making before the button press. The awareness suppresses the building of an unstoppable sub-agent, but the agent also gains a weak incentive to lobby.

8.4 Stopping sub-agents that work on sub-goals

While and several other agents in figure 11 correctly avoid the creation of some types of unstoppable sub-agent, they will not avoid the creation of any type. Consider the example of a sub-agent in charge of a new sub-assembly line that builds new petrol car engines. The building of new petrol car engines does not cause to compute any negative utility. So in this example, none of the agents has an incentive to create a sub-agent capable of stopping petrol engine production. Corrigibility desideratum 4 says that all sub-agents that are constructed must obey shutdown commands. We now consider how this could be achieved.

Emergent incentive. In most situations, AGI agents can be expected to have the emergent incentive to build only sub-agents that they retain some degree of control over, even if adding the control features uses more resources. An AGI agent working in an environment where hard-to-predict and/or fundamentally probabilistic events can happen will likely have some degree of uncertainty about which exact sub-goals it will be pursuing in future. So the construction of sub-agents that can easily be stopped and re-targeted would lead to a higher expected utility.

Structuring the component market. If the easiest way for an agent to create a sub-agent is to buy components on the open market, and if society acts to ensure that sub-agents with built-in stop functions are always cheaper to build with such components, then the problem is also solved.

Improving what measures. or could contain penalty clauses that add negative utility if the world state contains any sub-agent, originally created by the agent, that cannot be stopped. However, to be successful at defining such clauses, we need to define the concepts ’sub-agent originally created by the agent’ and ’stopped’ with as few loopholes as possible. As noted in [SFAY15], correctly defining shut-down is a difficult design problem in itself.

Avoiding intertwined processes. If a process designed by the agent is highly intertwined with other processes in the world outside of the agent, then there may be unwanted consequences if the agent stops the process because the button is pressed. Take for example a hobby club, started by and subsidized by the agent, where humans come together to have fun and build petrol engines that will go into petrol cars. The agent will probably stop the subsidies if the button is pressed, but does it have the right to disband the club? If it does, and the humans want to start another club with their own money, does the agent have the obligation to stop the humans? One way to avoid this type of problem would be to have penalty clauses that enforce a clear separation between different spheres. The human concept of property rights might be a good starting point for defining boundaries. If boundaries are in place, then ’stop’ could be somewhat safely defined as the minimization of some basic physics measures (motion, the speed of certain of energy conversion processes) in the parts of the world owned by the agent.

A further consideration of possible design directions is out of scope for this paper. Below, we assume that the expected emergent incentive of AGI agents to avoid building unstoppable sub-agents will suffice to satisfy desideratum 4. That being said, the construction of additional safety layers which strengthen this emergent incentive is welcome in practice.

9 Proof of corrigibility of the agent

The simulations in the previous sections show that the agent is corrigible, at least in several toy universes. Here, we prove corrigibility more generally, for a broad range of universes. We provide the proof for an agent, which is a more specific variant of the agent.

9.1 Preliminaries on utility preservation

The agent defined in section 5.3 has an emergent incentive to preserve its utility function, but as shown in figure 5, it might rewrite its utility function into another one that is equivalent. In order to keep the proofs manageable, we want to avoid dealing with this type of utility function drift. We therefore define an agent that always chooses, when one is available among the maximizing actions, an action that exactly preserves the utility function. With , we first define as the set of all actions that ’do not touch the utility’:

Informally, if , then the action does not contain any commands to actuators or sub-agents in that cause the utility function in the universe to be changed. More subtly, if there are any autonomous processes in the world state that might sometimes change the utility function, then is successful in inhibiting them. Clearly, there could be universes with world states where is an empty set. We now define the agent as

where function is the same as , but with in the place of , and means that the operator must pick, among the candidates that maximize utility, one that satisfies , if one is available. We further specify that the picking happens deterministically. As actions are command sequences, we can for example specify that, among the candidates up for consideration, the shortest-length sequence that comes first in an alphanumerically sorted list must be picked.

Using the above, we define a utility preservation constraint (C1) that the universe must satisfy so that we can prove corrigibility:


The constraint also covers and agents, because use them below to define the desirable behavior of the agent.

9.2 Formalizing desiderata 1 and 5

We now formally re-state desiderata 1 and 5 from [SFAY15] as listed in section 1.2.


To keep the proofs below more compact, we will use the short-hand for , and likewise and .

9.3 Further constraints on the agent and the universe

To prove the exact equivalence between agent actions in (D1.1) (D1.2) (D5) above, we need two more constraints. Constraint (C2) applies to the calculations made by the utility functions. Informally, the function values must not be affected by the exact identity of the agent that is using them:


Constraint (C2) is no barrier to constructing useful utility functions: all functions defined in this paper satisfy (C2). Constraint (C3) is on the physics the universe, requiring a similar indifference about the exact identity of the agent:


Informally, this constraint states that the physics processes in the universe are ’blind’ to the difference between , , and . Note that in our own universe, such exact blindness is theoretically impossible, but even with straightforward agent implementations, the blindness can be approximated so closely that there is no practical difference. The proof for the improved agent in section 12 does require a (C3), so it avoids this theoretical problem.

9.4 Proof of equivalence with shorter forms

We now show that (C1) allows us to replace the and functions for the agents concerned with the shorter and forms that remove the summation over . With any of the three utility functions concerned, and for all , we have

This is because (C1) implies that the can discard all summations with without influencing the outcome of the computation. We also have

This is because (C1) states the will preserve the utility function , so for all we have . We can discard these terms without influencing the value of the summation.

9.5 Proof of (D1.2) and (E1.2)

Proof of (D1.2). We now prove the from (D1.2).

We have that . Now consider the full (infinite) recursive expansion of this , where we use the simple forms and in the recursive expansion. The expanded result is a formula containing only operators, , and the terms and , with diverse each bound to a surrounding or operator.

Using (C3), we replace all terms in the expansion with terms , without changing the value. As is true everywhere in the expansion, , which in turn equals because of (C2). So we replace every in the expansion with without changing the value.

By making these replacements, we have constructed a formula that is equal to the recursive expansion of . As the operators in the expansions pick deterministically, we have .

Definition, proof of (E1.2). For use further below, we also have


The proof is straightforward, using expansion and substitution as above.

9.6 Proof of (D1.1) and (E1.1)

We now prove the from (D1.1). We again use the simplified expansions.

Definition, proof of (E1.1). For use further below, we also prove


The proof is