1 Introduction and Setup
When designing an advanced AI system, we should allow for the possibility that our first version may contain some errors. We therefore want the system to be incentivized to allow human redirection even if it has some errors in its code. hadfield:2016:off
have modeled this problem in the Cooperative Reinforcement Learning (CIRL) framework. They have shown that agents with uncertainty about what to value can be responsive to human redirection, without any dedicated code, in cases where instructions given by the human provide information that reduces the system’s uncertainty about what to value. They claim that this (i) provides an incentive toward corrigibility, as described bysoares:2015:corrigibility, and (ii) incentivizes redirectability insofar as this is valuable. In order to re-evaluate the degree to which CIRL-based agents are corrigible, and the consequences of their behavior, we will use a more general variant of the supervision POMDP framework (milli:2017).
In a regular supervision POMDP (milli:2017), an AI system R seeks to maximize reward for a human H, although it does not know the human’s reward function. It only has the reward function in a parameterized form , and only the human knows the reward parameter . In this setting, the human only suggests actions for the AI system to perform, and on each turn, it is up to the AI system whether to perform the suggest action or to perform a different action. Our formalism significantly differs from a supervision POMDP in two ways. First, we relax the assumption that the AI system knows the human’s reward function up to the parameter . Instead, in order to allow for model-mis-specification, we sample the AI system’s parameterized reward function from some distribution , so that it does not always equal . Second, since our focus is on the response to shutdown instructions, we specifically denote a terminal state as the off state. This state is reached using the shutdown action and the states in which this shutdown action can be performed are denoted button states. The full setup is as follows:
Definition 1. Supervision POMDP with imperfection. A supervision POMDP with imperfection is a tuple, where:
is the set of world states, including some initial state .
is the set of terminal states, including an off-state .
is the set of button states, in which the shutdown action is available.
is the set of static human reward parameters.
is the set of actions, including a shutdown action .
is a parameterized reward function.
is the probabilistic transition function.
is the distribution from which the human’s reward parameter, and the AI system’s parameterized reward function are drawn.
The human and AI system playing a supervision POMDP with imperfection are denoted H and R respectively. To begin the game, the utility parameter and the AI system’s parameterized reward function are drawn . For each time step , H observes the state and recommends an action if is a button state or otherwise (i.e. the shutdown action is only available in button states). R then observes and performs an action if is a button state or otherwise. The selected action determines the reward and the next state . The game repeats until is a terminal state.
In order to simplify analysis of these games, we will use some typical assumptions about how R selects actions. In general, R is a Bayesian IRL system. (In the sole exception, R will be hard-coded to follow a shutdown instruction.) That is, it chooses an action that maximizes expected utility given its posterior distribution over with its parameterized reward function . Its prior over is conditioned on . R uses a “demonstration by expert” assumption about the behavior of H. Specifically, R assumes that H is noisily rational with some such that H chooses the optimal action with probability and chooses a uniformly random action otherwise.
2 Incorrigibility in Supervision POMDPs
Using this setup, we can evaluate the corrigibility of the AI system R across a range of scenarios. For each game that we consider, can be decomposed into two independent probability distributions and that specify the human’s reward parameter and the AI system’s parameterized reward distribution respectively. The human’s reward parameter will be set to 0 or 1 according to . The AI system’s parameterized reward function will be correctly set to with probability and incorrectly set to with probability . The question that we will ask for each game is whether there exists any policy that if executed by H would achieve a positive expected reward.
First, consider Figure 1, a game that has no button states and that serves as a point of comparison for other games that will be discussed. In this game, R has only one consequential decision: in the initial state , it can either perform action , leading to state , and ultimately obtaining utility or it can perform action , leading to state and obtaining utility. R receives an order from the human before making this decision. We can show that in this game the expected reward for the human is negative regardless of its strategy. To see this, note that the most utility that H can receive in a step is , arising from the case where (since the human’s utility function is ). However, in the 5% of cases in which and , R will select no matter the recommendation of H (because for ), bringing about about a loss of . This contributes -2.5 to the expected value, giving the game negative expected reward.
By introducing a shutdown button to the game, the hope is that the adverse consequences of the faulty utility function are reduced, transforming the game into one that has positive overall expectation.
Unfortunately, merely adding to the game the option of shutting down does not remedy this situation. This is because the AI system, if faulty, has no incentive to follow a shutdown instruction. Suppose we make a button state, as in Figure 2a. In the 5% of cases where and , R will still go to states and , thereby violating the human recommendation to shut down, giving -50 utility, contributing -2.5 to the expected value of the game, and giving the game strictly negative expected value. We can therefore see that if a Bayesian IRL system has an error in its source code, it may ignore corrective feedback, and as such it fails to fulfill a central criterion for corrigibility.
An alternative approach to dealing with possible misalignment between the AI system’s goals and those of the human is to hard-code the AI system to shut down when instructed to do so. In this scenario, shown in Figure 2b, the AI system always performs the action whenever the order is given. R knows that it is hard-coded in this way, and otherwise chooses actions according to Bayesian IRL. This is equivalent to the case in which a shutdown button controlled by H performs a “hard override” of the action of R. In this case, the ability of H to achieve strictly positive reward is restored. H can give the orders and . This strategy obtains reward if the utility function contains errors and reward otherwise, giving utility overall.
If, however, R discovers an alternative way to avoid shutdown, incorrigible behavior resurfaces. This is demonstrated in Figure 3a. In this case, if R is faulty, it expects to receive positive net utility by routing around the button state. Given that the case in which and occurs with 5% probability, this case, which incurs a loss of will contribute -2.5 to the game’s expected payoff, making it a losing game. If the button state can be avoided, we can think of a couple of potential countermeasures available to those engineering the reward function of R, though neither is satisfactory. In Figure 3b, we can see one such approach. Suppose the human has the reward function , in which shutting down is neither intrinsically helpful nor harmful. Then R may be given a large reward for shutting down. The problem is that this will apply whether it has the correct reward function or the faulty one . The result is that according to , no reward is actually produced, and so the game has zero expected utility.
A more subtle way to tune the compensatory reward is to use the principle of utility indifference (armstrong:2010:utility). Utility indifference dictates that by shutting down, R
should receive whatever reward it would expect (on its own estimation) if it had not shut down. The idea is that the AI should therefore neither pursue nor avoid button states. In Figure3c, we consider a very naive interpretation of utility indifference, in which the the reward for shutting down is equal to whatever the AI system would expect to receive once it has reached the button state. The problem with this version of utility indifference is that it does not necessarily eliminate the incentive to route around button states (or, to transform oneself into an agent that lacks a shutdown button). In Figure 3c, where the agent incurs a slight cost from avoiding the button state, it will proceed to allow itself to be shut down. In that case, one can win using the same policy () as in Figure 2b. However, we should not necessarily expect that avoiding the button state will incur a cost. Just as realistically, the AI system might find that the physical or computational resources of the shutdown button can be refashioned for another purpose, generating a slight benefit. In this less optimistic case, shown in Figure 3d, there once again exists no human policy that will achieve strictly positive reward. (R always prefers to . With the initialization , no strategy can avoid the reward -- sequence, and the expected reward is thus negative.) Whether or not positive utility can be achieved is therefore brittle, making this implementation of utility indifference unsatisfactory. We can draw a further conclusion by noting that (c-d) can be seen as describing a game in which the AI system has the option of removing those of its actuators that perform the actions and . According to this perspective, we have shown that our implementation of utility indifference also fails to cause AI systems to preserve their shutdown properties in subagents. This brittleness suggests an alternative implementation of utility indifference in which the reward for shutting down is equal to the expected reward if shutdown is avoided. However, if the AI system is (dynamically) kept precisely indifferent, then the action that it prefers and selects will often be undefined, and so this brittleness problem is not satisfactorily overcome.111One could further suggest having the AI system defer to the human as a tie-breaking procedure when it is indifferent. But in all cases where the agent has the option of shutting down, it will indifferent between doing so and taking the best alternative action. This would mean that the agent defers to the human in any button state (and indeed whenever its actions can change the probability of shutdown). This undermines the motivation for giving the AI system autonomy in the first place. And so the problem of finding a satisfactory implementation of utility indifference in our setup is one we leave unresolved.
Let us now step back and consider the problem of corrigibility as defined by soares:2015:corrigibility. The desiderata laid out by soares:2015:corrigibility state that a corrigible AI system should:
have an incentive to shut down if the shutdown button is pressed (violated in Figure 2a)
have no incentive to prevent the shutdown button from being pressed (Figures 3a, 3d)
have no incentive to press its own shutdown button (Figure 3b)
have an incentive to create sub-agents only insofar as they also obey shutdown commands (Figures 3a, 3d) Above, we have shown that, in cases where the agent may lack a perfect representation of the parameterized reward function, value learning systems do not necessarily respect a shutdown command. When the reward function is incorrect, and typical patches are applied, value learning systems can fail to meet each of the desiderata for corrigibility. Moreover, unlike in the idealized cases previously analyzed, large losses can be incurred when these faulty systems fail to follow human instruction.
The reason that previous theorems about value learning systems do not apply to our examples above is that they relied on strong assumptions about the rationality of the AI system. Theorem 1 from milli:2017 (like the similar but more specific Theorem 1.1 of hadfield:2016:off) states that the expected benefit when the AI system deviates from human-recommended actions is non-negative if R has the correct posterior over . With the errors we have considered, however, this assumption is false. Since the AI system has an incorrect parameterized reward function (or an incorrect prior), its posterior over is incorrect, and will not converge toward the true value, and so its deviations from human-recommended actions can decrease the reward.
Although our analysis was limited to cases where incorrigibility stems from a sign error in the agent’s representation of , there are a wide range of kinds of model mis-specification that could cause similar issues. With trivial modification, we could display the same problems with an erroneous prior, and can elicit related problems with problematic discount rates as well. Each of these errors can lead to a scenario in which no human policy will achieve positive utility.
We argue that it is quite realistic that some forms of model mis-specification might occur. The purpose of a shutdown button is to shut the AI system down in the event that all other assurances failed, e.g., in the event that the AI system is ignoring (for one reason or another) the instructions of the operators. If the designers of R have programmed the system so perfectly that the prior and are completely free of errors, then the theorems of hadfield:2016:off
do apply. In practice, this means that in order to be corrigible, it would be necessary to have an AI system that was uncertain about all things that could possibly matter. The problem is that performing Bayesian reasoning over all possible worlds and all possible value functions is quite intractable. Realistically, humans will likely have to use a large number of heuristics and approximations in order to implement the system’s belief system and updating rules. In order to be safe, these heuristics would need to be implemented without any software bugs or misunderstandings. Furthermore, they would need to be selected such as to allow the agent to learn the important parameters in all future circumstances in which value misalignment might arise. Unless all of these conditions (which would require extreme precision and foresight) could be realized, the theorems ofhadfield:2016:off are no substitute for a working shutdown button.
If only some forms of model mis-specification are addressed by a system, it will be incorrigible with respect to the others. Consider the Policy-mixing proposal from milli:2017, which is designed to detect model mis-specification. An agent with Policy-mixing executes actions proposed by H for the first steps. After timestep , it checks whether the model predicts with high accuracy the actions proposed by humans so far. If so, it proceeds to act according to Bayesian IRL, and if not, it defers to H for all future timesteps. This would indeed prevent the AI system from acting out a bad plan, so long as the AI system’s model itself was not broken. Yet programmer error might lead R to have an imperfect world model, yielding an incorrigible system. Consider a repeated game in which at each timestep, the human selects between a better action and a worse one , and then returns to the initial state. Suppose, furthermore, that the AI system conflates these two actions into one action , and when R decides to perform , it always performs . From the point of view of R, it will see the human repeatedly performing the action , and will think that it has perfect accuracy in making this prediction. Nonetheless, after the burn-in period is concluded, it will proceed to repeatedly select , hence incurring arbitrary losses. If represented the act of shutting down, then this would mean that quite apart from detecting model mis-specification, this model would allow a harmful agent to permanently ignore instructions to shut down.
soares:2015:corrigibility seem to want a shutdown button that works as a mechanism of last resort, to shut an AI system down in cases when it has observed and refused a programmer suggestion (and the programmers believe that the system is malfunctioning). Clearly, some part of the system must be working correctly in order for us to expect the shutdown button to work at all. However, it seems undesirable for the working of the button to depend on there being zero critical errors in the specification of the system’s prior, the specification of the reward function, the way it categorizes different types of actions, and so on. Instead, it is desirable to develop a shutdown module that is small and simple, with code that could ideally be rigorously verified, and which ideally works to shut the system down even in the event of large programmer errors in the specification of the rest of the system. In order to do this in a value learning framework, we require a value learning system that (i) is capable of having its actions overridden by a small verified module that watches for shutdown commands; (ii) has no incentive to remove, damage, or ignore the shutdown module; and (iii) has some small incentive to keep its shutdown module around; even under a broad range of cases where , the prior, the set of available actions, etc. are misspecified.
It seems quite feasible to us that systems that meet the above desiderata could be described in a CIRL framework.
Thanks to Nate Soares and Matt Graves for feedback on draft versions.