1 Introduction
Should robots be obedient? The reflexive answer to this question is yes. A coffee making robot that doesn’t listen to your coffee order is not likely to sell well. Highly capable autonomous system that don’t obey human commands run substantially higher risks, ranging from property damage to loss of life (Asaro, 2006; Lewis, 2014) to potentially catastrophic threats to humanity (Bostrom, 2014; Russell et al., 2015). Indeed, there are several recent examples of research that considers the problem of building agents that at the very least obey shutdown commands (Soares et al., 2015; Orseau and Armstrong, 2016; HadfieldMenell et al., 2017).
However, in the longterm making systems blindly obedient doesn’t seem right either. A selfdriving car should certainly defer to its owner when she tries taking over because it’s driving too fast in the snow. But on the other hand, the car shouldn’t let a child accidentally turn on the manual driving mode.
The suggestion that it might sometimes be better for an autonomous systems to be disobedient is not new (Weld and Etzioni, 1994; Scheutz and Crowell, 2007). For example, this is the idea behind “Do What I Mean” systems (Teitelman, 1970) that attempt to act based on the user’s intent rather than the user’s literal order.
A key contribution of this paper is to formalize this idea, so that we can study properties of obedience in AI systems. Specifically, we focus on investigating how the tradeoff between the robot’s level of obedience and the value it attains for its owner is affected by the rationality of the human, the way the robot learns about the human’s preferences over time, and the accuracy of the robot’s model of the human. We argue that these properties are likely to have a predictable effect on the robot’s obedience and the value it attains.
We start with a model of the interaction between a human and robot^{1}^{1}1We use “robot” to refer to any autonomous system. that enables us to formalize ’s level of obedience (Section 2). and are cooperative, but knows the reward parameters and does not. can order to take an action and can decide whether to obey or not. We show that if tries to infer from ’s orders and then acts by optimizing its estimate of , then it can always do better than a blindly obedient robot when is not perfectly rational (Section 3). Thus, forcing to be blindly obedient does not come for free: it requires giving up the potential to surpass human performance.
We cast the problem of estimating from ’s orders as an
inverse reinforcement learning
(IRL) problem (Ng et al., 2000; Abbeel and Ng, 2004). We analyze the obedience and value attained by robots with different estimates for (Section 4). In particular, we show that a robot that uses a maximum likelihood estimate (MLE) of is more obedient to ’s first order than any other robot.Finally, we examine how ’s value and obedience is impacted when it has a misspecified model of ’s policy or (Section 5). We find that when uses the MLE it is robust to misspecification of ’s rationality level (i.e. takes the same actions that it would have with the true model), although with the optimal policy it is not. This suggests that we may want to use policies that are alternative to the “optimal” one because they are more robust to model misspecification.
If is missing features of , then it is less obedient than it should be, whereas with extra, irrelevant features is more obedient. This suggests that to ensure that errs on the side of obedience we should equip it with a more complex model. When has extra features, then it still attains more value than a blindly obedient robot. But if is missing features, then it is possible for to be better off being obedient. We use the fact that with the MLE should nearly always obey ’s first order (as proved in Section 4) to enable to detect when it is missing features and act accordingly obedient.
Overall, we conclude that in the longterm we should aim for to intelligently decide when to obey or not, since with a perfect model can always do better than being blindly obedient. But our analysis also shows that ’s value and obedience can easily be impacted by model misspecification. So in the meantime, it is critical to ensure that our approximations err on the side of obedience and are robust to model misspecification.
2 HumanRobot Interaction Model
Suppose is supervising in a task. At each step can order to take an action, but chooses whether to listen or not. We wish to analyze ’s incentive to obey given that

and are cooperative (have a shared reward)

knows the reward parameters, but does not

can learn about the reward through ’s orders

may act suboptimally
We first contribute a general model for this type of interaction, which we call a supervision POMDP. Then we add a simplifying assumption that makes this model clearer to analyze while still maintaining the above properties, and focus on this simplified version for the rest of the paper.
Supervision POMDP. At each step in a supervision POMDP first orders to take a particular action and then executes an action it chooses. The POMDP is described by a tuple . is a set of world states. is a set of static reward parameters. The hidden state space of the POMDP is and at each step observes the current world state and ’s order. is ’s set of actions. is a parametrized, bounded function that maps a world state, the robot’s action, and the reward parameters to the reward.
returns the probability of transitioning to a state given the previous state and the robot’s action.
is a distribution over the initial world state and reward parameters. is the discount factor.We assume that there is a (bounded) featurization of stateaction pairs and the reward function is a linear combination of the reward parameters and these features: . For clarity, we write as when we mean ’s orders and as when we mean ’s actions. ’s policy is Markovian: . ’s policy can depend on the history of previous states, orders, and actions: .
Human and Robot. Let be the value function under the optimal policy for the reward function parametrized by .
A rational human gives the optimal order, i.e. follows the policy
A noisily rational human follows the policy
(1) 
is the rationality parameter. As , becomes rational (). And as , becomes completely random ().
Let be this history of past states and orders where is the current state and order. A blindly obedient robot’s policy is to always follow the human’s order:
An IRL robot, IRL, is one whose policy is to maximize an estimate, , of :
(2) 
Simplification to Repeated Game. For the rest of the paper unless otherwise noted we focus on a simpler repeated game in which each state is independent of the next, i.e is independent of and . The repeated game eliminates any explorationexploitation tradeoff: . But it still maintains the properties listed at the beginning of this section, allowing us to more clearly analyze their effects.
3 Justifying Autonomy
In this section we show that there exists a tradeoff between the performance of a robot and its obedience. This provides a justification for why one might want a robot that isn’t obedient: robots that are sometimes disobedient perform better than robots that are blindly obedient.
We define ’s obedience, , as the probability that follows ’s order:
To study how much of an advantage (or disadvantage) gains from , we define the autonomy advantage, , as the expected extra reward receives over following ’s order:
We will drop the subscript on and when talking about properties that hold . We will also use to denote the reward of policy at step , and .
Remark 1.
For the robot to gain any advantage from being autonomous, it must sometimes be disobedient: .
This is because whenever is obedient . This captures the fact that a blindly obedient is limited by ’s decision making ability. However, if follows a type of IRL policy, then is guaranteed a positive advantage when is not rational. The next theorem states this formally.
Theorem 1.
The optimal robot is an IRLR whose policy has equal to the posterior mean of . is guaranteed a nonnegative advantage on each round: with equality if and only if .
Proof.
When each step is independent of the next ’s optimal policy is to pick the action that is optimal for the current step (Kaelbling et al., 1996). This results in picking the action that is optimal for the posterior mean,
By definition . Thus, . Also, by definition, . ∎
In addition to being an IRLR, the following IRLRs also converge to the maximum possible autonomy advantage.
Theorem 2.
Let be the maximum possible autonomy advantage and be the probability ’s order is optimal. Assume that when there are multiple optimal actions picks ’s order if it is optimal. If is an IRLR policy (Equation 2) and is strongly consistent, i.e , then and .
Proof.
because is bounded. Similarly,
∎
Remark 2.
In the limit is higher for less optimal humans (humans with a lower expected reward ).
Theorem 3.
The optimal robot is blindly obedient if and only if is rational:
Proof.
Let be the subset of for which are optimal. If is rational, then ’s posterior only has support over . So,
Thus, is rational .
We have shown that making blindly obedient does not come for free. A positive requires being sometimes disobedient (Remark 1). Under the optimal policy is guaranteed a positive when is not rational. And in the limit, converges to the maximum possible advantage. Furthermore, the more suboptimal is, the more of an advantage eventually earns (Remark 2). Thus, making blindly obedient requires giving up on this potential .
However, as Theorem 2 points out, as also only listens to ’s order when it is optimal. Thus, and come at a tradeoff. Autonomy advantage requires giving up obedience, and obedience requires giving up autonomy advantage.
4 Approximations via IRL
is an IRL with equal to the posterior mean, i.e. performs Bayesian IRL (Ramachandran and Amir, 2007). However, as others have noted Bayesian IRL can be very expensive in complex environments (Michini and How, 2012). We could instead approximate by using a less expensive IRL algorithm. Furthermore, by Theorem 2 we can guarantee convergence to optimal behavior.
Simpler choices for include the maximumaposteriori (MAP) estimate, which has previously been suggested as an alternative to Bayesian IRL (Choi and Kim, 2011), or the maximum likelihood estimate (MLE). If is noisily rational (Equation 1) and , then the MLE is equivalent to Maximum Entropy IRL (Ziebart et al., 2008).
Although Theorem 2 allows us to justify approximations at the limit, it is also important to ensure that ’s early behavior is not dangerous. Specifically, we may want to err on the side of obedience early on. To investigate this we first prove a necessary property for any IRL to follow ’s order:
Lemma 1.
(Undominated necessary) Call undominated if there exists such that is optimal, i.e . It is necessary for to be undominated for an IRL to execute .
Proof.
executes , so it is not possible for to execute if there is no choice of that makes optimal. This can happen when one action dominates another action in value. For example, suppose and there are three actions with features , , . If picks , then there is no that makes optimal, and thus will never follow . ∎
One basic property we may want to have is for it to listen to early on. The next theorem looks at we can guarantee about ’s obedience to the first order when is noisily rational.
Theorem 4.
(Obedience to noisily rational on 1st order)

[label=()]

When the MLE does not exist after one order. But if we constrain the norm of to not be too large, then we can ensure that follows an undominated . In particular, such that when plans using the MLE executes if and only if is undominated.

If any IRL robot follows , so does MLE. In particular, if follows , so does MLE.

If uses the MAP or posterior mean, it is not guaranteed to follow an undominated . Furthermore, even if follows , MAP is not guaranteed to follow .
Proof.

[label=()]

The only if condition holds from Lemma 1. Suppose is undominated. Then there exists such that is optimal for . is still optimal for a scaled version, . As , , but never reaches it. Thus, the MLE does not exist.
However since monotonically increases towards 1, such that for , . If , then the MLE will be optimal for because and executes . Therefore, in practice we can simply use the MLE while constraining to be less than some very large number.

From Lemma 1 if any IRL follows , then is undominated. Then by (a) MLE follows .

For space we omit explicit counterexamples, but both statements hold because we can construct adversarial priors for which is suboptimal for the posterior mean and for which is optimal for the posterior mean, but not for the MAP.
∎
Theorem 4 suggests that at least at the beginning when uses the MLE it errs on the side of giving us the “benefit of the doubt”, which is exactly what we would want out of an approximation.
Figure 1(a) and 1(b) plot and for an IRL robot that uses the MLE. As expected, gains more reward than a blindly obedient one (), eventually converging to the maximum autonomy advantage (Figure 1(a)). On the other hand, as learns about , its obedience also decreases, until eventually it only listens to the human when she gives the optimal order (Figure 1(b)).
As pointed out in Remark 2, is eventually higher for more irrational humans. However, a more irrational human also provides noisier evidence of , so the rate of convergence of is also slower. So, although initially may be lower for a more irrational , in the long run there is more to gain from being autonomous when interacting with a more irrational human. Figure 3 shows this empirically.
All experiments in this paper use the following parameters unless otherwise noted. At the start of each episode and at each step . There are 10 actions, 10 features, and . ^{2}^{2}2All experiments can be replicated using the Jupyter notebook available at http://github.com/smilli/obedience
Finally, even with good approximations we may still have good reason for feeling hesitation about disobedient robots. The naive analysis presented so far assumes that ’s models are perfect, but it is almost certain that ’s models of complex things like human preferences and behavior will be incorrect. By Theorem 1, will not obey even the first order made by if there is no that makes ’s order optimal. So clearly, it is possible to have disastrous effects by having an incorrect model of . In the next section we look at how misspecification of possible human preferences () and human behavior () can cause the robot to be overconfident and in turn less obedient than it should be. The autonomy advantage can easily become the rebellion regret.
5 Model Misspecification
Incorrect Model of Human Behavior. Having an incorrect model of ’s rationality () does not change the actions of MLE, but does change the actions of .
Theorem 5.
(Incorrect model of human policy) Let be ’s true rationality and be the rationality that believes has. Let and be ’s estimate under the true model and misspecified model, respectively. Call robust if its actions under are the same as its actions under .

[label=()]

MLE is robust.

is not robust.
Proof.

[label=()]

The log likelihood is concave in . So, . This does not change ’s action:

Counterexamples can be constructed based on the fact that as , becomes rational, but as , becomes completely random. Thus, the likelihood will “win” over the prior for , but not when .
∎
MLE is more robust than the optimal . This suggests a reason beyond computational savings for using approximations: the approximations may be more robust to misspecification than the optimal policy.
Remark 3.
Theorem 5 may give us insight into why Maximum Entropy IRL (which is the MLE with ) works well in practice. In simple environments where noisy rationality can be used as a model of human behavior, getting the level of noisiness right doesn’t matter.
Incorrect Model of Human Preferences. The simplest way that ’s preferences may be misspecified is through the featurization of . Suppose . believes that . may be missing features () or may have irrelevant features (). observes a
dimensional feature vector for each action:
. The true depends on only the first features, but estimates . Figure 4 shows how and change over time as a function of the number of features for a MLE. When has irrelevant features it still achieves a positive (and still converges to the maximum because remains consistent over a superset of ). But if is missing features, then may be negative, and thus would be better off being blindly obedient instead. Furthermore, when contains extra features it is more obedient than it would be with the true model. But if is missing features, then it is less obedient than it should be. This suggests that to ensure errs on the side of obedience we should err on the side of giving a more complex model.Detecting Misspecification. If has the wrong model of , may be better off being obedient. In the remainder of this section we look at how can detect that it is missing features and act accordingly obedient.
Remark 4.
(Policy mixing) We can make more obedient, while maintaining convergence to the maximum advantage, by mixing ’s policy with a blindly obedient policy:
where with . In particular, we can have an initial “burnin” period where is blindly obedient for a finite number of rounds before switching to .
By Theorem 4 we know MLE will always obey ’s first order if it is undominated. This means that for MLE, should be close to one if undominated orders are expected to be rare. As pointed out in Remark 4 we can have an initial “burnin” period where always obeys . Let have a burnin obedience period of rounds. uses this burnin period to calculate the sample obedience on the first order:
If is not close to one, then it is likely that has the wrong model of , and would be better off just being obedient. So, we can choose some small and make ’s policy
(3) 
Figure 5 shows the of this robot as compared to the MLE from Figure 4 after using the first ten orders as a burnin period. This achieves higher than MLE when missing features and still does as well as MLE when it isn’t missing features.
Note that this strategy relies on the fact that MLE has the property of always following an undominated first order. If were using the optimal policy, it is unclear what kind of simple property we could use to detect missing features. This gives us another reason for using an approximation: we may be able to leverage its properties to detect misspecification.
6 Related Work
Ensuring Obedience. There are several recent examples of research that aim to provably ensure that can interrupt . (Soares et al., 2015; Orseau and Armstrong, 2016; HadfieldMenell et al., 2017). HadfieldMenell et al. (2017) show that ’s obedience depends on a tradeoff between ’s uncertainty about and ’s rationality. However, they considered ’s uncertainty in the abstract. In practice would need to learn about through ’s behavior. Our work analyzes how the way learns about impacts its performance and obedience.
Intent Inference For Assistance. Instead of just being blindly obedient, an autonomous system can infer ’s intention and actively assist in achieving it. Do What I Mean software packages interpret the intent behind what a programmer wrote to automatically correct programming errors (Teitelman, 1970). When a user uses a telepointer network lag can cause jitter in her cursor’s path. Gutwin et al. (2003) address this by displaying a prediction of the user’s desired path, rather than the actual cursor path.
Similarly, in assistive teleoperation, the robot does not directly execute ’s (potentially noisy) input. It instead acts based on an inference of ’s intent. In Dragan and Srinivasa (2012) acts according to an arbitration between ’s policy and ’s prediction of ’s policy. Like our work, Javdani et al. (2015) formalize assistive teleoperation as a POMDP in which ’s goals are unknown, and try to optimize an inference of ’s goal. While assistive teleoperation apriori assumes that should act assistively, we show that under model misspecification sometimes it is better for to simply defer to , and contribute a method to decide between active assistance and blind obedience (Remark 4).
Inverse Reinforcement Learning. We use inverse reinforcement learning (Ng et al., 2000; Abbeel and Ng, 2004) to infer from ’s orders. We analyze how different IRL algorithms affect autonomy advantage and obedience, properties not previously studied in the literature. In addition, we analyze how model misspecification of the features of the space of reward parameters or the ’s rationality impacts autonomy advantage and obedience.
IRL algorithms typically assume that is rational or noisily rational. We show that Maximum Entropy IRL (Ziebart et al., 2008) is robust to misspecification of a noisily rational ’s rationality (). However, humans are not truly noisily rational, and in the future it is important to investigate other models of humans in IRL and their potential misspecifications. Evans et al. (2016) takes a step in this direction and models as temporally inconsistent and potentially having false beliefs. In addition, IRL assumes that acts without awareness of ’s presence, cooperative inverse reinforcement learning (HadfieldMenell et al., 2016) relaxes this assumption by modeling the interaction between and as a twoplayer cooperative game.
7 Conclusion
To summarize our key takeaways:

() If is not rational, then can always attain a positive . Thus, forcing to be blindly obedient requires giving up on a positive .

( vs ) There exists a tradeoff between and . At the limit attains the maximum , but only obeys ’s order when it is the optimal action.

(MLE) When is noisily rational MLE is at least as obedient as any other IRL to ’s first order. This suggests that the MLE is a good approximation to because it errs on the side of obedience.

(Wrong ) MLE is robust to having the wrong model of the human’s rationality (), but is not. This suggests that we may not want to use the “optimal” policy because it may not be very robust to misspecification.

(Wrong ) If has extra features, it is more obedient than with the true model, whereas if it is missing features, then it is less obedient. If has extra features, it will still converge to the maximum . But if is missing features, it is sometimes better for to be obedient. This implies that erring on the side of extra features is far better than erring on the side of fewer features.

(Detecting wrong ) We can detect missing features by checking how likely MLE is to follow the first order.
Overall, our analysis suggests that in the longterm we should aim to create robots that intelligently decide when to follow orders, but in the meantime it is crucial to ensure that these robots err on the side of obedience and are robust to misspecified models.
8 Acknowledgements
We thank Daniel Filan for feedback on an early draft.
References

Abbeel and
Ng (2004)
Pieter Abbeel and Andrew Y Ng.
Apprenticeship learning via inverse reinforcement learning.
In
International Conference on Machine learning
, page 1. ACM, 2004.  Asaro (2006) Peter M Asaro. What Should We Want From a Robot Ethic. International Review of Information Ethics, 6(12):9–16, 2006.
 Bostrom (2014) Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. OUP Oxford, 2014.
 Choi and Kim (2011) Jaedeug Choi and KeeEung Kim. MAP Inference for Bayesian Inverse Reinforcement Learning. In Advances in Neural Information Processing Systems, pages 1989–1997, 2011.
 Diaconis and Freedman (1986) Persi Diaconis and David Freedman. On the consistency of Bayes estimates. The Annals of Statistics, pages 1–26, 1986.
 Dragan and Srinivasa (2012) Anca D Dragan and Siddhartha S Srinivasa. Formalizing assistive teleoperation. MIT Press, July, 2012.

Evans et al. (2016)
Owain Evans, Andreas Stuhlmüller, and Noah D Goodman.
Learning the preferences of ignorant, inconsistent agents.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, pages 323–329. AAAI Press, 2016.  Gutwin et al. (2003) Carl Gutwin, Jeff Dyck, and Jennifer Burkitt. Using cursor prediction to smooth telepointer jitter. In Proceedings of the 2003 international ACM SIGGROUP conference on Supporting group work, pages 294–301. ACM, 2003.
 HadfieldMenell et al. (2016) Dylan HadfieldMenell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative Inverse Reinforcement Learning. In Advances in Neural Information Processing Systems, pages 3909–3917, 2016.
 HadfieldMenell et al. (2017) Dylan HadfieldMenell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The OffSwitch Game. In IJCAI, 2017.
 Javdani et al. (2015) Shervin Javdani, Siddhartha S Srinivasa, and J Andrew Bagnell. Shared autonomy via hindsight optimization. In Robotics: Science and Systems, 2015.
 Kaelbling et al. (1996) Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
 Lewis (2014) Johin Lewis. The Case for Regulating Fully Autonomous Weapons. Yale Law Journal, 124:1309, 2014.
 Michini and How (2012) Bernard Michini and Jonathan P How. Improving the Efficiency of Bayesian Inverse Reinforcement Learning. In IEEE International Conference on Robotics and Automation (ICRA), pages 3651–3656. IEEE, 2012.
 Ng et al. (2000) Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning, pages 663–670, 2000.
 Orseau and Armstrong (2016) Laurent Orseau and Stuart Armstrong. Safely Interruptible Agents. In UAI, 2016.
 Ramachandran and Amir (2007) Deepak Ramachandran and Eyal Amir. Bayesian Inverse Reinforcement Learning. In IJCAI, 2007.
 Russell et al. (2015) Stuart Russell, Daniel Dewey, and Max Tegmark. Research Priorities for Robust and Beneficial Artificial Intelligence. AI Magazine, 36(4):105–114, 2015.
 Scheutz and Crowell (2007) Matthias Scheutz and Charles Crowell. The Burden of Embodied Autonomy: Some Reflections on the Social and Ethical Implications of Autonomous Robots. In Workshop on Roboethics at the International Conference on Robotics and Automation, Rome, 2007.
 Soares et al. (2015) Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. Corrigibility. In Workshops at the TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 Teitelman (1970) Warren Teitelman. Toward a Programming Laboratory. Software Engineering Techniques, page 108, 1970.
 Weld and Etzioni (1994) Daniel Weld and Oren Etzioni. The First Law of Robotics (a call to arms). In AAAI, volume 94, pages 1042–1047, 1994.
 Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum Entropy Inverse Reinforcement Learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.