1 Introduction
It is difficult to specify reward functions that always lead to the desired behavior. Recent work has argued that the reward specified by a human is merely a source of information about what people actually want a robot to optimize, i.e., the intended reward (HadfieldMenell et al., 2017b; Ratner et al., 2018). Luckily, it is not the only one. Robots can also learn about the intended reward from demonstrations (IRL) (Ng and Russell, 2000; Abbeel and Ng, 2004), by asking us to make comparisons between trajectories (Wirth et al., 2017; Sadigh et al., 2017; Christiano et al., 2017), or by grounding our instructions (MacGlashan et al., 2015; Fu et al., 2019).
Perhaps even more fortunate is that we seem to leak information left and right about the intended reward. For instance, if we push the robot away, this shouldn’t just modify the robot’s current behavior – it should also inform the robot about our preferences more generally (Jain et al., 2015; Bajcsy et al., 2017). If we turn the robot off in a state of panic to prevent it from disaster, this shouldn’t just stop the robot right now. It should also inform the robot about the intended reward function so that the robot prevents itself from the same disaster in the future: the robot should infer that whatever it was about to do has a tragically low reward. Even the current state of the world ought to inform the robot about our preferences – it is a direct result of us having been acting in the world according to these preferences (Shah et al., 2019)! For instance, those shoes didn’t magically align themselves at the entrance, someone put effort into arranging them that way, so their state alone should tell the robot something about what we want.
Overall, there is much information out there, some purposefully communicated, other leaked. While existing papers are instructing us how to tap into some of it, one can only imagine that there is much more that is yet untapped. There are probably new yettobeinvented ways for people to purposefully provide feedback to robots – e.g. guiding them on which part of a trajectory was particularly good or bad. And, there will probably be new realizations about ways in which human behavior already leaks information, beyond the state of the world or turning the robot off. How will robots make sense of all these diverse sources of information?
Our insight is that there is a way to interpret all this information in a single unifying formalism. The critical observation is that human behavior is a rewardrational implicit choice – a choice from an implicit set of options, which is approximately rational for the intended reward. This observation leads to a recipe for making sense of human behavior, from language to switching the robot off. The recipe has two ingredients: 1) the set of options the person (implicitly) chose from, and 2) a grounding function that maps these options to robot behaviors. This is admittedly obvious for traditional feedback. In comparison feedback, for instance, the set of options is just the two robot behaviors presented to the human to compare, and the grounding is identity. In other types of behavior though, it is much less obvious. Take switching the robot off. The set of options is implicit: you can turn it off, or you can do nothing. The formalism says that when you turn it off, it should know that you could have done nothing, but (implicitly) chose not to. That, in turn, should propagate to the robot’s reward function. For this to happen, the robot needs to ground these options to robot behaviors: identity is no longer enough, because it cannot directly evaluate the reward of an utterance or of getting turned off, but it can evaluate the reward of robot actions or trajectories. Turning the robot off corresponds to a trajectory – whatever the robot did until the offbutton was pushed, followed by doing nothing for the rest of the time horizon. Doing nothing corresponds to the trajectory the robot was going to execute. Now, the robot knows you prefer the former to the latter. We have taken a highlevel human behavior, and turned it into a direct comparison on robot trajectories with respect to the intended reward, thereby gaining reward information.
Despite their diversity, we show that many sources of information proposed thus far can be characterized as instantiating this formalism, offering a unifying lens from which to view prior work. At the same time, we emphasize that the formalism also provides a recipe for interpreting new types of information: new types of feedback that people can give the robot, or new sources of leaked information. We showcase this via two examples of new sources. First, we exemplify how an explicit feedback type, credit assignment, could be implemented via the formalism. Second, we identify a hidden source of information in the type of feedback itself. Naively, each piece of human feedback is evidence about the intended reward. The human corrects the robot, the robot updates its belief based on the implicit choice the human made. The human speaks, the robot does another update. But humans actually have access to different types of feedback all at once: they can demonstrate, correct, speak. We observe that each time they provide feedback to the robot, they are actually making a "meta" implicit choice of which feedback they provide. When a person decides to turn off the robot, they are implicitly deciding not to provide a correction, or communicate via language. Thus, our formalism suggests that the the type of feedback itself leaks information about the intended reward, and we can instantiate it to leverage this leaked information.
We admit that our paper is an unusual one. Superficially, it may resemble a survey paper since we review many prior methods – but this is to showcase how the formalism can be instantiated in surprisingly many ways, and doing so can recover those prior methods. It may also look like a proposal for a new method, because the “meta” choice we discover has the potential to improve reward learning – but it is merely an example that showcases the utility of the formalism as a recipe for leveraging future sources of information. Our key contribution is the formalism itself: taking a wellunderstood fact, that Inverse Reinforcement Learning treats demonstrations as implicit choices among all possible trajectories, and clarifying how to generalize it to even unlikely sources of information, like turning robots off or looking at a single world state. We hope this will be a useful perspective for newcomers and experts alike.
2 A formalism for reward learning
2.1 Rewardrational implicit choice
In reward learning, the robot’s goal is to learn a reward function from human behavior.
(Implicit/explicit) set of options . We interpret human behavior as choosing an option from a set of options . Different behavior types will correspond to different explicit or implicit sets . For example, when a person is asked for a trajectory comparison, they are explicitly shown two trajectories and they pick one. However, when the person gives a demonstration, we think of the possible options as implicitly being all possible trajectories the person could have demonstrated. The implicit/explicit distinction brings out a general tradeoff in reward learning. The cleverness of implicit choice sets is that even when we cannot enumerate and show all options to the human, e.g. in demonstrations, we still rely on the human to optimize over the set. On the other hand, an implicit set is also risky – since it is not explicitly observed, we may get it wrong, potentially resulting in worse reward inference.
Feedback  Choices  Grounding 

Comparisons (Wirth et al., 2017)  
Demonstrations Ng and Russell (2000)  
Corrections (Bajcsy et al., 2017)  
Improvement (Jain et al., 2015)  
Language (Matuszek et al., 2012)  
Proxy Rewards (HadfieldMenell et al., 2017b)  
Reward and Punishment (Griffith et al., 2013)  
Initial state (Shah et al., 2019)  
Metachoice (Sec. 4.2)  
Credit assignment (Sec. 4.1) 
The grounding function . We link the human’s choice to the reward by thinking of the choice as (approximately) maximizing the reward. However, it is not immediately clear what it means for the human to maximize reward when choosing feedback because the feedback may not be a (robot) trajectory, and the reward is only defined over trajectories. For example, in language feedback, the human describes what they want in words. What is the reward of the sentence, “Do not go over the water”?
To overcome this syntax mismatch, we map options in to (distributions over) trajectories with a grounding function where is the set of distributions over trajectories for the robot . Different types of feedback will correspond to different groundings. In some instances, such as kinesthetic demonstrations or trajectory comparisons, the mapping is simply the identity. In others, like corrections, language, or proxy rewards, the grounding is more complex (see Section 3).
Human policy. Given the set of choices and the grounding function , the human’s approximately rational choice can now be modeled via a Boltzmannrational policy, a policy in which the probability of choosing an option is exponentially higher based on its reward:
(1) 
where the parameter is a coefficient that models how rational the human is. Often, we simplify Equation 1 to the case where is a deterministic mapping from choices in to trajectories in , instead of distributions over trajectories. Then, the probability of choosing can be written as:^{1}^{1}1One can also consider a variant of Equation 1 in which choices are grounded to actions, rather than trajectories, and are evaluated via a Qvalue function, rather than the reward function. That is, , .
(2) 
Boltzmannrational policies are widespread in psychology (Baker et al., 2009; Goodman et al., 2009; Goodman and Stuhlmüller, 2013), economics Bradley and Terry (1952); Luce (1959); Plackett (1975); Luce (1959), and AI Ziebart et al. (2008); Ramachandran and Amir (2007); Finn et al. (2016); Bloem and Bambos (2014); Dragan et al. (2013) as models of human choices, actions, or inferences. But why are they a reasonable model? While there are many possible motivations, we contribute a derivation (Appendix A) as the maximumentropy distribution over choices for a satisficing agent, i.e. an agent that in expectation makes a choice with optimal reward. A higher value of results in a lower value of , modeling less optimal humans.
Definition 2.1 (Rewardrational choice).
Finally, putting it all together, we call a type of feedback a rewardrational choice if, given a grounding function , it can be modeled as a choice from an (explicit or implicit set) that (approximately) maximizes reward, i.e., as in Equation 1.
2.2 Robot inference
Each feedback is an observation about the reward, which means the robot can run Bayesian inference to update its belief over the rewards. For a determinstic grounding,
(3) 
where is the prior over rewards and is the normalization over possible reward functions. The inference above is often intractable, and so reward learning work leverages approximations (Blei et al., 2017)
, or computes only the MLE for a parametrization of rewards (more recently as weights in a neural network on raw input
(Christiano et al., 2017; Ibarz et al., 2018)).Finally, when the human is highly rational (), the only choices in with a nonneglible probability of being picked are the choices that exactly maximize reward. Thus, the human’s choice can be interpreted as constraints on the reward function (e.g. (Ratliff et al., 2006)):
(4) 
3 Looking backward: unifying prior work
We now instantiate the formalism above with different behavior types from prior work, constructing their choice sets and groundings . Some are obvious – comparisons, demonstrations especially. Others – initial state, off, reward/punish – are more subtle and it takes slightly modifying their original methods to achieve unification, speaking to the nontrivial nuances of identifying a common formalism.
Table 1 lists and for each feedback, while Table 2 shows the deterministic constraint on rewards each behavior imposes, along with the probabilistic observation model – highlighting, despite the differences in feedback, the pattern of the (exponentiated) choice reward in the numerator, and the normalization over in the denominator. Fig. 1
will serve as the illustration for these types, looking at a grid world navigation task around a rug. The space of rewards we use for illustration is threedimensional weight vectors for avoiding the rug, not getting dirty, and reaching the goal.
Trajectory comparisons. In trajectory comparisons Wirth et al. (2016); Sadigh et al. (2017); Christiano et al. (2017); Ibarz et al. (2018), the human is typically shown two trajectories and , and then asked to select the one that they prefer. They are perhaps the most obvious exemplar of rewardrational choice: the set of choices is explicit, and the grounding is simply the identity. As Fig. 1
shows, for linear reward functions, a comparison corresponds to a hyperplane that cuts the space of feasible reward functions in half. For all the reward functions left, the chosen trajectory has higher reward than the alternative.
Demonstrations. In demonstrations, the human is asked to demonstrate the optimal behavior. Reward learning from demonstrations is often called inverse reinforcement learning (IRL) and is one of the most established types of feedback for reward learning (Ng and Russell, 2000; Abbeel and Ng, 2004; Ziebart et al., 2008). Since the work of Bayesian IRL (Ramachandran and Amir, 2007) and maximum entropy IRL (Ziebart et al., 2008), Boltzmannrational policies have become a common way to model the human (Bloem and Bambos, 2014; Finn et al., 2016; Ho and Ermon, 2016). Unlike in comparisons, in demonstrations, the human is not explicitly given a set of choices. However, we assume that the human is implicitly optimizing over all possible trajectories (Fig. 1 (1st row, 2nd column) shows these choices in gray). Thus, demonstrations are a rewardrational choice in which the set of choices is (implicitly) the set of trajectories . Again, the grounding is the identity. In Fig. 1, fewer rewards are consistent with a demonstration than with a comparison.
Corrections Bajcsy et al. (2017) are the first type of feedback we consider that has both an implicit set of choices and a nontrivial (not equal to identity) grounding. Corrections are most common in physical humanrobot interaction (pHRI), in which a human physically corrects the motion of a robot. There are many ways to model corrections; we describe a simple, representative way. The robot executes a trajectory , and the human intervenes by applying a correction that modifies the robot’s current configuration. Therefore, the set of choices consists of all possible configuration differences the person could have used (Fig. 1 1st row, 3rd column shows possible s in gray and the selected on in orange). The way we can ground these choices is by propagating the local change to the rest of the trajectory via some operator , e.g. the inverse of , with the finite differencing matrix: (orange trajectory in the figure).
Improvement. Prior work Jain et al. (2015) has also modeled a variant of corrections in which the human provides an improved trajectory which is treated as better than the robot’s original . Although not implemented this way by Jain et al. (2015), we can interpret improvement as follows: the set of options consists of only and now, as opposed to all the trajectories obtainable by propagating local corrections; the grounding is identity, resulting in essentially a comparison between the robot’s trajectory and the user provided one.
Off. In “off” feedback (HadfieldMenell et al., 2017a), the robot executes a trajectory, and at any point, the human may switch the robot off. “Off” appears to be a very sparse signal, and it is not spelled out in prior work how one might learn a reward from it. Rewardrational choice suggests that we first uncover the implicit set of options the human was choosing from. In this case, the set of options consists of turning the robot off or not doing anything at all: . Next, we must ask how to evaluate the reward of the two options, i.e., what is the grounding? Well, not intervening means that the robot continues on its current trajectory, and intervening means that it stays at its current position. Thus, the choices map to the trajectories .
Language. Humans might use rich language to instruct the robot, like “Avoid the rug.” Let be the trajectories that are consistent with an utterance (e.g. all trajectories that do not enter the rug). Language is a rewardrational choice in which the set of options is the set of instructions considered indomain and the grounding maps an utterance
to the uniform distribution over consistent trajectories
. In language feedback, a key difficulty is learning which robot trajectories are consistent with a natural language instruction, the language grounding problem (and is where we borrow the term “grounding” from) (Matuszek et al., 2012; Tellex et al., 2011; Fu et al., 2019). Fig. 1 shows the grounding for avoiding the rug in orange – all trajectories from start to goal that do not enter rug cells.Proxy rewards HadfieldMenell et al. (2017b) are expertspecified rewards that do not necessarily lead to the desired behavior in all situations, but can be trusted on the training environments. Therefore, rather than taking a specified reward at face value, we can interpret it as evidence about the true reward. Proxy reward feedback is a rewardrational choice in which the set of choices is the set of proxy rewards the designer may have chosen, . The reward designer is assumed to be approximatelyoptimal, i.e. they are more likely to pick a proxy reward if it leads to better trajectories. Thus, the grounding maps a proxy reward to the distribution over trajectories that the robot takes in the training environment given the proxy reward (HadfieldMenell et al., 2017b; Mindermann et al., 2018; Ratner et al., 2018). Fig. 1 shows the grounding for a proxy reward for reaching the goal, avoiding the rug, and not getting the rug dirty – many feasible rewards would produce similar behavior as the proxy.
Reward and punishment (Griffith et al., 2013; Loftin et al., 2014). The human can either reward () or punish () the robot for its trajectory ; the set of options is . A naive implementation would interpret reward and punishment literally, i.e. as a scalar reward signal for a reinforcement learning agent, however empirical studies show that humans reward and punish based on how well the robot performs relative to their expectations (MacGlashan et al., 2017). Thus, we can use our formalism to interpret that: reward () grounds to the robot’s trajectory , while punish () grounds to the trajectory the human expected (not necessarily observed).
Initial state. Shah et al. (2019) make the insight that when the robot is deployed in an environment that humans have acted in, the current state of the environment is already optimized for what humans want, and thus contains information about the reward. For example, suppose the environment has a goal state which the robot can reach through either a paved path or a carpet. If the carpet is pristine and untrodden, then humans must have intentionally avoided walking on it in the past (even though the robot hasn’t observed this past behavior), and the robot can reasonably infer that it too should not go on the carpet. Thus, the initial state the robot observes when deployed is also a rewardrational implicit choice. The set of choices is the set of possible initial states . The grounding function maps a state to the uniform distribution over any human trajectory that starts from a specified time before the robot was deployed () and ends at state at the time the robot was deployed (), i.e. .^{2}^{2}2Note that this is only equivalent to the algorithm from Shah et al. (2019) for a linear model of noisy rationality, but not actually equivalent for Boltzmann. Fig 1 shows the result of this inference, recovering as much information as with the correction or language.
Summary. From demonstrations to reward/punishment to the initial state of the world, the robot can extract information from humans by modeling them as making approximate rewardrational choices. Often, the choices are implicit, like in turning the robot off or providing language instructions. Sometimes, the choices are not made in order to purposefully communicate about the reward, and rather end up leaking information about it, like in the initial state, or even in corrections or turning the robot off. Regardless, this unifying lens enables us to better understand, as in Fig. 1, how all these sources of information relate and compare.
4 Looking forward: new behavior types
The types of feedback or behavior we have discussed so far are by no means the only types possible. New ones will inevitably be invented. But when designing a new type of feedback, it is often difficult to understand what the relationship is between the reward and the feedback . Rewardrational choice suggests a recipe for uncovering this link – define what the implicit set of options the human is choosing from is, and how those options ground to trajectories. Then, Equation 1 provides a formal model for the human feedback.
To illustrate the utility of this, we identify two new types in this section. We start with hypothesizing another type of direct feedback that a robot could ask a human for, which we call credit assignment. We then move onto an indirect source of information that we identified as a direct consequence of applying the formalism to combine feedback types.
4.1 A simple starting example
We first provide a quick example of a new (hypothetical) type of feedback, which we call credit assignment. Given a trajectory of length , the human is asked to pick a segment of length that has maximal reward. We doubt the set of choices in an implementation of credit assignment would be explicit, however the implicit set of choices is then the set of all segments of length . The grounding function is simply the identity. With this choice of and in hand, the human can now be modeled according to Equation 1, as we show in the last rows of Tables 1 and 2.
While we do not argue that our formalism will apply to all types of feedback, we believe that it applies to many, even to types that initially seem to have a more obvious, literal interpretation (e.g. reward and punishment, Section 3).
4.2 Metachoice: a new source of information
So far we have talked about learning from individual types of behaviors. But we do not want our robots stuck with a single type: we want them to 1) read into all the leaked information, and 2) learn from all the purposeful feedback. Luckily, the rewardrational choice framework also guides us in formalizing combinations of behavior.
In what follows, we start with a naive way of combining feedback types: treat each individual feedback received as a reward rational choice, and update the robot’s belief. We then move to make the observation that the moment we open it up to multiple types of feedback, the person is not stuck with a single type and is actually choosing which type to use.
We propose that this itself is a rewardrational implicit choice, and therefore leaks information about the reward.Naive: conditional independence given reward. Suppose we observe multiple types of behavior from a human. For example, we might have received a demonstration from a human during training, and then a correction during deployment, which was followed by the human prematurely switching the robot off. The observational model in (2) for a single type of behavior also provides a natural way to model combinations of behavior. If each observation is conditionally independent given the reward, then according to (2), the probability of observing a vector of behavioral signals (of possibly different types) is equal to
(5) 
Given this likelihood function for the human’s behavior, the robot can infer the reward function using the approaches and approximations described in Sec. 2.2. Recent work has often focused on combining specific types of behavior, such as trajectory comparisons and demonstrations (Ibarz et al., 2018; Palan et al., 2019). We note that the formulation in Equation 5 is general and applies to any combination. In Appendix B, we describe a case study on a novel combination of feedback types: proxy rewards, a physical improvement, and comparisons in which we use a constraintbased approximation (see Equation 4) to Equation 5.
We note that the RRC probabilistic model also immediately suggests a way to actively select the type of feedback to ask for: greedily maximize information gain. In Appendix C, we describe experiments with active selection of feedback types. In the environments we tested, we found that demonstrations are optimal early on, when little is known about the reward, while comparisons became optimal later, as a way to finetune the reward. The finding provides validation for the approach pursued by Palan et al. (2019) and Ibarz et al. (2018). Both papers manually define the mixing procedure we found to be optimal: initially train the reward model using human demonstrations, and then finetune with comparisons.
Metachoice. The assumption of conditional independence that the formulation in (5) uses is natural and makes sense in many settings. For example, during training time, we might control what feedback type we ask the human for. We might start by asking the human for demonstrations, but then move on to other types of feedback, like corrections or comparisons, to get more finegrained information about the reward function. Since the human is only ever considering one type of feedback at a time, the conditional independence assumption makes sense.
But the assumption breaks when the human has access to multiple types of feedback at once because the types of feedback the robot can interpret influence what the human does in the first place.^{3}^{3}3We note that this adaptation by the human only applies to types of behavior that the human uses to purposefully communicate with the robot, as opposed to sources of information like initial state. If the human intervenes and turns the robot off, that means one thing if this were the only feedback type available, and a whole different thing if, say, corrections were available too. In the latter case, we have more information  we know that the user chose to turn the robot off rather than provide a correction.
Thus, our key insight is that the type of feedback itself leaks information about the reward, and the RRC framework gives us a recipe for formalizing this new source: we need to uncover the set of options the human is choosing from. The human has two stages of choice: the first is the choice between feedback types, i.e corrections, language, turnoff, etc. and the second is the choice within the chosen feedback type, i.e the specific correction that the human gave. Our formalism can leverage both sources of information by defining a hierarchy of rewardrational choice.
Suppose the user has access to types of feedback with associated choice sets , groundings , and Boltzmann rationalities . For simplicity, we assume deterministic groundings. The set of choice sets for the firststage choice is The grounding for the first stage choice maps a feedback type to the distribution of trajectories defined by the human’s behavior and grounding in the second stage:
(6) 
where, as usual, is given by Equation 1. Instantiating Equation 1 to model the firststage decision as well, results in the following model for the human picking feedback type :
(7) 
Finally, the probability that the human gives feedback is
The firststage decision can be interpreted as the human metareasoning over the best type of feedback. The benefit of modeling the hierarchy is that we can cleanly separate and consider noise at both the level of metareasoning () and the level of execution of feedback (). Noise at the metareasoning level models the human’s imperfection in picking the optimal type of feedback. Noise at the execution level might model the fact that the human has difficulty in physically correcting a heavy and unintuitive robot.^{4}^{4}4Although we modeled rationality with respect to the reward that the robot should optimize, we can easily extend our formalism to capture that the person might tradeoff between that and their own effort – this is especially interesting at this metachoice level, where one type of feedback might be much more difficult and thus people might want to avoid it unless it is particularly informative.
We showcase the potential importance of accounting for the metachoice in an experiment in a gridworld setting, in which an agent navigates to a goal state while avoiding lava (Figure 2, left). The reward function is a linear combination of 2 features that encode the goal and lava. The human has access to two channels of feedback: “off” and corrections. We simulate the human feedback as choosing between feedback types according to Equation 4.2. We manipulate three factors: 1) whether the robot is naive, i.e. only accounts for the information within the feedback type, or metareasons, i.e. accounts for the other feedback types that were available but not chosen; 2) the metarationality parameter modeling human imperfection in selecting the optimal type of feedback; and 3) the location of the lava, so that the rational metachoice changes from off to corrections. We measure regret over holdout environments.
Figure 2 (left) depicts the possible grounded trajectories for corrections and for off. For the top, off is optimal because all corrections go through lava. For the bottom, the rational metachoice is to correct. In both cases, we find that metareasoning gains the learner more information, as seen in the belief (center). For the top, where the person turns it off, the robot can be more confident that lava is bad. For the bottom, the fact that the person had the off option and did not use it informs the robot about the importance of reaching the goal. This translates into lower regret (right), especially as increases and there is more signal in the feedback type choice. Since this advantage depends on knowledge of the human’s metarationality (), in Appendix D, we also describe experiments testing the effect of misspecifying .
5 Discussion
Summary. We have proposed a framework, rewardrational (implicit) choice, that provides a unified lens for reward learning from diverse types of human behavior. Not only is the framework conceptually useful for understanding types of behavior that have already been proposed in existing work, it also provides a recipe for interpreting new sources of information. We used the framework to properly interpret learning from combinations of behavior. In doing so, we uncovered an important new source of information: the human’s choice of the type of feedback to provide – a rewardrational choice in itself.
Limitations and Future Work. In this paper, we have focused on outlining the core aspects of the formalism. But there is much future work to be done in empirically evaluating different types of behavior with human subjects. How well can people provide each type of feedback? What types do people prefer? How well can people metareason over what type of feedback to provide?
Further, by pointing out that prior work often interprets human behavior as an implicit choice, we surface a major concern with reward learning: what if we get the implicit set of choices wrong? More broadly, our formalism naturally draws attention to the important question of misspecification: how does getting the set of options, grounding function, or rationality coefficient wrong affect the robot’s inference? Our experiments in Appendix D on the impact of misspecifying the metareasoning rationality coefficient are a start to studying misspecification, but there is much future work to do both in understanding misspecification and coming up with solutions to avoid it or be robust to it.
Overall, we see our formalism as providing conceptual clarity for existing and future methods for learning from human behavior, and a fruitful base for future work on multibehaviortype reward learning.
References

Abbeel and Ng [2004]
Pieter Abbeel and Andrew Y Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, page 1. ACM, 2004.  Bajcsy et al. [2017] Andrea Bajcsy, Dylan P Losey, Marcia K O’Malley, and Anca D Dragan. Learning robot objectives from physical human interaction. Conference on Robot Learning (CoRL), 2017.
 Baker et al. [2009] Chris L Baker, Rebecca Saxe, and Joshua B Tenenbaum. Action understanding as inverse planning. Cognition, 113(3):329–349, 2009.
 Blei et al. [2017] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 Bloem and Bambos [2014] Michael Bloem and Nicholas Bambos. Infinite time horizon maximum causal entropy inverse reinforcement learning. In 53rd IEEE Conference on Decision and Control, pages 4911–4916. IEEE, 2014.
 Bradley and Terry [1952] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
 Christiano et al. [2017] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.
 Dragan et al. [2013] Anca D Dragan, Kenton CT Lee, and Siddhartha S Srinivasa. Legibility and predictability of robot motion. In Proceedings of the 8th ACM/IEEE international conference on Humanrobot interaction, pages 301–308. IEEE Press, 2013.
 Finn et al. [2016] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, pages 49–58, 2016.
 Fu et al. [2019] Justin Fu, Anoop Korattikara, Sergey Levine, and Sergio Guadarrama. From language to goals: Inverse reinforcement learning for visionbased instruction following. arXiv preprint arXiv:1902.07742, 2019.
 Goodman and Stuhlmüller [2013] Noah D Goodman and Andreas Stuhlmüller. Knowledge and implicature: Modeling language understanding as social cognition. Topics in cognitive science, 5(1):173–184, 2013.
 Goodman et al. [2009] Noah D Goodman, Chris L Baker, and Joshua B Tenenbaum. Cause and intent: Social reasoning in causal learning. In Proceedings of the 31st annual conference of the cognitive science society, pages 2759–2764. Citeseer, 2009.
 Griffith et al. [2013] Shane Griffith, Kaushik Subramanian, Jonathan Scholz, Charles L Isbell, and Andrea L Thomaz. Policy shaping: Integrating human feedback with reinforcement learning. In Advances in neural information processing systems, pages 2625–2633, 2013.

HadfieldMenell et al. [2017a]
Dylan HadfieldMenell, Anca Dragan, Pieter Abbeel, and Stuart Russell.
The offswitch game.
In
Workshops at the ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017a.  HadfieldMenell et al. [2017b] Dylan HadfieldMenell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and Anca Dragan. Inverse reward design. In Advances in neural information processing systems, pages 6765–6774, 2017b.

Ho and Ermon [2016]
Jonathan Ho and Stefano Ermon.
Generative adversarial imitation learning.
In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.  Ibarz et al. [2018] Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in Atari. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
 Jain et al. [2015] Ashesh Jain, Shikhar Sharma, Thorsten Joachims, and Ashutosh Saxena. Learning preferences for manipulation tasks from online coactive feedback. The International Journal of Robotics Research, 34(10):1296–1313, 2015.
 Jaynes [1957] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.
 Loftin et al. [2014] Robert Tyler Loftin, James MacGlashan, Bei Peng, Matthew E Taylor, Michael L Littman, Jeff Huang, and David L Roberts. A strategyaware technique for learning behaviors from discrete human feedback. In AAAI Conference on Artificial Intelligence, 2014.
 Luce [1959] R Duncan Luce. Individual choice behavior: A theoretical analysis. Wiley, 1959.
 MacGlashan et al. [2015] James MacGlashan, Monica BabesVroman, Marie desJardins, Michael L Littman, Smaranda Muresan, Shawn Squire, Stefanie Tellex, Dilip Arumugam, and Lei Yang. Grounding english commands to reward functions. In Robotics: Science and Systems, 2015.
 MacGlashan et al. [2017] James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David L Roberts, Matthew E Taylor, and Michael L Littman. Interactive learning from policydependent human feedback. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2285–2294. JMLR.org, 2017.
 Matuszek et al. [2012] Cynthia Matuszek, Nicholas FitzGerald, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. A joint model of language and perception for grounded attribute learning. In International Conference on Machine Learning (ICML), 2012.
 Mindermann et al. [2018] Sören Mindermann, Rohin Shah, Adam Gleave, and Dylan HadfieldMenell. Active inverse reward design. In Proceedings of the 1st Workshop on Goal Specifications for Reinforcement Learning, 2018.
 Ng and Russell [2000] Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2, 2000.
 Ortega and Braun [2013] Pedro A Ortega and Daniel A Braun. Thermodynamics as a theory of decisionmaking with informationprocessing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153):20120683, 2013.
 Palan et al. [2019] Malayandi Palan, Nicholas C Landolfi, Gleb Shevchuk, and Dorsa Sadigh. Learning reward functions by integrating human demonstrations and preferences. In RSS, 2019.
 Plackett [1975] Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202, 1975.
 Ramachandran and Amir [2007] Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. In IJCAI, volume 7, pages 2586–2591, 2007.
 Ratliff et al. [2006] Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, pages 729–736. ACM, 2006.
 Ratner et al. [2018] Ellis Ratner, Dylan HadfieldMenell, and Anca D Dragan. Simplifying reward design through divideandconquer. arXiv preprint arXiv:1806.02501, 2018.
 Sadigh et al. [2017] Dorsa Sadigh, Anca D. Dragan, Shankar Sastry, and Sanjit A Seshia. Active preferencebased learning of reward functions. In Robotics: Science and Systems (RSS), 2017.
 Schulman et al. [2014] John Schulman, Yan Duan, Jonathan Ho, Alex Lee, Ibrahim Awwal, Henry Bradlow, Jia Pan, Sachin Patil, Ken Goldberg, and Pieter Abbeel. Motion planning with sequential convex optimization and convex collision checking. International Journal of Robotics Research (IJRR), 2014.
 Shah et al. [2019] Rohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, and Anca Dragan. Preferences implicit in the state of the world. In ICLR, 2019.
 Simon [1956] Herbert A Simon. Rational choice and the structure of the environment. Psychological review, 63(2):129, 1956.
 Tellex et al. [2011] Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth Teller, and Nicholas Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In TwentyFifth AAAI Conference on Artificial Intelligence, 2011.
 Wirth et al. [2016] Christian Wirth, Johannes Fürnkranz, and Gerhard Neumann. Modelfree preferencebased reinforcement learning. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 Wirth et al. [2017] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preferencebased reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945–4990, 2017.
 Ziebart et al. [2008] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
Appendix A Bounded rationality, maximum entropy, and Boltzmannrational policies
A perspective on reward learning that makes use at its core the Boltzmann model from Equation 1 would not be complete without a formal justification for it within our context. In this section, we derive it as the maximumentropy distribution for the choices made by a bounded, satisficing human. Our explanation is complementary to that of [Ortega and Braun, 2013] who derive an axiomatic, thermodynamic framework to modeling boundedrational decision making. Their framework leads to much the same interpretation of the Boltzmannrational distribution, but is significantly more complex than needed for our purposes.
A perfectly rational human choosing from the set would always pick the choice with optimal reward, . However, since humans are bounded, we do not expect them to perform optimally. Herbert Simon proposed the influential idea that humans are bounded rational and merely satisfice [Simon, 1956], rather than maximize, i.e., they pick an option above some satisfactory threshold, rather than picking the best possible option.
We can abstractly model a satisficing human by modeling their expected reward as equal to a satisficing threshold, where is the amount of expected error. The maximum possible error, , corresponds to antirationality, i.e., always picking the worst option.
Given the constraint that the human’s expected reward is satisfactory, how should we pick a distribution to model the human’s choices? The principle of maximum entropy
[Jaynes, 1957] gives us a guide. If we want to encode no extra information in the distribution, then we ought to pick the distribution that maximizes entropy subject to the constraint on the satisficing threshold.Definition A.1 (Satisficing MaxEnt problem).
Let be be a distribution over choice set and let be a density for with respect to a base measure . The Shannon entropy of is defined as . The satisficing maximum entropy problem is to find a distribution that maximizes entropy subject to the satisficing constraint (8):
subject to  
(8) 
It is wellknown that the maximumentropy distribution subject to linear constraints (such as a constraint on the mean like in (8
)) is the unique exponential distribution that satisfies the constraints. Thus, for our special case, the maximumentropy distrbution is the Boltzmann distribution with rationality coefficient
satisfying the satisficing constraint.Theorem A.1 ([Jaynes, 1957]).
The solution to the satisficing maximum entropy problem is the Boltzmann distribution where is the unique value satisfying the satisficing constraint (8).
Since the expected reward is monotonically increasing in the rationality parameter , the satisficing error and rationality coefficient have a onetoone relationship, as summarized in the following corollary.
Corollary A.1.
The solution to the satisficing maximum entropy problem is a Boltzmannrational policy where the rationality coefficient is monotonically decreasing in the satisficing error . In particular, we have the following:
Human type  Error  Rationality 

Perfectly rational  
Random  
Antirational 
Thus, we see that the idea of bounded rationality, as in satisficing, and Boltzmannrationality are in fact equivalent. By following the principle of maximum entropy, Boltzmannrationality provides a way to model a satisficing human, without implicitly adding in any other assumptions about the human’s choice.
Appendix B A case study on combining feedback types
Fig. 3 illustrates a case study for teaching a robot arm a reward for motion planning through a novel combination of feedback types. In each environment, the robot arm must plan a trajectory from a start configuration to a designated goal configuration. We want this trajectory to properly trade off efficiency against staying at an appropriate distances to the human, and to the table. Handtuning a reward function that returns desirable trajectories in all possible environments is actually very challenging. You could imagine that as you increase the efficiency weight to produce a smoother trajectory in one environment, you break the behavior in another environment where the robot now gets too close to the human, etc. In fact, the first type of feedback in the case study illustrates this: we design a (proxy) reward function that works well in two (training) environments (top left), but there are many rewards that are consistent with that behavior, yet produce vastly different behaviors in the two test environments (right).
Therefore, we start by defining a proxy reward, but then follow it up with more feedback: an improvement, and a comparison between two trajectories. This narrows down the space of rewards such that the robot can now generalize what to do outside of the training environments, as shown by two testing environments (right).
Cost Function and Features.
Efficiency is the sum of squared configuration space distances between consecutive trajectory waypoints. The table and human features are expressed as 1 minus a radial basis function of a modified distance between the object and the robot’s end effector positon (denoted by
, where is the forward kinematics that maps configuration to its end effector location in ). For the table, this modification is to only consider distance in the zcoordinate, effectively measuring the distance from the robot’s end effector to the table plane. For the human, the modification is to treat the human as an axis and consider distance in 2 dimensions after projecting onto the plane with normal . In Figure 3, the main obstacle is either the human’s body or his arm. When the body is the obstacle, and when the arm is the obstacle, . This considers the human not as just a point, but rather a line along the body, or arm axis.Optimization We approximate the space of reward parameters by uniform discretization at the surface of the nonnegative octant of the 3 dimensional sphere (1371 points). Robotic motion planners cannot, in general, compute the globally optimal trajectory for a given so we resort to computing a set of locally optimal trajectories for each via TrajOpt [Schulman et al., 2014]. The optimal trajectory for a given is then defined as
Proxy Reward. For this case study, the robot begins by asking the human designer for a proxy reward (cost) function. It is difficult for humans to provide proxies that work across all environments [Ratner et al., 2018], so the robot asks for a proxy that produces the desired behavior in the two training environments. The human can provide the proxy weights: and produce trajectories that match those of (Figure 3 depicted in orange). Providing a proxy applies constraints that shrink our feasible set from to :
where denotes the optimal trajectory^{5}^{5}5In our case study, the optimal trajectory is unique. w.r.t. cost parameter in environment . The new feasible set contains only the parameters that produce optimal trajectories with respect to the true weights in environments 1 and 2. Although it is a subset of the original feasible set , the new feasible set is still a reasonably large set (Figure 3, top, middle, orange area). Furthermore, although the proxy produces optimal trajectories in environments 1 and 2, it does not necessarily for environments 3 and 4. Figure 3 (top, right) illustrates the different trajectories that result from optimizing different . To further narrow our feasible set, we will ask for another form of feedback: Improvement.
Improvement. The robot will now (actively) provide a nominal trajectory, and ask the human to improve it, i.e. alter the trajectory to better suit their preferences. Suppose the robot presents the human with the nominal trajectory shown in gray (Figure 3, middle, left). This nominal trajectory is inefficient, staying too close to the table. Based on , the human could provide the improved orange trajectory (Figure 3, middle, left) that is more efficient and doesn’t emphasize closeness to the table as much. This improvement reduces our feasible set from to :
Figure 3 (middle, middle) shows the effect of applying this constraint, shrinking the orange feasible set. The feasible set has shrunk, but not enough to guarantee optimal behavior in all environments. The improvement establishes that closeness to the table should not come at the cost of efficiency. As a result, it removes the red trajectory in environment 3, which greatly traded off efficiency for proximity to the table (Figure 3, middle, right). To further fine tune, we will ask the human to answer a trajectory comparison.
Trajectory Comparison. The robot presents the human with two trajectories (Figure 3 bottom, left, orange and gray) and asks which incurs less cost. The human answers "orange", the trajectory that prioritizes efficiency over distance to the table. This comparison feedback shrinks our feasible set from to :
We finally see a very small orange feasible set (Figure 3, bottom, middle). Appropriately, in all four environments now, every produces a trajectory s.t. . This is illustrated in Figure 3 (bottom, right) as only the optimal green trajectory remains in each environment.
Our case study showcases the usefulness of combining types of feedback. A designer might start with their best guess at a reward function, the robot might misbehave in new environments, the designer or even enduser might observe this and intervene to correct or stop the robot, etc. – over time, the robot should narrow in on what people actually want it to do.
Appendix C Actively selecting which type of feedback to use
Given we can mix and match types of feedback, we may also wonder what is the best type
to ask for at each point in time. The probabilistic model defined by rewardrational choice hints at how to select the feedback type – pick the one that maximizes expected information gain. We point this out, not because using information gain as an active learning metric is a new idea, but because the ability to use it arises immediately as an application of the formalism.
Suppose we can select between types of feedback with choice sets to ask the user for. Let be the robot’s belief distribution over rewards at time . The type of feedback that (greedily) maximizes information gain for the next time step is
where is distributed according to the robot’s current belief,
is the random variable corresponding to the user’s choice within feedback type
, and is defined according to the human model in Equation 1.To showcase the benefit of actively selecting feedback types, we run an experiment with demonstrations and comparisons. We measure regret (maximum and expected difference, on holdout environments, in ground truth reward between 1) optimizing with ground truth vs. 2) optimizing with the learned reward). We manipulate whether we have access to demonstrations only, comparisons only, or both, as well as the number of feedback instances queried.
One may initially wonder whether comparisons are necessary, given that demonstrations seem to provide so much information early on. Overall, we observe that demonstrations are optimal early on, when little is known about the reward, while comparisons become optimal later, as a way to finetune the reward (Fig. 5 shows our results). The observation also serves to validate the approach contributed by Palan et al. [2019], Ibarz et al. [2018] in the applications of motion planning and Atari gameplaying, respectively. Both papers manually define the mixing procedure we found to be optimal: initially train the reward model using human demonstrations, and then finetune with comparisons.
Experiment Details We tested 3 different active learning methods: active querying of demonstrations, active querying of comparisons, and active querying of demonstrations and comparisons, across 8 different gridworld environments depicted in Figure 4. The top 4 environments were used in training while the bottom 4 were held for testing. Each environment is a 25x25 gridworld MDP with a linear reward function in 3 features: RGB color values of each pixel. We assign each with 10 different start goal pairs from which the algorithms can ask queries. The goal of each algorithm is to efficiently recover a ground truth reward through querying.
Since our rewards are linear in RGB, the feasible reward set consists of 3D parameters that weight the value of each feature in the reward function. can be constrained to the surface of the 3D unit sphere since reward functions in MDPs are scale invariant. We uniformly discretize points at the surface of the 3D sphere to approximate via . To approximate , we first compute the optimal trajectory under each to make . We include trajectories that are not the result of optimizing reward functions by inserting noise into the value function when computing optimal trajectories as above.
Demonstrations and Comparisons as Hard Constraints The algorithms recover by narrowing a set of feasible rewards with active queries. We use to denote the set of feasible rewards at iteration of querying. Demonstrations and comparisons shrink the feasible set in the following way:
For our experiments, we performed the following greedy volume removal over possible pairs that we specified in each environment.
For demonstrations, we look for the pair that in expectation produces a demonstrations that leave the smallest feasible set (size of feasible set is volume or diameter described below). For comparisons, we look for the pair of trajectories that produce the minimum worstcase feasible region remaining. For the method with demonstrations and comparisons, we computed the above 2 metrics and select the feedback type with the smaller feasible region. We run this algorithm for 10 iterations and average our results across 50 different ground truth . We plot several statistics for each iteration in Figure 5 including
where is a holdout environment and
is a startgoal pair in the MDP. Each metric is a proxy for how accurate our estimate of
is. We notice that the combination of demonstrations and comparisons achieves lower volume, diameter, max regret, and average regret than demonstrations alone and that it achieves this in fewer iterations than comparisons alone.Appendix D Experiments on metachoice (Section 4.2)
In our main metareasoning experiments, we assumed that the simulated human metareasoned with and that our algorithm somehow knew this quantity. However, in practice, we will not have access to . This brings about an interesting question: What are the effects of inference under a misspecified . What are the effects of overestimating or underestimating the human’s rationality?
To test this, we designed an experiment in which our simulated human provided supervision with a fixed ground truth and while our algorithm performs belief updates with various above and below . The first way to measure the extent of misspecification is to measure the KL divergence between the belief induced by and that induced by .
Additionally, we wanted to measure the expected regret given a human that provides supervision with rationality and the algorithm that performs belief updates with rationality .
We plot the results in Figure 6 averaged over 50 randomly sampled reward functions and . We notice that when the human does not metareason (, the KL divergence in the belief distribution update is large. In comparison, with any moderate level of metareasoning , the KL divergence is very low. We notice this too in the expected regret. Note that the minimum expected regret is not achieved by . This is because is used to compute the frequency at which the human provides each type of feedback as an answer. Simply matching with doesn’t guarantee minimum expected regret (the optimal for minimizing expected regret is a function of ). These experiments suggest that if we detect that the human is poor at metareasoning (low ), it is safer to drop the metareasoning assumption. However, if the human is displaying metareasoning, we can leverage this to improve learning.