## 1 Introduction

The accelerating progress in artificial intelligence (AI) and robotics is bound to have a substantial impact in society, simultaneously unlocking new potential in augmenting and transcending human capabilities while also posing significant challenges to safe and effective human-robot interaction. In the short term, integrating robotic systems into human-dominated environments will require them to assess the intentions and preferences of their users in order to assist them effectively, while avoiding failures due to poor coordination. In the long term, ensuring that advanced and highly autonomous AI systems will be beneficial to individuals and society will hinge on their ability to correctly assimilate human values and objectives

Amodei2017. We envision the short- and long-term challenges as being inherently coupled, and predict that improving the ability of robots to understand and coordinate with their human users will inform solutions to the general AI*value-alignment problem*.

Successful value alignment requires moving from typical single-agent AI formulations to robots that account for a second agent—the human—who determines what the objective is.
In other words, value alignment is fundamentally a multi-agent problem.
Cooperative Inverse Reinforcement Learning (CIRL) formulates value alignment as a two-player game in which a human and a robot share a common reward function, but *only the human* has
knowledge of
this reward Hadfield-Menell2016a.
In practice, solving a CIRL game requires more than multi-agent decision theory: we are not dealing with *any* multi-agent system,
but with a human-robot system.
This poses a unique challenge in that humans do not behave like idealized rational agents Tversky1974.
However, humans do
excel at social interaction
and are extremely perceptive of the mental states of others Heider1944; Meltzoff1995.
They will naturally project mental states such as beliefs and intentions onto their robotic collaborators, becoming invaluable allies in our robots’ quest for value alignment.

In the coming decades, tackling the value-alignment problem will be crucial to building collaborative robots that know what their human users want.
In this paper, we show that value alignment is possible not just in theory, but also in practice.
We introduce a solution for CIRL based on a model of the human agent that is grounded in
cognitive science findings regarding human decision making Baker2014 and pedagogical reasoning Shafto2014.
Our solution leverages two closely related insights to facilitate value alignment.
First,
to the extent that improving their collaborator’s understanding of their goals may be conducive to success, people will tend to behave *pedagogically*, deliberately choosing their actions to be informative about these goals.
Second, the robot should anticipate this pedagogical reasoning in interpreting the actions of its human users, akin to how a *pragmatic* listener interprets a speaker’s utterance in natural language.
Jointly, pedagogical actions and pragmatic interpretations enable stronger and faster inferences among people Shafto2014.
Our result suggests that it is possible for robots to partake in this naturally-emerging equilibrium,
ultimately becoming more perceptive and competent collaborators.

## 2 Solving Value Alignment using Cognitive Models

### 2.1 Cooperative Inverse Reinforcement Learning (CIRL)

Cooperative Inverse Reinforcement Learning (CIRL) Hadfield-Menell2016a formalizes value alignment as a two-player game, which we briefly present here.
Consider two agents, a human and a robot , engaged in a dynamic
collaborative task involving a (possibly infinite) sequence of steps.
The goal of both agents is to achieve the best possible outcome according to some objective .
However, this objective is only known to .
In order to contribute to the objective, will need to make inferences about from the actions of (an Inverse Reinforcement Learning (IRL) problem), and will have an incentive to behave informatively so that becomes more helpful, hence the term *cooperative* IRL.

Formally, a CIRL game is a dynamic (Markov) game of two players ( and ), described by a tuple , where
is the set of possible states of the world;
are the sets of actions available to and respectively;
a discrete transition measure^{1}^{1}1
Note that the theoretical formulation is easily extended to arbitrary measurable sets; we limit our analysis to finite state and objective sets for computational tractability and clarity of exposition.
over the next state, conditioned on the previous state and the actions of and : ;
is the set of possible objectives;
is a cumulative reward function assigning a real value to every tuple of state and actions for a given objective: ;

is a probability measure on the initial state and the objective;

is a geometric time discount factor making future rewards gradually less valuable.### 2.2 Pragmatic Robots for Pedagogic Humans

Asymmetric information structures in games (even static ones) generally induce an *infinite hierarchy of beliefs*: our robot will need to maintain a Bayesian belief over the human’s objectives to decide on its actions.
To reason about the robot’s decisions, the human would in principle need to maintain a belief on the robot’s belief, which will in turn inform her decisions, thereby requiring the robot to maintain a belief on the human’s belief about its own belief, and so on Zamir2012.
In Hadfield-Menell2016a, it was shown that an *optimal*

pair of strategies can be found for any CIRL game by solving a partially observed Markov decision process (POMDP). This avoids this bottomless recursion as long as both agents are rational and can coordinate perfectly before the start of the game.

Unfortunately,
when dealing with human agents,
rationality and prior coordination are nontrivial assumptions.
Finding an equivalent tractability result for more realistic human models is therefore crucial in using the CIRL formulation
to solve real-world value-alignment problems.
We discover the key insight
in cognitive studies of human *pedagogical reasoning* Shafto2014, in which a teacher chooses actions or utterances to influence the beliefs of a learner who is aware of the teacher’s intention.
The teacher can then exploit the fact that the learner can interpret utterances pragmatically.
Infinite recursion is averted by finding a fixed-point relation between the teacher’s best utterance and the learner’s best interpretation, exploiting a common modeling assumption in Bayesian theory of mind:
the learner models the teacher as a *noisily rational* decision maker Luce1959, who will be *likelier*

to choose utterances causing the learner to place a high posterior belief on the correct hypothesis, given the learner’s current belief. While in reality, the teacher cannot exactly compute the learner’s belief, the model supposes that she estimates it (from the learner’s previous responses to her utterances), then introduces noise in her decisions to capture estimation inaccuracies. This framework can predict complex behaviors observed in human teaching-learning interactions, in which pedagogical utterances and pragmatic interpretations permit efficient communication

Shafto2014.We adopt an analogous modeling framework to that in Shafto2014 for value alignment,
with a critical difference: the ultimate objective of the human is not to explicitly improve the robot’s understanding of the true objective, but to optimize the team’s expected performance *towards* this objective.
Pedagogic behavior thus emerges implicitly to the extent that a well-informed robot becomes a better collaborator.

### 2.3 Pragmatic-Pedagogic Equilibrium Solution to CIRL

The robot does not have access to the true objective , but rather estimates a belief over . We assume that this belief on can be expressed parametrically (this is always true if is a finite set), and define to be the corresponding (finite-dimensional) parameter space, denoting ’s belief by . While in reality the human cannot directly observe , we assume, as in Shafto2014, that she can compute it or infer it from the robot’s behavior (and model estimation inaccuracies as noise in her policy). We can then let represent the state-action value function of the CIRL game for a given objective , which we are seeking to compute: if is the true objective known to , then represents the best performance the team can expect to achieve if chooses and chooses from state , with ’s current belief being .

In order to solve for ,
we seek to establish an appropriate dynamic programming relation for the game, given a well-defined information structure and a model of the human’s decision making.
Since
it is typically possible for people to predict a robot’s next action if they see its beginning Dragan2014,
we assume that can observe at each turn before committing to .
A well-established model of human decision making in psychology and econometrics is the Luce choice rule, which models people’s decisions probabilistically, making high-utility choices more likely than those with lower utility Luce1959.
In particular, we employ a common case of the Luce choice rule, the Boltzmann (or *soft-max*) noisy rationality model Baker2014,
in which the probability of a choice decays exponentially as its utility decreases in comparison to competing options.
The relevant utility metric in our case is the sought (which captures ’s best expected outcome for each of her available actions ).
Therefore the probability that will choose action has the form

(1) |

where is termed the *rationality coefficient* of and quantifies the concentration of ’s choices around the optimum; as , becomes a perfect rational agent, while, as , becomes indifferent to .
The above expression can be interpreted by as the *likelihood* of action given a particular .
The evolution of ’s belief is then given (deterministically) by the Bayesian update

(2) |

Jointly, (1) and (2) define a fixed-point equation analogous to the one in Shafto2014, which states how should pragmatically update based on a noisily rational pedagogic . This amounts to a deterministic transition function for ’s belief, . Crucially, however, the fixed-point relation derived here involves itself, which we have yet to compute.

Unlike , is modeled as a rational agent; however, not knowing the true , the best can do is to maximize^{2}^{2}2
We assume for simplicity that the optimum is unique or a well-defined disambiguation rule exists.
the expectation of based on its current belief^{3}^{3}3
Note that this does not imply *certainty equivalence*, nor do we assume separation of estimation and control: is fully reasoning about how its actions and those of may affect its future beliefs.
:

(3) |

Combining (2) with the state transition measure , we can define the Bellman equation for under the noisily rational policy for any given :

(4) |

where ; ; . Note that ’s next action implicitly depends on ’s action at the next turn.

Substituting (1-3) into (4), we obtain the sought dynamic programming relation for the CIRL problem under a noisily rational-pedagogic human and a pragmatic robot. The human is pedagogic because she takes actions according to (1), which takes into account how her actions will influence the robot’s belief about the objective. The robot is pragmatic because it assumes the human is actively aware of how her actions convey the objective, and interprets them accordingly.

The resulting problem is similar to a POMDP (in this case formulated in belief-state MDP form), with the important difference that the belief transition depends on the value function itself. In spite of this complication, the problem can be solved in backward time through dynamic programming: each Bellman update will be based on a pragmatic-pedagogic fixed point that encodes an equilibrium between the function (and therefore ’s policy for choosing her action) and the belief transition (that is, ’s rule for interpreting ’s actions). Evidence in Shafto2014 suggests that people are proficient at finding such equilibria, even though uniqueness is not guaranteed in general; study of disambiguation is an open research direction.

## 3 A Proof-of-Concept

We introduce the benchmark domain ChefWorld, a household collaboration setting in which a human seeks to prepare a meal with the help of an intelligent robotic manipulator . There are multiple possible meals that may want to prepare using the available ingredients, and does not know beforehand which one she has chosen (we assume cannot or will not tell explicitly). The team obtains a reward only if ’s intended recipe is successfully cooked. If is aware of ’s uncertainty, she should take actions that give actionable information, particularly the information that she expects will allow to be as helpful as possible as the task progresses.

Our problem has 3 ingredients, each with 2 or 3 states: spinach (absent, chopped), tomatoes (absent, chopped, puréed), and bread (absent, sliced, toasted). Recipes correspond to (joint) target states for the food. Soup requires the tomatoes to be chopped then puréed, the bread to be sliced then toasted, and no spinach. Salad requires the spinach and tomatoes to be chopped, and the bread to be sliced then toasted. and can slice or chop any of the foods, while only can purée tomatoes or toast bread.

A simple scenario with the above two recipes is solved using discretized belief-state value iteration and presented as an illustrative example in Fig 1. has a wrong initial belief about ’s intended recipe. Under standard IRL, fails to communicate her recipe. But if is pragmatic and is pedagogic, is able to change ’s belief and they successfully collaborate to make the meal.

In addition, we computed the solution to games with 4 recipes through a modification of POMDP value iteration (Table 1). In the pragmatic-pedagogic CIRL equilibrium with , and successfully cook the correct recipe 97% of the time, whereas under the standard IRL framework (with acting as an expert disregarding ’s inferences) they only succeed 46% of the time—less than half as often.

Boltzmann ( = 1) | Boltzmann ( = 2.5) | Boltzmann ( = 5) | Rational | |
---|---|---|---|---|

IRL | 0.2351 | 0.3783 | 0.4555 | 0.7083 |

CIRL | 0.2916 | 0.7026 | 0.9727 | 1.0000 |

## 4 Discussion

We have presented here an analysis of the AI value alignment problem that incorporates a well-established model of human decision making and theory of mind into the game-theoretic framework of cooperative inverse reinforcement learning (CIRL). Using this analysis, we derive a Bellman backup that allows solving the dynamic game through dynamic programming. At every instant, the backup rule is based on a pragmatic-pedagogic equilibrium between the robot and the human: the robot is uncertain about the objective and therefore incentivized to learn it from the human, whereas the human has an incentive to help the robot infer the objective so that it can become more helpful.

We note that this type of pragmatic-pedagogic equilibrium, recently studied in the cognitive science literature for human teaching and learning Shafto2014, may not be unique in general: there may exist two actions for and two corresponding interpretations for leading to different fixed points. For example, could press a blue or a red button which could then interpret as asking it to pick up a blue or a red object. Although we might feel that blue-blue/red-red is a more intuitive pairing, blue-red/red-blue is valid as well: that is, if thinks that will interpret pressing the blue button as asking for the red object then she will certainly be incentivized to press blue when she wants red; and in this case ’s policy should consistently be to pick up the red object upon ’s press of the blue button. When multiple conventions are possible, human beings tend to naturally disambiguate between them, converging on salient equilibria or “focal points” schelling1960strategy. Accounting for this phenomenon is likely to be instrumental for developing competent human-centered robots.

On the other hand, it is important to point out that, although they are computationally simpler than more general multi-agent planning problems, POMDPs are still PSPACE-complete mundhenk2000complexity, so reducing pragmatic-pedagogic equilibrium computation to solving a modified POMDP falls short of rendering the problem tractable in general. However, finding a POMDP-like Bellman backup does open the door to efficient CIRL solution methods that leverage and benefit from the extensive research on practical algorithms for approximate planning in large POMDPs silver2010monte.

We find the results in this work promising for two reasons. First, they provide insight into how CIRL games can be not only theoretically formulated but also practically solved. Second, they demonstrate, for the first time, formal solutions to value alignment that depart from the ideal assumption of a rational human agent and instead benefit from modern studies of human cognition. We predict that developing efficient solution approaches and incorporating more realistic human models will constitute important and fruitful research directions for value alignment.

Comments

There are no comments yet.