Repeated Inverse Reinforcement Learning

05/15/2017 ∙ by Kareem Amin, et al. ∙ Google University of Michigan 0

We introduce a novel repeated Inverse Reinforcement Learning problem: the agent has to act on behalf of a human in a sequence of tasks and wishes to minimize the number of tasks that it surprises the human by acting suboptimally with respect to how the human would have acted. Each time the human is surprised, the agent is provided a demonstration of the desired behavior by the human. We formalize this problem, including how the sequence of tasks is chosen, in a few different ways and provide some foundational results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One challenge in building AI agents that learn from experience is how to set their goals or rewards. In the Reinforcement Learning (RL) setting, one interesting answer to this question is inverse RL (or IRL) in which the agent infers the rewards of a human by observing the human’s policy in a task (Ng and Russell, 2000). Unfortunately, the IRL problem is ill-posed for there are typically many reward functions for which the observed behavior is optimal in a single task (Abbeel and Ng, 2004)

. While the use of heuristics to select from among the set of feasible reward functions has led to successful applications of IRL to the problem of learning from demonstration

(e.g., Abbeel et al., 2007), not identifying the reward function poses fundamental challenges to the question of how well and how safely the agent will perform when using the learned reward function in other tasks.

We formalize multiple variations of a new repeated IRL problem in which the agent and (the same) human face multiple tasks over time. We separate the reward function into two components, one which is invariant across tasks and can be viewed as intrinsic to the human, and a second that is task specific. As a motivating example, consider a human doing tasks throughout a work day, e.g., getting coffee, driving to work, interacting with co-workers, and so on. Each of these tasks has a task-specific goal, but the human brings to each task intrinsic goals that correspond to maintaining health, financial well-being, not violating moral and legal principles, etc. In our repeated IRL setting, the agent presents a policy for each new task that it thinks the human would do. If the agent’s policy “surprises” the human by being sub-optimal, the human presents the agent with the optimal policy. The objective of the agent is to minimize the number of surprises to the human, i.e., to generalize the human’s behavior to new tasks.

In addition to addressing generalization across tasks, the repeated IRL problem we introduce and our results are of interest in resolving the question of unidentifiability of rewards from observations in standard IRL. Our results are also of interest to a particular aspect of the concern about how to make sure that the AI systems we build are safe, or AI safety. Specifically, the issue of reward misspecification is often mentioned in AI safety articles (e.g., Bostrom, 2003; Russell et al., 2015; Amodei et al., 2016). These articles mostly discuss broad ethical concerns and possible research directions, while our paper develops mathematical formulations and algorithmic solutions to a specific way of addressing reward misspecification.

In summary form, our contributions include: (1) an efficient reward-identification algorithm when the agent can choose the tasks in which it observes human behavior; (2) an upper bound on the number of total surprises when no assumptions are made on the tasks, along with a corresponding lower bound; (3) an extension to the setting where the human provides sample trajectories instead of complete behavior; and (4) identification guarantees when the agent can only choose the task rewards but is given a fixed task environment.

2 Markov Decision Processes (MDPs)

An MDP is specified by its state space , action space , initial state distribution , transition function (or dynamics) , reward function , and discount factor . We assume finite and , and is the space of all distributions over . A policy describes an agent’s behavior by specifying the action to take in each state. The (normalized) value function or long-term utility of is defined as .111Here we differ (w.l.o.g.) from common IRL literature in assuming that reward occurs after transition. Similarly, the Q-value function is . Where necessary we will use the notation to avoid ambiguity about the dynamics and the reward function. Let be an optimal policy, which maximizes and in all states (and actions) simultaneously.

Given an initial distribution over states, , a scalar value that measures the goodness of is defined as . We introduce some further notation to express

in vector-matrix form. Let

be the normalized state occupancy under initial distribution , dynamics , and policy , whose -th entry is ( is the indicator function). This vector can be computed in closed-form as where is an matrix whose -th element is , and is the identity matrix. For convenience we will also treat the reward function as a vector in , and we have


3 Problem setup

Here we define the repeated IRL problem. The human’s reward function captures his/her safety concerns and intrinsic/general preferences. This is unknown to the agent and is the object of interest herein, i.e., if were known to the agent, the concerns addressed in this paper would be solved. We assume that the human cannot directly communicate to the agent but can evaluate the agent’s behavior in a task as well as demonstrate optimal behavior. Each task comes with an external reward function , and the goal is to maximize the reward with respect to in each task.

As a concrete example, consider an agent for an autonomous vehicle. In this case, represents the cross-task principles that define good driving (e.g., courtesy towards pedestrians and other vehicles), which are often difficult to explicitly describe. In contrast, , the task-specific reward, could reward the agent for successfully completing parallel parking. While is easier to construct, it may not completely capture what a human deems good driving. (For example, an agent might successfully parallel park while still boxing in neighboring vehicles.)

More formally, a task is defined by a pair , where is the task environment (i.e., a controlled Markov process) and is the task-specific reward function (task reward). We assume that all tasks share the same , with , but may differ in the initial distribution , dynamics , and task reward ; all of the task-specifying quantities are known to the agent. In any task, the human’s optimal behavior is always with respect to the reward function . We emphasize again that is intrinsic to the human and remains the same across all tasks. Our use of task specific reward functions allows for greater generality than the usual IRL setting, and most of our results apply equally to the case where .

While is private to the human, the agent has some prior knowledge on , represented as a set of possible parameters that contains . Throughout, we assume that the human’s reward has bounded and normalized magnitude, that is, .

A demonstration in reveals , optimal for under environment , to the agent. A common assumption in the IRL literature is that the full mapping is revealed, which can be unrealistic if some states are unreachable from the initial distribution. We address the issue by requiring only the state occupancy vector . In Section 7 we show that this also allows an easy extension to the setting where the human only demonstrates trajectories instead of providing a policy.

Under the above framework for repeated IRL, we consider two settings that differ in how the sequence of tasks are chosen. In both settings, we will want to minimize the number of demonstrations needed.

1. (Section 5) Agent chooses the tasks, observes the human’s behavior in each of them, and infers the reward function. In this setting where the agent is powerful enough to choose tasks arbitrarily, we will show that the agent will be able to identify the human’s reward function which of course implies the ability to generalize to new tasks.

2. (Section 6) Nature chooses the tasks, and the agent proposes a policy in each task. The human demonstrates a policy only if the agent’s policy is significantly suboptimal (i.e., a mistake). In this setting we will derive upper and lower bounds on the number of mistakes our agent will make.

4 The challenge of identifying rewards

Note that it is impossible to identify from watching human behavior in a single task. This is because any is fundamentally indistinguishable from an infinite set of reward functions that yield exactly the policy observed in the task. We introduce the idea of behavioral equivalence below to tease apart two separate issues wrapped up in the challenge of identifying rewards.

Definition 1.

Two reward functions are behaviorally equivalent in all MDP tasks, if for any , the set of optimal policies for and are the same.

We argue that the task of identifying the reward function should amount only to identifying the (behavioral) equivalence class to which belongs. In particular, identifying the equivalence class is sufficient to get perfect generalization to new tasks. Any remaining unidentifiability is merely representational and of no real consequence. Next we present a constraint that captures the reward functions that belong to the same equivalence class.

Proposition 1.

Two reward functions and are behaviorally equivalent in all MDP tasks if and only if for some , where is an all-1 vector of length .

The proof is elementary and deferred to Appendix A. For any class of ’s that are equivalent to each other, we can choose a canonical element to represent this class. For example, we can fix an arbitrary reference state , and fix the reward of this state to for and all candidate ’s. In the rest of the paper, we will always assume such canonicalization in the MDP setting, hence .

5 Agent chooses the tasks

In this section, the protocol is that the agent chooses a sequence of tasks . For each task , the human reveals , which is optimal for environment and reward function . Our goal is to design an algorithm which chooses and identifies to a desired accuracy, , using as few tasks as possible. Theorem 1 shows that a simple algorithm can identify after only tasks, if any tasks may be chosen. Roughly speaking, the algorithm amounts to a binary search on each component of by manipulating the task reward .222While we present a proof that manipulates , an only slightly more complex proof applies to the setting where all the are exactly zero and the manipulation is limited to the environment (Amin and Singh, 2016). See the proof for the algorithm specification. As noted before, once the agent has identified within an appropriate tolerance, it can compute a sufficiently-near-optimal policy for all tasks, thus completing the generalization objective through the far stronger identification objective in this setting.

Theorem 1.

If , there exists an algorithm that outputs that satisfies after demonstrations.


The algorithm chooses the following fixed environment in all tasks: for each , let one action be a self-loop, and the other action transitions to . In , all actions cause self-loops. The initial distribution over states is uniformly at random over . Each task only differs in the task reward (where always). After observing the state occupancy of the optimal policy, for each we check if the occupancy is equal to . If so, it means that the demonstrated optimal policy chooses to go to from in the first time step, and ; if not, we have . Consequently, after each task we learn the relationship between and on each , so conducting a binary search by manipulating will identify to -accuracy after tasks. ∎

6 Nature chooses the tasks

While Theorem 1 yields a strong identification guarantee, it also relies on a strong assumption, that may be chosen by the agent in an arbitrary manner. In this section, we let nature, who is allowed to be adversarial for the purpose of the analysis, choose .

Generally speaking, we cannot obtain identification guarantees in such an adversarial setup. As an example, if and remains the same over time, we are essentially back to the classical IRL setting and suffer from the degeneracy issue. However, generalization to future tasks, which is our ultimate goal, is easy in this special case: after the initial demonstration, the agent can mimic it to behave optimally in all subsequent tasks without requiring further demonstrations. More generally, if nature repeats similar tasks, then the agent obtains little new information, but presumably it knows how to behave in most cases; if nature chooses a task unfamiliar to the agent, then the agent is likely to err, but it may learn about from the mistake.

To formalize this intuition, we consider the following protocol: the nature chooses a sequence of tasks in an arbitrary manner. For every task , the agent proposes a policy . The human examines the policy’s value under , and if the loss


is less than some then the human is satisfied and no demonstration is needed; otherwise a mistake is counted and is revealed to the agent (note that can be computed by the agent if needed from and its knowledge of the task). The main goal of this section is to design an algorithm that has a provable guarantee on the total number of mistakes.

On human supervision  Here we require the human to evaluate the agent’s policies in addition to providing demonstrations. We argue that this is a reasonable assumption because (1) only a binary signal is needed as opposed to the precise value of , and (2) if a policy is suboptimal but the human fails to realize it, arguably it should not be treated as a mistake. Meanwhile, we will also provide identification guarantees in Section 6.4, as the human will be relieved from the supervision duty once is identified.

Before describing and analyzing our algorithm, we first notice that the Equation 2 can be rewritten as


using Equation 1. So effectively, the given environment in each round induces a set of state occupancy vectors , and we want the agent to choose the vector that has the largest dot product with . The exponential size of the set will not be a concern because our main result (Theorem 2) has no dependence on the number of vectors, and only depends on the dimension of those vectors. The result is enabled by studying the linear bandit version of the problem, which subsumes the MDP setting for our purpose and is also a model of independent interest.

6.1 The linear bandit setting

In the linear bandit setting, is a finite action space with size . Each task is denoted as a pair , where is the task specific reward function as before. is a feature matrix, where is the feature vector for the -th action, and . When we reduce MDPs to linear bandits, each element of corresponds to an MDP policy, and the feature vector is the state occupancy of that policy.

As before, are the task reward and the human’s unknown reward, respectively. The initial uncertainty set for is . The value of the -th action is calculated as , and is the action that maximizes this value. Every round the agent proposes an action , whose loss is defined as

We now show how to embed the previous MDP setting in linear bandits.

Example 1.

Given an MDP problem with variables , we can convert it into a linear bandit problem as follows: (all variables with prime belong to the linear bandit problem, and we use to denote the vector with the -th coordinate removed)

  • , , .

  • . .

Note that there is a more straightforward conversion by letting , which also preserves losses. We perform a more succinct conversion in Example 1 by canonicalizing both (already assumed) and (explicitly done here) and dropping the coordinate for in all relevant vectors.

MDPs with linear rewards

In IRL literature, a generalization of the MDP setting is often considered, that reward is linear in state features (Ng and Russell, 2000; Abbeel and Ng, 2004). In this new setting, and are reward parameters, and the actual reward is . This new setting can also be reduced to linear bandits similarly to Example 1, except that the state occupancy is replaced by the discounted sum of expected feature values. Our main result, Theorem 2, will still apply automatically, but now the guarantee will only depend on the dimension of the feature space and has no dependence on . We include the conversion below but do not further discuss this setting in the rest of the paper.

Example 2.

Consider an MDP problem with state features, defined by , where task reward and background reward in state are and respectively, and . Suppose always holds, then we can convert it into a linear bandit problem as follows: . , , and remain the same. . Note that the division of in is for the purpose of normalization, so that .

6.2 Ellipsoid Algorithm for Repeated Inverse Reinforcement Learning

We propose Algorithm 1, and provide the mistake bound in the following theorem.

1:  Input: .
2:  .
3:  for  do
4:     Nature reveals .
5:     Learner plays , where is the center of . .
6:     if  then
7:        Human reveals .  
8:     end if
9:  end for
Algorithm 1 Ellipsoid Algorithm for Repeated Inverse Reinforcement Learning
Theorem 2.

For , the number of mistakes made by Algorithm 1 is guaranteed to be .

To prove Theorem 2

, we quote a result from linear programming literature in Lemma 

1, which is found in standard lecture notes (e.g., (O’Donnell, 2011), Theorem 8.8; see also (Grötschel et al., 2012), Lemma 3.1.34).

Lemma 1 (Volume reduction in ellipsoid algorithm).

Given any non-degenerate ellipsoid in centered at , and any non-zero vector , let be the minimum-volume enclosing ellipsoid (MVEE) of We have .

Proof of Theorem 2.

Whenever a mistake is made, we can induce the constraint Meanwhile, since is greedy w.r.t. , we have where is the center of as in Line 5. Taking the difference of the two inequalities, we obtain


Therefore, the update rule on Line 7 of Algorithm 1 preserves in . Since the update makes a central cut through the ellipsoid, Lemma 1 applies and the volume shrinks every time a mistake is made. To prove the theorem, it remains to upper bound the initial volume and lower bound the terminal volume of . We first show that an update never eliminates , the ball centered at with radius . This is because, any eliminated satisfies . Combining this with Equation 4, we have

The last step follows from . We conclude that any eliminated should be far away from in distance. Hence, we can lower bound the volume of for any by that of , which contains an ball with radius at its smallest (when is one of ’s vertices). To simplify calculation, we relax this lower bound (volume of the ball) to the volume of the inscribed ball.

Finally we put everything together: let be the number of mistakes made from round to , be the volume of the unit hypersphere in (i.e., ball with radius ), and denote the volume of an ellipsoid, we have

So . ∎

6.3 Lower bound

In Section 5, we get an upper bound on the number of demonstrations, which has no dependence on (which corresponds to in linear bandits). Comparing Theorem 2 to 1, one may wonder whether the polynomial dependence on is an artifact of the inefficiency of Algorithm 1. We clarify this issue by proving a lower bound, showing that mistakes are inevitable in the worst case when nature chooses the tasks. We provide a proof sketch below, and the complete proof is deferred to Appendix E.

Theorem 3.

For any randomized algorithm333While our Algorithm 1 is deterministic, randomization is often crucial for online learning in general (Shalev-Shwartz, 2011). in the linear bandit setting, there always exists and an adversarial sequence of that potentially adapts to the algorithm’s previous decisions, such that the expected number of mistakes made by the algorithm is .

Proof Sketch.

We randomize by sampling each element i.i.d. from . We will prove that there exists a strategy of choosing such that any algorithm’s expected number of mistakes is , which proves the theorem as max is no less than average.

In our construction, , where is some index to be specified. Hence, every round the agent is essentially asked to decided whether . The adversary’s strategy goes in phases, and remains the same during each phase. Every phase has rounds where is enumerated over .

The adversary will use to shift the posterior on so that it is centered around the origin; in this way, the agent has about probability to make an error (regardless of the algorithm), and the posterior interval will be halved. Overall, the agent makes mistakes in each phase, and there will be about phases in total, which gives the lower bound. ∎

Applying the lower bound to MDPs   The above lower bound is stated for linear bandits. In principle, we need to prove lower bound for MDPs separately, because linear bandits are more general than MDPs for our purpose, and the hard instances in linear bandits may not have corresponding MDP instances. In Lemma 2 below, we show that a certain type of linear bandit instances can always be emulated by MDPs with the same number of actions, and the hard instances constructed in Theorem 3 indeed satisfy the conditions for such a type; in particular, we require the feature vectors to be non-negative and have norm bounded by . As a corollary, an lower bound for the MDP setting (even with a small action space ) follows directly from Theorem 3. The proof of Lemma 2 is deferred to Appendix B.

Lemma 2 (Linear bandit to MDP conversion).

Let be a linear bandit task, and be the number of actions. If every is non-negative and , then there exists an MDP task with states and actions, such that under some choice of , converting as in Example 1 recovers the original problem.

6.4 On identification when nature chooses tasks

While Theorem 2 successfully controls the number of total mistakes, it completely avoids the identification problem and does not guarantee to recover . In this section we explore further conditions under which we can obtain identification guarantees when Nature chooses the tasks.

The first condition, stated in Proposition 2, implies that if we have made all the possible mistakes, then we have indeed identified the , where the identification accuracy is determined by the tolerance parameter that defines what is counted as a mistake. Due to space limit, the proof is deferred to Appendix C.

Proposition 2.

Consider the linear bandit setting. If there exists such that for any round , no more mistakes can be ever made by the algorithm for any choice of and any tie-braking mechanism, then we have .

While the above proposition shows that identification is guaranteed if the agent exhausts the mistakes, the agent has no ability to actively fulfill this condition when nature chooses tasks. For a stronger identification guarantee, we may need to grant the agent some freedom in choosing the tasks.

Identification with fixed environment   Here we consider a setting that fits in between Section 5 (completely active) and Section 6.1 (completely passive), where the environment (hence the induced feature vectors ) is given and fixed, and the agent can arbitrarily choose the task reward . The goal is to obtain identification guarantee in this intermediate setting.

Unfortunately, a degenerate case can be easily constructed that prevents the revelation of any information about . In particular, if , i.e., the environment is completely uncontrolled, then all actions are equally optimal and nothing can be learned. More generally, if for some we have , then we may never recover along the direction of . In fact, Proposition 1 can be viewed as an instance of this result where (recall that ), and that is why we have to remove such redundancy in Example 1 in order to discuss identification in MDPs. Therefore, to guarantee identification in a fixed environment, the feature vectors must have significant variation in all directions, and we capture this intuition by defining a diversity score (Definition 2) and showing that the identification accuracy depends inversely on the score (Theorem 4).

Definition 2.

Given the feature matrix whose size is , define as the

-th largest singular value of


Theorem 4.

For a fixed feature matrix , if , then there exists a sequence with and a sequence of tie-break choices of the algorithm, such that after round we have

The proof is deferred to Appendix D. The dependence in Theorem 4 may be of concern as can be exponentially large. However, Theorem 4 also holds if we replace by any matrix that consists of ’s columns, so we may choose a small yet most diverse set of columns as to optimize the bound.

7 Working with trajectories

1:  Input: .  
2:  , , , .
3:  for  do
4:     Nature reveals . Agent rolls-out a trajectory using greedily w.r.t. .
5:     .
6:     if agent takes in with  then
7:        Human produces an -step trajectory from . Let the empirical state occupancy be .
8:        , .
9:        Let be the state occupancy of from initial state , and .
10:        if  then
11:              , , .
12:        end if
13:     end if
14:  end for
Algorithm 2 Trajectory version of Algorithm 1 for MDPs

In previous sections, we have assumed that the human evaluates the agent’s performance based on the state occupancy of the agent’s policy, and demonstrates the optimal policy in terms of state occupancy as well. In practice, we would like to instead assume that for each task, the agent rolls out a trajectory, and the human shows an optimal trajectory if he/she finds the agent’s trajectory unsatisfying. We are still concerned about upper bounding the number of total mistakes, and aim to provide a parallel version of Theorem 2.

Unlike in traditional IRL, in our setting the agent is also acting, which gives rise to many subtleties. First, the total reward on the agent’s single trajectory is a random variable, and may deviate from the expected value of its policy. Therefore, it is generally impossible to decide if the agent’s policy is near-optimal, and instead we assume that the human can check if each action that the agent takes in the trajectory is near-optimal: when the agent takes

at state , an error is counted if and only if This criterion can be viewed as a noisy version of the one used in previous sections, as taking expectation of over the occupancy induced by will recover Equation 2.

While this resolves the issue on the agent’s side, how should the human provide his/her optimal trajectory? The most straightforward protocol is that the human rolls out a trajectory from the initial distribution of the task, . We argue that this is not a reasonable protocol for two reasons: (1) in expectation, the reward collected by the human may be less than that by the agent, because conditioning on the event that an error is spotted may introduce a selection bias; (2) the human may not encounter the problematic state in his/her own trajectory, hence the information provided in the trajectory may be irrelevant.

To resolve this issue, we consider a different protocol where the human rolls out a trajectory using an optimal policy from the very state where the agent errs.

Now we discuss how we can prove a parallel of Theorem 2 under this new protocol. First, let’s assume that the demonstration were still given in the form a state occupancy vector starting at the problematic state. In this case, we can reduce to the setting of Section 6 by changing to a point mass on the problematic state.444At the first glance this might seem suspicious: the problematic state is random and depends on the learner’s current policy, but in RL the initial distribution is usually fixed and the learner has no control over it. This concern is removed thanks to our adversarial setup on (of which is a component). To apply the algorithm and the analysis in Section 6, it remains to show that the notion of error in this section (a suboptimal action) implies the notion of error in Section 6 (a suboptimal policy): let be the problematic state and be the agent’s policy, we have So whenever a suboptimal action is spotted in state , it indeed implies that the agent’s policy is suboptimal for as the initial state. Hence, we can run Algorithm 1 as-is and Theorem 2 immediately applies.

To tackle the remaining issue that the demonstration is in terms of a single trajectory, we will not update after each mistake as in Algorithm 1, but only make an update after every mini-batch of mistakes, and aggregate them to form accurate update rules. See Algorithm 2. The formal guarantee of the algorithm is stated in Theorem 5, whose proof is deferred to Appendix G.

Theorem 5.

, with probability at least , the number of mistakes made by Algorithm 2 with parameters and where ,555Here we use the simpler conversion explained right after Example 1. We can certainly improve the dimension to by dropping the coordinate in all relevant vectors but that complicates presentation. is at most .666A term is suppressed in .

8 Related work & Conclusions

Most existing work in IRL focused on inferring the reward function777While we do not discuss it here, in the economics literature, the problem of inferring an agent’s utility from behavior-queries has long been studied under the heading of utility or preference elicitation (Chajewska et al., 2000; Von Neumann and Morgenstern, 2007; Regan and Boutilier, 2009, 2011; Rothkopf and Dimitrakakis, 2011). While our result in Section 5 uses similar techniques to elicit the reward function, we do so purely by observing the human’s behavior without external source of information (e.g., query responses). using data acquired from a fixed environment (Ng and Russell, 2000; Abbeel and Ng, 2004; Coates et al., 2008; Ziebart et al., 2008; Ramachandran and Amir, 2007; Syed and Schapire, 2007; Regan and Boutilier, 2010). There is prior work on using data collected from multiple — but exogenously fixed — environments to predict agent behavior (Ratliff et al., 2006). There are also applications where methods for single-environment MDPs have been adapted to multiple environments (Ziebart et al., 2008). Nevertheless, all these works consider the objective of mimicking an optimal behavior in the presented environment(s), and do not aim at generalization to new tasks that is the main contribution of this paper. Recently, Hadfield-Menell et al. (2016) proposed cooperative inverse reinforcement learning, where the human and the agent act in the same environment, allowing the human to actively resolve the agent’s uncertainty on the reward function. However, they only consider a single environment (or task), and the unidentifiability issue of IRL still exists. Combining their interesting framework with our resolution to unidentifiability (by multiple tasks) can be an interesting future direction.


This work was supported in part by NSF grant IIS 1319365 (Singh & Jiang) and in part by a Rackham Predoctoral Fellowship from the University of Michigan (Jiang). Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors.


  • Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. Apprenticeship Learning via Inverse Reinforcement Learning. In

    Proceedings of the 21st International Conference on Machine learning

    , page 1. ACM, 2004.
  • Abbeel et al. (2007) Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. An application of reinforcement learning to aerobatic helicopter flight. Advances in neural information processing systems, 19:1, 2007.
  • Amin and Singh (2016) Kareem Amin and Satinder Singh. Towards resolving unidentifiability in inverse reinforcement learning. arXiv preprint arXiv:1601.06569, 2016.
  • Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
  • Bostrom (2003) Nick Bostrom.

    Ethical issues in advanced artificial intelligence.

    Science Fiction and Philosophy: From Time Travel to Superintelligence, pages 277–284, 2003.
  • Chajewska et al. (2000) Urszula Chajewska, Daphne Koller, and Ronald Parr. Making rational decisions using adaptive utility elicitation. In AAAI/IAAI, pages 363–369, 2000.
  • Coates et al. (2008) Adam Coates, Pieter Abbeel, and Andrew Y Ng. Learning for control from multiple demonstrations. In Proceedings of the 25th international conference on Machine learning, pages 144–151. ACM, 2008.
  • Grötschel et al. (2012) Martin Grötschel, László Lovász, and Alexander Schrijver.

    Geometric algorithms and combinatorial optimization

    , volume 2.
    Springer Science & Business Media, 2012.
  • Hadfield-Menell et al. (2016) Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems, pages 3909–3917, 2016.
  • Ng and Russell (2000) Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pages 663–670, 2000.
  • O’Donnell (2011) Ryan O’Donnell. 15-859(E) – linear and semidefinite programming: lecture notes. Carnegie Mellon University, 2011.
  • Ramachandran and Amir (2007) Deepak Ramachandran and Eyal Amir. Bayesian inverse reinforcement learning. Urbana, 51:61801, 2007.
  • Ratliff et al. (2006) Nathan D Ratliff, J Andrew Bagnell, and Martin A Zinkevich. Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning, pages 729–736. ACM, 2006.
  • Regan and Boutilier (2009) Kevin Regan and Craig Boutilier.

    Regret-based reward elicitation for markov decision processes.

    In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 444–451. AUAI Press, 2009.
  • Regan and Boutilier (2010) Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain mdps using nondominated policies. In AAAI, 2010.
  • Regan and Boutilier (2011) Kevin Regan and Craig Boutilier. Eliciting additive reward functions for markov decision processes. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 2159, 2011.
  • Rothkopf and Dimitrakakis (2011) Constantin A Rothkopf and Christos Dimitrakakis. Preference elicitation and inverse reinforcement learning. In Machine Learning and Knowledge Discovery in Databases, pages 34–48. Springer, 2011.
  • Russell et al. (2015) Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4):105–114, 2015.
  • Shalev-Shwartz (2011) Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2011.
  • Syed and Schapire (2007) Umar Syed and Robert E Schapire. A game-theoretic approach to apprenticeship learning. In Advances in neural information processing systems, pages 1449–1456, 2007.
  • Von Neumann and Morgenstern (2007) John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior (60th Anniversary Commemorative Edition). Princeton university press, 2007.
  • Ziebart et al. (2008) Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI, pages 1433–1438, 2008.


Appendix A Proof of Proposition 1

To show that implies behavioral equivalence, we note that for any policy the occupancy vector always satisfies , so , and therefore the set of optimal policies is the same.

To show the other direction, we prove that if , then there exists such that the sets of optimal policies differ. In particular, we choose , so that all policies are optimal under . Since , there exists states and such that . Suppose is the one with smaller sum of rewards, then we can make an absorbing state, and have two deterministic actions in that transition to and respectively. Under , the self-loop in state is suboptimal, and this completes the proof. ∎

Appendix B Proof of Lemma 2

The construction is as follows. Choose as the initial state, and make all other states absorbing. Let and restricted on coincide with . The remaining work is to design the transition distribution of each action in so that the induced state occupancy matches exactly one column of .

Fixing any action , and let be the feature that we want to associate with. The next-state distribution of is as follows: with probability the next-state is itself, and the probability of transitioning to the -th state in is . Given and , it is easy to verify that this is a valid distribution.

Now we calculate the occupancy of policy . The normalized occupancy on is

The remaining occupancy, with a total mass of , is split among proportional to . Therefore, when we convert the MDP problem as in Example 1, the corresponding feature vector is exactly , so we recover the original linear bandit problem. ∎

Appendix C Proof of Proposition 2

Assume towards contradiction that . We will choose to make the algorithm err. In particular, let , so that the algorithm acts greedily with respect to . Since , any action would be a valid choice for the algorithm.

On the other hand, implies that there exists a coordinate such that where is a basis vector. Let and . So the value of action is always under any reward function (including ), and the value of action is whose absolute value is greater than . At least one of the 2 actions is more than suboptimal, and the algorithm may take any of them, so the algorithm can err again. ∎

Appendix D Proof of Theorem 4

It suffices to show that in any round , if , then . The bound on follows directly from Theorem 2. Similar to the proof of Proposition 2, our choice of the task reward is , so that any would be a valid choice of , and we will choose the worst action. Note that ,

So it suffices to show that there exists , such that . Let , and the precondition implies that .

Define a matrix of size , where each column


contains exactly one and one (the remaining entries are ), and the columns enumerate all possible positions of them. With the help of this matrix, we can rewrite the desired result (, s.t. ) as We relax the LHS as , and will provide a lower bound on . Note that

because every row of is some multiple of (recall Definition 2), and every column of is orthogonal to . Let be the vector normalized to unit length,

We lower bound each of the 3 terms. For the first term, we have the precondition . The second term is left multiplied by a unit vector, so its norm can be lower bounded by the smallest non-zero singular value of (recall that is full-rank), which is .

To lower bound the last term, note that and rows of are orthogonal to and so is , so

Putting all the pieces together, we have

Appendix E Proof of Theorem 3

As a standard trick, we randomize by sampling each element i.i.d. from . We will prove that there exists a strategy of choosing such that any algorithm’s expected number of mistakes is , where the expectation is with respect to the randomness of and the internal randomness of the algorithm. This immediately implies a worst-case result as max is no less than average (regarding the sampling of ).

In our construction, , where is some index to be specified. Hence, every round the agent is essentially asked to decided whether . The adversary’s strategy goes in phases, and remains the same during each phase. Every phase has rounds where is enumerated over . To fully specify the nature’s strategy, it remains to specify for each phase.

In the 1st phase, . For each coordinate , the information revealed to the agent is one of the following: , , , . For clarity we first make an simplification, that the revealed information is either or ; we will deal with the subtleties related to at the end of the proof.

In the 2nd phase, we fix as

Since is randomized i.i.d. for each coordinate, the posterior of conditioned on the revealed information is