Maximum Likelihood Constraint Inference for Inverse Reinforcement Learning

09/12/2019
by   Dexter R. R. Scobee, et al.
berkeley college
41

While most approaches to the problem of Inverse Reinforcement Learning (IRL) focus on estimating a reward function that best explains an expert agent's policy or demonstrated behavior on a control task, it is often the case that such behavior is more succinctly described by a simple reward combined with a set of hard constraints. In this setting, the agent is attempting to maximize cumulative rewards subject to these given constraints on their behavior. We reformulate the problem of IRL on Markov Decision Processes (MDPs) such that, given a nominal model of the environment and a nominal reward function, we seek to estimate state, action, and feature constraints in the environment that motivate an agent's behavior. Our approach is based on the Maximum Entropy IRL framework, which allows us to reason about the likelihood of an expert agent's demonstrations given our knowledge of an MDP. Using our method, we can infer which constraints can be added to the MDP to most increase the likelihood of observing these demonstrations. We present an algorithm which iteratively infers the Maximum Likelihood Constraint to best explain observed behavior, and we evaluate its efficacy using both simulated behavior and recorded data of humans navigating around an obstacle.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

04/13/2016

Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics

Inverse Reinforcement Learning (IRL) describes the problem of learning a...
08/04/2020

Deep Inverse Q-learning with Constraints

Popular Maximum Entropy Inverse Reinforcement Learning approaches requir...
05/23/2019

Inverse Reinforcement Learning in Contextual MDPs

We consider the Inverse Reinforcement Learning (IRL) problem in Contextu...
11/15/2021

Versatile Inverse Reinforcement Learning via Cumulative Rewards

Inverse Reinforcement Learning infers a reward function from expert demo...
09/21/2018

Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration

Autonomous cyber-physical agents and systems play an increasingly large ...
02/24/2021

Maximum Likelihood Constraint Inference from Stochastic Demonstrations

When an expert operates a perilous dynamic system, ideal constraint info...
09/10/2021

Discretizing Dynamics for Maximum Likelihood Constraint Inference

Maximum likelihood constraint inference is a powerful technique for iden...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in mechanical design and artificial intelligence continue to expand the horizons of robotic applications. In these new domains, it might be difficult to design a specific robot behavior by hand. Even manually specifying a task for a reinforcement-learning-enabled agent is notoriously difficult

[ho_2015_bad_rewards, amodei_2016_safety_problems]. Inverse Reinforcement Learning (IRL) techniques can help to alleviate this burden by automatically identifying the objectives driving certain behavior. Since first being introduced as Inverse Optimal Control by Kalman [kalman_1964_ioc], much of the work on IRL has focused on learning environmental rewards, a function mapping state-action pairs to real numbers [ng_2000_irl, abbeel_2004_apprenticeship, ratliff_2006_mmp, ziebart_2008_maxent]. While these types of IRL algorithms have proven useful in a variety of situations [abbeel_2007_flying, vasquez_2014_crowd_navigation, ziebart_2010_thesis, scobee_2018_haptic], their basis in assuming reward functions fully capture task specifications makes them ill suited to problem domains with hard constraints or non-Markovian objectives.

Recent work has attempted to address these pitfalls by using demonstrations to learn a rich class of possible specifications [vazquez_2018_specifications]. Others have focused specifically on learning constraints, that is, behaviors that are expressly forbidden or infeasible [pardowitz_2005_seq_constraints, perez_2017_clearn, subramani_2018_geometric_constraints, mcpherson_2018_s3, chou_2018_cfd]. It is towards this problem of constraint inference that we turn our attention. In this work, we present a novel method for inferring constraints, drawing primarily from the Maximum Entropy approach to IRL [ziebart_2008_maxent]. We use this framework to reason about the likelihood of observing a set of demonstrations given a nominal task description, as well as about their likelihood if we imposed additional constraints on the task. This knowledge allows us to select a constraint, or set of constraints, which maximizes the demonstrations’ likelihood and best explains the differences between expected and demonstrated behavior. Our method improves on prior work by being able to simultaneously consider constraints on states, actions and features in a Markov Decision Process (MDP) to provide a principled ranking of all options according to their affect on demonstration likelihood.

2 Related Work

2.1 Inverse Reinforcement Learning

A formulation of the IRL problem was first proposed by kalman_1964_ioc as the Inverse problem of Optimal Control (IOC). Given a dynamical system and a control law, the author sought to identify which function(s) the control law was designed to optimize. This problem was brought into the domain of MDPs and Reinforcement Learning (RL) by ng_2000_irl, who proposed IRL as the task of, given an MDP and a policy (or trajectories sampled according to that policy), find a reward function with respect to which that policy is optimal.

One of the chief difficulties in the problem of IRL is the fact that a policy can be optimal with respect to a potentially infinite set of reward functions. The most trivial example of this is the fact that all policies are optimal with respect to a null reward function that always returns zero. Much of the subsequent work in IRL has been devoted to developing approaches that address this ambiguity by imposing additional structure to make the problem well-posed [abbeel_2004_apprenticeship, ratliff_2006_mmp]. ziebart_2008_maxent

approach the problem by employing the principle of maximum entropy

[jaynes_1957_maxent]

, which allows the authors to develop an IRL algorithm that produces a single stochastic policy that matches feature counts without adding any additional constraints to the produced behavior. This so called Maximum Entropy IRL (MaxEnt) provides a framework for reasoning about demonstrations from experts who are noisily optimal. The induced probability distribution over trajectories forms the basis for our efforts in identifying the most likely behavior-modifying constraints.

2.2 Beyond Reward Functions

While Markovian rewards do often provide a succinct and expressive way to specify the objectives of a task, they cannot capture all possible task specifications. vazquez_2018_specifications highlight the utility of non-Markovian Boolean specifications which can describe complex objectives (e.g. do this before that) and compose in an intuitive way (e.g. avoid obstacles and reach the goal). The authors of [vazquez_2018_specifications]

draw inspiration from the MaxEnt framework to develop their technique for using demonstrations to calculate the posterior probability that an agent is attempting to satisfy a Boolean specification.

A subset of these types of specifications that is of particular interest to us is the specification of constraints, which are states, actions, or features of the environment that must be avoided. chou_2018_lcfd explore how to infer trajectory feature constraints given a nominal model of the environment (lacking the full set of constraints) and a set of demonstrated trajectories. The core of their approach is to sample from the set of trajectories which have better performance than the demonstrated trajectories. They then infer that the set of possible constraints is the subset of the feature space that contains the higher-reward sampled trajectories, but not the demonstrated trajectories. Intuitively, they reason that if the demonstrator could have passed through those features to earn a higher reward, but did not, then there must have been a previously unknown constraint preventing that behavior. However, while their approach does allow for a cost function to rank elements from the set of possible constraints, the authors do not offer a mechanism for determining what cost function will best order these constraints.

Our approach to constraint inference from demonstrations addresses this open question by providing a principled ranking of the likelihood of constraints. We adapt the MaxEnt framework to allow us to reason about how adding a constraint will affect the likelihood of demonstrated behaviors, and we can then select the constraints which maximize this likelihood. We consider feature-space constraints as in [chou_2018_lcfd], and we explicitly augment the feature space with state- and action-specific features to directly compare the impacts of state-, action-, and feature-based constraints on demonstration likelihood.

3 Maximum Likelihood Constraint Inference

3.1 Problem Formulation

Following the formulation presented in [ziebart_2008_maxent], we base our work in the setting of a (finite-state) Markov Decision Process (MDP). We define an MDP as a tuple where is a finite set of discrete states; is a set of the sets of actions available to be taken for each state , such that , where is a finite set of discrete actions; is a set of state transition probability distributions such that is the probability of transitioning to state after taking action from state ; is an initial state distribution; is a mapping to a -dimensional space of non-negative features; and is a reward function. A trajectory through this MDP is a sequence of states and actions such that and state . Actions are chosen by an agent navigating the MDP according to a, potentially time-varying, policy such that is a probability distribution over actions in . We denote a finite-time trajectory of length by .

At every time step , a trajectory will accumulate features equal to . We use the notation to refer to the -th element of the feature map, and we use the label to denote the -th feature itself. We also introduce an augmented indicator feature mapping , where

. This augmented feature map uses binary variables to indicate the presence of a feature and expands the feature space by adding binary features to track occurrences of each state and action, such that

(1)

Typically, agents are modeled as trying to maximize, or approximately maximize, the total reward earned for a trajectory , given by , where is a discount factor. Therefore, an agent’s policy is closely tied to the form of the MDP’s reward function.

Conventional IRL focuses on inferring a reward function that explains an agent’s policy, revealed through the behavior observed in a set of demonstrated trajectories . However, our method for constraint inference poses a different challenge: given an MDP , including a reward function, and a set of demonstrations , find the set of constraints which maximizes the likelihood of observing these demonstrations. We define our notion of constraints in the following section.

3.2 Constraints for MDPs

Figure 1: Human trajectories overlaid on a grid world MDP. The shaded region represents an obstacle in the human’s environment, and the red “X”s represent learned constraints.

Constraints are those behaviors that are not disallowed explicitly by the structure of the MDP, but which would be infeasible or prohibited for the underlying system being modeled by the MDP. This sort of discrepancy can occur when a general or simplified MDP is designed without exact knowledge of specific constraints for the modeled system. For instance, for a general MDP modeling the behavior of cars, we might want to include states for speeds up to 500 and actions for accelerations up to 12. However, for a specific car on a specific roadway, the set of states where the vehicle travels above 100 may be prohibited because of a speed limit, and the set of actions where the vehicle accelerates above 4 may be infeasible because of the physical limitations of the vehicle’s engine. Therefore, any MDP trajectory of this specific car system would not contain a state-action pair which violates these legal and physical limits. Figure 1 shows an example of constraints driving behavior.

We define a constraint set as a set of state-action pairs that violate some specification of the modeled system. We consider three general classes of constraints: state constraints, action constraints, and feature constraints. A state constraint set includes all state-action pairs such that the state component is . An action constraint set includes all state-action pairs such that the action component is . A feature constraint set includes all state-action pairs that produce a non-zero value for feature .

If we augment the set of features as described in (1), it is straightforward to see that state and action constraints become special cases of feature constraints, where . It is also evident that we can obtain compound constraints, respecting two or more conditions, by taking the union of constraint sets to obtain .

3.2.1 Adding Constraints to an MDP

We need to be able to reason about how adding a constraint to an MDP will influence the behavior of agents navigating that environment. Because constraints are sets of state-action pairs, imposing a constraint within an MDP means restricting the set of actions that can be taken from certain states. For a given constraint , we can replace the set of available actions in every state with an alternative set given by

(2)

Performing such substitutions for an MDP will lead to a modified MDP such that .

The question then arises as to the how we should treat states with empty action sets . Since an agent arriving in such an empty state would have no valid action to select, any trajectory visiting an empty state must be deemed invalid. Indeed, such empty action sets will be produced for any state such that .

For MDPs with deterministic transitions, it is clear that any agent respecting these constraints will not visit an empty state. If we consider the set of empty states , then for the purposes of reasoning about an agent’s behavior, we can impose an additional constraint set . In this work, we will always implicitly add this constraint set, such that will be equivalent to , and we recursively add these constraints until reaching a fixed point.

For MDPs with stochastic transitions, the semantics of an empty state are less obvious and could lend themselves to multiple interpretations depending on the nature of the system being modeled. We offer a possible treatment in the appendix.

3.3 Demonstration Likelihood Maximization

Under the maximum entropy model presented in [ziebart_2008_maxent], the probability of a certain finite-length trajectory being executed by an agent traversing a deterministic MDP is exponentially proportional to the reward earned by that trajectory.

(3)

where is the partition function, indicates if the trajectory is feasible for this MDP, and is a parameter describing how closely an agent adheres to the task of optimizing the reward function (as , the agent becomes a perfect optimizer, and as , the agent’s actions become perfectly random). In the sequel, we assume that a given reward function will appropriately capture the role of , so we omit from our notation without loss of generality.

In the case of finite horizon planning, the partition function will be the sum of the exponentially weighted rewards for all feasible trajectories on MDP of length no greater than the planning horizon. We denote this set of trajectories by . Because adding constraints modifies the set of feasible trajectories, we express this dependence as

(4)

Assuming independence among demonstrated trajectories, the probability of observing a set of demonstrations is given by the product

(5)

Our goal is to maximize the demonstration probability given by (5). Because we take the reward function and demonstrations as given, our only available decision variable in this maximization is the constraint set which alters the indicator and partition function .

(6)

where is the hypothesis space of possible constraints.

From the form of (5), it is clear that to solve (6), we must choose a constraint set that does not invalidate any demonstrated trajectory while simultaneously minimizing the value of . Consider the set of trajectories that would be made infeasible by augmenting the MDP with constraint , which we denote as . The value of is minimized when we maximize the sum of exponentiated rewards of these infeasible trajectories. Considering the form of the trajectory probability given by (3), we can see that this sum is proportional to the total probability of observing a trajectory from on the original MDP

(7)

This insight leads us to the final form of the optimization

(8)

In order to solve (8), we must reason about the probability distribution of trajectories on the original MDP , then find the constraint such that contains the most probability mass while not containing any demonstrated trajectories. While equation (8) is derived for deterministic MDPs, if we can assume, as proposed in [ziebart_2008_maxent], that for a given stochastic MDP, the stochastic outcomes have little effect on an agent’s behavior and the partition function, then the solution to (8) will also approximate the optimal constraint selection for that MDP. However, in order to fully address the stochastic case, we would need to reformulate our approach based on maximum causal entropy [ziebart_2010_thesis]. We save this extension for future work.

3.3.1 Constraint Hypothesis Space

In order for the solutions to (8) to be meaningful, we must be careful with our choice of the constraint hypothesis space . For instance, if we let , then the optimal solution will always be to choose the most restrictive to constrain all state-action pairs not observed in the demonstration set.

One approach to avoid this trivial solution is to use domain knowledge of the modeled system to restrict to a library of plausible or common constraints. The authors of [mcpherson_2018_s3] construct such a library by using reachability theory to calculate a family of likely unsafe sets.

Another approach, which we will explore in this work, is the use of minimal constraint sets for our hypothesis space. These minimal sets constrain a single state, action, or feature, and were introduced in Section 3.2 as , , and , respectively. By iteratively selecting minimal constraint sets, it is possible to gradually grow the full estimated constraint set and avoid over-fitting to the demonstrations. Section 3.4 details our approach for selecting the most likely minimal constraint, and Section 3.5 details our approach for iteratively growing the estimated constraint set.

3.4 Probability Mass for Minimal Constraints

As detailed in Section 3.3, the most likely constraint set is the one whose eliminated trajectories have the highest probability of being demonstrated on the original, unconstrained MDP. Therefore, to find the most likely of the minimal constraints, we must find the expected proportion of trajectories which will contain any state or action, or accrue any feature. By using our augmented indicator feature map from (1), we can reduce this problem to only examine feature accruals.

Input: an MDP , a policy , a time horizon

Output: expected feature accrual history

1:/* Initialize state visitation and feature accrual history */
2:for  do
3:   
4:   
5:end for
6:
7:/* Track feature accruals over the time horizon */
8:for  do
9:   for  do
10:      for  do
11:         /* New feature accruals */
12:         /* “” denotes element-wise multiplication */
13:         
14:      end for
15:   end for
16:   for  do
17:      
18:                                      
19:   end for
20:   
21:end for
22:Return
Algorithm 1 Feature Accrual History Calculation

In [ziebart_2008_maxent], the authors present their forward-backward algorithm for calculating expected feature counts for an agent following a policy in the maximum entropy setting. This algorithm nearly suffices for our purposes, but it computes the expectation of the total number of times a feature will be accrued (i.e. how often will this feature be observed per trajectory), rather than the expectation of the number of trajectories that will accrue that feature at any point. To address this problem, we present a modified form of the “forward” pass as Algorithm 1. Our algorithm tracks state visitations as well as feature accruals at each state, which allows us to produce the same maximum entropy distribution over trajectories as [ziebart_2008_maxent] while not counting additional accruals for trajectories that have already accrued a feature.

The input of Algorithm 1 includes the MDP itself, a time horizon, and a time-varying policy. This policy should capture the expected behavior of the demonstrator on the nominal MDP , and can be computed via the “backward” part of the algorithm from [ziebart_2008_maxent]. The output of Algorithm 1, , is an array such that the -th column

is a vector whose

-th entry is the expected proportion of trajectories to have accrued the -th feature by time . In particular, the -th element of is equal to , which allows us to now directly select the most likely constraint according to (8).

3.5 Maximum-Coverage-Based Iterative Constraint Inference

When using minimal constraint sets as the constraint hypothesis space, it is possible that the most likely constraint still does not provide a satisfactory explanation for the demonstrated behavior. In this case, it can be beneficial to combine minimal constraints. If the task of solving (8) is framed as finding the combination of constraint sets that “covers” the most probability mass, then the problem becomes a direct analog for the classic maximum coverage problem [hochbaum_1998_max_coverage]. While this problem is known to be NP-hard, there exist a simple greedy algorithm with known suboptimality bounds [hochbaum_1998_max_coverage].

Input: MDP , constraint hypothesis space ,

empirical probability distribution , threshold

Output: estimated constraint set

1:
2:for  do
3:    solution to (8) using , , and
4:   
5:   if  then
6:      break
7:   end if
8:   
9:end for
10:Return
Algorithm 2 Greedy Iterative Constraint Inference

We present Algorithm 2

as our approach for adapting this greedy heuristic to solve the problem of constraint inference. At each iteration, we grow our estimated constraint set by augmenting it with the constraint set in our hypothesis space that covers the most

currently uncovered probability mass. By analog to the maximum coverage problem, we derive the following bound on the suboptimality of our approach.

Theorem 1.

Let be the set of all constraints such that for , and let be the solution to (8) using as the constraint hypothesis space. It follows, then, that at the end of every iteration of Algorithm 2,

This bound is directly analogous to the suboptimality bound for the greedy solution to the maximum coverage problem proven in [hochbaum_1998_max_coverage]. For space, the proof is included in the appendix.

Rather than selecting the number of constraints to be used ahead of time, we check a condition based on KL divergence to decide if we should continue to add constraints. The quantity provides a measure of how well the distribution over trajectories induced by our inferred constraints, , agrees with the empirical probability distribution over trajectories observed in the demonstrations, . The threshold parameter is chosen to avoid over-fitting to the demonstrations, combating the tendency to select additional constraints that may only marginally better align our predictions with the demonstrations.

4 Examples

(a) True MDP
(b) Nominal MDP
(c) ,
(d) ,
(e) ,
Figure 2: Algorithm performance on a synthetic grid world MDP. Each subfigure represents the MDP by showing (clockwise from left) its states, actions, and features. Each element is shaded according to the proportion of trajectories that are expected to accrue the respective augmented feature, computed via Algorithm 1. Constraints are marked with a red “X,” and bright bounding boxes mark the green and blue feature-producing states. The result here are shown for a set of demonstrations sampled according to the expectation for the True MDP LABEL:sub@fig:true_mdp. We begin with the nominal MDP shown in LABEL:sub@fig:nominal_mdp, and produce LABEL:sub@fig:feature_estimate, LABEL:sub@fig:action_estimate, and LABEL:sub@fig:state_estimate by applying Algorithm 2. Note that LABEL:sub@fig:feature_estimate, LABEL:sub@fig:action_estimate, and LABEL:sub@fig:state_estimate show the selections of feature, action, and state constraints, respectively.

4.1 Synthetic Grid World

We consider the grid world MDP presented in Figure 2. The environment consists of a 9-by-9 grid of states, and the actions are to move up, down, left, right, or diagonally by one cell. The objective is to move from the starting state in the bottom-left corner () to the goal state in the bottom-right corner (). Every state-action pair produces a distance feature, and the MDP reward is negative distance, which encourages short trajectories. There are additionally two more features, denoted green and blue, which are produced by taking actions from certain states, as shown in Figure 2.

The true MDP, from which agents generate trajectories, is shown in Figure (a)a, including its constraints. The nominal, more generic MDP shown in Figure (b)b is what we take as for applying the iterative maximum likelihood constraint inference in Algorithm 2, with feature accruals estimated using Algorithm 1. While Figures (c)c through (e)e show the iteratively estimated constraints, which align with the true constraints, it is interesting to note that not all constraints present in the true MDP are identified. For instance, it is so unlikely that an agent would ever select the up-left diagonal action, that the fact that demonstrated trajectories did not contain that action is unsurprising and does not make that action an estimated constraint.

Figure 3 shows how the performance of our approach varies based on the number of available demonstrations and the selection for the threshold . The false positive rate shown in Figure (a)a is the proportion of selected constraints which are not constraints of the true system. As we would expect, having more demonstrations available reduces the rate at which obstacles are incorrectly identified. Further, Figure (b)b shows that more demonstrations also allows the behavior predicted by constraints to better align with the observations. It is interesting to note, however, that with fewer than 10 demonstrations and a very low , we may produce very low KL divergence, but at the cost of a high false positive rate. This phenomenon highlights the role of selecting to avoid over-fitting. The threshold achieves a good balance of producing few false positives with sufficient examples while also producing lower KL divergences, and we used this threshold to produce the results in Figures 2 and 1.

(a) False positives
(b)
Figure 3: Algorithm performance on the synthetic grid world. Each data point represents the mean result of 10 independent trajectory draws, and the margins show standard error.

4.2 Human Obstacle Avoidance

In our second example, we analyze trajectories from humans as they navigate around an obstacle on the floor. We map these continuous trajectories into trajectories through a grid world where each cell represents a 1-by-1 area on the ground. The human agents are attempting to reach a fixed goal state () from a given initial state (), as shown in Figure 1. We performed MaxEnt IRL on human demonstrations of the task without the obstacle to obtain the nominal distance-based reward function. We restrict ourselves to estimating only state constraints, as we do not supply our algorithm with knowledge of any additional features in the environment and we assume that the humans’ motion is unrestrained.

Demonstrations were collected from 16 volunteers, and the results of performing constraint inference are shown in Figure 1. Our method is able to successfully predict the existence of a central obstacle. While we do not estimate every constrained state, the constraints that we do estimate make all of the obstacle states unlikely to be visited. In order to identify those states as additional constraints, we would have to decrease our threshold, which could also lead to more spurious constraint selections, such as the three shown in Figure 1.

5 Conclusion and Future Work

We have presented our novel technique for learning constraints from demonstrations. We improve upon previous work in constraint-learning IRL by providing a principled framework for identifying the most likely constraint(s), and we do so in a way that explicitly makes state, action, and feature constraints all directly comparable to one another. We believe that the numerical results presented in Section 4 are promising and highlight the usefulness of our approach.

Despite its benefits, one drawback of our approach is that the formulation is based on (3), which only exactly holds for deterministic MDPs. As mentioned in Section 3.3, we plan to investigate the use of a maximum causal entropy approach to address this issue and fully handle stochastic MDPs. Additionally, the methods presented here require all demonstrations to contain no violations of the constraints we will estimate. We believe that softening this requirement, which would allow reasoning about the likelihood of constraints that are occasionally violated in the demonstration set, may be beneficial in cases where trajectory data is collected without explicit labels of success or failure. Finally, the structure of Algorithm 1, which tracks the expected features accruals of trajectories over time, suggests that we may be able to reason about non-Markovian constraints by using this historical information to our advantage.

Overall, we believe that our formulation of maximum likelihood constraint inference for IRL shows promising results and presents attractive avenues for further investigation.

References

Appendix

Appendix A Adding constraints to Stochastic MDPs

For MDPs with non-deterministic transitions, the semantics of an empty state are less obvious and could lend themselves to multiple interpretations depending on the nature of the system being modeled. In our context, we use constraints to describe how observed behaviors from demonstrations differ from possible behaviors allowed by the nominal MDP structure. We therefore assume that any demonstrations provided are, by the fact that they were selected to be provided, consistent with the system’s constraints, including avoiding empty states. This assumption implies that any stochastic state transitions that would have led to an empty state will not be observed in trajectories from the demonstration set. The omission of these transitions means that, for a given , if , then a proportion of these pairs which occur as an agent navigates the environment will be excluded from demonstrations. Therefore, as we modify the MDP to reason about demonstrated behavior, we need updated transition probabilities which eliminate the probability mass of transitioning to empty states, an event which will never be observed in a demonstration. Such modified probabilities can be given as

(9)

We must also capture the change to observed state-action pair frequencies by understanding that any observed policy will be related to an agent’s actual policy according to

(10)

It is important to note that the modifications presented in (9) and (10) for non-deterministic MDPs are not meant to directly reflect the reality of the underlying system (we wouldn’t expect the actual transition dynamics to change, for instance), but to reflect the apparent behavior that we would expect to observe in the subset of trajectories that would be selected as demonstrations. We further note that applying these modifications to deterministic MDPs will result in the same expected behavior as augmenting the constraint set with .

Appendix B Proof for Theorem 1

Theorem 2.

Let be the set of all constraints such that for , and let be the solution to (8) using as the constraint hypothesis space. It follows, then, that at the end of every iteration of Algorithm 2,

Proof.

The problem of finding is analogous to solving the maximum coverage problem, where the set of elements to be covered is the set of trajectories and the weight of each element is . Because Algorithm 2 constructs iteratively by taking the union of the previous value of and the set which solves (8), the value of at the end of the -th iteration is analagous to the greedy solution of the maximum coverage problem with . Therefore, we can directly apply the suboptimality bound for the greedy solution proven in [20] to arrive at our given bound on eliminated probability mass. ∎