A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress

06/18/2018 ∙ by Saurabh Arora, et al. ∙ University of Georgia 8

Inverse reinforcement learning is the problem of inferring the reward function of an observed agent, given its policy or behavior. Researchers perceive IRL both as a problem and as a class of methods. By categorically surveying the current literature in IRL, this article serves as a reference for researchers and practitioners in machine learning to understand the challenges of IRL and select the approaches best suited for the problem on hand. The survey formally introduces the IRL problem along with its central challenges which include accurate inference, generalizability, correctness of prior knowledge, and growth in solution complexity with problem size. The article elaborates how the current methods mitigate these challenges. We further discuss the extensions of traditional IRL methods: (i) inaccurate and incomplete perception, (ii) incomplete model, (iii) multiple rewards, and (iv) non-linear reward functions. This discussion concludes with some broad advances in the research area and currently open research questions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 20

page 34

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inverse reinforcement learning (IRL) seeks to model the preferences of another agent using its observed behavior, thereby avoiding a manual specification of its reward function Russell1998; Ng2000. In the past decade or so, IRL

has attracted several researchers in the communities of artificial intelligence, psychology, control theory, and machine learning. IRL is appealing because of its potential to use data recorded in everyday tasks (e.g., driving data) to build autonomous agents capable of modeling and socially collaborating with others in our society.

We study this problem and associated advances in a structured way to address the needs of readers with different levels of familiarity with the field. For clarity, we use a contemporary example to illustrate IRL’s use and associated challenges. Consider a self-driving car in role B in Fig. 1. To safely merge into a congested freeway, it should model the behavior of the car in role A; this car forms the immediate traffic. We may use previously collected trajectories of cars in role A, near freeway entry ramps, to learn the preferences of a typical driver as she approaches a merge (NGSIM NGSIM is one such existing data set).

Figure 1: Red car B is trying to merge into the lane, and Green car A is the immediate traffic. The transparent images of cars show their positions before merging, and the opaque images depict one of their possible positions after the merger. Color should be used for this figure in print.

Approaches for IRL

predominantly ascribe a Markov decision process (

MDPPuterman1994 to the interaction of observed agent with its environment, whose solution is a policy that maps states to actions. The reward function of this MDP is unknown, and the observed agent is assumed to follow optimal policy for the MDP. In the traffic merge example, MDP represents the driving process of car A. The driver of Car A is following action choices (deceleration, braking, low acceleration, and others) based on its optimal policy. Car B needs to reach the end of merging lane before or after Car A for merging safely.

1.1 Significance of Irl

Researchers in the areas of machine learning and artificial intelligence have developed a substantial interest in IRL because it caters to the following needs.

1.1.1 Demonstration Substitutes Manual Specification of Reward

Typically, if a designer wants a specific behavior in an agent, she manually formulates the problem as a forward learning or forward control task solvable using solution techniques in RL, optimal control, or predictive control. A key element of this formulation is a specification of the agent’s preferences and goals via a reward function. In the traffic merge example, we may hand design a reward function for Car A. For example, +1 reward if taking an action in a state decreases the relative velocity of Car A w.r.t. Car B within a predefined distance from merging junction, thereby allowing for a safe merge. Analogously, a negative reward of -1 if taking an action in a state increases the relative velocity of Car A w.r.t. Car B. However, an accurate specification of the reward function often needs much trial-and-error for properly tuning the variables such as the thresholds for distance and velocity, and the emphasis of +1. As such, the specification of a reward function is often time-intensive and cumbersome.

The need to pre-specify the reward function limits the applicability of RL and optimal control to problems where a lucid reward function can be easily specified. IRL offers a way to broaden the applicability and reduce the manual specification of the model, given that the desired policy or demonstrations of desired behavior are available. While acquiring the complete desired policy is usually infeasible, we have easier access to demonstrations of behaviors, often in the form of recorded data. For example, all state to action mappings for all contingencies for Car A are not typically available, but datasets such as NGSIM contain trajectories of Car A in real-world driving. Thus, IRL forms a key method for learning from demonstration argall2009survey.

A topic in control theory related to IRL is inverse optimal control Boyd94. While the input in both IRL and inverse optimal control are trajectories consisting of state-action pairs, the target of learning in the latter is the function mapping states of observed agent to her actions. The learning agent may use this policy to imitate it or deliberate with it in its own decision-making process.

1.1.2 Improved Generalization

A reward function represents the preferences of an agent in a succinct form, and is amenable to transfer to another agent. The learned reward function may be used as is if the subject agent shares the same environment as the other, otherwise it provides a useful basis if the agent configuration differs mildly. Indeed, Russell Russell1998 believes that the reward function is inherently more transferable compared to the observed agent’s policy. This is because even a slight change in the environment – for e.g., few more states – renders the learned policy completely unusable because it may not be directly revised in straightforward ways.

1.1.3 Potential Applications

While introducing IRL, Russell Russell1998 alluded to its potential application in providing computational models of human and animal behavior because these are difficult to specify. In this regard, Baker et al. Baker2009_action and Ullman et al. Ullman2009 demonstrate the inference of a human’s goal as an inverse planning problem in an MDP. Furthermore, IRL’s use toward apprenticeship has rapidly expanded the set of visible applications such as:

  1. Figure 2: Helicopter maneuvers using RL on a reward function learned from an expert pilot through IRL. The image is reprinted from Abbeel2007 with permission from publisher.
  2. Autonomous vehicle control by learning from an expert to create an agent with the expert’s preferences. Examples of such application include helicopter flight control Abbeel2007 illustrated in Figure 2, boat sailing Neu2007, socially adaptive navigation that avoids colliding into humans by learning from human walking trajectories Kretzschmar_interactingpedestrians; Kim2016_adaptivenavigation, and so on;

  3. Learning from another agent to predict its behavior. In this use, applications include destination or route prediction for taxis Ziebart2008; Ziebart_Cabbie, footstep prediction for planning legged locomotion Ratliff2009, anticipation of pedestrian interactions Ziebart_predictionpedestrian, energy efficient driving Vogel_efficientdriving, and penetration of a perimeter patrol Bogert_mIRL_Int_2014 by learning the patrollers’ preferences and patrolling route.

1.2 Importance of this Survey

This article is a reflection on the research area of IRL focusing on the following important aspects:

  1. Formally introducing IRL and its importance, by means of various examples, to the researchers and practitioners unaware of the field;

  2. A study of the challenges that make IRL problem difficult, and a brief review of the current (partial) solutions;

  3. Qualitative assessment and comparisons among different methods to evaluate them coherently. This will considerably help the readers decide on the approach suitable for the problem on their hands;

  4. Identification of key milestones achieved by the methods in this field along with their common shortcomings.

  5. Identification of the open avenues for future research.

Given the time and space constraints, a single article may not cover all methods in this growing field. Nevertheless, we have sought to make this survey as comprehensive as possible.

Section 2 mathematically states IRL problem. After the statement, a description of fundamental challenges – Accurate Inference, Generalizability, Correctness of Prior Knowledge, Growth in Solution Complexity with Problem Size, Direct Learning of Reward or Policy Matching – ensues in Section 3. Section 4 discusses the methods for solving IRL problem, which makes way to Section 5 about the descriptive assessment of the mitigation of challenges. Section 6 outlines the extensions of original IRL problems and the methods implementing them. As an overview of the progress in this area, Section 7 explains common milestones and shortcomings in the IRL research. The article concludes with open research questions.

2 Formal Definition of Irl

In order to formally define IRL, we must first decide on a framework for modeling the observed agent’s behavior. While methods variously ascribe frameworks such as an MDP, hidden-parameter MDP, or a POMDP to the expert, we focus on the most popular model by far, which is the MDP.

Definition 0.

[MDP] An MDP models an agent’s sequential decision-making process. is a finite set of states and is a set of actions. Mapping

defines a probability distribution over the set of next states conditioned on agent taking action

at state ; here denotes the set of all probability distributions over . is the probability that the system transitions to state . Reward function specifies the scalar reinforcement incurred for reaching state . Another variant outputs the reward received after taking action at state . Discount factor is the weight for past rewards accumulated in a trajectory, , where .

A policy is a function mapping current state to next action choice(s). It can be deterministic, or stochastic . For a policy , value function gives the value of a state as the long-term expected cumulative reward incurred from the state by following . The value of a policy is,

(1)

The goal of solving MDP is an optimal policy such that , for all . The action-value function for , , maps a state-action pair to the long-term expected cumulative reward incurred after taking action from and following policy thereafter. We also define the optimal action-value function as . Subsequently, . Another perspective to the value function involves multiplying the reward with the converged state-visitation frequency , the number of times the state s is visited on using policy . The latter is given by:

(2)

where is initialized as 0 for all states. Let be space of all functions. Iterating the above until the state-visitation frequency stops changing yields the converged frequency function, . We may write the value function as, .

We are now ready to give the formal problem definition of IRL. We adopt the conventional terminology in IRL, referring to the observed agent as an expert and the subject agent as the learner.

Definition 0 (Irl).

Let an MDP without reward, , represent the dynamics of the expert . Let be the set of demonstrated trajectory. A trajectory in is denoted as . We assume that all are perfectly observed. Then, determine that best explains either policy or observed behavior in the form of the demonstrated trajectories.

Figure 3: A schematic showing the subject agent (shaded in blue) performing RL Kaelbling1996. In forward learning or RL, the agent chooses an action at a known state and receives a reward in return. In inverse learning or IRL, the input and output for the learner are reversed. L perceives the states and actions of expert E (an alternative to input policy ), and infers a reward function, , of as an output. Note that the learned reward function may not exactly correspond to the true reward function. Color should be used for this figure in print.

Notice that IRL inverts the RL problem. Whereas RL seeks to learn the optimal policy given a reward function, IRL seeks to learn the reward function that best supports a given policy or observed behavior. We illustrate this relationship between RL and IRL in Fig. 3.

3 Primary Challenges in Irl

IRL is challenging because the optimization associated in finding a reward function that best explains observations is essentially ill-posed. Furthermore, computational costs of solving the problem tend to grow disproportionately with the size of the problem. We discuss these challenges in detail below, but prior to this discussion, we establish some notation. Let be the policy obtained by optimally solving the MDP with reward function .

3.1 Accurate Inference

Classical IRL takes an expert demonstration of a task consisting of a finite set of trajectories, knowledge of the environment and expert dynamics, and generates the expert’s potential reward function; this is illustrated in Fig. 4.

Figure 4: Pipeline for a classical IRL process. The learner receives an optimal policy or trajectories as input. The prior domain knowledge (shown here as a pentagon) include completely observable states and fully known transition probabilities.

A critical challenge, first noticed by Ng and Russell Ng2000, is that many reward functions (including highly degenerate ones such as a function with all rewards values zero) explain the observations. This is because the input is usually a finite and small set of trajectories (or a policy) and many reward functions in the set of all reward functions can generate policies that realize the observed demonstration. Thus, IRL suffers from an ambiguity in solution.

Figure 5: IRL with noisy sensing of states, actions or both of the learner.

In addition, the learner’s sensing of expert’s state and action could be imprecise due to noisy sensors, as illustrated in Fig. 5. For example, vehicle tracking data in NGSIM mixes up car IDs at multiple points in the data due to imprecise object recognition when cars are close to each other. Inaccurate input may negatively impact the process of learning reward function. This could be corrected to some extent if a model of the observation noise is available, and IRL solution method integrates this model.

Given the difficulty of ensuring accurate inference, its pertinent to contemplate how we may measure accuracy. If the true reward function is available for purposes of evaluation, then the accuracy is the closeness of a learned reward function to , . However, a direct comparison of rewards is not useful because an MDP’s optimal policy is invariant under affine transformations of the reward function Russell03:Artificial. On the other hand, two reward functions similar for the most part but differing for some state-action pairs may produce considerably different policies (behaviors). To make the evaluation targeted, a comparison of the behavior generated from the learned reward function with the true behavior of expert is more appropriate. In other words, we may compare the policy generated from MDP with with the true policy . The latter could be given or is generated using the true reward function. A limitation of this measure of accuracy is that a difference between the two policies in just one state could still have a significant impact. This is because performing the correct action at that state may be crucial to realizing the task. Consequently, this measure of closeness is inadequate because it would report just a small difference despite the high significance.

This brings us to the conclusion that we should measure the difference in values of the learned and true policies. Specifically, we may measure the error in the inverse learning, called inverse learning error (ILE), as , where is the value function for actual policy and is that for the learned policy Choi2011.

3.2 Generalizability

Generalization refers to the extrapolation of learned information to the states and actions unobserved in the demonstration and to starting the task at new initial states. Observed trajectories typically encompass a subset of the state space and the actions performed from those states. Well-generalized reward functions should reflect expert’s overall preferences relevant to the task. The challenge is to generalize correctly to the unobserved space using data that often covers a fraction of the complete space.

Notice that generalizability promotes the temptation of training the learner using fewer examples because the latter now possesses the ability to extrapolate. However, less data may contribute to greater approximation error in and inaccurate inference.

ILE continues to be pertinent by offering a way to measure the generalizability of the learned information as well. This is because it compares value functions, which are defined over all states. Another procedure for evaluating generalizability is to simply withhold a few of the demonstration trajectories from the learner. These can be used as labeled test data for comparing with the output of the learned policy on the undemonstrated state-action pairs.

3.3 Correctness of Prior Knowledge

Many IRL methods Ng2000; Abbeel2004; Ratliff2006; Neu2007; Ziebart2008 represent the reward function, , as a weighted combination of feature functions; the problem then reduces to finding the values of the weights. Each feature function, , is given and is intended to model a facet of the expert’s preferences.

Prior knowledge enters IRL via the specification of feature functions in and the transition function in the MDP ascribed to the expert. Consequently, the accuracy of IRL is sensitive to the selection of feature functions that encompass the various facets of the expert’s true reward function. Indeed, Neu et al. Neu2007 prove that IRL’s accuracy is closely tied to the scaling of correct features. Furthermore, it is also dependent on how accurately are the dynamics of the expert modeled by the ascribed MDP. If the dynamics are not deterministic, due to say some noise in the expert’s actuators, the corresponding stochasticity needs to be precisely modeled in the transitions.

Given the significant role of prior knowledge in IRL, the challenge is two-fold: we must ensure its accuracy, but this is often difficult to achieve in practice; we must reduce the sensitivity of solution method to the accuracy of prior knowledge or replace the knowledge with learned information.

3.4 Growth in Solution Complexity with Problem Size

Methods for IRL are iterative as they involve a constrained search through the space of reward functions. As the number of iterations may vary based on whether the optimization is convex, it is linear, the gradient can be computed quickly, or none of these, we focus on analyzing the complexity of each iteration. Consequently, the computational complexity is the sum of the time complexity of each iteration and its space complexity.

Each iteration’s time is dominated by the complexity of solving the ascribed MDP using the reward function learned. While the complexity of solving an MDP

is polynomial in the size of its parameters, the parameters such as the state space are impacted by the curse of dimensionality – its size is exponential in the number of components of state vector (dimensions). Furthermore, the state space in domains such as robotics is often continuous and an effective discretization also leads to an exponential blowup in the number of discrete states. Therefore, increasing problem size adversely impacts the run time of each iteration of

IRL methods.

Another type of complexity affecting IRL is sample complexity, which refers to the number of trajectories present in the input demonstration. As the problem size increases, the expert must demonstrate more trajectories in order to maintain the required level of coverage in training data. Abbeel and Ng Abbeel2004 analyze the impact of sample complexity on the convergence of IRL method.

3.5 Direct Learning of Reward or Policy Matching

Two distinct approaches to IRL present themselves, each with its own set of challenges. First seeks to approximate the reward function by tuning it using input data. The second approach focuses on learning a policy by matching its actions (action values) with the behavior demonstrated, using reward function only as means.

Success of the first approach hinges on selecting an accurate and complete prior knowledge (e.g. set of feature functions) that compose the reward function. Though learning a reward function offers better generalization to the task at hand, it may lead to policies that do not fully reproduce the observed trajectories. The second approach is limited in its generalizability – learned policies may perform poorly in portions of the state space not present in the input data. More importantly, Neu et al. Neu2008 point out that this optimization is convex only when the actions are deterministic and the demonstration spans the complete state space. Consequently, choosing between the two approaches involves prioritizing generalizability over a precise imitation or vice versa.

Next section categorizes the early, foundational methods in IRL based on the mathematical framework they use for learning, and discusses them in some detail.

4 Foundational Methods

A majority of IRL method fit a template of key steps. We show this template for in Algorithm 1, and present the methods in the context of this template. Such presentation allows us to compare and contrast various methods.

Input: Trajectory demonstrating desired behavior i.e.,
Output:
1 Model the dynamic expert’s observed behavior as an MDP without reward function;
2 Initialize a parameterized form (linear approximation, distribution over rewards, or others) of solution;
3 Solve MDP with current solution to generate learned behavior;
4 Update the parameters to minimize the divergence between the observed behavior and the learned behavior;
Repeat the previous two steps till the divergence is reduced to desired level.
Algorithm 1 Template for IRL

Existing methods seek to learn the expert’s preferences, a reward function , represented in different forms such as a linear combination of weighted feature functions, a probability distribution over multiple real-valued maps from states to reward values, and so on. Parameters of vary with the type of representation (weights, variables defining shape of distribution). IRL involves solving the MDP with the function hypothesized in current iteration and updating the parameters, constituting a search that terminates when the behavior derived from the current solution converges to the observed behavior.

This section categorizes methods according to the technique they use for learning – maximum margin based optimization, maximum entropy based optimization, Bayesian inference, regression, classification, and few others. In addition to explaining how various methods instantiate the steps in the template, we discuss in the next section how they take steps to mitigate some or all the primary learning challenges introduced previously in Section 

3.

4.1 Maximum Margin Optimization

General framework of maximum margin prediction aims to predict reward function that makes the demonstrated policy look better than alternative policies by a margin. Methods under this category aim to address the ambiguity of solution (discussed in Section  3.1) by deciding on a solution that maximizes a margin from others.

Linear Programming

One of the earliest methods for IRL is Ng and Russell’s Ng2000

, which takes in the expert’s policy as input. It formulates a linear program to retrieve the reward function that not only produces the given policy as optimal output from the complete MDP, but also maximizes the sum of differences between the value of the optimal action and that of the next-best action for every state. In addition to maximizing this margin, it also prefers reward functions with smaller values as a form of regularization. We may express the reward function as a linear and weighted sum of basis functions,

, where is a basis function and weight for all . For spaces which are not discrete or small-sized, the constraints defined for each discrete state must be generalized to the continuous state space, or algorithm can sample a large subset of the state space and restrict the constraints to these sampled states only (as Ng and Russell suggested).

Apprenticeship learning

Two methods Abbeel2004 under this general label extend the linear programming and perform max margin optimization. Noting that the learner does not typically have access to the expert’s policy, max-margin and projection take a demonstration (defined in Def. 2) as input and learns a linear reward function. expected feature count is the discounted sum of feature values under expectation of a policy,

(3)

Authors successfully used these expected feature counts or feature expectations as a representation of the value function of Eq. 1:. Given a set of trajectories

generated by the expert, the empirical estimation of

is

(4)
Figure 6: Iterations of the max-margin method (inspired from Figure 1 in Abbeel2004) computing the weight vector, , and the feature expectations, . is the estimation of the feature counts of the expert. is the learned weight vector in th iteration and is corresponding optimal policy for intermediate hypothesis.

The methods seeks a reward function that minimizes the margin between the feature expectations of policy computed by learner and the empirically computed feature expectations of expert. Both methods iteratively tune weights by computing policy for intermediate hypothesis at each step and using it to obtain intermediate feature counts. These counts are compared with the feature counts of expert, as shown in Fig. 6 and are updated each step. Abbeel and Ng Abbeel2004 point out that the performance of these methods is contingent on matching the feature expectations, which may not yield an accurate because feature expectations are based on the policy. An advantage of these methods is that their sample complexity depends on the number of features and not on the complexity of expert’s policy or the size of the state space.

An alternative to minimizing the value-loss in the projection method is to minimize the difference - for each state. Viewing a policy as Boltzmann function of the action-value, Neu and Szepesvari Neu2007 present hybrid-IRL as a gradient descent method that searches the space of reward hypotheses, with a reward function parameterizing a policy. As the behavior of expert is available instead of his policy, the difference above is computed using the empirically computed frequencies of the visitations of each state (Equation 2) and the frequencies of taking specific actions in that state.

A variant of the projection method described above is Syed et al’s. Syed2008 Multiplicative Weights for Apprenticeship Learning (mwal); the initial model and input are the same in both methods. However, it presents a learner as a max player choosing a policy and its environment as an adversary selecting a reward hypothesis. This formulation transforms the value-loss objective in the projection method to a minimax objective for a zero-sum game between the learner and its environment, with weights as output.

Maximum margin planning

Under the assumption that each trajectory reflects a distinct policy, Ratliff et al. Ratliff2006

associate each trajectory with an MDP. While these MDPs could differ in their state and action sets, the transition and feature functions, they share the same weight vector. MMP loss function

quantifies the closeness between the learned behavior and demonstrated behavior. The inputs for this function are the state-action frequency counts (computed by including actions along with states within Equation 2) of either trajectories or policies.

(5)

defines the payoff on failing to match the state-action pair of . Each of the MDPs includes the corresponding empirical feature expectation and a loss function, both computed using the associated trajectory.

The weight vector is obtained by solving a quadratic program that is constrained to have a positive margin between the expected value of the observed policy and the expected values of all other policies combined with loss. The former is obtained by multiplying the empirical state visitation frequency from the observed trajectory with the weighted feature functions (i.e., reward). The solution is the weight vector that makes the expected value of the optimal policy to be closest to the expected value of the observed policy, for each of the MDPs.

Figure 7: An iteration of learch in the feature space (Fig 3 in Ratliff et al. Ratliff2009 excerpted with permission). The method considers a reward function as negative of a cost function. Blue path depicts the demonstrated trajectory, and green path shows the maximum return (or minimum cost) trajectory according to the current intermediate reward hypothesis. Step 1 is the determination of the points where reward should be modified, shown as for a decrease and for an increase. Step 2 generalizes the modifications to entire space computing the next hypothesis and corresponding maximum return trajectory. Next iteration repeats these two steps. Color should be used for this figure in print.

Ratliff et al. Ratliff2009 improve on the previous maximum-margin prediction using quadratic programming in a subsequent method labeled learn to search (learch). Figure 7 explains how an iteration of learch increases the cost (decreases the reward) for the actions that deviate the learned behavior from demonstrated behavior. For optimization, learch uses an exponentiated functional gradient descent in the space of reward functions (represented as cost maps).

4.2 Entropy Optimization

IRL is essentially an ill-posed problem because multiple reward functions can explain the expert’s behavior. While the max-margin approach for solution is expected to be valid in some domains, it introduces a bias into the learned reward function in general. Thus, multiple methods take recourse to the maximum entropy principle Jaynes_MaxEnt to obtain a distribution over potential reward functions while avoiding any bias. The distribution that maximizes entropy makes minimal commitments beyond the constraints and is least wrong.

Maximum entropy Irl

This popular method for IRL

introduced the principle of maximum entropy as a way of addressing the challenge of accurate inference in

IRL. Maximum entropy IRL Ziebart2008 recovers a distribution over all trajectories, which has the maximum entropy among all such distributions under the constraint that feature expectations of learned policy match that of demonstrated behavior. Mathematically, this problem can be formulated as a convex, nonlinear optimization:

(6)

where is the space of all possible distributions ; and is the expectation over the feature from observations of the expert.  Ziebart2008 used Lagrangian relaxation to bring both constraints into the objective function and then solved the dual for weights by utilizing exponentiated gradient descent.

Figure 8: A Markov Random Field which favors the policies choosing similar actions in neighboring states. This structure generalizes the feature information beyond individual states. The figure is reprinted from Boularias2012 with permission from publisher.

A disadvantage of the above formulation is that it becomes intractable for long trajectories because the space of trajectories grows exponentially with the length. A related method Boularias2012 formulates the problem as one of finding a distribution over space of deterministic policies, , given as:

(7)

Due to the constraint of matching feature expectations, the process is equivalent to maximizing the likelihood of expert’s behavior under the maximum entropy distribution. The set of deterministic policies has size , which is independent of the length and number of demonstrated trajectories, but it could get considerably large with size of state space.

Some real applications exhibit a structure that neighboring states prefer similar optimal actions. To exploit this information, Boularias2012 introduced special features in aforementioned constraints to make the solution distribution favor the policies for which neighboring states have similar optimal actions. This method is Structured Apprenticeship Learning, resulting in a Markov Random Field like distribution (Figure 8).

Relative entropy Irl

Entropy optimization is also useful where the transition dynamics are not known. In reirl  Boularias2011relative, the relative entropy ( known as Kullbach-Leibler divergence Kullback68informationtheory) between two distributions on trajectories is minimized and the corresponding reward function is obtained.

From the space of all possible distributions, learner choses an arbitrary distribution on trajectories with feature expectations within pre-determined closeness to true feature expectations of input demonstration. Using a given baseline policy , learner empirically computes another distribution by sampling trajectories. Then the method learns reward function by minimizing the relative entropy between these two probability distributions. Notably, the input behavior need not be optimal for the method. An analytical computation of solution to this optimization problem requires a known transition function. To estimate latter, learner uses importance sampling.

Recently, Finn et al. Finn_gcl extend the above sample based estimation approach (Guided cost learning,gcl

) to model-free maximum entropy optimization by allowing a neural network to represent the reward function  

Wulfmeier2015

. Replacing an uninformed baseline policy that effectively leads to a uniform distribution over trajectories, sample generation is guided by policy optimization that uses the current reward hypothesis to update the baseline policy which in turn guides the sample generation. In this way, RL is interleaved with

IRLto generate more trajectories that are supported by a larger reward. As Wulfmeier et al. Wulfmeier2015

point out, the difference in the visitation frequency can be passed as an error signal to the neural network, and the backpropagation computes the gradient.

4.3 Bayesian Update

Some methods treat the state-action pairs in a trajectory as observations that facilitate a Bayesian update of a prior distribution over candidate reward solutions. This approach yields a different but principled way for IRL that has spawned various improvements.

Bayesian Irl

A posterior distribution over candidate reward functions is obtained using the following Bayesian update:

(8)

where is a demonstrated trajectory, and .

Ramachandran and Amir Ramachandran2007 define the likelihood

as a logit or exponential distribution of the Q-value of the state-action pair:

(9)

where controls the randomness in action selection (lower the , more exploratory is the action) and is the partition function or the normalization constant. Given a candidate reward hypothesis, some state-action pairs are more likely than others as given by the likelihood function. As the space of reward functions is technically continuous, Ramachandran and Amir present a random walk algorithm for implementing the update and obtaining a sampled posterior.

An extension of Bayesian IRL Lopes2009 measures the state-based entropy of the posterior over the reward functions. The method defines a set of reward functions such that for a given state-action under chosen reward function , . The posterior distribution induces following discretized distribution over values of .

(10)

where is a subinterval of interval I=[0,1].

State-based entropy of the distribution is

(11)

In an example of active learning, learner can query the expert for sample demonstrations in states exhibiting high entropy, with the aim of learning a posterior that is well informed at all states. Prior knowledge about the structural similarities between states improves the effectiveness of this approach.

birl via Bisimulation

Francisco et al. S.Melo2010 proposed a bisimulation technique to project a structure in their IRL algorithm that improves its generalization capacity while avoiding the computations for solving MDP multiple times. The method uses an MDP metric quantifying the equivalence between states based on the closeness of the output action-distributions of (stochastic) policy at the states Ferns2004; Taylor2009. The approach models policy

as a classifier with the state in each state-action pair is the data and the action is the label. Authors devise a kernel based on the metric and use kernel regression to generalize the demonstrated behavior to unobserved states. After the regression, the algorithm updates a distribution over the parameters of

using Bayesian inference

. The authors extend their method to active learning, in which learner queries the expert at the states with the maximum variance of the distribution.

Bayesian nonparametrics

Michini et al. Michini_CRPBNIRL1; Michini_CRPBNIRL2 (crp-bnirl) show a straightforward application of nonparametric clustering by letting a Chinese Restaurant process partition a trajectory into as many subtrajectories as needed and assign a reward function to each subtrajectory, to best fit the observed data. Subtrajectories are presumed to represent subtasks with their own subgoals. However, large state spaces and long trajectories make the calculation of demonstration-likelihood computationally intensive. The authors use RTDP (real-time dynamic programming) and action comparison with existing closed-loop controllers to avoid the discretization of the state space or action space for the calculation. The comparison also allows real-time learning.

4.4 Regression and Classification

Classical machine learning techniques such as regression and classification have also played a significant role in IRL.

Feature construction using regression

In a unique effort to automatically learn the feature functions, Levine2010_featureconstruction begins with primitive binary feature functions (logical conjunctions ) that form components of the final features used in the reward function. Starting with an empty set of features, method firl iteratively updates reward hypothesis and adds hypothesized feature functions to it. The search involves building a minimal regression tree on the state space that encodes those conjunctions, resulting in new binary feature functions which make consistent with demonstrated behavior.

Multi-label classification

Klein et al. Klein2012structured viewed IRL as a multi-label classification problem (scirl) in which the demonstration is training, and the state-n-action pairs are data-n-label pairs. They consider the action values (score function of classifier) in terms of feature expectations , (see Eq. 3), which makes weights shared parameters between Q-function and reward function. A multi-label classification algorithm computes solution by inferring a weight vector that minimizes the classification error.

An extension of the above method Klein_CSI (called csi) estimates the transition probabilities if they are unknown. csi utilizes standard regression on a simulated demonstration data set to estimate the transition model and thereby learn reward function (not necessarily linear this time).

4.5 Other Miscellaneous Techniques

We briefly review a few other IRL methods that do not fall in the broad categories discussed previously.

gpirl - Gaussian Process Irl

gpirl uses a Gaussian process to approximate as a function non-linear in base features , Levine2011. Instead of a single reward function, a hypothesis here is mean of a parametric Gaussian posterior over possible rewards. The method optimizes the likelihood of actions in demonstration to compute the mean.

Maximum likelihood estimate

Rather than the maximum-a-posteriori estimate of the expert’s reward function obtained from Bayesian learning, Vroman et al. 

Babes-Vroman2011 choose a reward function that leads to the maximum likelihood estimate. In mlirl, the formulation of estimate is same as one in Bayesian learning. However, this method remains challenged by degeneracy and ill-posedness of the IRL problem.

Path integral

Kalakrishnan et al. Kalakrishnan_continousspace propose pi-irl, a path integral approach to learn continuous-state continuous-action reward functions by sampling a local region around each demonstrated optimal trajectory. The optimization process does not consider the trajectories that incur significant control input because they deviate from demonstration; the technique applies to high dimensional problems due to the assumed local optimality of demonstration.

5 Mitigating Challenges

This section elaborates how the methods reviewed in previous section mitigate the challenges introduced in Section 3, to help a practitioner make an informed choice about the method that may address the challenges in her domain.

5.1 Improvement in Accuracy of Irl

The accuracy depends on several aspects of the learning process. Most existing methods aim at ensuring that the input is accurate, reducing the ambiguity among solutions, improving feature selection, and offering algorithmic performance guarantees.

5.1.1 Learning from Faulty Input

Perturbations

Sub optimal actions characterize a perturbed demonstration. Melo et al. Melo_perturbedinput aim for a formal analysis and characterization of the space of solutions for the cases when some of actions in demonstration are not optimal (a perturbation in the distribution modeling the expert’s policy) and when demonstration does not include samples in all states.

Some methods such as reirl stay robust to perturbation whereas other IRL methods may learn inaccurate feature weights Abbeel2007 or predict the control poorly Ratliff2006, resulting in an unstable learned behavior. Methods such as maxentirl, birl, mlirl, and gpirl use probabilistic frameworks to account for the perturbation. For example, mlirl allows tuning of its model parameter (Equation 9) to model learned policy as random when the demonstrated behavior is expected to be noisy and suboptimal Babes-Vroman2011. On the other hand, methods such as mmp and learch introduce slack variables in their optimization objective for this purpose. Using the application of helicopter flight control, Ziebart et al. ZiebartBD_compare_MMP show that the robustness of maxent against an imperfect demonstration is better than that of mmp.

Coates et al. Coates2008 introduce a generative model-based technique of trajectory learning that de-noises the noisy demonstration to learn from noise-free (unobserved) trajectories. In a subsequent method, Coates et al. apply IRL to the resultant noise-free trajectories Coates2009, using apprenticeship learning. Silver et al. Ratliff2009; Silver2008 also account for the actuation noise in a demonstration by relaxing the constraints in the optimization objective. Instead of constraining the learned policy to follow demonstration exactly, the modified constraint make it follow the lowest cost paths that are close to the examples (e.g. set of paths with sufficiently small loss w.r.t. demonstration). As an extreme instance of learning with sub-optimal input, Shiarlis et al. Shiarlis_2016_failed demonstrate IRL with demonstrations that failed to complete a task. Despite these contributions, research in IRL has not benefited from an explicit focus on robust learning in the context of faulty demonstrations in application domains.

Figure 9: Learning with a perturbed demonstration in an unstructured terrain. Figure reprinted Silver2008 with permission from MIT press. An expert provides three demonstration trajectories - red, blue, and green [top left]. The portion of terrain traveled by a presumably achievable red trajectory should have low cost (high reward) as the expert is presumably optimal. But the path is not optimal. It is not even achievable by any planning system with predefined features because passing through the grass is always cheaper than taking a wide berth around it. The assumed optimality of expert forces the optimization procedure in IRL methods to lower the cost (increase the reward) for features encountered along the path, i.e., features for grass. This influences the efficiency of learning behavior in other paths (blue path) [Bottom left]. Using a normalized functional gradient Silver2008 makes this lowering effect vanish [Bottom right]. Color should be used for this figure in print.
Over Training

A sub-optimal demonstration may also be a trajectory with length much longer than desired. mmp converges to a solution by reducing the cost for following all demonstrated trajectories as compared to other simulated trajectories. The reduction happens by exploiting a relatively lower visitation counts of demonstration in the undesired part of state space. However, mmp attempts this reduction for a sub-optimal demonstration as well, which can be avoided if the learning method distinguishes between an unusually long demonstration from the optimal ones. Silver et al. Silver2008; Ratliff2009 implement the solution by applying a functional gradient normalized by the state visitation counts of a whole trajectory (Fig. 9).

5.1.2 Ambiguity in Hypotheses

Various methods mitigate this challenge of ambiguity and degeneracy by better characterizing the space of solutions. This includes using heuristics, prior domain knowledge and adding optimization constraints.

Heuristics

max-margin and mwal avoid degenerate solutions by using heuristics that favor the learned value close to expert’s . Specifically, mmp avoids degeneracy by using an objective (loss) function which degenerate can not minimize because the function is proportional to state-visitation frequencies Neu2008. hybrid-IRL avoids degeneracy in the same way as mmp, and makes the solution unique by preferring a reward function that corresponds to a stochastic policy with action selection same as the expert’s (). Naturally, if no single non-degenerate solution makes the demonstration optimal, an unambiguous output is impossible using these methods Ziebart2010.

Constraints and Prior Knowledge

Many methods embrace the ambiguity by modeling the uncertainty of output as a probability distribution over solutions or that over the trajectories corresponding to the solutions. In this regard, maxentirl infers a unique reward function by using a probabilistic framework avoiding any constraint other than making the value-loss zero, . The likelihood objectives in Bayesian inference techniques and gpirl limit the probability mass of its output, a posterior distribution, to that specific subset of reward functions which supports the demonstrated behavior. This change in probability mass shapes the mean of posterior as well, which is the unique output of these methods.

Active learning uses state-wise entropy of the posterior to select the most informative states Lopes2009. The selection mechanism builds on birl and enhances the uniqueness of the solution (mean reward function) as compared to birl. In general, all these methods add optimization constraints and prior domain knowledge to exploit those aspects that contribute to the dissimilarity among solutions, in order to distinguish some solution over others.

5.1.3 Lowering Sensitivity to Features

Performance of methods such as projection, max-margin, mmp, mwal, learch, and mlirl are all highly sensitive to the selection of features, a challenge pointed out in Section 3.3. Few of the methods that use reward features attempt to mitigate this challenge. hybrid-IRL that uses policy matching and all maximum entropy based methods tune distributions over the policies or trajectories, which reduces the impact that feature selection has on the performance of IRL Neu2008.

Apart from selecting appropriate features, the size of the feature space influences the error in learned feature expectations for the methods that rely on , e.g. projection,mmp, mwal, maxentirl. If a reward function is linear with the magnitude of its features bounded from the above, then the high-probability bound on the error scales linearly with Ziebart2008. However, maximum entropy based methods show an improvement in this aspect as dependence.

5.1.4 Theoretically Guaranteed Accuracy

From a theoretical viewpoint, some methods have better analytically-proved performance guarantees than others. The maximum entropy probability distribution over space of policies (or trajectories) minimizes the worst-case expected loss Grunwald2004. Consequently, maxentirl learns a behavior which is neither much better nor much worse than expert’s Dimitrakakis2012. However, the worst-case analysis may not represent the performance on real applications because the performance of optimization-based learning methods can be improved by exploiting favorable properties of the application domain. Classification based approaches such as csi and scirl admit a theoretical guarantee for the quality of , in terms of optimality of learned behavior , given that both classification and regression errors are small. Nevertheless, these methods may not reduce the loss as much as mwal as the latter is the only method, in our knowledge, which has no probabilistic lower bound on the value-loss incurred Syed2008.

Few methods analyze and bound inverse learning error (ILE) for a given confidence of success and a given minimum number of demonstrations. The analysis relies on determining the value of a policy using feature expectations  Abbeel2004; Ziebart2008; Choi2011), state visitation frequencies  Neu2007; Ziebart2008; Ratliff2006) or occupancy measures ( Babes-Vroman2011).

For any lfd method based on feature expectations or occupancy measure, there exists a probabilistic upper bound on the bias in , and thereby on ILE for a given minimum sample complexity Abbeel2004; Syed2008_supplement; Vroman2014. As a state visitation frequency is analogous to an occupancy measure, the bounds derived for latter are applicable on the methods which use former, e.g., mmp, hybrid-irl, and maxentirl. Subsequently, each derived bound on bias can be used to analytically compute the maximum error in learning for a given minimum sample complexity Abbeel2004; Syed2008_supplement; Vroman2014. Lee et al. Lee_Improved_Projection change the criterion (and thereby the direction) for updating the current solution in max-margin and projection methods. The method demonstrates a proven improvement in the accuracy of solution as compared to that of the previous methods.

For specific problem domains, some of the algorithms for lfd that use IRL empirically show less value-loss than predecessor methods – these include mlirl, learch, and gpirl. As compared to action-based models of Bayesian inference and hybrid-IRL, the entropy-based approaches show an improved match between the trajectories (not state values) in learned behavior and those in demonstrated behavior Ziebart2008. Thus, the latter closely match not only the state-values of expert and learner but also their action-values. Similarly, among relatively new entropy-based methods, gcl learns more accurately than reirl in complex application domains. However, this improved performance may be entail a higher sample complexity.

5.2 Analysis and Improvement of Complexity

Figure 10: State equivalence in an MDP with states and actions . The similarity in actions and transitions for states 1 and 2 makes them equivalent. Therefore, the selection of optimal actions through expert’s policy will be similar in both the states. Demonstration of in one implies the optimality of in other. The illustration is inspired from Fig 1 in Francisco et al. S.Melo2010.

Active birl offers a benefit over traditional birl by exhibiting reduced sample complexity (see Section 3.4). This is because it seeks to ascertain the most informative states where a demonstration is needed, and queries for it. Consequently, less demonstrations are often needed and the method becomes more targeted. Of course, this efficiency exists at the computational expense of interacting with the expert. In a different direction, Francisco et al. S.Melo2010 exploiting an equivalence among the states to reduce the sample complexity, as shown in Fig. 10. Likewise, reirl uses fewer samples (input trajectories) as compared to alternative methods including a model-free variant of mmp Boularias2011relative.

While emphasis on reducing time complexity and space complexity is generally lacking among IRL techniques, a small subset does seek to reduce the time complexity. An analysis of birl shows that computing the policy for the mean of posterior distribution over solutions is computationally more efficient than the direct minimization of expected value-loss over the posterior Ramachandran2007

. Specifically, the Markov chain approximating the Bayesian posterior, with a uniform prior, converges in polynomial time. Next,

mwal requires ( is number of features) iterations for convergence, which is lower than for the projection method. birl using bisimulation (Section 4.3) has low computational cost because it does not need to solve the MDP repeatedly and the computation of bisimulation metric over space occurs once regardless of the number of applications. Although an iteration of firl is slower than mmp and projection due to the computationally expensive step of regression, former converges in a fewer number of iterations as compared to latter.

Some optimization methods employ more affordable techniques of gradient computations. In contrast with the fixed-point method in hybrid-irl, the approximation method in Active birl has a cost that is polynomial in the number of states. For maximum entropy based parameter estimation, gradient-based methods (e.g., BFGS fletcher1987practical) outperform iterative scaling approaches Malouf_ComparisonEstimatns.

5.3 Continuous State Spaces

Few approaches for IRL learn a reward function for continuous state spaces. A prominent group of methods in this regard are path integral Theodorou_pathintegral based approaches Aghasadeghi_continousspace; Kalakrishnan_continousspace. pi-irl aims for local optimality of demonstrations to avoid the complexity of full forward learning in continuous space. This approximation makes it scale well to high-dimensional continuous spaces and large demonstration data. Although the performance of path integral algorithms is sensitive to the choice of samples in demonstration, they show promising progress in scalability. Similarly, Kretzschmar et al. Kretzschmar_interactingpedestrians apply maxentirl to learn the probability distribution over navigation trajectories of interacting pedestrians, using a subset of their continuous space trajectories. A mixture distribution models both, the discrete as well as the continuous navigation decisions. Further, to avoid an expensive discretization in large continuous spaces, crp-bnirl approximates the demonstration likelihood by comparing actions to an existing closed-loop controller. Lopes et al. S.Melo2010 (Section 4.3, birl

via Bisimulation) suggest the use of MDP metric for continuous state spaces to exploit their good theoretical properties for avoiding the cost of supervised learning or optimization-based learning in continuous spaces.

5.4 High Dimensional and Large Spaces

In many IRL methods such as maxentirl, the probability of choosing actions in demonstration is

(12)

The complexity of computing the partition function scales exponentially with the dimensionality of the state space, because it requires finding the complete policy under the current solution (Ziebart2010.

Approximating Likelihood

The approximation of the reward likelihood by Levine et al. Levine2012ICML removes the exponential dependence on the dimensionality of the state space, and replaces the goal of global optimality of the input with local optimality. Other approaches for likelihood approximation in a high-dimensional state space include the use of importance sampling by reirl, the down-scaling the state space using low-dimensional features Vernaza_highdimension_approximation, and the assumption by pi-irl that demonstrations are locally optimal. For optimizations involved in maximum entropy methods, limited memory variable metric optimization methods such as L-BFGS are shown to perform better than other alternatives because they implicitly approximate the likelihood in the vicinity of the current solution Malouf_ComparisonEstimatns.

Task Decomposition

Instead of demonstrating complete trajectories for large tasks, often, a problem designer can decompose the task hierarchically. An expert may then easily give demonstrations at different levels of implementation. The modularity of this process significantly reduces the complexity of learning. Kolter et al. Kolter_2007_HierarchicalAL propose the exploitation of the hierarchical structure of a physical system (quadruped locomotion) to scale IRL up from low-dimensional to high- dimensional complex domains. Likewise, to make a high-dimensional navigation task computationally tractable, Rothkopf et al. Rothkopf_modular_highDim utilize the independence between the components of a task to introduce decomposition in birl. The independence propagates to transitions, stochastic reward functions, and Q-values of the subtasks, and the algorithm computes the individual contribution of each of the components. By decomposing a task into subtasks wherever possible, crp-bnirl allows a parallelized pre-computation of (approximate) action-value function.

Also, the action comparison allows a problem designer to use prior knowledge to trade-off between computational complexity and accuracy (at the expense of increasing the problem-design cost). However, the reward functions learned by crp-bnirl are limited to reaching particular subgoal states only.

Speeding up Forward Learning

We may reduce the cost of repeatedly computing the state values for computing the values of intermediate policies learned in IRL.

Syed et al. SyedLPAL2008 presents a linear program imitating the expert’s policy, that makes solving MDP less expensive in large state spaces with many basis functions ( for ). Babes-Vroman et al. Babes-Vroman2011 takes the dual of this linear program to include reward learning in the method, and called it lpirl, keeping its computational benefits intact. Modeling expert’s policy as a classifier defined using a bisimulation based kernel, ’birl via Bisimulation’ used supervised learning approach (regression) to avoid repeated solution of MDP (irrespective of the nature of underlying reward function). Similarly, csi, and scirl do not need a repeated solution for MDP because they update a solution by exploiting the structure imposed on MDP by their classification-based models. Another method Todorov2009 avoids this computation by requiring system dynamics to have state value function bases, which are difficult to construct as compared to commonly used reward function bases. As an alternative, Levine et al. Levine2012ICML avoid repeated forward learning by using a local approximation of the reward function around experts’ trajectories. In contrast, crp-bnirl achieve the same target by limiting the computation to those states which improve the action values around observations. The process depends on size of the input demonstration only rather than that of the complete state space.

5.5 Generalizability

Few approaches explicitly learn a reward function that correctly represent expert’s preferences for unseen state-action pairs, or one that is valid in an environment that differs from the input. An added benefit is that such methods may need less demonstrations. gpirl can predict the reward for unseen states lying within the domains of the features for a Gaussian process. Furthermore, firl can use the learned reward function in an environment slightly different from the original environment used for demonstration but with base features similar to those in the original environment. Similarly, gcl admits an enhanced generalization by learning new instances of a previously-learned task without repeated reward learning. ’birl with bisimulation’ achieves improved generalization by partitioning the state space based on a relaxed equivalence between states.

6 Extensions of Irl

Having surveyed the foundational methods of IRL and how various challenges are mitigated, we now discuss important ways in which the assumptions of original IRL problem have been relaxed to enable advances toward real-world applications.

6.1 Incomplete and Imperfect Observation

Learners in the real world must deal with noisy sensors and may not perceive the complete demonstration trajectory. For example, the merging car B in our illustrative example described in Section 1 Figure 1 may not see car A in the merging lane until it comes into its sensor view. This is often complicated by the fact that car B’s sensor may be partially blocked by other cars in front of it, which further occludes car A.

6.1.1 Extended Definition

As shown in Fig. 11, this modifies the traditional IRL problem and we provide a new definition below.

Figure 11: IRL with incomplete or imperfect perception of the input trajectory. The learner is limited to using just the perceivable portions.
Definition 0 (Irl with imperfect perception).

Let represent the dynamics of the expert . Let the set of demonstrated trajectories be, . We assume that either some portion of each trajectory, , is not observed, or some of the observed state-action pairs in a trajectory could be different from the actual demonstrated ones. Thus, let and be the subsets of states and actions that are observed. Then, determine that best explains either policy or the demonstrated trajectories.

Observing the trajectories imperfectly may require the learner to draw inferences about the unobserved state-action pairs or the true ones from available information, which is challenging.

6.1.2 Methods

Pomdp-Irl

An expert with noisy sensors that senses its state imperfectly can be modeled as a partially observable MDP (POMDPKaelbling_pomdp. The expert’s uncertainty about its current physical state is modeled as a belief (distribution) over its state space. The expert’s policy is then a mapping from its beliefs to optimal actions. Choi et al. Choi2011 propose making either this policy available to the learner or the prior belief along with the sequence of expert’s observations and actions (that can be used to reconstruct the expert’s sequence of beliefs). The POMDP policy is represented as a finite-state machine whose nodes give the actions to perform on receiving observations that form the edge labels. The learner conducts a search through space of policies by gradually improving on the previous policy until it explains the observed behavior. Figure 12 illustrates this approach.

Figure 12: In this illustration, similar to the one in Choi et al. Choi2011, consider a POMDP with two actions and two observations. (solid lines) is a fsm with nodes associated to actions and edges as observations . The one-step deviating policies (dashed lines) are policies which are slightly modified from . Each visits instead of and then becomes same as . The comparison of with characterizes the set of potential solutions. Since such policies are suboptimal yet similar to expert’s, to reduces computations, they are preferable for comparison instead of comparing with all possible policies.
Irl with Occlusion

Bogert et al. Bogert_mIRL_Int_2014 introduce an extension IRL* for settings where the learner is unable to see portions of demonstrated trajectories due to occlusion. The maximum entropy formulation of Boularias et al. Boularias2012, is generalized to allow feature expectations that span the observable state space only. This method is applied to domain of multi-robot patrolling as illustrated in Fig. 13.

Figure 13: Prediction of experts’ behavior using multi-robot IRLBogert_mIRL_Int_2014 in a multi-robot patrolling problem (left). Learner (green) needs to cross hallways patrolled by experts (black, reward ) and (gray, reward ). It has to reach goal ‘X’ without being detected. Due to occlusion, just portions of the trajectory are visible to L. After learning and , computes their policies and projects their trajectories forward in time and space to know the possible locations of patrollers at each future time step. The locations help create ’s own policy as shown in the figure on the right. Color should be used for this figure in print.

Wang et. al. Wang2002; Wang2012 introduce the principle of Latent Maximum Entropy to extend the principle of maximum entropy to problems with hidden variables. By using this extension, Bogert et al. Bogert_EM_hiddendata_fruit continued along this vein of incomplete observations and generalized multi-agent maxentirl

to the context where some causal factors of expert’s action are hidden from the learner. For example, the different amounts of force applied by a human while picking ripe and unripe fruit usually differs but this would be hidden from an observing co-worker robot. An expectation-maximization scheme is introduced with the E-step involving an expectation of the hidden variables while the M-step requires performing the

maxent optimization.

Figure 14: Hidden variable MDP. is a noisy observation, by the learner, of the state reached after taking action . The source of illustration is hiddenMDP_Kitani2012. The figure is shown here with permission from publisher.
Hidden Variable Inverse Optimal Control

A hidden variable MDP incorporates the probability of learner’s noisy observation (say ; see Figure 14) conditioned on the current state, as an additional feature () in the feature vector . Hidden variable inverse optimal control (hiochiddenMDP_Kitani2012 casts maxentirl to the setting modeled by the hidden variable MDP with a linear reward . Consequently, the expression for the likelihood of expert’s behavior incorporates the additional feature and its weight (). During maximization, the tuning of weights during learning also adjusts to determine the reliability of the imperfect observations.

We noticed that all the aforementioned methods use feature expectations matching for learning. However, by comparison, in the models for IRL* and hioc (MDP and hmdp), only the learner has an incomplete observation and expert is aware of its state; whereas in POMDP-IRL, an expert can not completely observe its state.

6.2 Multiple Tasks

Human drivers often exhibit differing driving styles based on traffic conditions as they drive toward their destination. For example, the style of driving on a merging lane of a freeway is distinctly different prior to the joining of merging lane, at joining the lane, and post joining the lane. We may model such distinct behaviors of expert(s) as driven by differing reward functions. Consequently, there is a need to investigate methods that learn multiple reward functions simultaneously (Figure 15).

6.2.1 Extended Definition

Definition 0.

(Multi-task IRL) Let the dynamics of the expert be represented by MDPs each without the reward function. The number of MDPs denoted by may not be known. Let the set of demonstrated trajectories be, . Determine , , , that best explains the observed behavior.

Figure 15: Multi-task IRL involves learning multiple reward functions. The input is mixed trajectories executed by a single or multiple expert(s) realizing behavior driven by different reward functions. The output is a set of reward functions, each one associated with a subset of input trajectories potentially generated by it.

Associating a subset of input trajectories to a reward function that is likely to generate it makes this extension challenging. This becomes further complex when the number of involved tasks is not known.

6.2.2 Methods

Diverse methods have sought to address this problem and we briefly review them below.

mIrl - Multi-robot maxentirl

Bogert et al. Bogert_mIRL_Int_2014 extend IRL* and maxentirl to the context of multiple experts who may interact, albeit sparsely. While navigation dynamics of each expert is modeled separately, the interaction is a strategic game between two players; this promotes scalability to many robots. Experiments illustrated in Fig. 13 (left) demonstrate modeling the interaction in maxentirl. Alternatively, all robots may be modeled jointly as a multiagent system. Reddy et al. Reddy_DecMulti adopt this approach and model multiple interacting experts as a decentralized general-sum stochastic game. Lin et al. Lin_MultiGame propose a Bayesian method learning the distribution over rewards in a sequential zero-sum stochastic multi-agent game (not MDP).

Maximum likelihood Irl for varying tasks

Babes-Vroman et al. Babes-Vroman2011 assume that a linear reward function of an expert can change over time in a chain of tasks. The method aims to learn multiple reward functions with common features . Given prior knowledge of the number of reward functions, for each target function, the solution is a pair of weight vector and a correspondence probability. Latter ties a cluster of trajectories to a reward function. The process iteratively clusters trajectories based on current hypothesis, followed by implementation of mlirl for updating the weights. This approach is reminiscent of using expectation-maximization for Gaussian data clustering.

Figure 16: Parametric multi-task birl

.Tuning of independent hyperprior modifies the dependent priors

on reward functions and priors on policies, the diagram shows it as a sequence of priors. The variation in priors is assumed to influence the observed trajectories . Bayesian update outputs an approximated posterior over rewards and policies.
Parametric multi-task birl

Allowing for multiple experts with known, birl is generalized to a hierarchical Bayes network by introducing a hyperprior that imposes a probability measure atop the space of priors over joint reward-policy space. Dimitrakakis and Rothkopf Dimitrakakis2012 show the sampling of the prior (latter) from an updated posterior (former), given input demonstration. This posterior (and thus the sampled prior) may differ for an expert trying to solve different tasks or multiple experts trying to solve different tasks. Within the context of our running example, Fig. 16 illustrates how this approach may be used to learn priors for multiple drivers on a merging lane of a freeway.

Parametric multi-task birl assumes that different experts share a common prior distribution over policies. The prior distribution can be either a joint prior over ’reward functions and policy parameters’ or a joint prior over ’policies and their optimality’.The method hypothesizes a posterior distribution over rewards by updating the prior based on a weighted likelihood of demonstration; the process associates subsets of demonstration trajectories to solutions likely to generate them.

Nonparametric multi-task birl

dpm-birl is a clustering method that learns multiple specifications using unlabeled fixed-length trajectories  Choi2012. It differs from previous methods because the number of experts are not known. Therefore, the method initializes a nonparametric Dirichlet process over the reward functions and aims to assign each input demonstration to a reward function that potentially generates it, thereby forming clusters of trajectories. Learning happens by implementing a Bayesian update to compute the joint posterior over the reward functions and the sought assignments to clusters. The process iterates until the reward functions and clusters stabilize.

6.3 Incomplete Model

Original IRL problem assumes the complete knowledge of transition model and features. However, knowing the transition probabilities that fully represent the dynamics or specifying the complete feature set is challenging and often unrealistic. Hand designed features introduce structure to the reward, but they increase the engineering burden. Inverse learning is difficult when the learner is partially unaware of the expert’s dynamics or when the known features can not sufficiently model the expert’s preferences. Subsequently, the learner must estimate the missing components for inferring .

6.3.1 Extended Definition

Figure 17: IRL with incomplete model of transition probabilities.
Definition 0.

(Incomplete Dynamics) Let an MDP without reward, , represent the dynamics between an expert and its environment, where specifies the probabilities for a subset of all possible transitions. Let the set of demonstrated trajectories be, . Then, determine reward that best explains either the input policy or the observed demonstration .

We illustrate the extension in Fig. 17.

Definition 0.

(Incomplete Features) Let an MDP without reward, , represent the dynamics of an expert and its environment. Let the reward function depend on the feature set . The input is either a policy or a demonstration . If the given feature set is incomplete, find the features and function that best explains the input.

6.3.2 Methods

Incomplete Dynamics

While the majority of IRL methods assume completely specified dynamics, some researchers have aimed learning the dynamics in addition to the reward function. For example, mwal Syed2008 obtains the maximum likelihood estimate of unknown transition probabilities by computing the frequencies of state-action pairs which are observed more than a preset threshold number of times. The process determines complete transition function by routing the transitions for remaining state-action pairs to an absorbing state. To give formal guarantees for the accuracy of learned dynamics and thereby the reward function, the algorithm leverages a theoretical upper bound on the accuracy of learned transition model if the learner receives a precomputed minimum amount of demonstration. Therefore, the upper bound on learning error in the MDP with learned transition model naturally extends to the actual MDP of expert  Syed2008_supplement.

Figure 18: In mIRL*\t, denotes the transition-features for transition of expert , is intended next state. Computation of unknown probabilities by using probabilities of transition-features, , is feasible because different transitions share transition-features among them. Source of illustration is Bogert_mIRL_woT_Int_2015 and figure is reprinted by author’s permission.

While mwal assumes that a learner fully observes the states, mIRL* Bogert_mIRL_Int_2014 focuses on limited observations with unknown transition probabilities and multiple experts. Bogert and Doshi model each transition as an event composed of underlying components. For example, movement by a robot may be decomposed into its left and right wheels moving at some angular velocity. Therefore, the probability of moving forward is the joint probability of left and right wheels rotating with the same velocities. Learner knows the intended next state for a state-action pair, and probability not assigned to the intended state is distributed equally among the unintended next states. Importantly, the components, also called transition features, are likely to be shared between observable and unobservable transitions as shown in Fig. 18. Therefore, a fixed distribution over the transition features determines . The frequencies of a state-action pair in demonstrations provide a set of empirical joint probabilities, as potential solutions. The preferred solution is the distribution of component probabilities with the maximum entropy computed through optimization. The generalization capacity of learned information is more in mIRL*\t than that in mwal because the structure introduced by a shared basis is more generizable, in the space of transition probabilities, than a local frequency based estimation.

On the other hand, Boularias et al. Boularias2011relative shows that the transition models approximated from a small set of demonstrations may result in highly inaccurate solutions. This raises doubts on the efficacy of the previous two methods.To this end, pi-irl learns a reward function without any input transition probabilities. Furthermore, for estimating unknown dynamics, gcl Levine_UnknownDynamics_NN_policy

iteratively runs a linear-Gaussian controller (current policy) to generate trajectory samples, fits local linear-Gaussian dynamics to them by using linear regression, and updates the controller under the fitted dynamics.

csi also learns reward function without the probabilities as input, by regression on separate non-expert transitions dataset given as input.

Incomplete Features

A generalization of mmp that focuses on IRL where the feature vectors are insufficient to explain expert’s behavior is mmpboost Ratliff2007. In this case, the method assumes that a predefined set of primitive features, which are easier to specify, create the reward feature functions. In the space of nonlinear functions of base features, mmpboost searches new features that make the demonstrated trajectories more likely and any alternative (simulated) trajectories less likely. Consequently, the hypothesized reward function performs better than the one with original feature functions. Further, it is well known that methods employing L-1 regularization objectives can learn robustly when input features are not completely relevant Ng_RegularizationComparison. In addition to mmpboost, gpirl also uses this concept of base features to learn a new set of features which correspond better to the observed behavior. Wulfmeier et al. Wulfmeier2015 and Finn et al. Finn_gcl propose neural network as function approximators that avoid the cumbersome hand-engineering of appropriate reward features.

In some applications, it is important to capture the logical relationships between base features to learn an optimum function representing the expert’s reward. Most methods do not determine these relationships automatically. But firl constructs features by capturing these relationships. In contrast, bnp-firl uses an Indian buffet process prior distribution to derive a Markov Chain Monte Carlo mcmc procedure for Bayesian inference of the features and weights of a linear reward function Choi2013bayesian. Authors demonstrate that the procedure constructs features more succinct than those by firl. Of course, all these methods are applicable only in those domains where the base feature space have primitive features sufficient enough to satisfactorily express the reward states.

6.4 Nonlinear Reward Function

A majority of the IRL methods assume that the solution is a weighted linear combination of a set of reward features. While this is sufficient for many domains, a linear representation may be over simplistic in complex real tasks; especially when raw sensory input is used to compute the reward values Finn_gcl. Also, the analysis of learner’s performance w.r.t the best solution seems compromised when a linear form restricts the class of possible solutions. But a significant challenge for relaxing this assumption is that nonlinear reward functions may take any shape, which could lead to a very large number of parameters.

As our original definition of IRL given in Def. 2 does not involve the structure of the learned reward function, it continues to represent the problem in the context of nonlinear reward functions as well.

Figure 19: Learning a nonlinear reward function with boosted features improves performance over linear reward. Learner needs to imitate example paths drawn by humans in overhead imagery. Upper left panel - base features for a region. Upper right panel - image of the region used for testing. Red path is a demonstrated path and Green path is a learned path. Lower left panel - a map (un-boosted linear reward function) inferred using mmp with uniformly high cost everywhere. Lower right panel shows results of mmpboost. Since mmpboost creates new features by a search through a space of nonlinear reward functions, it performs significantly better. We reprinted this figure from Ratliff2007 with permission from MIT press. Color should be used for this figure in print.

Methods such as projection, mmp, and maxentirl assume that expert’s reward is a weighted linear combination of a set of features; the assumption is required for computational expediency. To overcome the restriction, methods mmpboost, learch and gpirl infer a nonlinear reward function. mmpboost and learch use a matrix of features in an image (cost map) and gpirl uses a Gaussian process for representing the nonlinear function . Figure 19 shows the benefit of a nonlinear form with boosted features as compared to a restrictive linear form. In addition to these, Wulfmeier et al. Wulfmeier2015 and Finn et al. Finn_gcl represent a complex nonlinear cost function using a neural network approximation, avoiding assumption of linear form.

6.5 Other Extensions

Ranchod et al. Ranchod_SkillTransfer presented a Bayesian method to segment a set of unstructured demonstration trajectories, identifying the reward functions (as reusable skills in a task) that maximize the likelihood of demonstration. The method is nonparamteric as the number of functions is unknown.

Figure 20: The 3 step process of IRL in relational domain - classification outputs a score function, reward shaping optimizes the output, and regression approximates reward function corresponding to the optimal score function. Author has shared this figure with us. Color should be used for this figure in print.

Munzer et al. Munzer_relationalIRL extend the classification-regression steps in csi to include relational learning in order to benefit from the strong generalization and transfer properties that are associated with relational-learning representations. The process shapes the reward function for the score function as computed by the classification (see Fig. 20).

7 Discussion

The previous section explained how some researchers recently extended initial IRL methods to broaden the scope of these methods. This section explains the common milestones and limitations of current literature. Then we briefly discuss Bayesian vis-a-vis entropy-based inference and optimization in general.

7.1 Milestones

Apart from the achievements in Section 5 , IRL methods achieved the milestones listed here.

7.1.1 Formalization of Reliable Evaluation Metric

As mentioned in Section 3.1

, the evaluation metrics like reward-loss are not reliable. In the pursuit of different challenges in

IRL problem, some researchers formalized the norm of value-loss as a metric for error in IRL. Choi et al. Choi2011 name the normed loss as Inverse Learning Error or ILE. Given and in the space of MDP policies, ILE is calculated as

(13)

If the actual is unavailable, we compute ILE using estimated from demonstrated behavior.

7.1.2 Achieving Monotonic Probability - Optimality Relation

In an output policy or an output distribution over trajectories, the probability mass assigned for an action or a trajectory should be proportional to the value incurred by following it. Then an IRL algorithm should assign a high probability to optimal behavior demonstrated to the learner. Unexpectedly, max-margin and mwal are prone to compute a policy with a zero probability to the actions in demonstration Ziebart2010. The label bias in action-value based methods (e.g. birl, hybrid-irl, and mlirl) may result in an output distribution without monotonicity between the output probability mass and the value of a sample Ziebart2008.

Different methods solved this issue. mmp and its extensions avoid the label bias by making a solution policy have state-action visitations following that of expert Neu2008. maxent based IRL methods address this issue by distributing probability mass based on entropy value and feature expectation matching Ziebart2010. Further, gpirl addresses it by assigning a higher probability mass to the reward function corresponding to the demonstrated behavior, due to zero variance of the posterior distribution. Similarly, pi-irl maintains this monotonicity by using a continuous state continuous time model in which the likelihood of a trajectory is proportional to its exponentiated cost.

7.2 Shortcomings

The methods surveyed in this article encounter the following shortcomings.

7.2.1 Hand Crafting of Prior Knowledge

The IRL problem was originally devised to avoid the efforts of hand tuning the reward function to get desired behavior in an agent (expert). Although a variety of remarkable endeavors exists for solving this problem, there are some inevitable parameters in all the methods which are painstakingly hard to tune in the state spaces larger than few dozens of states (e.g. base features of a reward function, an appropriate prior distribution for implementing Bayesian methods, control parameter in Equation 9). Therefore, although every solution is a trade-off between a desired outcome and a realistically-achievable outcome, it remains an open question that ‘did research in IRL satisfactorily achieve the desired outcome?’

7.2.2 Scarcity of Analysis on Learning Bias

A limited number of IRL papers have explicitly mentioned the bias in the estimation of the value of observed behavior, i.e. . Authors of max-margin, projection, mwal, and mlirl bounded the bias from above to derive a probabilistic lower bound on sample complexity for achieving a learning error within a desired upper bound. The bias is a significant contributor to error, and a measure of the extent of efficient usage of available demonstration. Therefore, it deserves analysis in every IRL method.

8 Conclusions and Future Work

Since its introduction in 1998 by Russell, IRL has witnessed a significantly improved understanding of the inherent challenges, various methods for their mitigation, and the extension of these challenges toward real-world applications. This survey explicitly focused on the specific ways by which methods mitigated challenges and contributed to the progress of IRL research. The reason for this focus is that we believe that successful approaches in IRL will combine the synergy of different methods to solve complex learning problems. Our improved understanding has also revealed more novel questions. More development and clarifications are needed to answer these new questions.

Direct and indirect Learning

When the state space is large and precise tuning of is cumbersome, directly learning a reward function results in a better generalization as compared to policy matching Neu2007 (see Section  3.5). However, the issue of choosing between these two ways or exploiting their synergies warrants a more thorough analysis.

Heuristics

Choi et al. Choi2011 observed that when the authors evaluated the values of learned policy and expert’s policy on the learned reward , both are optimal and about equal. However, obtained using does not achieve the same value as when they use the true reward for the evaluation. This is, of course, a quantification of the reward ambiguity challenge, which we pointed out earlier in this survey. It significantly limits learning accuracy. We believe that the choice of heuristics in optimization may worsen or mitigate this issue.

Convergence analysis

Methods projection and mwal analyze their convergence because of the geometric structure in the space of latent variables – feature expectations  Abbeel2004. Both methods aim to match the observed feature expectations of an expert, , and its estimated feature expectations . The geometric relationship between and the outputs of iterations helps derive the error after each iteration. Bounding this error gives the number of required iterations. Both methods use Hoeffding’s inequality to relate the bound with minimum sample complexity for a bounded bias in . A comprehensive theoretical analysis of both sample complexity and number of required iterations, to achieve a bounded error, is significant for knowing the cost of learning in problems with high-dimensional or large state spaces. However, most other methods such as those that optimize maximum entropy and Bayesian methods do not provide such analyses. 111As POMDP-IRL is derived using the projection method for which a sample complexity bound exists, a similar analysis for it seems feasible. Such gaps are opportunities to formally analyze the convergence of these methods.

Figure 21: The state of i-POMDP evolves in an interactive state space that encompasses the computable models (beliefs, preferences, and capabilities) for other agents and the physical states of the environment. Agent maintains and updates his models of other agents.
Multi-expert Interaction

Recent work on IRL for multi-agent interactive systems can be extended to include more general classes of interaction-based models to increase the potential for applications Lin_MultiGame; Reddy_DecMulti. These classes include models for fully-observable state spaces (Markov games Littman1994markov_games, multi-agent Markov decision processes Boutilier1999, interaction-driven Markov games Spaan_InteractiveFulObs) and partially-observable states (partially observable identical-payoff stochastic games Peshkin00, multi-agent team decision problems PynadathT02, decentralized Markov decision processes Bernstein02, and i-POMDPGmytrasiewicz_Doshi_IPOMDP illusrated in Fig. 21). Outside the domain of IRL, we note behavior prediction approaches related to inverse optimal control in multi-agent game-theoretic settings  WaughZB_InverseEquilibrium. The authors derived a regret-based criterion which can be used for Markov multi-agent games too: for any linear reward function, the learned behavior of agents should have regret less than that in observed behavior.

Non-stationary rewards

Most methods assume a fixed reward function that does not change. However, preferences of experts may change with time, and the reward function is time-variant i.e., . Babes-Vroman et al. Babes-Vroman2011 capture such dynamic reward functions as multiple reward functions, but this approximation is crude. A more reasonable start in this research direction is the reward model in Kalakrishnan_continousspace.

Function approximations

As mentioned in Section 3.4, we need computationally inexpensive approximations of complex nonlinear reward functions in large state spaces. Unlike shallow local architectures, a deep neural network architecture is expressive given a large number of demonstration trajectories Bengio2007, Recent preliminary work has investigated the use of neural network function approximation and computation of the gradient of maxent objective function using back-propagation making IRL scalable to large spaces with nonlinear rewards Wulfmeier2015; Finn_gcl. Future work in this direction may help IRL get closer to real-world applications.

Efficient maxentIRL

The optimizations in Lafferty1996; WuKhudanpur are likely to give an improvement in maximum entropy IRL algorithms. When input (demonstration) data is structured accordingly, these algorithms make the evaluation of partition function efficient Malouf_ComparisonEstimatns.

Moving away from passive observations of the expert performing the task, Lopes et al. Lopes2009 allow the learner to query the expert for actions at select states. While such active learning is shown to reduce the number of samples needed to learn the estimated reward, as we may expect, its usefulness is limited to those settings where the learner may interact with the expert.

Some more observations merit further enquiry and provide good avenues for future research. Firstly, a principled way to construct and impose meaningful, generalizable features is necessary in high dimensional spaces, in order to avoid numerous demonstration trajectories and handcrafted features in such spaces. Secondly, future research could focus on the accuracy achieved with suboptimal input as it is natural for a human demonstrator to make mistakes while training a learner. Thirdly, there is a need for more work on modeling the sensing noise of learner observing the expert, especially in robotic applications. Finally, improvement in the generalization is required to reduce the sample complexity and the number of required iterations to achieve a pre-specified desired accuracy Lopes2009. Some methods showed an improvement in this context  Levine2011; Levine2010_featureconstruction; S.Melo2010.