Risk-sensitive Inverse Reinforcement Learning via Semi- and Non-Parametric Methods

11/28/2017 ∙ by Sumeet Singh, et al. ∙ Princeton University Stanford University 0

The literature on Inverse Reinforcement Learning (IRL) typically assumes that humans take actions in order to minimize the expected value of a cost function, i.e., that humans are risk neutral. Yet, in practice, humans are often far from being risk neutral. To fill this gap, the objective of this paper is to devise a framework for risk-sensitive IRL in order to explicitly account for a human's risk sensitivity. To this end, we propose a flexible class of models based on coherent risk measures, which allow us to capture an entire spectrum of risk preferences from risk-neutral to worst-case. We propose efficient non-parametric algorithms based on linear programming and semi-parametric algorithms based on maximum likelihood for inferring a human's underlying risk measure and cost function for a rich class of static and dynamic decision-making settings. The resulting approach is demonstrated on a simulated driving game with ten human participants. Our method is able to infer and mimic a wide range of qualitatively different driving styles from highly risk-averse to risk-neutral in a data-efficient manner. Moreover, comparisons of the Risk-Sensitive (RS) IRL approach with a risk-neutral model show that the RS-IRL framework more accurately captures observed participant behavior both qualitatively and quantitatively, especially in scenarios where catastrophic outcomes such as collisions can occur.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 13

page 28

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imagine a world where robots and humans coexist and work seamlessly together. In order to realize this vision, robots must be able to (1) accurately predict the actions of humans in their environment, (2) quickly learn the preferences of human agents in their proximity and act accordingly, and (3) learn how to accomplish new tasks from human demonstrations. Inverse Reinforcement Learning (IRL) (Russell, 1998; Ng and Russell, 2000; Abbeel and Ng, 2005; Levine and Koltun, 2012; Ramachandran and Amir, 2007; Ziebart et al., 2008; Englert and Toussaint, 2015) is a powerful and flexible framework for tackling these challenges and has been previously used for a wide range of tasks, including modeling and mimicking human driver behavior (Abbeel and Ng, 2004; Kuderer et al., 2015; Sadigh et al., 2016a), pedestrian trajectory prediction (Ziebart et al., 2009; Mombaur et al., 2010), and legged robot locomotion (Zucker et al., 2010; Kolter et al., 2007; Park and Levine, 2013)

. More recently, the popular technique of Max-Entropy (MaxEnt) IRL, an inspiration for some of the techniques leveraged in this work, has been adopted in a deep learning framework 

(Wulfmeier et al., 2015), and embedded within the guided policy optimization algorithm (Finn et al., 2016). The underlying assumption behind IRL is that humans act optimally with respect to an (unknown) cost function. The goal of IRL is then to infer this cost function from observed actions of the human. By learning the human’s underlying preferences (in contrast to, e.g., directly learning a policy for a given task), IRL allows one to generalize one’s predictions to novel scenarios and environments.

The prevalent modeling assumption made by existing IRL techniques is that humans take actions in order to minimize the expected value of a random cost. Such a model, referred to as the expected value (EV) model, implies that humans are risk neutral with respect to the random cost; yet, humans are often far from being risk neutral. A generalization of the EV model is represented by the expected utility (EU) theory in economics (von Neumann and Morgenstern, 1944), whereby one assumes that a human is an optimizer of the expected value of a disutility function of a random cost. Despite the historical prominence of EU theory in modeling human behavior, a large body of literature from the theory of human decision making strongly suggests that humans behave in a manner that is inconsistent with the EU model. At a high level, the EU model has two main limitations: (1) experimental evidence consistently confirms that this model is lacking in its ability to describe human behavior in risky scenarios (Allais, 1953; Ellsberg, 1961; Kahneman and Tversky, 1979)

, and (2) the EU model assumes that humans make no distinction between scenarios in which the probabilities of outcomes are known and ones in which they are unknown, which is often not the case. Consequently, a robot interacting with a human in a safety-critical setting (e.g., autonomous driving or navigation using shared autonomy), while leveraging such an inference model, could make incorrect assumptions about the human agent’s behavior, potentially leading to catastrophic outcomes.

The known and unknown probability scenarios are referred to as risky and ambiguous respectively in the decision theory literature. An elegant illustration of the role of ambiguity is provided by the Ellsberg paradox (Ellsberg, 1961). Imagine an urn (Urn 1) containing 50 red and 50 black balls. Urn 2 also contains 100 red and black balls, but the relative composition of colors is unknown. Suppose that there is a payoff of if a red ball is drawn (and no payoff for black). In human experiments, subjects display an overwhelming preference towards having a ball drawn from Urn 1. However, now suppose the subject is told that a black ball has payoff (and no payoff for red). Humans still prefer to draw from Urn 1. This is a paradox, since choosing to draw from Urn 1 in the first case (payoff for red) indicates that the human assesses the proportion of red in Urn 1 to be higher than in Urn 2, while choosing Urn 1 in the second case (payoff for black) indicates that the human assesses a lower proportion of red in Urn 1 than in Urn 2. Indeed, there is no utility function for the two outcomes that can resolve such a contradictory assessment of underlying probabilities since it stems from a subjective distortion of outcome probabilities rather than rewards.

The limitations of EU theory in modeling human behavior has prompted substantial work on various alternative theories such as rank-dependent expected utility (Quiggin, 1982), expected uncertain utility (Gul and Pesendorfer, 2014), dual theory of choice (distortion risk measures) (Yaari, 1987), prospect theory (Kahneman and Tversky, 1979; Barberis, 2013), and many more (see (Majumdar and Pavone, 2017) for a recent review of the various axiomatic underpinnings of these risk measures). Further, one way to interpret the Ellsberg paradox is that humans are not only risk averse, but are also ambiguity averse – an observation that has sparked an alternative set of literature in decision theory on “ambiguity-averse” modeling; see, e.g., the recent review (Gilboa and Marinacci, 2016). It is clear that the assumptions made by EU theory thus represent significant restrictions from a modeling perspective in an IRL context since a human expert is likely to be both risk and ambiguity averse, especially in safety critical applications such as driving where outcomes are inherently ambiguous and can possibly incur very high cost.

The key insight of this paper is to address these challenges by modeling humans as evaluating costs according to an (unknown) risk measure. A risk measure is a function that maps an uncertain cost to a real number (the expected value is thus a particular risk measure and corresponds to risk neutrality). In particular, we will consider the class of coherent risk measures (CRMs) (Artzner et al., 1999; Shapiro, 2009; Ruszczyński, 2010). CRMs were proposed within the operations research community and have played an influential role within the modern theory of risk in finance (Rockafellar and Uryasev, 2000; Acerbi and Tasche, 2002; Acerbi, 2002; Rockafellar, 2007). This theory has also recently been adopted for risk-sensitive (RS) Model Predictive Control and decision making (Chow and Pavone, 2014; Chow et al., 2015), and guiding autonomous robot exploration for maximizing information gain in time-varying environments (Axelrod et al., 2016).

Coherent risk measures enjoy a number of advantages over EV and EU theories in the context of IRL. First, they capture an entire spectrum of risk assessments from risk-neutral to worst-case and thus offer a significant degree of modeling flexibility. Second, they capture risk sensitivity in an axiomatically justified manner; specifically, they formally capture a number of intuitive properties that one would expect any risk measure to satisfy (see Section 2.2). Third, a representation theorem for CRMs (Section 2.2) implies that they can be interpreted as computing the expected value of a cost function in a worst-case sense over a set

of probability distributions (referred to as the

risk envelope). Thus, CRMs capture both risk and ambiguity aversion within the same modeling framework since the risk envelope can be interpreted as capturing uncertainty about the underlying probability distribution that generates outcomes in the world. Finally, they are tractable from a computational perspective; the representation theorem allows us to solve both the inverse and forward problems in a computationally tractable manner for a rich class of static and dynamic decision-making settings.

Statement of contributions: This paper presents an IRL algorithm that explicitly takes into account risk sensitivity under general axiomatically-justified risk models that jointly capture risk and ambiguity within the same modeling framework. To this end, this paper makes four primary contributions. First, we propose a flexible modeling framework for capturing risk sensitivity in humans by assuming that the human demonstrator (hereby referred to as the “expert”) acts according to a CRM. This framework allows us to capture an entire spectrum of risk assessments from risk-neutral to worst-case. Second, we develop efficient algorithms based on Linear Programming (LP) for inferring an expert’s underlying risk measure for a broad range of static (Section 3) decision-making settings, including a proof of convergence of the predictive capability of the algorithm in the case where we only attempt to learn the risk measure. We additionally consider cases where both the cost and risk measure of the expert are unknown. Third, we develop a maximum likelihood based model for inferring the expert’s risk measure and cost function for a rich class of dynamic decision-making settings (Section 4), generalizing our work in (Majumdar et al., 2017). Fourth, we demonstrate our approach on a simulated driving game (visualized in Figure 1) using a state-of-the-art commercial driving simulator and present results on ten human participants (Section 5). We show that our approach is able to infer and mimic qualitatively different driving styles ranging from highly risk-averse to risk-neutral using only a minute of training data from each participant. We also compare the predictions made by our risk-sensitive IRL (RS-IRL) approach with one that models the expert using expected value theory and demonstrate that the RS-IRL framework more accurately captures observed participant behavior both qualitatively and quantitatively, especially in scenarios involving significant risk to the participant-driven car.

(a) Visualization of simulator during the interactive game experiment as seen by participant.
(b) Logitech G29 game input hardware consists of a force-feedback steering wheel and accelerator and brake pedals.
Figure 1: The simulated driving game considered in this paper. The human controls the follower car using a force-feedback steering wheel and two pedals and must follow the leader (an “erratic driver”) as closely as possible without colliding. We observed a wide range of behaviors from participants reflecting varying attitudes towards risk.

Related Work: Safety-critical control and decision making applications demand increased resilience to events of low probability and detrimental consequences (e.g., a UAV crashing due to unexpectedly large wind gusts or an autonomous car failing to accommodate for an erratic neighboring vehicle). Such problems have inspired the recent advancement of various restricted versions of the problems considered here. In particular, there is a large body of work on RS decision making. For instance, in (Howard and Matheson, 1972) the authors leverage the exponential (or entropic) risk. This has historically been a very popular technique for parameterizing risk-attitudes in decision theory but suffers from the usual drawbacks of the EU framework such as the calibration theorem (Rabin, 2000)

. The latter states that very little risk aversion over moderate costs leads to unrealistically high degrees of risk aversion over large costs, which is undesirable from a modeling perspective. Other RS Markov Decision Process (MDP) formulations include Markowitz-inspired mean-variance 

(Filar et al., 1989; Tamar et al., 2012), percentile criteria on objectives (Wu and Yuanlie, 1999) and constraints (Geibel and Wysotzki, 2005), and cumulative prospect theory (Prashanth et al., 2016). This has driven research in the design of learning-based solution algorithms, i.e., RS reinforcement learning (Mihatsch and Neuneier, 2002; Bäuerle and Ott, 2011; Tamar et al., 2012; Petrik and Subramanian, 2012; Shen et al., 2014; Tamar et al., 2016). Ambiguity in MDPs is also well studied via the robust MDP framework, see e.g., (Nilim and El Ghaoui, 2005; Xu and Mannor, 2010), as well as (Osogami, 2012; Chow et al., 2015) where the duality between risk and ambiguity as a result of CRMs is exploited. The key difference between this literature and the present work is that we consider the inverse reinforcement learning problem.

Results in the RS-IRL setting are more limited and have largely been pursued in the neuroeconomics literature (Glimcher and Fehr, 2014). For example, (Hsu et al., 2005) performed Functional Magnetic Resonance Imaging (FMRI) studies of humans making decisions in risky and ambiguous settings and modeled risk and ambiguity aversion using parametric utility and weighted probability models. In a similar vein, (Shen et al., 2014) models risk aversion using utility based shortfalls (with utility functions fixed a priori) and presents FMRI studies on humans performing a sequential investment task. While this literature may be interpreted in the context of IRL, the models used to predict risk and ambiguity aversion are quite limited. Risk in (Sadigh et al., 2016b) is captured via a single parameter to represent the aggressiveness of the expert driver – a fairly limited model that additionally does not account for probabilistic uncertainty. More recently, the authors in (Ratliff and Mazumdar, 2017) leverage the shortfall-risk model and associated value decomposition introduced in (Shen et al., 2014)

to devise a gradient based RS-IRL algorithm. The model again assumes an a priori known risk measure and parameterized utility function and the learning loss function is taken to be the likelihood of the observed actions assuming the Boltzmann distribution fit to the optimal

values. There are two key limitations of this approach. First, learning is performed assuming a known utility functional and risk measure – both of which, in general, are difficult to fix a priori for a given application. Second, computing gradients involves taking expectations with respect to the optimal policy as determined by the current value of the parameters and thus must be determined by solving the fixed-point equations defining the “forward” RL problem – a computationally demanding task for large or infinite domains. This limitation is not an artifact of RS-IRL but in fact a standard complexity issue in any MaxEnt IRL-based algorithm. In contrast, this work (1) harnesses the elegant dual representation results for CRMs to avoid having to assume a known risk functional, and (2) solves a significantly less complex forward problem by leveraging a receding-horizon planning model for the expert – a technique used to great effect also in (Sadigh et al., 2016a).

A first version of this work was presented in (Majumdar et al., 2017). In this revised and extended edition, we include the following additional contributions: (1) a significant improvement in the multi-step RS-IRL model which now accounts for an expert planning over sequential disturbance modes (as opposed to the single-branch model in (Majumdar et al., 2017)); (2) a formal proof of convergence guaranteeing that in the limit, the single-step RS-IRL model will exactly replicate the expert’s behavior; (3) introduction of a new maximum likelihood based approach for inferring both the risk measure and cost function for the multi-step model without assuming any a priori functional form; (4) extensive experimental validation on a realistic driving simulator where we demonstrate a significant improvement in predictive performance enabled by the RS-IRL algorithm over the standard risk-neutral model.

2 Problem Formulation

2.1 Dynamics

Consider the following discrete-time dynamical system:

(1)

where is the time index, is the state, is the control input, and is the disturbance. The control input is assumed to be bounded component-wise: . We take to be a finite set with probability mass function (pmf) , where and . The time-sampling of the disturbance will be discussed in Section 4. We assume that we are given demonstrations from an expert in the form of sequences of state-control pairs and that the expert has knowledge of the underlying dynamics (1) but does not necessarily have access to the disturbance pmf .

2.2 Model of the Expert

We model the expert as a risk-sensitive decision-making agent acting according to a coherent risk measure (defined formally below). We refer to such a model as a coherent risk model.

We assume that the expert has a cost function that captures his/her preferences about outcomes. Let denote the cumulative cost accrued by the agent over a horizon :

(2)

Note that since the process is stochastic,

is a random variable adapted to the sequence

. A risk measure is a function that maps this uncertain cost to a real number. We will assume that the expert is assessing risks according to a coherent risk measure, defined as follows.

Definition 1 (Coherent Risk Measures).

Let be a probability space and let be the space of random variables on . A coherent risk measure (CRM) is a mapping that obeys the following four axioms. For all :

A1. Monotonicity: .

A2. Translation invariance: , .

A3. Positive homogeneity: , .

A4. Subadditivity: .

These axioms were originally proposed in (Artzner et al., 1999) to ensure the “rationality” of risk assessments. For example, A1 states that if a random cost is less than or equal to a random cost regardless of the disturbance realizations, then must be considered less risky (one may think of the different random costs arising from different control policies). A4 reflects the intuition that a risk-averse agent should prefer to diversify. We refer the reader to (Artzner et al., 1999; Majumdar and Pavone, 2017) for a thorough justification of these axioms. An important characterization of CRMs is provided by the following representation theorem.

Theorem 1 (Representation Theorem for Coherent Risk Measures (Artzner et al., 1999)).

Let be a probability space, where is a finite set with cardinality , is the algebra over subsets (i.e., ), probabilities are assigned according to , and is the space of random variables on . Denote by the set of valid probability densities:

(3)

Define where , . A risk measure with respect to the space is a CRM if and only if there exists a compact convex set such that for any :

(4)

This theorem is important for two reasons. Conceptually, it gives us an interpretation of CRMs as computing the worst-case expectation of the cost function over a set of densities (referred to as the risk envelope). Coherent risk measures thus allow us to consider risk and ambiguity (ref. Section 1) in a unified framework since one may interpret an agent acting according to a coherent risk model as being uncertain about the underlying probability density. Second, it provides us with an algorithmic handle over CRMs and will form the basis of our approach to measuring experts’ risk preferences.

In this work, we will take the risk envelope to be a polytope. We refer to such risk measures as polytopic risk measures, which were also considered in (Eichhorn and Römisch, 2005). By absorbing the density into the pmf , we can represent (without loss of generality) a polytopic risk measure as:

(5)

where is a polytopic subset of the probability simplex:

(6)

where . Polytopic risk measures constitute a rich class of risk measures, encompassing a spectrum ranging from risk neutrality () to worst-case assessments (). Examples include CVaR, mean absolute semi-deviation, spectral risk measures, optimized certainty equivalent, and the distributionally robust risk (Chow and Pavone, 2014). We further note that the ambiguity interpretation of CRMs is reminiscent of Gilboa & Schmeidler’s Minmax EU model for ambiguity-aversion (Gilboa and Schmeidler, 1989) which was shown to outperform various competing models in (Hey et al., 2010) for single-stage decision problems, albeit with more restrictions on the set .

Goal: Given demonstrations from the expert in the form of state-control trajectories, our goal is to approximate the expert’s risk preferences by finding an approximation of the risk envelope .

3 Risk-sensitive IRL: Single Decision Period

In this section we consider the single step decision problem, i.e., in equation (2). Thus, the probability space is simply .

3.1 Known Cost Function

We first consider the static decision-making setting where the expert’s cost function is known but the risk measure is unknown. A coherent risk model then implies that the expert is solving the following optimization problem at state in order to compute an optimal action:

(7)
(8)

where is a CRM with respect to the space (i.e., ). In the last equation, is the cost when the disturbance is realized. Since the inner maximization problem is linear in , the optimal value is achieved at a vertex of the polytope . Denoting the set of vertices of as , we can thus rewrite problem (7) above as follows:

(9)

If the cost function is convex in the control input , the resulting optimization problem is convex. Given a dataset of state-control pairs of the expert taking action at state , our goal is to deduce an approximation of from the given data. The key idea of our technical approach is to examine the Karush-Kuhn-Tucker (KKT) conditions for Problem (9). The use of KKT conditions for Inverse Optimal Control is a technique also adopted in (Englert and Toussaint, 2015). The KKT conditions are necessary for optimality in general and are also sufficient in the case of convex problems. We can thus use the KKT conditions along with the dataset to constrain the constraints of Problem (9). In other words, the KKT conditions will allow us to constrain where the vertices of must lie in order to be consistent with the fact that the state-control pairs represent optimal solutions to Problem (9). Importantly, we will not assume access to the number of vertices of .

Let be an optimal state-control pair and let and denote the sets of components of the control input that are saturated above and below respectively (i.e., and ).

Theorem 2 (KKT-Based Inference).

Consider the following optimization problem:

(10)

Denote the optimal value of this problem by and define the halfspace:

(11)

Then, the risk envelope satisfies

Proof.

The KKT conditions for Problem (9) are:

(12)
(13)
(14)
(15)

where are multipliers. Now, suppose there are multiple optimal vertices for Problem (9) in the sense that , . Defining , we see that satisfies:

(16)

and since . Now, since satisfies the constraints of Problem (10) (which are implied by the KKT conditions), it follows that . From problem (9), we see that and thus . ∎

Problem (10) is a Linear Program (LP) and can thus be solved efficiently. For each demonstration , Theorem 2 provides a halfspace constraint on the risk envelope . By aggregating these constraints, we obtain a polytopic outer approximation of . This is summarized in Algorithm 1. Note that Algorithm 1 operates sequentially through the data and is thus directly applicable in online settings.

1:  Initialize
2:  for  do
3:     Solve Linear Program (10) with

to obtain a hyperplane

4:     Update
5:  end for
6:  Return
Algorithm 1 Outer Approximate Risk Envelope
Remark 1.

Algorithm 1 is a non-parametric algorithm for inferring the expert’s risk measure; i.e., we are not fitting parameters for an a priori chosen risk measure. Instead, by operating directly within the dual space, Algorithm 1 can recover any risk measure within the class of all CRMs, that best explains the expert’s demonstrations.

As we collect more half-space constraints in Algorithm 1, the constraint in Problem (10) above can be replaced by , where is the current outer approximation of the risk envelope. It is easily verified that the results of Theorem 2 still hold. This allows us to obtain a tighter (i.e., lower) upper bound for , thus resulting in tighter halfspace constraints for each new demonstration processed by the algorithm.

Denote the output of Algorithm 1 after processing sequentially the first demonstrations . Observe that for all , . We can then define the limiting set as . An important consideration for this algorithm is whether it is possible to recover, at least from an imitation perspective, the risk envelope from sufficiently many optimal demonstrations. In other words, we are specifically interested in the question of whether the limiting set (whenever such a limit exists) allows one to exactly predict the actions of a decision maker that operates under a risk model characterized by the set . In the following theorem we establish, under mild technical conditions, that this is indeed possible. The proof is provided in Appendix A.

Theorem 3 (Convergence of Algorithm 1).

Let be a convex, compact subset of the state space. Let be a set of infinitely many optimal demonstrations such that the sequence is dense in . Assume that the following technical conditions hold:

  1. [label=A.0]

  2. The expert’s cost vector

    is strictly convex with respect to .

  3. For all and any state , the cost function associated with the -th disturbance has bounded level sets.

Finally, for any risk envelope and any state , define

as the optimal control action of an expert with risk envelope at state . Then, for any state ,

(17)

That is, for any state , the optimal action predicted using the limiting envelope matches that computed using the true expert polytope .

Once we have recovered an approximation of , we can solve the “forward” problem (i.e., compute actions at a given state ) by solving the optimization problem (7) with as the risk envelope.

3.1.1 Example: Linear-Quadratic System

As a simple illustrative example to gain intuition for the convergence properties of Algorithm 1, consider a linear dynamical system with multiplicative uncertainty of the form . We consider the one-step decision-making process with a quadratic cost on state and action: , where . Here, and . We consider a -dimensional state space with a -dimensional control input space. The number of realizations is taken to be for ease of visualization. The different and

matrices corresponding to each realization are generated randomly by independently sampling elements of the matrices from the standard normal distribution

. The cost matrix is a randomly generated positive semi-definite matrix and is the identity. States are drawn . The true envelope was generated by taking the convex hull of a set of random samples in the simplex .

Figure 2 shows the outer approximations of the risk envelope obtained using Algorithm 1. We observe rapid convergence (approximately 20 sampled states ) of the outer approximations (red) to the true risk envelope (green).

(a) 5 data points
(b) 10 data points
(c) 15 data points
(d) 20 data points
Figure 2: Rapid convergence of the outer approximation of the risk envelope.

Figure 3 shows the mean squared error (on an independent test set with 30 demonstrations) between actions predicted using the sequentially refined polytope approximations generated by Algorithm 1 and the expert’s true actions, as a function of the number of training demonstrations.

Figure 3: Rapid decrease of the mean squared error between predicted and expert’s actions as a function of the number of demonstrations.

3.2 Unknown Cost Function

Now we consider the more general case where both the expert’s cost function and risk measure are unknown. We parameterize the cost function as a linear combination of features in . Then, the expected value of the cost function w.r.t. can be written as , where:

(18)

with nonnegative weights . Since the solution of problem (7) solved by the expert is invariant to positive scalings of the cost function due to the positive homogeneity property of coherent risk measures (see Definition 1), one can assume without loss of generality that the feature weights sum to 1.

With this cost structure, we see that the KKT conditions derived in Section 3.1 involve products of the feature weights and the vertices of . Similarly, an analogous version of optimization problem (10) can be used to bound the optimal value. This problem again contains products of the unknown feature weights and the probability vertex . The key idea here is to introduce new decision variables that replace each product by a new variable which allows us to re-write problem (10) as an LP in , with the addition of the following two simple constraints: , and In a manner analogous to Theorem 2, this optimization problem allows us to obtain bounding hyperplanes in the space of product variables which can then be aggregated as in Algorithm 1. Denoting this polytope as , we can then proceed to solve the “forward” problem (i.e., computing actions at a given state ) by solving the following optimization problem:

(19)

This problem can be solved by enumerating the vertices of the polytope in a manner similar to problem (9). Similar to the case where the cost function is known, this provides us with a way to conservatively approximate the expert’s decision-making process (in the sense that we are considering a larger risk envelope).

3.2.1 Approximate Recovery of Cost and Risk Measure

While the procedure described above operates in the space of product variables and does not require explicitly recovering the cost function and risk envelope separately, it may nevertheless be useful to do so for two reasons. First, the number of vertices of may be quite large (since the space of product variables may be high dimensional) and thus solving the forward problem (19) may be computationally expensive. Recovering the cost and risk envelope separately allows us to solve a smaller optimization problem (since the risk envelope is lower dimensional in this case). Second, recovering the cost and risk measure separately may provide additional intuition and insights into the expert’s decision-making process and may also allow us to make useful predictions in novel settings (e.g., where we expect the expert’s risk measure to be the same but not the cost function or vice versa).

Here we describe a procedure for approximately recovering the feature weights and the risk envelope from the polytope . The key observation that makes this possible is to note that the matrix containing the variables is equal to the outer product by definition. Hence, for , we have:

(20)

The last equality follows from the fact is a probability vector and sums to 1. Similarly, for , we have:

(21)

The last equality follows from the fact that we assumed without loss of generality that the feature weights sum to 1.

Let be the set of vertices of the polytope . We can then apply equations (20) and (21) to each vertex. If we have exactly recovered the polytope

in the space of product variables, the estimates of the feature weights obtained from the different vertices will be equal. Since in general this will not be the case, the different estimates of the weights will be different (but ideally similar). We can take the mean of the different estimates for

(from (20)) as our estimate for the weight vector and the convex hull of the estimates of the probability vectors from equation (21) applied to each vertex of to be our estimate of the risk envelope.

It is important to be able to gauge the quality of the estimates we obtain from the procedure above. We can do this in two ways. First, if the estimates of the weight vector are tightly clustered, this is a good indication that we have an accurate recovery. Second, if each vertex of the polytope is close to a rank one matrix, then this is again a good indication (since the true product variables equal ).

3.2.2 Example: Linear-Quadratic System

Consider the same system as 3.1.1, but now we assume that the cost function is unknown. We take the cost function as the weighted sum of three quadratic features (i.e., ). The quadratic features are generated randomly by taking them to be equal to , where the elements of are sampled from the standard normal distribution. The corresponding weights are drawn uniformly between 0 and 1 and are normalized to sum to 1.

Figure 4 a) illustrates the tightness of the approximate envelope as compared with the true polytope while Figure 4 b) is a scatter plot of the first two feature weights (the third is uniquely determined given the first two) as recovered from applying eq. (20) to each vertex of the compound polytope . Notice that the cost weight estimates are tightly clustered, as desired.

(a) Polytope estimate.
(b) Feature weight estimates.
Figure 4: Approximated risk envelope and cost function weights from 200 state-control pair demonstrations.

4 Risk-sensitive IRL: Multi-step case

We now consider the dynamical system given by (1) and generalize the one-step decision problem to the multi-step setting. We consider a model where the disturbance is sampled every time-steps and held constant in the interim. Such a model generalizes settings where disturbances are sampled i.i.d. at every time-step (corresponding to in our model) as it also allows us to model delays in the expert’s reaction to changing disturbances. We model the expert as planning in a receding horizon manner by looking ahead for a finite horizon. Owing to the need to account for future disturbances, the multi-step finite-horizon problem is a search over control policies (i.e., the executed control inputs depend on which disturbance is realized).

4.1 Prepare-React Model

In this section we reprise the “prepare” – “react” model introduced in (Majumdar et al., 2017), and depicted below in Figure 5.

Figure 5: Scenario tree as visualized at time . The disturbance is sampled every steps. The control look-ahead has two phases: “prepare” and “react”.

The expert’s policy is decomposed into two phases (shown in Figure 5), referred to as “prepare” and “react.” The “prepare” phase precedes each disturbance realization by steps while the “react” phase follows it for steps. Intuitively, this model captures the idea that in the period preceding a disturbance (i.e., the “prepare” phase) the expert controls the system to a state from which he/she can recover well (in the “react” phase) once a disturbance is realized. Studies showing that humans have a relatively short look-ahead horizon in uncertain decision-making settings lend credence to such a model (Carton et al., 2016). As in (Majumdar et al., 2017), the delay parameter would be learned directly from the demonstrations. In this work, we wish to overcome the primary limitation of the dynamic model in (Majumdar et al., 2017), namely that of single branching events in each prediction/planning horizon. To do this, we first must define the notion of dynamic risk measures, used to assess risk over sequential realizations of uncertainty.

4.2 Dynamic Risk Measures

Consider a discrete-time stochastic cost sequence , where the space of real-valued non-negative random variables at stage . Let . A dynamic risk measure is a sequence of risk measures , each mapping a future stream of random costs into a risk assessment at stage and satisfying the monotonicity property for all such that . The monotonicity property is an intuitive extension of the monotonicity property for single-step risk assessments, and an arguably defensible axiom for all risk assessments.

To give dynamic risk measures a concrete functional form, we need to generalize the CRM axioms presented in Definition 1 to the dynamic case.

Definition 2 (Coherent One-Step Conditional Risk Measures).

A coherent one-step conditional risk measure is a mapping , for all , that obeys the following four axioms. For all and :

A1. Monotonicity: .

A2. Translation invariance: .

A3. Positive homogeneity: , .

A4. Subadditivity: .

Note that each is a random variable on the space and given the discrete underlying probability space, each component of is uniquely identified by the sequence of disturbances preceding stage (hence the term conditional). Furthermore, it is readily observed that a mapping is a coherent one-step conditional risk measure if and only if each component of is a CRM.

As investigated in (Ruszczyński, 2010), in order for dynamic risk assessments to satisfy the intuitive monotonicity condition and to ensure rationality of evaluations over time, a dynamic risk measure must have the following compositional form:

(22)

where each is a coherent one-step conditional risk measure. Figure 6 provides a helpful visualization of this compounded functional form.

Figure 6: A scenario tree with three uncertain outcomes at each stage. The one-step risk mapping maps the random cost to a risk assessment at stage 1, i.e., is a random variable on and is thus isomorphic to the space . Here, component of , associated with node at stage 1 (green node), is a coherent one-step conditional risk mapping over the children of node at stage 2. The mapping subsequently maps the risk-assessments at stage 1 (i.e., ) back to stage 0.

4.3 Multiple-Branch Prepare-React Model

We are now ready to define the multiple-branch formulation of the expert’s multi-stage problem, as envisioned at time-step , with look-ahead horizon steps where denotes the number of branching events within the prediction horizon. As in the prepare-react model introduced earlier, we assume that the disturbance mode for the first steps starting at time-step corresponds to , following which the disturbance is re-sampled every steps. A schematic of the multiple-branch generalization of the prepare-react model is depicted below in Figure 7.

Figure 7: Multiple-branch scenario tree schematic for the prepare-react model, as “visualized” by the expert at time-step (and indexed internally using for simplicity). The disturbance is sampled every steps. The control look-ahead consists of multiple nested branches of “prepare” and “react”; displayed in the figure above (shaded green) is one such nested branch corresponding to . To evaluate costs over , it is assumed that the expert is leveraging the conditional CRM where for each realization of (identified uniquely by the observed disturbance branch at stage ), the expert uses the static CRM over the nested outcomes (shown in green for one possible realization of ). The observed control sequence is the beginning “prepare” – “react” sequence corresponding to the actual realized disturbance .

Let denote the predicted state for time-step , where , as predicted at time-step within the expert’s multi-stage optimization problem. Similarly, let for represent the predicted (stage-wise) disturbance sequence. As the optimization is over “prepare” – “react” control policies, let denote the expert’s “prepare” – “react” control policy for stage (i.e., control input sequence for time-steps ) as a function111For notational clarity, we suppress the obvious dependence on and partial policy history . of the partial predicted disturbance history , and the next predicted disturbance mode . We take to represent the actual disturbance mode in progress at the time of solving the multi-stage optimization problem at time-step . Note that, by causality, only the “react” portion of may be a function of but not the “prepare” portion. Finally, denote to be the accumulated cost over time-steps given the stage “prepare” – “react” control policy . The expert’s multi-period optimization problem is then given as:

(23)

where each is a coherent one-step conditional risk measure such that each component of is a CRM with respect to the probability space and characterized by the fixed risk-envelope . Leveraging the translational invariance property, the objective may be equivalently re-written as

One should notice that (1) for each , the cost sequence is split across the risk operator due to the “prepare”–“react” structure and the translational invariance property, and (2) the risk mapping is over a sum of costs since disturbances are sampled every steps. These two observations elucidate the stage-wise decomposition of the dynamic risk measure as introduced in (22), where each stage corresponds to the cost accrued over the time-steps in between consecutive disturbance branching events. The observed input from the expert is the “prepare” – “react” control policy where represents the actual disturbance mode sampled after time-step , following which the expert re-solves the problem.

Notice that by setting , we recover the single-branch prepare/react model presented in (Majumdar et al., 2017). The success of the dynamic model in (Majumdar et al., 2017) followed from reducing the multi-step inference problem to be mathematically equivalent to the single-step case by first inferring the (un-observed) control policies of the human agent, corresponding to the un-realized disturbance branches. Consider the scenario tree decomposition in Figure 5. If disturbance is realized at , then we only observe the “react” control sequence corresponding to the third branch. The algorithm in (Majumdar et al., 2017) proceeded by first inferring the “react” control sequences for the un-observed branches and then constructing a bounding hyperplane using a similar version of problem (10). In a multiple-branch setting however, it is exceedingly difficult to exactly infer (or approximate) the unobserved control policies as each of these policies involves an unobserved nested optimization over future branching events. Consequently, the optimality conditions of an observed control policy are defined by equalities that are non-linear in the unobserved variables. Therefore, extending the use of KKT conditions to infer an outer approximation of the risk envelope in the style of Theorem 2 leads to an intractable non-convex optimization problem. To address this fundamental observability issue, we introduce a semi-parametric representation of the risk-envelope, discussed next.

4.3.1 Semi-Parametric CRM

Fix a set of normal vectors . Define the polytope , parameterized by the vector as:

(24)

where for each , . The CRM induced by this dual representation is denoted as and assumes the functional form

where

is a discrete random variable with

possible realizations. This induced CRM is termed semi-parametric since unlike methods where one seeks to find the parameters defining a fixed disutility function (e.g., Shen et al. (2014); Ratliff and Mazumdar (2017)), here we do not assume a fixed chosen risk measure. Instead, by parameterizing the risk functional in the dual space (via its risk envelope characterization), we retain the generality to recover any polytopic CRM, given sufficient number of normal vectors . A potential method to choose the normal vectors is to take the halfplane normals from the multi-step KKT method described in (Majumdar et al., 2017).

In order to ensure that the polytope described by (24) intersected with the probability simplex is non-empty, define the extended polytope

and define to be the projection of this extended polytope along the variables. Then, ensures that the polytope is non-empty. It is readily observed that the set is also a polytope.

Remark 2.

While we lose the outer-approximation of the risk envelope and convergence guarantees associated with the KKT method, in its place we obtain a tractable algorithm that enables us to accommodate a substantially larger class of dynamic decision-making inference problems. Experimental results, as discussed in Section 5, confirm that the method works well in approximating a wide range of risk profiles.

4.3.2 Constrained Maximum Likelihood

Given the semi-parametric representation of the risk envelope given in (24), the RS-IRL problem reduces to inference over the offset vector , and cost-weight vector . We will perform this inference using a constrained maximum likelihood model. Consider, first, the following likelihood model:

(25)

where, similar to the motivation for the MaxEnt IRL model (Ziebart et al., 2008), we take to be proportional to the exponential of the negative optimal value of (23), computed using the envelope (24) and conditioned on and (see Appendix B for a detailed derivation of this distribution). While the original MaxEnt IRL model is motivated by finding the maximum entropy distribution subject to an expected feature matching constraint, the robust performance of this model even in the absence of such a statistical motivation has been extensively observed and leveraged in the IRL literature.

A key limitation of this formulation however is the complexity of the partition function and the resulting gradients. The form of (25) is a distribution over all possible -length policies. This makes sampling-based approximations intractable, as similarly observed in (Kretzschmar et al., 2016), and Laplace integral-based approximations as used in (Levine and Koltun, 2012) too imprecise.

In order to construct a tractable algorithm, we employ the simplification whereby at the beginning of any “prepare” stage in Figure 7, the expert may only choose from a finite set of open-loop control trajectories , each of length , thereby eliminating the notion of a “react” policy

and replacing it with an open-loop sequence spanning the entire “prepare” – “react” stage. These trajectories can be chosen for instance by running the K-means clustering algorithm on the raw input trajectories. This simplification allows us to interpret problem (

23) as a game between the expert with action set and nature with action set , and uniquely identify any predicted state using the predicted game history, i.e., disturbance history and control history . Leveraging this discrete representation and dynamic programming, we can construct the optimal solution to the expert’s multi-stage optimization problem using a “risk-sensitive” Bellman recursion, defined below.

Terminal Stage: For all possible game histories at stage , define

Recursion: For all possible game histories at stage , for :

First Stage:

The value is the optimal value of problem (23). In the equations above, it is understood that for each , the cost sequence is evaluated based on the previous disturbance mode for the first steps, followed by for the remaining steps.

Given the structure of the optimal solution of problem (23), presented in Bellman form above using the true CRM , we now construct a computationally tractable likelihood model for the parameters and by defining the soft risk-sensitive Bellman recursion using the semi-parametric CRM . For the terminal stage, define

(26)

and for all :

(27)
(28)

where . Let be the closest (in norm) trajectory in to the observed control sequence over time-steps 222We use here for notational consistency between the stagewise decomposition of the multi-step problem and demonstrated action trajectories.. Similar to the MaxEnt IRL approach, we allow for imperfect human demonstrations by postulating that lower risk-sensitive cost actions (i.e., ) are exponentially preferred, i.e.,

(29)

where is an inverse temperature parameter. Thus, the likelihood of parameters is given by: