Inverse Risk-Sensitive Reinforcement Learning

03/29/2017 ∙ by Lillian J. Ratliff, et al. ∙ berkeley college University of Washington 0

We address the problem of inverse reinforcement learning in Markov decision processes where the agent is risk-sensitive. In particular, we model risk-sensitivity in a reinforcement learning framework by making use of models of human decision-making having their origins in behavioral psychology, behavioral economics, and neuroscience. We propose a gradient-based inverse reinforcement learning algorithm that minimizes a loss function defined on the observed behavior. We demonstrate the performance of the proposed technique on two examples, the first of which is the canonical Grid World example and the second of which is a Markov decision process modeling passengers' decisions regarding ride-sharing. In the latter, we use pricing and travel time data from a ride-sharing company to construct the transition probabilities and rewards of the Markov decision process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The modeling and learning of human decision-making behavior is increasingly becoming important as critical systems begin to rely more on automation and artificial intelligence. Yet, in this task we face a number of challenges, not least of which is the fact that humans are known to behave in ways that are not completely rational. There is mounting evidence to support the fact that humans often use

reference points—e.g., the status quo or former experiences or recent expectaions about the future that are otherwise perceived to be related to the decision the human is making [1, 2]. It has also been observed that their decisions are impacted by their perception of the external world (exogenous factors) and their present state of mind (endogenous factors) as well as how the decision is framed or presented [3].

The success of descriptive behavioral models in capturing human behavior has long been touted by the psychology community and, more recently, by the economics community. In the engineering context, humans have largely been modeled, under rationality assumptions, from the so-called normative point of view where things are modeled as they ought to be, which is counter to a descriptive as is point of view.

However, risk-sensitivity in the context of learning to control stochastic dynamical systems (see, e.g.[4, 5]) has been fairly extensively explored in engineering. Many of these approaches are targeted at mitigating risks due to uncertainties in controlling a system such as a plant or robot where risk-aversion

is captured by leveraging techniques such as exponential utility functions or minimizing mean-variance criteria.

Complex risk-sensitive behavior arising from human interaction with automation is only recently coming into focus. Human decision makers can be at once risk-averse and risk-seeking depending their frame of reference. The adoption of diverse behavioral models in engineering—in particular, in learning and control—is growing due to the fact that humans are increasingly playing an integral role in automation both at the individual and societal scale. Learning accurate models of human decision-making is important for both prediction and description. For example, control/incentive schemes need to predict human behavior as a function of external stimuli including not only potential disturbances but also the control/incentive mechanism itself. On the other hand, policy makers and regulatory agencies, e.g., are interested in interpreting human reactions to implemented regulations and policies.

Approaches for integrating the risk-sensitivity in the control and reinforcement learning problems via behavioral models have recently emerged [6, 7, 8, 9, 10]. These approaches largely assume a risk-sensitive Markov decision process (MDP) formulated based on a model that captures behavioral aspects of the human’s decision-making process. We refer the problem of learning the optimal policy in this setting as the forward problem. Our primary interest is in solving the so-called inverse

problem which seeks to estimate the decision-making process given a set of demonstrations; yet, to do so requires a well formulated forward problem with convergence guarantees.

Inverse reinforcement learning in the context of recovering policies directly (or indirectly via first learning a representation for the reward) has long been studied in the context expected utility maximization and MDPs [11, 12, 13]. We may care about, e.g., producing the value and reward functions (or at least, characterize the space of these functions) that produce behaviors matching that which is observed. On the other hand, we may want to extract the optimal policy from a set of demonstrations so that we can reproduce the behavior in support of, e.g., designing incentives or control policies. In this paper, our focus is on the combination of these two tasks.

We model human decision-makers as risk-sensitive Q-learning agents where we exploit very rich behavioral models from behavioral psychology and economics that capture a whole spectrum of risk-sensitive behaviors and loss aversion. We first derive a reinforcement learning algorithm that leverages convex risk metrics and behavioral value functions. We provide convergence guarantees via a contraction mapping argument. In comparison to previous work in this area [14], we show that the behavioral value functions we introduce satisfy the assumptions of our theorems.

Given the forward risk-sensitive reinforcement learning algorithm, we propose a gradient-based learning algorithm for inferring the decision-making model parameters from demonstrations—that is, we propose a framework for solving the inverse risk-sensitive reinforcement learning problem with theoretical guarantees. We show that the gradient of the loss function with respect to the model parameters is well-defined and computable via a contraction map argument. We demonstrate the efficacy of the learning scheme on the canonical Grid World example and a passenger’s view of ride-sharing modeled as an MDP with parameters estimated from real-world data.

The work in this paper significantly extends our previous work [15] first, by providing the proofs for the theoretical results appearing in the earlier work and second, by providing a more extensive theory for both the forward and inverse risk-sensitive reinforcement problems.

The remainder of this paper is organized as follows. In Section II, we overview the model we assume for risk-sensitive agents, show that it is amenable to integration with the behavioral models, and present our risk-sensitive Q-learning convergence results. In Section III, we formulate the problem and propose a gradient–based algorithm to solve it. Examples that demonstrate the ability of the proposed scheme to capture a wide breadth of risk-sensitive behaviors are provided in Section IV. We comment on connections to recent related work in Section V. Finally, we conclude with some discussion in Section VI.

Ii Risk-Sensitive Reinforcement Learning

In order to learn a decision-making model for an agent who faces sequential decisions in an uncertain environment, we leverage a risk-sensitive Q-learning model that integrates coherent risk metrics with behavioral models. In particular, the model we use is based on a model first introduced in [16] and later refined in [8, 7].

The primary difference between the work presented in this section and previous work111For further details on the relationship the work in this paper and related works, including our previous work, see Section V. is that we (i) introduce a new prospect theory based value function and (ii) provide a convergence theorem whose assumptions are satisfied for the behavioral models we use. Under the assumption that the agent is making decisions according to this model, in the sequel we formulate a gradient–based method for learning the policy as well as parameters of the agent’s value function.

Ii-a Markov Decision Process

We consider a class of finite MDPs consisting of a state space , an admissible action space for each , a transition kernel that denotes the probability of moving from state to given action , and a reward function222We note that it is possible to consider the more general reward structure , however we exclude this case in order to not further bog down the notation. where is the space of bounded disturbances and has distribution . Including disturbances allows us to model random rewards; we use the notation to denote the random reward having distribution .

In the classical expected utility maximization framework, the agent seeks to maximize the expected discounted rewards by selecting a Markov policy —that is, for an infinite horizon MDP, the optimal policy is obtained by maximizing

(1)

where is the initial state and is the discount factor.

The risk-sensitive problem transforms the above problem to account for a salient features of the human decision-making process such as loss aversion, reference point dependence, and risk-sensitivity. Specifically, we introduce two key components, value functions and valuation functions, that allow for our model to capture these features. The former captures risk-sensitivity, loss-aversion, and reference point dependence in its transformation of outcome values to their value as perceived by the agent and the latter generalizes the expectation operator to more general measures of risk—specifically, convex risk measures.

Ii-B Value Functions

Given the environmental and reward uncertainties, we model the outcome of each action as a real-valued random variable

, where denotes a finite event space and is the outcome of –th event with probability where

, the space of probability distributions on

. Analogous to the expected utility framework, agents make choices based on the value of the outcome determined by a value function .

There are a number of existing approaches to defining value functions that capture risk-sensitivity and loss aversion. These approaches derive from a variety of fields including behavioral psychology/economics, mathematical finance, and even neuroscience.

One of the principal features of human decision-making is that losses are perceived more significant than a gain of equal true value. The models with the greatest efficacy in capturing this effect are convex and concave in different regions of the outcome space. Prospect theory, e.g., is built on one such model [17, 18]. The value function most commonly used in prospect theory is given by

(2)

where is the reference point that the decision-maker compares outcomes against in determining if the decision is a loss or gain. The parameters control the degree of loss-aversion and risk-sensitivity; e.g.,

  1. implies preferences that are risk-averse on gains and risk-seeking on losses (concave in gains, convex in losses);

  2. implies risk-neutral preferences;

  3. implies preferences that are risk-averse on losses and risk-seeking on gains (convex in gains, concave in losses).

Experimental results for a series of one-off decisions have indicated that typically thereby indicating that humans are risk-averse on gains and risk-seeking on losses.

In addition to the non-linear transformation of outcome values, in prospect theory the effect of under/over-weighting the likelihood of events that has been commonly observed in human behavior is modeled via

warping of event probabilities [19]. Other concepts such as framing, reference dependence, and loss aversion—captured, e.g., in the parameters in (2)—have also been widely observed in experimental studies (see, e.g.[20, 21, 22]).

Outside of the prospect theory value function, other mappings have been proposed to capture risk-sensitivity. Proposed in [8], the linear mapping

(3)

with is one such example. This value function can be viewed as a special case of (2).

Another example is the entropic map which is given by

(4)

where controls the degree of risk-sensitivity. The entropic map, however, is either convex or concave on the entire outcome space.

Motivated by the empirical evidence supporting the prospect theoretic value function and numerical considerations of our algorithm, we introduce a value function that retains the shape of the prospect theory value function while improving the performance (in terms of convergence speed) of the forward and inverse reinforcement learning procedures we propose. In particular, we define the locally Lipschitz-prospect () value function given by

(5)

with and , a small constant. This value function is Lipschitz continuous on a bounded domain. Moreover, the derivative of the function is bounded away from zero at the reference point. Hence, in practice it has better numerical properties.

We remark that, for given parameters , the function has the same risk-sensitivity as the prospect value function with those same parameters. Moreover, as the value function approaches the prospect value function and thus, qualitatively speaking, the degree of Lipschitzness decreases as .

The fact that each of these value functions are defined by a small number of parameters that are highly interpretable in terms of risk-sensitivity and loss-aversion is one of the motivating factors for integrating them into a reinforcement learning framework. It is our aim to design learning algorithms that will ultimately provide the theoretical underpinnings for designing incentives and control policies taking into consideration salient features of human decision-making behavior.

Ii-C Valuation Functions via Convex Risk Metrics

To further capture risk-sensitivity, valuation functions generalize the expectation operator, which considers average or expected outcomes,333In the case of two events, the valuation function can also capture warping of probabilities. Alternative approaches to based on cumulative prospect theory for the more general case have been examined [6]. to measures of risk.

[Monetary Risk Measure [23]] A functional on the space of measurable functions defined on a probability space is said to be a monetary risk measure if is finite and if, for all , satisfies the following:

  1. (monotone)

  2. (translation invariant)

If a monetary risk measure satisfies

(6)

for , then it is a convex risk measure. If, additionally, is positive homogeneous, i.e. if , then , then we call a coherent risk measure. While the results apply to coherent risk measures, we will primarily focus on convex measures of risk, a less restrictive class, that are generated by a set of acceptable positions.

Denote the space of probability measures on by . [Acceptable Positions] Consider a value function , a probability measure , and an acceptance level with in the domain of . The set

(7)

is the set of acceptable positions. The above definition can be extending to the entire class of probability measures on as follows:

(8)

with constants such that .

[[23, Proposition 4.7]] Suppose the class of acceptable positions is a non-empty subset of satisfying

  1. , , and

  2. given , .

Then, induces a monetary measure of risk . If is convex, then is a convex measure of risk. Furthermore, if is a cone, then is a coherent risk metric. Note that a monetary measure of risk induced by a set of acceptable positions is given by

(9)

The following proposition is key for extending the expectation operator to more general measures of risk. [[23, Proposition 4.104]] Consider

(10)

for a continuous value function , acceptance level for some in the domain of , and probability measure . Suppose that is strictly increasing on for some . Then, the corresponding is a convex measure of risk which is continuous from below. Moreover, is the unique solution to

(11)

Proposition II-C also implies that for each value function, we can define an acceptance set which in turn induces a convex risk metric . Let us consider an example. [Entropic Risk Metric [23]] Consider the entropic value function . It has been used extensively in the field of risk measures [23], in neuroscience to capture risk sensitivity in motor control [9] and even more so in control of MDPs (see, e.g.[24]).

The entropic value function with an acceptance level can be used to define the acceptance set

(12)

with corresponding risk metric

(13)
(14)

The parameter controls the risk preference; indeed, this can be seen by considering the Taylor expansion [23, Example 4.105].

As a further comment, this particular risk metric is equivalent (up to an additive constant) to the so called entropic risk measure which is given by

(15)

where is the set of all measures on that are absolutely continuous with respect to and where is the relative entropy function.

Let us recall the concept of a valuation function introduced and used in [23, 7, 25]. [Valuation Function] A mapping is called a valuation function if for each , (i) whenever (monotonic) and (ii) for any (translation invariant). Such a map is used to characterize an agent’s preferences—that is, one prefers to whenever .

We will consider valuation functions that are convex risk metrics induced by a value function and a probability measure . To simplify notation, from here on out we will suppress the dependence on the probability measure .

For each state–action pair, we define a valuation map such that is a valuation function induced by an acceptance set with respect to value function and acceptance level .

If we let , the optimization problem in (1) generalizes to

(16)

where we define .

Ii-D Risk-Sensitive Q-Learning Convergence

In the classical reinforcement learning framework, the Bellman equation is used to derive a Q-learning procedure. Generalizations of the Bellman equation for risk-sensitive reinforcement learning—derived, e.g., in [14, 8]—have been used to formulate an action–value function or Q-learning procedure for the risk-sensitive reinforcement learning problem. In particular, as shown in [14], if satisfies

(17)

then holds for all ; moreover, a deterministic policy is optimal if  [14, Thm. 5.5]. The action–value function is defined such that (17) becomes

(18)

for all .

Given a value function and acceptance level , we use the coherent risk metric induced state-action valuation function given by

(19)

where the expectation is taken with respect to . Hence, by a direct application of Proposition II-C, if is continuous and strictly increasing, then is the unique solution to .

As shown in [7, Proposition 3.1], by letting , we have that corresponds to and, in particular,

(20)

where, again, the expectation is taken with respect to .

The above leads naturally to a Q-learning procedure,

(21)

where the non-linear transformation is applied to the temporal difference

instead of simply the reward . Transformation of the temporal differences avoids certain pitfalls of the reward transformation approach such as poor convergence performance. This procedure has convergence guarantees even in this more general setting under some assumptions on the value function .

[Q-learning Convergence [7, Theorem 3.2]] Suppose that is in , is strictly increasing in and there exists constants such that for all . Moreover, suppose that there exists a such that . If the non-negative learning rates are such that and , , then the procedure in (21) converges to for all with probability one. The assumptions on are fairly standard and the core of the convergence proof is based on the Robbins–Siegmund Theorem appearing in the seminal work [26].

We note that the assumptions on the value function of Theorem II-D are fairly restrictive, excluding many of the value functions presented in Section II-B. For example, value functions of the form and do not satisfy the global Lipschitz condition.

We generalize the convergence result in Theorem II-D by modifying the assumptions on the value function to ensure that we have convergence of the Q-learning procedure for the and entropic value functions. The value function satisfies the following:

  1. it is strictly increasing in and there exists a such that ;

  2. it is locally Lipschitz on any ball of finite radius centered at the origin;

Note that in comparison to the assumptions of Theorem II-D, we have removed the assumption that the derivative of is bounded away from zero, and relaxed the global Lipschitz assumption on . We remark that the and entropic value functions satisfy these assumptions for all parameters and MDPs.

Let be a complete metric space endowed with the norm and let be the space of maps . Further, define . We then re-write the –update equation in the form

(22)

where and we have suppressed the dependence of on . This is a standard update equation form in, e.g., the stochastic approximation algorithm literature [27, 28, 29]. In addition, we define the map given by

(23)

which we will prove is a contraction.

Suppose that satisfies Assumption II-D and that for each the reward is bounded almost surely—that is, there exists such that almost surely. Moreover, let , for , the Lipschitz constant of on .

  1. Let be a closed ball of radius centered at zero. Then, is a contraction.

  2. Suppose is chosen such that

    (24)

    where . Then, has a unique fixed point in .

The proof of the above theorem is provided in Appendix -A.

The following proposition shows that the and entropic value functions satisfy the assumption in (24). Moreover, it shows that the value functions which satisfy Assumption II-D also satisfy (24). Consider a MDP with reward bounded almost surely by and and consider the condition

(25)
  1. Suppose satisfies Assumption II-D and that for some , for all . Then (25) holds.

  2. Suppose is an value function with arbitrary parameters satisfying Assumption II-D. Then there exists a such that the value function satisfies (25).

  3. Suppose that is an entropic value function. Then there exists a such that for any where satisfies Assumption II-D, (25) holds with .

With Theorem II-D and Proposition II-D, we can prove convergence of Q-learning for risk-sensitive reinforcement learning. [Q-learning Convergence on ] Suppose that satisfies Assumption II-D and that for each the reward is bounded almost surely—that is, there exists such that almost surely. Moreover, suppose the ball is chosen such that (24) holds. If the non-negative learning rates are such that and , , then the procedure in (21) converges to with probability one. Given Theorem II-D, the proof of Theorem 25 follows directly the same proof as provided in [14]. The key aspect of the proof is combining the fixed point result of Theorem II-D with Robbins–Siegmund Theorem [26].

Theorem II-D, Proposition II-D, and Theorem 25 extend the results for risk-sensitive reinforcement learning presented in [14] by relaxing the assumptions on the value functions for which the Q-learning procedure converges.

Iii Inverse Risk-Sensitive Reinforcement Learning

We formulate the inverse risk-sensitive reinforcement learning problem as follows. First, we select a parametric class of policies, , and parametric value function , where is a family of value functions and .

We use value functions such as those described in Section II-B; e.g., if is the prospect theory value function defined in (2

), then the parameter vector is

. For mappings and , we now indicate their dependence on —that is, we will write and where . Note that since is the temporal difference it also depends on and we will indicate this dependence where it is not directly obvious by writing .

It is common in the literature to adopt a smooth map that operates on the action-value function space for defining the parametric policy space—e.g., Boltzmann policies of the form

(26)

to the action-value functions where controls how close is to a greedy policy which we define to be any policy such that at all states . We will utilize policies of this form. Note that, as is pointed out in [30], the benefit of selecting strictly stochastic policies is that if the true agent’s policy is deterministic, uniqueness of the solution is forced.

We aim to tune the parameters so as to minimize some loss which is a function of the parameterized policy . By an abuse of notation, we introduce the shorthand .

Iii-a Inverse Reinforcement Learning Optimization Problem

The optimization problem is specified by

(27)

Given a set of demonstrations , it is our goal to recover the policy and estimate the value function.

There are several possible loss functions that may be employed. For example, suppose we elect to minimize the negative weighted log-likelihood of the demonstrated behavior which is given by

(28)

where may, e.g., be the normalized empirical frequency of observing pairs in , i.e.  where is the frequency of .

Related to maximizing the log-likelihood, an alternative loss function is the relative entropy or Kullback-Leibler (KL) divergence between the empirical distribution of the state-action trajectories and their distribution under the learned policy—that is,

(29)

where

(30)

is the KL divergence, is the sequence of observed states, and is the empirical distribution on the trajectories of .

Iii-B Gradient–Based Approach

We propose to solve the problem of estimating the parameters of the agent’s value function and approximating the agent’s policy via gradient methods which requires computing the derivative of with respect to . Hence, given the form of the Q-learning procedure where the temporal differences are transformed as in (21), we need to derive a mechanism for obtaining the optimal , show that it is in fact differentiable, and derive a procedure for obtaining the derivative.

Using some basic calculus, given the form of smoothing map in (26), we can compute the derivative of the policy with respect to for an element of :

(31)
(32)

We show that can be calculated almost everywhere on by solving fixed-point equations similar to the Bellman-optimality equations.

To do this, we require some assumptions on the value function . The value function satisfies the following conditions:

  1. is strictly increasing in and for each , there exists a such that ;

  2. for each , on any ball centered around the origin of finite radius, is locally Lipschitz in with constant and locally Lipschitz on with constant ;

  3. there exists such that for all .

Define and . As before, let . We re-write the –update equation as

(33)

where

is the temporal difference, and we have suppressed the dependence of on . In addition, define the map such that

(34)

where . This map is a contraction for each . Indeed, fixing , when satisfies Assumption III-B, then for cases where , was shown to be a contraction in [8] and in the more general setting (i.e. ), in [7].

Our first main result on inverse risk-sensitive reinforcement learning, which is the theoretical underpinning of our gradient-based algorithm, gives us a mechanism to compute the derivative of with respect to as a solution to a fixed-point equation via a contraction mapping argument.

Let be the derivative of with respect to the –th argument where . Assume that satisfies Assumption III-B. Then the following statements hold:

  1. is locally Lipschitz continuous as a function of —that is, for any , , for some ;

  2. except on a set of measure zero, the gradient is given by the solution of the fixed–point equation

    (35)

    where and is the action that maximizes where is any policy that is greedy with respect to .

We provide the proof in Appendix -C. To give a high-level outline, we use an induction argument combined with a contraction mapping argument on the map

(36)

The almost everywhere differentiability follows from Rademacher’s Theorem (see, e.g.[31, Thm. 3.1]).

Theorem III-B gives us a procedure—namely, a fixed–point equation which is a contraction—to compute the derivative so that we can compute the derivative of our loss function . Hence the gradient method provided in Algorithm 1 for solving the inverse risk-sensitive reinforcement learning problem is well formulated.

1:procedure RiskIRL()
2:     Initialize:
3:     while  &  do
4:         
5:          LineSearch()
6:         
7:               
8:     return
Algorithm 1 Gradient-Based Risk-Sensitive IRL

The prospect theory value function given in (2) is not globally Lipschitz in —in particular, it is not Lipschitz near the reference point —for values of and less than one. Moreover, for certain parameter combinations, it may not even be differentiable. The function, on the other hand, is locally Lipschitz and its derivative near the reference point is bounded away from zero. This makes it a more viable candidate for numerical implementation. Its derivative, however, is not bounded away from zero as .

This being said, we note that if the procedure for computing follows an algorithm which implements repeated applications of the map is initialized with being finite for all and is bounded for all possible pairs, then the derivative of will always be bounded away from zero for all realized values of in the procedure. An analogous statement can be made regarding the computation of . Hence, the procedures for computing and for all the value functions we consider (excluding the classical prospect value function) are guaranteed to converge (except on a set of measure zero).

Let us translate this remark into a formal result. Consider a modified version of Assumption III-B: The value function satisfies the following:

  1. it is strictly increasing in and for each , there exists a such that ;

  2. for each , it is Lipschitz in with constant and locally Lipschitz on with constant .

Simply speaking, analogous to Assumption II-D, we have removed the uniform lower bound on the derivative of . Moreover, Theorem II-D gives us that , as defined in (53), is a contraction on a ball of finite radius for each under Assumption II-D.

Assume that satisfies Assumption III-B and that the reward is bounded almost surely by . Then the following statements hold.

  1. For any ball , is locally Lipschitz-continuous on as a function of —that is, for any , , for some .

  2. For each , let be the ball with radius satisfying

    (37)

    Except on a set of measure zero, the gradient is given by the solution of the fixed–point equation

    (38)

    where and is the action that maximizes with being any policy that is greedy with respect to .

The proof (provided in Appendix -D) of the above theorem follows the same techniques as in Theorem II-D and Theorem III-B.

Note that for each fixed , condition (37) is the same as condition (24). Moreover, Proposition II-D shows that for the and entropic value functions, such a must exist for any choice of parameters.

Iii-C Complexity

Small dataset size is often a challenge in modeling sequential human decision-making owing in large part to the frequency and time scale on which decisions are made in many applications. To properly understand how our gradient-based approach performs for different amounts of data, we analyze the case when the loss function, , is either the negative of the log-likelihood of the data—see (28) above—or the sum over states of the KL divergence between the policy under our learned value function and the the empirical policy of the agent—see (29) above. These are two of the more common loss functions used in the literature.

We first note that maximizing the log-likelihood is equivalent to minimizing a weighted sum over states of the KL divergence between the empirical policy of the true agent, , and the policy under the learned value function, . In particular, through some algebraic manipulation the weighted log-likelihood can be re-written as

(39)

where is the frequency of state normalized by . This approach has the added benefit that it is independent of and therefore will not be affected by scaling of the value functions [30].

Both cost functions are natural metrics for performance in that they minimize a measure of the divergence between the optimal policy under the learned agent and empirical policy of the true agent. While the KL-divergence is not suitable for our analysis, since it is not a metric on the space of probability distributions, it does provide an upper bound on the total variation (TV) distance via Pinsker’s inequality:

(40)

where is the TV distance between and , defined as

(41)

The TV distance between distributions is a proper metric. Furthermore, use of the two cost functions described above will also translate to minimizing the TV distance as it is upper bounded by the KL divergence.

We first note that, for each state , we would ideally like to get a bound on , the TV distance between the agent’s true policy and the estimated policy . However, we only have access to the empirical policy . We therefore use the triangle inequality to get an upper bound on , in terms of values for which we can calculate explicitly or construct bounds. In particular, we derive the following bound:

(42)

Note that is tantamount to a training error as metricized by the TV distance, and is upper bounded by a function of the KL divergence (which appears in the loss function) via (40).

The first term in (42), , is the distance between the empirical policy and the true policy in state . Using the Dvoretzky Kiefer-Wolfowitz inequality (see, e.g.[32, 33]), this term can be bounded above with high probability. Indeed,

(43)

where is the number of samples from the distribution and is the cardinality of the action set. Combining this bound with (42), we get that, with probability ,

(44)

Supposing Algorithm 1 achieves a sufficiently small training error , the second term above can be bounded above by a calculable small amount which we define notationally to be . Supposing is also sufficiently small, the dominating term in the distance between and is the first term on the right-hand side in (44). This gives us a convergence rate on the per state level. This rate is seen qualitatively in our experiments on sample complexity outlined in Section IV-A4.

We note that this bound is for each individual state . Thus, for states that are visited more frequently by the agent, we have better guarantees on how well the policy under the learned value function approximates the true policy. Moreover, it suggests ways of designing data collection schemes to better understand the agent’s actions in less explored regions of the state space.

Iv Examples

Let us now demonstrate the performance of the proposed method on two examples. While we are able to formulate the inverse risk-sensitive reinforcement learning problem for parameter vectors that include and , in the following examples we use and . The purpose of doing this is to explore the effects of changing the value function parameters on the resulting policy.