I Introduction
The modeling and learning of human decisionmaking behavior is increasingly becoming important as critical systems begin to rely more on automation and artificial intelligence. Yet, in this task we face a number of challenges, not least of which is the fact that humans are known to behave in ways that are not completely rational. There is mounting evidence to support the fact that humans often use
reference points—e.g., the status quo or former experiences or recent expectaions about the future that are otherwise perceived to be related to the decision the human is making [1, 2]. It has also been observed that their decisions are impacted by their perception of the external world (exogenous factors) and their present state of mind (endogenous factors) as well as how the decision is framed or presented [3].The success of descriptive behavioral models in capturing human behavior has long been touted by the psychology community and, more recently, by the economics community. In the engineering context, humans have largely been modeled, under rationality assumptions, from the socalled normative point of view where things are modeled as they ought to be, which is counter to a descriptive as is point of view.
However, risksensitivity in the context of learning to control stochastic dynamical systems (see, e.g., [4, 5]) has been fairly extensively explored in engineering. Many of these approaches are targeted at mitigating risks due to uncertainties in controlling a system such as a plant or robot where riskaversion
is captured by leveraging techniques such as exponential utility functions or minimizing meanvariance criteria.
Complex risksensitive behavior arising from human interaction with automation is only recently coming into focus. Human decision makers can be at once riskaverse and riskseeking depending their frame of reference. The adoption of diverse behavioral models in engineering—in particular, in learning and control—is growing due to the fact that humans are increasingly playing an integral role in automation both at the individual and societal scale. Learning accurate models of human decisionmaking is important for both prediction and description. For example, control/incentive schemes need to predict human behavior as a function of external stimuli including not only potential disturbances but also the control/incentive mechanism itself. On the other hand, policy makers and regulatory agencies, e.g., are interested in interpreting human reactions to implemented regulations and policies.
Approaches for integrating the risksensitivity in the control and reinforcement learning problems via behavioral models have recently emerged [6, 7, 8, 9, 10]. These approaches largely assume a risksensitive Markov decision process (MDP) formulated based on a model that captures behavioral aspects of the human’s decisionmaking process. We refer the problem of learning the optimal policy in this setting as the forward problem. Our primary interest is in solving the socalled inverse
problem which seeks to estimate the decisionmaking process given a set of demonstrations; yet, to do so requires a well formulated forward problem with convergence guarantees.
Inverse reinforcement learning in the context of recovering policies directly (or indirectly via first learning a representation for the reward) has long been studied in the context expected utility maximization and MDPs [11, 12, 13]. We may care about, e.g., producing the value and reward functions (or at least, characterize the space of these functions) that produce behaviors matching that which is observed. On the other hand, we may want to extract the optimal policy from a set of demonstrations so that we can reproduce the behavior in support of, e.g., designing incentives or control policies. In this paper, our focus is on the combination of these two tasks.
We model human decisionmakers as risksensitive Qlearning agents where we exploit very rich behavioral models from behavioral psychology and economics that capture a whole spectrum of risksensitive behaviors and loss aversion. We first derive a reinforcement learning algorithm that leverages convex risk metrics and behavioral value functions. We provide convergence guarantees via a contraction mapping argument. In comparison to previous work in this area [14], we show that the behavioral value functions we introduce satisfy the assumptions of our theorems.
Given the forward risksensitive reinforcement learning algorithm, we propose a gradientbased learning algorithm for inferring the decisionmaking model parameters from demonstrations—that is, we propose a framework for solving the inverse risksensitive reinforcement learning problem with theoretical guarantees. We show that the gradient of the loss function with respect to the model parameters is welldefined and computable via a contraction map argument. We demonstrate the efficacy of the learning scheme on the canonical Grid World example and a passenger’s view of ridesharing modeled as an MDP with parameters estimated from realworld data.
The work in this paper significantly extends our previous work [15] first, by providing the proofs for the theoretical results appearing in the earlier work and second, by providing a more extensive theory for both the forward and inverse risksensitive reinforcement problems.
The remainder of this paper is organized as follows. In Section II, we overview the model we assume for risksensitive agents, show that it is amenable to integration with the behavioral models, and present our risksensitive Qlearning convergence results. In Section III, we formulate the problem and propose a gradient–based algorithm to solve it. Examples that demonstrate the ability of the proposed scheme to capture a wide breadth of risksensitive behaviors are provided in Section IV. We comment on connections to recent related work in Section V. Finally, we conclude with some discussion in Section VI.
Ii RiskSensitive Reinforcement Learning
In order to learn a decisionmaking model for an agent who faces sequential decisions in an uncertain environment, we leverage a risksensitive Qlearning model that integrates coherent risk metrics with behavioral models. In particular, the model we use is based on a model first introduced in [16] and later refined in [8, 7].
The primary difference between the work presented in this section and previous work^{1}^{1}1For further details on the relationship the work in this paper and related works, including our previous work, see Section V. is that we (i) introduce a new prospect theory based value function and (ii) provide a convergence theorem whose assumptions are satisfied for the behavioral models we use. Under the assumption that the agent is making decisions according to this model, in the sequel we formulate a gradient–based method for learning the policy as well as parameters of the agent’s value function.
Iia Markov Decision Process
We consider a class of finite MDPs consisting of a state space , an admissible action space for each , a transition kernel that denotes the probability of moving from state to given action , and a reward function^{2}^{2}2We note that it is possible to consider the more general reward structure , however we exclude this case in order to not further bog down the notation. where is the space of bounded disturbances and has distribution . Including disturbances allows us to model random rewards; we use the notation to denote the random reward having distribution .
In the classical expected utility maximization framework, the agent seeks to maximize the expected discounted rewards by selecting a Markov policy —that is, for an infinite horizon MDP, the optimal policy is obtained by maximizing
(1) 
where is the initial state and is the discount factor.
The risksensitive problem transforms the above problem to account for a salient features of the human decisionmaking process such as loss aversion, reference point dependence, and risksensitivity. Specifically, we introduce two key components, value functions and valuation functions, that allow for our model to capture these features. The former captures risksensitivity, lossaversion, and reference point dependence in its transformation of outcome values to their value as perceived by the agent and the latter generalizes the expectation operator to more general measures of risk—specifically, convex risk measures.
IiB Value Functions
Given the environmental and reward uncertainties, we model the outcome of each action as a realvalued random variable
, where denotes a finite event space and is the outcome of –th event with probability where, the space of probability distributions on
. Analogous to the expected utility framework, agents make choices based on the value of the outcome determined by a value function .There are a number of existing approaches to defining value functions that capture risksensitivity and loss aversion. These approaches derive from a variety of fields including behavioral psychology/economics, mathematical finance, and even neuroscience.
One of the principal features of human decisionmaking is that losses are perceived more significant than a gain of equal true value. The models with the greatest efficacy in capturing this effect are convex and concave in different regions of the outcome space. Prospect theory, e.g., is built on one such model [17, 18]. The value function most commonly used in prospect theory is given by
(2) 
where is the reference point that the decisionmaker compares outcomes against in determining if the decision is a loss or gain. The parameters control the degree of lossaversion and risksensitivity; e.g.,

implies preferences that are riskaverse on gains and riskseeking on losses (concave in gains, convex in losses);

implies riskneutral preferences;

implies preferences that are riskaverse on losses and riskseeking on gains (convex in gains, concave in losses).
Experimental results for a series of oneoff decisions have indicated that typically thereby indicating that humans are riskaverse on gains and riskseeking on losses.
In addition to the nonlinear transformation of outcome values, in prospect theory the effect of under/overweighting the likelihood of events that has been commonly observed in human behavior is modeled via
warping of event probabilities [19]. Other concepts such as framing, reference dependence, and loss aversion—captured, e.g., in the parameters in (2)—have also been widely observed in experimental studies (see, e.g., [20, 21, 22]).Outside of the prospect theory value function, other mappings have been proposed to capture risksensitivity. Proposed in [8], the linear mapping
(3) 
with is one such example. This value function can be viewed as a special case of (2).
Another example is the entropic map which is given by
(4) 
where controls the degree of risksensitivity. The entropic map, however, is either convex or concave on the entire outcome space.
Motivated by the empirical evidence supporting the prospect theoretic value function and numerical considerations of our algorithm, we introduce a value function that retains the shape of the prospect theory value function while improving the performance (in terms of convergence speed) of the forward and inverse reinforcement learning procedures we propose. In particular, we define the locally Lipschitzprospect () value function given by
(5) 
with and , a small constant. This value function is Lipschitz continuous on a bounded domain. Moreover, the derivative of the function is bounded away from zero at the reference point. Hence, in practice it has better numerical properties.
We remark that, for given parameters , the function has the same risksensitivity as the prospect value function with those same parameters. Moreover, as the value function approaches the prospect value function and thus, qualitatively speaking, the degree of Lipschitzness decreases as .
The fact that each of these value functions are defined by a small number of parameters that are highly interpretable in terms of risksensitivity and lossaversion is one of the motivating factors for integrating them into a reinforcement learning framework. It is our aim to design learning algorithms that will ultimately provide the theoretical underpinnings for designing incentives and control policies taking into consideration salient features of human decisionmaking behavior.
IiC Valuation Functions via Convex Risk Metrics
To further capture risksensitivity, valuation functions generalize the expectation operator, which considers average or expected outcomes,^{3}^{3}3In the case of two events, the valuation function can also capture warping of probabilities. Alternative approaches to based on cumulative prospect theory for the more general case have been examined [6]. to measures of risk.
[Monetary Risk Measure [23]] A functional on the space of measurable functions defined on a probability space is said to be a monetary risk measure if is finite and if, for all , satisfies the following:

(monotone)

(translation invariant)
If a monetary risk measure satisfies
(6) 
for , then it is a convex risk measure. If, additionally, is positive homogeneous, i.e. if , then , then we call a coherent risk measure. While the results apply to coherent risk measures, we will primarily focus on convex measures of risk, a less restrictive class, that are generated by a set of acceptable positions.
Denote the space of probability measures on by . [Acceptable Positions] Consider a value function , a probability measure , and an acceptance level with in the domain of . The set
(7) 
is the set of acceptable positions. The above definition can be extending to the entire class of probability measures on as follows:
(8) 
with constants such that .
[[23, Proposition 4.7]] Suppose the class of acceptable positions is a nonempty subset of satisfying

, , and

given , .
Then, induces a monetary measure of risk . If is convex, then is a convex measure of risk. Furthermore, if is a cone, then is a coherent risk metric. Note that a monetary measure of risk induced by a set of acceptable positions is given by
(9) 
The following proposition is key for extending the expectation operator to more general measures of risk. [[23, Proposition 4.104]] Consider
(10) 
for a continuous value function , acceptance level for some in the domain of , and probability measure . Suppose that is strictly increasing on for some . Then, the corresponding is a convex measure of risk which is continuous from below. Moreover, is the unique solution to
(11) 
Proposition IIC also implies that for each value function, we can define an acceptance set which in turn induces a convex risk metric . Let us consider an example. [Entropic Risk Metric [23]] Consider the entropic value function . It has been used extensively in the field of risk measures [23], in neuroscience to capture risk sensitivity in motor control [9] and even more so in control of MDPs (see, e.g., [24]).
The entropic value function with an acceptance level can be used to define the acceptance set
(12) 
with corresponding risk metric
(13)  
(14) 
The parameter controls the risk preference; indeed, this can be seen by considering the Taylor expansion [23, Example 4.105].
As a further comment, this particular risk metric is equivalent (up to an additive constant) to the so called entropic risk measure which is given by
(15) 
where is the set of all measures on that are absolutely continuous with respect to and where is the relative entropy function.
Let us recall the concept of a valuation function introduced and used in [23, 7, 25]. [Valuation Function] A mapping is called a valuation function if for each , (i) whenever (monotonic) and (ii) for any (translation invariant). Such a map is used to characterize an agent’s preferences—that is, one prefers to whenever .
We will consider valuation functions that are convex risk metrics induced by a value function and a probability measure . To simplify notation, from here on out we will suppress the dependence on the probability measure .
For each state–action pair, we define a valuation map such that is a valuation function induced by an acceptance set with respect to value function and acceptance level .
IiD RiskSensitive QLearning Convergence
In the classical reinforcement learning framework, the Bellman equation is used to derive a Qlearning procedure. Generalizations of the Bellman equation for risksensitive reinforcement learning—derived, e.g., in [14, 8]—have been used to formulate an action–value function or Qlearning procedure for the risksensitive reinforcement learning problem. In particular, as shown in [14], if satisfies
(17) 
then holds for all ; moreover, a deterministic policy is optimal if [14, Thm. 5.5]. The action–value function is defined such that (17) becomes
(18) 
for all .
Given a value function and acceptance level , we use the coherent risk metric induced stateaction valuation function given by
(19) 
where the expectation is taken with respect to . Hence, by a direct application of Proposition IIC, if is continuous and strictly increasing, then is the unique solution to .
As shown in [7, Proposition 3.1], by letting , we have that corresponds to and, in particular,
(20) 
where, again, the expectation is taken with respect to .
The above leads naturally to a Qlearning procedure,
(21) 
where the nonlinear transformation is applied to the temporal difference
instead of simply the reward . Transformation of the temporal differences avoids certain pitfalls of the reward transformation approach such as poor convergence performance. This procedure has convergence guarantees even in this more general setting under some assumptions on the value function .
[Qlearning Convergence [7, Theorem 3.2]] Suppose that is in , is strictly increasing in and there exists constants such that for all . Moreover, suppose that there exists a such that . If the nonnegative learning rates are such that and , , then the procedure in (21) converges to for all with probability one. The assumptions on are fairly standard and the core of the convergence proof is based on the Robbins–Siegmund Theorem appearing in the seminal work [26].
We note that the assumptions on the value function of Theorem IID are fairly restrictive, excluding many of the value functions presented in Section IIB. For example, value functions of the form and do not satisfy the global Lipschitz condition.
We generalize the convergence result in Theorem IID by modifying the assumptions on the value function to ensure that we have convergence of the Qlearning procedure for the and entropic value functions. The value function satisfies the following:

it is strictly increasing in and there exists a such that ;

it is locally Lipschitz on any ball of finite radius centered at the origin;
Note that in comparison to the assumptions of Theorem IID, we have removed the assumption that the derivative of is bounded away from zero, and relaxed the global Lipschitz assumption on . We remark that the and entropic value functions satisfy these assumptions for all parameters and MDPs.
Let be a complete metric space endowed with the norm and let be the space of maps . Further, define . We then rewrite the –update equation in the form
(22) 
where and we have suppressed the dependence of on . This is a standard update equation form in, e.g., the stochastic approximation algorithm literature [27, 28, 29]. In addition, we define the map given by
(23) 
which we will prove is a contraction.
Suppose that satisfies Assumption IID and that for each the reward is bounded almost surely—that is, there exists such that almost surely. Moreover, let , for , the Lipschitz constant of on .

Let be a closed ball of radius centered at zero. Then, is a contraction.

Suppose is chosen such that
(24) where . Then, has a unique fixed point in .
The proof of the above theorem is provided in Appendix A.
The following proposition shows that the and entropic value functions satisfy the assumption in (24). Moreover, it shows that the value functions which satisfy Assumption IID also satisfy (24). Consider a MDP with reward bounded almost surely by and and consider the condition
(25) 
With Theorem IID and Proposition IID, we can prove convergence of Qlearning for risksensitive reinforcement learning. [Qlearning Convergence on ] Suppose that satisfies Assumption IID and that for each the reward is bounded almost surely—that is, there exists such that almost surely. Moreover, suppose the ball is chosen such that (24) holds. If the nonnegative learning rates are such that and , , then the procedure in (21) converges to with probability one. Given Theorem IID, the proof of Theorem 25 follows directly the same proof as provided in [14]. The key aspect of the proof is combining the fixed point result of Theorem IID with Robbins–Siegmund Theorem [26].
Iii Inverse RiskSensitive Reinforcement Learning
We formulate the inverse risksensitive reinforcement learning problem as follows. First, we select a parametric class of policies, , and parametric value function , where is a family of value functions and .
We use value functions such as those described in Section IIB; e.g., if is the prospect theory value function defined in (2
), then the parameter vector is
. For mappings and , we now indicate their dependence on —that is, we will write and where . Note that since is the temporal difference it also depends on and we will indicate this dependence where it is not directly obvious by writing .It is common in the literature to adopt a smooth map that operates on the actionvalue function space for defining the parametric policy space—e.g., Boltzmann policies of the form
(26) 
to the actionvalue functions where controls how close is to a greedy policy which we define to be any policy such that at all states . We will utilize policies of this form. Note that, as is pointed out in [30], the benefit of selecting strictly stochastic policies is that if the true agent’s policy is deterministic, uniqueness of the solution is forced.
We aim to tune the parameters so as to minimize some loss which is a function of the parameterized policy . By an abuse of notation, we introduce the shorthand .
Iiia Inverse Reinforcement Learning Optimization Problem
The optimization problem is specified by
(27) 
Given a set of demonstrations , it is our goal to recover the policy and estimate the value function.
There are several possible loss functions that may be employed. For example, suppose we elect to minimize the negative weighted loglikelihood of the demonstrated behavior which is given by
(28) 
where may, e.g., be the normalized empirical frequency of observing pairs in , i.e. where is the frequency of .
Related to maximizing the loglikelihood, an alternative loss function is the relative entropy or KullbackLeibler (KL) divergence between the empirical distribution of the stateaction trajectories and their distribution under the learned policy—that is,
(29) 
where
(30) 
is the KL divergence, is the sequence of observed states, and is the empirical distribution on the trajectories of .
IiiB Gradient–Based Approach
We propose to solve the problem of estimating the parameters of the agent’s value function and approximating the agent’s policy via gradient methods which requires computing the derivative of with respect to . Hence, given the form of the Qlearning procedure where the temporal differences are transformed as in (21), we need to derive a mechanism for obtaining the optimal , show that it is in fact differentiable, and derive a procedure for obtaining the derivative.
Using some basic calculus, given the form of smoothing map in (26), we can compute the derivative of the policy with respect to for an element of :
(31)  
(32) 
We show that can be calculated almost everywhere on by solving fixedpoint equations similar to the Bellmanoptimality equations.
To do this, we require some assumptions on the value function . The value function satisfies the following conditions:

is strictly increasing in and for each , there exists a such that ;

for each , on any ball centered around the origin of finite radius, is locally Lipschitz in with constant and locally Lipschitz on with constant ;

there exists such that for all .
Define and . As before, let . We rewrite the –update equation as
(33) 
where
is the temporal difference, and we have suppressed the dependence of on . In addition, define the map such that
(34) 
where . This map is a contraction for each . Indeed, fixing , when satisfies Assumption IIIB, then for cases where , was shown to be a contraction in [8] and in the more general setting (i.e. ), in [7].
Our first main result on inverse risksensitive reinforcement learning, which is the theoretical underpinning of our gradientbased algorithm, gives us a mechanism to compute the derivative of with respect to as a solution to a fixedpoint equation via a contraction mapping argument.
Let be the derivative of with respect to the –th argument where . Assume that satisfies Assumption IIIB. Then the following statements hold:

is locally Lipschitz continuous as a function of —that is, for any , , for some ;

except on a set of measure zero, the gradient is given by the solution of the fixed–point equation
(35) where and is the action that maximizes where is any policy that is greedy with respect to .
We provide the proof in Appendix C. To give a highlevel outline, we use an induction argument combined with a contraction mapping argument on the map
(36) 
The almost everywhere differentiability follows from Rademacher’s Theorem (see, e.g., [31, Thm. 3.1]).
Theorem IIIB gives us a procedure—namely, a fixed–point equation which is a contraction—to compute the derivative so that we can compute the derivative of our loss function . Hence the gradient method provided in Algorithm 1 for solving the inverse risksensitive reinforcement learning problem is well formulated.
The prospect theory value function given in (2) is not globally Lipschitz in —in particular, it is not Lipschitz near the reference point —for values of and less than one. Moreover, for certain parameter combinations, it may not even be differentiable. The function, on the other hand, is locally Lipschitz and its derivative near the reference point is bounded away from zero. This makes it a more viable candidate for numerical implementation. Its derivative, however, is not bounded away from zero as .
This being said, we note that if the procedure for computing follows an algorithm which implements repeated applications of the map is initialized with being finite for all and is bounded for all possible pairs, then the derivative of will always be bounded away from zero for all realized values of in the procedure. An analogous statement can be made regarding the computation of . Hence, the procedures for computing and for all the value functions we consider (excluding the classical prospect value function) are guaranteed to converge (except on a set of measure zero).
Let us translate this remark into a formal result. Consider a modified version of Assumption IIIB: The value function satisfies the following:

it is strictly increasing in and for each , there exists a such that ;

for each , it is Lipschitz in with constant and locally Lipschitz on with constant .
Simply speaking, analogous to Assumption IID, we have removed the uniform lower bound on the derivative of . Moreover, Theorem IID gives us that , as defined in (53), is a contraction on a ball of finite radius for each under Assumption IID.
Assume that satisfies Assumption IIIB and that the reward is bounded almost surely by . Then the following statements hold.

For any ball , is locally Lipschitzcontinuous on as a function of —that is, for any , , for some .

For each , let be the ball with radius satisfying
(37) Except on a set of measure zero, the gradient is given by the solution of the fixed–point equation
(38) where and is the action that maximizes with being any policy that is greedy with respect to .
The proof (provided in Appendix D) of the above theorem follows the same techniques as in Theorem IID and Theorem IIIB.
IiiC Complexity
Small dataset size is often a challenge in modeling sequential human decisionmaking owing in large part to the frequency and time scale on which decisions are made in many applications. To properly understand how our gradientbased approach performs for different amounts of data, we analyze the case when the loss function, , is either the negative of the loglikelihood of the data—see (28) above—or the sum over states of the KL divergence between the policy under our learned value function and the the empirical policy of the agent—see (29) above. These are two of the more common loss functions used in the literature.
We first note that maximizing the loglikelihood is equivalent to minimizing a weighted sum over states of the KL divergence between the empirical policy of the true agent, , and the policy under the learned value function, . In particular, through some algebraic manipulation the weighted loglikelihood can be rewritten as
(39) 
where is the frequency of state normalized by . This approach has the added benefit that it is independent of and therefore will not be affected by scaling of the value functions [30].
Both cost functions are natural metrics for performance in that they minimize a measure of the divergence between the optimal policy under the learned agent and empirical policy of the true agent. While the KLdivergence is not suitable for our analysis, since it is not a metric on the space of probability distributions, it does provide an upper bound on the total variation (TV) distance via Pinsker’s inequality:
(40) 
where is the TV distance between and , defined as
(41) 
The TV distance between distributions is a proper metric. Furthermore, use of the two cost functions described above will also translate to minimizing the TV distance as it is upper bounded by the KL divergence.
We first note that, for each state , we would ideally like to get a bound on , the TV distance between the agent’s true policy and the estimated policy . However, we only have access to the empirical policy . We therefore use the triangle inequality to get an upper bound on , in terms of values for which we can calculate explicitly or construct bounds. In particular, we derive the following bound:
(42) 
Note that is tantamount to a training error as metricized by the TV distance, and is upper bounded by a function of the KL divergence (which appears in the loss function) via (40).
The first term in (42), , is the distance between the empirical policy and the true policy in state . Using the Dvoretzky KieferWolfowitz inequality (see, e.g., [32, 33]), this term can be bounded above with high probability. Indeed,
(43) 
where is the number of samples from the distribution and is the cardinality of the action set. Combining this bound with (42), we get that, with probability ,
(44) 
Supposing Algorithm 1 achieves a sufficiently small training error , the second term above can be bounded above by a calculable small amount which we define notationally to be . Supposing is also sufficiently small, the dominating term in the distance between and is the first term on the righthand side in (44). This gives us a convergence rate on the per state level. This rate is seen qualitatively in our experiments on sample complexity outlined in Section IVA4.
We note that this bound is for each individual state . Thus, for states that are visited more frequently by the agent, we have better guarantees on how well the policy under the learned value function approximates the true policy. Moreover, it suggests ways of designing data collection schemes to better understand the agent’s actions in less explored regions of the state space.
Iv Examples
Let us now demonstrate the performance of the proposed method on two examples. While we are able to formulate the inverse risksensitive reinforcement learning problem for parameter vectors that include and , in the following examples we use and . The purpose of doing this is to explore the effects of changing the value function parameters on the resulting policy.
In all experiments, our optimization objective is the negative loglikelihood of the data, defined in (
Comments
There are no comments yet.